# Apify Documentation ## Apify API - [Apify API](https://docs.apify.com/api.md) - [Apify API](https://docs.apify.com/api/v2.md): The Apify API (version 2) provides programmatic access to the [Apify - [Abort build](https://docs.apify.com/api/v2/act-build-abort-post.md): **[DEPRECATED]** API endpoints related to build of the Actor were moved - [Get default build](https://docs.apify.com/api/v2/act-build-default-get.md): ClientsGet the default build for an Actor. - [Get build](https://docs.apify.com/api/v2/act-build-get.md): By passing the optional `waitForFinish` parameter the API endpoint will - [Get list of builds](https://docs.apify.com/api/v2/act-builds-get.md): ClientsGets the list of builds of a specific Actor. The response is a JSON with the - [Build Actor](https://docs.apify.com/api/v2/act-builds-post.md): ClientsBuilds an Actor. - [Delete Actor](https://docs.apify.com/api/v2/act-delete.md): ClientsDeletes an Actor. - [Get Actor](https://docs.apify.com/api/v2/act-get.md): ClientsGets an object that contains all the details about a specific Actor. - [Get OpenAPI definition](https://docs.apify.com/api/v2/act-openapi-json-get.md): - [Update Actor](https://docs.apify.com/api/v2/act-put.md): ClientsUpdates settings of an Actor using values specified by an Actor object - [Abort run](https://docs.apify.com/api/v2/act-run-abort-post.md): **[DEPRECATED]** API endpoints related to run of the Actor were moved under - [Get run](https://docs.apify.com/api/v2/act-run-get.md): **[DEPRECATED]** API endpoints related to run of the Actor were moved under - [Metamorph run](https://docs.apify.com/api/v2/act-run-metamorph-post.md): **[DEPRECATED]** API endpoints related to run of the Actor were moved under - [Resurrect run](https://docs.apify.com/api/v2/act-run-resurrect-post.md): **[DEPRECATED]** API endpoints related to run of the Actor were moved under - [Without input](https://docs.apify.com/api/v2/act-run-sync-get.md): Runs a specific Actor and returns its output. - [Run Actor synchronously without input and get dataset items](https://docs.apify.com/api/v2/act-run-sync-get-dataset-items-get.md): Runs a specific Actor and returns its dataset items. - [Run Actor synchronously with input and get dataset items](https://docs.apify.com/api/v2/act-run-sync-get-dataset-items-post.md): Runs a specific Actor and returns its dataset items. - [Run Actor synchronously with input and return output](https://docs.apify.com/api/v2/act-run-sync-post.md): Runs a specific Actor and returns its output. - [Get list of runs](https://docs.apify.com/api/v2/act-runs-get.md): ClientsGets the list of runs of a specific Actor. The response is a list of - [Get last run](https://docs.apify.com/api/v2/act-runs-last-get.md): This is not a single endpoint, but an entire group of endpoints that lets you to - [Run Actor](https://docs.apify.com/api/v2/act-runs-post.md): ClientsRuns an Actor and immediately returns without waiting for the run to finish. - [Delete version](https://docs.apify.com/api/v2/act-version-delete.md): Deletes a specific version of Actor's source code. - [Delete environment variable](https://docs.apify.com/api/v2/act-version-env-var-delete.md): Deletes a specific environment variable. - [Get environment variable](https://docs.apify.com/api/v2/act-version-env-var-get.md): ClientsGets a [EnvVar object](#/reference/actors/environment-variable-object) that - [Update environment variable](https://docs.apify.com/api/v2/act-version-env-var-put.md): ClientsUpdates Actor environment variable using values specified by a [EnvVar - [Get list of environment variables](https://docs.apify.com/api/v2/act-version-env-vars-get.md): ClientsGets the list of environment variables for a specific version of an Actor. - [Create environment variable](https://docs.apify.com/api/v2/act-version-env-vars-post.md): ClientsCreates an environment variable of an Actor using values specified in a - [Get version](https://docs.apify.com/api/v2/act-version-get.md): ClientsGets a [Version object](#/reference/actors/version-object) that contains all the details about a specific version of an Actor. - [Update version](https://docs.apify.com/api/v2/act-version-put.md): ClientsUpdates Actor version using values specified by a [Version object](#/reference/actors/version-object) passed as JSON in the POST payload. - [Get list of versions](https://docs.apify.com/api/v2/act-versions-get.md): ClientsGets the list of versions of a specific Actor. The response is a JSON object - [Create version](https://docs.apify.com/api/v2/act-versions-post.md): ClientsCreates a version of an Actor using values specified in a [Version - [Get list of webhooks](https://docs.apify.com/api/v2/act-webhooks-get.md): Gets the list of webhooks of a specific Actor. The response is a JSON with - [Abort build](https://docs.apify.com/api/v2/actor-build-abort-post.md): ClientsAborts an Actor build and returns an object that contains all the details - [Delete build](https://docs.apify.com/api/v2/actor-build-delete.md): ClientsDelete the build. The build that is the current default build for the Actor - [Get build](https://docs.apify.com/api/v2/actor-build-get.md): ClientsGets an object that contains all the details about a specific build of an - [Get log](https://docs.apify.com/api/v2/actor-build-log-get.md): Check out [Logs](#/reference/logs) for full reference. - [Get OpenAPI definition](https://docs.apify.com/api/v2/actor-build-openapi-json-get.md): ClientsGet the OpenAPI definition for Actor builds. Two similar endpoints are available: - [Actor builds - Introduction](https://docs.apify.com/api/v2/actor-builds.md): Actor builds - Introduction - [Get user builds list](https://docs.apify.com/api/v2/actor-builds-get.md): Gets a list of all builds for a user. The response is a JSON array of - [Abort run](https://docs.apify.com/api/v2/actor-run-abort-post.md): ClientsAborts an Actor run and returns an object that contains all the details - [Delete run](https://docs.apify.com/api/v2/actor-run-delete.md): ClientsDelete the run. Only finished runs can be deleted. Only the person or - [Get run](https://docs.apify.com/api/v2/actor-run-get.md): This is not a single endpoint, but an entire group of endpoints that lets - [Metamorph run](https://docs.apify.com/api/v2/actor-run-metamorph-post.md): ClientsTransforms an Actor run into a run of another Actor with a new input. - [Update status message](https://docs.apify.com/api/v2/actor-run-put.md): You can set a single status message on your run that will be displayed in - [Reboot run](https://docs.apify.com/api/v2/actor-run-reboot-post.md): ClientsReboots an Actor run and returns an object that contains all the details - [Actor runs - Introduction](https://docs.apify.com/api/v2/actor-runs.md): Actor runs - Introduction - [Get user runs list](https://docs.apify.com/api/v2/actor-runs-get.md): Gets a list of all runs for a user. The response is a list of objects, where - [Delete task](https://docs.apify.com/api/v2/actor-task-delete.md): ClientsDelete the task specified through the `actorTaskId` parameter. - [Get task](https://docs.apify.com/api/v2/actor-task-get.md): ClientsGet an object that contains all the details about a task. - [Get task input](https://docs.apify.com/api/v2/actor-task-input-get.md): ClientsReturns the input of a given task. - [Update task input](https://docs.apify.com/api/v2/actor-task-input-put.md): ClientsUpdates the input of a task using values specified by an object passed as - [Update task](https://docs.apify.com/api/v2/actor-task-put.md): ClientsUpdate settings of a task using values specified by an object passed as JSON - [Run task synchronously](https://docs.apify.com/api/v2/actor-task-run-sync-get.md): Run a specific task and return its output. - [Run task synchronously and get dataset items](https://docs.apify.com/api/v2/actor-task-run-sync-get-dataset-items-get.md): Run a specific task and return its dataset items. - [Run task synchronously and get dataset items](https://docs.apify.com/api/v2/actor-task-run-sync-get-dataset-items-post.md): Runs an Actor task and synchronously returns its dataset items. - [Run task synchronously](https://docs.apify.com/api/v2/actor-task-run-sync-post.md): Runs an Actor task and synchronously returns its output. - [Get list of task runs](https://docs.apify.com/api/v2/actor-task-runs-get.md): Get a list of runs of a specific task. The response is a list of objects, - [Get last run](https://docs.apify.com/api/v2/actor-task-runs-last-get.md): This is not a single endpoint, but an entire group of endpoints that lets you to - [Run task](https://docs.apify.com/api/v2/actor-task-runs-post.md): ClientsRuns an Actor task and immediately returns without waiting for the run to - [Get list of webhooks](https://docs.apify.com/api/v2/actor-task-webhooks-get.md): Gets the list of webhooks of a specific Actor task. The response is a JSON - [Actor tasks - Introduction](https://docs.apify.com/api/v2/actor-tasks.md): Actor tasks - Introduction - [Get list of tasks](https://docs.apify.com/api/v2/actor-tasks-get.md): ClientsGets the complete list of tasks that a user has created or used. - [Create task](https://docs.apify.com/api/v2/actor-tasks-post.md): ClientsCreate a new task with settings specified by the object passed as JSON in - [Actors - Introduction](https://docs.apify.com/api/v2/actors.md): Actors - Introduction - [Actor builds - Introduction](https://docs.apify.com/api/v2/actors-actor-builds.md): Actor builds - Introduction - [Actor runs - Introduction](https://docs.apify.com/api/v2/actors-actor-runs.md): Actor runs - Introduction - [Actor versions - Introduction](https://docs.apify.com/api/v2/actors-actor-versions.md): Actor versions - Introduction - [Webhook collection - Introduction](https://docs.apify.com/api/v2/actors-webhook-collection.md): Webhook collection - Introduction - [Get list of Actors](https://docs.apify.com/api/v2/acts-get.md): ClientsGets the list of all Actors that the user created or used. The response is a - [Create Actor](https://docs.apify.com/api/v2/acts-post.md): ClientsCreates a new Actor with settings specified in an Actor object passed as - [Delete dataset](https://docs.apify.com/api/v2/dataset-delete.md): ClientsDeletes a specific dataset. - [Get dataset](https://docs.apify.com/api/v2/dataset-get.md): ClientsReturns dataset object for given dataset ID. - [Get items](https://docs.apify.com/api/v2/dataset-items-get.md): ClientsReturns data stored in the dataset in a desired format. - [Store items](https://docs.apify.com/api/v2/dataset-items-post.md): ClientsAppends an item or an array of items to the end of the dataset. - [Update dataset](https://docs.apify.com/api/v2/dataset-put.md): ClientsUpdates a dataset's name using a value specified by a JSON object passed in the PUT payload. - [Get dataset statistics](https://docs.apify.com/api/v2/dataset-statistics-get.md): Returns statistics for given dataset. - [Get list of datasets](https://docs.apify.com/api/v2/datasets-get.md): ClientsLists all of a user's datasets. - [Create dataset](https://docs.apify.com/api/v2/datasets-post.md): ClientsCreates a dataset and returns its object. - [Getting started with Apify API](https://docs.apify.com/api/v2/getting-started.md): The Apify API provides programmatic access to the Apify platform. The API is organized around RESTful HTTP endpoints. - [Delete store](https://docs.apify.com/api/v2/key-value-store-delete.md): ClientsDeletes a key-value store. - [Get store](https://docs.apify.com/api/v2/key-value-store-get.md): ClientsGets an object that contains all the details about a specific key-value - [Get list of keys](https://docs.apify.com/api/v2/key-value-store-keys-get.md): ClientsReturns a list of objects describing keys of a given key-value store, as - [Update store](https://docs.apify.com/api/v2/key-value-store-put.md): ClientsUpdates a key-value store's name using a value specified by a JSON object - [Delete record](https://docs.apify.com/api/v2/key-value-store-record-delete.md): ClientsRemoves a record specified by a key from the key-value store. - [Get record](https://docs.apify.com/api/v2/key-value-store-record-get.md): ClientsGets a value stored in the key-value store under a specific key. - [Check if a record exists](https://docs.apify.com/api/v2/key-value-store-record-head.md): ClientsCheck if a value is stored in the key-value store under a specific key. - [Store record](https://docs.apify.com/api/v2/key-value-store-record-put.md): ClientsStores a value under a specific key to the key-value store. - [Get list of key-value stores](https://docs.apify.com/api/v2/key-value-stores-get.md): ClientsGets the list of key-value stores owned by the user. - [Create key-value store](https://docs.apify.com/api/v2/key-value-stores-post.md): ClientsCreates a key-value store and returns its object. The response is the same - [Get log](https://docs.apify.com/api/v2/log-get.md): ClientsRetrieves logs for a specific Actor build or run. - [Logs - Introduction](https://docs.apify.com/api/v2/logs.md): Logs - Introduction - [Charge events in run](https://docs.apify.com/api/v2/post-charge-run.md): ClientsCharge for events in the run of your [pay per event Actor](https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event). - [Resurrect run](https://docs.apify.com/api/v2/post-resurrect-run.md): ClientsResurrects a finished Actor run and returns an object that contains all the details about the resurrected run. - [Delete request queue](https://docs.apify.com/api/v2/request-queue-delete.md): ClientsDeletes given queue. - [Get request queue](https://docs.apify.com/api/v2/request-queue-get.md): ClientsReturns queue object for given queue ID. - [Get head](https://docs.apify.com/api/v2/request-queue-head-get.md): ClientsReturns given number of first requests from the queue. - [Get head and lock](https://docs.apify.com/api/v2/request-queue-head-lock-post.md): ClientsReturns the given number of first requests from the queue and locks them for - [Update request queue](https://docs.apify.com/api/v2/request-queue-put.md): ClientsUpdates a request queue's name using a value specified by a JSON object - [Delete request](https://docs.apify.com/api/v2/request-queue-request-delete.md): ClientsDeletes given request from queue. - [Get request](https://docs.apify.com/api/v2/request-queue-request-get.md): ClientsReturns request from queue. - [Delete request lock](https://docs.apify.com/api/v2/request-queue-request-lock-delete.md): ClientsDeletes a request lock. The request lock can be deleted only by the client - [Prolong request lock](https://docs.apify.com/api/v2/request-queue-request-lock-put.md): ClientsProlongs request lock. The request lock can be prolonged only by the client - [Update request](https://docs.apify.com/api/v2/request-queue-request-put.md): ClientsUpdates a request in a queue. Mark request as handled by setting - [Delete requests](https://docs.apify.com/api/v2/request-queue-requests-batch-delete.md): ClientsBatch-deletes given requests from the queue. The number of requests in a - [Add requests](https://docs.apify.com/api/v2/request-queue-requests-batch-post.md): ClientsAdds requests to the queue in batch. The maximum requests in batch is limit - [List requests](https://docs.apify.com/api/v2/request-queue-requests-get.md): ClientsReturns a list of requests. This endpoint is paginated using - [Add request](https://docs.apify.com/api/v2/request-queue-requests-post.md): ClientsAdds request to the queue. Response contains ID of the request and info if - [Unlock requests](https://docs.apify.com/api/v2/request-queue-requests-unlock-post.md): ClientsUnlocks requests in the queue that are currently locked by the client. - [Get list of request queues](https://docs.apify.com/api/v2/request-queues-get.md): ClientsLists all of a user's request queues. The response is a JSON array of - [Create request queue](https://docs.apify.com/api/v2/request-queues-post.md): ClientsCreates a request queue and returns its object. - [Delete schedule](https://docs.apify.com/api/v2/schedule-delete.md): ClientsDeletes a schedule. - [Get schedule](https://docs.apify.com/api/v2/schedule-get.md): ClientsGets the schedule object with all details. - [Get schedule log](https://docs.apify.com/api/v2/schedule-log-get.md): ClientsGets the schedule log as a JSON array containing information about up to a - [Update schedule](https://docs.apify.com/api/v2/schedule-put.md): ClientsUpdates a schedule using values specified by a schedule object passed as - [Schedules - Introduction](https://docs.apify.com/api/v2/schedules.md): Schedules - Introduction - [Get list of schedules](https://docs.apify.com/api/v2/schedules-get.md): ClientsGets the list of schedules that the user created. - [Create schedule](https://docs.apify.com/api/v2/schedules-post.md): ClientsCreates a new schedule with settings provided by the schedule object passed - [Datasets - Introduction](https://docs.apify.com/api/v2/storage-datasets.md): Datasets - Introduction - [Key-value stores - Introduction](https://docs.apify.com/api/v2/storage-key-value-stores.md): Key-value stores - Introduction - [Request queues - Introduction](https://docs.apify.com/api/v2/storage-request-queues.md): Request queues - Introduction - [Requests- Introduction](https://docs.apify.com/api/v2/storage-request-queues-requests.md): Requests- Introduction - [Requests locks - Introduction](https://docs.apify.com/api/v2/storage-request-queues-requests-locks.md): Requests locks - Introduction - [Store - Introduction](https://docs.apify.com/api/v2/store.md): Store - Introduction - [Get list of Actors in store](https://docs.apify.com/api/v2/store-get.md): Gets the list of public Actors in Apify Store. You can use `search` - [Get public user data](https://docs.apify.com/api/v2/user-get.md): Returns public information about a specific user account, similar to what - [Users - Introduction](https://docs.apify.com/api/v2/users.md): Users - Introduction - [Get private user data](https://docs.apify.com/api/v2/users-me-get.md): Returns information about the current user account, including both public - [Get limits](https://docs.apify.com/api/v2/users-me-limits-get.md): Returns a complete summary of your account's limits. It is the same - [Update limits](https://docs.apify.com/api/v2/users-me-limits-put.md): Updates the account's limits manageable on your account's [Limits page](https://console.apify.com/billing#/limits). - [Get monthly usage](https://docs.apify.com/api/v2/users-me-usage-monthly-get.md): Returns a complete summary of your usage for the current usage cycle, - [Delete webhook](https://docs.apify.com/api/v2/webhook-delete.md): ClientsDeletes a webhook. - [Get webhook dispatch](https://docs.apify.com/api/v2/webhook-dispatch-get.md): ClientsGets webhook dispatch object with all details. - [Get collection](https://docs.apify.com/api/v2/webhook-dispatches-get.md): ClientsGets a given webhook's list of dispatches. - [Get webhook](https://docs.apify.com/api/v2/webhook-get.md): ClientsGets webhook object with all details. - [Update webhook](https://docs.apify.com/api/v2/webhook-put.md): ClientsUpdates a webhook using values specified by a webhook object passed as JSON - [Test webhook](https://docs.apify.com/api/v2/webhook-test-post.md): ClientsTests a webhook. Creates a webhook dispatch with a dummy payload. - [Get list of webhooks](https://docs.apify.com/api/v2/webhooks-get.md): ClientsGets the list of webhooks that the user created. - [Create webhook](https://docs.apify.com/api/v2/webhooks-post.md): ClientsCreates a new webhook with settings provided by the webhook object passed as - [Webhook dispatches - Introduction](https://docs.apify.com/api/v2/webhooks-webhook-dispatches.md): Webhook dispatches - Introduction - [Webhooks - Introduction](https://docs.apify.com/api/v2/webhooks-webhooks.md): Webhooks - Introduction ## open-source - [Apify open source](https://docs.apify.com/open-source.md) ## sdk - [Apify SDK](https://docs.apify.com/sdk.md) ## search - [Search the documentation](https://docs.apify.com/search.md) ## Apify academy - [Web Scraping Academy](https://docs.apify.com/academy.md): Learn everything about web scraping and automation with our free courses that will turn you into an expert scraper developer. - [Actor marketing playbook](https://docs.apify.com/academy/actor-marketing-playbook.md): Learn how to optimize and monetize your Actors on Apify Store by sharing them with other platform users. - [Actor description & SEO description](https://docs.apify.com/academy/actor-marketing-playbook/actor-basics/actor-description.md): Learn about Actor description and meta description. Where to set them and best practices for both content and length. - [Actors & emojis](https://docs.apify.com/academy/actor-marketing-playbook/actor-basics/actors-and-emojis.md): Discover how emojis can boost your Actors by grabbing attention, simplifying navigation, and enhancing clarity. Improve user experience and engagement on Apify Store. - [How to create an Actor README](https://docs.apify.com/academy/actor-marketing-playbook/actor-basics/how-to-create-an-actor-readme.md): Learn how to write a comprehensive README to help users better navigate, understand and run public Actors in Apify Store. - [Importance of Actor URL](https://docs.apify.com/academy/actor-marketing-playbook/actor-basics/importance-of-actor-url.md): Learn how to set your Actor’s URL (technical name) and name effectively when creating it on Apify. Follow best practices to optimize your Actor’s web presence and ensure it stands out on Apify Store. - [Name your Actor](https://docs.apify.com/academy/actor-marketing-playbook/actor-basics/name-your-actor.md): Learn Apify’s standards for naming Actors and how to choose the right name for your scraping and automation tools and maximize visibility on Apify Store. - [Emails to Actor users](https://docs.apify.com/academy/actor-marketing-playbook/interact-with-users/emails-to-actor-users.md): Email communication is a key tool to keep users engaged and satisfied. Learn when and how to email your users effectively to build loyalty and strengthen relationships with this practical guide. - [Issues tab](https://docs.apify.com/academy/actor-marketing-playbook/interact-with-users/issues-tab.md): Learn how the Issues tab can help you improve your Actor, engage with users, and build a reliable, user-friendly solution. - [Your Store bio](https://docs.apify.com/academy/actor-marketing-playbook/interact-with-users/your-store-bio.md): Your Apify Store bio is all about helping you promote your tools & skills. - [Actor bundles](https://docs.apify.com/academy/actor-marketing-playbook/product-optimization/actor-bundles.md): Learn what an Actor bundle is, explore existing examples, and discover how to promote them. - [How to create a great input schema](https://docs.apify.com/academy/actor-marketing-playbook/product-optimization/how-to-create-a-great-input-schema.md): Optimizing your input schema. Learn to design and refine your input schema with best practices for a better user experience. - [Blogs and blog resources](https://docs.apify.com/academy/actor-marketing-playbook/promote-your-actor/blogs-and-blog-resources.md): Blogs are still a powerful way to promote your Actors and build authority. By sharing expertise, engaging users, and driving organic traffic, blogging remains a key strategy to complement social media, SEO, and other platforms in growing your audience. - [Parasite SEO](https://docs.apify.com/academy/actor-marketing-playbook/promote-your-actor/parasite-seo.md): Explore parasite SEO, a unique strategy that leverages third-party sites to boost rankings and drive traffic to your tools. - [Product Hunt](https://docs.apify.com/academy/actor-marketing-playbook/promote-your-actor/product-hunt.md): Boost your Actor’s visibility by launching it on Product Hunt, a top platform for tech innovations. Attract early adopters, developers, and businesses while showcasing your tool’s value through visuals or demos. - [SEO](https://docs.apify.com/academy/actor-marketing-playbook/promote-your-actor/seo.md): Learn how to optimize your content to rank higher on search engines like Google and Bing, attract more users, and drive long-term traffic - all for free. - [Social media](https://docs.apify.com/academy/actor-marketing-playbook/promote-your-actor/social-media.md): Leverage social media to connect with users and grow your Actor’s audience. Learn how to showcase features, engage with users, and avoid common pitfalls. - [Video tutorials](https://docs.apify.com/academy/actor-marketing-playbook/promote-your-actor/video-tutorials.md): Use video tutorials to demonstrate features, offer tutorials, and connect with users in real time, building trust and driving interest in your tools. - [Webinars](https://docs.apify.com/academy/actor-marketing-playbook/promote-your-actor/webinars.md): Webinars and live streams are powerful tools to showcase your Actor’s features. Learn how to plan, host, and maximize the impact of your webinar. - [Actor success stories](https://docs.apify.com/academy/actor-marketing-playbook/store-basics/actor-success-stories.md): Learn about developers who successfully make passive income from their Actors. - [How Actor monetization works](https://docs.apify.com/academy/actor-marketing-playbook/store-basics/how-actor-monetization-works.md): Discover how to share your tools and explore monetization options to earn from your automation expertise. - [How Apify Store works](https://docs.apify.com/academy/actor-marketing-playbook/store-basics/how-store-works.md): Learn how to create and publish your own Actor, and join a growing community of innovators in scraping and web automation. - [How to build Actors](https://docs.apify.com/academy/actor-marketing-playbook/store-basics/how-to-build-actors.md): Learn how to create web scrapers and automation tools on Apify. Use universal scrapers for quick setup, code templates for a head start, or SDKs and libraries for full control. - [Ideas page and its use](https://docs.apify.com/academy/actor-marketing-playbook/store-basics/ideas-page.md): Learn where you can draw inspiration for your Actors. - [Actorization playbook](https://docs.apify.com/academy/actorization.md): A guide to converting your applications, scripts, and open-source projects into monetizable, cloud-based tools on the Apify platform. - [Advanced web scraping](https://docs.apify.com/academy/advanced-web-scraping.md): Take your scrapers to a production-ready level by learning various advanced concepts and techniques that will help you build highly scalable and reliable crawlers. - [Crawling sitemaps](https://docs.apify.com/academy/advanced-web-scraping/crawling/crawling-sitemaps.md): Learn how to extract all of a website's listings even if they limit the number of results pages. See code examples for setting up your scraper. - [Scraping websites with search](https://docs.apify.com/academy/advanced-web-scraping/crawling/crawling-with-search.md): Learn how to extract all of a website's listings even if they limit the number of results pages. See code examples for setting up your scraper. - [Sitemaps vs search](https://docs.apify.com/academy/advanced-web-scraping/crawling/sitemaps-vs-search.md): Learn how to extract all of a website's listings even if they limit the number of results pages. - [Tips and tricks for robustness](https://docs.apify.com/academy/advanced-web-scraping/tips-and-tricks-robustness.md): Learn how to make your automated processes more effective. Avoid common pitfalls, future-proof your programs and improve your processes. - [Build and monetize AI agents on Apify](https://docs.apify.com/academy/ai/ai-agents.md): This guide shows you how to create an AI agent using the CrewAI Python framework and the Apify platform. You will build an Instagram analysis agent that integrates with large language models (LLMs) and web scrapers. - [Anti-scraping protections](https://docs.apify.com/academy/anti-scraping.md): Understand the various anti-scraping measures different sites use to prevent bots from accessing them, and how to appear more human to fix these issues. - [Anti-scraping mitigation](https://docs.apify.com/academy/anti-scraping/mitigation.md): After learning about the various different anti-scraping techniques websites use, learn how to mitigate them with a few different techniques. - [Bypassing Cloudflare browser check](https://docs.apify.com/academy/anti-scraping/mitigation/cloudflare-challenge.md.md): Learn how to bypass Cloudflare browser challenge with Crawlee. - [Generating fingerprints](https://docs.apify.com/academy/anti-scraping/mitigation/generating-fingerprints.md): Learn how to use two super handy npm libraries to generate fingerprints and inject them into a Playwright or Puppeteer page. - [Proxies](https://docs.apify.com/academy/anti-scraping/mitigation/proxies.md): Learn all about proxies, how they work, and how they can be leveraged in a scraper to avoid blocking and other anti-scraping tactics. - [Using proxies](https://docs.apify.com/academy/anti-scraping/mitigation/using-proxies.md): Learn how to use and automagically rotate proxies in your scrapers by using Crawlee, and a bit about how to obtain pools of proxies. - [Anti-scraping techniques](https://docs.apify.com/academy/anti-scraping/techniques.md): Understand the various common (and obscure) anti-scraping techniques used by websites to prevent bots from accessing their content. - [Browser challenges](https://docs.apify.com/academy/anti-scraping/techniques/browser-challenges.md): Learn how to navigate browser challenges like Cloudflare's to effectively scrape data from protected websites. - [Captchas](https://docs.apify.com/academy/anti-scraping/techniques/captchas.md): Learn about the reasons a bot might be presented a captcha, the best ways to avoid captchas in the first place, and how to programmatically solve them. - [Fingerprinting](https://docs.apify.com/academy/anti-scraping/techniques/fingerprinting.md): Understand browser fingerprinting, an advanced technique used by browsers to track user data and even block bots from accessing them. - [Firewalls](https://docs.apify.com/academy/anti-scraping/techniques/firewalls.md): Understand what a web-application firewall is, how they work, and the various common techniques for avoiding them altogether. - [Geolocation](https://docs.apify.com/academy/anti-scraping/techniques/geolocation.md): Learn about the geolocation techniques to determine where requests are coming from, and a bit about how to avoid being blocked based on geolocation. - [Rate-limiting](https://docs.apify.com/academy/anti-scraping/techniques/rate-limiting.md): Learn about rate-limiting, a common tactic used by websites to avoid a large and non-human rate of requests coming from a single IP address. - [API Tutorials 💻📚](https://docs.apify.com/academy/api.md): A collection of various tutorials explaining how to interact with the Apify platform programmatically using its API. - [API scraping](https://docs.apify.com/academy/api-scraping.md): Learn all about how the professionals scrape various types of APIs with various configurations, parameters, and requirements. - [General API scraping](https://docs.apify.com/academy/api-scraping/general-api-scraping.md): Learn the benefits and drawbacks of API scraping, how to locate an API, how to utilize its features, and how to work around common roadblocks. - [Dealing with headers, cookies, and tokens](https://docs.apify.com/academy/api-scraping/general-api-scraping/cookies-headers-tokens.md): Learn about how some APIs require certain cookies, headers, and/or tokens to be present in a request in order for data to be received. - [Handling pagination](https://docs.apify.com/academy/api-scraping/general-api-scraping/handling-pagination.md): Learn about the three most popular API pagination techniques and how to handle each of them when scraping an API with pagination. - [Locating API endpoints](https://docs.apify.com/academy/api-scraping/general-api-scraping/locating-and-learning.md): Learn how to effectively locate a website's API endpoints, and learn how to use them to get the data you want faster and more reliably. - [GraphQL scraping](https://docs.apify.com/academy/api-scraping/graphql-scraping.md): Dig into the topic of scraping APIs which use the latest and greatest API technology - GraphQL. GraphQL APIs are very different from regular REST APIs. - [Custom queries](https://docs.apify.com/academy/api-scraping/graphql-scraping/custom-queries.md): Learn how to write custom GraphQL queries, how to pass input values into GraphQL requests as variables, and how to retrieve and output the data from a scraper. - [Introspection](https://docs.apify.com/academy/api-scraping/graphql-scraping/introspection.md): Understand what introspection is, and how it can help you understand a GraphQL API to take advantage of the features it has to offer before writing any code. - [Modifying variables](https://docs.apify.com/academy/api-scraping/graphql-scraping/modifying-variables.md): Learn how to modify the variables of a JSON format GraphQL query to use the API without needing to write any GraphQL language or create custom queries. - [How to retry failed requests](https://docs.apify.com/academy/api/retry-failed-requests.md): Learn how to resurrect your run but retrying only failed requests - [Run Actor and retrieve data via API](https://docs.apify.com/academy/api/run-actor-and-retrieve-data-via-api.md): Learn how to run an Actor/task via the Apify API, wait for the job to finish, and retrieve its output data. Your key to integrating Actors with your projects. - [Introduction to the Apify platform](https://docs.apify.com/academy/apify-platform.md): Learn all about the Apify platform, all of the tools it offers, and how it can improve your overall development experience. - [Scraping with Apify](https://docs.apify.com/academy/apify-scrapers.md): Discover Apify's ready-made web scraping and automation tools. Compare Web Scraper, Cheerio Scraper and Puppeteer Scraper to decide which is right for you. - [Scraping with Cheerio Scraper](https://docs.apify.com/academy/apify-scrapers/cheerio-scraper.md): Learn how to scrape a website using Apify's Cheerio Scraper. Build an Actor's page function, extract information from a web page and download your data. - [Getting started with Apify scrapers](https://docs.apify.com/academy/apify-scrapers/getting-started.md): Step-by-step tutorial that will help you get started with all Apify Scrapers. Learn the foundations of scraping the web with Apify and creating your own Actors. - [Scraping with Puppeteer Scraper](https://docs.apify.com/academy/apify-scrapers/puppeteer-scraper.md): Learn how to scrape a website using Apify's Puppeteer Scraper. Build an Actor's page function, extract information from a web page and download your data. - [Scraping with Web Scraper](https://docs.apify.com/academy/apify-scrapers/web-scraper.md): Learn how to scrape a website using Apify's Web Scraper. Build an Actor's page function, extract information from a web page and download your data. - [Concepts 🤔](https://docs.apify.com/academy/concepts.md): Learn about some common yet tricky concepts and terms that are used frequently within the academy, as well as in the world of scraper development. - [CSS selectors](https://docs.apify.com/academy/concepts/css-selectors.md): Learn about CSS selectors. What they are, their types, why they are important for web scraping and how to use them in browser Console with JavaScript. - [Dynamic pages and single-page applications (SPAs)](https://docs.apify.com/academy/concepts/dynamic-pages.md): Understand what makes a page dynamic, and how a page being dynamic might change your approach when writing a scraper for it. - [HTML elements](https://docs.apify.com/academy/concepts/html-elements.md): Learn about HTML elements. What they are, their types and how to work with them in a browser environment using JavaScript. - [HTTP cookies](https://docs.apify.com/academy/concepts/http-cookies.md): Learn a bit about what cookies are, and how they are utilized in scrapers to appear logged-in, view specific data, or even avoid blocking. - [HTTP headers](https://docs.apify.com/academy/concepts/http-headers.md): Understand what HTTP headers are, what they're used for, and three of the biggest differences between HTTP/1.1 and HTTP/2 headers. - [Querying elements](https://docs.apify.com/academy/concepts/querying-css-selectors.md): Learn how to query DOM elements using CSS selectors with the document.querySelector() and document.querySelectorAll() functions. - [What is robotic process automation (RPA)?](https://docs.apify.com/academy/concepts/robotic-process-automation.md): Learn the basics of robotic process automation. Make your processes on the web and other software more efficient by automating repetitive tasks. - [Deploying your code to Apify](https://docs.apify.com/academy/deploying-your-code.md): In this course learn how to take an existing project of yours and deploy it to the Apify platform as an Actor. - [Creating dataset schema](https://docs.apify.com/academy/deploying-your-code/dataset-schema.md): Learn how to generate an appealing Overview table interface to preview your Actor results in real time on the Apify platform. - [Publishing your Actor](https://docs.apify.com/academy/deploying-your-code/deploying.md): Push local code to the platform, or create a new Actor on the console and integrate it with a Git repository to optionally automatically rebuild any new changes. - [Creating Actor Dockerfile](https://docs.apify.com/academy/deploying-your-code/docker-file.md): Understand how to write a Dockerfile (Docker image blueprint) for your project so that it can be run within a Docker container on the Apify platform. - [How to write Actor input schema](https://docs.apify.com/academy/deploying-your-code/input-schema.md): Learn how to generate a user interface on the platform for your Actor's input with a single file - the INPUT_SCHEMA.json file. - [Managing Actor inputs and outputs](https://docs.apify.com/academy/deploying-your-code/inputs-outputs.md): Learn to accept input into your Actor, do something with it, and then return output. Actors can be written in any language, so this concept is language agnostic. - [Expert scraping with Apify](https://docs.apify.com/academy/expert-scraping-with-apify.md): After learning the basics of Actors and Apify, learn to develop pro-level scrapers on the Apify platform with this advanced course. - [Webhooks & advanced Actor overview](https://docs.apify.com/academy/expert-scraping-with-apify/actors-webhooks.md): Learn more advanced details about Actors, how they work, and the default configurations they can take. Also, learn how to integrate your Actor with webhooks. - [Apify API & client](https://docs.apify.com/academy/expert-scraping-with-apify/apify-api-and-client.md): Gain an in-depth understanding of the two main ways of programmatically interacting with the Apify platform - through the API, and through a client. - [Bypassing anti-scraping methods](https://docs.apify.com/academy/expert-scraping-with-apify/bypassing-anti-scraping.md): Learn about bypassing anti-scraping methods using proxies and proxy/session rotation together with Crawlee and the Apify SDK. - [Managing source code](https://docs.apify.com/academy/expert-scraping-with-apify/managing-source-code.md): Learn how to manage your Actor's source code more efficiently by integrating it with a GitHub repository. This is standard on the Apify platform. - [Migrations & maintaining state](https://docs.apify.com/academy/expert-scraping-with-apify/migrations-maintaining-state.md): Learn about what Actor migrations are and how to handle them properly so that the state is not lost and runs can safely be resurrected. - [Saving useful run statistics](https://docs.apify.com/academy/expert-scraping-with-apify/saving-useful-stats.md): Understand how to save statistics about an Actor's run, what types of statistics you can save, and why you might want to save them for a large-scale scraper. - [Solutions](https://docs.apify.com/academy/expert-scraping-with-apify/solutions.md): View all of the solutions for all of the activities and tasks of this course. Please try to complete each task on your own before reading the solution! - [Handling migrations](https://docs.apify.com/academy/expert-scraping-with-apify/solutions/handling-migrations.md): Get real-world experience of maintaining a stateful object stored in memory, which will be persisted through migrations and even graceful aborts. - [Integrating webhooks](https://docs.apify.com/academy/expert-scraping-with-apify/solutions/integrating-webhooks.md): Learn how to integrate webhooks into your Actors. Webhooks are a super powerful tool, and can be used to do almost anything! - [Managing source](https://docs.apify.com/academy/expert-scraping-with-apify/solutions/managing-source.md): View in-depth answers for all three of the quiz questions that were provided in the corresponding lesson about managing source code. - [Rotating proxies/sessions](https://docs.apify.com/academy/expert-scraping-with-apify/solutions/rotating-proxies.md): Learn firsthand how to rotate proxies and sessions in order to avoid the majority of the most common anti-scraping protections. - [Saving run stats](https://docs.apify.com/academy/expert-scraping-with-apify/solutions/saving-stats.md): Implement the saving of general statistics about an Actor's run, as well as adding request-specific statistics to dataset items. - [Using the Apify API & JavaScript client](https://docs.apify.com/academy/expert-scraping-with-apify/solutions/using-api-and-client.md): Learn how to interact with the Apify API directly through the well-documented RESTful routes, or by using the proprietary Apify JavaScript client. - [Using storage & creating tasks](https://docs.apify.com/academy/expert-scraping-with-apify/solutions/using-storage-creating-tasks.md): Get quiz answers and explanations for the lesson about using storage and creating tasks on the Apify platform. - [Tasks & storage](https://docs.apify.com/academy/expert-scraping-with-apify/tasks-and-storage.md): Understand how to save the configurations for Actors with Actor tasks. Also, learn about storage and the different types Apify offers. - [Monetizing your Actor](https://docs.apify.com/academy/get-most-of-actors/monetizing-your-actor.md): Learn how you can monetize your web scraping and automation projects by publishing Actors to users in Apify Store. - [Getting started](https://docs.apify.com/academy/getting-started.md): Get started with the Apify platform by creating an account and learning about the Apify Console, which is where all Apify Actors are born! - [Actors](https://docs.apify.com/academy/getting-started/actors.md): What is an Actor? How do we create them? Learn the basics of what Actors are, how they work, and try out an Actor yourself right on the Apify platform! - [The Apify API](https://docs.apify.com/academy/getting-started/apify-api.md): Learn how to use the Apify API to programmatically call your Actors, retrieve data stored on the platform, view Actor logs, and more! - [Apify client](https://docs.apify.com/academy/getting-started/apify-client.md): Interact with the Apify API in your code by using the apify-client package, which is available for both JavaScript and Python. - [Creating Actors](https://docs.apify.com/academy/getting-started/creating-actors.md): Build and run your very first Actor directly in Apify Console from a template. This lesson provides hands-on experience with building and running Actors. - [Inputs & outputs](https://docs.apify.com/academy/getting-started/inputs-outputs.md): Create an Actor from scratch which takes an input, processes that input, and then outputs a result that can be used elsewhere. - [Why a glossary?](https://docs.apify.com/academy/glossary.md): Browse important web scraping concepts, tools and topics in succinct articles explaining common web development terms in a web scraping and automation context. - [Node.js Tutorials 💻📚](https://docs.apify.com/academy/node-js.md): A collection of various Node.js tutorials on scraping sitemaps, optimizing your scrapers, using popular Node.js web scraping libraries, and more. - [How to add external libraries to Web Scraper](https://docs.apify.com/academy/node-js/add-external-libraries-web-scraper.md): Learn how to load external JavaScript libraries in Apify's Web Scraper Actor. - [How to analyze and fix errors when scraping a website](https://docs.apify.com/academy/node-js/analyzing-pages-and-fixing-errors.md): Learn how to deal with random crashes in your web-scraping and automation jobs. Find out the essentials of debugging and fixing problems in your crawlers. - [Apify's free Google SERP API](https://docs.apify.com/academy/node-js/apify-free-google-serp-api.md): How to stay up to date on search results with a Google SERP API - [Avoid EACCES error in Actor builds with a custom Dockerfile](https://docs.apify.com/academy/node-js/avoid-eacces-error-in-actor-builds.md): Learn how to work around an issue where Actor builds with a custom Dockerfile fail to copy files due to write access errors. - [Block requests in Puppeteer](https://docs.apify.com/academy/node-js/block-requests-puppeteer.md): Why and how to block requests in Puppeteer - [How to optimize Puppeteer by caching responses](https://docs.apify.com/academy/node-js/caching-responses-in-puppeteer.md): Learn why it is important for performance to cache responses in memory when intercepting requests in Puppeteer and how to implement it in your code. - [How to choose the right scraper for the job](https://docs.apify.com/academy/node-js/choosing-the-right-scraper.md): Learn basic web scraping concepts to help you analyze a website and choose the best scraper for your particular use case. - [How to scrape from dynamic pages](https://docs.apify.com/academy/node-js/dealing-with-dynamic-pages.md): Learn about dynamic pages and dynamic content. How can we find out if a page is dynamic? How do we programmatically scrape dynamic content? - [Running code in a browser console](https://docs.apify.com/academy/node-js/debugging-web-scraper.md): Test your Page Function's code directly in your browser's console - [Filter out blocked proxies using sessions](https://docs.apify.com/academy/node-js/filter-blocked-requests-using-sessions.md): Handling blocked requests efficiently using sessions - [BasicCrawler](https://docs.apify.com/academy/node-js/handle-blocked-requests-puppeteer.md): Getting around website defense mechanisms when crawling - [How to fix 'Target closed' error in Puppeteer and Playwright](https://docs.apify.com/academy/node-js/how_to_fix_target-closed.md): Learn about common causes for the 'Target closed' error in your browser automation workflow and what you can do to fix it. - [How to save screenshots from puppeteer](https://docs.apify.com/academy/node-js/how-to-save-screenshots-puppeteer.md): Code example for how to save screenshots from puppeteer to Apify key-value store - [How to scrape hidden JavaScript objects in HTML](https://docs.apify.com/academy/node-js/js-in-html.md): Learn about "hidden" data found within the JavaScript of certain pages, which can increase the scraper reliability and improve your development experience. - [Scrape website in parallel with multiple Actor runs](https://docs.apify.com/academy/node-js/multiple-runs-scrape.md): Learn how to run multiple instances of an Actor to scrape a website faster. This tutorial will guide you through the process of setting up your scraper. - [How to optimize and speed up your web scraper](https://docs.apify.com/academy/node-js/optimizing-scrapers.md): We all want our scrapers to run as cost-effective as possible. Learn how to think about performance in the context of web scraping and automation. - [Enqueuing start pages for all keywords](https://docs.apify.com/academy/node-js/processing-multiple-pages-web-scraper.md): Solving a common problem with scraper automatically deduplicating the same URLs - [Request labels and how to pass data to other requests](https://docs.apify.com/academy/node-js/request-labels-in-apify-actors.md): How to handle request labels in Apify Actors with Cheerio or Puppeteer Crawler - [How to scrape from sitemaps](https://docs.apify.com/academy/node-js/scraping-from-sitemaps.md): The sitemap.xml file is a jackpot for every web scraper developer. Take advantage of this and learn an easier way to extract data from websites using Crawlee. - [How to scrape sites with a shadow DOM](https://docs.apify.com/academy/node-js/scraping-shadow-doms.md): The shadow DOM enables isolation of web components, but causes problems for those building web scrapers. Here's a workaround. - [Scraping a list of URLs from a Google Sheets document](https://docs.apify.com/academy/node-js/scraping-urls-list-from-google-sheets.md): Learn how to crawl a list of URLs specified in a Google Sheets document using one of the Apify web scraping Actors. - [Downloading the file to memory](https://docs.apify.com/academy/node-js/submitting-form-with-file-attachment.md): How to submit a form with attachment using request-promise. - [Submitting forms on .ASPX pages](https://docs.apify.com/academy/node-js/submitting-forms-on-aspx-pages.md): How to handle pages created with ASP.NET in Web Scraper. - [Using man-in-the-middle proxy to intercept requests in Puppeteer](https://docs.apify.com/academy/node-js/using-proxy-to-intercept-requests-puppeteer.md): This article demonstrates how to set up a reliable interception of HTTP requests in headless Chrome / Puppeteer using a local proxy. - [Waiting for dynamic content](https://docs.apify.com/academy/node-js/waiting-for-dynamic-content.md): You load the page. You execute the correct selectors. Everything should work. It doesn't? Learn how to wait for dynamic loading. - [When to use Puppeteer Scraper](https://docs.apify.com/academy/node-js/when-to-use-puppeteer-scraper.md): Choosing between Web Scraper and Puppeteer Scraper can be difficult. We explain the important differences to help you pick the right tool. - [PHP Tutorials 👨‍💻📚](https://docs.apify.com/academy/php.md): A collection of PHP tutorials to aid you in your journey to becoming a master web scraping and automation developer. - [How to use Apify from PHP](https://docs.apify.com/academy/php/use-apify-from-php.md): Learn how to access Apify's REST API endpoints from your PHP projects using the guzzle package. Follow a tutorial to run an Actor and download its data. - [Puppeteer & Playwright course](https://docs.apify.com/academy/puppeteer-playwright.md): Learn in-depth how to use two of the most popular Node.js libraries for controlling a headless browser - Puppeteer and Playwright. - [Browser](https://docs.apify.com/academy/puppeteer-playwright/browser.md): Understand what the Browser object is in Puppeteer/Playwright, how to create one, and a bit about how to interact with one. - [Creating multiple browser contexts](https://docs.apify.com/academy/puppeteer-playwright/browser-contexts.md): Learn what a browser context is, how to create one, how to emulate devices, and how to use browser contexts to automate multiple sessions at one time. - [Common use cases](https://docs.apify.com/academy/puppeteer-playwright/common-use-cases.md): Learn about some of the most common use cases of Playwright and Puppeteer, and how to handle these use cases when you run into them. - [Downloading files](https://docs.apify.com/academy/puppeteer-playwright/common-use-cases/downloading-files.md): Learn how to automatically download and save files to the disk using two of the most popular web automation libraries, Puppeteer and Playwright. - [Logging into a website](https://docs.apify.com/academy/puppeteer-playwright/common-use-cases/logging-into-a-website.md): Understand the "login flow" - logging into a website, then maintaining a logged in status within different browser contexts for an efficient automation process. - [Paginating through results](https://docs.apify.com/academy/puppeteer-playwright/common-use-cases/paginating-through-results.md): Learn how to paginate through results on websites that use either page number-based pagination or dynamic lazy-loading pagination. - [Scraping iFrames](https://docs.apify.com/academy/puppeteer-playwright/common-use-cases/scraping-iframes.md): Extracting data from iFrames can be frustrating. In this tutorial, we will learn how to scrape information from iFrames using Puppeteer or Playwright. - [Submitting a form with a file attachment](https://docs.apify.com/academy/puppeteer-playwright/common-use-cases/submitting-a-form-with-a-file-attachment.md): Understand how to download a file, attach it to a form using a headless browser in Playwright or Puppeteer, then submit the form. - [Executing scripts](https://docs.apify.com/academy/puppeteer-playwright/executing-scripts.md): Understand the two different contexts which your code can be run in, and how to run custom scripts in the context of the browser. - [Extracting data](https://docs.apify.com/academy/puppeteer-playwright/executing-scripts/collecting-data.md): Learn how to extract data from a page with evaluate functions, then how to parse it by using a second library called Cheerio. - [Injecting code](https://docs.apify.com/academy/puppeteer-playwright/executing-scripts/injecting-code.md): Learn how to inject scripts prior to a page's load (pre-injecting), as well as how to expose functions to be run at a later time on the page. - [Opening a page](https://docs.apify.com/academy/puppeteer-playwright/page.md): Learn how to create and open a Page with a Browser, and how to use it to visit and programmatically interact with a website. - [Interacting with a page](https://docs.apify.com/academy/puppeteer-playwright/page/interacting-with-a-page.md): Learn how to programmatically do actions on a page such as clicking, typing, and pressing keys. Also, discover a common roadblock that comes up when automating. - [Page methods](https://docs.apify.com/academy/puppeteer-playwright/page/page-methods.md): Understand that the Page object has many different methods to offer, and learn how to use two of them to capture a page's title and take a screenshot. - [Waiting for elements and events](https://docs.apify.com/academy/puppeteer-playwright/page/waiting.md): Learn the importance of waiting for content and events before running interaction or extraction code, as well as the best practices for doing so. - [Using proxies](https://docs.apify.com/academy/puppeteer-playwright/proxies.md): Understand how to use proxies in your Puppeteer and Playwright requests, as well as a couple of the most common use cases for proxies. - [Reading & intercepting requests](https://docs.apify.com/academy/puppeteer-playwright/reading-intercepting-requests.md): You can use DevTools, but did you know that you can do all the same stuff (plus more) programmatically? Read and intercept requests in Puppeteer/Playwright. - [Python Tutorials 🐍📚](https://docs.apify.com/academy/python.md): A collection of various Python tutorials to aid you in your journey to becoming a master web scraping and automation developer. - [How to process data in Python using Pandas](https://docs.apify.com/academy/python/process-data-using-python.md): Learn how to process the resulting data of a web scraper in Python using the Pandas library, and how to visualize the processed data using Matplotlib. - [How to scrape data in Python using Beautiful Soup](https://docs.apify.com/academy/python/scrape-data-python.md): Learn how to create a Python Actor and use Python libraries to scrape, process and visualize data extracted from the web. - [Running a web server on the Apify platform](https://docs.apify.com/academy/running-a-web-server.md): A web server running in an Actor can act as a communication channel with the outside world. Learn how to set one up with Node.js. - [Web scraping basics for JavaScript devs](https://docs.apify.com/academy/scraping-basics-javascript2.md): Learn how to use JavaScript to extract information from websites in this practical course, starting from the absolute basics. - [Crawling websites with Node.js](https://docs.apify.com/academy/scraping-basics-javascript2/crawling.md): Lesson about building a Node.js application for watching prices. Using the Fetch API to follow links to individual product pages. - [Extracting data from a web page with browser DevTools](https://docs.apify.com/academy/scraping-basics-javascript2/devtools-extracting-data.md): Lesson about using the browser tools for developers to manually extract product data from an e-commerce website. - [Inspecting web pages with browser DevTools](https://docs.apify.com/academy/scraping-basics-javascript2/devtools-inspecting.md): Lesson about using the browser tools for developers to inspect and manipulate the structure of a website. - [Locating HTML elements on a web page with browser DevTools](https://docs.apify.com/academy/scraping-basics-javascript2/devtools-locating-elements.md): Lesson about using the browser tools for developers to manually find products on an e-commerce website. - [Downloading HTML with Node.js](https://docs.apify.com/academy/scraping-basics-javascript2/downloading-html.md): Lesson about building a Node.js application for watching prices. Using the Fetch API to download HTML code of a product listing page. - [Extracting data from HTML with Node.js](https://docs.apify.com/academy/scraping-basics-javascript2/extracting-data.md): Lesson about building a Node.js application for watching prices. Using string manipulation to extract and clean data scraped from the product listing page. - [Using a scraping framework with Node.js](https://docs.apify.com/academy/scraping-basics-javascript2/framework.md): Lesson about building a Node.js application for watching prices. Using the Crawlee framework to simplify creating a scraper. - [Getting links from HTML with Node.js](https://docs.apify.com/academy/scraping-basics-javascript2/getting-links.md): Lesson about building a Node.js application for watching prices. Using the Cheerio library to locate links to individual product pages. - [Locating HTML elements with Node.js](https://docs.apify.com/academy/scraping-basics-javascript2/locating-elements.md): Lesson about building a Node.js application for watching prices. Using the Cheerio library to locate products on the product listing page. - [Parsing HTML with Node.js](https://docs.apify.com/academy/scraping-basics-javascript2/parsing-html.md): Lesson about building a Node.js application for watching prices. Using the Cheerio library to parse HTML code of a product listing page. - [Using a scraping platform with Node.js](https://docs.apify.com/academy/scraping-basics-javascript2/platform.md): Lesson about building a Node.js application for watching prices. Using the Apify platform to deploy a scraper. - [Saving data with Node.js](https://docs.apify.com/academy/scraping-basics-javascript2/saving-data.md): Lesson about building a Node.js application for watching prices. Using the json2csv library to save data scraped from product listing pages in both JSON and CSV. - [Scraping product variants with Node.js](https://docs.apify.com/academy/scraping-basics-javascript2/scraping-variants.md): Lesson about building a Node.js application for watching prices. Using browser DevTools to figure out how to extract product variants and exporting them as separate items. - [Web scraping basics for Python devs](https://docs.apify.com/academy/scraping-basics-python.md): Learn how to use Python to extract information from websites in this practical course, starting from the absolute basics. - [Crawling websites with Python](https://docs.apify.com/academy/scraping-basics-python/crawling.md): Lesson about building a Python application for watching prices. Using the HTTPX library to follow links to individual product pages. - [Extracting data from a web page with browser DevTools](https://docs.apify.com/academy/scraping-basics-python/devtools-extracting-data.md): Lesson about using the browser tools for developers to manually extract product data from an e-commerce website. - [Inspecting web pages with browser DevTools](https://docs.apify.com/academy/scraping-basics-python/devtools-inspecting.md): Lesson about using the browser tools for developers to inspect and manipulate the structure of a website. - [Locating HTML elements on a web page with browser DevTools](https://docs.apify.com/academy/scraping-basics-python/devtools-locating-elements.md): Lesson about using the browser tools for developers to manually find products on an e-commerce website. - [Downloading HTML with Python](https://docs.apify.com/academy/scraping-basics-python/downloading-html.md): Lesson about building a Python application for watching prices. Using the HTTPX library to download HTML code of a product listing page. - [Extracting data from HTML with Python](https://docs.apify.com/academy/scraping-basics-python/extracting-data.md): Lesson about building a Python application for watching prices. Using string manipulation to extract and clean data scraped from the product listing page. - [Using a scraping framework with Python](https://docs.apify.com/academy/scraping-basics-python/framework.md): Lesson about building a Python application for watching prices. Using the Crawlee framework to simplify creating a scraper. - [Getting links from HTML with Python](https://docs.apify.com/academy/scraping-basics-python/getting-links.md): Lesson about building a Python application for watching prices. Using the Beautiful Soup library to locate links to individual product pages. - [Locating HTML elements with Python](https://docs.apify.com/academy/scraping-basics-python/locating-elements.md): Lesson about building a Python application for watching prices. Using the Beautiful Soup library to locate products on the product listing page. - [Parsing HTML with Python](https://docs.apify.com/academy/scraping-basics-python/parsing-html.md): Lesson about building a Python application for watching prices. Using the Beautiful Soup library to parse HTML code of a product listing page. - [Using a scraping platform with Python](https://docs.apify.com/academy/scraping-basics-python/platform.md): Lesson about building a Python application for watching prices. Using the Apify platform to deploy a scraper. - [Saving data with Python](https://docs.apify.com/academy/scraping-basics-python/saving-data.md): Lesson about building a Python application for watching prices. Using standard library to save data scraped from product listing pages in popular formats such as CSV or JSON. - [Scraping product variants with Python](https://docs.apify.com/academy/scraping-basics-python/scraping-variants.md): Lesson about building a Python application for watching prices. Using browser DevTools to figure out how to extract product variants and exporting them as separate items. - [Tools 🔧](https://docs.apify.com/academy/tools.md): Discover a variety of tools that can be used to enhance the scraper development process, or even unlock doors to new scraping possibilities. - [The Apify CLI](https://docs.apify.com/academy/tools/apify-cli.md): Learn about, install, and log into the Apify CLI - your best friend for interacting with the Apify platform via your terminal. - [What's EditThisCookie?](https://docs.apify.com/academy/tools/edit-this-cookie.md): Learn how to add, delete, and modify different cookies in your browser for testing purposes using the EditThisCookie Chrome extension. - [What is Insomnia](https://docs.apify.com/academy/tools/insomnia.md): Learn about Insomnia, a valuable tool for testing requests and proxies when building scalable web scrapers. - [What is ModHeader?](https://docs.apify.com/academy/tools/modheader.md): Discover a super useful Chrome extension called ModHeader, which allows you to modify your browser's HTTP request headers. - [What is Postman?](https://docs.apify.com/academy/tools/postman.md): Learn about Postman, a valuable tool for testing requests and proxies when building scalable web scrapers. - [What's Proxyman?](https://docs.apify.com/academy/tools/proxyman.md): Learn about Proxyman, a tool for viewing all network requests that are coming through your system. Filter by response type, by a keyword, or by application. - [Quick JavaScript Switcher](https://docs.apify.com/academy/tools/quick-javascript-switcher.md): Discover a handy tool for disabling JavaScript on a certain page to determine how it should be scraped. Great for detecting SPAs. - [What is SwitchyOmega?](https://docs.apify.com/academy/tools/switchyomega.md): Discover SwitchyOmega, a Chrome extension to manage and switch between proxies, which is extremely useful when testing proxies for a scraper. - [User-Agent Switcher](https://docs.apify.com/academy/tools/user-agent-switcher.md): Learn how to switch your User-Agent header to different values in order to monitor how a certain site responds to the changes. - [Tutorials 📚](https://docs.apify.com/academy/tutorials.md): Learn about various different specific topics related to web-scraping and web-automation with the Apify Academy tutorial lessons! - [Web scraping basics for JavaScript devs](https://docs.apify.com/academy/web-scraping-for-beginners.md): Learn how to develop web scrapers with this comprehensive and practical course. Go from beginner to expert, all in one place. - [Best practices when writing scrapers](https://docs.apify.com/academy/web-scraping-for-beginners/best-practices.md): Understand the standards and best practices that we here at Apify abide by to write readable, scalable, and maintainable code. - [Challenge](https://docs.apify.com/academy/web-scraping-for-beginners/challenge.md): Test your knowledge acquired in the previous sections of this course by building an Amazon scraper using Crawlee's CheerioCrawler! - [Initialization & setting up](https://docs.apify.com/academy/web-scraping-for-beginners/challenge/initializing-and-setting-up.md): When you extract links from a web page, you often end up with a lot of irrelevant URLs. Learn how to filter the links to only keep the ones you need. - [Modularity](https://docs.apify.com/academy/web-scraping-for-beginners/challenge/modularity.md): Before you build your first web scraper with Crawlee, it is important to understand the concept of modularity in programming. - [Scraping Amazon](https://docs.apify.com/academy/web-scraping-for-beginners/challenge/scraping-amazon.md): Before you build your first web scraper with Crawlee, it is important to understand the concept of modularity in programming. - [Basics of crawling](https://docs.apify.com/academy/web-scraping-for-beginners/crawling.md): Learn how to crawl the web with your scraper. How to extract links and URLs from web pages and how to manage the collected links to visit new pages. - [Exporting data](https://docs.apify.com/academy/web-scraping-for-beginners/crawling/exporting-data.md): Learn how to export the data you scraped using Crawlee to CSV or JSON. - [Filtering links](https://docs.apify.com/academy/web-scraping-for-beginners/crawling/filtering-links.md): When you extract links from a web page, you often end up with a lot of irrelevant URLs. Learn how to filter the links to only keep the ones you need. - [Finding links](https://docs.apify.com/academy/web-scraping-for-beginners/crawling/finding-links.md): Learn what a link looks like in HTML and how to find and extract their URLs when web scraping. Using both DevTools and Node.js. - [Your first crawl](https://docs.apify.com/academy/web-scraping-for-beginners/crawling/first-crawl.md): Learn how to crawl the web using Node.js, Cheerio and an HTTP client. Extract URLs from pages and use them to visit more websites. - [Headless browsers](https://docs.apify.com/academy/web-scraping-for-beginners/crawling/headless-browser.md): Learn how to scrape the web with a headless browser using only a few lines of code. Chrome, Firefox, Safari, Edge - all are supported. - [Professional scraping 👷](https://docs.apify.com/academy/web-scraping-for-beginners/crawling/pro-scraping.md): Learn how to build scrapers quicker and get better and more robust results by using Crawlee, an open-source library for scraping in Node.js. - [Recap of data extraction basics](https://docs.apify.com/academy/web-scraping-for-beginners/crawling/recap-extraction-basics.md): Review our e-commerce website scraper and refresh our memory about its code and the programming techniques we used to extract and save the data. - [Relative URLs](https://docs.apify.com/academy/web-scraping-for-beginners/crawling/relative-urls.md): Learn about absolute and relative URLs used on web pages and how to work with them when parsing HTML with Cheerio in your scraper. - [Scraping data](https://docs.apify.com/academy/web-scraping-for-beginners/crawling/scraping-the-data.md): Learn how to add data extraction logic to your crawler, which will allow you to extract data from all the websites you crawled. - [Basics of data extraction](https://docs.apify.com/academy/web-scraping-for-beginners/data-extraction.md): Learn about HTML, CSS, and JavaScript, the basic building blocks of a website, and how to use them in web scraping and data extraction. - [Starting with browser DevTools](https://docs.apify.com/academy/web-scraping-for-beginners/data-extraction/browser-devtools.md): Learn about browser DevTools, a valuable tool in the world of web scraping, and how you can use them to extract data from a website. - [Prepare your computer for programming](https://docs.apify.com/academy/web-scraping-for-beginners/data-extraction/computer-preparation.md): Set up your computer to be able to code scrapers with Node.js and JavaScript. Download Node.js and npm and run a Hello World script. - [Extracting data with DevTools](https://docs.apify.com/academy/web-scraping-for-beginners/data-extraction/devtools-continued.md): Continue learning how to extract data from a website using browser DevTools, CSS selectors, and JavaScript via the DevTools console. - [Extracting data with Node.js](https://docs.apify.com/academy/web-scraping-for-beginners/data-extraction/node-continued.md): Continue learning how to create a web scraper with Node.js and Cheerio. Learn how to parse HTML and print the results of the data your scraper has collected. - [Scraping with Node.js](https://docs.apify.com/academy/web-scraping-for-beginners/data-extraction/node-js-scraper.md): Learn how to use JavaScript and Node.js to create a web scraper, plus take advantage of the Cheerio and Got-scraping libraries to make your job easier. - [Setting up your project](https://docs.apify.com/academy/web-scraping-for-beginners/data-extraction/project-setup.md): Create a new project with npm and Node.js. Install necessary libraries, and test that everything works before starting the next lesson. - [Saving results to CSV](https://docs.apify.com/academy/web-scraping-for-beginners/data-extraction/save-to-csv.md): Learn how to save the results of your scraper's collected data to a CSV file that can be opened in Excel, Google Sheets, or any other spreadsheets program. - [Finding elements with DevTools](https://docs.apify.com/academy/web-scraping-for-beginners/data-extraction/using-devtools.md): Learn how to use browser DevTools, CSS selectors, and JavaScript via the DevTools console to extract data from a website. - [Introduction](https://docs.apify.com/academy/web-scraping-for-beginners/introduction.md): Start learning about web scraping, web crawling, data extraction, and popular tools to start developing your own scraper. ## Legal documents - [Apify Legal](https://docs.apify.com/legal.md): This is an index of Apify's public facing policies, terms of use and legal documents. - [Apify Acceptable Use Policy](https://docs.apify.com/legal/acceptable-use-policy.md): Apify's acceptable use policy describes activities that are prohibited on the Apify platform and on our websites. - [Apify Affiliate Program Terms and Conditions](https://docs.apify.com/legal/affiliate-program-terms-and-conditions.md): Apify Affiliate Program Terms and Conditions govern Apify's affiliate partnership program. - [Apify Candidate Referral Program](https://docs.apify.com/legal/candidate-referral-program-terms.md): Apify Candidate Referral Program is a public promise of a remuneration for referred candidates. - [Apify Cookie Policy](https://docs.apify.com/legal/cookie-policy.md): Apify Cookie Policy describes how we handle cookies on our website and platform. - [Apify Data Processing Addendum](https://docs.apify.com/legal/data-processing-addendum.md): Apify Data Processing Addendum serve as a framework for processing of personal data on behalf of Apify customers. - [Apify Open Source Fair Share Program Terms and Conditions](https://docs.apify.com/legal/fair-share-program-terms-and-conditions.md): Apify Affiliate Program Terms and Conditions govern Apify's affiliate partnership program. - [Apify GDPR Information](https://docs.apify.com/legal/gdpr-information.md): This Apify GDPR Information document describes how Apify complies with GDPR and its requirements. - [Apify General Terms and Conditions](https://docs.apify.com/legal/general-terms-and-conditions.md): Apify General Terms and Conditions govern the use of Apify's website, platform and services. - [Apify General Terms and Conditions October 2022](https://docs.apify.com/legal/old/general-terms-and-conditions-october-2022.md): Outdated general terms and conditions that governed the use of Apify website, platform and services until May 2024. - [Apify Store Publishing Terms and Conditions December 2022](https://docs.apify.com/legal/old/store-publishing-terms-and-conditions-december-2022.md): Outdated Apify Store publishing terms and conditions that governed publishing of Actors in the Apify Store until May 2024. - [Apify Privacy Policy](https://docs.apify.com/legal/privacy-policy.md): Apify Privacy Policy describes how we handle your personal data and how you can exercise your personal data rights. - [Apify Store Publishing Terms and Conditions](https://docs.apify.com/legal/store-publishing-terms-and-conditions.md): Apify Store Publishing Terms and Conditions govern publishing of Actors in the Apify Store and payments for monetized Actors. - [Apify Whistleblowing Policy](https://docs.apify.com/legal/whistleblowing-policy.md): Apify's whistleblowing policy describes how illegal activities can be reported, as required by law. ## Platform documentation - [Apify platform](https://docs.apify.com/platform.md): Apify is your one-stop shop for web scraping, data extraction, and RPA. Automate anything you can do manually in a browser. - [Actors](https://docs.apify.com/platform/actors.md): Learn how to develop, run and share serverless cloud programs. Create your own web scraping and automation tools and publish them on the Apify platform. - [Actor development](https://docs.apify.com/platform/actors/development.md): Read about the technical part of building Apify Actors. Learn to define Actor inputs, build new versions, persist Actor state, and choose base Docker images. - [Actor definition](https://docs.apify.com/platform/actors/development/actor-definition.md): Learn how to turn your arbitrary code into an Actor simply by adding an Actor definition directory. - [actor.json](https://docs.apify.com/platform/actors/development/actor-definition/actor-json.md): Learn how to write the main Actor config in the `.actor/actor.json` file. - [Dataset Schema Specification](https://docs.apify.com/platform/actors/development/actor-definition/dataset-schema.md): Learn how to define and present your dataset schema in an user-friendly output UI. - [Dataset validation](https://docs.apify.com/platform/actors/development/actor-definition/dataset-schema/validation.md): Specify the dataset schema within the Actors so you can add monitoring and validation at the field level. - [Dockerfile](https://docs.apify.com/platform/actors/development/actor-definition/dockerfile.md): Learn about the available Docker images you can use as a base for your Apify Actors. Choose the right base image based on your Actor's requirements and the programming language you're using. - [Actor input schema](https://docs.apify.com/platform/actors/development/actor-definition/input-schema.md): Learn how to define and validate a schema for your Actor's input with code examples. Provide an autogenerated input UI for your Actor's users. - [Secret input](https://docs.apify.com/platform/actors/development/actor-definition/input-schema/secret-input.md): Learn about making some Actor input fields secret and encrypted. Ideal for passing passwords, API tokens, or login cookies to Actors. - [Actor input schema specification](https://docs.apify.com/platform/actors/development/actor-definition/input-schema/specification/v1.md): Learn how to define and validate a schema for your Actor's input with code examples. Provide an autogenerated input UI for your Actor's users. - [Key-value Store Schema Specification](https://docs.apify.com/platform/actors/development/actor-definition/key-value-store-schema.md): Learn how to define and present your key-value store schema to organize records into collections. - [Source code](https://docs.apify.com/platform/actors/development/actor-definition/source-code.md): Learn about the Actor's source code placement and its structure. - [Automated tests](https://docs.apify.com/platform/actors/development/automated-tests.md): Learn how to automate ongoing testing and make sure your Actors perform over time. See code examples for configuring the Actor Testing Actor. - [Builds and runs](https://docs.apify.com/platform/actors/development/builds-and-runs.md): Learn about Actor builds and runs, their lifecycle, versioning, and other properties. - [Builds](https://docs.apify.com/platform/actors/development/builds-and-runs/builds.md): Learn about Actor build numbers, versioning, and how to use specific Actor version in runs. Understand an Actor's lifecycle and manage its cache. - [Runs](https://docs.apify.com/platform/actors/development/builds-and-runs/runs.md): Learn about Actor runs, how to start them, and how to manage them. - [State persistence](https://docs.apify.com/platform/actors/development/builds-and-runs/state-persistence.md): Learn how to maintain an Actor's state to prevent data loss during unexpected restarts. Includes code examples for handling server migrations. - [Deployment](https://docs.apify.com/platform/actors/development/deployment.md): Learn how to deploy your Actor to the Apify platform and build them. - [Continuous integration for Actors](https://docs.apify.com/platform/actors/development/deployment/continuous-integration.md): Learn how to integrate your Actors by setting up automated builds, deploys, and testing for your Actors. - [Source types](https://docs.apify.com/platform/actors/development/deployment/source-types.md): Learn about Apify Actor source types and how to deploy an Actor from GitHub using CLI or Gist. - [Performance](https://docs.apify.com/platform/actors/development/performance.md): Learn how to get the maximum value out of your Actors, minimize costs, and maximize results. - [Programming interface](https://docs.apify.com/platform/actors/development/programming-interface.md): Learn about the programming interface of Apify Actors, important commands and features provided by the Apify SDK, and how to use them in your Actors. - [Basic commands](https://docs.apify.com/platform/actors/development/programming-interface/basic-commands.md): Learn how to use basic commands of the Apify SDK for both JavaScript and Python. - [Container web server](https://docs.apify.com/platform/actors/development/programming-interface/container-web-server.md): Learn about how to run a web server inside your Actor, which enables you to communicate with the outer world via both UI and API. - [Actor environment variables](https://docs.apify.com/platform/actors/development/programming-interface/environment-variables.md): Learn how to provide your Actor with context that determines its behavior through a plethora of pre-defined environment variables offered by the Apify SDK. - [Metamorph](https://docs.apify.com/platform/actors/development/programming-interface/metamorph.md): The metamorph operation transforms an Actor run into the run of another Actor with a new input. - [Standby mode](https://docs.apify.com/platform/actors/development/programming-interface/standby.md): Use the Actor as a real-time API server. - [Status messages](https://docs.apify.com/platform/actors/development/programming-interface/status-messages.md): Learn how to use custom status messages to inform users about the progress of an Actor. - [System events in Apify Actors](https://docs.apify.com/platform/actors/development/programming-interface/system-events.md): Learn about system events sent to your Actor and how to benefit from them. - [Quick start](https://docs.apify.com/platform/actors/development/quick-start.md): Create your first Actor using the Apify Web IDE or locally in your IDE. - [Build with AI](https://docs.apify.com/platform/actors/development/quick-start/build-with-ai.md): Learn how to set up your environment, choose the right tools, and establish workflows for effective vibe coding - [Local development](https://docs.apify.com/platform/actors/development/quick-start/locally.md): Create your first Actor locally on your machine, deploy it to the Apify platform, and run it in the cloud. - [Web IDE](https://docs.apify.com/platform/actors/development/quick-start/web-ide.md): Create your first Actor using the web IDE in Apify Console. - [Publishing and monetization](https://docs.apify.com/platform/actors/publishing.md): Learn about publishing, and monetizing your Actors on the Apify platform. - [Monetize your Actor](https://docs.apify.com/platform/actors/publishing/monetize.md): Learn how you can monetize your web scraping and automation projects by publishing Actors to users in Apify Store. - [Publish your Actor](https://docs.apify.com/platform/actors/publishing/publish.md): Prepare your Actor for Apify Store with a description and README file, and learn how to make your Actor available to the public. - [Actor status badge](https://docs.apify.com/platform/actors/publishing/status-badge.md): The Actor status badge can be embedded in the README or documentation to show users the current status and usage of your Actor on the Apify platform. - [Automated testing](https://docs.apify.com/platform/actors/publishing/test.md): Apify has a QA system that regularly runs automated tests to ensure that all Actors in the store are functional. - [Running Actors](https://docs.apify.com/platform/actors/running.md): Start an Actor from Apify Console or via API. Learn about Actor lifecycles, how to specify settings and version, provide input, and resurrect finished runs. - [Actors in Store](https://docs.apify.com/platform/actors/running/actors-in-store.md): Apify Store is home to thousands of public Actors available to the Apify community. It's the easiest way for you to start with Apify. - [Input and output](https://docs.apify.com/platform/actors/running/input-and-output.md): Configure your Actor's input parameters using Apify Console, locally or via API. Access parameters in key-value stores from your Actor's code. - [Runs and builds](https://docs.apify.com/platform/actors/running/runs-and-builds.md): Learn about Actor builds and runs, their lifecycle, sharing, and data retention policy. - [Standby mode](https://docs.apify.com/platform/actors/running/standby.md): Use an Actor as a real-time API server. - [Actor tasks](https://docs.apify.com/platform/actors/running/tasks.md): Create and save reusable configurations of Apify Actors tailored to specific use cases. - [Usage and resources](https://docs.apify.com/platform/actors/running/usage-and-resources.md): Learn about your Actors' memory and processing power requirements, their relationship with Docker resources, minimum requirements for different use cases and its impact on the cost. - [Collaboration](https://docs.apify.com/platform/collaboration.md): Learn how to collaborate with other users and manage permissions for organizations or private resources such as Actors, Actor runs, and storages. - [Access rights](https://docs.apify.com/platform/collaboration/access-rights.md): Manage permissions for your private resources such as Actors, Actor runs, and storages. Allow other users to read, run, modify, or build new versions. - [List of permissions](https://docs.apify.com/platform/collaboration/list-of-permissions.md): Learn about the access rights you can grant to other users. See a list of all access options for Apify resources such as Actors, Actor runs/tasks and storage. - [Organization account](https://docs.apify.com/platform/collaboration/organization-account.md): Create a specialized account for your organization to encourage collaboration and manage permissions. Convert an existing account, or create one from scratch. - [Using the organization account](https://docs.apify.com/platform/collaboration/organization-account/how-to-use.md): Learn to use and manage your organization account using the Apify Console or API. View the organizations you are in and manage your memberships. - [Setup](https://docs.apify.com/platform/collaboration/organization-account/setup.md): Configure your organization account by inviting new members and assigning their roles. Manage team members' access permissions to the organization's resources. - [Apify Console](https://docs.apify.com/platform/console.md): Learn about Apify Console's easy account creation and user-friendly homepage for efficient web scraping management. - [Billing](https://docs.apify.com/platform/console/billing.md): The Billings page is the central place for all information regarding your invoices, billing information regarding current usage, historical usage, subscriptions & limits. - [Account settings](https://docs.apify.com/platform/console/settings.md): Learn how to manage your Apify account, configure integrations, create and manage organizations, and set notification preferences in the Settings tab. - [Apify Store](https://docs.apify.com/platform/console/store.md): Explore Apify Store, browse and select Actors, search by criteria, sort by relevance, and adjust settings for immediate or future runs. - [Two-factor authentication setup](https://docs.apify.com/platform/console/two-factor-authentication.md): Learn about Apify Console's two-factor authentication process and how to set it up. - [Integrations](https://docs.apify.com/platform/integrations.md): Learn how to integrate the Apify platform with other services, your systems, data pipelines, and other web automation workflows. - [What are Actor integrations?](https://docs.apify.com/platform/integrations/actors.md): Learn how to integrate with other Actors and tasks. - [Integrating Actors via API](https://docs.apify.com/platform/integrations/actors/integrating-actors-via-api.md): Learn how to integrate with other Actors and tasks using the Apify API. - [Integration-ready Actors](https://docs.apify.com/platform/integrations/actors/integration-ready-actors.md): Learn how to create Actors that are ready to be integrated with other Actors and tasks. - [Agno Integration](https://docs.apify.com/platform/integrations/agno.md): Integrate Apify with Agno to power AI agents with web scraping, automation, and data insights. - [Airbyte integration](https://docs.apify.com/platform/integrations/airbyte.md): Learn how to integrate your Apify datasets with Airbyte. - [Airtable integration](https://docs.apify.com/platform/integrations/airtable.md): Learn how to integrate Apify with Airtable - [API integration](https://docs.apify.com/platform/integrations/api.md): Learn how to integrate with Apify via API. - [Amazon Bedrock integrations](https://docs.apify.com/platform/integrations/aws_bedrock.md): Learn how to integrate Apify with Amazon Bedrock Agents to provide web data for AI agents - [🤖🚀 CrewAI integration](https://docs.apify.com/platform/integrations/crewai.md): Learn how to build AI Agents with Apify and CrewAI 🤖🚀. - [Google Drive integration](https://docs.apify.com/platform/integrations/drive.md): Learn how to integrate Apify with Google Drive - [Flowise integration](https://docs.apify.com/platform/integrations/flowise.md): Learn how to integrate Apify with Flowise. - [GitHub integration](https://docs.apify.com/platform/integrations/github.md): Learn how to integrate your Apify Actors with GitHub. This article shows you how to automatically create an issue in your repo when an Actor run fails. - [Gmail integration](https://docs.apify.com/platform/integrations/gmail.md): Learn how to integrate Apify with Gmail - [Haystack integration](https://docs.apify.com/platform/integrations/haystack.md): Learn how to integrate Apify with Haystack to work with web data in the Haystack ecosystem. - [IFTTT integration](https://docs.apify.com/platform/integrations/ifttt.md): Connect Apify Actors with IFTTT to automate workflows using Actor run events, data queries, and task actions. - [Integrate with Apify](https://docs.apify.com/platform/integrations/integrate.md): Learn about how to integrate your service with Apify to benefit from a mutual integration. - [Keboola integration](https://docs.apify.com/platform/integrations/keboola.md): Learn how to integrate your Apify datasets with Airbyte. - [🦜🔗 LangChain integration](https://docs.apify.com/platform/integrations/langchain.md): Learn how to integrate Apify with 🦜🔗 LangChain, in order to feed vector databases and LLMs with data crawled from the web. - [Langflow integration](https://docs.apify.com/platform/integrations/langflow.md): Learn how to integrate Apify with Langflow low-code tool to build powerful AI agents and workflows that can use any API, model, or database. - [🦜🔘➡️ LangGraph integration](https://docs.apify.com/platform/integrations/langgraph.md): Learn how to build AI Agents with Apify and LangGraph 🦜🔘➡️. - [Lindy integration](https://docs.apify.com/platform/integrations/lindy.md): Learn how to integrate Apify with Lindy. - [LlamaIndex integration](https://docs.apify.com/platform/integrations/llama.md): Learn how to integrate Apify with LlamaIndex in order to feed vector databases and LLMs with data crawled from the web. - [Make integration](https://docs.apify.com/platform/integrations/make.md): Learn how to integrate your Apify Actors with Make. - [Make - AI crawling Actor integration](https://docs.apify.com/platform/integrations/make/ai-crawling.md): Learn about AI Crawling scraper modules. - [Make - Amazon Actor integration](https://docs.apify.com/platform/integrations/make/amazon.md): Learn about Amazon scraper modules, extract product, search, or category data from Amazon. - [Make - Facebook Actor integration](https://docs.apify.com/platform/integrations/make/facebook.md): Learn about Facebook scraper modules, extract posts, comments, and profile data from Facebook. - [Make - Instagram Actor integration](https://docs.apify.com/platform/integrations/make/instagram.md): Learn about Instagram scraper modules. Extract posts, comments, and profile data. - [Make - Google Maps Leads Actor integration](https://docs.apify.com/platform/integrations/make/maps.md): Learn about Google Maps scraper modules. - [Make - Google Search Actor integration](https://docs.apify.com/platform/integrations/make/search.md): Learn about Google Search scraper modules. - [Make - TikTok Actor integration](https://docs.apify.com/platform/integrations/make/tiktok.md): Learn about TikTok scraper modules, extract posts, comments, and profile data. - [Make - YouTube Actor integration](https://docs.apify.com/platform/integrations/make/youtube.md): Learn about YouTube scraper modules. Extract channel, video, streams, shorts, and search data from YouTube. - [Mastra MCP integration](https://docs.apify.com/platform/integrations/mastra.md): Learn how to build AI Agents with Mastra via Apify Actors MCP server - [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md): Learn how to use the Apify MCP server to integrate Apify Actors into your AI agents or applications. - [Milvus integration](https://docs.apify.com/platform/integrations/milvus.md): Learn how to integrate Apify with Milvus (Zilliz) to save data scraped from the websites into the Milvus vector database. - [n8n integration](https://docs.apify.com/platform/integrations/n8n.md): Connect Apify with n8n to automate workflows by running Actors, extracting data, and responding to Actor or task events. - [OpenAI Assistants integration](https://docs.apify.com/platform/integrations/openai-assistants.md): Learn how to integrate Apify with OpenAI Assistants to provide real-time search data and to save them into OpenAI Vector Store - [Pinecone integration](https://docs.apify.com/platform/integrations/pinecone.md): Learn how to integrate Apify with Pinecone to feed data crawled from the web into the Pinecone vector database. - [Qdrant integration](https://docs.apify.com/platform/integrations/qdrant.md): Learn how to integrate Apify with Qdrant to feed data crawled from the web into the Qdrant vector database. - [Slack integration](https://docs.apify.com/platform/integrations/slack.md): Learn how to integrate your Apify Actors with Slack. This article guides you from installation through to automating your whole workflow in Slack. - [Telegram integration through Zapier](https://docs.apify.com/platform/integrations/telegram.md): Learn how to integrate your Apify Actors with Zapier. - [Webhook integration](https://docs.apify.com/platform/integrations/webhooks.md): Learn how to integrate multiple Apify Actors or external systems with your Actor or task run. Send alerts when your Actor run succeeds or fails. - [Webhook actions](https://docs.apify.com/platform/integrations/webhooks/actions.md): Send notifications when specific events occur in your Actor/task run or build. Dynamically add data to the notification payload. - [Ad-hoc webhooks](https://docs.apify.com/platform/integrations/webhooks/ad-hoc-webhooks.md): Set up one-time webhooks for Actor runs initiated through the Apify API or from the Actor's code. Trigger events when the run reaches a specific state. - [Events types for webhooks](https://docs.apify.com/platform/integrations/webhooks/events.md): Specify the types of events that trigger a webhook in an Actor or task run. Trigger an action on Actor or task run creation, success, failure, termination, or timeout. - [Zapier integration](https://docs.apify.com/platform/integrations/zapier.md): Learn how to integrate your Apify Actors with Zapier. - [Limits](https://docs.apify.com/platform/limits.md): Learn the Apify platform's resource capability and limitations such as max memory, disk size and number of Actors and tasks per user. - [Monitoring](https://docs.apify.com/platform/monitoring.md): Learn how to continuously make sure that your Actors and tasks perform as expected and retrieve correct results. Receive alerts when your jobs or their metrics are not as you expect. - [Proxy](https://docs.apify.com/platform/proxy.md): Learn to anonymously access websites in scraping/automation jobs. Improve data outputs and efficiency of bots, and access websites from various geographies. - [Datacenter proxy](https://docs.apify.com/platform/proxy/datacenter-proxy.md): Learn how to reduce blocking when web scraping using IP address rotation. See proxy parameters and learn to implement Apify Proxy in an application. - [Google SERP proxy](https://docs.apify.com/platform/proxy/google-serp-proxy.md): Learn how to collect search results from Google Search-powered tools. Get search results from localized domains in multiple countries, e.g. the US and Germany. - [Residential proxy](https://docs.apify.com/platform/proxy/residential-proxy.md): Achieve a higher level of anonymity using IP addresses from human users. Access a wider pool of proxies and reduce blocking by websites' anti-scraping measures. - [Proxy usage](https://docs.apify.com/platform/proxy/usage.md): Learn how to configure and use Apify Proxy. See the required parameters such as the correct username and password. - [Using your own proxies](https://docs.apify.com/platform/proxy/using-your-own-proxies.md): Learn how to use your own proxies while using the Apify platform. - [Schedules](https://docs.apify.com/platform/schedules.md): Learn how to automatically start your Actor and task runs and the basics of cron expressions. Set up and manage your schedules from Apify Console or via API. - [Security](https://docs.apify.com/platform/security.md): Learn more about Apify's security practices and data protection measures that are used to protect your Actors, their data, and the Apify platform in general. - [Storage](https://docs.apify.com/platform/storage.md): Store anything from images and key-value pairs to structured output data. Learn how to access and manage your stored data from the Apify platform or via API. - [Dataset](https://docs.apify.com/platform/storage/dataset.md): Store and export web scraping, crawling or data processing job results. Learn how to access and manage datasets in Apify Console or via API. - [Key-value store](https://docs.apify.com/platform/storage/key-value-store.md): Store anything from Actor or task run results, JSON documents, or images. Learn how to access and manage key-value stores from Apify Console or via API. - [Request queue](https://docs.apify.com/platform/storage/request-queue.md): Queue URLs for an Actor to visit in its run. Learn how to share your queues between Actor runs. Access and manage request queues from Apify Console or via API. - [Storage usage](https://docs.apify.com/platform/storage/usage.md): Learn how to effectively use Apify's storage options. Understand key aspects of data retention, rate limiting, and secure sharing. --- # Full Documentation Content [Skip to main content](#__docusaurus_skipToContent_fallback) [![](/img/apify_sdk.svg)![](/img/apify_sdk_white.svg)](https://docs.apify.com) [Academy](https://docs.apify.com/academy)[Platform](https://docs.apify.com/platform) [API](https://docs.apify.com/api) * [Reference](https://docs.apify.com/api/v2) * [Client for JavaScript](https://docs.apify.com/api/client/js/) * [Client for Python](https://docs.apify.com/api/client/python/) [SDK](https://docs.apify.com/sdk) * [SDK for JavaScript](https://docs.apify.com/sdk/js/) * [SDK for Python](https://docs.apify.com/sdk/python/) [CLI](https://docs.apify.com/cli/) [Open source](https://docs.apify.com/open-source) * [Crawlee](https://crawlee.dev) * [Got Scraping](https://github.com/apify/got-scraping) * [Fingerprint Suite](https://github.com/apify/fingerprint-suite) * [Apify on GitHub](https://github.com/apify) * [Actor whitepaper](https://whitepaper.actor) [Discord](https://discord.com/invite/jyEM2PRvMU "Chat on Discord")[Get started](https://console.apify.com) # Apify API Apify API provides programmatic access to the [Apify Platform](https://docs.apify.com/) ## API reference The Apify API allows developers to interact programmatically with apps using HTTP requests. The Apify API is built around [REST](https://en.wikipedia.org/wiki/REST). The API has predictable resource-oriented URLs, returns JSON-encoded responses, and uses standard HTTP response codes, authentication, and verbs. [Check API reference](https://docs.apify.com/api/v2.md) cURL ``` # Prepare Actor input and run it synchronously echo '{ "searchStringsArray": ["Apify"] }' | curl -X POST -d @- \ -H 'Content-Type: application/json' \ -H 'Authorization: Bearer ' \ -L 'https://api.apify.com/v2/acts/compass~crawler-google-places/run-sync-get-dataset-items' ``` ## API client The official library to interact with Apify API. ##### ![](/img/javascript-40x40.svg)![](/img/javascript-40x40.svg)JavaScript Client ##### ![](/img/python-40x40.svg)![](/img/python-40x40.svg)Python Client ### JavaScript API client The official library to interact with Apify API from a web browser, Node.js, JavaScript, or Typescript applications.[Star](https://github.com/apify/apify-client-js) [Get started](https://docs.apify.com/api/client/js/docs)[JavaScript client reference](https://docs.apify.com/api/client/js/reference) ``` npm install apify-client ``` ``` // Easily run Actors, await them to finish using the convenient .call() method, and retrieve results from the resulting dataset. const { ApifyClient } = require('apify-client'); const client = new ApifyClient({ token: 'MY-APIFY-TOKEN', }); // Starts an actor and waits for it to finish. const { defaultDatasetId } = await client.actor('john-doe/my-cool-actor').call(); // Fetches results from the actor's dataset. const { items } = await client.dataset(defaultDatasetId).listItems(); ``` ## Related articles [![](https://blog.apify.com/content/images/2022/03/vanilla-js-ice-cream-js.jpg)](https://blog.apify.com/web-scraping-with-client-side-vanilla-javascript/) [Web scraping with client-side Vanilla JavaScript](https://blog.apify.com/web-scraping-with-client-side-vanilla-javascript/) [Read more](https://blog.apify.com/web-scraping-with-client-side-vanilla-javascript/) [![](https://blog.apify.com/content/images/2021/10/python.png)](https://blog.apify.com/apify-python-api-client/) [Apify ❤️ Python, so we’re releasing a Python API client](https://blog.apify.com/apify-python-api-client/) [Read more](https://blog.apify.com/apify-python-api-client/) [![](https://blog.apify.com/content/images/2024/02/API-for-dummies.png)](https://blog.apify.com/api-for-dummies/) [API for dummies](https://blog.apify.com/api-for-dummies/) [Read more](https://blog.apify.com/api-for-dummies/) Learn * [Academy](https://docs.apify.com/academy) * [Platform](https://docs.apify.com/platform) API * [Reference](https://docs.apify.com/api/v2) * [Client for JavaScript](https://docs.apify.com/api/client/js/) * [Client for Python](https://docs.apify.com/api/client/python/) SDK * [SDK for JavaScript](https://docs.apify.com/sdk/js/) * [SDK for Python](https://docs.apify.com/sdk/python/) Other * [CLI](https://docs.apify.com/cli/) * [Open source](https://docs.apify.com/open-source) More * [Crawlee](https://crawlee.dev) * [GitHub](https://github.com/apify) * [Discord](https://discord.com/invite/jyEM2PRvMU) * [Trust Center](https://trust.apify.com) [](https://apify.com) --- [Skip to main content](#__docusaurus_skipToContent_fallback) [![](/img/apify_sdk.svg)![](/img/apify_sdk_white.svg)](https://docs.apify.com) [Academy](https://docs.apify.com/academy)[Platform](https://docs.apify.com/platform) [API](https://docs.apify.com/api) * [Reference](https://docs.apify.com/api/v2) * [Client for JavaScript](https://docs.apify.com/api/client/js/) * [Client for Python](https://docs.apify.com/api/client/python/) [SDK](https://docs.apify.com/sdk) * [SDK for JavaScript](https://docs.apify.com/sdk/js/) * [SDK for Python](https://docs.apify.com/sdk/python/) [CLI](https://docs.apify.com/cli/) [Open source](https://docs.apify.com/open-source) * [Crawlee](https://crawlee.dev) * [Got Scraping](https://github.com/apify/got-scraping) * [Fingerprint Suite](https://github.com/apify/fingerprint-suite) * [Apify on GitHub](https://github.com/apify) * [Actor whitepaper](https://whitepaper.actor) [Discord](https://discord.com/invite/jyEM2PRvMU "Chat on Discord")[Get started](https://console.apify.com) # Apify open source Open-source tools and libraries created and maintained by Apify experts to help you with web scraping, browser automation, and proxy management. ## Crawlee Crawlee is a fully open-source web scraping and browser automation library that helps you build reliable crawlers. [![JavaScript](/img/javascript-40x40.svg)](https://crawlee.dev/) ### [JavaScript](https://crawlee.dev/) [![Python](/img/python-40x40.svg)](https://crawlee.dev/python/) ### [Python](https://crawlee.dev/python/) ## Other [![](/img/fingerprint-suite.svg)](https://github.com/apify/fingerprint-suite) ### [Fingerprint suite](https://github.com/apify/fingerprint-suite) Generate and inject browser fingerprints to avoid detection and improve scraper stealth. [Star](https://github.com/apify/fingerprint-suite) [![](/img/got-scraping.svg)](https://github.com/apify/got-scraping) ### [Got scraping](https://github.com/apify/got-scraping) A powerful extension for sending browser-like requests and blending in with web traffic. [Star](https://github.com/apify/got-scraping) ### [Proxy chain](https://github.com/apify/proxy-chain) A Node.js proxy server with support for SSL, authentication, upstream proxy chaining, custom HTTP responses, and traffic statistics. [Star](https://github.com/apify/proxy-chain) ## Actor templates Actor templates help you quickly set up your web scraping projects. Save development time and get immediate access to all the features of the Apify platform. [Browse all templates](https://apify.com/templates) Learn * [Academy](https://docs.apify.com/academy) * [Platform](https://docs.apify.com/platform) API * [Reference](https://docs.apify.com/api/v2) * [Client for JavaScript](https://docs.apify.com/api/client/js/) * [Client for Python](https://docs.apify.com/api/client/python/) SDK * [SDK for JavaScript](https://docs.apify.com/sdk/js/) * [SDK for Python](https://docs.apify.com/sdk/python/) Other * [CLI](https://docs.apify.com/cli/) * [Open source](https://docs.apify.com/open-source) More * [Crawlee](https://crawlee.dev) * [GitHub](https://github.com/apify) * [Discord](https://discord.com/invite/jyEM2PRvMU) * [Trust Center](https://trust.apify.com) [](https://apify.com) --- [Skip to main content](#__docusaurus_skipToContent_fallback) [![](/img/apify_sdk.svg)![](/img/apify_sdk_white.svg)](https://docs.apify.com) [Academy](https://docs.apify.com/academy)[Platform](https://docs.apify.com/platform) [API](https://docs.apify.com/api) * [Reference](https://docs.apify.com/api/v2) * [Client for JavaScript](https://docs.apify.com/api/client/js/) * [Client for Python](https://docs.apify.com/api/client/python/) [SDK](https://docs.apify.com/sdk) * [SDK for JavaScript](https://docs.apify.com/sdk/js/) * [SDK for Python](https://docs.apify.com/sdk/python/) [CLI](https://docs.apify.com/cli/) [Open source](https://docs.apify.com/open-source) * [Crawlee](https://crawlee.dev) * [Got Scraping](https://github.com/apify/got-scraping) * [Fingerprint Suite](https://github.com/apify/fingerprint-suite) * [Apify on GitHub](https://github.com/apify) * [Actor whitepaper](https://whitepaper.actor) [Discord](https://discord.com/invite/jyEM2PRvMU "Chat on Discord")[Get started](https://console.apify.com) # Apify SDK The Apify SDK is a toolkit for building Actors—serverless microservices running (not only) on the Apify platform. Apify comes with first-class support for JavaScript/TypeScript and Python, but you can run any containerized code on the Apify platform. ![](/img/javascript-40x40.svg)![](/img/javascript-40x40.svg) ## SDK for JavaScript Toolkit for building Actors—serverless microservices running (not only) on the Apify platform. [Star](https://github.com/apify/apify-sdk-js) [Get started](https://docs.apify.com/sdk/js/docs/guides/apify-platform)[JavaScript SDK Reference](https://docs.apify.com/sdk/js/reference) ``` npx apify-cli create my-crawler ``` ``` // The Apify SDK makes it easy to initialize the actor on the platform with the Actor.init() method, // and to save the scraped data from your Actors to a dataset by simply using the Actor.pushData() method. import { Actor } from 'apify'; import { PlaywrightCrawler } from 'crawlee'; await Actor.init(); const crawler = new PlaywrightCrawler({ async requestHandler({ request, page, enqueueLinks }) { const title = await page.title(); console.log(`Title of ${request.loadedUrl} is '${title}'`); await Actor.pushData({ title, url: request.loadedUrl }); await enqueueLinks(); } }); await crawler.run(['https://crawlee.dev']); await Actor.exit(); ``` ![](/img/python-40x40.svg)![](/img/python-40x40.svg) ## SDK for Python The Apify SDK for Python is the official library for creating Apify Actors in Python. It provides useful features like actor lifecycle management, local storage emulation, and actor event handling. [Star](https://github.com/apify/apify-sdk-python) [Get started](https://docs.apify.com/sdk/python/docs/overview/introduction)[Python SDK Reference](https://docs.apify.com/sdk/python/reference) ``` apify create my-python-actor ``` ``` # The Apify SDK makes it easy to read the actor input with the Actor.get_input() method, # and to save the scraped data from your Actors to a dataset by simply using the Actor.push_data() method. from apify import Actor from bs4 import BeautifulSoup import requests async def main(): async with Actor: actor_input = await Actor.get_input() response = requests.get(actor_input['url']) soup = BeautifulSoup(response.content, 'html.parser') await Actor.push_data({ 'url': actor_input['url'], 'title': soup.title.string }) ``` Learn * [Academy](https://docs.apify.com/academy) * [Platform](https://docs.apify.com/platform) API * [Reference](https://docs.apify.com/api/v2) * [Client for JavaScript](https://docs.apify.com/api/client/js/) * [Client for Python](https://docs.apify.com/api/client/python/) SDK * [SDK for JavaScript](https://docs.apify.com/sdk/js/) * [SDK for Python](https://docs.apify.com/sdk/python/) Other * [CLI](https://docs.apify.com/cli/) * [Open source](https://docs.apify.com/open-source) More * [Crawlee](https://crawlee.dev) * [GitHub](https://github.com/apify) * [Discord](https://discord.com/invite/jyEM2PRvMU) * [Trust Center](https://trust.apify.com) [](https://apify.com) --- [Skip to main content](#__docusaurus_skipToContent_fallback) [![](/img/apify_sdk.svg)![](/img/apify_sdk_white.svg)](https://docs.apify.com) [Academy](https://docs.apify.com/academy)[Platform](https://docs.apify.com/platform) [API](https://docs.apify.com/api) * [Reference](https://docs.apify.com/api/v2) * [Client for JavaScript](https://docs.apify.com/api/client/js/) * [Client for Python](https://docs.apify.com/api/client/python/) [SDK](https://docs.apify.com/sdk) * [SDK for JavaScript](https://docs.apify.com/sdk/js/) * [SDK for Python](https://docs.apify.com/sdk/python/) [CLI](https://docs.apify.com/cli/) [Open source](https://docs.apify.com/open-source) * [Crawlee](https://crawlee.dev) * [Got Scraping](https://github.com/apify/got-scraping) * [Fingerprint Suite](https://github.com/apify/fingerprint-suite) * [Apify on GitHub](https://github.com/apify) * [Actor whitepaper](https://whitepaper.actor) [Discord](https://discord.com/invite/jyEM2PRvMU "Chat on Discord")[Get started](https://console.apify.com) # Search the documentation Type your search here [](https://www.algolia.com/) Learn * [Academy](https://docs.apify.com/academy) * [Platform](https://docs.apify.com/platform) API * [Reference](https://docs.apify.com/api/v2) * [Client for JavaScript](https://docs.apify.com/api/client/js/) * [Client for Python](https://docs.apify.com/api/client/python/) SDK * [SDK for JavaScript](https://docs.apify.com/sdk/js/) * [SDK for Python](https://docs.apify.com/sdk/python/) Other * [CLI](https://docs.apify.com/cli/) * [Open source](https://docs.apify.com/open-source) More * [Crawlee](https://crawlee.dev) * [GitHub](https://github.com/apify) * [Discord](https://discord.com/invite/jyEM2PRvMU) * [Trust Center](https://trust.apify.com) [](https://apify.com) --- # Web Scraping Academy Learn everything about web scraping and automation with our free courses that will turn you into an expert scraper developer. ## Beginner courses [![](/img/academy/intro.svg)](https://docs.apify.com/academy/web-scraping-for-beginners.md) #### [Web scraping basics for JavaScript devs](https://docs.apify.com/academy/web-scraping-for-beginners.md) [Learn how to use JavaScript to extract information from websites in this practical course, starting from the absolute basics.](https://docs.apify.com/academy/web-scraping-for-beginners.md) [![](/img/academy/intro.svg)](https://docs.apify.com/academy/scraping-basics-python.md) #### [Web scraping basics for Python devs](https://docs.apify.com/academy/scraping-basics-python.md) [Learn how to use Python to extract information from websites in this practical course, starting from the absolute basics.](https://docs.apify.com/academy/scraping-basics-python.md) [![](/img/academy/apify_platform.png)](https://docs.apify.com/academy/apify-platform.md) #### [Apify platform](https://docs.apify.com/academy/apify-platform.md) [The Apify platform is the best place to run your scrapers and automations in the cloud. Learn what an actor is, how to turn your program into an actor, and how to deploy it.](https://docs.apify.com/academy/apify-platform.md) ## Advanced web scraping courses [![](/img/academy/actors-01.svg)](https://docs.apify.com/academy/api-scraping.md) #### [API scraping](https://docs.apify.com/academy/api-scraping.md) [Learn all about how the professionals scrape various types of APIs with various configurations, parameters, and requirements.](https://docs.apify.com/academy/api-scraping.md) [![](/img/academy/proxy-01.svg)](https://docs.apify.com/academy/anti-scraping.md) #### [Anti-scraping protections](https://docs.apify.com/academy/anti-scraping.md) [Understand the various anti-scraping measures different sites use to prevent bots from accessing them, and how to appear more human to fix these issues.](https://docs.apify.com/academy/anti-scraping.md) [![](/img/academy/enterprise-01.svg)](https://docs.apify.com/academy/expert-scraping-with-apify.md) #### [Expert scraping with Apify](https://docs.apify.com/academy/expert-scraping-with-apify.md) [After learning the basics of actors, learn to develop pro-level scrapers on the Apify platform with this advanced course.](https://docs.apify.com/academy/expert-scraping-with-apify.md) --- # Actor marketing playbook **Learn how to optimize and monetize your Actors on Apify Store by sharing them with other platform users.** *** [Apify Store](https://apify.com/store) is a marketplace featuring thousands of ready-made automation tools called Actors. As a developer, you can publish your own Actors and generate revenue through our [monetization program](https://apify.com/partners/actor-developers). To help you succeed, we've created a comprehensive Actor marketing playbook. You'll learn how to: * Optimize your Actor's visibility on Apify Store * Create compelling descriptions and documentation * Build your developer brand * Promote your work to potential customers * Analyze performance metrics * Engage with the Apify community ## Apify Store basics[​](#apify-store-basics "Direct link to Apify Store basics") #### [How Apify Store works](https://docs.apify.com/academy/actor-marketing-playbook/store-basics/how-store-works.md) [Apify Store is where users discover, run, and purchase Actors, making it the best place to showcase your work. This section explains how Store is structured, how Actors get visibility, and what factors contribute to success.](https://docs.apify.com/academy/actor-marketing-playbook/store-basics/how-store-works.md) #### [How to build Actors](https://docs.apify.com/academy/actor-marketing-playbook/store-basics/how-to-build-actors.md) [Building a successful Actor isn’t just about writing code. You need to make your Actor reliable, scalable, and easy to use. Here, you’ll learn best practices for structuring your code, handling input/output, and making sure that everything runs smoothly.](https://docs.apify.com/academy/actor-marketing-playbook/store-basics/how-to-build-actors.md) #### [How Actor monetization works](https://docs.apify.com/academy/actor-marketing-playbook/store-basics/how-actor-monetization-works.md) [Apify Store lets you earn revenue from your Actors, but success depends on pricing, user engagement, and quality. This section covers how subscriptions work, how to set competitive pricing, and what it takes to generate sustainable income.](https://docs.apify.com/academy/actor-marketing-playbook/store-basics/how-actor-monetization-works.md) #### [Ideas page and its use](https://docs.apify.com/academy/actor-marketing-playbook/store-basics/ideas-page.md) [Not sure what to build next? The Ideas page helps you find high-demand Actor concepts based on community requests and trending topics.](https://docs.apify.com/academy/actor-marketing-playbook/store-basics/ideas-page.md) #### [Actor success stories](https://docs.apify.com/academy/actor-marketing-playbook/store-basics/actor-success-stories.md) [Some developers have turned their Actors into profitable tools with steady user bases. This section shares real-world examples of what works, from product strategy to marketing tactics.](https://docs.apify.com/academy/actor-marketing-playbook/store-basics/actor-success-stories.md) ## Actor basics[​](#actor-basics "Direct link to Actor basics") #### [Naming your Actor](https://docs.apify.com/academy/actor-marketing-playbook/actor-basics/name-your-actor.md) [A clear, descriptive name helps users understand what your Actor does at a glance. Learn how to craft a name that’s both informative and easy to search for.](https://docs.apify.com/academy/actor-marketing-playbook/actor-basics/name-your-actor.md) #### [Importance of Actor URL](https://docs.apify.com/academy/actor-marketing-playbook/actor-basics/importance-of-actor-url.md) [Your Actor’s URL isn’t just a link. You should also see it as a branding and SEO tool. This section explains why a simple, memorable URL can improve discoverability and credibility.](https://docs.apify.com/academy/actor-marketing-playbook/actor-basics/importance-of-actor-url.md) #### [Actor description and SEO description](https://docs.apify.com/academy/actor-marketing-playbook/actor-basics/actor-description.md) [Your Actor’s description determines whether users understand its value in seconds. Learn how to write compelling copy that highlights key features while improving search rankings.](https://docs.apify.com/academy/actor-marketing-playbook/actor-basics/actor-description.md) #### [How to create an Actor README](https://docs.apify.com/academy/actor-marketing-playbook/actor-basics/how-to-create-an-actor-readme.md) [A strong README makes it easy for users to get started with your Actor. This guide walks you through the essential sections to include, from installation to usage examples.](https://docs.apify.com/academy/actor-marketing-playbook/actor-basics/how-to-create-an-actor-readme.md) #### [Actors and emojis](https://docs.apify.com/academy/actor-marketing-playbook/actor-basics/actors-and-emojis.md) [Emojis can make your Actor stand out, but using them strategically is key. This section covers when and how to use emojis effectively without overdoing it.](https://docs.apify.com/academy/actor-marketing-playbook/actor-basics/actors-and-emojis.md) ## Promoting your Actor[​](#promoting-your-actor "Direct link to Promoting your Actor") #### [SEO](https://docs.apify.com/academy/actor-marketing-playbook/promote-your-actor/seo.md) [Search engines play a huge role in driving users to your Actor. Learn the basics of keyword optimization, metadata, and other techniques that improve your ranking on Google and Apify’s search.](https://docs.apify.com/academy/actor-marketing-playbook/promote-your-actor/seo.md) #### [Social media](https://docs.apify.com/academy/actor-marketing-playbook/promote-your-actor/social-media.md) [Your Actor won’t promote itself. This section covers simple, effective ways to showcase your Actor on platforms like X, LinkedIn, and Reddit.](https://docs.apify.com/academy/actor-marketing-playbook/promote-your-actor/social-media.md) #### [Parasite SEO](https://docs.apify.com/academy/actor-marketing-playbook/promote-your-actor/parasite-seo.md) [Ranking on Google is tough, but piggybacking on high-authority platforms can help. This guide explains how to use sites like Medium, GitHub, and Stack Overflow to increase visibility.](https://docs.apify.com/academy/actor-marketing-playbook/promote-your-actor/parasite-seo.md) #### [Product Hunt](https://docs.apify.com/academy/actor-marketing-playbook/promote-your-actor/product-hunt.md) [A successful Product Hunt launch can drive early traction and long-term growth. Learn how to position your Actor, time your launch, and engage with the community for the best results.](https://docs.apify.com/academy/actor-marketing-playbook/promote-your-actor/product-hunt.md) #### [Blogs and blog resources](https://docs.apify.com/academy/actor-marketing-playbook/promote-your-actor/blogs-and-blog-resources.md) [Writing about your Actor helps users discover and understand it. This section walks you through how to write effective blog posts, where to publish them, and how to attract readers.](https://docs.apify.com/academy/actor-marketing-playbook/promote-your-actor/blogs-and-blog-resources.md) #### [Video tutorials](https://docs.apify.com/academy/actor-marketing-playbook/promote-your-actor/video-tutorials.md) [Video content makes it easier for users to grasp your Actor’s functionality. Learn how to create short, engaging tutorials that explain key features and use cases.](https://docs.apify.com/academy/actor-marketing-playbook/promote-your-actor/video-tutorials.md) #### [Webinars](https://docs.apify.com/academy/actor-marketing-playbook/promote-your-actor/webinars.md) [Hosting a webinar lets you connect directly with potential users and answer their questions. This section outlines how to plan, promote, and run a successful live session.](https://docs.apify.com/academy/actor-marketing-playbook/promote-your-actor/webinars.md) ## Interacting with users[​](#interacting-with-users "Direct link to Interacting with users") #### [Emails to Actor users](https://docs.apify.com/academy/actor-marketing-playbook/interact-with-users/emails-to-actor-users.md) [Good email communication keeps users engaged and informed. Find out when and how to send helpful emails, from onboarding to feature updates.](https://docs.apify.com/academy/actor-marketing-playbook/interact-with-users/emails-to-actor-users.md) #### [Issues tab](https://docs.apify.com/academy/actor-marketing-playbook/interact-with-users/issues-tab.md) [User feedback is critical for improving your Actor. Learn how to track, respond to, and resolve issues efficiently through Apify’s Issues tab.](https://docs.apify.com/academy/actor-marketing-playbook/interact-with-users/issues-tab.md) #### [Your Store bio](https://docs.apify.com/academy/actor-marketing-playbook/interact-with-users/your-store-bio.md) [Your bio is where users learn who you are and why they should trust your Actors. This guide helps you write a professional yet approachable bio that builds credibility.](https://docs.apify.com/academy/actor-marketing-playbook/interact-with-users/your-store-bio.md) ## Product optimization[​](#product-optimization "Direct link to Product optimization") #### [How to create a great input schema](https://docs.apify.com/academy/actor-marketing-playbook/product-optimization/how-to-create-a-great-input-schema.md) [A well-designed input schema makes your Actor easy to use. Learn how to create clear, logical inputs that guide users without overwhelming them.](https://docs.apify.com/academy/actor-marketing-playbook/product-optimization/how-to-create-a-great-input-schema.md) #### [Actor bundles](https://docs.apify.com/academy/actor-marketing-playbook/product-optimization/actor-bundles.md) [Bundling Actors together can increase their value and appeal. Learn how to package multiple Actors into a single, cohesive product that covers unique use cases.](https://docs.apify.com/academy/actor-marketing-playbook/product-optimization/actor-bundles.md) Ready to grow your presence on the Apify platform? Check out our guide to [publishing your first Actor](https://docs.apify.com/platform/actors/publishing.md). --- # Actor description & SEO description Learn about Actor description and meta description. Where to set them and best practices for both content and length. *** ## What is an Actor description?[​](#what-is-an-actor-description "Direct link to What is an Actor description?") First impressions are important, especially when it comes to tools. Actor descriptions are the first connection potential users have with your Actor. You can set two kinds of descriptions: *regular description* (in Apify Store) and *SEO description* (on Google search), along with their respective names: regular name and SEO name. tip You can change descriptions and names as many times as you want. ## Regular description vs. SEO description[​](#regular-description-vs-seo-description "Direct link to Regular description vs. SEO description") | | Actor description & name | SEO description & name | | ------------------ | ------------------------ | ---------------------- | | Name length | 40-50 characters | 40-50 characters | | Description length | 300 characters | 145-155 characters | | Visibility | Visible on Store | Visible on Google | ### Description & Actor name[​](#description--actor-name "Direct link to Description & Actor name") Actor description is what users see on the Actor's web page in Apify Store, along with the Actor's name and URL. When creating an Actor description, a “warm” visitor experience is prioritized (more on that later). ![actor name \& description](/assets/images/actor-description-name-bea8b2060a01d4c5d190cb2445a9a6c6.png) Actor description is also present in Apify Console and across Apify Store. ![actor description in store](/assets/images/actor-description-store-bda4a42f8f8a0ca572e2fca5ce79d4b1.png) ### SEO description & SEO name[​](#seo-description--seo-name "Direct link to SEO description & SEO name") Actor SEO description is a tool description visible on Google. It is shorter and SEO-optimized (keywords matter here). When creating the SEO description, a “cold” visitor experience is prioritized. ![seo description](/assets/images/seo_description-12e904f852b518923f228bd2ef68a534.png) Usually the way the potential user interacts with both these descriptions goes like this: SEO first, regular description second. Is there any benefit in them being different? ### Is there any benefit in the description and meta description being different?[​](#is-there-any-benefit-in-the-description-and-meta-description-being-different "Direct link to Is there any benefit in the description and meta description being different?") Different descriptions give you a chance to target different stages of user acquisition. And make sure the acquisition takes place. *SEO description (and SEO name)* is targeting a “cold” potential user who knows nothing about your tool yet and just came across it on Google search. They’re searching to solve a problem or use case. The goal of the meta description is to convince that visitor to click on your tool's page among other similar search results on Google. While it's shorter, SEO description is also the space to search-engine-optimize your language to the max to attract the most matching search intent. *Description (and name)* is targeting a “warm” potential user who is already curious about your tool. They have clicked on the tool's page and have a few seconds to understand how complex the tool is and what it can do for them. Here you can forget SEO optimization and speak directly to the user. The regular description also has a longer character limit, which means you can expand on your Actor’s features. Learn more about search intent here: [SEO](https://docs.apify.com/academy/actor-marketing-playbook/promote-your-actor/seo.md) ## Where can Actor descriptions be set?[​](#where-can-actor-descriptions-be-set "Direct link to Where can Actor descriptions be set?") Both descriptions can be found and edited on the very right **Publication tab → Display information.** It has to be done separately for each Actor. note Setting the SEO description and SEO name is optional. If not set, the description will just be duplicated. ![changing seo name](/assets/images/changing__SEO_name-b739468d580f0dfd5aa0c82cb477f581.png) ![changing actor name and seo name](/assets/images/changing_Actor_name_and_SEO_name-fd56498d2164f1715ff6495538f5690d.png) Actor description specifically can also be quick-edited in this pop-up on the Actor's page in Apify Console. Open the **Actor's page**, then click on **…** in the top right corner, and choose ✎ **Edit name or description**. Then set the URL in the **Unique name** ✎ field and click **Save**. ![changing actor description](/assets/images/change_Actor_description-703d7e6db0ba521eed798cd719c25a2a.png) ## Tips and recommendations on how to write descriptions[​](#tips-and-recommendations-on-how-to-write-descriptions "Direct link to Tips and recommendations on how to write descriptions") When writing a description, less is more. You only have a few seconds to capture attention and communicate what your Actor can do. To make the most of that time, follow these guidelines used by Apify (these apply to both types of descriptions): ### Use variations and experiment 🔄[​](#use-variations-and-experiment- "Direct link to Use variations and experiment 🔄") * *SEO name vs. regular name*: * name: Airbnb Scraper * SEO name: Airbnb Data Scraper * *Keywords on the web page*: Include variations, e.g. Airbnb API, Airbnb data, Airbnb data scraper, Airbnb rentals, Airbnb listings * No-code scraping tool to extract Airbnb data: host info, prices, dates, location, and reviews. * Scrape Airbnb listings without official Airbnb API! * *Scraping/automation process variations*: Use terms, e.g. crawl, crawler, scraping tool, finder, scraper, data extraction tool, extract data, get data * Scrape XYZ data, scraped data, data scraper, data crawler. ### Choose how to start your sentences 📝[​](#choose-how-to-start-your-sentences- "Direct link to Choose how to start your sentences 📝") * *Noun-first (descriptive)*: * Data extraction tool to extract Airbnb data: host info, prices, dates, location, and reviews. * *Imperative-first (motivating)*: * Try a free web scraping tool to extract Airbnb data: host info, prices, dates, location, and reviews. ### Keep it short and SEO-focused ✂️[​](#keep-it-short-and-seo-focused-️ "Direct link to Keep it short and SEO-focused ✂️") * *Be concise and direct*: clearly state what your Actor does. Avoid unnecessary fluff and boilerplate text. * ✅ Scrapes job listings from Indeed and gathers... * ❌ \*This Actor scrapes job listings from Indeed in order to gather... * *Optimize for search engines*: include popular keywords related to your Actor’s functionality that users might search for. * ✅ This Indeed scraper helps you collect job data efficiently. Use the tool to gather... * ❌ This tool will search through job listings on Indeed and offers you... ### List the data your Actor works with 📝[​](#list-the-data-your-actor-works-with- "Direct link to List the data your Actor works with 📝") * Data extraction tool to extract Airbnb data: host info, prices, dates, location, and reviews. * Get hashtags, usernames, mentions, URLs, comments, images, likes, locations without the official Instagram API. ### Use keywords or the language of the target website 🗣️[​](#use-keywords-or-the-language-of-the-target-website-️ "Direct link to Use keywords or the language of the target website 🗣️") * Extract data from hundreds of Airbnb home rentals in seconds. * Extract data from chosen tik-toks. Just add a TikTok URL and get TikTok video and profile data: URLs, numbers of shares, followers, hashtags, hearts, video, and music metadata. * Scrape Booking with this hotels scraper and get data about accommodation on Booking.com. ### Highlight your strong suits 🌟[​](#highlight-your-strong-suits- "Direct link to Highlight your strong suits 🌟") * Ease of use, no coding, user-friendly: * Easy scraping tool to extract Airbnb data. * Fast and scalable: * Scrape whole cities or extract data from hundreds of Airbnb rentals in seconds. * Free (only if the trial run can cover $5 free credits): * Try a free scraping tool to extract Airbnb data: host info, prices, dates, location, and reviews. * Extract host information, locations, availability, stars, reviews, images, and host/guest details for free. * Available platform features (various formats, API, integrations, scheduling): * Export scraped data in formats like HTML, JSON, and Excel. * Additional tips: * Avoid ending lists with etc. * Consider adding relevant emojis for visual appeal. ### Break it down 🔠[​](#break-it-down- "Direct link to Break it down 🔠") Descriptions typically fit into 2-3 sentences. Don't try to jam everything into one. Examples: 1. Scrape whole cities or extract data from hundreds of Airbnb rentals in seconds. 2. Extract host information, addresses, locations, prices, availability, stars, reviews, images, and host/guest details. 3. Export scraped data, run the scraper via API, schedule and monitor runs, or integrate with other tools. ## FAQ[​](#faq "Direct link to FAQ") #### Can the Actor's meta description and description be the same?[​](#can-the-actors-meta-description-and-description-be-the-same "Direct link to Can the Actor's meta description and description be the same?") Yes, they can, as long as they have the same (shorter) length (under 150 characters). But they can also be different - there's no harm in that. #### How different can description and meta description be?[​](#how-different-can-description-and-meta-description-be "Direct link to How different can description and meta description be?") They can be vastly different and target different angles of your Actor. You can experiment by setting up different SEO descriptions for a period of time and seeing if the click-through rate rises. #### I set a custom SEO description but Google doesn't show it[​](#i-set-a-custom-seo-description-but-google-doesnt-show-it "Direct link to I set a custom SEO description but Google doesn't show it") Sometimes Google picks up a part of the README as the SEO description. It's heavily dependent on the search query. Sometimes what you see on Google might look differently compared to how you set the SEO description. It's all a part of how Google customizes search results. --- # Actors & emojis Using emojis in Actors is a science on its own. Learn how emojis enhance the user experience in Actors by grabbing attention, simplifying navigation, and making information clearer. ## On the use of emojis in Actors[​](#on-the-use-of-emojis-in-actors "Direct link to On the use of emojis in Actors") We started using emojis in Actors for several reasons. First, tech today often uses emojis to make things look more user-friendly. Second, people don’t read as much as we’d like. You only have a few seconds to grab their attention, and text alone can feel overwhelming. Third, we don’t have many opportunities or space to explain things about Actors, and we want to avoid users needing to open extra tabs or pages. Clarity should come instantly, so we turned to emojis. When evaluating a new tool, those first 5 seconds are critical. That’s why we use emojis extensively with our Actors. They’re part of the Actor SEO title and description to help the tool stand out in Google search results, although Google doesn't always display them. In READMEs, they serve as shortcuts to different sections and help users quickly understand the type of data they’ll get. In complex input schemas, we rely on emojis to guide users and help them navigate the tool more efficiently. ## Emoji science[​](#emoji-science "Direct link to Emoji science") Believe it or not, there’s a science to emoji usage. When we use emojis in Actors and related content, we tap into the brain's iconic and working memory. Iconic memory holds information for less than a second - this is unconscious processing, where attributes like color, size, and location are instantly recognized. This part is where emojis guide the person's attention in the sea of text. They signify that something important is here. Emojis help with that immediate first impression and create a sense of clarity. After that, the brain shifts to working memory, where it combines information into visual chunks. Since we can only hold about 3-4 chunks at once, emojis help reinforce key points, thus reducing cognitive load. Consistent emoji use across the Actor ecosystem ensures users can quickly connect information without getting overwhelmed. As an example of this whole process, first, the user notices the emojis used in the field titles (pre-attentive processing). They learn to associate the emojis with those titles (attentive processing). Later, when they encounter the same emojis in a README section, they’ll make the connection, making it easier to navigate without drowning in a sea of text. ## Caveats to emojis[​](#caveats-to-emojis "Direct link to Caveats to emojis") 1. Don't overuse them, and don’t rely on emojis for critical information. Emojis should support the text, not replace key explanations or instructions. They're a crutch for concise copywriting, not a universal solution. 2. Use them consistently. Choose one and stick with it across all content: descriptions, parts of input schema, mentions in README, blog posts, etc. 3. Some emojis have multiple meanings, so choose the safest one. It could be general internet knowledge or cultural differences, so make sure the ones you choose won’t confuse or offend users in other markets. 4. Some emojis don’t render well on Windows or older devices. Try to choose ones that display correctly on Mac, Windows, and mobile platforms. Besides, emoji-heavy content can be harder for screen readers and accessibility tools to interpret. Make sure the information is still clear without the emojis. 5. It's okay not to use them. --- # How to create an Actor README **Learn how to write a comprehensive README to help users better navigate, understand and run public Actors in Apify Store.** *** ## What's a README in the Apify sense?[​](#whats-a-readme-in-the-apify-sense "Direct link to What's a README in the Apify sense?") At Apify, when we talk about a README, we don’t mean a guide mainly aimed at developers that explains what a project is, how to set it up, or how to contribute to it. At least, not in its traditional sense. You could argue our notion of README is closer to this [one described on GitHub](https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/about-readmes): README files typically include information on: * What the project does * Why the project is useful * How users can get started with the project * Where users can get help with your project We mean all of this and even more. At Apify, when we talk about READMEs, we refer to the public Actor detail page on Apify Store. Specifically, its first tab. The README exists in the same form both on the web and in Console. So what is it for? Before we dive in, a little disclaimer: you don't need your Apify README to fulfill all its purposes. Technically, you could even publish an Actor with just a single word in the README. But you'd be missing out if you did that. Your Actor’s README has at least four functions: 1. *SEO* - If your README is well-structured and includes important keywords — both in headings and across the text — it has a high chance of being noticed and promoted by Google. Organic search brings the most motivated type of potential users. If you win this game, you've won most of the SEO game. 2. *First impression* - Your README is one of the first points of contact with a potential user. If you come across as convincing, clear, and reassuring it could be the factor that will make a user try your Actor for their task. 3. *Extended instruction* - The README is also the space that explains specific complex input settings. For example, special formatting of the input, any coding-related, or extended functions. Of course, you could put that all in a blog post as well, but the README should be their first point of contact. 4. *Support* - Your users come back to the README when they face issues. So use it as a space to let them know that's where they can find links to the tutorials if they run into issues, describe common troubleshooting techniques, share tricks, or warn you about bugs. ## README elements theory[​](#readme-elements-theory "Direct link to README elements theory") These are the most important elements of the README. This structure is also not to be followed to a “t”. Of course, what you want to say to your potential users and how you want to promote your Actor will differ case by case. These are just the most common practices we have for our Actor READMEs. Beware that the headings are written with SEO in mind, which is why you see certain keywords repeated over and over. Aim for sections 1–6 below and try to include at least 300 words. You can move the sections around to some extent if it makes sense, e.g. 3 might come after 6. Consider using emojis as bullet points or otherwise trying to break up the text. ### Intro and features[​](#intro-and-features "Direct link to Intro and features") What is \[Actor]? * explain in two or three sentences what the Actor does and the easiest way to try it. Mention briefly what kind of data it can extract and any other tangible goal the tool can achieve. Describe the input in one sentence. Highlight the most important words in bold. What can this \[Actor] do? * list the main features of this tool. list multiple ways of input if applicable. list platform advantages. If it's a bundle, mention the steps that the Actor will do for you, mention specific obstacles this tool is able to overcome, say upfront how many results you can get for free. Remember the Apify platform! Your Actor + the Apify platform. They come as a package. Don't forget to flaunt all the advantages that the platform gives to your solution. Imagine if there was a solution that is identical to yours but without the platform advantages such as monitoring, access to API, scheduling, possibility of integrations, proxy rotation. Now, if that tool suddenly gained all those advantages it would surely make a selling point out of it. This is how you should be thinking about your tool — as a solution boosted by the Apify platform. Don't ever forget that advantage. What data can \[Actor] extract? What data can you extract from \[target website] * Create a table that represents the main data points that the Actor can extract. You don't have to list every single one, just list the most understandable and relatable ones. Depending on the complexity of your Actor, you might include one or all three of these sections. It will also depend on what your Actor does. If your Actor has simple input but does a lot of steps for the user under the hood (like a bundle would), you might like to include the "What can this Actor do?" section. If your Actor extracts data, it makes sense to include a section with a table. ### Tutorial section[​](#tutorial-section "Direct link to Tutorial section") This could be a simple listed step-by-step section or a paragraph with a link to a tutorial on a blog. A step-by-step section is reassuring for the user, and it can be a section optimized for Google. How do I use \[Actor] to scrape website data? ### Pricing[​](#pricing "Direct link to Pricing") How much will it cost to scrape \[target site]? How much will scraping \[target site] cost? Is scraping \[target site] free? How much does it cost to extract \[target site] data? Web scraping can be very unpredictable because there are a lot of elements involved in order for the process to be successful: the complexity of the website, proxies, cookies, etc. This is why it's important to set the pricing and scraping volume expectations for your users. You might think the part above the Actor detail page already indicates pricing. But this paragraph can still be useful. First of all, cost-related questions can show up in Google, if they are SEO optimized. Second, you can use this space to inform and reassure the user about the pricing, give more details about it, or entice them with the promise of very scalable scraping. * If it's a consumption pricing model (only consumed CUs), you can use this space to set expectations and explain what it means to pay for Compute Units. Similarly, if it's a rental Actor, you can also use this paragraph to set expectations. Talk about the average amount of data that can be scraped per given price. Make it easy for users to imagine how much they will pay for a given dataset. This will also make it easier for them to compare your solution with others on the market price-wise and value-wise. * If it's price per result, you can extrapolate how many results a user can get on a free plan and also entice them with a larger plan and how many thousands of results they can get with that. * If it's a bundle that consists of a couple of Actors that are priced differently, you can use this section to talk about the difference between all the Actors involved and how that will affect the final price of a run. In any case, on top of setting expectations and reassuring users, this paragraph can get into Google. If somebody is Googling "How much does it cost to scrape \[website]", they might come across this part of your README and it will lead them from Google search directly to your Actor's detail page. So you don't want to miss that opportunity. ![readme example](/assets/images/readme-7f2dd6436cb16cefbbfcc9c83e10bb98.png) ### Input and output examples[​](#input-and-output-examples "Direct link to Input and output examples") This is what people click on the most in the table of contents of the README. After they are done scrolling through the first part of the README, users are interested in how difficult the input it, what it looks like, and what kind of information they can expect. **Input**: often a screenshot of the input schema. This is also a way for people to see the platform even before they create an account. **Output**: can be shown as a screenshot if your output schema looks like something you would want to promote to users. You can also just include a JSON example containing a few objects. Even better if there's continuity between the input example and output example. If your datasets come out too complex and you want to save your users some scrolling, you can also show multiple output examples: one for reviews, one for contact details, one for ads, etc. ### Other Actors[​](#other-actors "Direct link to Other Actors") Don't forget to promote your other Actors. While our system for Actor recommendation works - you can see related Actors at the bottom of the README — it only works within the same category or similar name. It won't recommend a completely different Actor from the same creator. So make sure to interconnect your work by taking the initiative yourself. You can mention your other Actors in a list or as a table. ### FAQ, disclaimers, and support[​](#faq-disclaimers-and-support "Direct link to FAQ, disclaimers, and support") The FAQ is a section where you can keep all the secondary questions that might still come up. Here are just a few things we usually push to the FAQ section. * disclaimers and legality * comparison table between your Actor and similar solutions * information about the official API and how the scraper is a stand-in for it (SEO) * questions brought up by the users * tips on how best to use the Actor * troubleshooting and mentioning known bugs * mentioning the Issues tab and highlighting that you're open for feedback and collecting feedback * mentioning being open to creating a custom solution based on the current one and showing a way to contact you * interlinking * mentioning the possibility of transferring data using an API — API tab * possibility for integrations * use cases for the data scraped, success stories exemplifying the use of data ## Format of the README[​](#format-of-the-readme "Direct link to Format of the README") ### Markdown[​](#markdown "Direct link to Markdown") The README has to be written in Markdown. The most important elements are H2 and H3 headings, links to pages, links to images, and tables. For specific formatting, you can try using basic HTML. That will also work. CSS won’t. ### HTML use[​](#html-use "Direct link to HTML use") You can mix HTML with Markdown interchangeably. The Actor README will display either on the Apify platform. That gives you more freedom to use HTML when needed. Remember, don't try CSS. ### Tone of the README[​](#tone-of-the-readme "Direct link to Tone of the README") Apify Store has many Actors in its stock, and it's only growing. The advantage of an Actor is that an Actor can be anything, as versatile or complex as possible. From a single URL type of input to complex features that give customized control over the input parameters to the user. There are Actors that are intended for users who aren't familiar with coding and don't have any experience with it. Ideally, the README should reflect the level of skill one should need to use the Actor. The tone of the README should make it immediately obvious who the tool is aimed at. If your tool's input includes glob patterns or looking for selectors, it should be immediately visible from the README. Before the user even tries the tool. Trying to simplify this information using simple words with ChatGPT can be misleading to the user. You will attract the wrong audience, and they will end up churning or asking you too many questions. And vice versa. If your target audience is people with little to no coding skills, who just prefer point-and-click solutions, this should be visible from the README. Speak in regular terms, avoid code blocks or complex information at the beginning unless it's absolutely necessary. This means that, when people land on your Actor detail page, they will have their expectations set from the get-go. ### Length of a README[​](#length-of-a-readme "Direct link to Length of a README") When working on improving a README, we regularly look at heatmaps that show us where our website visitors spend most of their time. From our experience, most first-time visitors don't scroll past the first 25% of a README. That means that the first quarter of the README is where you want to focus the most of your attention if you're trying to persuade the page visitor to try your Actor. From the point of view of acquisition, the first few sections should make it immediately obvious what the tool is about, how hard it is to use, and who it is created for. This is why, in Apify's READMEs, you can see our first few paragraphs are built in such a way as to explain these things and reassure the visitors that anyone can use these tools. From the point of view of retention, it doesn't mean you can't have long or complex READMEs or not care for the information beyond the 25% mark. Since the README is also intended to be used as a backup when something goes wrong or the user needs more guidance, your users will come back to it multiple times. ### Images and videos[​](#images-and-videos "Direct link to Images and videos") As for using screenshots and GIFs, put them in some sort of image hosting. Your own GitHub repository would be best because you have full control over it. Name the images with SEO in mind and try to keep them compressed but good enough quality. You don't want to load an image or GIF for too long. One trick is not only to add images but also to make them clickable. For some reason, people like clicking on images, at least they try to when we look at the heatmaps. You can lead the screenshot clicks towards a signup page, which is possible with Markdown. If your screenshot seems too big or occupies too much space, smaller size images are possible by using HTML. To embed a YouTube video, all you have to do is include its URL. No further formatting is needed, the thumbnail will render itself on the README page. Try Carbon for code If you want to add snippets of code anywhere in your README, you can use [Carbon](https://github.com/carbon-app/carbon). If you need quick Markdown guidance, check out  ## README and SEO[​](#readme-and-seo "Direct link to README and SEO") Your README is your landing page. If there were only one thing to remember about READMEs on Apify Store, it would be this. A README on Apify Store is not just dry instructions on how to use your Actor. It has much more potential than that. In the eyes of Google, your Actor's detail page, aka README, is a full-fledged landing page containing all the most important information to be found and understood by users. Of course, that all only counts if your README is both well formatted and contains keywords. We'll talk about that part later on. What makes a good README? A good README has to be a balance between what you want your page visitors to know, your users to turn to when they run into trouble, and Google to register when it's indexing pages and considering which one deserves to be put up higher. ### Table of contents[​](#table-of-contents "Direct link to Table of contents") The H1 of your page is the Actor name, so you don't have to set that up. Don't add more H1s. README headings should be H2 or H3. H2 headings will make up the table of contents on the right. So if you don't want the table to be too crowded, keep the H2s to the basics and push all the longer phrases and questions to H3s. H3s will stay hidden in the accordion in the default state until the visitor hovers their cursor over it. H4 readings can also be included, of course, but they won't show up as a part of the table of contents. ### Keyword opportunities[​](#keyword-opportunities "Direct link to Keyword opportunities") Do SEO research for keywords and see how they can fit organically into the text. Priority for H2s and H3s, then the regular text. Add new keyword-heavy paragraphs if you see an opportunity. The easiest sections to include keywords in are, for example: * API, as in Instagram API * data, as in extract Instagram data * Python, as in extract data in Python * scrape, as in how to scrape X * scraping, as in scraping X Now, could every H2 just say exactly what it is about, without SEO? Of course. You don't have to optimize your H2s and H3s, and are free to call them simply Features, How it works, Pricing, Support, etc. or not even to have many H2s at all and keep it all as one page. However, the H2s and H3s are what sometimes get into the Google Search results. If you're familiar with the People Also Ask section, that's the best place to match your H2s. They can also get highlighted in the Sitelinks of Google Search Results. Any part of your README can make it onto Google pages. The intro sentence describing what your Actor is about, a video, a random question. Each one can become a good candidate for those prime Google pages. That's why it's important to structure and write your README with SEO in mind. ### Importance of including a video[​](#importance-of-including-a-video "Direct link to Importance of including a video") If your page has a video, it has a better chance of ranking higher in Google. ## README and input schema[​](#readme-and-input-schema "Direct link to README and input schema") The README should serve as a fallback for your users if something isn't immediately obvious in the input schema. There's also only that much space in the input schema and the tooltips, so naturally, if you want to provide more details about something, e.g. input, formatting, or expectations, you should put it in the README and refer to it from the relevant place in the input schema. Learn about [How to create a great input schema](https://docs.apify.com/academy/actor-marketing-playbook/product-optimization/how-to-create-a-great-input-schema.md) ## Readme elements template[​](#readme-elements-template "Direct link to Readme elements template") 1. What does (Actor name) do? * in 1–2 sentences describe what the Actor does and what it does not do * consider adding keywords like API, e.g. Instagram API * always have a link to the target website in this section 2. Why use (Actor name)? or Why scrape (target site)? * How it can be beneficial for the user * Business use cases * Link to a success story, a business use case, or a blog post. 3. How to scrape (target site) * Link to "How to…" blogs, if one exists (or suggest one if it doesn't) * Add a video tutorial or GIF from an ideal Actor run. Embedding YouTube videos For better user experience, Apify Console automatically renders every YouTube URL as an embedded video player. Simply add a separate line with the URL of your YouTube video. * Consider adding a short numbered tutorial, as Google will sometimes pick these up as rich snippets. Remember that this might be in search results, so you can repeat the name of the Actor and give a link, e.g. 1. Is it legal to scrape (target site)? * This can be used as a boilerplate text for the legal section, but you should use your own judgment and also customize it with the site name. > Our scrapers are ethical and do not extract any private user data, such as email addresses, gender, or location. They only extract what the user has chosen to share publicly. We therefore believe that our scrapers, when used for ethical purposes by Apify users, are safe. However, you should be aware that your results could contain personal data. Personal data is protected by the GDPR in the European Union and by other regulations around the world. You should not scrape personal data unless you have a legitimate reason to do so. If you're unsure whether your reason is legitimate, consult your lawyers. You can also read our blog post on the legality of web scraping 2. Input * Each Actor detail page has an input tab, so you just need to refer to that. If you like, you can add a screenshot showing the user what the input fields will look like. * This is an example of how to refer to the input tab: > Twitter Scraper has the following input options. Click on the input tab for more information. 3. Output * Mention "You can download the dataset extracted by (Actor name) in various formats such as JSON, HTML, CSV, or Excel.” * Add a simplified JSON dataset example, like here  4. Tips or Advanced options section * Share any tips on how to best run the Actor, such as how to limit compute unit usage, get more accurate results, or improve speed. If you want some general tips on how to make a GitHub README that stands out, check out these guides. Not everything in there will be suitable for an Apify Actor README, so you should cherry-pick what you like and use your imagination. ## Resources[​](#resources "Direct link to Resources") [Build a Stunning README For Your GitHub Profile](https://towardsdatascience.com/build-a-stunning-readme-for-your-github-profile-9b80434fe5d7) [How to Create a Beautiful README for Your GitHub Profile](https://yushi95.medium.com/how-to-create-a-beautiful-readme-for-your-github-profile-36957caa711c) --- # Importance of Actor URL **Actor URL (or technical name, as we call it), is the page URL of the Actor shown on the web. When you're creating an Actor, you can set the URL yourself along with the Actor name. Here are best practices on how to do it well.** ![actor url example](/assets/images/what-is-actor-url-7560efc6bb6906af078c2cef44100b93.png) *** ## Why is Actor URL so important?[​](#why-is-actor-url-so-important "Direct link to Why is Actor URL so important?") The Actor URL plays a crucial role in SEO. Google doesn't just read the Actor's name or README; it also analyzes the URL. The *URL is one of the first signals to Google about the content of your page*- whether it's a product listing, a tool, a blog post, a landing page for a specific offering, or something else entirely. Therefore, it's important to know how to use this shorthand to your advantage and clearly communicate to Google what your page offers. Choose the URL carefully This part of the manual is only applicable to new Actors. *Once set, existing Actor URLs shouldn't change*. ## How to choose a URL[​](#how-to-choose-a-url "Direct link to How to choose a URL") The right naming can propel or hinder the success of the Actor on Google Search. Just as naming your Actor is important, so is choosing its URL. The only difference is, once set, the URL is intended to be permanent (more on this [later](https://docs.apify.com/academy/actor-marketing-playbook/actor-basics/importance-of-actor-url.md)). What's the formula for the best Actor URL? ### Brainstorming[​](#brainstorming "Direct link to Brainstorming") What does your Actor do? Does it scrape, find, extract, automate, connect? Think of these when you are looking for a name. You might already have a code name in mind, but it’s essential to ensure it stands out and is distinct from similar names—both on Google and on Apify Store. ### Matching URL and name[​](#matching-url-and-name "Direct link to Matching URL and name") The easiest way is to make sure the Actor name and the technical name match. As in TikTok Scraper (tiktok-scraper) or Facebook Data Extractor (facebook-data-extractor). But they can also be different. ### SEO[​](#seo "Direct link to SEO") The name should reflect not only what Actor does (or what website it targets), but also what words people use when they search for it. This is why it's also important to do SEO research to see which keywords work best for the topic. Ideally, the URL should include a keyword that has low complexity (low competition) but high traffic (high demand). Learn more about SEO research and the best tools for it here: [SEO](https://docs.apify.com/academy/actor-marketing-playbook/promote-your-actor/seo.md) ### Inspiration in Apify Store[​](#inspiration-in-apify-store "Direct link to Inspiration in Apify Store") Explore Store URLs of similar Actors. But avoid naming your Actor too similarly to what already exists, because of these two reasons: 1. There’s evidence that new URLs that are similar to existing ones can have drastically different levels of success. The first URL might thrive while a similar one published later struggles to gain traction. For example, *onedev/pentagon-scraper* was published first and has almost 100x traction than *justanotherdev/pentagon-scraper*. It will be very hard for the latter to beat the former. The reason for this is that Google operates on a "first come, first served” basis, and once it's set, it is very hard to make Google change its ways and make it pay attention to new pages with a similar name. 2. As Apify Store is growing, it's important to differentiate yourself from the competition. A different URL is just one more way to do that. If a person is doing research on Store, they will be less likely to get confused between two tools with the same name. ### Length of URL[​](#length-of-url "Direct link to Length of URL") Ideally, keep it under four words. As in, *Facebook Data Extractor* (*facebook-data-extractor*), not (*facebook-data-meta-online-extractor-light*). If the name is long and you're trying to match it with your URL, keep only the most essential words for the URL. ### Variations[​](#variations "Direct link to Variations") It can be a long-tail keyword with the tool type in it: scraper, finder, extractor. But you can also consider keywords that include terms like API, data, and even variations of the website name. Check out what keywords competitors outside of Apify Store are using for similar tools. ### Nouns and adjectives[​](#nouns-and-adjectives "Direct link to Nouns and adjectives") One last tip on this topic is to *avoid adjectives and verbs*. Your page is about a tool, so keep it to nouns. Anything regarding what the tool does (scrape, automate, import) and what it's like (fast, light, best) can be expressed in the Actor's name, not the Actor's URL. Adding an adjective or verb like that either does nothing for SEO and might even damage the SEO chances of the page. ## Why you shouldn’t change your Actor URL[​](#why-you-shouldnt-change-your-actor-url "Direct link to Why you shouldn’t change your Actor URL") Don't change the URL There's only one rule about Actor URL: don't change the URL. The Actor's name, however, can be changed without any problems. Once set, the page URL should not be changed. Because of those two important reasons: * Google dislikes changes to URLs. Once your Actor has built up keyword associations and familiarity with Google, regaining that standing after a URL change can be challenging. You will have to start from scratch. * Current integrations will break for your Actor's users. This is essential for maintaining functionality. If you absolutely have to change the URL, you will have to communicate that fact to your users. 💡 Learn more about the easiest ways to communicate with your users: \[Emails to Actor users] ## How and where to set the Actor URL[​](#how-and-where-to-set-the-actor-url "Direct link to How and where to set the Actor URL") In Console. Open the **Actor's page**, then click on **…** in the top right corner, and choose ✎ **Edit name or description**. Then set the URL in the **Unique name** ✎ field and click **Save**. ![set actor url in console](/assets/images/how-and-where-to-set-the-actor-url-console-18c354804a82c1ab93f59d39cabfcc97.png) ![set the actor url](/assets/images/how-and-where-to-set-the-actor-url-5f4f6293d3389f468863c78d086c97ee.png) ## FAQ[​](#faq "Direct link to FAQ") #### Can Actor URL be different from Actor name?[​](#can-actor-url-be-different-from-actor-name "Direct link to Can Actor URL be different from Actor name?") Yes. While they can be the same, they don’t have to be. For the best user experience, keeping them identical is recommended, but you can experiment with the Actor's name. Just avoid changing the Actor URL. #### Can I change a very fresh Actor URL?[​](#can-i-change-a-very-fresh-actor-url "Direct link to Can I change a very fresh Actor URL?") Yes, but act quickly. It takes Google a few days to start recognizing your page. For this reason, if you really have to, *it is best to change the Actor's name in the first few days*, before you build a steady user base and rapport with Google. #### How long does it take Google to pick up on the new URL?[​](#how-long-does-it-take-google-to-pick-up-on-the-new-url "Direct link to How long does it take Google to pick up on the new URL?") Google reindexes Apify web pages almost every day. It might take anywhere from 3-7 days for it to pick up a new URL. Or it might happen within a day. #### Can I use the identical technical name as this other Actor?[​](#can-i-use-the-identical-technical-name-as-this-other-actor "Direct link to Can I use the identical technical name as this other Actor?") Yes, you can. But it will most likely lower your chances of being noticed by Google. #### Does changing my Apify account name affect the Actor URL?[​](#does-changing-my-apify-account-name-affect-the-actor-url "Direct link to Does changing my Apify account name affect the Actor URL?") Yes. If you're changing from *justanotherdev/pentagon-scraper* to *dev/pentagon-scraper*, it counts as a new page. Essentially, the consequences are the same as after changing the technical name of the Actor. --- # Name your Actor **Apify's standards for Actor naming. Learn how to choose the right name for scraping and automation Actors and how to optimize your Actor for search engines.** *** Naming your Actor can be tricky, especially after you’ve worked hard on it. To help people find your Actor and make it stand out, we’ve set some naming guidelines. These will help your Actor rank better on Google and keep things consistent on [Apify Store](https://apify.com/store). Ideally, you should choose a name that clearly shows what your Actor does and includes keywords people might use to search for it. ## Parts of Actor naming[​](#parts-of-actor-naming "Direct link to Parts of Actor naming") Your Actor's name consists of four parts: actual name, SEO name, URL, and GitHub repository name. * Actor name (name shown in Apify Store), e.g. *Booking Scraper*. * Actor SEO name (name shown on Google Search, optional), e.g. *Booking.com Hotel Data Scraper*. * If the SEO name is not set, the Actor name will be the default name shown on Google. * Actor URL (technical name), e.g. *booking-scraper*. * More on it on [Importance of Actor URL](https://docs.apify.com/academy/actor-marketing-playbook/actor-basics/importance-of-actor-url.md) page. * GitHub repository name (best to keep it similar to the other ones, for convenience), e.g. *actor-booking-scraper*. ## Actor name[​](#actor-name "Direct link to Actor name") The Actor name provides a human-readable name. The name is the most important real estate from an SEO standpoint. It should exactly match the most likely search query that potential users of your Actor will use. At the same time, it should give your Actor a clear name for people who will use it every day. tip Your Actor's name should be *40-50 characters* long. You can change your Actor name freely in Apify Console. ### Actor name vs. SEO name[​](#actor-name-vs-seo-name "Direct link to Actor name vs. SEO name") There's an option to step away from your Actor's name for the sake of search engine optimization — the Actor SEO name. The Actor name and Actor SEO name serve different purposes: * *Actor name*: this is the name visible in Apify Store and Console. It should be easy for users to understand and quickly show what your Actor does. It’s about attracting users who browse the Store. ![actor name example](/assets/images/actor-name-68e32093948ca0b704dda5e5672bf4d2.png) * *Actor SEO name*: this is the name that appears in search engine results. It should include keywords people might search for to find your Actor. It’s about improving visibility on search engines and encouraging users to click on your link. ![actor seo name example](/assets/images/actor-seo-name-1a71276bdf8a0d33b3be5d33ba264288.png) For example: * *Actor name*: YouTube Scraper * *Actor SEO name*: YouTube data extraction tool for video analysis Here, the SEO name uses extra keywords to help people find it through search engines, while the Actor name is simpler and easier for users to understand and find on Apify Store. 💡 When creating the SEO name, focus on using relevant keywords that potential users might search for. It should still match what your Actor does. More about SEO name and description: \[Actor description and SEO description] ### Actor name vs. technical name[​](#actor-name-vs-technical-name "Direct link to Actor name vs. technical name") The Actor name and technical name (or URL) have different uses: * *Actor name*: this is the name users see on Apify Store and Console. It’s designed to be user-friendly and should make the Actor's purpose clear to anyone browsing or searching for it. * *Technical name*: this is a simplified, URL-friendly version used in technical contexts like API calls and scripts. This name should be concise and easily readable. Once set, it should not be changed as it can affect existing integrations and cause broken links. For example: * *Actor name*: Google Search Scraper * *Technical name*: google-search-scraper The Actor name is user-friendly and descriptive, while the technical name is a clean, URL-compatible version. Note that the technical name does not include spaces or special characters to ensure it functions properly in technical contexts. important This is important for SEO! Once set, the technical name should not be changed. Make sure you finalize this name early in development. More on why here: \[Importance of Actor URL] ## Best practices for naming[​](#best-practices-for-naming "Direct link to Best practices for naming") ### Brainstorming[​](#brainstorming "Direct link to Brainstorming") What does your Actor do? Does it scrape, find, extract, automate, connect, or upload? When choosing a name, ensure it stands out and is distinct from similar names both on Google and on Apify Store. * *Use nouns and variations*: use nouns like "scraper", "extractor", “downloader”, “checker”, or "API" to describe what your Actor does. You can also include terms like API, data, or variations of the website name. * *Include key features*: mention unique features or benefits to highlight what sets your Actor apart. * *Check for uniqueness*: ensure your name isn’t too similar to existing Actors to avoid confusion and help with SEO. ### Match name and URL[​](#match-name-and-url "Direct link to Match name and URL") The simplest approach is to make all names match. For example, TikTok Ads Scraper (tiktok-ads-scraper) or Facebook Data Extractor (facebook-data-extractor). However, variations are acceptable. ### Name length[​](#name-length "Direct link to Name length") Keep the name concise, ideally less than four words. For instance, Facebook Data Extractor is preferable to Facebook Meta Data Extractor Light. ### Check Apify Store for inspiration[​](#check-apify-store-for-inspiration "Direct link to Check Apify Store for inspiration") Look at the names of similar Actors on Apify Store, but avoid naming your Actor too similarly. By choosing a unique name, you can stand out from the competition. This will also reduce confusion and help users easily distinguish your Actor. ### Keep SEO in mind[​](#keep-seo-in-mind "Direct link to Keep SEO in mind") Even though you can set a different variation for SEO name specifically, consider doing a bit of research when setting the regular name as well. The name should reflect what the Actor does and the keywords people use when searching for it. If the keywords you find sound too robotic, save them for the SEO name. But if they sound like something you'd search for, it's a good candidate for a name. You can also check the keywords competitors use for similar tools outside Apify Store. ### Occasionally experiment[​](#occasionally-experiment "Direct link to Occasionally experiment") You can test and refine your SEO assumptions by occasionally changing the SEO name. This allows you to track how changes to names affect search rankings and user engagement. Changing the regular name is not forbidden but still less desirable since it can confuse your existing users and also affect SEO. ## Naming examples[​](#naming-examples "Direct link to Naming examples") ### Scraping Actors[​](#scraping-actors "Direct link to Scraping Actors") ✅: * Technical name (Actor's name in the [Apify Console](https://console.apify.com/)): `${domain}-scraper`, e.g. youtube-scraper. * Actor name: `${Domain} Scraper`, e.g. YouTube Scraper. * Name of the GitHub repository: `actor-${domain}-scraper`, e.g. actor-youtube-scraper. ❌: * Technical name: `the-scraper-of-${domain}`, e.g. the-scraper-of-youtube. * Actor name: `The Scraper of ${Domain}`, e.g. The Scraper of YouTube. * GitHub repository: `actor-the-scraper-of-${domain}`, e.g. actor-the-scraper-of-youtube. If your Actor only caters to a specific service on a domain (and you don't plan on extending it), add the service to the Actor's name. For example, * Technical name: `${domain}-${service}-scraper`, e.g. google-search-scraper. * Actor name: `${Domain} ${Service} Scraper`, e.g. [Google Search Scraper](https://apify.com/apify/google-search-scraper). * GitHub repository: `actor-${domain}-${service}-scraper`, e.g. actor-google-search-scraper. ### Non-scraping Actors[​](#non-scraping-actors "Direct link to Non-scraping Actors") Naming for non-scraping Actors is more liberal. Being creative and considering SEO and user experience are good places to start. Think about what your users will type into a search engine when looking for your Actor. What is your Actor's function? Below are examples for the [Google Sheets](https://apify.com/lukaskrivka/google-sheets) Actor. ✅: * Technical name: google-sheets. * Actor name: Google Sheets Import & Export. * GitHub repository: actor-google-sheets. ❌: * Technical name: import-to-and-export-from-google-sheets. * Actor name: Actor for Importing to and Exporting from Google Sheets. * GitHub repository: actor-for-import-and-export-google-sheets. Renaming your Actor You may rename your Actor freely, except when it comes to the Actor URL. Remember to read [Importance of Actor URL](https://docs.apify.com/academy/actor-marketing-playbook/actor-basics/importance-of-actor-url.md) to find out why! --- # Emails to Actor users **Getting users is one thing, but keeping them is another. While emailing your users might not seem like a typical marketing task, any seasoned marketer will tell you it’s essential. It’s much easier to keep your current users happy and engaged than to find new ones. This guide will help you understand when and how to email your users effectively.** *** ## Whom and where to email[​](#whom-and-where-to-email "Direct link to Whom and where to email") You can email the audience of a specific Actor directly from Apify Console. Go to **Messaging > Emails > Compose new**. From there, select the Actor whose users you want to email, write a subject line, and craft your message. An automatic signature will be added to the end of your email. ## How to write a good email[​](#how-to-write-a-good-email "Direct link to How to write a good email") Emails can include text, formatting, images, GIFs, and links. Here are four main rules for crafting effective emails: 1. Don’t email users without a clear purpose. 2. Keep your message concise and friendly. 3. Make the subject line direct and to the point. Consider adding an emoji to give users a hint about the email’s content. 4. Use formatting to your advantage. Console emails support Markdown, so use bold, italics, and lists to highlight important details. Additional tips: * Show, don’t tell — use screenshots with arrows to illustrate your points. * If you’re asking users to take action, include a direct link to what you're referring to. * Provide alternatives if it suits the situation. * Always send a preview to yourself before sending the email to all your users. ## When to email users[​](#when-to-email-users "Direct link to When to email users") Our general policy is to avoid spamming users with unnecessary emails. We contact them only if there's a valid reason. Here’s the list of regular good reasons to contact users of the Actor: ### 1. Introducing a new feature of the Actor[​](#1-introducing-a-new-feature-of-the-actor "Direct link to 1. Introducing a new feature of the Actor") New filter, faster scraping, changes in input schema, in output schema, a new Integration, etc. > ✉️ 🏙️ Introducing Deep city search for Tripadvisor scrapers > > Hi, > > Tired of Tripadvisor's 3000 hotels-per-search limit? We've got your back. Say hello to our latest baked-in feature: Deep city search. Now, to get all results from a country-wide search you need to just set Max search results above 3000, and watch the magic happen. > > A bit of context: while Tripadvisor never limited the search for restaurants or attractions, hotel search was a different case; it always capped at 3000. Our smart search is designed to overcome that limit by including every city within your chosen location. We scrape hotels from each one, ensuring no hidden gems slip through the cracks. This feature is available for [Tripadvisor Scraper](https://console.apify.com/actors/dbEyMBriog95Fv8CW/console) and [Tripadvisor Hotels Scraper](https://console.apify.com/actors/qx7G70MC4WBE273SM/console). > > So get ready for an unbeatable hotel-hunting experience. Give it a spin, and let us know what you think! Introduce and explain the features, add a screenshot of a feature if it will show in the input schema, and ask for feedback. ### 2. Actor adapting to the changes of the website it scrapes[​](#2-actor-adapting-to-the-changes-of-the-website-it-scrapes "Direct link to 2. Actor adapting to the changes of the website it scrapes") A common situation in web scraping that's out of your control. > ✉️ 📣 Output changes for Facebook Ads Scraper > > Hi, > > We've got some news regarding your favorite Actor – [Facebook Ads Scraper](https://console.apify.com/actors/JJghSZmShuco4j9gJ/console). Recently, Facebook Ads have changed their data format. To keep our Actor running smoothly, we'll be adapting to these changes by slightly tweaking the Actor Output. Don't worry; it's a breeze! Some of the output data might just appear under new titles. > > This change will take place on October 10; please\*\* \*\*make sure to remap your integrations accordingly. > > Need a hand or have questions? Our support team is just one friendly message away. Inform users about the reason for changes and how the changes impact them and the Actor + give them a date when the change takes effect. ### 3. Actor changing its payment model (from rental to pay-per-result, for example)[​](#3-actor-changing-its-payment-model-from-rental-to-pay-per-result-for-example "Direct link to 3. Actor changing its payment model (from rental to pay-per-result, for example)") Email 1 (before the change, warning about deprecation). > ✉️ 🛎 Changes to Booking Scraper > > Hi, > > We’ve got news regarding the Booking scraper you have been using. This change will happen in two steps: > > 1. On September 22, we will deprecate it, i.e., new users will not be able to find it in Store. You will still be able to use it though. > 2. At the end of October, we will unpublish this Actor, and from that point on, you will not be able to use it anymore. > > Please use this time to change your integrations to our new [Booking Scraper](https://apify.com/voyager/booking-scraper). > > That’s it! If you have any questions or need more information, don’t hesitate to reach out. Warn the users about the deprecation and future unpublishing + add extra information about related Actors if applicable + give them steps and the date when the change takes effect. Email 2 (after the change, warning about unpublishing) > ✉️ **📢 Deprecated Booking Scraper will stop working as announced 📢** > > Hi, > > Just a heads-up: today, the deprecated [Booking Scraper](https://console.apify.com/actors/5T5NTHWpvetjeRo3i/console) you have been using will be completely unpublished as announced, and you will not be able to use it anymore. > > If you want to continue to scrape Booking.com, make sure to switch to the [latest Actor version](https://apify.com/voyager/booking-scraper). > > For any assistance or questions, don't hesitate to reach out to our support team. Remind users to switch to the Actor with a new model. ### 4. After a major issue[​](#4-after-a-major-issue "Direct link to 4. After a major issue") Actor downtime, performance issues, Actor directly influenced by platform hiccups. > ✉️ **🛠️ Update on Google Maps Scraper: fixed and ready to go** > > Hi, > > We've got a quick update on the Google Maps Scraper for you. If you've been running the Actor this week, you might have noticed some hiccups — scraping was failing for certain places, causing retries and overall slowness. > > We apologize for any inconvenience this may have caused you. The **good news is those performance issues are now resolved**. So feel free to resurrect any affected runs using the "latest" build, should work like a charm now. > > Need a hand or have questions? Feel free to reply to this email. Apologize to users and or let them know you're working on it/everything is fixed now. This approach helps maintain trust and reassures users that you're addressing the situation. tip It might be an obvious tip, but If you're not great at emails, just write a short draft and ask ChatGPT to polish it. Play with the style until you find the one that suits you. You can even create templates for each situation. If ChatGPT is being too wordy, you can ask it to write at 9th or 10th-grade level, and it will use simpler words and sentences. ## Emails vs. newsletters[​](#emails-vs-newsletters "Direct link to Emails vs. newsletters") While sending an email is usually a quick way to address immediate needs or support for your users, newsletters can be a great way to keep everyone in the loop on a regular basis. Instead of reaching out every time something small happens, newsletters let you bundle updates together. Unless it's urgent, it’s better to wait until you have 2 or 3 pieces of news and share them all at once. Even if those updates span across different Actors, it’s perfectly fine to send one newsletter to all relevant users. Here are a few things you can include in your newsletter: * updates or new features for your Actors or Actor-to-Actor Integrations * an invitation to a live webinar or tutorial session * asking your users to upvote your Actor, leave a review or a star * a quick feedback request after introducing new features * spotlighting a helpful blog post or guide you wrote or found * sharing success stories or use cases from other users * announcing a promotion or a limited-time discount * links to your latest YouTube videos or tutorials Newsletters are a great way to keep your users engaged without overwhelming them. Plus, it's an opportunity to build a more personal connection by showing them you’re actively working to improve the tools they rely on. ## Emailing a separate user[​](#emailing-a-separate-user "Direct link to Emailing a separate user") There may be times when you need to reach out to a specific user — whether it’s to address a unique situation, ask a question that doesn’t fit the public forum of the **Issue tab**, or explore a collaboration opportunity. While there isn’t a quick way to do this through Apify Console just yet, you can ensure users can contact you by **adding your email or other contact info to your Store bio**. This makes it easy for them to reach out directly. ✍🏻 Learn best practices on how to use your Store bio to connect with your users [Your Store bio](https://docs.apify.com/academy/actor-marketing-playbook/interact-with-users/your-store-bio.md). --- # Issues tab **Once you publish your Actor in Apify Store, it opens the door to new users, feedback, and… issue reports. Users can create issues and add comments after trying your Actor. But why is this space so important?** *** ## What is the Issues tab?[​](#what-is-the-issues-tab "Direct link to What is the Issues tab?") The Issues tab is a dedicated section on your Actor’s page where signed-in users can report problems, share feedback, ask questions, and have conversations with you. You can manage each issue thread individually, and the whole thread is visible to everyone. The tab is divided into three categories: **Open**, **Closed**, and **All**, and it shows how long each response has been there. While only signed-in users can post and reply, all visitors can see the interactions, giving your page a transparent and welcoming vibe. Keep active 🕑 On the web, your average 🕑 **Response time** is calculated and shown in your Actor Metrics. The purpose of this metric is to make it easy for potential users to see how active you are and how well-maintained the Actor is. You can view all the issues related to your Actors by going to **Actors** > [**Issues**](https://console.apify.com/actors?tab=issues) in Apify Console. Users can get automatic updates on their reported issues or subscribe to issues they are interested in, so they stay informed about any responses. When users report an issue, they’re encouraged to share their run, which helps you get the full context and solve the problem more efficiently. Note that shared runs aren’t visible on the public Actor page. ## What is the Issues tab for?[​](#what-is-the-issues-tab-for "Direct link to What is the Issues tab for?") The tab is a series of conversations between you and your users. There are existing systems like GitHub for that. So why create a separate system like an Issues tab? Since the Issues tab exists both in private space (Console) and public space (Actor's page on the web), it can fulfill two different sets of purposes. ### Issues tab in Apify Console[​](#issues-tab-in-apify-console "Direct link to Issues tab in Apify Console") Originally, the Issues tab was only available in Apify Console, and its main goals were: * Convenience: a single space to hold the communication between you and your users. * Unity and efficiency: make sure multiple users don't submit the same issue through multiple channels or multiple times. * Transparency: make sure users have their issues addressed publicly and professionally. You can’t delete issues, you can only close them, so there's a clear record of what's been resolved and how. * Quality of service and innovation: make sure the Actor gets fixed and continuously improved, and users get the quality scraping services they pay for. ### Issues tab on the web[​](#issues-tab-on-the-web "Direct link to Issues tab on the web") Now that the Issues tab is public and on the web, it also serves other goals: * Credibility: new users can check how active and reliable you are by looking at the issues and your average 🕑 **Response time** even before trying your Actor. It also sets expectations for when to expect a response from you. * Collaboration: developers can learn from each other’s support styles, which motivates everyone to maintain good interactions and keep up good quality work. * SEO boost: every issue now generates its own URL, potentially driving more keyword traffic to your Actor's page ## Example of a well-managed Issues tab[​](#example-of-a-well-managed-issues-tab "Direct link to Example of a well-managed Issues tab") Check out how the team behind the **Apollo.io leads scraper** manages their [Issues tab](https://apify.com/curious_coder/apollo-io-scraper/issues/open) for a great example of professional responses and quick problem-solving. Note that this Actor is a rental, so users expect a high-quality service. ![issues tab example](/assets/images/issues-tab-example-f6201ae99bc15f12f5e04c19857711fa.png) warning Once your Actor is public, you’re required to have an Issues tab. ## SEO for the Issues tab[​](#seo-for-the-issues-tab "Direct link to SEO for the Issues tab") Yes, you read that right! The public Issues tab can boost your search engine visibility. Each issue now has its own URL, which means every report could help your Actor rank for relevant keywords. When we made the tab public, we took inspiration from StackOverflow’s SEO strategy. Even though StackOverflow started as a Q\&A forum, its strong SEO has been key to its success. Similarly, your Actor’s Issues tab can help bring in more traffic, with each question and answer potentially generating more visibility. This makes it easier for users to find solutions quickly. ## Tips for handling Actor issues[​](#tips-for-handling-actor-issues "Direct link to Tips for handling Actor issues") 1. *Don’t stay silent* Respond quickly, even if it’s just a short note. If an issue takes weeks to resolve, keep the user in the loop. A quick update prevents frustration and shows the user (and others following it) that you’re actively working on solving the issue. 2. *Encourage search to avoid duplication* Save time by encouraging users to search for existing issues before submitting new ones. If a similar issue exists, they can follow that thread for updates instead of creating a new one. 3. *Encourage reporters to be specific* The more context, the better! Ask users to share details about their run, which helps you diagnose issues faster. If needed, remind them that runs are shared privately, so sensitive data won’t be exposed. 4. *Use screenshots and links* The same goes for your side. Screenshots and links to specific runs make your answers much clearer. It’s easier to walk the user through a solution if they can see what you’re referencing. 5. *Structure issue reporting* As you get more experienced, you’ll notice common types of issues: bugs, feature requests, questions, reports, misc. This way, you can prioritize and respond faster based on the category. 6. *Have ready answers for common categories* Once you recognize recurring types of issues, have pre-prepared responses. For example, if it’s a bug report, you might already have a troubleshooting guide you can link to, or if it’s a feature request, you can figure out the development timeline. 7. *Be polite and precise* Politeness goes a long way! Make sure your responses are respectful and straight to the point. It helps to keep things professional, even if the issue seems minor. --- # Your Store bio ## Your Apify Store bio and Store “README”[​](#your-apify-store-bio-and-store-readme "Direct link to Your Apify Store bio and Store “README”") To help our community showcase their talents and projects, we introduced public profile pages for developers. On a dedicated page, you can showcase contact info, a summary of important Actor metrics (like total users, response time, and success rates), and all of their public Actors. We took inspiration from freelance platforms. This space is all about helping you shine and promote your tools and skills. Here’s how you can use it to your advantage: * Share your contact email, website, GitHub, X (Twitter), LinkedIn, or Discord handles. * Summarize what you’ve been doing in Apify Store, your main skills, big achievements, and any relevant experience. * Offer more ways for people to connect with you, such as links for booking a meeting, discounts, a subscription option for your email newsletter, or your YouTube channel or blog. * You can even add a Linktree to keep things neat. * Highlight your other tools on different platforms. * Get creative by adding banners and GIFs to give your profile some personality. Everything is neatly available under a single URL, making it easy to share. Need some inspiration? Check out examples of how others are using their Store bio and README. You can set yours up by heading to **Settings > Account > Profile.** --- # Actor bundles **Learn what an Actor bundle is, explore existing examples, and discover how to promote them.** *** ## What is an Actor bundle?[​](#what-is-an-actor-bundle "Direct link to What is an Actor bundle?") If an Actor is an example of web automation software, what is an Actor bundle? An Actor bundle is basically a chain of multiple Actors unified by a common use case. Bundles can include both scrapers and automation tools, and they are usually designed to achieve an overarching goal related to scraping or automation. The concept of an Actor bundle originated from frequent customer requests for comprehensive tools. For example, someone would ask for a Twitter scraper that also performs additional tasks, or for a way to find all profiles of the same public figure across multiple social media platforms without needing to use each platform separately. For example, consider a bundle that scrapes company reviews from multiple platforms, such as Glassdoor, LinkedIn, and Indeed. Typically, you would need to use several different scrapers and then consolidate the results. But this bundle would do it all in one run, once provided with the name of the company. Or consider a bundle that scrapes all posts and comments of a given profile, and then produces a sentiment score for each scraped comment. The main advantage of an Actor bundle is its ease of use. The user inputs a keyword or a URL, and the Actor triggers all the necessary Actors sequentially to achieve the desired result. The user is not expected to use each Actor separately and then process and filter the results themselves. ### Examples of bundles[​](#examples-of-bundles "Direct link to Examples of bundles") 🔍 [Social Media Finder](https://apify.com/tri_angle/social-media-finder) searches for profiles on 13 social media sites provided just the (nick)name. 🍝 [Restaurant Review Aggregator](https://apify.com/tri_angle/restaurant-review-aggregator) gets restaurant reviews from Google Maps, DoorDash, Uber Eats, Yelp, Tripadvisor, and Facebook in one place. 🤔 [Social Media Sentiment Analysis Tool](https://apify.com/tri_angle/social-media-sentiment-analysis-tool) not only collects comments from Facebook, Instagram, and TikTok but also performs sentiment analysis on them. It unites post scrapers, comments scrapers and a text analysis tool. 🦾 [Website Content Crawler + Pinecone bundle](https://apify.com/tri_angle/wcc-pinecone-integration) scrapes a website and stores the data in a Pinecone database to build and improve your own AI chatbot assistant. 🤖 [Pinecone GPT Chatbot](https://apify.com/tri_angle/pinecone-gpt-chatbot) combines OpenAI's GPT models with Pinecone's vector database, which simplifies creating a GPT Chatbot. As you can see, they vary in complexity and range. *** ## Caveats[​](#caveats "Direct link to Caveats") ### Pricing model[​](#pricing-model "Direct link to Pricing model") Since bundles are still relatively experimental, profitability is not guaranteed and will depend heavily on the complexity of the bundle. However, if you have a solid idea for a bundle, don’t hesitate to reach out. Prepare your case, write to our support team, and we’ll help determine if it’s worth it. ### Specifics of bundle promotion[​](#specifics-of-bundle-promotion "Direct link to Specifics of bundle promotion") First of all, when playing with the idea of creating a bundle, always check the keyword potential. Sometimes, there are true keyword gems just waiting to be discovered, with high search volume and little competition. However, bundles may face the challenge of being "top-of-the-funnel" solutions. People might not search for them directly because they don't have a specific keyword in mind. For instance, someone is more likely to search for an Instagram comment scraper than imagine a bundle that scrapes comments from 10 different platforms, including Instagram. Additionally, Google tends to favor tools with rather focused descriptions. If your tool offers multiple functions, it can send mixed signals that may conflict with each other rather than accumulate. Sometimes, even though a bundle can be a very innovative tool product-wise, it can be hard to market from an SEO perspective and match the search intent. In such cases, you may need to try different marketing and promotion strategies. Once you’ve exhausted every angle of SEO research, be prepared to explore non-organic marketing channels like Product Hunt, email campaigns, community engagement, Reddit, other social media, your existing customer base, word-of-mouth promotion, etc. Remember, bundles originated as customized solutions for specific use cases - they were not primarily designed to be easily found. This is also an opportunity to tell a story rather than just presenting a tool. Consider writing a blog post about how you created this tool, recording a video, or hosting a live webinar. If you go this route, it’s important to emphasize how the tool was created and what a technical feat it represents. That said, don’t abandon SEO entirely. You can still capture some SEO value by referencing the bundle in the READMEs of the individual Actors that comprise it. For example, if a bundle collects reviews from multiple platforms, potential users are likely to search for review scrapers for each specific platform—Google Maps reviews scraper, Tripadvisor reviews scraper, Booking reviews scraper, etc. These keywords may not lead directly to your review scraping bundle, but they can guide users to the individual scrapers, where you can then present the bundle as a more comprehensive solution. *** ## Resources[​](#resources "Direct link to Resources") Learn more about Actor Bundles: --- # How to create a great input schema Optimizing your input schema. Learn to design and refine your input schema with best practices for a better user experience. *** ## What is an input schema?[​](#what-is-an-input-schema "Direct link to What is an input schema?") So you've succeeded: your user has 1. found your Actor on Google, 2. explored the Actor's landing page, 3. decided to try it, and 4. created an Apify account. Now they’re on your Actor's page in Apify Console. The SEO fight is over. What’s next? Your user is finally one-on-one with your Actor — specifically, its input schema. This is the moment when they try your Actor and decide whether to stick with it. The input schema is your representative here, and you want it to work in your favor. Technically, the input schema is a `JSON` object with various field types supported by the Apify platform, designed to simplify the use of the Actor. Based on the input schema you define, the Apify platform automatically generates a *user interface* for your Actor. Of course, you can create an Actor without setting up an elaborate input schema. If your Actor is designed for users who don't need a good interface (e.g. they’ll use a JSON object and call it via API), you can skip this guide. But most users engage with Actors in Manual mode, aka the Actor interface. So, if your Actor is complex or you’re targeting regular users who need an intuitive interface, it's essential to consider their experience. In this article, *we’ll refer to the input schema as the user interface* of your Actor and focus exclusively on it. Understand input schemas To fully understand the recommendations in this blog post, you’ll first need to familiarize yourself with the [technical aspects of the input schema](https://docs.apify.com/platform/actors/development/actor-definition/input-schema). This context is essential to make good use of the insights shared here. ## The importance of a good input schema[​](#the-importance-of-a-good-input-schema "Direct link to The importance of a good input schema") It can feel intimidating when facing the Apify platform for the first time. You only have a few seconds for a user to assess the ease of using your Actor. If something goes wrong or is unclear with the input, an ideal user will first turn to the tooltips in the input schema. Next, they might check the README or tutorials, and finally, they’ll reach out to you through the **Issues** tab. However, many users won’t go through all these steps — they may simply get overwhelmed and abandon the tool altogether. A well-designed input schema is all about managing user expectations, reducing cognitive load, and preventing frustration. Ideally, a good input schema, as your first line of interaction, should: * Make the tool as easy to use as possible * Reduce the user’s cognitive load and make them feel confident about using and paying for it * Give users enough information and control to figure things out on their own * Save you time on support by providing clear guidance * Prevent incorrect or harmful tool usage, like overcharges or scraping personal information by default ### Reasons to rework an input schema[​](#reasons-to-rework-an-input-schema "Direct link to Reasons to rework an input schema") * Your Actor is complex and has many input fields * Your Actor offers multiple ways to set up input (by URL, search, profile, etc.) * You’re adding new features to your Actor * Certain uses of the Actor have caveats that need to be communicated immediately * Users frequently ask questions about specific fields 👀 Input schema can be formatted using basic HTML. ## Most important elements of the input schema[​](#most-important-elements-of-the-input-schema "Direct link to Most important elements of the input schema") You can see the full list of elements and their technical characteristics in [Docs](https://docs.apify.com/academy/deploying-your-code/input-schema): titles, tooltips, toggles, prefills, etc. That's not what this guide is about. It's not enough to just create an input schema, you should ideally aim to place and word its elements to the user's advantage: to alleviate the user's cognitive load and make the acquaintance and usage of your tool as smooth as possible. Unfortunately, when it comes to UX, there's only so much you can achieve armed with HTML alone. So here are the best elements to focus on, along with some best practices for using them effectively: * **`description` at the top** * As the first thing users see, the description needs to provide crucial information and a sense of reassurance if things go wrong. Key points to mention: the easiest way to try the Actor, links to a guide, and any disclaimers or other similar Actors to try. ![Input schema description example](/assets/images/description-sshot-4a31a900bc58209d44032f409cf8eed6.png) * Descriptions can include multiple paragraphs. If you're adding a link, it’s best to use the `target_blank` property so your user doesn’t lose the original Actor page when clicking. * **`title` of the field (regular bold text)** * This is the default way to name a field. * Keep it brief. The user’s flow should be 1. title → 2. tooltip → 3. link in the tooltip. Ideally, the title alone should provide enough clarity. However, avoid overloading the title with too much information. Instead, make the title as concise as possible, expand details in the tooltip, and include a link in the tooltip for full instructions. ![Input schema input example](/assets/images/title-sshot-59c5431c3d78f35f398c1c55d930b806.png) * **`prefill`, the default input** * this is your chance to show rather than tell * Keep the **prefilled number** low. Set it to 0 if it's irrelevant for a default run. * Make the **prefilled text** example simple and easy to remember. * If your Actor accepts various URL formats, add a few different **prefilled URLs** to show that possibility. * Use the **prefilled date** format that the user is expected to follow. This way, they can learn the correct format without needing to check the tooltip. * There’s also a type of field that looks like a prefill but isn’t — usually a `default` field. It’s not counted as actual input but serves as a mock input to show users what to type or paste. It is gray and disappears after clicking on it. Use this to your advantage. * **toggle** * The toggle is a boolean field. A boolean field represents a yes/no choice. * So how would you word this toggle: **Skip closed places** or **Scrape open places only**? And should the toggle be enabled or disabled by default? ![Input schema toggle example](/assets/images/toggle-sshot-b27af75e3ef46c83a61ef2bad6670a56.png) * You have to consider this when you're choosing how to word the toggle button and which choice to set up as the default. If you're making this more complex than it's needed (e.g. by using negation as the ‘yes’ choice), you're increasing your user's cognitive load. You also might get them to receive way less, or way more, data than they need from a default run. * In our example, we assume the default user wants to scrape all places but still have the option to filter out closed ones. However, they have to make that choice consciously, so we keep the toggle disabled by default. If the toggle were enabled by default, users might not notice it, leading them to think the tool isn't working properly when it returns fewer results than expected. * **sections or `sectionCaption` (BIG bold text) and `sectionDescription`** * A section looks like a wrapped toggle list. ![Input schema sections example](/assets/images/sections-sshot-fc6cbd06170d0a33c1c9ab909bd8d6d1.png) * It is useful to section off non-default ways of input or extra features. If your tool is complex, don't leave all fields in the first section. Just group them by topic and section them off (see the screenshot above ⬆️) * You can add a description to every section. Use `sectionDescription` only if you need to provide extra information about the section (see the screenshot below ⬇️. * sometimes `sectionDescription` is used as a space for disclaimers so the user is informed of the risks from the outset instead of having to click on the tooltip. ![Input schema section description example](/assets/images/section-description-sshot-3f2616cb044875c2841e131fe408554c.png) * tooltips or `description` to the title * To see the tooltip's text, the user needs to click on the `?` icon. * This is your space to explain the title and what's going to happen in that field: any terminology, referrals to other fields of the tool, examples that don't fit the prefill, or caveats can be detailed here. Using HTML, you can add links, line breaks, code, and other regular formatting here. Use this space to add links to relevant guides, video tutorials, screenshots, issues, or readme parts if needed. * Wording in titles vs. tooltips. Titles are usually nouns. They have a neutral tone and simply inform on what content this field is accepting (**Usernames**). * Tooltips to those titles are usually verbs in the imperative that tell the user what to do (*Add, enter, use*). * This division is not set in stone, but the reason why the tooltip is an imperative verb is because, if the user is clicking on the tooltip, we assume they are looking for clarifications or instructions on what to do. ![Input schema tooltips example](/assets/images/tooltips-sshot-956de479172bfe492e0e8b98a06e6e01.png) * emojis (visual component) * Use them to attract attention or as visual shortcuts. Use emojis consistently to invoke a user's iconic memory. The visual language should match across the whole input schema (and README) so the user can understand what section or field is referred to without reading the whole title. * Don't overload the schema with emojis. They attract attention, so you need to use them sparingly. tip Read more on the use of emojis: \[Actors and emojis] ## Example of an improved input schema[​](#example-of-an-improved-input-schema "Direct link to Example of an improved input schema") 1. A well-used `description` space. The description briefly introduces possible scraping options, visual language (sections represented by emojis), the easiest way to try the tool, and a link to a tutorial in case of issues. The description isn't too long, uses different formatting, and looks reassuring. 2. The main section is introduced and visually separated from the rest. This is the space for the user to try the first run before they can discover the other options. 3. The title says right away that this field refers to multiple other fields, not only the first section. 4. `prefill` is a small number (so in case users run the tool with default settings, it doesn't take too long and isn't expensive for them) and uses the language of the target website (not results or posts, *videos*). 5. The tooltip expands with more details and refers to other sections it's applicable to using matching emojis. 6. Section names are short. Sections are grouped by content type. 7. More technical parameters lack emojis. They are formatted this way to attract less attention and visually inform the user that this section is the most optional to set. 8. Visual language is unified across the whole input schema. Emojis are used as a shortcut for the user to understand what section or field is referred to without actually reading the whole title. ![Input schema example](/assets/images/improved-input-schema-example-193dcc1c44cbcc8db6016ced168d8dc5.png) ### Example of a worse input schema[​](#example-of-a-worse-input-schema "Direct link to Example of a worse input schema") The version above was the improved input schema. Here's what this tool's input schema looked like before: 1. Brief and dry description, with little value for the user, easy to miss. Most likely, the user already knows this info because what this Actor does is described in the Actor SEO description, description, and README. 2. The field title is wordy and reads a bit techie: it uses terminology that's not the most accurate for the target website (*posts*) and limiting terms (*max*). The field is applicable for scraping by hashtags (field above) and by profile (section below). Easy detail to miss. 3. The prefilled number is too high. If the user runs the Actor with default settings, they might spend a lot of money, and it will take some time. Users often just leave if an Actor takes a long time to complete on the first try. 4. The tooltip simply reiterates what is said in the title. Could've been avoided if the language of the title wasn't so complex. 5. Merging two possible input types into one (profiles and URLs) can cause confusion. Verbose, reminds the user about an unrelated field (hashtags). 6. This section refers to profiles but is separate. The user had to make extra effort to scrape profiles. They have to move across 3 sections: (use Max posts from section 1, use Profiles input from section 2, use Date sorting filters from section 3). 7. The proxy and browser section invites the users to explore it even though it's not needed for a default run. It's more technical to set up and can make an impression that you need to know how to set it so the tool works. ![Input schema example](/assets/images/worse-input-schema-f6354139a96611112dbeb1f9882ab2e9.png) ## Best practices[​](#best-practices "Direct link to Best practices") 1. Keep it short. Don’t rely too much on text - most users prefer to read as little as possible. 2. Use formatting to your advantage (bold, italic, underline), links, and breaks to highlight key points. 3. Use specific terminology (e.g., posts, images, tweets) from the target website instead of generic terms like "results" or "pages." 4. Group related items for clarity and ease of use. 5. Use emojis as shortcuts and visual anchors to guide attention. 6. Avoid technical jargon — keep the language simple. 7. Minimize cognitive load wherever possible. ## Signs and tools for improving input schema[​](#signs-and-tools-for-improving-input-schema "Direct link to Signs and tools for improving input schema") * *User feedback*. If they're asking obvious things, complaining, or consistently making silly mistakes with input, take notes. Feedback from users can help you understand their experience and identify areas for improvement. * *High churn rates*. If your users are trying your tool but quickly abandon it, this is a sign they are having difficulties with your schema. * *Input Schema Viewer*. Write your base schema in any code editor, then copy the file and put it into [\*\*Input Schema Viewer](https://console.apify.com/actors/UHTe5Bcb4OUEkeahZ/source).\*\* This tool should help you visualize your Input Schema before you add it to your Actor and build it. Seeing how your edits look in Apify Console right away will make the process of editing the fields in code easier. ## Resources[​](#resources "Direct link to Resources") * Basics of input schema: * Specifications of input schema: --- # Blogs and blog resources **Blogs remain a powerful tool for promoting your Actors and establishing authority in the field. With social media, SEO, and other platforms, you might wonder if blogging is still relevant. The answer is a big yes. Writing blog posts can help you engage your users, share expertise, and drive organic traffic to your Actor.** ## Why blogs still matter[​](#why-blogs-still-matter "Direct link to Why blogs still matter") 1. SEO. Blog posts are great for boosting your Actor’s search engine ranking. Well-written content with relevant keywords can attract users searching for web scraping or automation solutions. For example, a blog about “how to scrape social media profiles” could drive people to your Actor who might not find it on Google otherwise. 2. Establishing authority. When you write thoughtful, well-researched blog posts, you position yourself as an expert in your niche. This builds trust and makes it more likely users will adopt your Actors. 3. Long-form content. Blogs give you the space to explain the value of your Actor in-depth. This is especially useful for complex tools that need more context than what can fit into a README or product description. 4. Driving traffic. Blog posts can be shared across social media, linked in webinars, and included in your Actor’s README. This creates multiple avenues for potential users to discover your Actor. ## Good topics for blog posts[​](#good-topics-for-blog-posts "Direct link to Good topics for blog posts") 1. Problem-solving guides. Write about the specific problems your Actor solves. For example, if you’ve created an Actor that scrapes e-commerce reviews, write a post titled "How to automate e-commerce review scraping in 5 minutes". Focus on the pain points your tool alleviates. 2. Actor use cases. Show real-world examples of how your Actor can be applied. These can be case studies or hypothetical scenarios like "Using web scraping to track competitor pricing." 3. Tutorials and step-by-step guides. Tutorials showing how to use your Actor or similar tools are always helpful. Step-by-step guides make it easier for beginners to start using your Actor with minimal hassle. 4. Trends. If you’ve noticed emerging trends in web scraping or automation, write about them. Tie your Actor into these trends to highlight its relevance. 5. Feature announcements or updates. Have you recently added new features to your Actor? Write a blog post explaining how these features work and what makes them valuable. 🪄 These days, blog posts always need to be written with SEO in mind. Yeah, it's annoying to use keywords, but think of it this way: even if there's the most interesting customer story and amazing programming insights, but nobody can find it, it won't have the impact you want. Do try to optimize your posts with relevant keywords and phrases — across text, structure, and even images — to ensure they reach your target audience. *** ## Factors to consider when writing a blog[​](#factors-to-consider-when-writing-a-blog "Direct link to Factors to consider when writing a blog") 1. Audience. Know your target audience. Are they developers, small business owners, or data analysts? Tailor your writing to match their technical level and needs. 2. SEO. Incorporate relevant keywords naturally throughout your post. Don’t overstuff your content, but make sure it ranks for search queries like "web scraping tools", "automation solutions", or "how to scrape LinkedIn profiles". Remember to include keywords in H2 and H3 headings. 3. Clarity and simplicity. Avoid jargon, especially if your target audience includes non-technical users. Use simple language to explain how your Actor works and why it’s beneficial. 4. Visuals. Include screenshots, GIFs, or even videos to demonstrate your Actor’s functionality. Visual content makes your blog more engaging and easier to follow. 5. Call to action (CTA). Always end your blog with a clear CTA. Whether it’s "try our Actor today" or "download the demo", guide your readers to the next step. 6. Engage with comments. If readers leave comments or questions, engage with them. Answer their queries and use the feedback to improve both your blog and Actor. *** ## Best places to publish blogs[​](#best-places-to-publish-blogs "Direct link to Best places to publish blogs") There are a variety of platforms where you can publish your blog posts to reach the right audience: 1. [Dev.to](http://dev.to/): It's a developer-friendly platform where technical content gets a lot of visibility, and a great place to publish how-to guides, tutorials, and technical breakdowns of your Actor. 2. Medium: Allows you to reach a broader, less technical audience. It’s also good for writing about general topics like automation trends or how to improve data scraping practices. 3. ScrapeDiary: Run by Apify, [scrapediary.com](http://scrapediary.com) is a blog specifically geared toward Apify community devs and web scraping topics. Publishing here is a great way to reach users already interested in scraping and automation. Contact us if you want to publish a blog post there. 4. Personal blogs or company websites. If you have your own blog or a company site, post there. It’s the most direct way to control your content and engage your established audience. *** ## Not-so-obvious SEO tips for blog posts[​](#not-so-obvious-seo-tips-for-blog-posts "Direct link to Not-so-obvious SEO tips for blog posts") Everybody knows you should include keywords wherever it looks natural. Some people know the structure of the blog post should be hierarchical and follow an H1 - H2 - H3 - H4 structure with only one possible H1. Here are some unobvious SEO tips for writing a blog post that can help boost its visibility and ranking potential: ### 1. Keep URL length concise and strategic[​](#1-keep-url-length-concise-and-strategic "Direct link to 1. Keep URL length concise and strategic") Optimal length. Keep your URL short and descriptive. URLs between 50-60 characters perform best, so aim for 3-4 words. Avoid unnecessary words like "and", "of", or long prepositions. Include keywords. Ensure your primary keyword is naturally integrated into the URL. This signals relevance to both users and search engines. Avoid dates. Don’t include dates or numbers in the URL to keep the content evergreen, as dates can make the post seem outdated over time. ### 2. Feature a video at the top of the post[​](#2-feature-a-video-at-the-top-of-the-post "Direct link to 2. Feature a video at the top of the post") Engagement boost. Videos significantly increase the time users spend on a page, positively influencing SEO rankings. Blog posts with videos in them generally do better SEO-wise. Thumbnail optimization. Use an optimized thumbnail with a clear title and engaging image to increase click-through rates. ### 3. Alt text for images with a keyword focus[​](#3-alt-text-for-images-with-a-keyword-focus "Direct link to 3. Alt text for images with a keyword focus") Descriptive alt text. Include a short, descriptive alt text for every image with one or two keywords where it makes sense. This also improves accessibility. Optimize file names. Name your images with SEO-friendly keywords before uploading (e.g., "web-scraping-tools.png" rather than "IMG12345\_screenshot1.png"). This helps search engines understand the content of your images. File format and size. Use web-optimized formats like WebP or compressed JPEGs/PNGs to ensure fast page loading, which is a key SEO factor. Lazy loading images. Use lazy loading to only load images when the user scrolls to them, reducing initial page load times, which can help your SEO ranking. ### 4. Interlinking for better user experience and SEO[​](#4-interlinking-for-better-user-experience-and-seo "Direct link to 4. Interlinking for better user experience and SEO") Internal links. Use contextual links to other relevant blog posts or product pages on your site. This not only helps with SEO but also keeps users engaged longer on your site, reducing bounce rates. Anchor text. When linking internally, use keyword-rich anchor text that describes what users will find on the linked page. Content depth. By interlinking, you can show Google that your site has a strong internal structure and is a hub of related, authoritative content. ### 5. Target the 'People Also Ask' section of Google results with an FAQ[​](#5-target-the-people-also-ask-section-of-google-results-with-an-faq "Direct link to 5. Target the 'People Also Ask' section of Google results with an FAQ") Answer common questions. Including an FAQ section that answers questions people search for can help you rank in the "People Also Ask" section of Google. Research questions that come up in this feature related to your topic and address them in your content. Provide clear, concise answers to the FAQs, typically between 40-60 words, since these match the format used in "People Also Ask". Don't bother using FAQ schema. Google doesn't react to those anymore unless you’re a .gov or .edu domain. ### 6. Optimize for readability and structure[​](#6-optimize-for-readability-and-structure "Direct link to 6. Optimize for readability and structure") Short paragraphs and subheadings. Make your blog post easy to scan by using short paragraphs and meaningful subheadings that contain keywords. Bullet points and lists. Include bullet points and numbered lists to break up content and make it more digestible. Search engines prioritize well-structured content. Readability tools. Use tools like Hemingway Editor or Grammarly to improve readability. Content that is easy to read tends to rank higher, as it keeps readers engaged. ## Referring to blogs in your Actor’s ecosystem[​](#referring-to-blogs-in-your-actors-ecosystem "Direct link to Referring to blogs in your Actor’s ecosystem") To drive traffic to your blog and keep users engaged, reference your blog posts across various touchpoints: 1. README. Add links to your blog posts in your Actor’s README. If you’ve written a tutorial or feature guide, include it under a "Further reading" section. 2. Input schema. Use your input schema to link to blog posts. For instance, if a certain field in your Actor has complex configurations, add a link to a blog post that explains how to use it. 3. YouTube videos. If you’ve created tutorial videos about your Actor, link them in your blog and vice versa. Cross-promoting these assets will increase your overall engagement. 4. Webinars and live streams. Mention your blog posts during webinars, especially if you’re covering a topic that’s closely related. Include the links in follow-up emails after the event. 5. Social media. Share your blog posts on Twitter, LinkedIn, or other social platforms. Include snippets or key takeaways to entice users to click through. 🔄 Remember, you can always turn your blog into a video and vice versa. You can also use parts of blog posts for social media promotion. ## Additional tips for blog success[​](#additional-tips-for-blog-success "Direct link to Additional tips for blog success") 1. Consistency. Regular posting helps build an audience and makes sure you keep at it. Try to stick to a consistent schedule, whether it’s weekly, bi-weekly, or monthly. As Woody Allen said, “80 percent of success is showing up”. 2. Guest blogging. Reach out to other blogs or platforms like [Dev.to](http://dev.to/) for guest blogging opportunities. This helps you tap into new audiences. 3. Repurpose content. Once you’ve written a blog post, repurpose it. Turn it into a YouTube video, break it down into social media posts, or use it as the base for a webinar. 4. Monitor performance. Use analytics to track how your blog is performing. Are people reading it? Is it driving traffic to your Actor? What keywords is it ranking for? Who are your competitors? Use this data to refine your content strategy. --- # Parasite SEO **Do you want to attract more users to your Actors? Consider parasite SEO, a non-conventional method of ranking that leverages third-party sites.** *** Here’s a full definition, from Authority Hackers: > Parasite SEO involves publishing a quality piece of content on an established, high-authority external site to rank on search engines. This gives you the benefit of the host’s high traffic, boosting your chances for leads and successful conversions. These high DR websites have a lot of authority and trust in the eyes of Google As you can see, you’re leveraging the existing authority of a third-party site where you can publish content promoting your Actors, and the content should rank better and faster as you publish it on an established site. You can do parasite SEO for free, but you can also pay for guest posts on high-authority sites to post your articles promoting the Actors. Let’s keep things simple and practical for this guide, so you can start immediately. We will cover only the free options, which should give you enough exposure to get started. If you want to learn more, we recommend the following reading about parasite SEO: * [Authority Hackers](https://www.authorityhacker.com/parasite-seo/) * [Ahrefs](https://ahrefs.com/blog/parasite-seo/) In this guide, we will cover the following sites that you can use for parasite SEO for free: * Medium * LinkedIn * Reddit * Quora ## Medium[​](#medium "Direct link to Medium") You probably know [Medium](https://medium.com/). But you might not know that Google quite likes Medium, and you have a good chance of ranking high in Google with articles you publish there. 1. You need a Medium account. It’s free and easy to create. 2. Now, you need to do keyword research. Go to [Ahrefs Free Keyword Generator](https://ahrefs.com/keyword-generator/?country=us), enter your main keyword (e.g. Airbnb scraper), and check what keyword has the highest search volume. 3. Search for that keyword in Google. Use incognito mode and a US VPN if you can. Analyze the results and check what type of content you need to create. Is it a how-to guide on how to create an Airbnb scraper? Or is it a list of the best Airbnb scrapers? Or perhaps it’s a review or just a landing page. 4. Now, you should have a good idea of the article you have to write. Write the article and try to mimic the structure of the first results. 5. Once you’re done with the article, don’t forget to include a few calls to action linking to your Actor on Apify Store. Don’t be too pushy, but mention all the benefits of your Actor. 6. Publish the article. Make sure your title and URL have the main keyword and that the main keyword is also in the first paragraph of the article. Also, try to use relevant tags for your Actor. ## LinkedIn Pulse[​](#linkedin-pulse "Direct link to LinkedIn Pulse") LinkedIn Pulse is similar to Medium, so we won’t go into too much detail. The entire process is the same as with Medium; the way you publish the article differs. [Here is a full guide](https://www.linkedin.com/pulse/how-publish-content-linkedin-pulse-hamza-sarfraz/) for publishing your article on LinkedIn Pulse. ## Reddit[​](#reddit "Direct link to Reddit") 1. You must have a Reddit account to use to comment in relevant Subreddits. 2. Go to Google and perform this search: `site:reddit.com `, where you replace `` with the main topic of your Actor. 3. Now, list relevant Reddit threads that Google gives you. For an Airbnb scraper, this might be a good thread: [Has anybody have an latest Airbnb scraper code?](https://www.reddit.com/r/webscraping/comments/m650ol/has_anybody_have_an_latest_airbnb_scraper_code/) 4. To prioritize threads from the list, you can check the traffic they get from Google in [Ahrefs Traffic Checker](https://ahrefs.com/traffic-checker). Just paste the URL, and the tool will give you traffic estimation. You can use this number to prioritize your list. If the volume exceeds 10, it usually has some traffic potential. 5. Now, the last step is to craft a helpful comment that also promotes your Actor. Try to do that subtly. People on Reddit usually don’t like people who promote their stuff, but you should be fine if you’re being genuinely helpful. ## Quora[​](#quora "Direct link to Quora") Quora is similar to Reddit, so again we won’t go into too much detail. The entire process is the same. You just have to use a different search phrase in Google, which is `site:quora.com `. --- # Product Hunt Product Hunt is one of the best platforms for introducing new tools, especially in the tech community. It attracts a crowd of early adopters, startup enthusiasts, and developers eager to discover the latest innovations. Even [Apify itself](https://www.producthunt.com/products/apify) was on PH. If you're looking to build awareness and generate short-term traffic, Product Hunt can be a powerful tool in your marketing strategy. It's a chance to attract a wide audience, including developers, startups, and businesses looking for automation. If your Actor solves a common problem, automates a tedious process, or enhances productivity, it's a perfect candidate for Product Hunt. Product Hunt is also great for tools with a strong visual component or demo potential. If you can show the value of your Actor in action, you’re more likely to grab attention and drive engagement. *** ## How to promote your Actor on Product Hunt[​](#how-to-promote-your-actor-on-product-hunt "Direct link to How to promote your Actor on Product Hunt") ### Create a compelling launch[​](#create-a-compelling-launch "Direct link to Create a compelling launch") Launching your Actor on Product Hunt requires thoughtful planning. Start by creating a product page that clearly explains what your Actor does and why it’s valuable. You’ll need: * *A catchy tagline*. Keep it short and to the point. Think of something that captures your Actor's essence in just a few words. * *Eye-catching visuals*. Screenshots, GIFs, or short videos that demonstrate your Actor in action are essential. Show users what they’ll get, how it works, and why it’s awesome. * *Concise description*. Write a brief description of what your Actor does, who it’s for, and the problem it solves. Use plain language to appeal to a wide audience, even if they aren’t developers. * *Demo video*. A short video that shows how your Actor works in a real-life scenario will resonate with potential users. Once your page is set up, you’ll need to choose the right day to launch. Product Hunt is most active on weekdays, with Tuesday and Wednesday being the most popular launch days. Avoid launching on weekends or holidays when traffic is lower. ### Build momentum before launch[​](#build-momentum-before-launch "Direct link to Build momentum before launch") Start building awareness before your launch day. This is where your social media channels and community engagement come into play. Share teasers about your upcoming Product Hunt launch on Twitter (X), Discord, LinkedIn, and even StackOverflow, where other developers might take an interest. Highlight key features or the problems your Actor solves. If you have a mailing list, give your subscribers a heads-up about your launch date. Encourage them to visit Product Hunt and support your launch by upvoting and commenting. This pre-launch activity helps create early momentum on launch day. ### Timing your launch[​](#timing-your-launch "Direct link to Timing your launch") The timing of your Product Hunt launch matters a lot. Since Product Hunt operates on a daily ranking system, getting in early gives your product more time to gain votes. Aim to launch between 12:01 AM and 2:00 AM PST, as this will give your product a full day to collect upvotes. Once you’ve launched, be ready to engage with the community throughout the day. Respond to comments, answer questions, and thank users for their support. Product Hunt users appreciate creators who are active and communicative, and this can help drive more visibility for your Actor. ### Engage with your audience[​](#engage-with-your-audience "Direct link to Engage with your audience") The first few hours after your launch are crucial for gaining traction. Engage with users who comment on your product page, answer any questions, and address any concerns they might have. The more interaction you generate, the more likely you are to climb the daily rankings. Be transparent and friendly in your responses. If users point out potential improvements or bugs, acknowledge them and make a commitment to improve your Actor. Product Hunt users are often open to giving feedback, and this can help you iterate on your product quickly. If possible, have team members or collaborators available to help respond to comments. The more responsive and helpful you are, the better the overall experience will be for users checking out your Actor. Leverage Apify You can also give a shoutout to Apify, this way your Actor will also notified to the community of Apify on Product Hunt: ## Expectations and results[​](#expectations-and-results "Direct link to Expectations and results") Launching on Product Hunt can provide a massive spike in short-term traffic and visibility. However, it’s important to manage your expectations. Not every launch will result in hundreds of upvotes or immediate sales. Here’s what you can realistically expect: * *Short-term traffic boost*. Your Actor might see a surge in visitors, especially on the day of the launch. If your Actor resonates with users, this traffic may extend for a few more days. * *Potential long-term benefits*. While the short-term traffic is exciting, the long-term value lies in the relationships you build with early users. Some of them may convert into paying customers or become advocates for your Actor. * *SEO boost*. Product Hunt is a high-authority site with a 91 [domain rating](https://help.ahrefs.com/en/articles/1409408-what-is-domain-rating-dr). Having your product listed can provide an SEO boost and help your Actor's page rank higher in search engines. * *User feedback*. Product Hunt is a great place to gather feedback. Users may point out bugs, request features, or suggest improvements. ## Tricks for a successful launch[​](#tricks-for-a-successful-launch "Direct link to Tricks for a successful launch") 1. *Leverage your network*. Ask friends, colleagues, and early users to support your launch. Ask the Apify community. Ask your users. Encourage them to upvote, comment, and share your product on social media. 2. *Prepare for feedback*. Product Hunt users can be critical, but this is an opportunity to gather valuable insights. Be open to suggestions and use them to improve your Actor. 3. *Use a consistent brand voice*. Make sure your messaging is consistent across all platforms when you're responding to comments and promoting your launch on social media. 4. *Offer a special launch deal*. Incentivize users to try your Actor by offering a discount or exclusive access for Product Hunt users. This can drive early adoption and build momentum. ## Caveats to Product Hunt promotion[​](#caveats-to-product-hunt-promotion "Direct link to Caveats to Product Hunt promotion") * *Not every Actor is a good fit*. Product Hunt is best for tools with broad appeal or innovative features. If your Actor is highly specialized or niche, it may not perform as well. * *High competition*. Product Hunt is a popular platform, and your Actor will be competing with many other launches. A strong marketing strategy is essential to stand out. * *Short-term focus*. While the traffic spike is great, Product Hunt tends to focus on short-term visibility. To maintain long-term growth, you’ll need to continue promoting your Actor through other channels. --- # SEO SEO means optimizing your content to rank high for your target queries in search engines such as Google, Bing, etc. SEO is a great way to get more users for your Actors. It’s also free, and it can bring you traffic for years. This guide will give you a simple framework to rank better for your targeted queries. ## Search intent[​](#search-intent "Direct link to Search intent") Matching the search intent of potential users is super important when creating your Actor's README. The information you include should directly address the problems or needs that led users to search for a solution like yours. For example: * *User goals*: What are users trying to accomplish? * *Pain points*: What challenges are they facing? * *Specific use cases*: How might they use your Actor? Make sure your README demonstrates how your Actor aligns with the search intent. This alignment helps users quickly recognize your Actor's value and helps Google understand your Actor and rank you better. *Example:* Let’s say you want to create a “YouTube Hashtag Scraper” Actor. After you search YouTube HashTag Scraper, you see that most people searching for it want to extract hashtags from YouTube videos, not download videos using a certain hashtag. ## Keyword research[​](#keyword-research "Direct link to Keyword research") Keyword research is a very important part of your SEO success. Without that, you won’t know which keywords you should target with your Actor, and you might be leaving traffic on the table by not targeting all the angles or targeting the wrong one. We will do keyword research with free tools, but if you want to take this seriously, we highly recommend [Ahrefs](https://ahrefs.com/). ### Google autocomplete suggestions[​](#google-autocomplete-suggestions "Direct link to Google autocomplete suggestions") Start by typing your Actor's main function or purpose into Google. As you type, Google will suggest popular search terms. These suggestions are based on common user queries and can provide insight into what your potential users are searching for. *Example:* Let's say you've created an Actor for scraping product reviews. Type "product review scraper" into Google and note the suggestions: * product review scraper free * product review scraper amazon * product review scraper python * product review scraper api These suggestions reveal potential features or use cases to highlight in your README. ### Alphabet soup method[​](#alphabet-soup-method "Direct link to Alphabet soup method") This technique is similar to the previous one, but it involves adding each letter of the alphabet after your main keyword to discover more specific and long-tail keywords. *Example*: Continue with "product review scraper" and add each letter of the alphabet: * product review scraper a (autocomplete might suggest "api") * product review scraper b (might suggest "best") * product review scraper c (might suggest "chrome extension") ...and so on through the alphabet. ### People Also Ask[​](#people-also-ask "Direct link to People Also Ask") Search for your Actor's main function or purpose on Google. Scroll down to find the "People Also Ask" section, which contains related questions. *Example*: For a "product review scraper" Actor: * How do I scrape product reviews? * Is it legal to scrape product reviews? * What is the best tool for scraping reviews? * How can I automate product review collection? Now, you can expand the “People Also Ask” questions. Click on each question to reveal the answer and generate more related questions you can use in your README. ### Google Keyword Planner[​](#google-keyword-planner "Direct link to Google Keyword Planner") Another way to collect more keywords is to use the official Google Keyword Planner. Go to [Google Keyword Planner](https://ads.google.com/home/tools/keyword-planner/) and open the tool. You need a Google Ads account, so just create one for free if you don’t have one already. After you’re in the tool, click on “Discover new keywords”, make sure you’re in the “Start with keywords” tab, enter your Actor's main function or purpose, and then select the United States as the region and English as the language. Click “Get results” to see keywords related to your actor. Write them down. ### Ahrefs Keyword Generator[​](#ahrefs-keyword-generator "Direct link to Ahrefs Keyword Generator") Go to [Ahrefs Keyword Generator](https://ahrefs.com/keyword-generator), enter your Actor's main function or purpose, and click “Find keywords.” You should see a list of keywords related to your actor. Write them down. ## What to do with the keywords[​](#what-to-do-with-the-keywords "Direct link to What to do with the keywords") First, remove any duplicates that you might have on your list. You can use an online tool [like this one](https://dedupelist.com/) for that. After that, we need to get search volumes for your keywords. Put all your keywords in a spreadsheet, with one column being the keyword and the second one being the search volume. Go to the [Keyword Tool](https://backlinko.com/tools/keyword), enter the keyword, and write down the search volume. You will also see other related keywords, so you might as well write them down if you don’t have them on your list yet. At the end, you should have a list of keywords together with their search volumes that you can use to prioritize the keywords, use the keywords to name your Actor, choose the URL, etc. ### Headings[​](#headings "Direct link to Headings") If it makes sense, consider using keywords with the biggest search volume and the most relevant for your Actor as H2 headings in your README. Put the most relevant keyword at the beginning of the heading when possible. Also, remember to use a clear hierarchy. The main features are H2, sub-features are H3, etc. ### Content[​](#content "Direct link to Content") When putting keywords in your Actor’s README, it's important to maintain a natural, informative tone. Your primary goal should be to create valuable, easily understandable content for your users. Aim to use your most important keyword in the first paragraph of your README. This helps both search engines and users quickly understand what your Actor does. But avoid forcing keywords where they don't fit naturally. In your content, you can use the keywords you gathered before where they make sense. We want to include those keywords naturally in your README. If there are relevant questions in your keyword list, you can always cover them within an “FAQ” section of your Actor. Remember that while including keywords is important, always prioritize readability and user experience. Your content should flow naturally and provide real value to the reader. ## Learn more about SEO[​](#learn-more-about-seo "Direct link to Learn more about SEO") If you want to learn more about SEO, these two free courses will get you started: * [SEO Course for Beginners](https://ahrefs.com/academy/seo-training-course) by Ahrefs * [SEO Courses](https://www.semrush.com/academy/courses/seo/) by Semrush The [Ahrefs YouTube channel](https://www.youtube.com/@AhrefsCom/featured) is also a great resource. You can start with [this video](https://www.youtube.com/watch?v=xsVTqzratPs). --- # Social media **Social media is a powerful way to connect with your Actor users and potential users. Whether your tool focuses on web scraping or automation, social platforms can help you showcase its features, answer user questions, and grow your audience. This guide will show you how to use social media effectively, what to share, and how to avoid common mistakes along the way.** Now, before we start listing social media platforms, it might be important to acknowledge something. Developers are notorious for not using social media that much. Or they use social media exclusively in the context of their own interests: that won’t find them new users, but rather colleagues or collaborators. That's a good start, and maybe it's enough. A developer that can also “do” social media is a unicorn. These are super rare. And if you want to really promote your Actor, you'll need to become that unicorn. Before we start, you need to understand the benefits of this activity. *** ## Why be active on social media[​](#why-be-active-on-social-media "Direct link to Why be active on social media") Engaging with your users on social media offers a lot of benefits beyond just promoting your Actor. Let’s look at some of the main reasons why being active online can be a game-changer for your Actor’s success: 1. Social platforms make it easy to gather real-time feedback and also provide support in real-time. You can quickly learn what users love, what they struggle with, and what features they’d like to see. This can guide your Actor’s future development. It also allows you to build trust and credibility with your audience. 2. Shot in the dark: social media exposes your Actor to new users who might not find you through search engines alone. A shared post or retweet can dramatically expand your reach, helping you grow your user base. 3. Consistent activity on social platforms creates more backlinks to your Actor’s page, which can improve its search engine ranking and drive organic traffic. ## Where to engage: Choosing the right platforms[​](#where-to-engage-choosing-the-right-platforms "Direct link to Where to engage: Choosing the right platforms") Choosing the right platforms is key to reaching your target audience. Here's a breakdown of the best places for developers to promote their web scraping and automation tools: * *Discord*: We started with an easy one. Create a community around your Actor to engage with users directly. Offering quick support and discussing the features of your Actor in a real-time chat setting can lead to deeper user engagement. Use Apify's Discord You can also promote your tools through [Apify's Discord](https://discord.com/invite/crawlee-apify-801163717915574323). * *Twitter (X)*: Good for short updates, feature announcements, and quick interactions with users. The tech community on Twitter is very active, which makes it a great spot for sharing tips and getting noticed. * *Reddit*: In theory, subreddits like r/webscraping, r/automation, and r/programming allow you to share expertise, engage in discussions, and present your Actor as a solution. However, in reality, you have to be quite careful with promotion there. Be very mindful of subreddit rules to avoid spamming or over-promoting. For Reddit, personal stories on how you built the tool + a roadblock you might be facing right now are the safest formula. If a tool is already finished and perfected, it will be treated as promotional content. But if you're asking for advice - now that's a community activity. * *TikTok*: Might not be an obvious choice, but that’s where most young people spend time. They discuss a myriad of topics, laugh at the newest memes, and create trends that take weeks to get to Reels and Shorts. If you want to create educational, fun, short video content (and be among the first to talk about web scraping), this is your place for experiments and taking algorithm guesses. * *YouTube*: Ideal for tutorials and demos. A visual walk-through of how to use your Actor can attract users who prefer watching videos to reading tutorials or READMEs. It's also good for Shorts and short, funny content. * *StackOverflow*: While not a traditional social media platform, StackOverflow is a great space to answer technical questions and demonstrate your expertise. Offering help related to web scraping or automation can build credibility, and you can subtly mention your Actor if it directly solves the issue (as long as it adheres to community guidelines). * *LinkedIn*: If your Actor solves problems for professionals or automates business tasks, LinkedIn is the place to explain how your tool provides value to an industry or business. *** ## Best practices for promoting your Actor on social media[​](#best-practices-for-promoting-your-actor-on-social-media "Direct link to Best practices for promoting your Actor on social media") Now that you know where to engage and why it’s important, here are some best practices to help you make the most of social media: 1. *Offer value beyond promotion*: If you look around, you'll see that the golden rule of social media these days is to educate and entertain. Focus on sharing useful information related to your Actor. Post tips on automation, web scraping techniques, or industry insights that can help your audience. When you do promote your Actor, users will see it as part of a valuable exchange, not just an ad. Besides, constantly posting promotional content turns anybody off. 2. *Post consistently*: The most important rule for social media is to show up. Whether it’s a weekly post about new features or daily tips for using your Actor more effectively, maintaining a regular posting schedule keeps your audience connected. 3. *Visuals matter*: Screenshots, GIFs, and short videos can explain more than text ever could. Show users how your Actor works, the results it scrapes, or how automation saves time. 4. *Widen your reach*: Web scraping is a niche topic. Find ways to talk about it more widely. If you stumble upon ways to relate it to wider topics: news, science, research, even politics and art, use it. Or you can go more technical and talk about various libraries and languages you can use to build it. 5. *Use relevant hashtags*: Hashtags like #webscraping, #automation, #programming, and #IT help you reach a wider audience on platforms like Twitter and TikTok. Stick to a few relevant hashtags per post to avoid clutter. 6. *Engage actively*: Social media is a two-way street. Reply to comments, thank users for sharing your content, create stitches, and answer questions. Building relationships with your users helps foster loyalty and builds a sense of community around your Actor. 7. *Use polls and Q\&As*: Interactive content like polls or Q\&A sessions can drive engagement. Ask users what features they’d like to see next or run a live Q\&A to answer questions about using your Actor. These tools encourage participation and provide valuable insights. 8. *Collaborate with other creators*. ## Caveats to social media engagement[​](#caveats-to-social-media-engagement "Direct link to Caveats to social media engagement") 1. *Over-promotion*: Constantly pushing your Actor without offering value can turn users away. Balance your promotional content with educational posts, interesting links, or insights into the development process. Users are more likely to engage when they feel like they’re learning something, rather than just being sold to. 2. *Handling negative feedback*: Social media is a public forum, and not all feedback will be positive. Be prepared to address user concerns or criticism professionally. Responding kindly (or funnily) to criticism shows you’re committed to improving your tool and addressing users' needs. 3. *Managing multiple platforms*: Social media management can be time-consuming, especially if you’re active on multiple platforms. Focus on one or two platforms that matter most to your audience instead of spreading yourself too thin. 4. *Algorithm changes*: Social media platforms often tweak their algorithms, which can impact your content’s visibility. Stay updated on these changes, and adjust your strategy accordingly. If a post doesn’t perform well, experiment with different formats (videos, visuals, polls) to see what resonates with your audience. 5. *Privacy and compliance*: Very important here to be mindful of sharing user data or results, especially if your Actor handles sensitive information. Make sure your posts comply with privacy laws and don’t inadvertently expose any personal data. ## For inspiration[​](#for-inspiration "Direct link to For inspiration") It's sometimes hard to think of a good reason to scream into the void that is social media. Here are 25 scenarios where you might use social media to promote your Actor or your work: 1. *Funny interaction with a user*: Share a humorous tweet or post about a quirky question or feedback from a user that highlights your Actor’s unique features. 2. *Roadblock story*: Post about a challenging bug you encountered while developing your Actor and how you solved it, including a screenshot or snippet of code. 3. *Success story*: Share a post detailing how a user’s feedback led to a new feature in your Actor and thank them for their suggestion. 4. *Tutorial video*: Create and share a short video demonstrating how to use a specific feature of your Actor effectively. 5. *Before-and-after example*: Post a visual comparison showing the impact of your Actor’s automation on a task or process. 6. *Feature announcement*: Announce a new feature or update in your Actor with a brief description and a call-to-action for users to try it out. 7. *User testimonial*: Share a positive review or testimonial from a user who benefited from your Actor, including their quote and a link to your tool. 8. *Live Q\&A*: Host a live Q\&A session on a platform like Twitter or Reddit, answering questions about your Actor and its capabilities. 9. *Behind-the-scenes look*: Post a behind-the-scenes photo or video of your development process or team working on your Actor. 10. *Debugging tip*: Share a tip or trick related to debugging or troubleshooting common issues with web scraping or automation. 11. *Integration highlight*: Post about how your Actor integrates with other popular tools or platforms, showcasing its versatility. Don't forget to tag them. 12. *Case study*: Share a case study or success story showing how a business or individual used your Actor to achieve specific results. 13. *Commentary on a news piece*: Offer your perspective on a recent news story related to technology, scraping, or automation. If possible, explain how it relates to your Actor. 14. *User-generated content*: Share content created by your users, such as screenshots or examples of how they’re using your Actor. 15. *Memes*: Post a relevant meme about the challenges of web scraping or automation. 16. Milestone celebration: Announce and celebrate reaching a milestone, such as a certain number of users or downloads for your Actor. 17. *Quick tip*: Share a short, useful tip or hack related to using your Actor more efficiently. 18. *Throwback post*: Share a throwback post about the early development stages of your Actor, including any challenges or milestones you achieved. 19. *Collaboration announcement*: Announce a new collaboration with another developer or tool, explaining how it enhances your Actor’s functionality. 20. *Community shout-out*. Give a shout-out to a user or community member who has been particularly supportive or helpful. 21. *Demo invitation*: Invite your followers to a live demo or webinar where you’ll showcase your Actor and answer questions. 22. *Feedback request*: Ask your audience for feedback on a recent update or feature release, and encourage them to share their thoughts. 23. *Book or resource recommendation*: Share a recommendation for a book or resource that helped you in developing your Actor, and explain its relevance. --- # Video tutorials **Videos and live streams are powerful tools for connecting with users and potential users, especially when promoting your Actors. You can use them to demonstrate functionality, provide tutorials, or engage with your audience in real time.** *** ## Why videos and live streams matter[​](#why-videos-and-live-streams-matter "Direct link to Why videos and live streams matter") 1. *Visual engagement*. Videos allow you to show rather than just tell. Demonstrating how your Actor works or solving a problem in makes the content more engaging and easier to understand. For complex tools, visual explanations can be much more effective than text alone. 2. *Enhanced communication*. Live streams offer a unique opportunity for direct interaction. You can answer questions, address concerns, and gather immediate feedback from your audience, creating a more dynamic and personal connection. 3. *Increased reach*. Platforms like YouTube and TikTok have massive user bases, giving you access to a broad audience. Videos can also be shared across various social media channels, extending your reach even further. Learn more about the rules of live streams in our next section: [Webinars](https://docs.apify.com/academy/actor-marketing-playbook/promote-your-actor/webinars.md) ## Optimizing videos for SEO[​](#optimizing-videos-for-seo "Direct link to Optimizing videos for SEO") 1. *Keywords and titles*. Use relevant keywords in your video titles and descriptions. For instance, if your Actor is a web scraping tool, include terms like “web scraping tutorial” or “how to use web scraping tools” to help users find your content. 2. *Engaging thumbnails*. Create eye-catching thumbnails that accurately represent the content of your video. Thumbnails are often the first thing users see, so make sure they are visually appealing and relevant. 3. *Transcriptions and captions*. Adding transcripts and captions to your videos improves accessibility and can enhance SEO. They allow search engines to index your content more effectively and help users who prefer reading or have hearing impairments. ## YouTube vs. TikTok[​](#youtube-vs-tiktok "Direct link to YouTube vs. TikTok") 1. *YouTube*. YouTube is an excellent platform for longer, detailed videos. Create a channel dedicated to your Actors and regularly upload content such as tutorials, feature walkthroughs, and industry insights. Utilize YouTube’s SEO features by optimizing video descriptions, tags, and titles with relevant keywords. Engage with your audience through comments and encourage them to subscribe for updates. Collaborating with other YouTubers or influencers in the tech space can also help grow your channel. 2. *TikTok*. TikTok is ideal for short, engaging videos. Use it to share quick tips, demo snippets, or behind-the-scenes content about your Actors. The platform’s algorithm favors high engagement, so create catchy content that encourages viewers to interact. Use trending hashtags and participate in challenges relevant to your niche to increase visibility. Consistency is key, so post regularly and monitor which types of content resonate most with your audience. ## Growing your channels[​](#growing-your-channels "Direct link to Growing your channels") 1. *Regular content*. Consistently upload content to keep your audience engaged and attract new viewers. Create a content calendar to plan and maintain a regular posting schedule. 2. *Cross-promotion*. Share your videos across your social media channels, blogs, and newsletters. This cross-promotion helps drive traffic to your videos and increases your reach. 3. *Engage with your audience*. Respond to comments and feedback on your videos. Engaging with viewers builds a community around your content and encourages ongoing interaction. 4. *Analyze performance*. Use analytics tools provided by YouTube and TikTok to track the performance of your videos. Monitor metrics like watch time, engagement rates, and viewer demographics to refine your content strategy. *** ## Where to mention videos across your Actor ecosystem[​](#where-to-mention-videos-across-your-actor-ecosystem "Direct link to Where to mention videos across your Actor ecosystem") 1. *README*: include links to your videos in your Actor’s README file. For example, if you have a tutorial video, mention it in a "How to scrape X" or "Resources" section to guide users. 2. *Input schema*: if your Actor’s input schema includes complex fields, link to a video that explains how to configure these fields. This can be especially helpful for users who prefer visual guides. 3. *Social media*: share your videos on platforms like Twitter, LinkedIn, and Facebook. Use engaging snippets or highlights to attract users to watch the full video. 4. *Blog posts*: embed videos in your blog posts for a richer user experience. If you write a tutorial or feature update, include a video to provide additional context. 5. *Webinars and live streams*: mention your videos during webinars or live streams. If you’re covering a topic related to a video you’ve posted, refer to it as a supplemental resource. --- # Webinars Webinars and live streams are a fantastic way to connect with your audience, showcase your Actor's capabilities, and gather feedback from users. Though the term webinar might sound outdated these days, the concept of a live video tutorial is alive and well in the world of marketing and promotion. Whether you're introducing a new feature, answering questions, or walking through a common use case, a live event can create more personal engagement, boost user trust, and open the door for valuable two-way communication. But how do you get started? Here's a friendly guide on where to host, how to prepare, and what to do before, during, and after your webinar. *** ## Why host a live stream?[​](#why-host-a-live-stream "Direct link to Why host a live stream?") Here are a few reasons why live streams are ideal for promoting your Actor: * *Demo*. You can show your Actor in action and highlight its most powerful features. You can tell a story about how you built it. You can also show how your Actor interacts with other tools and platforms and what its best uses are. A live demo lets users see immediately how your tool solves their problems. * *Building trust and rapport*. Interacting directly with your users builds trust and rapport. Even showing up and showing your face/voice, it's a chance to let your users meet you and get a feel for the team behind the Actor. * *Live Q\&A*. Users often have questions that can be hard to fully address in documentation, README, or tutorials. A live session allows for Q\&A, so you can explain complex features and demonstrate how to overcome common issues. * *Tutorial or training*. If you don't have time for complex graphics, this is an easy replacement for a video tutorial until you do. Remember that some platforms (YouTube) give the option of publishing the webinar after it's over. You can reuse it later in other content as well as a guide. Also, if you’ve noticed users struggling with particular features, a webinar is a great way to teach them directly. Webinars help build a community around your Actor and turn one-time users into loyal advocates. ## Where to host your webinar or live stream[​](#where-to-host-your-webinar-or-live-stream "Direct link to Where to host your webinar or live stream") It all goes back to where you have or would like to have your audience and whether you want to have the webinar available on the web later. 1. Social media: 1. *YouTube*: ideal for reaching a broad audience. It’s free and easy to set up. You can also make recordings available for future viewing. 2. *TikTok*: same, ideal for reaching a broad audience, free and easy to set up. However, live video will disappear once the broadcast has ended. TikTok does allow you to save your livestreams. You won't be able to republish them to the platform (we assume your live stream will be longer than 10 minutes). But you can later re-upload it elsewhere. 3. *Twitch*: Known for gaming, Twitch has become a space for tech demos, coding live streams, and webinars. If your target audience enjoys an interactive and casual format, Twitch might be a good fit. 4. *LinkedIn*: If your audience is more professional, LinkedIn Live could be a good fit to present your Actor there. Once a stream is complete, it will remain on the feed of your LinkedIn Page or profile as a video that was ‘previously recorded live’. 5. Facebook: Not recommended. 2. General platforms: 1. *Zoom* or *Google Meet*: More personal, these are great for smaller webinars where you might want closer interaction. They also give you control over who attends. Pick a platform where your users are most likely to hang out. If your audience is primarily tech-savvy, YouTube or Twitch could work. If your Actor serves businesses, LinkedIn might be the best spot. ## Webinar/live stream prep[​](#webinarlive-stream-prep "Direct link to Webinar/live stream prep") ### Promote your webinar and get your users[​](#promote-your-webinar-and-get-your-users "Direct link to Promote your webinar and get your users") Send an email blast if you have an email list of users or potential users, send a friendly invite. Include details about what you’ll cover and how they can benefit from attending. * Social media promotion on Twitter (X), LinkedIn, or other platforms. Highlight what people will learn and any special features you’ll be demonstrating. Do it a few times - 2 weeks before the webinar, 1 week before, a day before, and the day of. Don't forget to announce on Apify’s Discord. These are places where your potential audience is likely hanging out. Let them know you’re hosting an event and what they can expect. * Use every piece of real estate on Apify Store and Actor pages. Add a banner or notification to your Actor’s page (top of the README): This can be a great way to notify people who are already looking at your Actor. A simple “join us for a live demo on DATE” message works well. Add something like that to your Store bio and its README. Mention it at the top description of your Actor's input schema. Use UTM tags When creating a link to share to the webinar, you can add different UTM tags for different places where you will insert the link. That way you can later learn which space brought the most webinar sign-ups. * Collaborate with other developers. If you can team up with someone in the Apify community, you’ll double your reach. Cross-promotion can bring in users from both sides. *** ### Plan the content[​](#plan-the-content "Direct link to Plan the content") Think carefully about what you’ll cover. Focus on what’s most relevant for your audience: * *Decide on your content*. What will you cover? A demo? A deep dive into Actor configurations? Create a flow and timeline to keep yourself organized. * Prepare visuals. Slides, product demos, and examples are helpful to explain complex ideas clearly. * *Feature highlights*. Demonstrate the key features of your Actor. Walk users through common use cases and be ready to show live examples. * *Input schema*. If your Actor has a complex input schema, spend time explaining how to use it effectively. Highlight tips that will save users time and frustration. You can incorporate your knowledge from the issues tab. * *Q\&A session*. Leave time for questions at the end. Make sure to keep this flexible, as it’s often where users will engage the most. Don't forget to add an intro with an agenda and an outro with your contact details. Consider timezones When thinking of when to run the webinar, focus on the timezone of the majority of your users. ### Prepare technically[​](#prepare-technically "Direct link to Prepare technically") Test your setup before going live. Here’s what to focus on: * *Stable internet connection*. This one’s obvious but essential. Test your stream quality ahead of time. * *Test the Actor live*. If you're demoing your Actor, ensure it works smoothly. Avoid running scripts that take too long or have potential bugs during the live session. * *Audio quality*. People are far more likely to tolerate a blurry video than bad audio. Use a good-quality microphone to ensure you’re heard clearly. * Screen sharing. If you’re doing a live demo, make sure you know how to seamlessly switch between windows and share your screen effectively. * *Backup plan*. Have a backup plan in case something goes wrong. This could be as simple as a recorded version of your presentation to share if things go south during the live session. * *Make it interactive*. Consider using polls or a live Q\&A session to keep the audience engaged. Maybe have a support person assisting with that side of things while you're speaking. ## Best practices during the live stream[​](#best-practices-during-the-live-stream "Direct link to Best practices during the live stream") When the time comes, here’s how to make the most of your webinar or live stream: * *Start with an introduction*. Begin with a brief introduction of yourself, the Actor you’re showcasing, and what attendees can expect to learn. This sets expectations and gives context. It's also best if you have a slide that shows the agenda. * *Try to stay on time*. Stick to the agenda. Users appreciate when events run on schedule. * *Show a live demo*. Walk through a live demo of your Actor. Show it solving a problem from start to finish. * *Explain as you go*. Be mindful that some people might be unfamiliar with technical terms or processes. Try to explain things simply and offer helpful tips as you demonstrate but don't go off on a tangent. * *Invite questions and engage your audience*. Encourage users to ask questions throughout the session. This creates a more conversational tone and helps you address their concerns in real time. You can also ask a simple question or poll to get the chat going. Try to direct the Q\&A into one place so you don't have to switch tabs. Throughout the presentation, pause for questions and make sure you're addressing any confusion in real time. * *Wrap up with a clear call to action*. Whether it’s to try your Actor, leave a review, or sign up for a future live, finish with a clear CTA. Let them know the next step to take. This works for when it's a simple tutorial walkthrough and if you have a code-along session, the practices work for it as well.s ## After the live session[​](#after-the-live-session "Direct link to After the live session") Once your live session wraps up, there are still sides of it you can benefit from: * *Make it public and share the recording*. Not everyone who wanted to attend will have been able to make it. Send a recording to all attendees whose emails you have and make it publicly available on your channels (emails, README, social media, etc.). Upload the recorded session to YouTube and your Actor’s documentation. If it's on YouTube, you can also ask Apify's video team to add it to their Community playlist. Make it easy for people to revisit the content or share it with others. * *Follow up with attendees, thank them, and ask for feedback*. Send a follow-up email thanking people for attending. Include a link to the recording, additional resources, and ways to get in touch if they have more questions. Share any special offers or discount codes if relevant. If you don’t have the attendees' emails, include a link in your newsletter and publish it on your channels. Ask for feedback on what they liked and what could be improved. This can guide your next webinar or help fine-tune your Actor. * *Answer lingering questions*. If any questions didn’t get answered live, take the time to address them in the follow-up email. * *Create a blog post or article*. Summarize the key points of your webinar in a written format. This can boost your SEO and help users find answers in the future. * *Review your performance*. Analyze the data from your webinar, if available. How many people attended? Which platform brought the most sign-ups? How many questions did you receive? Were there any technical difficulties? This helps refine your approach for future events. * *Share snippets from the webinar or interesting takeaways on social media*. Encourage people to watch the recording and let them know when you’ll be hosting another event. --- # Actor success stories *Web scraping freelance financial freedom with microworlds.* Discover how Caleb David, founder of `microworlds`, achieved financial freedom through freelance web scraping. His journey showcases how mastering the craft with tools like Crawlee and creating a Twitter scraper transformed his career. See the full story [here](https://blog.apify.com/web-scraping-freelance-financial-freedom/) and learn from his success. *Web scraping for freelance success – insights from Tugkan.* In this success story, our first community dev Tugkan shares how his journey into freelancing via Apify changed his life. Learn about his process, challenges, and how his paid Actors have brought him financial rewards and freedom. Check out his story [here](https://apify.com/success-stories/paid-actor-journey-apify-freelancer-tugkan) for inspiration. Interested in sharing your story? Reach out to our marketing team at for a case study to showcase your journey. --- # How Actor monetization works **You can turn your web scrapers into a source of income by publishing them on Apify Store. Learn how it's done and what monetization options you have.** *** ## Monetizing your Actor[​](#monetizing-your-actor "Direct link to Monetizing your Actor") Monetizing your Actor on the Apify platform involves several key steps: 1. *Development*: create and refine your Actor. 2. *Testing*: ensure your Actor works reliably. 3. *Publication & monetization*: publish your Actor and set up its monetization model. 4. *Promotion*: attract users to your Actor. *** ## Monetization models[​](#monetization-models "Direct link to Monetization models") ### Rental pricing model[​](#rental-pricing-model "Direct link to Rental pricing model") ![rental model example](/assets/images/rental-model-727e0b838b54bbd57b7e6095cddd90a7.png) * *How it works*: you offer a free trial period and set a monthly fee. Users on Apify paid plans can continue using the Actor after the trial. You earn 80% of the monthly rental fees. * *Example*: you set a 7-day free trial and $30/month rental. If 3 users start using your Actor: * 1st user on a paid plan pays $30 after the trial (you earn $24). * 2nd user starts their trial but pays next month. * 3rd user on a free plan finishes the trial without upgrading to a paid plan and can’t use the Actor further. Learn more about the rental pricing model in our [documentation](https://docs.apify.com/platform/actors/publishing/monetize.md#rental-pricing-model). ### Pay-per-result pricing model[​](#pay-per-result-pricing-model "Direct link to Pay-per-result pricing model") ![pay per result model example](/assets/images/ppr-model-c7cd05e9f4a2a973bb8101fed2eaab67.png) * *How it works*: you charge users based on the number of results your Actor generates. You earn 80% of the revenue minus platform usage costs. * *Profit calculation*: `profit = (0.8 * revenue) - platform usage costs` * *Cost breakdown*: * Compute unit: $0.4 per CU * Residential proxies: $13 per GB * SERPs proxy: $3 per 1,000 SERPs * Data transfer (external): $0.20 per GB * Dataset storage: $1 per 1,000 GB-hours * *Example*: you set a price of $1 per 1,000 results. Two users generate 50,000 and 20,000 results, paying $50 and $20, respectively. If the platform usage costs are $5 and $2, your profit is $49. Learn more about the pay-per-result pricing model in our [documentation](https://docs.apify.com/platform/actors/publishing/monetize.md#pay-per-result-pricing-model). ### Pay-per-event pricing model[​](#pay-per-event-pricing-model "Direct link to Pay-per-event pricing model") ![pay per event model example](/assets/images/ppe-model-0e4ba61669f4bffb1fe144b4d225e3c2.png) * *How it works*: you charge users based on specific events triggered programmatically by your Actor's code. You earn 80% of the revenue minus platform usage costs. * * *Profit calculation*: `profit = (0.8 * revenue) - platform usage costs` * *Event cost example*: you set the following events for your Actor: * `Actor start per 1 GB of memory` at $0.005 * `Pages scraped` at $0.002 * `Page opened with residential proxy` at $0.002 - this is on top of `Pages scraped` * `Page opened with a browser` at $0.002 - this is on top of `Pages scraped` * *Example*: * User A: * Started the Actor 10 times = $0.05 * Scraped 1,000 pages = $2.00 * 500 of those were scraped using residential proxy = $1.00 * 300 of those were scraped using browser = $0.60 * This comes up to $3.65 of total revenue * User B: * Started the Actor 5 times = $0.025 * Scraped 500 pages = $1.00 * 200 of those were scraped using residential proxy = $0.40 * 100 of those were scraped using browser = $0.20 * This comes up to $1.625 of total revenue * That means if platform usage costs are $0.365 for user A and $0.162 for user B your profit is $4.748 Learn more about the pay-per-event pricing model in our [documentation](https://docs.apify.com/platform/actors/publishing/monetize.md#pay-per-event-pricing-model). ## Setting up monetization[​](#setting-up-monetization "Direct link to Setting up monetization") 1. *Go to your Actor page*: navigate to the **Publication** tab and open the **Monetization** section. 2. *Fill in billing details*: set up your payment details for payouts. 3. *Choose your pricing model*: use the monetization wizard to select your model and set fees. ### Changing monetization[​](#changing-monetization "Direct link to Changing monetization") Adjustments to monetization settings take 14 days to take effect and can be made once per month. ### Tracking and promotion[​](#tracking-and-promotion "Direct link to Tracking and promotion") * *Track profit*: review payout invoices and statistics in Apify Console (**Monitoring** tab). * *Promote your Actor*: optimize your Actor’s description for SEO, share on social media, and consider creating tutorials or articles to attract users. ## Marketing tips for defining the price for your Actor[​](#marketing-tips-for-defining-the-price-for-your-actor "Direct link to Marketing tips for defining the price for your Actor") It's up to you to set the pricing, of course. It can be as high or low as you wish, you can even make your Actor free. But if you're generally aiming for a successful, popular Actor, here are a few directions: ### Do market research outside Apify Store[​](#do-market-research-outside-apify-store "Direct link to Do market research outside Apify Store") The easiest way to understand your tool's value is to look around. Are there similar tools on the market? What do they offer, and how much do they charge? What added value does your tool provide compared to theirs? What features can your tool borrow from theirs for the future? Try competitor tools yourself (to assess the value and the quality they provide), check their SEO (to see how much traffic they get), and note ballpark figures. Think about what your Actor can do that competitors might be missing. Also, remember that your Actor is a package deal with the Apify platform. So all the platform's features automatically transfer onto your Actor and its value. Scheduling, monitoring runs, ways of exporting data, proxies, and integrations can all add value to your Actor (on top of its own functionalities). Be sure to factor this into your tool's value proposition and communicate that to the potential user. ### Do research in Apify Store[​](#do-research-in-apify-store "Direct link to Do research in Apify Store") Apify Store is like any other marketplace, so take a look at your competition there. Are you the first in your lane, or are there other similar tools? What makes yours stand out? Remember, your README is your first impression — communicate your tool's benefits clearly and offer something unique. Competing with other developers is great, but collaborations can drive even better results 😉 Learn more about what makes a good readme here: [How to create an Actor README](https://docs.apify.com/academy/actor-marketing-playbook/actor-basics/how-to-create-an-actor-readme.md) ### Rental, pay-per-result (PPR), or pay-per-event (PPE)[​](#rental-pay-per-result-ppr-or-pay-per-event-ppe "Direct link to Rental, pay-per-result (PPR), or pay-per-event (PPE)") Rental pricing is technically easier: you set the rental fee, and the user covers their CU usage. So all you have to define is how much you want to charge the users. With pay-per-result, you’ll need to include both CU usage and your margin. So you have to calculate how much the average run is going to cost for the user + define how much you want to charge them. To figure out the average cost per run for users, just run a few test runs and look at the statistics in the Actor [**Analytics**](https://console.apify.com/actors?tab=analytics) tab. From an average user's perspective, pay-per-result is often easier to grasp — $25 for a thousand pages, $5 for a thousand videos, $1 for a thousand images, etc. It gives users a clearer idea of what they’re paying for and allows them to estimate faster. But rental pricing has its fans, too — if your tool provides high value, users will come. Pay-per-event (PPE) lets you define pricing for individual events. You can charge for specific events directly from your Actor by calling our PPE charging API. The most common events will most likely be Actor start, dataset item, external API calls, etc. PPE is great for users who want to optimize their costs and value transparency. PPE is also a fairer pricing model for integration and AI-driven use cases, where dataset-based pricing doesn’t make sense. ### Adapt when needed[​](#adapt-when-needed "Direct link to Adapt when needed") Don’t be afraid to experiment with pricing, especially at the start. You can monitor your results in the dashboard and adjust if necessary. Keep an eye on SEO as well. If you monitor the volume of the keywords your Actor is targeting as well as how well your Actor's page is ranking for those keywords, you can estimate the number of people who actually end up trying your tool (aka conversion rate). If your keywords are getting volume, but conversions are lower than expected, it might point to a few issues It could be due to your pricing, a verbose README, or complex input. If users are bouncing right away, it makes sense to check out your pricing and your closest competitors to see where adjustments might help. ### Summary & a basic plan[​](#summary--a-basic-plan "Direct link to Summary & a basic plan") Pick a pricing model, run some tests, and calculate your preliminary costs (**Analytics** tab in Console). Then check your costs against similar solutions in the Store and the market (try Google search or other marketplaces), and set a price that gives you some margin. It’s also normal to adjust pricing as you get more demand. For context, most prices on Apify Store range between $1-10 per 1,000 results. Example of useful pricing estimates from the **Analytics** tab: ![example of pricing estimates in analytics tab](/assets/images/analytisc-example-e5005177826fdce533bedec8beb29b4e.png) Use emails! 📫 Don't forget to set an email sequence to warn and remind your users about pricing changes. Learn more about emailing your users here: \[Emails to Actor users] ## Resources[​](#resources "Direct link to Resources") * Learn about [incentives behind monetization](https://apify.com/partners/actor-developers) * Detailed guide to [setting up monetization models](https://docs.apify.com/academy/get-most-of-actors/monetizing-your-actor) * Guide to [publishing Actors](https://docs.apify.com/platform/actors/publishing) * Watch our webinar on how to [build, publish and monetize Actors](https://www.youtube.com/watch?v=4nxStxC1BJM) * Read a blog post from our CEO on the [reasoning behind monetizing Actors](https://blog.apify.com/make-regular-passive-income-developing-web-automation-actors-b0392278d085/) * Learn about the [Creator plan](https://apify.com/pricing/creator-plan), which allows you to create and freely test your own Actors for $1 --- # How Apify Store works **Out of thousands of Actors on [Apify Store](https://apify.com/store) marketplace, most of them were created by developers just like you. Let's get acquainted with the concept of Apify Store and what it takes to publish an Actor there.** *** ## What are Actors (and why they're called that)?[​](#what-are-actors-and-why-theyre-called-that "Direct link to What are Actors (and why they're called that)?") [Actors](https://apify.com/actors) are serverless cloud applications that run on the Apify platform, capable of performing various computing tasks on the web, such as crawling websites or sending automated emails. They are developed by independent developers all over the world, and *you can be one of them*. The term "Actor" is used because, like human actors, these programs follow a script. This naming convention unifies both web scraping and web automation solutions, including AI agents, under a single term. Actors can range in complexity and function, targeting different websites or performing multiple tasks, which makes the umbrella term very useful. ## What is Apify Store?[​](#what-is-apify-store "Direct link to What is Apify Store?") [Apify Store](https://apify.com/store) is a public library of Actors that is constantly growing and evolving. It's basically a publicly visible (and searchable) part of the Apify platform. With thousands of Actors currently available, most of them are created and maintained by the community. Actors that consistently perform well remain on Apify Store, while those reported as malfunctioning or under maintenance are eventually removed. This keeps the tools in our ecosystem reliable, effective, and competitive. ### Types of Actors[​](#types-of-actors "Direct link to Types of Actors") * *Web scraping Actors*: for instance, [Twitter (X) Scraper](https://apify.com/apidojo/twitter-user-scraper) extracts data from Twitter. * *Automation Actors*: for example, [Content Checker](https://apify.com/jakubbalada/content-checker) monitors website content for changes and emails you once a change occurs. * *Bundles*: chains of multiple Actors united by a common data point or target website. For example, [Restaurant Review Aggregator](https://apify.com/tri_angle/restaurant-review-aggregator) can scrape reviews from six platforms at once. Learn more about bundles here: [Actor bundles](https://docs.apify.com/academy/actor-marketing-playbook/product-optimization/actor-bundles.md) ## Public and private Actors[​](#public-and-private-actors "Direct link to Public and private Actors") Actors on Apify Store can be public or private: * *Private Actors*: these are only accessible to you in Apify Console. You can use them without exposing them to the web. However, you can still share the results they produce. * *Public Actors*: these are available to everyone on Apify Store. You can choose to make them free or set a price. By publishing your web scrapers and automation solutions, you can attract users and generate income. ## How Actor monetization works (briefly)[​](#how-actor-monetization-works-briefly "Direct link to How Actor monetization works (briefly)") You can monetize your Actors using three different pricing models: * Pay for usage: charge based on how much the Actor is used. * Pay per result: the price is based on the number of results produced, with the first few free. * Pay per event: the price is based on specific events triggered by the Actor. * Monthly billing: set a fixed monthly rental rate for using the Actor. For detailed information on which pricing model might work for your Actor, refer to [How Actor monetization works](https://docs.apify.com/academy/actor-marketing-playbook/store-basics/how-actor-monetization-works.md). ## Actor ownership on Store[​](#actor-ownership-on-store "Direct link to Actor ownership on Store") Actors are either created and maintained by Apify or by members of the community: * *Maintained by Apify*: created and supported by the Apify team. * *Maintained by Community*: created and managed by independent developers from the community. To see who maintains an Actor, check the upper-right corner of the Actor's page. When it comes to managing Actors on Apify, it’s important that every potential community developer understands the differences between Apify-maintained and Community-maintained Actors. Here’s what you need to know to navigate the platform effectively and ensure your work stands out. ### Community-maintained Actors[​](#community-maintained-actors "Direct link to Community-maintained Actors") ✨ *Features and functionality*: offers a broader range of use cases and features, often tailored to specific needs. Great for exploring unique or niche applications. 🧑‍💻 *Ownership*: created and maintained by independent developers like you. 🛠 *Maintenance*: you’re responsible for all updates, bug fixes, and ongoing maintenance. Apify hosts your Actor but does not manage its code. 👷‍♀️ *Reliability and testing*: it’s up to you to ensure your Actor’s reliability and performance. ☝️ *Support and Issues*: Apify does not provide direct support for Community-maintained Actors. You must manage issues through the Issues tab, where you handle user queries and problems yourself. ✍️ *Documentation*: you’re responsible for creating and maintaining documentation for your Actor. Make sure it’s clear and helpful for users. Test your Actor! For the best results, make sure your Actor is well-documented and thoroughly tested. Engage with users through the Issues tab to address any problems promptly. By maintaining high standards and being proactive, you’ll enhance your Actor’s reputation and usability in Apify Store. ## Importance of Actor testing and reliability[​](#importance-of-actor-testing-and-reliability "Direct link to Importance of Actor testing and reliability") It's essential to test your Actors and make sure they work as intended. That's why Apify does it on our side as much as you should do it on yours. Apify runs automated tests daily to ensure all Actors on Apify Store are functional and reliable. These tests check *if an Actor can successfully run with its default input within 5 minutes*. If an Actor fails for three consecutive days, it’s labeled under maintenance, and the developer is notified. Continuous failures for another 28 days lead to deprecation. To restore an Actor's health, developers should fix and rebuild it. The testing system will automatically recognize the changes within 24 hours. If your Actor requires longer run times or authentication, contact support to explain why it should be excluded from tests. For more control, you can implement your own tests using the Actor Testing tool available on Apify Store. ### Actor metrics and reliability score[​](#actor-metrics-and-reliability-score "Direct link to Actor metrics and reliability score") On the right panel of each Actor on Store, you can see a list of Actor metrics. Actor metrics such as the number of monthly users, star ratings, success rates, response times, creation dates, and recent modifications collectively offer insights into its reliability. Basically, they serve as a *shorthand for potential users to assess your Actor's reliability* before even trying it out. A high number of monthly users indicates widespread trust and effective performance, while a high star rating reflects user satisfaction. A success rate nearing 100% demonstrates consistent performance. Short response times show a commitment to addressing issues promptly, though quicker responses are ideal. A recent creation date suggests modern features and ongoing development, while recent modifications point to active maintenance and continuous improvements. Together, these metrics provide a comprehensive view of an Actor’s reliability and quality. ### Reporting Issues in Actors[​](#reporting-issues-in-actors "Direct link to Reporting Issues in Actors") Each Actor has an **Issues** tab in Apify Console and on the web. Here, users can open an issue (ticket) and engage in discussions with the Actor's creator, platform admins, and other users. The tab is ideal for asking questions, requesting new features, or providing feedback. Since the **Issues** tab is public, the level of activity — or lack thereof — can be observed by potential users and may serve as an indicator of the Actor's reliability. A well-maintained Issues tab with prompt responses suggests an active and dependable Actor. Learn more about how to handle the [Issues tab](https://docs.apify.com/academy/actor-marketing-playbook/interact-with-users/issues-tab.md) ## Resources[​](#resources "Direct link to Resources") * Best practices on setting up [testing for your Actor](https://docs.apify.com/platform/actors/publishing/test) * What are Apify-maintained and [Community-maintained Actors](https://help.apify.com/en/articles/6999799-what-are-apify-maintained-and-community-maintained-actors)? On ownership, maintenance, features, and support * Step-by-step guide on how to [publish your Actor](https://docs.apify.com/platform/actors/publishing) * Watch our webinar on how to [build, publish and monetize Actors](https://www.youtube.com/watch?v=4nxStxC1BJM) * Detailed [guide on pricing models](https://docs.apify.com/platform/actors/running/actors-in-store) for Actors in Store --- # How to build Actors At Apify, we try to make building web scraping and automation straightforward. You can customize our universal scrapers with JavaScript for quick tweaks, use our code templates for rapid setup in JavaScript, TypeScript, or Python, or build from scratch using our JavaScript and Python SDKs or Crawlee libraries for Node.js and Python for ultimate flexibility and control. This guide offers a quick overview of our tools to help you find the right fit for your needs. ## Three ways to build Actors[​](#three-ways-to-build-actors "Direct link to Three ways to build Actors") 1. [Our universal scrapers](https://apify.com/scrapers/universal-web-scrapers) — customize our boilerplate tools to your needs with a bit of JavaScript and setup. 2. [Our code templates](https://apify.com/templates) for web scraping projects — for a quick project setup to save you development time (includes JavaScript, TypeScript, and Python templates). 3. Open-source libraries and SDKs 1. [JavaScript SDK](https://docs.apify.com/sdk/js/) & [Python SDK](https://docs.apify.com/sdk/python/) — for creating your own solution from scratch on the Apify platform using our free development kits. Involves more coding but offers infinite flexibility. 2. [Crawlee](https://crawlee.dev/) and [Crawlee for Python](https://crawlee.dev/python) — for creating your own solutions from scratch using our free web automation libraries. Involves even more coding but offers infinite flexibility. There’s also no need to host these on the platform. ## Universal scrapers & what are they for[​](#universal-scrapers--what-are-they-for "Direct link to Universal scrapers & what are they for") [Universal scrapers](https://apify.com/scrapers/universal-web-scrapers) were built to provide an intuitive UI plus configuration that will help you start extracting data as quickly as possible. Usually, you just provide a [simple JavaScript function](https://docs.apify.com/tutorials/apify-scrapers/getting-started#the-page-function) and set up one or two parameters, and you're good to go. Since scraping and automation come in various forms, we decided to build not just one, but *six* scrapers. This way, you can always pick the right tool for the job. Let's take a look at each particular tool and its advantages and disadvantages. | Scraper | Technology | Advantages | Disadvantages | Best for | | ------------------------ | -------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------ | ----------------------------------------------- | | 🌐 Web Scraper | Headless Chrome Browser | Simple, fully JavaScript-rendered pages | Executes only client-side JavaScript | Websites with heavy client-side JavaScript | | 👐 Puppeteer Scraper | Headless Chrome Browser | Powerful Puppeteer functions, executes both server-side and client-side JavaScript | More complex | Advanced scraping with client/server-side JS | | 🎭 Playwright Scraper | Cross-browser support with Playwright library | Cross-browser support, executes both server-side and client-side JavaScript | More complex | Cross-browser scraping with advanced features | | 🍩 Cheerio Scraper | HTTP requests + Cheerio parser (JQuery-like for servers) | Simple, fast, cost-effective | Pages may not be fully rendered (lacks JavaScript rendering), executes only server-side JavaScript | High-speed, cost-effective scraping | | ⚠️ JSDOM Scraper | JSDOM library (Browser-like DOM API) | + Handles client-side JavaScript+ Faster than full-browser solutions+ Ideal for light scripting | Not for heavy dynamic JavaScript, executes server-side code only, depends on pre-installed NPM modules | Speedy scraping with light client-side JS | | 🍲 BeautifulSoup Scraper | Python-based, HTTP requests + BeautifulSoup parser | Python-based, supports recursive crawling and URL lists | No full-featured web browser, not suitable for dynamic JavaScript-rendered pages | Python users needing simple, recursive crawling | ### How do I choose the right universal web scraper to start with?[​](#how-do-i-choose-the-right-universal-web-scraper-to-start-with "Direct link to How do I choose the right universal web scraper to start with?") 🎯 Decision points: * Use 🌐 [Web Scraper](https://apify.com/apify/web-scraper) if you need simplicity with full browser capabilities and client-side JavaScript rendering. * Use 🍩 [Cheerio Scraper](https://apify.com/apify/cheerio-scraper) for fast, cost-effective scraping of static pages with simple server-side JavaScript execution. * Use 🎭 [Playwright Scraper](https://apify.com/apify/playwright-scraper) when cross-browser compatibility is crucial. * Use 👐 [Puppeteer Scraper](https://apify.com/apify/puppeteer-scraper) for advanced, powerful scraping where you need both client-side and server-side JavaScript handling. * Use ⚠️ [JSDOM Scraper](https://apify.com/apify/jsdom-scraper) for lightweight, speedy scraping with minimal client-side JavaScript requirements. * Use 🍲 [BeautifulSoup Scraper](https://apify.com/apify/beautifulsoup-scraper) for Python-based scraping, especially with recursive crawling and processing URL lists. To make it easier, here's a short questionnaire that guides you on selecting the best scraper based on your specific use case: Questionnaire 1. Is the website content rendered with a lot of client-side JavaScript? * Yes: * Do you need full browser capabilities? * Yes: use Web Scraper or Playwright Scraper * No, but I still want advanced features: use Puppeteer Scraper * No: * Do you prioritize speed and cost-effectiveness? * Yes: use Cheerio Scraper * No: use JSDOM Scraper 2. Do you need cross-browser support for scraping? * Yes:\*\* use Playwright Scraper * No:\*\* continue to the next step. 3. Is your preferred scripting language Python?\*\* * Yes:\*\* use BeautifulSoup Scraper * No:\*\* continue to the next step. 4. Are you dealing with static pages or lightweight client-side JavaScript?\*\* * Static pages: use Cheerio Scraper or BeautifulSoup Scraper * Light client-side JavaScript: * Do you want a balance between speed and client-side JavaScript handling? * Yes: use JSDOM Scraper * No: use Web Scraper or Puppeteer Scraper 5. Do you need to support recursive crawling or process lists of URLs? * Yes, and I prefer Python: use BeautifulSoup Scraper * Yes, and I prefer JavaScript: use Web Scraper or Cheerio Scraper * No: choose based on other criteria above. This should help you navigate through the options and choose the right scraper based on the website’s complexity, your scripting language preference, and your need for speed or advanced features. 📚 Resources: * How to use [Web Scraper](https://www.youtube.com/watch?v=5kcaHAuGxmY) to scrape any website * How to use [Beautiful Soup](https://www.youtube.com/watch?v=1KqLLuIW6MA) to scrape the web * Learn about our $1/month [Creator plan](https://apify.com/pricing/creator-plan) that encourages devs to build Actors based on universal scrapers ## Web scraping code templates[​](#web-scraping-code-templates "Direct link to Web scraping code templates") Similar to our universal scrapers, our [code templates](https://apify.com/templates) also provide a quick start for developing web scrapers, automation scripts, and testing tools. Built on popular libraries like BeautifulSoup for Python or Playwright for JavaScript, they save time on setup, allowing you to focus on customization. Though they require more coding than universal scrapers, they're ideal for those who want a flexible foundation while still needing room to tailor their solutions. | Code template | Supported libraries | Purpose | Pros | Cons | | -------------- | ----------------------------------------------------- | ------------------------------------------ | -------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------- | | 🐍 Python | Requests, BeautifulSoup, Scrapy, Selenium, Playwright | Creating scrapers Automation Testing tools | - Simplifies setup - Supports major Python libraries | - Requires more manual coding (than universal scrapers)- May be restrictive for complex tasks | | ☕️ JavaScript | Playwright, Selenium, Cheerio, Cypress, LangChain | Creating scrapers Automation Testing tools | - Eases development with pre-set configurations - Flexibility with JavaScript and TypeScript | - Requires more manual coding (than universal scrapers)- May be restrictive for tasks needing full control | 📚 Resources: * [How to build a scraper](https://www.youtube.com/watch?v=u-i-Korzf8w) using a web scraper template. ## Toolkits and libraries[​](#toolkits-and-libraries "Direct link to Toolkits and libraries") ### Apify JavaScript and Python SDKs[​](#apify-javascript-and-python-sdks "Direct link to Apify JavaScript and Python SDKs") [Apify SDKs](https://docs.apify.com/sdk/js/) are designed for developers who want to interact directly with the Apify platform. It allows you to perform tasks like saving data in Apify Datasets, running Apify Actors, and accessing the key-value store. Ideal for those who are familiar with [Node.js](https://docs.apify.com/sdk/js/) and [Python](https://docs.apify.com/sdk/python/), SDKs provide the tools needed to develop software specifically on the Apify platform, offering complete freedom and flexibility within the JavaScript ecosystem. * *Best for*: interacting with the Apify platform (e.g., saving data, running Actors, etc) * *Pros*: full control over platform-specific operations, integrates seamlessly with Apify services * *Cons*: requires writing boilerplate code, higher complexity with more room for errors ### Crawlee[​](#crawlee "Direct link to Crawlee") [Crawlee](https://crawlee.dev/) (for both Node.js and [Python](https://crawlee.dev/python)) is a powerful web scraping library that focuses on tasks like extracting data from web pages, automating browser interactions, and managing complex scraping workflows. Unlike the Apify SDK, Crawlee does not require the Apify platform and can be used independently for web scraping tasks. It handles complex operations like concurrency management, auto-scaling, and request queuing, allowing you to concentrate on the actual scraping tasks. * *Best for*: web scraping and automation (e.g., scraping paragraphs, automating clicks) * *Pros*: full flexibility in web scraping tasks, does not require the Apify platform, leverages the JavaScript ecosystem * *Cons*: requires more setup and coding, higher chance of mistakes with complex operations ### Combining Apify SDK and Crawlee[​](#combining-apify-sdk-and-crawlee "Direct link to Combining Apify SDK and Crawlee") While these tools are distinct, they can be combined. For example, you can use Crawlee to scrape data from a page and then use the Apify SDK to save that data in an Apify dataset. This integration allows developers to make use of the strengths of both tools while working within the Apify ecosystem. 📚 Resources: * Introduction to [Crawlee](https://www.youtube.com/watch?v=g1Ll9OlFwEQ) * Crawlee [blog](https://crawlee.dev/blog) * Webinar on scraping with [Crawlee 101](https://www.youtube.com/watch?v=iAk1mb3v5iI): how to create scrapers in JavaScript and TypeScript * Step-by-step video guide: [building an Amazon Scraper](https://www.youtube.com/watch?v=yTRHomGg9uQ) in Node.js with Crawlee * Webinar on how to use [Crawlee Python](https://www.youtube.com/watch?v=ip8Ii0eLfRY) * Introduction to Apify's [Python SDK](https://www.youtube.com/watch?v=C8DmvJQS3jk) ## Code templates vs. universal scrapers vs. libraries[​](#code-templates-vs-universal-scrapers-vs-libraries "Direct link to Code templates vs. universal scrapers vs. libraries") Basically, the choice here depends on how much flexibility you need and how much coding you're willing to do. More flexibility → more coding. [Universal scrapers](https://apify.com/scrapers/universal-web-scrapers) are simple to set up but are less flexible and configurable. Our [libraries](https://crawlee.dev/), on the other hand, enable the development of a standard [Node.js](https://nodejs.org/) or Python application, so be prepared to write a little more code. The reward for that is almost infinite flexibility. [Code templates](https://apify.com/templates) are sort of a middle ground between scrapers and libraries. But since they are built on libraries, they are still on the rather more coding than less coding side. They will only give you a starter code to begin with. So please take this into account when choosing the way to build your scraper, and if in doubt — just ask us, and we'll help you out. ## Switching sides: how to transfer an existing solution from another platform[​](#switching-sides-how-to-transfer-an-existing-solution-from-another-platform "Direct link to Switching sides: how to transfer an existing solution from another platform") You can also take advantage of the Apify platform's features without having to modify your existing scraping or automation solutions. ### Integrating Scrapy spiders[​](#integrating-scrapy-spiders "Direct link to Integrating Scrapy spiders") The Apify platform fully supports Scrapy spiders. By [deploying your existing Scrapy code to Apify](https://apify.com/run-scrapy-in-cloud), you can take advantage of features like scheduling, monitoring, scaling, and API access, all without needing to modify your original spider. This process is made easy with the [Apify CLI](https://docs.apify.com/cli/), which allows you to convert your Scrapy spider into an Apify Actor with just a few commands. Once deployed, your spider can run in the cloud, offering a reliable and scalable solution for your web scraping needs. Additionally, you can monetize your spiders by [publishing them as Actors](https://apify.com/partners/actor-developers) on Apify Store, potentially earning passive income from your work while benefiting from the platform’s extensive features. ### ScrapingBee, ScrapingAnt, ScraperAPI[​](#scrapingbee-scrapingant-scraperapi "Direct link to ScrapingBee, ScrapingAnt, ScraperAPI") To make the transition from these platforms easier, we've also created [SuperScraper API](https://apify.com/apify/super-scraper-api). This API is an open-source REST API designed for scraping websites by simply passing a URL and receiving the rendered HTML content in return. This service functions as a cost-effective alternative to other scraping services like ScrapingBee, ScrapingAnt, and ScraperAPI. It supports dynamic content rendering with a headless browser, can use various proxies to avoid blocking, and offers features such as capturing screenshots of web pages. It is ideal for large-scale scraping tasks due to its scalable nature. To use SuperScraper API, you can deploy it with an Apify API token and access it via HTTP requests. The API supports multiple parameters for fine-tuning your scraping tasks, including options for rendering JavaScript, waiting for specific elements, and handling cookies and proxies. It also allows for custom data extraction rules and JavaScript execution on the scraped pages. Pricing is based on actual usage, which can be cheaper or more expensive than competitors, depending on the configuration. 📚 Resources: * [How to integrate Scrapy projects](https://docs.apify.com/cli/docs/integrating-scrapy) * Scrapy monitoring: how to [manage your Scrapy spider on Apify](https://blog.apify.com/scrapy-monitoring-spidermon/) * Run ScrapingBee, ScraperAPI, and ScrapingAnt on Apify — [SuperScraper API Tutorial](https://www.youtube.com/watch?v=YKs-I-2K1Rg) ## General resources[​](#general-resources "Direct link to General resources") * Creating your Actor: [Actor sources](https://docs.apify.com/academy/getting-started/creating-actors) * Use it, build it or buy it? [Choosing the right solution on Apify](https://help.apify.com/en/articles/3024655-choosing-the-right-solution) * How to programmatically retrieve data with the [Apify API](https://www.youtube.com/watch?v=ViYYDHSBAKM\&t=0s) * Improved way to [build your scrapers from a Git repo](https://www.youtube.com/watch?v=8QJetr-BYdQ) * Webinar on [how to build and monetize Actors](https://www.youtube.com/watch?v=4nxStxC1BJM) on Apify Store * 6 things you should know before buying or [building a web scraper](https://blog.apify.com/6-things-to-know-about-web-scraping/) * For a comprehensive guide on creating your first Actor, visit the [Apify Academy](https://docs.apify.com/academy). --- # Ideas page and its use So you want to build an Actor and publish it on Apify Store. Where should you start? How can you make people want to use it? To generate new Actor ideas, you can draw from your experience. You can also use SEO tools to discover relevant search terms and explore sites related to web scraping, automation, or integrations. But for direct inspiration straight from Apify, check out our Actor [Ideas page](https://apify.com/ideas) to see what data extraction tools are trending in the Apify community. Let's see how you can both use and contribute to this valuable resource. *** ## What's the Ideas page?[​](#whats-the-ideas-page "Direct link to What's the Ideas page?") The [Ideas page](https://apify.com/ideas) is where users can submit and explore potential projects for Actors, including scrapers, integrations, and automations. It serves as a collaborative space for proposing new tool ideas and finding inspiration for building and developing web scraping and automation solutions. ## How you, as a developer, can use the Ideas page[​](#how-you-as-a-developer-can-use-the-ideas-page "Direct link to How you, as a developer, can use the Ideas page") Got an innovative Actor idea or unsure what to build next? The Apify Ideas page is your go-to destination for submitting, developing, and claiming Actor concepts. If you're a developer ready to build an Actor using the Apify Ideas page, here’s how you can get involved: 1. *Browse the Ideas page* Check out the [Ideas page](https://apify.com/ideas) to find ideas that interest you. Look for ideas that align with your skills and the kind of Actor you want to build. 2. *Select an idea* Once you’ve found a promising idea, review the details and requirements provided. If you see an idea you want to develop, make sure to check its current status. If it’s marked as **Open to develop**, you’re good to go. 3. *Develop your Actor* Start building your Actor based on the idea. You don’t need to notify Apify about your development process. Focus on creating a functional and well-documented tool. 4. *Prepare for launch* Once your Actor is ready, ensure it meets all quality standards and has a comprehensive README. This documentation should include installation instructions, usage details, and any other relevant information. 5. *Publish your Actor* Deploy your Actor on Apify Store. Make sure it’s live and accessible for users. 6. *Claim your idea* After your Actor is published, email with the URL of your Actor and the original idea. This will allow us to tag the idea as Completed and link it to your new Actor, giving you credit and visibility. 7. *Monitor and optimize* Make sure to monitor your Actor’s performance and user feedback. Use this information to make improvements and keep your Actor up to date. By following these steps, you’ll be able to contribute to the community while also gaining recognition for your work. ## Criteria for claiming an idea[​](#criteria-for-claiming-an-idea "Direct link to Criteria for claiming an idea") To claim an idea, ensure that: 1. Your Actor is functional. 2. Your README contains relevant information. 3. Your Actor closely aligns with the original idea. ## Giving back to the Ideas page[​](#giving-back-to-the-ideas-page "Direct link to Giving back to the Ideas page") The Ideas page at Apify offers a variety of concepts for scrapers, integrations, and automations, and is a great place to find inspiration or solutions. It’s also a platform where you can contribute your own ideas to drive innovation and growth in our community. 1. *Submit your Ideas* Got a great Actor concept? Share it with us through the [Ideas form](https://apify.typeform.com/to/BNON8poB#source=ideas). Provide clear details about what your tool should do and how it should work. 2. *Engage with the community* Upvote ideas you find intriguing. The more support an idea receives, the more likely it is to catch a developer’s eye and move forward. 3. *Don’t forget to claim your idea* Once your Actor is up and running, claim your idea by emailing with your Actor's URL and the original idea. We’ll mark your idea as **Completed** and link it to your Actor - a signal to the other developers that this tool already exists on Apify Store. ## Multiple developers for one idea[​](#multiple-developers-for-one-idea "Direct link to Multiple developers for one idea") No problem! Apify Store can host multiple Actors with similar functions. However, we go by the “first come - first served” rule, so the first developer to claim an idea will receive the **Completed** tag and a link from the Ideas page. Remember that Apify Store is just like any other marketplace. We believe that competition helps developers thrive and improve upon their code, especially when there are similar scrapers on the horizon! You can still build the Actor, but try to be imaginative when it comes to its set of features. --- # Actorization playbook Apify is a cloud platform with a [marketplace](https://apify.com/store) of 5,000+ web scraping and automation tools called *Actors*. These tools are used for extracting data from social media, search engines, maps, e-commerce sites, travel portals, and general websites. Most Actors are developed by a global creator community, and some are developed by Apify. We have 18k monthly active users/developers on the platform (growing 138% YoY). Last month, we paid out $170k to creators (growing 118% YoY), and in total, over the program's history, we paid out almost $2M to them. ## What are Actors[​](#what-are-actors "Direct link to What are Actors") Under the hood, Actors are programs packaged as Docker images, that accept a well-defined JSON input, perform an action, and optionally produce a well-defined JSON output. This makes it easy to auto-generate user interfaces for Actors and integrate them with one another or with external systems. For example, we have user-friendly integrations with Zapier, Make, LangChain, MCP, OpenAPI, and SDKs for TypeScript/Python, CLI, etc. etc. Actors are a new way to build reusable serverless micro-apps that are easy to develop, share, integrate, and build upon—and, importantly, monetize. While Actors are our invention, we’re in the process of making them an open standard. Learn more at [https://whitepaper.actor](https://whitepaper.actor/). While most Actors on our marketplace are web scrapers or crawlers, there are ever more Actors for other use cases including data processing, web automation, API backend, or [AI agents](https://apify.com/store/categories/agents). In fact, any piece of software that accepts input, performs a job, and can run in Docker, can be *Actorized* simply by adding an `.actor` directory to it with a couple of JSON files. ## Why Actorize[​](#why-actorize "Direct link to Why Actorize") By publishing your service or project at [Apify Store](https://apify.com/store) your project will benefit from: 1. *Expanded reach*: Your tool instantly becomes available to Apify's user community and connects with popular automation platforms like [Make](https://www.make.com), [n8n](https://n8n.io/), and [Zapier](https://zapier.com/). 2. *Multiple monetization paths*: Choose from flexible pricing models (monthly subscriptions, pay-per-result, or pay-per-event). 3. *AI integration*: Your Actor can serve as a tool for AI agents through Apify's MCP (Model Context Protocol) server, creating new use cases and opportunities while you earn 80% of all revenues. Open-Source Benefits For open-source developers, Actorization adds value without extra costs: * Host your code in the cloud for easy user trials (no local installs needed). * Avoid managing cloud infrastructure—users cover the costs. * Earn income through [Apify’s Open Source Fair Share program](https://apify.com/partners/open-source-fair-share) via GitHub Sponsors or direct payouts. * Publish and monetize 10x faster than building a micro-SaaS, with Apify handling infra, billing, and access to 700,000+ monthly visitors and 70,000 signups. For example, IBM’s [Docling project](https://github.com/docling-project/docling) merged our pull request that actorized their open-source GitHub repo (24k stars) and added the Apify Actor badge to the README: ![Docling Apify badge](/assets/images/docling-apify-badge-3b6ad8beefffa23d0ffcc9bc92d593bb.png) ### Example Actorized projects[​](#example-actorized-projects "Direct link to Example Actorized projects") You can Actorize various projects ranging from open-source libraries, throughout existing SaaS services, up to MCP server: | Name | Type | Source | Actor | | --------------- | ---------------------- | --------------------------------------------- | ------------------------------------------------- | | Parsera | SaaS service | | | | Monolith | Open source library | | | | Crawl4AI | Open source library | | | | Docling | Open source library | | | | Playwright MCP | Open source MCP server | | | | Browserbase MCP | SaaS MCP server | | | ### What projects are suitable for Actorization[​](#what-projects-are-suitable-for-actorization "Direct link to What projects are suitable for Actorization") Use these criteria to decide if your project is a good candidate for Actorization: 1. *Is it self-contained?* Does the project work non-interactively, with a well-defined, preferably structured input and output format? Positive examples include various data processing utilities, web scrapers and other automation scripts. Negative examples are GUI applications or applications that run indefinitely. If you want to run HTTP APIs on Apify, you can do so using [Actor Standby](https://docs.apify.com/platform/actors/development/programming-interface/standby.md). 2. *Can the state be stored in Apify storages?* If the application has state that can be stored in a small number of files it can utilize [key-value store](https://docs.apify.com/platform/storage/key-value-store.md), or if it processes records that can be stored in Apify’s [request queue](https://docs.apify.com/platform/storage/request-queue.md). If the output consists of one or many similar JSON objects, it can utilize [dataset](https://docs.apify.com/platform/storage/dataset.md). 3. *Can it be containerized?* The project needs to be able to run in a Docker container. Apify currently does not support GPU workloads. External services (e.g., databases) need to be managed by developer. 4. *Can it use Apify tooling?* Javascript/Typescript applications and Python applications can be Actorized with the help of the [Apify SDK](https://docs.apify.com/sdk.md), which makes easy for your code to interacts with the Apify platform. Applications that can be run using just the CLI can also be Actorized using the Apify CLI by writing a simple shell script that retrieves user input using [Apify CLI](https://docs.apify.com/cli), then runs your application and sends the results back to Apify (also using the CLI). If your application is implemented differently, you can still call the [Apify API](https://docs.apify.com/api/v2.md) directly - it’s just HTTP and pretty much every language has support for that but the implementation is less straightforward. ## Actorization guide[​](#actorization-guide "Direct link to Actorization guide") This guide outlines the steps to convert your application into an Apify [Actor](https://docs.apify.com/platform/actors.md). Follow the documentation links for detailed information - this guide provides an overview rather than exhaustive instructions. ### 1. Add Actor metadata - the `.actor` folder[​](#1-add-actor-metadata---the-actor-folder "Direct link to 1-add-actor-metadata---the-actor-folder") The Apify platform requires your Actor repository to have a `.actor` folder at the root level, which contains the metadata needed to build and run the Actor. For existing projects, you can add the `.actor` folder using the [`apify init` CLI command](https://docs.apify.com/cli/docs/reference#apify-init-actorname). In case you're starting a new project, we strongly advise to start with a [template](https://apify.com/templates) using the [`apify create` CLI command](https://docs.apify.com/cli/docs/reference#apify-create-actorname) based on your usecase * [TypeScript template](https://apify.com/templates/ts-empty) * [Python template](https://apify.com/templates/python-empty) * [CLI template](https://apify.com/templates/cli-start) * [MCP server template](https://apify.com/templates/python-mcp-server) * … and many others, check out for comprehensive list Quick Start for beginners For a step-by-step introduction to creating your first Actor (including tech stack choices and development paths), see [Quick Start](https://docs.apify.com/platform/actors/development/quick-start.md). The newly created `.actor` folder contains an `actor.json` file - a manifest of the Actor. See [documentation](https://docs.apify.com/platform/actors/development/actor-definition/actor-json.md) for more details You must also make sure your Actor has a Dockerfile and that it installs everything needed to successfully run your application. Check out [Dockerfile documentation](https://docs.apify.com/platform/actors/development/actor-definition/dockerfile.md) by Apify. If you don't want to use these, you are free to use any image as the base of your Actor. When launching the Actor, the Apify platform will simply run your Docker image. This means that a) you need to configure the `ENTRYPOINT` and `CMD` directives so that it launches your application and b) you can test your image locally using Docker. These steps are the bare minimum you need to run your code on Apify. The rest of the guide will help you flesh it out better. ### 2. Define input and output[​](#2-define-input-and-output "Direct link to 2. Define input and output") Most Actors accept an input and produce an output. As part of Actorization, you need to define the input and output structure of your application. For detailed information, read the docs for [input schema](https://docs.apify.com/platform/actors/development/actor-definition/input-schema.md), [dataset schema](https://docs.apify.com/platform/actors/development/actor-definition/dataset-schema.md), and general [storage](https://docs.apify.com/platform/storage.md). #### Design guidelines[​](#design-guidelines "Direct link to Design guidelines") 1. If your application has some arguments or options, those should be part of the input defined by input schema. 2. If there is a configuration file or if your application is configured with environment variables, those should also be part of the input. Ideally, nested structures should be “unpacked”, i.e., try not to accept deeply nested structures in your input. Start with less input options and expand later. 3. If the output is a single file, you’ll probably want your Actor to output a single dataset item that contains a public URL to the output file stored in the Apify key-value store 4. If the output has a table-like structure or a series of JSON-serializable objects, you should output each row or object as a separate dataset item 5. If the output is a single key-value record, your Actor should return a single dataset item ### 3. Handle state persistence (optional)[​](#3-handle-state-persistence-optional "Direct link to 3. Handle state persistence (optional)") If your application performs a number of well-defined subtasks, the [request queue](https://docs.apify.com/platform/storage/request-queue.md) lets you pause and resume execution on job restart. This is important for long-running jobs that might be migrated between servers at some point. In addition, this allows the Apify platform to display the progress to your users in the UI. A lightweight alternative to the request queue is simply storing the state of your application as a JSON object in the key-value store and checking for that when your Actor is starting. Fully-fledged Actors will often combine these two approaches for maximum reliability. More on this topic you find in the [state persistence](https://docs.apify.com/platform/actors/development/builds-and-runs/state-persistence.md) article. ### 4. Write Actorization code[​](#4-write-actorization-code "Direct link to 4. Write Actorization code") Perhaps the most important part of the Actorization process is writing the code that will be executed when the Apify platform launches your Actor. Unless you’re writing an application targeted directly on the Apify platform, this will have the form of a script that calls your code and integrates it with the Apify Storages Apify provides SDKs for [Javascript](https://docs.apify.com/sdk/js) and [Python](https://docs.apify.com/sdk/python) plus a [Apify CLI](https://docs.apify.com/cli) allowing an easy interaction with Apify platform from command line. Check out [programming interface](https://docs.apify.com/platform/actors/development/programming-interface.md) documentation article for details on interacting with the Apify platform in your Actor's code. ### 5. Deploy the Actor[​](#5-deploy-the-actor "Direct link to 5. Deploy the Actor") Deployment to Apify platform can be done easily via `apify push` command of [Apify CLI](https://docs.apify.com/cli) and for details see [deployment](https://docs.apify.com/platform/actors/development/deployment.md) documentation. ### 6. Publish and monetize[​](#6-publish-and-monetize "Direct link to 6. Publish and monetize") For details on publishing the Actor in [Apify Store](https://apify.com/store) see the [Publishing and monetization](https://docs.apify.com/platform/actors/publishing.md). You can also follow our guide on [How to create an Actor README](https://docs.apify.com/academy/actor-marketing-playbook/actor-basics/how-to-create-an-actor-readme.md) and [Actor marketing playbook](https://docs.apify.com/academy/actor-marketing-playbook.md). --- # Advanced web scraping In the [Web scraping basics for JavaScript devs](https://docs.apify.com/academy/web-scraping-for-beginners.md) course, we have learned the necessary basics required to create a scraper. In the following courses, we learned more about specific practices and techniques that will help us to solve most of the problems we will face. In this course, we will take all of that knowledge, add a few more advanced concepts, and apply them to learn how to build a production-ready web scraper. ## What does production-ready mean[​](#what-does-production-ready-mean "Direct link to What does production-ready mean") To scrape large and complex websites, we need to scale two essential aspects of the scraper: crawling and data extraction. Big websites can have millions of pages and the data we want to extract requires more sophisticated parsing techniques than just selecting elements by CSS selectors or using APIs as they are. We will also touch on monitoring, performance, anti-scraping protections, and debugging. If you've managed to follow along with all of the courses prior to this one, then you're more than ready to take these upcoming lessons on 😎 ## First up[​](#first-up "Direct link to First up") First, we will explore [advanced crawling section](https://docs.apify.com/academy/advanced-web-scraping/crawling/sitemaps-vs-search.md) that will help us to find all pages or products on the website. --- # Crawling sitemaps In the previous lesson, we learned what is the utility (and dangers) of crawling sitemaps. In this lesson, we will go in-depth to how to crawl sitemaps. We will look at the following topics: * How to find sitemap URLs * How to set up HTTP requests to download sitemaps * How to parse URLs from sitemaps * Using Crawlee to get all URLs in a few lines of code ## How to find sitemap URLs[​](#how-to-find-sitemap-urls "Direct link to How to find sitemap URLs") Sitemaps are commonly restricted to contain a maximum of 50k URLs so usually, there will be a whole list of them. There can be a master sitemap containing URLs of all other sitemaps or the sitemaps might simply be indexed in `robots.txt` and/or have auto-incremented URLs like `/sitemap1.xml`, `/sitemap2.xml`, etc. ### Google[​](#google "Direct link to Google") You can try your luck on Google by searching for `site:example.com sitemap.xml` or `site:example.com sitemap.xml.gz` and see if you get any results. If you do, you can try to download the sitemap and see if it contains any useful URLs. The success of this approach depends on the website telling Google to index the sitemap file itself which is rather uncommon. ### robots.txt[​](#robots-txt "Direct link to robots.txt") If the website has a `robots.txt` file, it often contains sitemap URLs. The sitemap URLs are usually listed under `Sitemap:` directive. ### Common URL paths[​](#common-url-paths "Direct link to Common URL paths") You can check some common URL paths, such as the following: /sitemap.xml /product\_index.xml /product\_template.xml /sitemap\_index.xml /sitemaps/sitemap\_index.xml /sitemap/product\_index.xml /media/sitemap.xml /media/sitemap/sitemap.xml /media/sitemap/index.xml Make also sure you test the list with `.gz`, `.tar.gz` and `.tgz` extensions and by capitalizing the words (e.g. `/Sitemap_index.xml.tar.gz`). Some websites also provide an HTML version, to help indexing bots find new content. Those include: /sitemap /category-sitemap /sitemap.html /sitemap\_index Apify provides the [Sitemap Sniffer](https://apify.com/vaclavrut/sitemap-sniffer), an open source actor that scans the URL variations automatically for you so that you don't have to check them manually. ## How to set up HTTP requests to download sitemaps[​](#how-to-set-up-http-requests-to-download-sitemaps "Direct link to How to set up HTTP requests to download sitemaps") For most sitemaps, you can make a single HTTP request and parse the downloaded XML text. Some sitemaps are compressed and have to be streamed and decompressed. The code can get fairly complicated, but scraping frameworks, such as [Crawlee](#using-crawlee), can do this out of the box. ## How to parse URLs from sitemaps[​](#how-to-parse-urls-from-sitemaps "Direct link to How to parse URLs from sitemaps") Use your favorite XML parser to extract the URLs from inside the `` tags. Just be careful that the sitemap might contain other URLs that you don't want to crawl (e.g. `/about`, `/contact`, or various special category sections). For specific code examples, see [our Node.js guide](https://docs.apify.com/academy/node-js/scraping-from-sitemaps.md). ## Using Crawlee[​](#using-crawlee "Direct link to Using Crawlee") Fortunately, you don't have to worry about any of the above steps if you use [Crawlee](https://crawlee.dev), a scraping framework, which has rich traversing and parsing support for sitemap. It can traverse nested sitemaps, download, and parse compressed sitemaps, and extract URLs from them. You can get all the URLs in a few lines of code: ``` import { RobotsFile } from 'crawlee'; const robots = await RobotsFile.find('https://www.mysite.com'); const allWebsiteUrls = await robots.parseUrlsFromSitemaps(); ``` ## Next up[​](#next-up "Direct link to Next up") That's all we need to know about sitemaps for now. Let's dive into a much more interesting topic - search, filters, and pagination. --- # Scraping websites with search In this lesson, we will start with a simpler example of scraping HTML based websites with limited pagination. Limiting pagination is a common practice on e-commerce sites. It makes sense: a real user will never want to look through more than 200 pages of results – only bots love unlimited pagination. Fortunately, there are ways to overcome this limit while keeping our code clean and generic. ![Pagination in on Google search results page]() > In a rush? Skip the tutorial and get the [full code example](https://github.com/apify-projects/apify-extra-library/tree/master/examples/crawler-with-filters). ## How to overcome the limit[​](#how-to-overcome-the-limit "Direct link to How to overcome the limit") Websites usually limit the pagination of a single (sub)category to somewhere between 1,000 to 20,000 listings. The site might have over a million listings in total. Without a proven algorithm, it will be very manual and almost impossible to scrape all listings. We will first look at a couple of ideas that don't work so well and then present the [final robust solution](#using-filter-ranges). ### Going deeper into subcategories[​](#going-deeper-into-subcategories "Direct link to Going deeper into subcategories") This is usually the first solution that comes to mind. You traverse the smallest subcategories and hope that those are below the pagination limits. Unfortunately, there are two big problems with this approach: 1. Any subcategory might be bigger than the pagination limit. 2. Some listings from the parent category might not be present in any subcategory. While you can often manually test if the second problem is true on the site, the first problem is a hard blocker. You might be just lucky, and it may work on this site but usually, traversing subcategories is not enough. It can be used as a first step of the solution but not as the solution itself. ### Using filters[​](#using-filters "Direct link to Using filters") Most websites also provide a way for the user to select search filters. These allow a more granular level of search than categories and can be combined with them. Common filters allow you to select a **color**, **size**, **location** and similar attributes. At first, it might seem like an easy solution. Enqueue all possible filter combinations and that should be so granular that it will never hit a pagination limit. Unfortunately, this solution is still far from good. 1. No guarantee that some products won't slip through the chosen filter combinations. 2. The resulting split might be too granular and end up having too many tiny paginations with many duplicate products. This leads to scraping a lot more pages than necessary and makes analytics much harder. ### Using filter ranges[​](#using-filter-ranges "Direct link to Using filter ranges") The best option is to use only a specific type of filter that can be used as a range. The most common one is **price range** but there may be others like the apartment size, etc. You can split the pagination pages to only contain listings within that range, e.g. products costing between $10 and $20. This has several benefits: 1. All listings can eventually be found in a range. 2. The ranges do not overlap, so we scrape the smallest possible number of pages and avoid duplicate listings. 3. Ranges can be controlled by a generic algorithm that can be reused for different sites. ## Splitting pages with range filters[​](#splitting-pages-with-range-filters "Direct link to Splitting pages with range filters") In the previous section, we analyzed different options to split the pages to overcome the pagination limit. We have chosen range filters as the most reliable way to do that. In this section, we will discuss a generic algorithm to work with ranges, look at a few special cases and then write an example crawler. ![An example of range filters on a website](/assets/images/pagination-filters-ad8028367191ccc8ad1c7835e3f21067.png) ### The algorithm[​](#the-algorithm "Direct link to The algorithm") The core algorithm can be used on any (even overlapping) range. This is a simplified presentation, we will discuss the details later. 1. We choose a few pivot ranges with a similar number of products and enqueue them. For example, **$0-$10**, **$100-$1000**, **$1000-$10000**, **$10000-**. 2. For each range, we open the page and check if the listings are below the limit. If yes, we continue to step 3. If not, we split the filter in half, e.g. **$0-$10** to **$0-$5** and **$5-$10** and enqueue those again. We recursively repeat step **2** for each range as long as needed. 3. We now have a pagination URL that is below the limit, we enqueue it under a pagination label and start enqueuing products. Because the algorithm is recursive, we don't need to think about how big the final ranges should be, the algorithm will find them over time. ### Special cases to look for[​](#special-cases-to-look-for "Direct link to Special cases to look for") We have the base algorithm, but before we start coding, let's answer a few questions to get more insight. #### Can the ranges overlap?[​](#can-the-ranges-overlap "Direct link to Can the ranges overlap?") Some sites will allow you to construct non-overlapping ranges. For example, you can set the ranges with cents, e.g. **$0-$4.99**, **$5-$9.99**, etc. If that is possible, create the pivot ranges this way, too. Non-overlapping ranges should remove the possibility of duplicate products (unless a [listing has multiple values](#can-a-listing-have-more-values)) and the lowest number of pages. If the website supports only overlapping ranges (e.g. **$0-$5**, **$5–10**), it is not a big problem. Only a small portion of the listings will be duplicates, and they can be removed using a [request queue](https://docs.apify.com/platform/storage/request-queue.md). #### Can a listing have more values?[​](#can-a-listing-have-more-values "Direct link to Can a listing have more values?") In rare cases, a listing can have more than one value that you are filtering in a range. A typical example is Amazon, where each product has several offers and those offers have different prices. If any of those offers is within the range, the product is shown. No easy way exists to get around this but the price range split works even with duplicate listings, use a [JS set](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Set) or request queue to deduplicate them. #### How is the range passed to the URL?[​](#how-is-the-range-passed-to-the-url "Direct link to How is the range passed to the URL?") In the easiest case, you can pass the range directly in the page's URL. For example, `https://example.com/products?price=0-10`. Sometimes, you will need to do some query composition because the price range might be encoded together with more information into a single parameter. Some sites don't have page URLs with filters and instead load the filtered products via [XHRs](https://developer.mozilla.org/en-US/docs/Web/API/XMLHttpRequest). Those can be GET or POST requests with varying **URL** and **payload** syntax. The nice thing here is that if you get to understand how their internal API works, you can have it return more products per page or extract full product details just from this single request. In addition, XHRs are smaller and faster than loading an HTML page. On the other hand, you should not overly abuse them (with setting overly large limits), as this can expose you. #### Does the website show the number of products for each filtered page?[​](#does-the-website-show-the-number-of-products-for-each-filtered-page "Direct link to Does the website show the number of products for each filtered page?") If it does, it's a nice bonus. It gives us a way to check if we are over or below the pagination limit and helps with analytics. If it doesn't, we have to find a different way to check if the number of listings is within a limit. One option is to go to the last allowed page of the pagination. If that page is still full of products, we can assume the filter is over the limit. #### How to handle (open) ends of the range[​](#how-to-handle-open-ends-of-the-range "Direct link to How to handle (open) ends of the range") Logically, every full (price) range starts at 0 and ends at infinity. But the way this is encoded will differ on each site. The end of the price range can be either closed (0) or open (infinity). Open ranges require special handling when you split them (we will get to that). Most sites will let you start with 0 (there might be exceptions, where you will have to make the start open), so we can use just that. The high end is more complicated. Because you don't know the biggest price, it is best to leave it open and handle it specially. Internally you can assign `null` to the value. Here are a few examples of a query parameter with an open and closed high-end range: * Open: `p:100-` (higher than 100), Closed: `p:100-200` (between 100 and 200) * Open: `min_price=100`, Closed: `min_price=100&max_price=200` #### Can the range exceed the limit on a single value?[​](#can-the-range-exceed-the-limit-on-a-single-value "Direct link to Can the range exceed the limit on a single value?") In very rare cases, a site will have so many listings that a single value (e.g. **$100** or **$4.99**) will include a number of listings over the limit. [The basic algorithm](#the-algorithm) will recurse until the **min** value equals the **max** value and then stop because it cannot split that single value anymore. In this rare case, you will need to add another range or other filters to combine it to get an even deeper split. ### Implementing a range filter[​](#implementing-a-range-filter "Direct link to Implementing a range filter") This section shows a code example implementing our solution for an imaginary website. Writing a real solution will bring up more complex problems but the previous section should prepare you for some of them. First, let's define our imaginary site: * It has a single `/products` path that contains all the products that we want to scrape. * **Max** pagination limit is **1000**. * The site contains over a million products. * It allows for filtering over a price range with query parameters `min_price` and `max_price`. * If `min_price` or `max_price` are not defined, it opens that end of the range (all products up to or all products over that). * The site allows to specify the price in cents. * Pagination is done via `page` query parameter. #### Define and enqueue pivot ranges[​](#define-and-enqueue-pivot-ranges "Direct link to Define and enqueue pivot ranges") This step is not necessary but it is useful. The algorithm doesn't start with splitting over too large or too small values. ``` import { Actor } from 'apify'; import { CheerioCrawler } from 'crawlee'; await Actor.init(); const MAX_PRODUCTS_PAGINATION = 1000; // Just an example, choose what makes sense for your site const PIVOT_PRICE_RANGES = [ { min: 0, max: 9.99 }, { min: 10, max: 99.99 }, { min: 100, max: 999.99 }, { min: 1000, max: 9999.99 }, { min: 10000, max: null }, // open-ended ]; // Let's create a helper function for creating the filter URLs, you can move those to a utils.js file const createFilterUrl = ({ min, max }) => { const minString = `min_price=${min}`; // We don't want to pass the parameter at all if it is null (open-ended) const maxString = max ? `&max_price=${max}` : ''; return `https://www.mysite.com/products?${minString}${maxString}`; }; // And another helper for getting filters back from the URL, we could also pass them in userData const getFiltersFromUrl = (url) => { const min = Number(url.match(/min_price=([0-9.]+)/)[1]); // Max price might be empty const maxMatch = url.match(/max_price=([0-9.]+)/); const max = maxMatch ? Number(maxMatch[1]) : null; return { min, max }; }; // Actor setup things here const crawler = new CheerioCrawler({ async requestHandler(context) { // ... }, }); // Let's create the pivot requests const initialRequests = []; for (const { min, max } of PIVOT_PRICE_RANGES) { initialRequests.push({ url: createFilterUrl({ min, max }), label: 'FILTER', }); } // Let's start the crawl await crawler.run(initialRequests); await Actor.exit(); ``` #### Define the logic for the `FILTER` page[​](#define-the-logic-for-the-filter-page "Direct link to define-the-logic-for-the-filter-page") ``` import { CheerioCrawler } from 'crawlee'; // Doesn't matter what Crawler class we choose const crawler = new CheerioCrawler({ // Crawler options here // ... async requestHandler({ request, $ }) { const { label } = request; if (label === 'FILTER') { // Of course, change the selectors and make it more robust const numberOfProducts = Number($('.product-count').text()); // The filter is either good enough of we have to split it if (numberOfProducts max) { throw new Error(`WRONG FILTER - min(${min}) is greater than max(${max})`); } // We crate a middle value for the split. If max in null, we will use double min as the middle value const middle = max ? min + Math.floor((max - min) / 2) : min * 2; // We have to do the Math.max and Math.min to prevent having min > max const filterMin = { min, max: Math.max(middle, min), }; const filterMax = { min: max ? Math.min(middle + 1, max) : middle + 1, max, }; // We return 2 new filters return [filterMin, filterMax]; } ``` #### Enqueue the filters[​](#enqueue-the-filters "Direct link to Enqueue the filters") Let's finish the crawler now. This code example will go inside the `else` block of the previous crawler example. ``` const { min, max } = getFiltersFromUrl(request.url); // Our generic splitFilter function doesn't account for decimal values so we will have to convert to cents and back to dollars const newFilters = splitFilter({ min: min * 100, max: max * 100 }); // And we enqueue those 2 new filters so the process will recursively repeat until all pages get to the PAGINATION phase const requestsToEnqueue = []; for (const filter of newFilters) { requestsToEnqueue.push({ // Remember that we have to convert back from cents to dollars url: createFilterUrl({ min: filter.min / 100, max: filter.max / 100 }), label: 'FILTER', }); } await crawler.addRequests(requestsToEnqueue); ``` ## Summary[​](#summary "Direct link to Summary") And that's it. We have an elegant solution for a complicated problem. In a real project, you would want to make this a bit more robust and [save analytics data](https://docs.apify.com/academy/expert-scraping-with-apify/saving-useful-stats.md). This will let you know what filters you went through and how many products each of them had. Check out the [full code example](https://github.com/apify-projects/apify-extra-library/tree/master/examples/crawler-with-filters). --- # Sitemaps vs search The core crawling problem comes to down to ensuring that we reliably find all detail pages on the target website or inside its categories. This is trivial for small sites. We just open the home page or category pages and paginate to the end as we did in the [Web scraping basics for JavaScript devs](https://docs.apify.com/academy/web-scraping-for-beginners.md) course. Unfortunately, *most modern websites restrict pagination* only to somewhere between 1 and 10,000 products. Solving this problem might seem relatively straightforward at first but there are multiple hurdles that we will explore in this lesson. There are two main approaches to solving this problem: * Extracting all page URLs from the website's *sitemap*. * Using **categories, search and filters** to split the website so we get under the pagination limit. Both of these approaches have their pros and cons so the best solution is to *use both and combine the results*. Here we will learn why. ## Pros and cons of sitemaps[​](#pros-and-cons-of-sitemaps "Direct link to Pros and cons of sitemaps") Sitemap is usually a simple XML file that contains a list of all pages on the website. They are created and maintained mainly for search engines like Google to help ensure that the website gets fully indexed there. They are commonly located at URLs like `https://example.com/sitemap.xml` or `https://example.com/sitemap.xml.gz`. We will get to work with sitemaps in the next lesson. ### Pros[​](#pros "Direct link to Pros") * *Quick to set up* - The logic to find all sitemaps and extract all URLs is usually simple and can be done in a few lines of code. * *Fast to run* - You only need to run a single request for each sitemap that contains up to 50,000 URLs. This means you can get all the URLs in a matter of seconds. * *Usually complete* - Websites have an incentive to keep their sitemaps up to date as they are used by search engines. This means that they usually contain all pages on the website. ### Cons[​](#cons "Direct link to Cons") * *Does not directly reflect the website* - There is no way you can ensure that all pages on the website are in the sitemap. The sitemap also can contain pages that were already removed and will return 404s. This is a major downside of sitemaps which prevents us from using them as the only source of URLs. * *Updated in intervals* - Sitemaps are usually not updated in real-time. This means that you might miss some pages if you scrape them too soon after they were added to the website. Common update intervals are 1 day or 1 week. * *Hard to find or unavailable* - Sitemaps are not always trivial to locate. They can be deployed on a CDN with unpredictable URLs. Sometimes they are not available at all. * *Streamed, compressed, and archived* - Sitemaps are often streamed and archived with .tgz extensions and compressed with gzip. This means that you cannot use default HTTP client settings and must handle these cases with extra code or use a scraping framework. ## Pros and cons of categories, search, and filters[​](#pros-and-cons-of-categories-search-and-filters "Direct link to Pros and cons of categories, search, and filters") This approach means traversing the website like a normal user does by going through categories, setting up different filters, ranges, and sorting options. The goal is to ensure that we cover all categories or ranges where products can be located, and that for each of those we stay under the pagination limit. The pros and cons of this approach are pretty much the opposite of relying on sitemaps. ### Pros[​](#pros-1 "Direct link to Pros") * *Directly reflects the website* - With most scraping use-cases, we want to analyze the website as the regular users see it. By going through the intended user flow, we ensure that we are getting the same pages as the users. * *Updated in real-time* - The website is updated in real-time so we can be sure that we are getting all pages. * *Often contain detailed data* - While sitemaps are usually just a list of URLs, categories, searches and filters often contain additional data like product names, prices, categories, etc, especially if available via JSON API. This means that we can sometimes get all the data we need without going to the detail pages. ### Cons[​](#cons-1 "Direct link to Cons") * *Complex to set up* - The logic to traverse the website is usually complex and can take a lot of time to get right. We will get to this in the next lessons. * *Slow to run* - The traversing can require a lot of requests. Some filters or categories will have products we already found. * *Not always complete* - Sometimes the combination of filters and categories will not allow us to ensure we have all products. This is especially painful for sites where we don't know the exact number of products we are looking for. The tools we'll build in the following lessons will help us with this. ## Do we know how many products there are?[​](#do-we-know-how-many-products-there-are "Direct link to Do we know how many products there are?") Most websites list a total number of detail pages somewhere. It might be displayed on the home page, search results, or be provided in the API response. We just need to make sure that this number really represents the whole site or category we are looking to scrape. By knowing the total number of products, we can tell if our approach to scrape all succeeded or if we still need to refine it. Some sites, like Amazon, do not provide exact numbers. In this case, we have to work with what they give us and put even more effort into making our scraping logic accurate. We will tackle this in the following lessons as well. ## Next up[​](#next-up "Direct link to Next up") Next, we will look into [sitemap crawling](https://docs.apify.com/academy/advanced-web-scraping/crawling/crawling-sitemaps.md). After that we will go through all the intricacies of the category, search and filter crawling, and build up tools implementing a generic approach that we can use on any website. At last, we will combine the results of both and set up monitoring and persistence to ensure we can run this regularly without any manual controls. --- # Tips and tricks for robustness **Learn how to make your automated processes more effective. Avoid common web scraping and web automation pitfalls, future-proof your programs and improve your processes.** *** This collection of tips and tricks aims to help you make your scrapers work smoother and produce fewer errors. ## Proofs and verification[​](#proofs-and-verification "Direct link to Proofs and verification") **Absence of evidence ≠ evidence of absence**. Make sure output remains consistent regardless of any changes at the target host/website: * Always base all important checks on the **presence** of proof. * Never build any important checks on the **absence** of anything. The absence of an expected element or message does **not** prove an action has been (un)successful. The website might have been updated or expected content may no longer exist in the original form. The **action relying on the absence** of something might still be failing. Instead, it must rely on **proof of presence**. **Good**: Rely on the presence of an element or other content confirming a successful action. ``` async function isPaymentSuccessful() { try { await page.waitForSelector('#PaymentAccepted'); } catch (error) { return OUTPUT.paymentFailure; } return OUTPUT.paymentSuccess; } ``` **Avoid**: Relying on the absence of an element that may have been updated or changed. ``` async function isPaymentSuccessful() { const $paymentAmount = await page.$('#PaymentAmount'); if (!$paymentAmount) return OUTPUT.paymentSuccess; } ``` ## Presumption of failure[​](#presumption-of-failure "Direct link to Presumption of failure") **Every action has failed until it has provably succeeded.** Always assume an action has failed before having a proof of success. Always verify important steps to avoid false positives or false negatives. * False positive = **false / failed** outcome reported as **true / successful** on output. * False negative = **true / successful** outcome reported as **false / failed** on output. Assuming any action has been successful without direct proof is dangerous. Disprove failure actively through proof of success instead. Only then consider output valid and verified. **Good**: Verify outcome through proof. Clearly disprove failure of an important action. ``` async function submitPayment() { await Promise.all([ page.click('submitPayment'), page.waitForNavigation(), ]); try { await page.waitForFunction( (selector) => document.querySelector(selector).innerText.includes('Payment Success'), { polling: 'mutation' }, '#PaymentOutcome', ); } catch (error) { return OUTPUT.paymentFailure; } return OUTPUT.paymentSuccess; } ``` **Avoid**: Not verifying an outcome. It can fail despite output claiming otherwise. ``` async function submitPayment() { await Promise.all([ page.click('submitPayment'), page.waitForNavigation(), ]); return OUTPUT.paymentSuccess; } ``` ## Targeting elements[​](#targeting-elements "Direct link to Targeting elements") Be both as specific and as generic as possible at the same time. ### DOM element selectors[​](#dom-element-selectors "Direct link to DOM element selectors") Make sure your [CSS selectors](https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors) have the best chance to remain valid after a website is updated. * Prefer [**higher-specificity**](https://developer.mozilla.org/en-US/docs/Web/CSS/Specificity) selectors over lower specificity ones (**#id** over **.class**). * Use [**attribute selectors**](https://developer.mozilla.org/en-US/docs/Web/CSS/Attribute_selectors) to search parts of attributes (prefix, suffix, etc.). * Use element attributes with the **lowest probability of a future change**. * Completely **avoid or strip** selectors of values that are clearly **random**. * Completely **avoid or strip** selectors of values that are clearly **flexible**. * **Extend low-specificity** selectors to reduce the probability of **collisions**. Below is an example of stripping away too-specific parts of a selector that are likely random or subject to change. `#P_L_v201w3_t3_ReceiptToolStripLabel` => `a[id*="ReceiptToolStripLabel"]` If you are reasonably confident a page layout will remain without any dramatic future changes **and** need to increase the selector specificity to reduce the chance of a collision with other selectors, you can extend the selector as per the principle below. `#ReceiptToolStripLabel_P_L_v201w3_t3` => `table li > a[id^="ReceiptToolStripLabel"]` ### Content pattern matching[​](#content-pattern-matching "Direct link to Content pattern matching") Matching elements by content is already natively supported by [Playwright](https://playwright.dev/). Playwright is a [Node.js](https://nodejs.org/en/) library that allows you to automate Chromium, Firefox and WebKit with a single API. In [Puppeteer](https://pptr.dev/), you can use custom utility functions to [polyfill](https://developer.mozilla.org/en-US/docs/Glossary/Polyfill) this functionality. ## Event-bound flows[​](#event-bound-flows "Direct link to Event-bound flows") Always strive to make code as fluid as possible. Listen to events and react to them as needed by triggering consecutive actions immediately. * **Avoid** any **fixed-duration** delays wherever possible. * Prefer fluid flow based on the **occurrence of events**. ``` // Avoid: await page.waitForTimeout(timeout); // Good: await page.waitForFunction(myFunction, options, args); // Good: await page.waitForFunction(() => { return window.location.href.includes('path'); }); // Good: await page.waitForFunction( (selector) => document.querySelector(selector).innerText, { polling: 'mutation' }, '[data-qa="btnAppleSignUp"]', ); ``` --- # Build and monetize AI agents on Apify **This guide shows you how to create an AI agent using the CrewAI Python framework and the Apify platform. You will build an Instagram analysis agent that integrates with large language models (LLMs) and web scrapers.** *** AI agents are goal-oriented systems that make independent decisions. They interact with environments using predefined tools and workflows to automate complex tasks. On Apify, AI agents are built as Actors—serverless cloud programs for web scraping, data processing, and AI deployment. Apify evolved from running scrapers in the cloud to supporting LLMs that follow predefined workflows with dynamically defined goals. ## Prerequisites[​](#prerequisites "Direct link to Prerequisites") To build an effective AI agent, you need prompts to guide it, tools for external interactions, a large language model (LLM) to connect the components, an agentic framework to handle LLM behavior, and a platform to run, deploy, and scale the solution. ## Benefits of using Apify for AI agents[​](#benefits-of-using-apify-for-ai-agents "Direct link to Benefits of using Apify for AI agents") Apify provides a complete platform for building and deploying AI agents with the following benefits: * *Serverless execution* - without infrastructure management * *Stateful execution* - with agent memory capabilities * *Monetization options* - through usage-based charging * *Extensive tool ecosystem* - with thousands of available Actors * *Scalability and reliability* - for production environments * *Pre-integrated tools* - for web scraping and automation ## Building an AI agent[​](#building-an-ai-agent "Direct link to Building an AI agent") ### Step 1: Define the use case[​](#step-1-define-the-use-case "Direct link to Step 1: Define the use case") This tutorial creates a social media analysis agent that analyzes Instagram posts based on user queries using the [Instagram Scraper Actor](https://apify.com/apify/instagram-scraper). *Example:* * *Input:* "Analyze the last 10 posts from @openai and summarize AI trends." * *Output:* Trend analysis based on post content. ### Step 2: Configure input and output[​](#step-2-configure-input-and-output "Direct link to Step 2: Configure input and output") Define the input format (URL, JSON configuration, or text query) and output format (text response or structured data) for your agent. *Example input:* * User query: "Analyze @openai posts for AI trends" * OpenAI model selection (e.g., `gpt-4`) *Example output:* * Text response with insights * Data stored in Apify [Dataset](https://docs.apify.com/platform/storage/dataset.md) Agent memory Agents can include memory for storing information between conversations. Single-task agents typically do not require memory. ### Step 3: Set up the development environment[​](#step-3-set-up-the-development-environment "Direct link to Step 3: Set up the development environment") Install the Apify CLI, which allows you to create, run, and deploy Actors from your local machine. ``` npm install -g @apify/cli ``` Create a new Actor project from the CrewAI template and navigate into the new directory. ``` apify create agent-actor -t python-crewai cd agent-actor ``` ### Step 4: Understand the project structure[​](#step-4-understand-the-project-structure "Direct link to Step 4: Understand the project structure") The template includes: * `.actor/` – Actor configuration files. * `actor.json` – The Actor's definition. * `input_schema.json` – Defines the UI for the Actor's input. * `dataset_schema.json` – Defines the structure of the output data. * `pay_per_event.json` – Configuration for monetization. * `src/` – Source code * `main.py` – The main script for Actor execution, agent, and task definition. * `tools.py` – Implementations of the tools the agent can use. * `models.py` – Pydantic models for structured tool output. * `ppe_utils.py` – Helper functions for pay-per-event monetization. ### Step 5: Define input and output schemas[​](#step-5-define-input-and-output-schemas "Direct link to Step 5: Define input and output schemas") Update `.actor/input_schema.json` to define the Actor's inputs. This schema generates a user interface for running the Actor on the Apify platform. ``` { "title": "Instagram Analysis Agent Input", "type": "object", "schemaVersion": 1, "properties": { "query": { "title": "Query", "type": "string", "description": "Task for the agent to perform", "example": "Analyze @openai posts for AI trends" }, "modelName": { "title": "Model Name", "type": "string", "description": "OpenAI model to use", "default": "gpt-4" } }, "required": ["query"] } ``` Define the dataset schema in `.actor/dataset_schema.json`. This helps structure the data pushed to the dataset. ``` { "title": "Instagram Analysis Output", "type": "object", "properties": { "query": { "title": "Query", "type": "string" }, "response": { "title": "Response", "type": "string" } } } ``` ### Step 6: Configure tools[​](#step-6-configure-tools "Direct link to Step 6: Configure tools") The Instagram post scraper tool is implemented using the [Instagram Scraper Actor](https://apify.com/apify/instagram-scraper). The tool returns structured output as Pydantic models defined in `src/models.py`: ``` class InstagramPost(BaseModel): id: str url: str caption: str timestamp: datetime likes_count: int comments_count: int ``` The tool is defined in `src/tools.py` and includes: * Tool description and argument schema for the agent * Integration with Instagram Scraper Actor * Data retrieval and formatting ### Step 7: Implement the agent[​](#step-7-implement-the-agent "Direct link to Step 7: Implement the agent") The agent implementation in `src/main.py` includes: 1. Handle Actor input: Read the user's query and any other parameters from the Actor input. ``` async def main(): async with Actor: actor_input = await Actor.get_input() query = actor_input.get("query") model_name = actor_input.get("modelName", "gpt-4") ``` 2. Define the agent: Instantiate the agent, giving it a role, a goal, and access to the tools you configured. ``` agent = Agent( role="Social Media Analyst", goal="Analyze Instagram posts and provide insights", backstory="Expert in social media analysis and trend identification", tools=[instagram_scraper_tool], llm=ChatOpenAI(model=model_name) ) ``` 3. Create task and crew: Define the task for the agent to complete based on the user's query. ``` task = Task( description=query, agent=agent, expected_output="Detailed analysis with insights" ) crew = Crew( agents=[agent], tasks=[task] ) ``` 4. Execute and save results: Kick off the crew to run the task and save the final result to the Actor's default dataset. ``` result = crew.kickoff() await Actor.push_data({ "query": query, "response": str(result) }) ``` ### Step 8: Test locally[​](#step-8-test-locally "Direct link to Step 8: Test locally") Run the agent on your local machine using the Apify CLI. Ensure you have set any required environment variables (e.g., `OPENAI_API_KEY`). ``` apify run ``` ### Step 9: Deploy to Apify[​](#step-9-deploy-to-apify "Direct link to Step 9: Deploy to Apify") Push your Actor's code to the Apify platform. ``` apify push ``` After deployment: 1. Navigate to your Actor's settings. 2. Set `OPENAI_API_KEY` as a secret environment variable. 3. Rebuild the Actor version to apply the changes. ### Step 10: Test the deployed agent[​](#step-10-test-the-deployed-agent "Direct link to Step 10: Test the deployed agent") Run the agent on the platform with a sample query and monitor the results in the output dataset. ``` Analyze the posts of the @openai and @googledeepmind and summarize me current trends in the AI. ``` Troubleshooting Common issues and solutions: * *Agent fails to call tools:* Check that the tool descriptions in src/tools.py are clear and the argument schemas are correct. * *Instagram scraper fails:* Verify that the Instagram usernames exist and are public. Check the scraper Actor's run logs for specific errors. * *Missing API key:* Ensure OPENAI\_API\_KEY is set as a secret environment variable in your Actor's Settings. ## Monetizing your AI agent[​](#monetizing-your-ai-agent "Direct link to Monetizing your AI agent") Apify's pay-per-event (PPE) pricing model allows charging users based on specific triggered events through the API or SDKs. ### Step 1: Define chargeable events[​](#step-1-define-chargeable-events "Direct link to Step 1: Define chargeable events") You can configure charges for events like the Actor starting, a task completing successfully, or custom events such as specific API calls. Example event definition: ``` { "eventName": "task-completed", "description": "Charge for completed analysis task", "price": 0.10 } ``` ### Step 2: Implement charging in code[​](#step-2-implement-charging-in-code "Direct link to Step 2: Implement charging in code") Add charging logic to your code: ``` await Actor.charge({ "eventName": "task-completed", "amount": 1 }) ``` ### Step 3: Configure PPE settings[​](#step-3-configure-ppe-settings "Direct link to Step 3: Configure PPE settings") 1. Enable pay-per-event monetization in Actor settings. 2. Define events from `pay_per_event.json`. 3. Set pricing for each event. ### Step 4: Publish the agent[​](#step-4-publish-the-agent "Direct link to Step 4: Publish the agent") Before making your agent public on [Apify Store](https://apify.com/store), complete the following checklist: * Update README with usage instructions. * Validate `input_schema.json` and `dataset_schema.json`. * Verify `OPENAI_API_KEY` environment variable is handled correctly. * Check monetization settings on the Actor publication page. * Test the Actor thoroughly. * Set your Actor's visibility to public. ## Next steps[​](#next-steps "Direct link to Next steps") To continue developing AI agents: 1. *Use the CrewAI template:* Start with `apify create agent-actor -t python-crewai` 2. *Explore other templates:* Visit the Apify templates page for alternatives 3. *Review existing agents:* Check the AI agents collection on Apify Store 4. *Publish and monetize:* Deploy with `apify push` and enable monetization --- # Anti-scraping protections **Understand the various anti-scraping measures different sites use to prevent bots from accessing them, and how to appear more human to fix these issues.** *** If at any point in time you've strayed away from the Academy's demo content, and into the Wild West by writing some scrapers of your own, you may have been hit with anti-scraping measures. This is extremely common in the scraping world; however, the good thing is that there are always solutions. This section covers the essentials of mitigating anti-scraping protections, such as proxies, HTTP headers and cookies, and a few other things to consider when working on a reliable and scalable crawler. Proper usage of the methods taught in the next lessons will allow you to extract data which is specific to a certain location, enable your crawler to browse websites as a logged-in user, and more. In development, it is crucial to check and adjust the configurations related to our next lessons' topics, as doing this can fix blocking issues on the majority of websites. ## Quick start[​](#quick-start "Direct link to Quick start") If you don't have time to read about the theory behind anti-scraping protections to fine-tune your scraping project and instead you need to get unblocked ASAP, here are some quick tips: * Use high-quality proxies. [Residential proxies](https://docs.apify.com/platform/proxy/residential-proxy.md) are the least blocked. You can find many providers out there like Apify, BrightData, Oxylabs, NetNut, etc. * Set **real-user-like HTTP settings** and **browser fingerprints**. [Crawlee](https://crawlee.dev/) uses statistically generated realistic HTTP headers and browser fingerprints by default for all of its crawlers. * Use a browser to pass bot capture challenges. We recommend [Playwright with Firefox](https://crawlee.dev/docs/examples/playwright-crawler-firefox) because it is not that common for scraping. You can also play with [non-headless mode](https://crawlee.dev/api/playwright-crawler/interface/PlaywrightCrawlerOptions#headless) and adjust other [fingerprint settings](https://crawlee.dev/api/browser-pool/interface/FingerprintGeneratorOptions). * Consider extracting data from **[private APIs](https://docs.apify.com/academy/api-scraping.md)** or **mobile app APIs**. They are usually much less protected. * Increase the number of request retries significantly to at least 10 with [`maxRequestRetries: 10`](https://crawlee.dev/api/basic-crawler/interface/BasicCrawlerOptions#maxRequestRetries). Rotate sessions after every error with [`maxErrorScore: 1`](https://crawlee.dev/api/core/interface/SessionOptions#maxErrorScore) * If you cannot afford to use browsers for performance reasons, you can try [Playwright.request](https://playwright.dev/docs/api/class-playwright#playwright-request) or [curl-impersonate](https://www.npmjs.com/package/node-libcurl) as the HTTP library for [Cheerio](https://crawlee.dev/api/cheerio-crawler/class/CheerioCrawler) or [Basic](https://crawlee.dev/api/basic-crawler/class/BasicCrawler) Crawlers, instead of its default [got-scraping](https://crawlee.dev/docs/guides/got-scraping) HTTP back end. These libraries have access to native code which offers much finer control over the HTTP traffic and mimics real browsers more than what can be achieved with plain Node.js implementation like `got-scraping`. These libraries should become part of Crawlee itself in the future. In the vast majority of cases, this configuration should lead to success. Success doesn't mean that all requests will go through unblocked, that is not realistic. Some IP addresses and fingerprint combinations will still be blocked but the automatic retry system takes care of that. If you can get at least 10% of your requests through, you can still scrape the whole website with enough retries. The default [SessionPool](https://crawlee.dev/api/core/class/SessionPool) configuration will preserve the working sessions and eventually the success rate will increase. If the above tips didn't help, you can try to fiddle with the following: * Try different browsers. Crawlee & Playwright support Chromium, Firefox and WebKit out of the box. You can also try the [Brave browser](https://brave.com) which [can be configured for Playwright](https://blog.apify.com/unlocking-the-potential-of-brave-and-playwright-for-browser-automation/). * Don't use browsers at all. Sometimes the anti-scraping protections are extremely sensitive to browser behavior but will allow plain HTTP requests (with the right headers) just fine. Don't forget to match the specific [HTTP headers](https://docs.apify.com/academy/concepts/http-headers.md) for each request. * Decrease concurrency. Slower scraping means you can blend in better with the rest of the traffic. * Add human-like behavior. Don't traverse the website like a bot (paginating quickly from 1 to 100). Instead, visit various types of pages, add time randomizations and you can even introduce some mouse movements and clicks. * Try Puppeteer with the [puppeteer-extra-plugin-stealth](https://github.com/berstend/puppeteer-extra/tree/master/packages/puppeteer-extra-plugin-stealth) plugin. Generally, Crawlee's default configuration should have stronger bypassing but some features might land first in the stealth plugin. * Find different sources of the data. The data might be rendered to the HTML but you could also find it in JavaScript (inlined in the HTML or in files) or in the API responses. Especially the APIs are often much less protected (if you use the right headers). * Reverse engineer the JavaScript challenges that run on the page so you can figure out how the bypass them. This is a very advanced topic that you can read about online. We plan to introduce more content about this. Keep in mind that there is no silver bullet solution. You can find many anti-scraping systems and each of them behaves differently depending the website's configuration. That is why "trying a few things" usually leads to success. You will find more details about these tricks in the [mitigation](https://docs.apify.com/academy/anti-scraping/mitigation.md) section below. ## First of all, why do websites want to block bots?[​](#why-block-bots "Direct link to First of all, why do websites want to block bots?") What's up with that?! A website might have a variety of reasons to block bots from accessing it. Here are a few of the main ones: * To prevent the possibility of malicious bots from crawling the site to steal sensitive data like passwords or personal data about users. * In order to avoid server performance hits due to bots making a large amount of requests to the website at a single time. * To avoid their competitors to gain market insights about their business. * To prevent bots from scraping their content and selling it to other websites or re-publishing it. * To not skew their analytics data with bot traffic. * If it is a social media website, they might be attempting to keep away bots programmed to mass create fake profiles (which are usually sold later). > We recommend checking out [this article about legal and ethical ramifications of web scraping](https://blog.apify.com/is-web-scraping-legal/). Unfortunately for these websites, they have to make compromises and tradeoffs. While super strong anti-bot protections will surely prevent the majority of bots from accessing their content, there is also a higher chance of regular users being flagged as bots and being blocked as well. Because of this, different sites have different scraping-difficulty levels based on the anti-scraping measures they take. > Going into this topic, it's important to understand that there is no one silver bullet solution to bypassing protections against bots. Even if two websites are using Cloudflare (for example), one of them might be significantly more difficult to scrape due to harsher CloudFlare configurations. It is all about configuration, not the anti-scraping tool itself. ## The principles of anti-scraping protections[​](#the-principles "Direct link to The principles of anti-scraping protections") Anti-scraping protections can work on many different layers and use a large amount of bot-identification techniques. 1. **Where you are coming from** - The IP address of the incoming traffic is always available to the website. Proxies are used to emulate a different IP addresses but their quality matters a lot. 2. **How you look** - With each request, the website can analyze its HTTP headers, TLS version, ciphers, and other information. Moreover, if you use a browser, the website can also analyze the whole browser fingerprint and run challenges to classify your hardware (like graphics hardware acceleration). 3. **What you are scraping** - The same data can be extracted in many ways from a website. You can get the initial HTML or you can use a browser to render the full page or you can reverse engineer internal APIs. Each of those endpoints can be protected differently. 4. **How you behave** - The website can see patterns in how you are ordering your requests, how fast you are scraping, etc. It can also analyze browser behavior like mouse movement, clicks or key presses. These are the 4 main principles that anti-scraping protections are based on. Not all websites use all of these principles but they encompass the possibilities websites have to track and block bots. All techniques that help you mitigate anti-scraping protections are based on making yourself blend in with the crowd of regular users with each of these principles. A bot can usually be detected in one of two ways, which follow two different types of web scraping: 1. Crawlers using **HTTP requests** 2. Crawlers using **browser automation** (usually with a headless browser) Once a bot is detected, there are some countermeasures a website takes to prevent it from re-accessing it. The protection techniques are divided into two main categories: 1. Uses only the **information provided within the HTTP request**, such as headers, IP addresses, TLS versions, ciphers, etc. 2. Uses **JavaScript evaluation to collect browser fingerprint**, or even track the user behavior on the website. These JavaScript evaluations can also track mouse movement or keys pressed. Based on the information gathered, they can decide if the user is a bot or a human. This method is often paired with the first one. Once one of these methods detects that the user is a bot, it will take countermeasures depending on how advanced its techniques are. A common workflow of a website after it has detected a bot goes as follows: 1. The bot is added to the "greylist" (a list of suspicious IP addresses, fingerprints or any other value that can be used to uniquely identify the bot). 2. A [Turing test](https://en.wikipedia.org/wiki/Turing_test) is provided to the bot. Typically a **captcha**. If the bot succeeds, it is added to the whitelist. 3. If the captcha is failed, the bot is added to the blacklist. One thing to keep in mind while navigating through this course is that advanced anti-scraping methods are able to identify non-humans not only by one value (such as a single header value, or IP address), but are able to identify them through more complex things such as header combinations. Watch a conference talk by [Ondra Urban](https://github.com/mnmkng), which provides an overview of various anti-scraping measures and tactics for circumventing them. [YouTube video player](https://www.youtube-nocookie.com/embed/aXil0K-M-Vs) Several years old? Although the talk, given in 2021, features some outdated code examples, it still serves well as a general overview. ## Common anti-scraping measures[​](#common-measures "Direct link to Common anti-scraping measures") Because we here at Apify scrape for a living, we have discovered many popular and niche anti-scraping techniques. We've compiled them into a short and comprehensible list here to help understand the roadblocks before this course teaches you how to get around them. > Not all issues you encounter are caused by anti-scraping systems. Sometimes, it's a configuration issue. Learn [how to effectively debug your programs here](https://docs.apify.com/academy/node-js/analyzing-pages-and-fixing-errors.md). ### IP rate-limiting[​](#ip-rate-limiting "Direct link to IP rate-limiting") This is the most straightforward and standard protection, which is mainly implemented to prevent DDoS attacks, but it also works for blocking scrapers. Websites using rate limiting don't allow to more than some defined number of requests from one IP address in a certain time span. If the max-request number is low, then there is a high potential for false-positive due to IP address uniqueness, such as in large companies where hundreds of employees can share the same IP address. > Learn more about rate limiting [here](https://docs.apify.com/academy/anti-scraping/techniques/rate-limiting.md) ### Header checking[​](#header-checking "Direct link to Header checking") This type of bot identification is based on the given fact that humans are accessing web pages through browsers, which have specific [header](https://docs.apify.com/academy/concepts/http-headers.md) sets which they send along with every request. The most commonly known header that helps to detect bots is the `User-Agent` header, which holds a value that identifies which browser is being used, and what version it's running. Though `User-Agent` is the most commonly used header for the **Header checking** method, other headers are sometimes used as well. The evaluation is often also run based on the header consistency, and includes a known combination of browser headers. ### URL analysis[​](#url-analysis "Direct link to URL analysis") Solely based on the way how the bots operate. It compares data-rich page visits and the other page visits. The ratio of the data-rich and regular pages has to be high to identify the bot and reduce false positives successfully. ### Regular structure changes[​](#regular-structure-changes "Direct link to Regular structure changes") By definition, this is not an anti-scraping method, but it can heavily affect the reliability of a scraper. If your target website drastically changes its CSS selectors, and your scraper is heavily reliant on selectors, it could break. In principle, websites using this method change their HTML structure or CSS selectors randomly and frequently, making the parsing of the data harder, and requiring more maintenance of the bot. One of the best ways of avoiding the possible breaking of your scraper due to website structure changes is to limit your reliance on data from HTML elements as much as possible (see [API Scraping](https://docs.apify.com/academy/api-scraping.md) and [JavaScript objects within HTML](https://docs.apify.com/academy/node-js/js-in-html.md)) ### IP session consistency[​](#ip-session-consistency "Direct link to IP session consistency") This technique is commonly used to entirely block the bot from accessing the website altogether. It works on the principle that every entity that accesses the site gets a token. This token is then saved together with the IP address and HTTP request information such as User-Agent and other specific headers. If the entity makes another request, but without the session token, the IP address is added on the greylist. ### Interval analysis[​](#interval-analysis "Direct link to Interval analysis") This technique is based on analyzing the time intervals of the visit of a website. If the times are very similar, the entity is added to the greylist. This method’s premise is that the bot runs in regular intervals by, for example, a CRON job that starts every Monday. It is a long-term strategy, so it should be used as an extension. This technique needs only the information from the HTTP request to identify the frequency of the visits. ### Browser fingerprinting[​](#browser-fingerprinting "Direct link to Browser fingerprinting") One of the most successful and advanced methods is collecting the browser's "fingerprint", which is a fancy name for information such as fonts, audio codecs, canvas fingerprint, graphics card, and more. Browser fingerprints are highly unique, so they are a reliable means of identifying a specific user (or bot). If the fingerprint provides different/inconsistent information, the user is added to the greylist. > It's important to note that this method also blocks all users that cannot evaluate JavaScript (such as bots sending only static HTTP requests), and combines both of the fundamental methods mentioned earlier. ### Honeypots[​](#honeypots "Direct link to Honeypots") The honeypot approach is based on providing links that only bots can see. A typical example is hidden pagination. Usually, the bot needs to go through all the pages in the pagination, so the website's last "fake" page has a hidden link for the user, but has the same selector as the real one. Once the bot visits the link, it is automatically blacklisted. This method needs only the HTTP information. ## First up[​](#first "Direct link to First up") In our [first section](https://docs.apify.com/academy/anti-scraping/techniques.md), we'll be discussing more in-depth about the various anti-scraping methods and techniques websites use, as well as how to mitigate these protections. --- # Anti-scraping mitigation **After learning about the various different anti-scraping techniques websites use, learn how to mitigate them with a few different techniques.** *** In the [techniques](https://docs.apify.com/academy/anti-scraping/techniques.md) section of this course, you learned about multiple methods websites use to prevent bots from accessing their content. This **Mitigation** section will be all about how to circumvent these protections using various different techniques. ## Next up[​](#next "Direct link to Next up") In the [first lesson](https://docs.apify.com/academy/anti-scraping/mitigation/proxies.md) of this section, you'll be learning about what proxies are and how to use them in your own crawler. --- # Bypassing Cloudflare browser check **Learn how to bypass Cloudflare browser challenge with Crawlee.** *** If you find yourself stuck, there are a few strategies that you can employ. One key strategy is to ensure that your browser fingerprint is consistent. In some cases, the default browser fingerprint may actually be more effective than an inconsistently generated fingerprint. Additionally, it may be beneficial to avoid masking a Linux browser to look like a Windows or macOS browser, although this will depend on the specific configuration of the website you are targeting. For those using Crawlee, the library provides out-of-the-box support for generating consistent fingerprints that are able to pass the Cloudflare challenge. However, it's important to note that in some cases, the Cloudflare challenge screen may return a 403 status code even if it is evaluating the fingerprint and the request is not blocked. This can cause the default Crawlee browser crawlers to throw an error and not wait until the challenge is submitted and the page is redirected to the target webpage. To address this issue, it is necessary to alter the crawler configuration. For example, you might use the following code to remove default blocked status code handling from the crawler: ``` const crawler = new PlaywrightCrawler({ ...otherOptions, sessionPoolOptions: { blockedStatusCodes: [], }, }); ``` It's important to note that by removing default blocked status code handling, you should also add custom session retire logic on blocked pages to reduce retries. Additionally, you should add waiting logic to start the automation logic only after the Cloudflare challenge is solved and the page is redirected. This can be accomplished by waiting for a common selector that is available on all pages, such as a header logo. In some cases, the browser may not pass the check and you may be presented with a captcha, indicating that your IP address has been graylisted. If you are working with a large pool of proxies you can retire the session and use another IP. However, if you have a small pool of proxies you might want to whitelist the IP. To do this, you'll need to solve the captcha to improve your IP address's reputation. You can find various captcha-solving services, such as [AntiCaptcha](https://anti-captcha.com/), that you can use for this purpose. For more info check the section about [Captchas](https://docs.apify.com/academy/anti-scraping/techniques/captchas.md). ![Cloudflare captcha](https://images.ctfassets.net/slt3lc6tev37/6sN2VXiUaJpjxqVfTbZEJd/9a4e13cbf08ce29797167c133c534e1f/image1.png) In summary, while Cloudflare's browser challenge is designed to protect websites from automated scraping, it can be bypassed by ensuring a consistent browser fingerprint and customizing your scraping strategy. Crawlee offers out-of-the-box support for generating consistent fingerprints, but you may need to adjust your crawler configuration to handle Cloudflare's response. By following these tips, you can successfully navigate Cloudflare's browser challenge and continue scraping the data you need. --- # Generating fingerprints **Learn how to use two super handy npm libraries to generate fingerprints and inject them into a Playwright or Puppeteer page.** *** In [**Crawlee**](https://crawlee.dev), you can use [**FingerprintOptions**](https://crawlee.dev/api/browser-pool/interface/FingerprintOptions) on a crawler to automatically generate fingerprints. ``` import { PlaywrightCrawler } from 'crawlee'; const crawler = new PlaywrightCrawler({ browserPoolOptions: { fingerprintOptions: { fingerprintGeneratorOptions: { browsers: [{ name: 'firefox', minVersion: 80 }], devices: ['desktop'], operatingSystems: ['windows'], }, }, }, }); ``` > Note that Crawlee will automatically generate fingerprints for you with no configuration necessary, but the option to configure them yourself is still there within **browserPoolOptions**. ## Using the fingerprint-generator package[​](#using-fingerprint-generator "Direct link to Using the fingerprint-generator package") Crawlee uses the [Fingerprint generator](https://github.com/apify/fingerprint-suite) npm package to do its fingerprint generating magic. For maximum control outside of Crawlee, you can install it on its own. With this package, you can generate browser fingerprints. > It is crucial to generate fingerprints for the specific browser and operating system being used to trick the protections successfully. For example, if you are trying to overcome protection locally with Firefox on a macOS system, you should generate fingerprints for Firefox and macOS to achieve the best results. ``` import { FingerprintGenerator } from 'fingerprint-generator'; // Instantiate the fingerprint generator with // configuration options const fingerprintGenerator = new FingerprintGenerator({ browsers: [ { name: 'firefox', minVersion: 80 }, ], devices: [ 'desktop', ], operatingSystems: [ 'windows', ], }); // Grab a fingerprint from the fingerprint generator const generated = fingerprintGenerator.getFingerprint({ locales: ['en-US', 'en'], }); ``` ## Injecting fingerprints[​](#injecting-fingerprints "Direct link to Injecting fingerprints") Once you've manually generated a fingerprint using the **Fingerprint generator** package, it can be injected into the browser using [**fingerprint-injector**](https://github.com/apify/fingerprint-injector). This tool allows you to inject fingerprints into browsers automated by Playwright or Puppeteer: ``` import FingerprintGenerator from 'fingerprint-generator'; import { FingerprintInjector } from 'fingerprint-injector'; import { chromium } from 'playwright'; // Instantiate a fingerprint injector const fingerprintInjector = new FingerprintInjector(); // Launch a browser in Playwright const browser = await chromium.launch(); // Instantiate the fingerprint generator with // configuration options const fingerprintGenerator = new FingerprintGenerator({ browsers: [ { name: 'firefox', minVersion: 80 }, ], devices: [ 'desktop', ], operatingSystems: [ 'windows', ], }); // Grab a fingerprint const generated = fingerprintGenerator.getFingerprint({ locales: ['en-US', 'en'], }); // Create a new browser context, plugging in // some values from the fingerprint const context = await browser.newContext({ userAgent: generated.fingerprint.userAgent, locale: generated.fingerprint.navigator.language, }); // Attach the fingerprint to the newly created // browser context await fingerprintInjector.attachFingerprintToPlaywright(context, generated); // Create a new page and go to Google const page = await context.newPage(); await page.goto('https://google.com'); ``` > Note that [Crawlee](https://crawlee.dev) automatically applies wide variety of fingerprints by default, so it is not required to do this unless you aren't using Crawlee or if you need a super specific custom fingerprint to scrape with. ## Generating headers[​](#generating-headers "Direct link to Generating headers") Headers are also used by websites to fingerprint users (or bots), so it might sometimes be necessary to generate some user-like headers to mitigate anti-scraping protections. Similarly with fingerprints, **Crawlee** automatically generates headers for you, but you can have full control by using the [**browser-headers-generator**](https://github.com/apify/browser-headers-generator) package. ``` import BrowserHeadersGenerator from 'browser-headers-generator'; const browserHeadersGenerator = new BrowserHeadersGenerator({ operatingSystems: ['windows'], browsers: ['chrome'], }); await browserHeadersGenerator.initialize(); const randomBrowserHeaders = await browserHeadersGenerator.getRandomizedHeaders(); ``` ## Wrap up[​](#wrap-up "Direct link to Wrap up") That's it for the **Mitigation** course for now, but be on the lookout for future lessons! We release lessons as we write them, and will be updating the Academy frequently, so be sure to check back every once in a while for new content! --- # Proxies **Learn all about proxies, how they work, and how they can be leveraged in a scraper to avoid blocking and other anti-scraping tactics.** *** A proxy server provides a gateway between users and the internet, to be more specific in our case - between the crawler and the target website. Many websites have [rate-limiting](https://docs.apify.com/academy/anti-scraping/techniques/rate-limiting.md) set up, which is when a website **limits** the **rate** at which requests can be sent from a single IP address. In cases when a higher number of requests is expected for the crawler - using a proxy is essential to let the crawler run as smoothly as possible and avoid being blocked. The following factors determine the quality of a proxy IP: * How many users share the same proxy IP address? * How did the previous user use (or overuse) the proxy? * How long was the proxy left to "heal" before it was resold? * What is the quality of the underlying server of the proxy? (latency) Although IP quality is still the most important factor when it comes to using proxies and avoiding anti-scraping measures, nowadays it's not just about avoiding rate-limiting, which brings new challenges for scrapers that can no longer rely on IP rotation. Anti-scraping software providers, such as CloudFlare, have global databases of "suspicious" IP addresses. If you are unlucky, your newly bought IP might be blocked even before you use it. If the previous owners overused it, it might have already been marked as suspicious in many databases, or even (very likely) was blocked altogether. If you care about the quality of your IPs, use them as a real user, and any website will have a hard time banning them completely. Fixing rate-limiting issues is only the tip of the iceberg of what proxies can do for your scrapers, though. By implementing proxies properly, you can successfully avoid the majority of anti-scraping measures listed in the [previous lesson](https://docs.apify.com/academy/anti-scraping.md). ## About proxy links[​](#understanding-proxy-links "Direct link to About proxy links") To use a proxy, you need a proxy link, which contains the connection details, sometimes including credentials. ``` http://proxy.example.com:8080 ``` The proxy link above has several parts: * `http://` tells us we're using HTTP protocol, * `proxy.example.com` is a hostname, i.e. an address to the proxy server, * `8080` is a port number. Sometimes the proxy server has no name, so the link contains an IP address instead: ``` http://123.456.789.10:8080 ``` If proxy requires authentication, the proxy link can contain username and password: ``` http://USERNAME:PASSWORD@proxy.example.com:8080 ``` ## Proxy rotation[​](#proxy-rotation "Direct link to Proxy rotation") Web scrapers can implement a method called "proxy rotation" to **rotate** the IP addresses they use to access websites. Each request can be assigned a different IP address, which makes it appear as if they are all coming from different users in different location. This greatly enhances performance, and is a major factor when it comes to making a web scraper appear more human. ## Next up[​](#next "Direct link to Next up") Proxies are one of the most important things to understand when it comes to mitigating anti-scraping techniques in a scraper. Now that you're familiar with what they are, the next lesson will be teaching you how to configure your crawler in Crawlee to use and automatically rotate proxies. [Let's get right into it!](https://docs.apify.com/academy/anti-scraping/mitigation/using-proxies.md) --- # Using proxies **Learn how to use and automagically rotate proxies in your scrapers by using Crawlee, and a bit about how to obtain pools of proxies.** *** In the [**Web scraping basics for JavaScript devs**](https://docs.apify.com/academy/web-scraping-for-beginners/crawling/pro-scraping.md) course, we learned about the power of Crawlee, and how it can streamline the development process of web crawlers. You've already seen how powerful the `crawlee` package is; however, what you've been exposed to thus far is only the tip of the iceberg. Because proxies are so widely used in the scraping world, Crawlee has built-in features for implementing them in an effective way. One of the main functionalities that comes baked into Crawlee is proxy rotation, which is when each request is sent through a different proxy from a proxy pool. ## Implementing proxies in a scraper[​](#implementing-proxies "Direct link to Implementing proxies in a scraper") Let's borrow some scraper code from the end of the [pro-scraping](https://docs.apify.com/academy/web-scraping-for-beginners/crawling/pro-scraping.md) lesson in our **Web scraping basics for JavaScript devs** course and paste it into a new file called **proxies.js**. This code enqueues all of the product links on [demo-webstore.apify.org](https://demo-webstore.apify.org)'s on-sale page, then makes a request to each product page and scrapes data about each one: ``` // crawlee.js import { CheerioCrawler, Dataset } from 'crawlee'; const crawler = new CheerioCrawler({ requestHandler: async ({ $, request, enqueueLinks }) => { if (request.label === 'START') { await enqueueLinks({ selector: 'a[href*="/product/"]', }); // When on the START page, we don't want to // extract any data after we extract the links. return; } // We copied and pasted the extraction code // from the previous lesson const title = $('h3').text().trim(); const price = $('h3 + div').text().trim(); const description = $('div[class*="Text_body"]').text().trim(); // Instead of saving the data to a variable, // we immediately save everything to a file. await Dataset.pushData({ title, description, price, }); }, }); await crawler.addRequests([{ url: 'https://demo-webstore.apify.org/search/on-sale', // By labeling the Request, we can identify it // later in the requestHandler. label: 'START', }]); await crawler.run(); ``` In order to implement a proxy pool, we will first need some proxies. We'll quickly use the free [proxy scraper](https://apify.com/mstephen190/proxy-scraper) on the Apify platform to get our hands on some quality proxies. Next, we'll need to set up a [`ProxyConfiguration`](https://crawlee.dev/api/core/class/ProxyConfiguration) and configure it with our custom proxies, like so: ``` import { ProxyConfiguration } from 'crawlee'; const proxyConfiguration = new ProxyConfiguration({ proxyUrls: ['http://45.42.177.37:3128', 'http://43.128.166.24:59394', 'http://51.79.49.178:3128'], }); ``` Awesome, so there's our proxy pool! Usually, a proxy pool is much larger than this; however, a three proxies pool is totally fine for tutorial purposes. Finally, we can pass the `proxyConfiguration` into our crawler's options: ``` const crawler = new CheerioCrawler({ proxyConfiguration, requestHandler: async ({ $, request, enqueueLinks }) => { if (request.label === 'START') { await enqueueLinks({ selector: 'a[href*="/product/"]', }); return; } const title = $('h3').text().trim(); const price = $('h3 + div').text().trim(); const description = $('div[class*="Text_body"]').text().trim(); await Dataset.pushData({ title, description, price, }); }, }); ``` > Note that if you run this code, it may not work, as the proxies could potentially be down/non-operating at the time you are going through this course. That's it! The crawler will now automatically rotate through the proxies we provided in the `proxyUrls` option. ## A bit about debugging proxies[​](#debugging-proxies "Direct link to A bit about debugging proxies") At the time of writing, the scraper above utilizing our custom proxy pool is working just fine. But how can we check that the scraper is for sure using the proxies we provided it, and more importantly, how can we debug proxies within our scraper? Luckily, within the same `context` object we've been destructuring `$` and `request` out of, there is a `proxyInfo` key as well. `proxyInfo` is an object which includes useful data about the proxy which was used to make the request. ``` const crawler = new CheerioCrawler({ proxyConfiguration, // Destructure "proxyInfo" from the "context" object handlePageFunction: async ({ $, request, proxyInfo }) => { // Log its value console.log(proxyInfo); // ... // ... }, }); ``` After modifying your code to log `proxyInfo` to the console and running the scraper, you're going to see some logs which look like this: ![proxyInfo being logged by the scraper](/assets/images/proxy-info-logs-edb7e733aab82acb15258e9d44ba8a64.png) These logs confirm that our proxies are being used and rotated successfully by Crawlee, and can also be used to debug slow or broken proxies. ## Higher level proxy scraping[​](#higher-level-proxy-scraping "Direct link to Higher level proxy scraping") Though we will discuss it more in-depth in future courses, it is still important to mention that Crawlee has integrated support for the Apify SDK, which supports [Apify Proxy](https://apify.com/proxy) - a service that provides access to pools of both residential and datacenter IP addresses. A `proxyConfiguration` using Apify Proxy might look something like this: ``` import { Actor } from 'apify'; const proxyConfiguration = await Actor.createProxyConfiguration({ countryCode: 'US', }); ``` Notice that we didn't provide it a list of proxy URLs. This is because the `SHADER` group already serves as our proxy pool (courtesy of Apify Proxy). ## Next up[​](#next "Direct link to Next up") [Next up](https://docs.apify.com/academy/anti-scraping/mitigation/generating-fingerprints.md), we'll be checking out how to use two npm packages to generate and inject [browser fingerprints](https://docs.apify.com/academy/anti-scraping/techniques/fingerprinting.md). --- # Anti-scraping techniques **Understand the various common (and obscure) anti-scraping techniques used by websites to prevent bots from accessing their content.** *** In this section, we'll be discussing some of the most common (as well as some obscure) anti-scraping techniques used by websites to detect and block/limit bots from accessing their content. When a scraper is detected, a website can respond in a variety of ways: ## "Access denied" page[​](#access-denied "Direct link to \"Access denied\" page") This is a complete block which usually has a response status code of **403**. Usually, you'll hit an **Access denied** page if you have bad IP address or the website is restricted in the country of the IP address. > For a better understanding of what all the HTTP status codes mean, we recommend checking out [HTTP Cat](https://http.cat/) which provides a highly professional description for each status code. ## Captcha page[​](#captcha "Direct link to Captcha page") Probably the most common blocking method. The website gives you a chance to prove that you are not a bot by presenting you with a captcha. We'll be covering captchas within this course. ## Redirect[​](#redirect "Direct link to Redirect") Another common method is redirecting to the home page of the site (or a different location). ## Request timeout/Socket hangup[​](#request-timeout "Direct link to Request timeout/Socket hangup") This is the cheapest defense mechanism where the website won't even respond to the request. Dealing with timeouts in a scraper can be challenging, because you have to differentiate them from regular network problems. ## Custom status code or message[​](#custom-status-code "Direct link to Custom status code or message") Similar to getting an **Access denied** page, but some sites send along specific status codes (eg. **503**) and messages explaining what was wrong with the request. ## Empty results[​](#empty-results "Direct link to Empty results") The website responds "normally," but pretends to not find any results. This requires manual testing to recognize the pattern. ## Fake results[​](#fake-results "Direct link to Fake results") The website responds with data, but the data is totally fake, which is very difficult to recognize and requires extensive manual testing. Luckily, this type of response is not all too common. ## Next up[​](#next "Direct link to Next up") In the [first lesson](https://docs.apify.com/academy/anti-scraping/techniques/rate-limiting.md) of this course, you'll be learning about **rate limiting**, which is a technique used to prevent a large amount of requests from being sent from one user. --- # Browser challenges > Learn how to navigate browser challenges like Cloudflare's to effectively scrape data from protected websites. ## Browser challenges[​](#browser-challenges "Direct link to Browser challenges") Browser challenges are a type of security measure that relies on browser fingerprints. These challenges typically involve a JavaScript program that collects both static and dynamic browser fingerprints. Static fingerprints include attributes such as User-Agent, video card, and number of CPU cores available. Dynamic fingerprints, on the other hand, might involve rendering fonts or objects in the canvas (known as a [canvas fingerprint](https://docs.apify.com/academy/anti-scraping/techniques/fingerprinting.md#with-canvases)), or playing audio in the [AudioContext](https://docs.apify.com/academy/anti-scraping/techniques/fingerprinting.md#from-audiocontext). We were covering the details in the previous [fingerprinting](https://docs.apify.com/academy/anti-scraping/techniques/fingerprinting.md) lesson. While some browser challenges are relatively straightforward - for example, loading an image and checking if it renders correctly - others can be much more complex. One well-known example of a complex browser challenge is Cloudflare's browser screen check. In this challenge, Cloudflare visually inspects the browser screen and blocks the first request if any inconsistencies are found. This approach provides an extra layer of protection against automated attacks. Many online protections incorporate browser challenges into their security measures, but the specific techniques used can vary. ## Cloudflare browser challenge[​](#cloudflare-browser-challenge "Direct link to Cloudflare browser challenge") One of the most well-known browser challenges is the one used by Cloudflare. Cloudflare has a massive dataset of legitimate canvas fingerprints and User-Agent pairs, which they use in conjunction with machine learning algorithms to detect any device property spoofing. This might include spoofed User-Agent headers, operating systems, or GPUs. ![Cloudflare browser check](https://images.ctfassets.net/slt3lc6tev37/55EYMR81XJCIG5uxLjQQOx/252a98adf90fa0ff2f70437cc5c0a3af/under-attack-mode_enabled.gif) When you encounter a Cloudflare browser challenge, the platform checks your canvas fingerprint against the expected value. If there is a mismatch, the request is blocked. However, if your canvas fingerprint matches the expected value, Cloudflare issues a cookie that allows you to continue scraping - even without the browser - until the cookie is invalidated. It's worth noting that Cloudflare's protection is highly customizable, and can be adjusted to be extremely strict or relatively loose. This makes it a powerful tool for website owners who want to protect against automated traffic, while still allowing legitimate traffic to flow through. If you want to learn how to bypass Cloudflare challenge visit the [Bypassing Cloudflare challenge](https://docs.apify.com/academy/anti-scraping/mitigation/cloudflare-challenge.md) article. ## Next up[​](#next "Direct link to Next up") In the [next lesson](https://docs.apify.com/academy/anti-scraping/techniques/captchas.md), we'll be covering **captchas**, which were mentioned throughout this lesson. It's important to note that attempting to solve a captcha programmatically is the last resort - always try to avoid being presented with the captcha in the first place by using the techniques mentioned in this lesson. --- # Captchas **Learn about the reasons a bot might be presented a captcha, the best ways to avoid captchas in the first place, and how to programmatically solve them.** *** In general, a website will present a user (or scraper) a captcha for 2 main reasons: 1. The website always does captcha checks to access the desired content. 2. One of the website's anti-bot measures (or the [WAF](https://docs.apify.com/academy/anti-scraping/techniques/firewalls.md)) has flagged the user as suspicious. ## Dealing with captchas[​](#dealing-with-captchas "Direct link to Dealing with captchas") When you've hit a captcha, your first thought should not be how to programmatically solve it. Rather, you should consider the factors as to why you received the captcha in the first place: your bot didn't appear enough like a real user to avoid being presented the challenge. Have you expended all of the possible options to make your scraper appear more human-like? Are you: * Using [proxies](https://docs.apify.com/academy/anti-scraping/mitigation/proxies.md)? * Making the request with the proper [headers](https://docs.apify.com/academy/concepts/http-headers.md) and [cookies](https://docs.apify.com/academy/concepts/http-cookies.md)? * Generating and using a custom [browser fingerprint](https://docs.apify.com/academy/anti-scraping/techniques/fingerprinting.md)? * Trying different general scraping methods (HTTP scraping, browser scraping)? If you are using browser scraping, have you tried using a different browser? ## Solving captchas[​](#solving-captchas "Direct link to Solving captchas") If you've tried everything you can to avoid being presented the captcha and are still facing this roadblock, there are methods to programmatically solve captchas. Tons of different types of captchas exist, but one of the most popular is Google's [**reCAPTCHA**](https://www.google.com/recaptcha/about/). ![Google's reCAPTCHA](https://miro.medium.com/max/1400/1*4NhFKMxr-qXodjYpxtiE0w.gif) **reCAPTCHA**s can be solved using the [Anti Captcha Recaptcha](https://apify.com/petr_cermak/anti-captcha-recaptcha) Actor on the Apify platform (note that this method requires an account on [anti-captcha.com](https://anti-captcha.com)). Another popular captcha is the [Geetest slider captcha](https://www.geetest.com/en/adaptive-captcha-demo). You can find a guide for solving these types of captchas in Puppeteer [here](https://scraperbox.com/blog/solving-a-geetest-slider-captcha-with-puppeteer). Amazon's captcha can similarly also be solved programmatically. ## Wrap up[​](#wrap-up "Direct link to Wrap up") In this course, you've learned about some of the most common (and some of the most advanced) anti-scraping techniques. Keep in mind that as the web (and technology in general) evolves, this section of the **Anti scraping** course will evolve as well. In the [next section](https://docs.apify.com/academy/anti-scraping/mitigation.md), we'll be discussing how to mitigate the anti-scraping techniques you learned about in this section. --- # Fingerprinting **Understand browser fingerprinting, an advanced technique used by browsers to track user data and even block bots from accessing them.** *** Browser fingerprinting is a method that some websites use to collect information about a browser's type and version, as well as the operating system being used, any active plugins, the time zone and language of the machine, the screen resolution, and various other active settings. All of this information is called the **fingerprint** of the browser, and the act of collecting it is called **fingerprinting**. Yup! Surprisingly enough, browsers provide a lot of information about the user (and even their machine) that is accessible to websites! Browser fingerprinting wouldn't even be possible if it weren't for the sheer amount of information browsers provide, and the fact that each fingerprint is unique. Based on [research](https://www.eff.org/press/archives/2010/05/13) carried out by the Electronic Frontier Foundation, 84% of collected fingerprints are globally exclusive, and they found that the next 9% were in sets with a size of two. They also stated that even though fingerprints are dynamic, new ones can be matched up with old ones with 99.1% correctness. This makes fingerprinting a very viable option for websites that want to track the online behavior of their users in order to serve hyper-personalized advertisements to them. In some cases, it is also used to aid in preventing bots from accessing the websites (or certain sections of it). ## What makes up a fingerprint?[​](#what-makes-up-a-fingerprint "Direct link to What makes up a fingerprint?") To collect a good fingerprint, websites must collect them from various places. ### From HTTP headers[​](#from-http-headers "Direct link to From HTTP headers") Several [HTTP headers](https://docs.apify.com/academy/concepts/http-headers.md) can be used to create a fingerprint about a user. Here are some of the main ones: 1. **User-Agent** provides information about the browser and its operating system (including its versions). 2. **Accept** tells the server what content types the browser can render and send, and **Content-Encoding** provides data about the content compression. 3. **Content-Language** and **Accept-Language** both indicate the user's (and browser's) preferred language. 4. **Referer** gives the server the address of the previous page from which the link was followed. A few other headers commonly used for fingerprinting can be seen below: ![Fingerprinted headers](/assets/images/fingerprinted-headers-ec689af0e137398a072e51fb876a7a33.png) ### From window properties[​](#from-window-properties "Direct link to From window properties") The `window` is defined as a global variable that is accessible from JavaScript running in the browser. It is home to a vast amount of functions, variables, and constructors, and most of the global configuration is stored there. Most of the attributes that are used for fingerprinting are stored under the `window.navigator` object, which holds methods and info about the user's state and identity starting with the **User-Agent** itself and ending with the device's battery status. All of these properties can be used to fingerprint a device; however, most fingerprinting solutions (such as [Valve](https://valve.github.io/fingerprintjs/)) only use the most crucial ones. Here is a list of some of the most crucial properties on the `window` object used for fingerprinting: | Property | Example | Description | | ------------------------------- | ------------------------------------------------------------------------ | ------------------------------------------------------------------------------------- | | `screen.width` | `1680` | Defines the width of the device screen. | | `screen.height` | `1050` | Defines the height of the device screen. | | `screen.availWidth` | `1680` | The portion of the screen width available to the browser window. | | `screen.availHeight` | `1050` | The portion of the screen height available to the browser window. | | `navigator.userAgent` | `'Mozilla/5.0 (X11; Linux x86_64; rv:90.0) Gecko/20100101 Firefox/90.0'` | Same as the HTTP header. | | `navigator.platform` | `'MacIntel'` | The platform the browser is running on. | | `navigator.cookieEnabled` | `true` | Whether or not the browser accepts cookies. | | `navigator.doNotTrack` | `'1'` | Indicates the browser's Do Not Track settings. | | `navigator.buildID` | `20181001000000` | The build ID of the browser. | | `navigator.product` | `'Gecko'` | The layout engine used. | | `navigator.productSub` | `20030107` | The version of the layout engine used. | | `navigator.vendor` | `'Google Inc.'` | Vendor of the browser. | | `navigator.hardwareConcurrency` | `4` | The number of logical processors the user's computer has available to run threads on. | | `navigator.javaEnabled` | `false` | Whether or not the user has enabled Java. | | `navigator.deviceMemory` | `8` | Approximately the amount of user memory (in gigabytes). | | `navigator.language` | `'en-US'` | The user's primary language. | | `navigator.languages` | `['en-US', 'cs-CZ', 'es']` | Other user languages. | ### From function calls[​](#from-function-calls "Direct link to From function calls") Fingerprinting tools can also collect pieces of information that are retrieved by calling specific functions: ``` // Get the WebGL vendor information WebGLRenderingContext.getParameter(37445); // Get the WebGL renderer information WebGLRenderingContext.getParameter(37446); // Pass any codec into this function (ex. "audio/aac"). It will return // either "maybe," "probably," or "" indicating whether // or not the browser can play that codec. An empty // string means that it can't be played. HTMLMediaElement.canPlayType('some/codec'); // can ask for a permission if it is not already enabled. // allows you to know which permissions the user has // enabled, and which are disabled navigator.permissions.query('some_permission'); ``` ### With canvases[​](#with-canvases "Direct link to With canvases") This technique is based on rendering [WebGL](https://developer.mozilla.org/en-US/docs/Web/API/WebGL_API) scenes to a canvas element and observing the pixels rendered. WebGL rendering is tightly connected with the hardware, and therefore provides high entropy. Here's a quick breakdown of how it works: 1. A JavaScript script creates a [`` element](https://developer.mozilla.org/en-US/docs/Web/API/Canvas_API) and renders some font or a custom shape. 2. The script then gets the pixel-map from the `` element. 3. The collected pixel-map is stored in a cryptographic hash specific to the device's hardware. Canvas fingerprinting takes advantage of the CSS3 feature for importing fonts into CSS (called [WebFonts](https://developer.mozilla.org/en-US/docs/Learn/CSS/Styling_text/Web_fonts)). This means it's not required to use just the machine's preinstalled fonts. Here's an example of multiple WebGL scenes visibly being rendered differently on different machines: ![Differences in canvas element renderings](/assets/images/canvas-differences-f6c668c93ead711787a67a7dac7ea62b.png) ### From AudioContext[​](#from-audiocontext "Direct link to From AudioContext") The [AudioContext](https://developer.mozilla.org/en-US/docs/Web/API/AudioContext) API represents an audio-processing graph built from audio modules linked together, each represented by an [AudioNode](https://developer.mozilla.org/en-US/docs/Web/API/AudioNode) ([OscillatorNode](https://developer.mozilla.org/en-US/docs/Web/API/OscillatorNode)). In the simplest cases, the fingerprint can be obtained by checking for the existence of AudioContext. However, this doesn't provide very much information. In advanced cases, the technique used to collect a fingerprint from AudioContext is quite similar to the `` method: 1. Audio is passed through an OscillatorNode. 2. The signal is processed and collected. 3. The collected signal is cryptographically hashed to provide a short ID. > A downfall of this method is that two same machines with the same browser will get the same ID. ### From BatteryManager[​](#from-batterymanager "Direct link to From BatteryManager") The `navigator.getBattery()` function returns a promise which resolves with a [BatteryManager](https://developer.mozilla.org/en-US/docs/Web/API/BatteryManager) interface. BatteryManager offers information about whether or not the battery is charging, and how much time is left until the battery has fully discharged/charged. On its own this method is quite weak, but it can be potent when combined with the `` and AudioContext fingerprinting techniques mentioned above. ## Fingerprint example[​](#fingerprint-example "Direct link to Fingerprint example") When all is said and done, this is what a browser fingerprint might look like: ``` { "userAgent": "Mozilla/5.0 (X11; Linux x86_64; rv:90.0) Gecko/20100101 Firefox/90.0", "cookiesEnabled": true, "timezone": "Europe/Prague", "timezoneOffset": -60, "audioCodecs": { "ogg": "probably", "mp3": "maybe", "wav": "probably", "m4a": "maybe", "aac": "maybe" }, "videoCodecs": { "ogg": "probably", "h264": "probably", "webm": "probably" }, "videoCard": [ "Intel Open Source Technology Center", "Mesa DRI Intel(R) HD Graphics 4600 (HSW GT2)" ], "productSub": "20100101", "hardwareConcurrency": 8, "multimediaDevices": { "speakers": 0, "micros": 0, "webcams": 0 }, "platform": "Linux x86_64", "pluginsSupport": true, "screenResolution": [ 1920, 1080 ], "availableScreenResolution": [ 1920, 1080 ], "colorDepth": 24, "touchSupport": { "maxTouchPoints": 0, "touchEvent": false, "touchStart": false }, "languages": [ "en-US", "en" ] } ``` ## How it works[​](#how-it-works "Direct link to How it works") Sites employ multiple levels and different approaches to collect browser fingerprints. However, they all have one thing in common: they are using a script written in JavaScript to evaluate the target browser's context and collect information about it (oftentimes also storing it in their database, or in a cookie). These scripts are often obfuscated and difficult to track down and understand, especially if they are anti-bot scripts. Multiple levels of script obfuscation are used to make fingerprinting scripts unreadable and hard to find: ### Randomization[​](#randomization "Direct link to Randomization") The script is modified with some random JavaScript elements. Additionally, it also often incorporates a random number of whitespaces and other unusual formatting characters as well as cryptic variable and function names devoid of readable meaning. ### Data obfuscation[​](#data-obfuscation "Direct link to Data obfuscation") Two main data obfuscation techniques are widely employed: 1. **String splitting** uses the concatenation of multiple substrings. It is mostly used alongside an `eval()` or `document.write()`. 2. **Keyword replacement** allows the script to mask the accessed properties. This allows the script to have a random order of the substrings and makes it harder to detect. Oftentimes, both of these data obfuscation techniques are used together. ### Encoding[​](#encoding "Direct link to Encoding") Built-in JavaScript encoding functions are used to transform the code into, for example, hexadecimal string. Or, a custom encoding function is used and a custom decoding function decodes the code as it is evaluated in the browser. ## Detecting fingerprinting scripts[​](#detecting-fingerprinting-scripts "Direct link to Detecting fingerprinting scripts") As mentioned above, many sites obfuscate their fingerprinting scripts to make them harder to detect. Luckily for us, there are ways around this. ### Manual de-obfuscation[​](#manual-de-obfuscation "Direct link to Manual de-obfuscation") Almost all sites using fingerprinting and tracking scripts try to protect them as much as much as they can. However, it is impossible to make client-side JavaScript immune to reverse engineering. It is only possible to make reverse engineering difficult and unpleasant for the developer. The procedure used to make the code as unreadable as possible is called [obfuscation](https://www.techtarget.com/searchsecurity/definition/obfuscation#:~:text=Obfuscation%20means%20to%20make%20something,code%20is%20one%20obfuscation%20method.). When you want to dig inside the protection code to determine exactly which data is collected, you will probably have to deobfuscate it. Be aware that this can be a very time-consuming process. Code deobfuscation can take anywhere up to 1–2 days to be in a semi-readable state. We recommend watching some videos from [Jarrod Overson on YouTube](https://www.youtube.com/channel/UCJbZGfomrHtwpdjrARoMVaA/videos) to learn the tooling necessary to deobfuscate code. ### Using browser extensions[​](#using-browser-extensions "Direct link to Using browser extensions") Because of how common it has become to obfuscate fingerprinting scripts, there are many extensions that help identify fingerprinting scripts due to the fact that browser fingerprinting is such a big privacy question. Browser extensions such as [**Don't Fingerprint Me**](https://github.com/freethenation/DFPM) have been created to help detect them. In the extension's window, you can see a report on which functions commonly used for fingerprinting have been called, and which navigator properties have been accessed. ![Don\'t Fingerprint Me extension window](/assets/images/dont-fingerprint-me-51a71cc91aec391b54c341abe69c3cf6.png) This extension provides monitoring of only a few critical attributes, but in order to deceive anti-scraping protections, the full list is needed. However, the extension does reveal the scripts that collect the fingerprints. ## Anti-bot fingerprinting[​](#anti-bot-fingerprinting "Direct link to Anti-bot fingerprinting") On websites which implement advanced fingerprinting techniques, they will tie the fingerprint and certain headers (such as the **User-Agent** header) to the IP address of the user. These sites will block a user (or scraper) if it made a request with one fingerprint and set of headers, then tries to make another request on the same proxy but with a different fingerprint. When dealing with these cases, it's important to sync the generation of headers and fingerprints with the rotation of proxies (this is known as session rotation). ## Next up[​](#next "Direct link to Next up") [Next up](https://docs.apify.com/academy/anti-scraping/techniques/geolocation.md), we'll be covering **geolocation** methods that websites use to grab the location from which a request has been made, and how they relate to anti-scraping. --- # Firewalls **Understand what a web-application firewall is, how they work, and the various common techniques for avoiding them altogether.** *** A web-application firewall (or **WAF**) is a tool for website admins which allows them to set various access rules for their visitors. The rules can vary on each website and are usually hard to detect; therefore, on sites using a WAF, you need to run a set of tests to test the rules and find out their limits. One of the most common WAFs one can come across is the one from [Cloudflare](https://www.cloudflare.com). It allows setting a waiting screen that runs a few tests against the visitor to detect a genuine visitor or a bot. However, not all WAFs are that easy to detect. ![Cloudflare waiting screen](/assets/images/cloudflare-bd22fffac9bd5e98e327247500da14cb.png) ## How it works[​](#how-it-works "Direct link to How it works") WAPs work on a similar premise as regular firewalls. Web admins define the rules, and the firewall executes them. As an example of how the WAF can work, we will take a look at Cloudflare's solution: 1. The visitor sends a request to the webpage. 2. The request is intercepted by the firewall. 3. The firewall decides if presenting a challenge (captcha) is necessary. If the user already solved a captcha in the past or nothing is suspicious, it will immediately forward the request to the application's server. 4. A captcha is presented which must be solved. Once it is solved, a [cookie](https://docs.apify.com/academy/concepts/http-cookies.md) is stored in the visitor's browser. 5. The request is forwarded to the application's server. ![Cloudflare WAP workflow](/assets/images/cloudflare-graphic-8f4223bc691752af247662e7778589ff.jpg) Since there are multiple providers, it is essential to say that the challenges are not always graphical and can be entirely server-side (without any JavaScript evaluation in the visitor browser). ## Bypassing web-application firewalls[​](#bypassing-firewalls "Direct link to Bypassing web-application firewalls") * Using [proxies](https://docs.apify.com/academy/anti-scraping/mitigation/proxies.md). * Mocking [headers](https://docs.apify.com/academy/concepts/http-headers.md). * Overriding the browser's [fingerprint](https://docs.apify.com/academy/anti-scraping/techniques/fingerprinting.md) (most effective). * Farming the [cookies](https://docs.apify.com/academy/concepts/http-cookies.md) from a website with a headless browser, then using the farmed cookies to do HTTP based scraping (most performant). As you likely already know, there is no solution that fits all. If you are struggling to get past a WAP provider, you can try using Firefox with Playwright. ## Next up[​](#next "Direct link to Next up") In the [next lesson](https://docs.apify.com/academy/anti-scraping/techniques/browser-challenges.md), we'll be covering **browser challenges** and specifically the Cloudflare browser challenge which is part of the Cloudflare WAF mentioned in this lesson. --- # Geolocation **Learn about the geolocation techniques to determine where requests are coming from, and a bit about how to avoid being blocked based on geolocation.** *** Geolocation is yet another way websites can detect and block access or show limited data. Other than by using the [Geolocation API](https://developer.mozilla.org/en-US/docs/Web/API/Geolocation_API) (which requires user permission in order to receive location data), there are two main ways that websites geolocate a user (or bot) visiting it. ## Cookies & headers[​](#cookies-headers "Direct link to Cookies & headers") Certain websites might use certain location-specific/language-specific [headers](https://docs.apify.com/academy/concepts/http-headers.md)/[cookies](https://docs.apify.com/academy/concepts/http-cookies.md) to geolocate a user. Some examples of these headers are `Accept-Language` and `CloudFront-Viewer-Country` (which is a custom HTTP header from [CloudFront](https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/adding-cloudfront-headers.html)). On targets which are utilizing just cookies and headers to identify the location from which a request is coming from, it is pretty straightforward to make requests which appear like they are coming from somewhere else. ## IP address[​](#ip-address "Direct link to IP address") The oldest (and still most common) way of geolocating is based on the IP address used to make the request. Sometimes, country-specific sites block themselves from being accessed from any other country (some Chinese, Indian, Israeli, and Japanese websites do this). [Proxies](https://docs.apify.com/academy/anti-scraping/mitigation/proxies.md) can be used in a scraper to bypass restrictions and to make requests from a different location. Oftentimes, proxies need to be used in combination with location-specific [cookies](https://docs.apify.com/academy/concepts/http-cookies.md)/[headers](https://docs.apify.com/academy/concepts/http-headers.md). ## Override/emulate geolocation when using a browser-based scraper[​](#override-emulate-geolocation "Direct link to Override/emulate geolocation when using a browser-based scraper") When using [Puppeteer](https://pptr.dev/#?product=Puppeteer\&show=api-pagesetgeolocationoptions), you can emulate the geolocation with the `page.setGeolocation()` function. In [Playwright](https://playwright.dev/docs/api/class-browsercontext#browsercontextsetgeolocationgeolocation), geolocation can be emulated by using `browserContext.setGeolocation()`. Overriding browser geolocation should be used in tandem with a proper proxy corresponding to the emulated geolocation. You would still likely get blocked if you, for example, used a German proxy with the overridden location set to Japan. --- # Rate-limiting **Learn about rate-limiting, a common tactic used by websites to avoid a large and non-human rate of requests coming from a single IP address.** *** When crawling a website, a web scraping bot will typically send many more requests from a single IP address than a human user could generate over the same period. Websites can monitor how many requests they receive from a single IP address, and block it or require a [captcha](https://docs.apify.com/academy/anti-scraping/techniques/captchas.md) test to continue making requests. In the past, most websites had their own anti-scraping solutions, the most common of which was IP address rate-limiting. In recent years, the popularity of third-party specialized anti-scraping providers has dramatically increased, but a lot of websites still use rate-limiting to only allow a certain number of requests per second/minute/hour to be sent from a single IP; therefore, crawler requests have the potential of being blocked entirely quite quickly. In cases when a higher number of requests is expected for the crawler, using a [proxy](https://docs.apify.com/academy/anti-scraping/mitigation/proxies.md) and rotating the IPs is essential to let the crawler run as smoothly as possible and avoid being blocked. ## Dealing with rate limiting by rotating proxy or session[​](#dealing-with-rate-limiting "Direct link to Dealing with rate limiting by rotating proxy or session") The most popular and effective way of avoiding rate-limiting issues is by rotating [proxies](https://docs.apify.com/academy/anti-scraping/mitigation/proxies.md) after every **n** number of requests, which makes your scraper appear as if it is making requests from various different places. Since the majority of rate-limiting solutions are based on IP addresses, rotating IPs allows a scraper to make large amounts to a website without getting restricted. In Crawlee, proxies are automatically rotated for you when you use `ProxyConfiguration` and a [**SessionPool**](https://crawlee.dev/api/core/class/SessionPool) within a crawler. The SessionPool handles a lot of the nitty gritty of proxy rotating, especially with [browser based crawlers](https://docs.apify.com/academy/puppeteer-playwright.md) by retiring a browser instance after a certain number of requests have been sent from it in order to use a new proxy (a browser instance must be retired in order to use a new proxy). Here is an example of these features being used in a **PuppeteerCrawler** instance: ``` import { PuppeteerCrawler } from 'crawlee'; import { Actor } from 'apify'; const myCrawler = new PuppeteerCrawler({ proxyConfiguration: await Actor.createProxyConfiguration({ groups: ['RESIDENTIAL'], }), sessionPoolOptions: { // Note that a proxy is tied to a session sessionOptions: { // Let's say the website starts blocking requests after // 20 requests have been sent in the span of 1 minute from // a single user. // We can stay on the safe side and retire the browser // and rotate proxies after 15 pages (requests) have been opened. maxUsageCount: 15, }, }, // ... }); ``` > Take a look at the [**Using proxies**](https://docs.apify.com/academy/anti-scraping/mitigation/using-proxies.md) lesson to learn more about how to use proxies and rotate them in Crawlee. ### Configuring a session pool[​](#configuring-session-pool "Direct link to Configuring a session pool") To set up the SessionPool for different rate-limiting scenarios, you can use various configuration options in `sessionPoolOptions`. In the example above, we used `maxUsageCount` within `sessionOptions` to prevent more than 15 requests from being sent using a session before it was thrown away; however, a maximum age can also be set using `maxAgeSecs`. When dealing with frequent and unpredictable blockage, the `maxErrorScore` option can be set to trash a session after it's hit a certain number of errors. To learn more about all configurations available in `sessionPoolOptions`, refer to the [Crawlee documentation](https://crawlee.dev/api/core/interface/SessionPoolOptions). > Don't worry too much about these configurations. Crawlee's defaults are usually good enough for the majority of use cases. ## Next up[​](#next "Direct link to Next up") Though rate limiting is still common today, a lot of sites have improved over the years to use more complicated techniques such as **browser fingerprinting**, which is covered in the [next lesson](https://docs.apify.com/academy/anti-scraping/techniques/fingerprinting.md). --- # API Tutorials 💻📚 **A collection of various tutorials explaining how to interact with the Apify platform programmatically using its API.** *** This section explains how you can run [Apify Actors](https://docs.apify.com/platform/actors.md) using Apify's [API](https://docs.apify.com/api/v2.md), retrieve their results, and integrate them into your own product and workflows. You can do this using a raw HTTP client, or you can benefit from using one of our API clients for: * [JavaScript](https://docs.apify.com/api/client/js) * [Python](https://docs.apify.com/api/client/python) --- # API scraping **Learn all about how the professionals scrape various types of APIs with various configurations, parameters, and requirements.** *** API scraping is locating a website's API endpoints, and fetching the desired data directly from their API, as opposed to parsing the data from their rendered HTML pages. > **Note:** In the next few lessons, we'll be using [SoundCloud's website](https://soundcloud.com) as an example target, but the techniques described here can be applied to any site. In this module, we will discuss the benefits and drawbacks of API scraping, how to locate an API, how to utilize its potential features, and how to work around some common roadblocks. ## What's an API?[​](#what-is-api "Direct link to What's an API?") An API is a custom service that lives on the server of any given website. They provide an intuitive way for the website's client-side pages to send and receive data to and from the server, where it can be stored in a database, manipulated, or used to perform an operation. Though not **all** sites have APIs, many do, especially those built as complex web applications. Learn more about APIs [in this article](https://blog.apify.com/what-is-an-api/). ## Different types of APIs[​](#different-types-of-apis "Direct link to Different types of APIs") Websites use APIs which can be either REST or GraphQL. While REST is a vague architectural style based only on conventions, GraphQL is a specification. The REST APIs usually consists of many so-called endpoints, to which you can send your requests. In the responses you are provided with information about various resources, such as users, products, etc. Examples of typical REST API requests: ``` GET https://api.example.com/users/123 GET https://api.example.com/comments/abc123?limit=100 POST https://api.example.com/orders ``` In a GraphQL API, all requests are `POST` and point to a single URL, typically something like `https://api.example.com/graphql`. To get data, you send along a query in the GraphQL query language, optionally with variables. Example of such query: ``` query($number_of_repos: Int!) { viewer { name repositories(last: $number_of_repos) { nodes { name } } } } ``` ## Advantages of API scraping[​](#advantages "Direct link to Advantages of API scraping") ### 1. More reliable[​](#1-more-reliable "Direct link to 1. More reliable") Since the data is coming directly from the site's API, as opposed to the parsing of HTML content based on CSS selectors, it can be relied on more, as it is less likely to change. Typically, websites change their APIs much less frequently than they change the structure/selectors of their pages. ### 2. Configurable[​](#2-configurable "Direct link to 2. Configurable") Most APIs accept query parameters such as `maxPosts` or `fromCountry`. These parameters can be mapped to the configuration options of the scraper, which makes creating a scraper that supports various requirements and use-cases much easier. They can also be utilized to filter and/or limit data results. ### 3. Fast and efficient[​](#3-fast-and-efficient "Direct link to 3. Fast and efficient") Especially for [dynamic sites](https://blog.apify.com/what-is-a-dynamic-page/), in which a headless browser would otherwise be required (it can sometimes be slow and cumbersome), scraping their API can prove to be much quicker and more efficient. ### 4. Easy on the target website[​](#4-easy-on-the-target-website "Direct link to 4. Easy on the target website") Depending on the website, sending large amounts of requests to their pages could result in a slight performance decrease on their end. By using their API instead, not only does your scraper run better, but it is less demanding of the target website. ## Disadvantages of API Scraping[​](#disadvantages "Direct link to Disadvantages of API Scraping") ### 1. Sometimes requires special tokens[​](#1-sometimes-requires-special-tokens "Direct link to 1. Sometimes requires special tokens") Many APIs will require the session cookie, an API key, or some other special value to be included within the header of the request in order to receive any data back. For certain projects, this can be a challenge. ### 2. Potential overhead[​](#2-potential-overhead "Direct link to 2. Potential overhead") For complex APIs that require certain headers and/or payloads in order to make a successful request, return encoded data, have rate limits, or that use GraphQL, there can be a slight overhead in figuring out how to utilize them in a scraper. ## Extra challenges[​](#extra-challenges "Direct link to Extra challenges") ### 1. Different data formats[​](#1-different-data-formats "Direct link to 1. Different data formats") APIs come in all different shapes and sizes. That means every API will vary in not only the quality of the data that it returns, but also the format that it is in. The two most common formats are JSON and HTML. JSON responses are ideal, as they can be manipulated in JavaScript code. In general, no serious parsing is necessary, and the data can be filtered and formatted to fit a scraper's dataset schema. APIs which output HTML generally return the raw HTML of a small component of the page which is already hydrated with data. In these cases, it is still worth using the API, as it is still more efficient than making a request to the entire page; even though the data does still need to be parsed from the HTML response. ### 2. Encoded data[​](#2-encoded-data "Direct link to 2. Encoded data") Sometimes, a response will look something like this: ``` { "title": "Scraping Academy Message", "message": "SGVsbG8hIFlvdSBoYXZlIHN1Y2Nlc3NmdWxseSBkZWNvZGVkIHRoaXMgYmFzZTY0IGVuY29kZWQgbWVzc2FnZSEgV2UgaG9wZSB5b3UncmUgbGVhcm5pbmcgYSBsb3QgZnJvbSB0aGUgQXBpZnkgU2NyYXBpbmcgQWNhZGVteSE=" } ``` Or some other encoding format. This example's `message` has some data encoded in [Base64](https://en.wikipedia.org/wiki/Base64), which is one of the most common encoding types. For testing out Base64 encoding and decoding, you can use [base64encode.org](https://www.base64encode.org/) and [base64decode.org](https://www.base64decode.org/). Within a project where base64 decoding/encoding is necessary, the [Node.js Buffer Class](https://nodejs.org/api/buffer.html) can be used like so: ``` const value = 'SGVsbG8hIFlvdSBoYXZlIHN1Y2Nlc3NmdWxseSBkZWNvZGVkIHRoaXMgYmFzZTY0IGVuY29kZWQgbWVzc2FnZSEgV2UgaG9wZSB5b3UncmUgbGVhcm5pbmcgYSBsb3QgZnJvbSB0aGUgQXBpZnkgU2NyYXBpbmcgQWNhZGVteSE='; const decoded = Buffer.from(value, 'base64').toString('utf-8'); console.log(decoded); ``` ## First up[​](#first "Direct link to First up") Get started with this course by learning some general knowledge about API scraping in the [General API Scraping](https://docs.apify.com/academy/api-scraping/general-api-scraping.md) section! This section will teach you everything you need to know about scraping APIs before moving into more complex sections. --- # General API scraping **Learn the benefits and drawbacks of API scraping, how to locate an API, how to utilize its features, and how to work around common roadblocks.** *** This section will teach you everything you should know about API scraping before moving into the next sections in the **API Scraping** module. Learn how to find APIs, how to use them, how to paginate them, and how to get past some common roadblocks when dealing with them. Each lesson will prepare you for real-world API scraping, and will help put yet another data extraction technique into your scraping toolbelt. ## Next up[​](#next "Direct link to Next up") In our [first lesson](https://docs.apify.com/academy/api-scraping/general-api-scraping/locating-and-learning.md), we will take a look at how to locate a website's API endpoints with DevTools, and how to use them. This is your entrypoint into learning how to scrape APIs. --- # Dealing with headers, cookies, and tokens **Learn about how some APIs require certain cookies, headers, and/or tokens to be present in a request in order for data to be received.** *** Unfortunately, most APIs will require a valid cookie to be included in the `cookie` field within a request's headers in order to be authorized. Other APIs may require special tokens, or other data that validates the request. Luckily, there are ways to retrieve and set cookies for requests prior to sending them, which will be covered more in-depth within future Scraping Academy modules. The most important things to know at the moment are: ## Cookies[​](#cookies "Direct link to Cookies") 1. For sites that heavily rely on cookies for user-verification and request authorization, certain generic requests (such as to the website's main page, or to the target page) will return back a (or multiple) `set-cookie` header(s). 2. The `set-cookie` response header(s) can be parsed and used as the `cookie` header in the headers of a request. A great package for parsing these values from a response's headers is [`set-cookie-parser`](https://www.npmjs.com/package/set-cookie-parser). With this package, cookies can be parsed from headers like so: ``` import axios from 'axios'; // import the set-cookie-parser module import setCookieParser from 'set-cookie-parser'; const getCookie = async () => { // make a request to the target site const response = await axios.get('https://www.example.com/'); // parse the cookies from the response const cookies = setCookieParser.parse(response); // format the parsed data into a usable string const cookieString = cookies.map(({ name, value }) => `${name}=${value};`).join(' '); // log the final cookie string to be used in a 'cookie' header console.log(cookieString); }; getCookie(); ``` ## Headers[​](#headers "Direct link to Headers") Other APIs may not require a valid cookie header, but instead will require certain headers to be attached to the request which are typically attached when a user makes a "real" request from a browser. The most commonly required headers are: * `User-Agent` * `Referer` * `Origin` * `Host` Headers required by the target API can be configured manually in a manner such as this, and attached to every single request the scraper sends: ``` const HEADERS = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko)' + 'Chrome/96.0.4664.110 YaBrowser/22.1.0.2500 Yowser/2.5 Safari/537.36', Referer: 'https://soundcloud.com', // ... }; ``` However, a much better option is to use either a custom implementation of generating random headers for each request, or to use a package such as [`got-scraping`](https://www.npmjs.com/package/got-scraping) to automatically do this. With `got-scraping`, generating request-specific headers can be done right within a request with `headerGeneratorOptions`. Specific headers can also be set with the `headers` option: ``` const response = await gotScraping({ url: 'https://example.com', headerGeneratorOptions: { browsers: [ { name: 'chrome', minVersion: 87, maxVersion: 89, }, ], devices: ['desktop'], locales: ['de-DE', 'en-US'], operatingSystems: ['windows', 'linux'], }, headers: { 'some-header': 'Hello, Academy!', }, }); ``` ## Tokens[​](#tokens "Direct link to Tokens") For our SoundCloud example, testing the endpoint from the previous section in a tool like [Postman](https://docs.apify.com/academy/tools/postman.md) works perfectly, and returns the data we want; however, when the `client_id` parameter is removed, we receive a **401 Unauthorized** error. Luckily, the Client ID is the same for every user, which means that it is not tied to a session or an IP address (this is based on our own observations and tests). The big downfall is that the token being used by SoundCloud changes every few weeks, so it shouldn't be hardcoded. This case is actually quite common, and is not only seen with SoundCloud. Ideally, this `client_id` should be scraped dynamically, especially since it changes frequently, but unfortunately, the token cannot be found anywhere on SoundCloud's pages. We already know that it's available within the parameters of certain requests though, and luckily, [Puppeteer](https://github.com/puppeteer/puppeteer) offers a way to analyze each response when on a page. It's a bit like using browser DevTools, which you are already familiar with by now, but programmatically instead. Here is a way you could dynamically scrape the `client_id` using Puppeteer: ``` // import the puppeteer module import puppeteer from 'puppeteer'; const scrapeClientId = async () => { const browser = await puppeteer.launch({ headless: false }); const page = await browser.newPage(); // initialize a variable that will eventually hold the client_id let clientId = null; // handle each response page.on('response', async (res) => { // try to grab the 'client_id' parameter from each URL const id = new URL(res.url()).searchParams.get('client_id') ?? null; // if the parameter exists, set our clientId variable to the newly parsed value if (id) clientId = id; }); // visit the page await page.goto('https://soundcloud.com/tiesto/tracks'); // wait for a selector that ensures the page has time to load and make requests to its API await page.waitForSelector('.profileHeader__link'); await browser.close(); console.log(clientId); // log the retrieved client_id }; scrapeClientId(); ``` ## Next up[​](#next "Direct link to Next up") Keep the code above in mind, because we'll be using it in the [next lesson](https://docs.apify.com/academy/api-scraping/general-api-scraping/handling-pagination.md) when paginating through results from SoundCloud's API. --- # Handling pagination **Learn about the three most popular API pagination techniques and how to handle each of them when scraping an API with pagination.** *** When scraping large APIs, you'll quickly realize that most APIs limit the number of results it responds back with. For some APIs, the max number of results is 5, while for others it's 2000. Either way, they all have something in common - pagination. If you've never dealt with it before, trying to scrape thousands to hundreds of thousands of items from an API with pagination can be a bit challenging. In this lesson, we'll be discussing a few of the different types of pagination, as well as how to work with them. ## Page-number pagination[​](#page-number "Direct link to Page-number pagination") The most common and rudimentary forms of pagination have page numbers. Imagine paginating through a typical e-commerce website. ![Amazon pagination]() This implementation makes it fairly straightforward to programmatically paginate through an API, as it pretty much entails incrementing up or down in order to receive the next set of items. The page number is usually provided right in the parameters of the request URL; however, some APIs require it to be provided in the request body instead. ## Offset pagination[​](#offset-pagination "Direct link to Offset pagination") The second most popular pagination technique used is based on using a **limit** parameter along with an **offset** parameter. The **limit** says how many records should be returned in a single request, while the **offset** parameter says how many records should be skipped. For example, let's say that we have this dataset and an API route to retrieve its items: ``` const myAwesomeDataset = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]; ``` If we were to make a request with the **limit** set to **5** and the **offset** parameter also set to **5**, the API would skip over the first five items and return `[6, 7, 8, 9, 10]`. ## Cursor pagination[​](#cursor-pagination "Direct link to Cursor pagination") Sometimes pagination uses **cursor** instead of **offset**. Cursor is a marker of an item in the dataset. It can be a date, number, or a more or less random string of letters and numbers. Request with a **cursor** parameter will result in an API response containing items which follow after the item which the cursor points to. One of the most painful things about scraping APIs with cursor pagination is that you can't skip to, for example, the 5th page. You have to paginate through each page one by one. > Note: SoundCloud [migrated](https://developers.soundcloud.com/blog/pagination-updates-on-our-api/) over to using cursor-based pagination; however, they did not change the parameter name from **offset** to **cursor**. Always be on the lookout for this type of stuff! ## Using "next page"[​](#using-next-page "Direct link to Using \"next page\"") In a minute, we're going to create a mini-project which will scrape the first 100 of Tiësto's tracks by keeping a **limit** of 20 and paginating through until we've scraped 100 items. Luckily for us, SoundCloud's API (and many others) provides a **next\_href** property in each response, which means we don't have to directly deal with setting the **offset** (cursor) parameter: ``` //... { "next_href": "https://api-v2.soundcloud.com/users/141707/tracks?offset=2020-03-13T00%3A00%3A00.000Z%2Ctracks%2C00774168919&limit=20&representation=https%3A%2F%2Fapi-v2.soundcloud.com%2Fusers%2F141707%2Ftracks%3Flimit%3D20", "query_urn": null } ``` This URL can take various different forms, and can be given different names; however, they all generally do the same thing - bring you to the next page of results. ## Mini project[​](#mini-project "Direct link to Mini project") First, create a new folder called **pagination-tutorial** and run this command inside of it: ``` # initialize the project and install the puppeteer # and got-scraping packages npm init -y && npm i puppeteer got-scraping ``` Now, make a new file called **scrapeClientId**, copying the **client\_id** scraping code from the previous lesson and making a slight modification: ``` // scrapeClientId.js import puppeteer from 'puppeteer'; // export the function to be used in a different file export const scrapeClientId = async () => { const browser = await puppeteer.launch({ headless: true }); const page = await browser.newPage(); let clientId = null; page.on('response', async (res) => { const id = new URL(res.url()).searchParams.get('client_id') ?? null; if (id) clientId = id; }); await page.goto('https://soundcloud.com/tiesto/tracks'); await page.waitForSelector('.profileHeader__link'); await browser.close(); // return the client_id return clientId; }; ``` Now, in a new file called **index.js** we'll write the skeleton for our pagination and item-scraping code: ``` // index.js // we will need gotScraping to make HTTP requests import { gotScraping } from 'got-scraping'; import { scrapeClientId } from './scrapeClientId'; const scrape100Items = async () => { // the initial request URL const nextHref = 'https://api-v2.soundcloud.com/users/141707/tracks?limit=20&offset=0'; // create an array for all of our scraped items to live const items = []; // scrape the client ID with the script from the // previous lesson const clientId = await scrapeClientId(); // More code will go here }; ``` Let's now take a step back and think about the condition on which we should continue paginating: 1. If the API responds with a **next\_href** set to **null**, that means that there are no more pages, and that we have scraped all of the possible items and we should stop paginating. 2. If our items list has 100 records or more, we should stop paginating. Otherwise, we should continue until 100+ items has been reached. With a full understanding of this condition, we can translate it into code: ``` const scrape100Items = async () => { // ...previous code // continue making requests until either we've reached 100+ items while (items.flat().length Note that it's better to add requests to a requests queue rather than processing them in memory. The crawlers offered by [Crawlee](https://crawlee.dev/docs/) provide this functionality out of the box. ``` // index.js import { gotScraping } from 'got-scraping'; import { scrapeClientId } from './scrapeClientId'; const scrape100Items = async () => { let nextHref = 'https://api-v2.soundcloud.com/users/141707/tracks?limit=20&offset=0'; const items = []; const clientId = await scrapeClientId(); while (items.flat().length { // run the function const data = await scrape100Items(); // log the length of the items array console.log(data.length); })(); ``` > We are using the [`.flat()`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/flat) method when returning the **items** array to turn our array of arrays into a single array of items. Here's what the output of this code looks like: ``` 105 ``` ## Final note[​](#final-note "Direct link to Final note") Sometimes, APIs have limited pagination. That means that they limit the total number of results that can appear for a set of pages, or that they limit the pages to a certain number. To learn how to handle these cases, take a look at the [Crawling with search](https://docs.apify.com/academy/advanced-web-scraping/crawling/crawling-with-search.md) article. ## Next up[​](#next "Direct link to Next up") This is the last lesson in the API scraping tutorial for now, but be on the lookout for more lessons soon to come! Thus far, you've learned how to: 1. Locate API endpoints 2. Understand located API endpoints and their parameters 3. Parse and modify cookies 4. Modify/set headers 5. Farm API tokens using Puppeteer 6. Use paginated APIs If you'd still like to read more about API scraping, check out the [**GraphQL scraping**](https://docs.apify.com/academy/api-scraping/graphql-scraping.md) course! GraphQL is the king of API scraping. --- # Locating API endpoints **Learn how to effectively locate a website's API endpoints, and learn how to use them to get the data you want faster and more reliably.** *** In order to retrieve a website's API endpoints, as well as other data about them, the **Network** tab within Chrome's (or another browser's) DevTools can be used. This tab allows you to see all of the various network requests being made, and even allows you to filter them based on request type, response type, or by a keyword. On our target page, we'll open up the Network tab, and filter by request type of `Fetch/XHR`, as opposed to the default of `All`. Next, we'll do some action on the page which causes the request for the target data to be sent, which will enable us to view the request in DevTools. The types of actions that need to be done can vary depending on the website, the type of page, and the type of data being returned. Sometimes, reloading the page is enough, while other times, a button must be clicked, or the page must be scrolled. For our example use case, reloading the page is sufficient. *Here's what we can see in the Network tab after reloading the page:* ![Network tab results after completing an action on the page which results in the API being called](/assets/images/results-in-network-tab-be10d5fd17e35bf8aafca9b2899cdccd.png) Let's say that our target data is a full list of Tiësto's uploaded songs on SoundCloud. We can use the **Filter** option to search for the keyword `tracks`, and see if any endpoints have been hit that include that word. Multiple results may still be in the list when using this feature, so it is important to carefully examine the payloads and responses of each request in order to ensure that the correct one is found. Filtering requests To find what we're looking for, we must wisely choose what piece of data (in this case a keyword) we filter by. Think of something that is most likely to be part of the endpoint (in this case a string `tracks`). After a little bit of digging through the different response values of each request in our filtered list within the Network tab, we can discover this endpoint, which returns a JSON list including 20 of Tiësto's latest tracks: ![Endpoint found in the Network tab](/assets/images/endpoint-found-6c93a91aff4ad378bf5b5b1baceeba3e.png) ## Learning the API[​](#learning-the-api "Direct link to Learning the API") The majority of APIs, especially for popular sites that serve up large amounts of data, are configurable through different parameters, query options, or payload values. A lot of times, an endpoint discovered through the Network tab will reveal at least a few of these options. Here's what our target endpoint's URL looks like coming directly from the Network tab: ``` https://api-v2.soundcloud.com/users/141707/tracks?representation=&client_id=zdUqm51WRIAByd0lVLntcaWRKzuEIB4X&limit=20&offset=0&linked_partitioning=1&app_version=1646987254&app_locale=en ``` Since our request doesn't have any body/payload, we need to analyze the URL. We can break this URL down into chunks that help us understand what each value does. ![Breaking down the request url into understandable chunks](/assets/images/analyzing-the-url-d13462b4beaa20eb6bab7d8f95091507.png) Understanding an API's various configurations helps with creating a game-plan on how to best scrape it, as many of the parameters can be utilized for pagination, or data-filtering. Additionally, these values can be mapped to a scraper's configuration options, which overall makes the scraper more versatile. Let's say we want to receive all of the user's tracks in one request. Based on our observations of the endpoint's different parameters, we can modify the URL and utilize the `limit` option to return more than twenty songs. The `limit` option is extremely common with most APIs, and allows the person making the request to literally limit the maximum number of results to be returned in the request: ``` https://api-v2.soundcloud.com/users/141707/tracks?client_id=zdUqm51WRIAByd0lVLntcaWRKzuEIB4X&limit=99999 ``` By using the ridiculously large number of `99999`, we ensure that all of the user's tracks will be captured in this single request. Luckily, with SoundCloud's API, there is no cap to the `limit` parameter; however, most other APIs will have a limit to ensure that hundreds of thousands of results aren't retrieved at one time. For this use-case, setting a massive results limit is not much of a risk, as most users don't have a track-count over 500 anyways, but receiving too many results at once can result in overflow errors. ## Next up[​](#next "Direct link to Next up") [Next lesson](https://docs.apify.com/academy/api-scraping/general-api-scraping/cookies-headers-tokens.md) will be all about cookies, headers, and tokens, and how they're relevant when scraping an API. --- # GraphQL scraping **Dig into the topic of scraping APIs which use the latest and greatest API technology - GraphQL. GraphQL APIs are very different from regular REST APIs.** *** [GraphQL](https://graphql.org/) APIs different from the regular [REST](https://www.redhat.com/en/topics/api/what-is-a-rest-api)ful APIs you're likely familiar with, which means that different methods and tooling are used to scrape them. This course will teach you everything you need to know about GraphQL to scrape an API built with it. ## How do I know if it's a GraphQL API?[​](#graphql-endpoints "Direct link to How do I know if it's a GraphQL API?") In this section, we'll be scraping [cheddar.com](https://www.cheddar.com/)'s GraphQL API. When you visit the website and make a search for anything while your **Network Tab** is open, you'll see a request that has been sent to the endpoint **api.cheddar.com/graphql**. ![GraphQL endpoint]() As a rule of thumb, when the endpoint ends with **/graphql** and it's a **POST** request, it's a 99.99% bulletproof indicator that the target site is using GraphQL. If you want to be 100% certain though, taking a look at the request payload will most definitely give it away. ![GraphQL payload]() Every GraphQL payload will be a JSON object with a **query** property, and a **variables** property if any variables were provided. If you take a closer look at the full **query** property of this request, you'll notice that it's stringified GraphQL language content. ![Taking a closer look at the payload](/assets/images/stringified-syntax-d8dab2e70acddc32bdf220d05917e527.png) ## Advantages & disadvantages[​](#advantages-disadvantages "Direct link to Advantages & disadvantages") We already discussed the advantages and disadvantages of API scraping in general in this course's introduction, but because GraphQL is such a different technology, scraping an API built with it comes with its own pros and cons. ### Advantages[​](#advantages "Direct link to Advantages") 1. GraphQL allows you as the developer to choose which fields you'd like to be returned back to you. Not only does this leave you with only the data you want and no extra unwanted fields, but it is also easier on the target. 2. Allows access to data that is not readily available natively through the website. 3. Queries are heavily customizable due to features like **fragments**. ### Disadvantages[​](#disadvantages "Direct link to Disadvantages") 1. Though it's a fantastic technology with lots of awesome features, it is also more complex to understand. 2. GraphQL [introspection](https://docs.apify.com/academy/api-scraping/graphql-scraping/introspection.md) is disabled on many sites, which makes it more difficult to reap the full benefits of GraphQL. ## Next up[​](#next "Direct link to Next up") This course section's [first lesson](https://docs.apify.com/academy/api-scraping/graphql-scraping/modifying-variables.md) will discuss how to customize GraphQL queries without ever having to write any GraphQL language. --- # Custom queries **Learn how to write custom GraphQL queries, how to pass input values into GraphQL requests as variables, and how to retrieve and output the data from a scraper.** *** Sometimes, the queries found in the **Network** tab aren't good enough for your use case. Or, perhaps they're even returning more data than what you're after (which can slow down the queries depending on how much data they're giving back). In these situations, it's a good idea to dig a bit deeper into the API and start writing your own custom use-case specific queries. In this lesson, we're building a scraper which expects a single number (in **hours**) and a **query** string as its input. As output, it should provide data about the first 1000 Cheddar posts published within the last **n** hours which match the provided query. Each **post** object should contain the **title**, the **publishDate** and the **videoUrl** of the post. ``` [ { "title": "FDA Authorizes 1st Breath Test for COVID-19 Infection", "publishDate": "2022-04-15T11:58:44-04:00", "videoUrl": "https://vod.chdrstatic.com/source%3Dbackend%2Cexpire%3D1651782479%2Cpath%3D%2Ftranscode%2Fb68f8133-3aa9-4c96-ac26-047452bbc9ce%2Ctoken%3D581fd52bb7f634834edca5c201619c014cd21eb20448cf89525bf101ca8a6f64/transcode/b68f8133-3aa9-4c96-ac26-047452bbc9ce/b68f8133-3aa9-4c96-ac26-047452bbc9ce.mp4" }, { "...": "..." } ] ``` ## Project setup[​](#project-setup "Direct link to Project setup") To make sure we're all on the same page, we're going to set up the project together by first creating a folder named **graphql-scraper**. Once navigated to the folder within your terminal, run the following command: ``` npm init -y && npm install graphql-tag puppeteer got-scraping ``` This command will first initialize the project with npm, then will install the `puppeteer`, `graphql-tag`, and `got-scraping` packages, which we will need in this lesson. Finally, create a file called **index.js**. This is the file we will be working in for the rest of the lesson. ## Preparations[​](#preparations "Direct link to Preparations") If we remember from the last lesson, we need to pass a valid "app token" within the **X-App-Token** header of every single request we make, or else we will be blocked. When testing queries, we copied this value straight from the **Network** tab; however, since this is a dynamic value, we should farm it. Since we know requests with this header are sent right when the front page is loaded, it can be farmed by visiting the page and intercepting requests in Puppeteer like so: ``` // scrapeAppToken.js import puppeteer from 'puppeteer'; const scrapeAppToken = async () => { const browser = await puppeteer.launch(); const page = await browser.newPage(); let appToken = null; page.on('response', async (res) => { // grab the token from the request headers const token = res.request().headers()?.['x-app-token']; // if there is a token, grab it and close the browser if (token) { appToken = token; await browser.close(); } }); await page.goto('https://www.cheddar.com/'); await page.waitForNetworkIdle(); // otherwise, close the browser after networkidle // has been fired await browser.close(); // return the apptoken (or null) return appToken; }; export default scrapeAppToken; ``` With this code, we're doing the same exact thing as we did in the previous lesson to grab this header value, except programmatically. > To learn more about this method of scraping headers and tokens, refer to the [Cookies, headers, and tokens](https://docs.apify.com/academy/api-scraping/general-api-scraping/cookies-headers-tokens.md) lesson of the **General API scraping** section. Now, we can import this function into our **index.js** and use it to create a `token` variable which will be passed as our **X-App-Token** header when scraping: ``` // index.js // import the function import scrapeAppToken from './scrapeAppToken.mjs'; const token = await scrapeAppToken(); ``` ## Building the query[​](#building-the-query "Direct link to Building the query") First, we'll write a skeleton query where we define which variables we're expecting (from the user of the scraper): ``` query SearchQuery($query: String!, $max_age: Int!) { # query will go here } ``` Also in the previous lesson, we learned that the **media** type is dependent on the **organization** type. This means to get any **media**, it must be wrapped in the **organization** query: ``` query SearchQuery($query: String!, $max_age: Int!) { organization { media(query: $query, max_age: $max_age , first: 1000) { } } } ``` Finally, since Cheddar is using [cursor-based relay pagination](https://relay.dev/graphql/connections.htm#relay-style-cursor-pagination) for their API, we must access the data through the **edges** property, where each **node** is a result item: ``` query SearchQuery($query: String!) { organization { media(query: $query, max_age: $max_age , first: 1000) { edges { node { # here we will define the fields we want } } } } } ``` The next step is to fill out the fields we'd like back, and we've got our final query! ``` query SearchQuery($query: String!) { organization { media(query: $query, max_age: $max_age , first: 1000) { edges { node { title # title public_at # this will be publishDate hero_video { video_urls { url # the first URL from these results will be videoUrl } } } } } } } ``` ## Making the request[​](#making-the-request "Direct link to Making the request") Back in our code, we can import `gql` from `graphql-tag` and use it to store our query: ``` // index.js import { gql } from 'graphql-tag'; import scrapeAppToken from './scrapeAppToken.mjs'; const token = await scrapeAppToken(); const GET_LATEST = gql` query SearchQuery($query: String!, $max_age: Int!) { organization { media(query: $query, max_age: $max_age, first: 1000) { edges { node { title public_at hero_video { video_urls { url } } thumbnail_url } } } } } `; ``` Alternatively, if you don't want to write your GraphQL queries right within your JavaScript code, you can write them in files using the **.graphql** format, then read them from the filesystem or import them. > In order to receive nice GraphQL syntax highlighting in these template literals, download the [GraphQL VSCode extension](https://marketplace.visualstudio.com/items?itemName=GraphQL.vscode-graphql) Then, we'll take our input and use it to create a **variables** object which will be used for the request: ``` // find posts from the last 48 hours that include the keyword "stocks". // since we don't have any real input, we'll simulate some input const testInput = { hours: 48, query: 'stocks' }; // the API takes max_input in the format of minutes * 60 // to calculate this value, we do hours * 60^2 const variables = { query: testInput.query, max_age: Math.round(testInput.hours) * 60 ** 2 }; ``` The final step is to take the query and variable and marry them within a `gotScraping()` call, which will return the API response: ``` const data = await gotScraping('https://api.cheddar.com/graphql', { // we are expecting a JSON response back responseType: 'json', // we must use a post request method: 'POST', // this is where we pass in our token headers: { 'X-App-Token': token, 'Content-Type': 'application/json' }, // here is our query with our variables body: JSON.stringify({ query: GET_LATEST.loc.source.body, variables }), }); ``` The final step after making the query is to format the data to match the expected dataset schema. ## Final code[​](#final-code "Direct link to Final code") Here's what our final project looks like: ``` // index.js import { gql } from 'graphql-tag'; import { gotScraping } from 'got-scraping'; import scrapeAppToken from './scrapeAppToken.mjs'; // Scrape the token const token = await scrapeAppToken(); // Define our query const GET_LATEST = gql` query SearchQuery($query: String!, $max_age: Int!) { organization { media(query: $query, max_age: $max_age, first: 1000) { edges { node { title public_at hero_video { video_urls { url } } thumbnail_url } } } } } `; // Grab our input const testInput = { hours: 48, query: 'stocks' }; // Calculate and prepare our variables const variables = { query: testInput.query, max_age: Math.round(testInput.hours) * 60 ** 2 }; // Make the request const { body: { data: { organization } } } = await gotScraping('https://api.cheddar.com/graphql', { responseType: 'json', method: 'POST', headers: { 'X-App-Token': token, 'Content-Type': 'application/json' }, body: JSON.stringify({ query: GET_LATEST.loc.source.body, variables }), }); // Format the data const result = organization.media.edges.map(({ node }) => ({ title: node?.title, publishDate: node?.public_at, videoUrl: node?.hero_video ? node.hero_video.video_urls[0].url : null, })); // Log the result console.log(result); ``` ``` // scrapeAppToken.js import puppeteer from 'puppeteer'; const scrapeAppToken = async () => { const browser = await puppeteer.launch(); const page = await browser.newPage(); let appToken = null; page.on('response', async (res) => { const token = res.request().headers()?.['x-app-token']; if (token) { appToken = token; await browser.close(); } }); await page.goto('https://www.cheddar.com/'); await page.waitForNetworkIdle(); await browser.close(); return appToken; }; export default scrapeAppToken; ``` ## Wrap up[​](#wrap-up "Direct link to Wrap up") If you've made it this far, that means that you've conquered the king of API scraping - GraphQL, and that you're ready to take on writing scrapers for the majority of websites out there. Nice work! Take a moment to review the skills you learned in this section: 1. Modifying the variables of copied GraphQL queries 2. Introspecting a GraphQL API 3. Visualizing and understanding a GraphQL API introspection 4. Writing custom queries 5. Dealing with cursor-based relay pagination 6. Writing a GraphQL scraper with custom queries --- # Introspection **Understand what introspection is, and how it can help you understand a GraphQL API to take advantage of the features it has to offer before writing any code.** *** [Introspection](https://graphql.org/learn/introspection/) is when you make a query to the target GraphQL API requesting information about its schema. When done properly, this can provide a whole lot of information about the API and the different **queries** and **mutations** it supports. Just like when working with regular RESTful APIs in the [**General API scraping**](https://docs.apify.com/academy/api-scraping/general-api-scraping/locating-and-learning.md) section, it's important to learn a bit about the different available features of the GraphQL API (or at least of the query/mutation) you are scraping before actually writing any code. Not only does becoming comfortable with and understanding the ins and outs of using the API make the development process easier, but it can also sometimes expose features which will return data you'd otherwise be scraping from a different location. ## Making the query[​](#making-the-query "Direct link to Making the query") ! Cheddar website was changed and the below example no longer works there. Nonetheless, the general approach is still viable on some websites even though introspection is disabled on most. In order to perform introspection on our [target website](https://www.cheddar.com), we need to make a request to their GraphQL API with this introspection query using [Insomnia](https://docs.apify.com/academy/tools/insomnia.md) or another HTTP client that supports GraphQL: > To make a GraphQL query in Insomnia, make sure you've set the HTTP method to **POST** and the request body type to **GraphQL Query**. ``` query { __schema { queryType { name } mutationType { name } subscriptionType { name } types { ...FullType } directives { name description locations args { ...InputValue } } } } fragment FullType on __Type { kind name description fields(includeDeprecated: true) { name description args { ...InputValue } type { ...TypeRef } isDeprecated deprecationReason } inputFields { ...InputValue } interfaces { ...TypeRef } enumValues(includeDeprecated: true) { name description isDeprecated deprecationReason } possibleTypes { ...TypeRef } } fragment InputValue on __InputValue { name description type { ...TypeRef } defaultValue } fragment TypeRef on __Type { kind name ofType { kind name ofType { kind name ofType { kind name ofType { kind name ofType { kind name ofType { kind name ofType { kind name } } } } } } } } ``` Here's what we got back from the request: ![GraphQL introspection request response](/assets/images/introspection-2f8159c4f926e20040ee65bfc4e18eb0.jpg) The response body of our introspection query contains a whole lot of useful information about the API, such as the data types defined within it, as well the queries and mutations available for retrieving/changing the data. ## Understanding the response[​](#understanding-the-response "Direct link to Understanding the response") An introspection query's response body size will vary depending on how big the target API is. In our case, what we got back is a 27 thousand line JSON response 🤯 If you thought to yourself, "Wow, that's a whole lot to sift through! I don't want to look through that!", you are absolutely right. Luckily for us, there is a fantastic online tool called [GraphQL Voyager](https://graphql-kit.com/graphql-voyager/) (no install required) which can take this massive JSON response and turn it into a digestable visualization of the API. Let's copy the response to our clipboard by clicking inside of the response body and pressing **CMD** + **A**, then subsequently **CMD** + **C**. Now, we'll head over to [GraphQL Voyager](https://graphql-kit.com/graphql-voyager/) and click on **Change Schema**. In the modal, we'll click on the **Introspection** tab and paste our data into the text area. ![Pasting the introspection](/assets/images/pasting-introspection-78e8ac32a797fcfd7f17f7f1685bbceb.png) Finally, we can click on **Display** and immediately be shown a visualization of the API: ![GraphQL Voyager API visualization](/assets/images/voyager-interface-b74eff607e4985d5228ec7d08563f909.jpg) Now that we have this visualization to work off of, it will be much easier to build a query of our own. ## Building a query[​](#building-a-query "Direct link to Building a query") In future lessons, we'll be building more complex queries using **dynamic variables** and advanced features such as **fragments**; however, for now let's get our feet wet by using the data we have from GraphQL Voyager to build a query. Right now, our goal is to fetch the 1000 most recent articles on [Cheddar](https://www.cheddar.com). From each article, we'd like to fetch the **title** and the **publish date**. After a bit of digging through the schema, we've come across the **media** field within the **organization** type, which has both **title** and **public\_at** fields - seems to check out! ![The media field pointing to datatype slugable](/assets/images/media-field-066b5bbc4dccdef44b38495648478deb.jpg) Cool. Now we know we need to access **media** through the **organization** query. The **media** field also takes in some arguments, of which we will be using the **first** parameter set to **1000**. Let's start writing our query in Insomnia! ![Receiving a suggestion for a field titled edges](/assets/images/edges-suggested-65c22c50bf4e1682ec511f97e0790009.png) While writing our query, we've hit a slight roadblock - the **media** type doesn't seem to be accepting a **title** field; however, we are being suggested an **edges** field. This signifies that Cheddar is using [cursor-based relay pagination](https://relay.dev/graphql/connections.htm#relay-style-cursor-pagination), and that what is returned from media is actually a **Connection** type with multiple properties. The **edges** property contains the list of results we're after, and each result lies within a **Node** type accessible within **edges** as **node**. With this knowledge, we can finish writing our query: ``` query { organization { media(first: 1000) { edges { node { title public_at } } } } } ``` ## Sending the query[​](#sending-the-query "Direct link to Sending the query") Let's send it! ![Unauthorized](/assets/images/unauthorized-e5a911a6290b5515598de42cfb2f8b8a.png) Oh, okay. That didn't work. But **why**? Rest assured, nothing is wrong with our query. We are most likely missing an authorization token/parameter. Let's check back on the Cheddar website within our browser to see what types of headers are being sent with the requests there: ![Request headers back on the Cheddar website](/assets/images/cheddar-headers-37014534c6ca4250bc5c28b673373dda.jpg) The **Authorization** and **X-App-Token** headers seem to be our culprits. Of course these values are dynamic, but for testing purposes we can copy them right from the **Network** tab and use them for our request in Insomnia. ![Successful request](/assets/images/successful-request-81d1fa87c1e58b7456a02376d395e38f.png) Cool, it worked! Now we know that if we want to scrape this API, we'll likely have to scrape these authorization headers as well in order to not get blocked. > For more information about cookies, headers, and tokens, refer back to [this lesson](https://docs.apify.com/academy/api-scraping/general-api-scraping/cookies-headers-tokens.md) from the previous section of the **API scraping** course. ## Introspection disabled?[​](#introspection-disabled "Direct link to Introspection disabled?") If the target website is smart, they will have introspection disabled. One of the most widely used GraphQL development tools is [ApolloServer](https://www.apollographql.com/docs/apollo-server/), which automatically disables introspection, so these cases are actually quite common. ![Introspection disabled](/assets/images/introspection-disabled-0b524331e3d8505a3e4c2cc6cdc3e39e.png) In these cases, it is still possible to get some information about the API when using [Insomnia](https://docs.apify.com/academy/tools/insomnia.md) or [Postman](https://docs.apify.com/academy/tools/postman.md), due to the autocomplete that they provide. If we remember from the [Building a query](#building-a-query) section of this lesson, we were able to receive autocomplete suggestions when we entered a non-existent field into the query. Though this is not as great as seeing an entire visualization of the API in GraphQL Voyager, it can still be quite helpful. ## Next up[​](#next "Direct link to Next up") [Next lesson](https://docs.apify.com/academy/api-scraping/graphql-scraping/custom-queries.md)'s code-along project will walk you through how to construct a custom GraphQL query for scraping purposes, how to accept input into it, and how to retrieve and output the data. --- # Modifying variables **Learn how to modify the variables of a JSON format GraphQL query to use the API without needing to write any GraphQL language or create custom queries.** *** In the introduction of this course, we searched for the term **test** on the [Cheddar](https://www.cheddar.com/) website and discovered a request to their GraphQL API. The payload looked like this: ``` { "query": "query SearchQuery($query: String!, $count: Int!, $cursor: String) {\n organization {\n ...SearchList_organization\n id\n }\n }\n fragment SearchList_organization on Organization {\n media(\n first: $count\n after: $cursor\n query: $query\n recency_weight: 0.6\n recency_days: 30\n include_private: false\n include_unpublished: false\n ) {\n hitCount\n edges {\n node {\n _score\n id\n ...StandardListCard_video\n __typename\n }\n cursor\n }\n pageInfo {\n endCursor\n hasNextPage\n }\n }\n }\n fragment StandardListCard_video on Slugable {\n ...Thumbnail_video\n ...StandardTextCard_media\n slug\n id\n __typename\n }\n fragment Thumbnail_video on Slugable {\n original_thumbnails: thumbnails(aspect_ratio: ORIGINAL) {\n small\n medium\n large\n }\n sd_thumbnails: thumbnails(aspect_ratio: SD) {\n small\n medium\n large\n }\n hd_thumbnails: thumbnails(aspect_ratio: HD) {\n small\n medium\n large\n }\n film_thumbnails: thumbnails(aspect_ratio: FILM) {\n small\n medium\n large\n }\n square_thumbnails: thumbnails(aspect_ratio: SQUARE) {\n small\n medium\n large\n }\n }\n fragment StandardTextCard_media on Slugable {\n public_at\n updated_at\n title\n hero_video {\n duration\n }\n description\n }", "variables": { "query": "test","count": 10,"cursor": null }, "operationName": "SearchQuery" } ``` We also learned that every GraphQL request payload will have a **query** property, which contains a stringified version of the query, and a **variables** property, which contains any parameters for the query. If we convert the query field to a `.graphql` format, we can get it nicely formatted with syntax highlighting (install GraphQL extension for editor) ``` query SearchQuery($query: String!, $count: Int!, $cursor: String) { organization { ...SearchList_organization id } } fragment SearchList_organization on Organization { media( first: $count after: $cursor query: $query recency_weight: 0.6 recency_days: 30 include_private: false include_unpublished: false ) { hitCount edges { node { _score id ...StandardListCard_video __typename } cursor } pageInfo { endCursor hasNextPage } } } fragment StandardListCard_video on Slugable { ...Thumbnail_video ...StandardTextCard_media slug id __typename } fragment Thumbnail_video on Slugable { original_thumbnails: thumbnails(aspect_ratio: ORIGINAL) { small medium large } sd_thumbnails: thumbnails(aspect_ratio: SD) { small medium large } hd_thumbnails: thumbnails(aspect_ratio: HD) { small medium large } film_thumbnails: thumbnails(aspect_ratio: FILM) { small medium large } square_thumbnails: thumbnails(aspect_ratio: SQUARE) { small medium large } } fragment StandardTextCard_media on Slugable { public_at updated_at title hero_video { duration } description } ``` If the query provided in the payload you find in the **Network** tab is good enough for your scraper's needs, you don't actually have to go down the GraphQL rabbit hole. Rather, you can change the variables to receive the data you want. For example, right now, our example payload is set up to search for articles matching the keyword **test**. However, if we wanted to search for articles matching **cats** instead, we could do that by changing the **query** variable like so: ``` { "...": "...", "variables": { "query": "cats","count": 10,"cursor": null } } ``` Depending on the API, doing just this can be sufficient. However, sometimes we want to utilize complex GraphQL features in order to optimize our scrapers or to receive more data than is being provided in the response of the request found in the **Network** tab. This is what we will be discussing in the next lessons. ## Next up[​](#next "Direct link to Next up") In the [next lesson](https://docs.apify.com/academy/api-scraping/graphql-scraping/introspection.md) we will be walking you through how to learn about a GraphQL API before scraping it by using **introspection**. --- # How to retry failed requests **Learn how to re-scrape only failed requests in your run.** *** Requests of a scraper can fail for many reasons. The most common causes are different page layouts or proxy blocking issues ([check here on how to effectively analyze errors](https://docs.apify.com/academy/node-js/analyzing-pages-and-fixing-errors)). Both [Apify](https://apify.com) and [Crawlee](https://crawlee.dev/) allow you to restart your scraper run from the point where it ended, but there is no native functionality to re-scrape only failed requests. Usually, you also want to first analyze the problem, update the code, and build it before trying again. If you attempt to restart an already finished run, it will likely immediately finish because all the requests in the [request queue](https://crawlee.dev/docs/guides/request-storage) are marked as handled. You need to update the failed requests in the queue to be marked as pending again. The additional complication is that the [Request](https://crawlee.dev/api/core/class/Request) object doesn't have anything like the `isFailed` property. We have to approximate it using other fields. Fortunately, we can use the `errorMessages` and `retryCount` properties to identify failed requests. Unless the user explicitly has overridden these properties, we can identify failed requests with a larger amount of `errorMessages` than `retryCount`. That happens because the last error that doesn't cause a retry anymore is added to `errorMessages`. A simplified code example can look like this: ``` // The code is similar for both Crawlee-only but uses a different API import { Actor } from 'apify'; const REQUEST_QUEUE_ID = 'pFCvCasdvsyvyZdfD'; // Replace with your valid request queue ID const allRequests = []; let exclusiveStartId = null; // List all requests from the queue, we have to do it in a loop because the request queue list is paginated for (; ;) { const { items: requests } = await Actor.apifyClient .requestQueue(REQUEST_QUEUE_ID) .listRequests({ exclusiveStartId, limit: 1000 }); allRequests.push(...requests); // If we didn't get the full 1,000 requests, we have all and can finish the loop if (requests.length (request.errorMessages?.length || 0) > (request.retryCount || 0)); // We need to update them 1 by 1 to the pristine state for (const request of failedRequests) { request.retryCount = 0; request.errorMessages = []; // This tells the request queue to handle it again request.handledAt = null; await Actor.apifyClient.requestQueue(REQUEST_QUEUE_ID).updateRequest(request); } // And now we can resurrect our scraper again; it will only process the failed requests. ``` ## Resurrect automatically with a free public Actor[​](#resurrect-automatically-with-a-free-public-actor "Direct link to Resurrect automatically with a free public Actor") Fortunately, you don't need to implement this code into your workflow. [Apify Store](https://apify.com/store) provides the [Rebirth Failed Requests](https://apify.com/lukaskrivka/rebirth-failed-requests) Actor (that is [open-source](https://github.com/metalwarrior665/rebirth-failed-requests)) that does this and more. The Actor can automatically scan multiple runs of your Actors based on filters like `date started`. It can also automatically resurrect the runs after renewing the failed requests. That means you will finish your scrape into the final successful state with a single click on the Run button. --- # Run Actor and retrieve data via API **Learn how to run an Actor/task via the Apify API, wait for the job to finish, and retrieve its output data. Your key to integrating Actors with your projects.** *** The most popular way of [integrating](https://help.apify.com/en/collections/1669769-integrations) the Apify platform with an external project/application is by programmatically running an [Actor](https://docs.apify.com/platform/actors.md) or [task](https://docs.apify.com/platform/actors/running/tasks.md), waiting for it to complete its run, then collecting its data and using it within the project. Follow this tutorial to have an idea on how to approach this, it isn't as complicated as it sounds! > Remember to check out our [API documentation](https://docs.apify.com/api/v2.md) with examples in different languages and a live API console. We also recommend testing the API with a desktop client like [Postman](https://www.postman.com/) or [Insomnia](https://insomnia.rest). Apify API offers two ways of interacting with it: * [Synchronously](#synchronous-flow) * [Asynchronously](#asynchronous-flow) If the Actor being run via API takes 5 minutes or less to complete a typical run, it should be called **synchronously**. Otherwise, (if a typical run takes longer than 5 minutes), it should be called **asynchronously**. ## Run an Actor or task[​](#run-an-actor-or-task "Direct link to Run an Actor or task") > If you are unsure about the differences between an Actor and a task, you can read about them in the [tasks](https://docs.apify.com/platform/actors/running/tasks.md) documentation. In brief, tasks are pre-configured inputs for Actors. The API endpoints and usage (for both sync and async) for [Actors](https://docs.apify.com/api/v2.md#tag/ActorsRun-collection/operation/act_runs_post) and [tasks](https://docs.apify.com/api/v2/actor-task-runs-post.md) are essentially the same. To run, or **call**, an Actor/task, you will need a few things: * The name or ID of the Actor/task. The name looks like `username~actorName` or `username~taskName`. The ID can be retrieved on the **Settings** page of the Actor/task. * Your [API token](https://docs.apify.com/platform/integrations.md), which you can find on the **Integrations** page in [Apify Console](https://console.apify.com/account?tab=integrations) (do not share it with anyone!). * Possibly an input, which is passed in JSON format as the request's **body**. * Some other optional settings if you'd like to change the default values (such as allocated memory or the build). The URL of [POST request](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods/POST) to run an Actor looks like this: ``` https://api.apify.com/v2/acts/ACTOR_NAME_OR_ID/runs?token=YOUR_TOKEN ``` For tasks, we can switch the path from **acts** to **actor-tasks** and keep the rest the same: ``` https://api.apify.com/v2/actor-tasks/TASK_NAME_OR_ID/runs?token=YOUR_TOKEN ``` If we send a correct POST request to one of these endpoints, the actor/actor-task will start just as if we had pressed the **Start** button on the Actor's page in the [Apify Console](https://console.apify.com). ### Additional settings[​](#additional-settings "Direct link to Additional settings") We can also add settings for the Actor (which will override the default settings) as additional query parameters. For example, if we wanted to change how much memory the Actor's run should be allocated and which build to run, we could add the `memory` and `build` parameters separated by `&`. ``` https://api.apify.com/v2/acts/ACTOR_NAME_OR_ID/runs?token=YOUR_TOKEN&memory=8192&build=beta ``` This works in almost exactly the same way for both Actors and tasks; however, for tasks, there is no reason to specify a [`build`](https://docs.apify.com/platform/actors/development/builds-and-runs/builds.md) parameter, as a task already has only one specific Actor build which cannot be changed with query parameters. ### Input JSON[​](#input-json "Direct link to Input JSON") Most Actors would not be much use if input could not be passed into them to change their behavior. Additionally, even though tasks already have specified input configurations, it is handy to have the ability to overwrite task inputs through the **body** of the POST request. > The input can technically be any [JSON object](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/JSON), and will vary depending on the Actor being run. Ensure that you are familiar with the Actor's input schema while writing the body of the request. Good Actors have reasonable defaults for most input fields, so if you want to run one of the major Actors from [Apify Store](https://apify.com/store), you usually do not need to provide all possible fields. Via API, let's quickly try to run [Web Scraper](https://apify.com/apify/web-scraper), which is the most popular Actor on the Apify Store at the moment. The full input with all possible fields is [pretty long and ugly](https://apify.com/apify/web-scraper?section=example-run), so we will not show it here. Because it has default values for most fields, we can provide a JSON input containing only the fields we'd like to customize. We will send a POST request to the endpoint below and add the JSON as the **body** of the request: ``` https://api.apify.com/v2/acts/apify~web-scraper/runs?token=YOUR_TOKEN ``` Here is how it looks in [Postman](https://www.postman.com/): ![Run an Actor via API in Postman](/assets/images/run-actor-postman-b89097bdd92cf55096e73719086cb847.png) If we press **Send**, it will immediately return some info about the run. The `status` will be either `READY` (which means that it is waiting to be allocated on a server) or `RUNNING` (99% of cases). ![Actor run info in Postman](/assets/images/run-info-postman-0d11537cf5eeccf8a474cdeab4e8550d.png) We will later use this **run info** JSON to retrieve the run's output data. This info about the run can also be retrieved with another call to the [**Get run**](https://docs.apify.com/api/v2/act-run-get.md) endpoint. ## JavaScript and Python client[​](#javascript-and-python-client "Direct link to JavaScript and Python client") If you are using JavaScript or Python, we highly recommend using the Apify API client ([JavaScript](https://docs.apify.com/api/client/js/), [Python](https://docs.apify.com/api/client/python/)) instead of the raw HTTP API. The client implements smart polling and exponential backoff, which makes calling Actors and getting results efficient. You can skip most of this tutorial by following this code example that calls Google Search Results Scraper and logs its results: * Node.js * Python ``` import { ApifyClient } from 'apify-client'; const client = new ApifyClient({ token: 'YOUR_API_TOKEN' }); const input = { queries: 'Food in NYC' }; // Run the Actor and wait for it to finish // .call method waits infinitely long using smart polling // Get back the run API object const run = await client.actor('apify/google-search-scraper').call(input); // Fetch and print Actor results from the run's dataset (if any) const { items } = await client.dataset(run.defaultDatasetId).listItems(); items.forEach((item) => { console.dir(item); }); ``` ``` from apify_client import ApifyClient client = ApifyClient(token='YOUR_API_TOKEN') run_input = { "queries": "Food in NYC", } # Run the Actor and wait for it to finish # .call method waits infinitely long using smart polling # Get back the run API object run = client.actor("apify/google-search-scraper").call(run_input=run_input) # Fetch and print Actor results from the run's dataset (if there are any) for item in client.dataset(run["defaultDatasetId"]).iterate_items(): print(item) ``` By using our client, you don't need to worry about choosing between synchronous or asynchronous flow. But if you don't want your code to wait during `.call` (potentially for hours), continue reading below about how to implement webhooks. ## Synchronous flow[​](#synchronous-flow "Direct link to Synchronous flow") If each of your runs will last shorter than 5 minutes, you can use a single [synchronous endpoint](https://usergrid.apache.org/docs/introduction/async-vs-sync.html#synchronous). When running **synchronously**, the connection will be held for *up to* 5 minutes. If your synchronous run exceeds the 5-minute time limit, the response will be a run object containing information about the run and the status of `RUNNING`. If that happens, you need to restart the run [asynchronously](#asynchronous-flow) and [wait for the run to finish](#wait-for-the-run-to-finish). ### Synchronous runs with dataset output[​](#synchronous-runs-with-dataset-output "Direct link to Synchronous runs with dataset output") Most Actor runs will store their data in the default [dataset](https://docs.apify.com/platform/storage/dataset.md). The Apify API provides **run-sync-get-dataset-items** endpoints for [Actors](https://docs.apify.com/api/v2/act-run-sync-get-dataset-items-post.md) and [tasks](https://docs.apify.com/api/v2/actor-task-run-sync-get-dataset-items-post.md), which allow you to run an Actor and receive the items from the default dataset once the run has finished. Here is a Node.js example of calling a task via the API and logging the dataset items to the console: ``` // Use your favorite HTTP client import got from 'got'; // Specify your API token // (find it at https://console.apify.com/account#/integrations) const myToken = ''; // Start apify/google-search-scraper Actor // and pass some queries into the JSON body const response = await got({ url: `https://api.apify.com/v2/acts/apify~google-search-scraper/run-sync-get-dataset-items?token=${myToken}`, method: 'POST', json: { queries: 'web scraping\nweb crawling', }, responseType: 'json', }); const items = response.body; // Log each non-promoted search result for both queries items.forEach((item) => { const { nonPromotedSearchResults } = item; nonPromotedSearchResults.forEach((result) => { const { title, url, description } = result; console.log(`${title}: ${url} --- ${description}`); }); }); ``` ### Synchronous runs with key-value store output[​](#synchronous-runs-with-key-value-store-output "Direct link to Synchronous runs with key-value store output") [Key-value stores](https://docs.apify.com/platform/storage/key-value-store.md) are useful for storing files like images, HTML snapshots, or JSON data. The Apify API provides **run-sync** endpoints for [Actors](https://docs.apify.com/api/v2/act-run-sync-post.md) and [tasks](https://docs.apify.com/api/v2/actor-task-run-sync-post.md), which allow you to run a specific task and receive the output. By default, they return the `OUTPUT` record from the default key-value store. ## Asynchronous flow[​](#asynchronous-flow "Direct link to Asynchronous flow") For runs longer than 5 minutes, the process consists of three steps: * [Run the Actor or task](#run-an-actor-or-task) * [Wait for the run to finish](#wait-for-the-run-to-finish) * [Collect the data](#collect-the-data) ### Wait for the run to finish[​](#wait-for-the-run-to-finish "Direct link to Wait for the run to finish") There may be cases where we need to run the Actor and go away. But in any kind of integration, we are usually interested in its output. We have three basic options for how to wait for the actor/task to finish. * [`waitForFinish` parameter](#waitforfinish-parameter) * [Webhooks](#webhooks) * [Polling](#polling) #### `waitForFinish` parameter[​](#waitforfinish-parameter "Direct link to waitforfinish-parameter") This solution is quite similar to the synchronous flow. To make the POST request wait, add the `waitForFinish` parameter. It can have a value from `0` to `60`, which is the maximum time in seconds to wait (the max value for `waitForFinish` is 1 minute). Knowing this, we can extend the example URL like this: ``` https://api.apify.com/v2/acts/apify~web-scraper/runs?token=YOUR_TOKEN&waitForFinish=60 ``` You can also use the `waitForFinish` parameter with the [**GET Run** endpoint](https://docs.apify.com/api/v2/actor-run-get.md) to implement a smarter [polling](#polling) system. Once again, the final response will be the **run info object**; however, now its status should be `SUCCEEDED` or `FAILED`. If the run exceeds the `waitForFinish` duration, the status will still be `RUNNING`. #### Webhooks[​](#webhooks "Direct link to Webhooks") If you have a server, [webhooks](https://docs.apify.com/platform/integrations/webhooks.md) are the most elegant and flexible solution for integrations with Apify. You can set up a webhook for any Actor or task, and that webhook will send a POST request to your server after an [event](https://docs.apify.com/platform/integrations/webhooks/events.md) has occurred. Usually, this event is a successfully finished run, but you can also set a different webhook for failed runs, etc. ![Webhook example](/assets/images/webhook-8b2fcb569631f00cd1bcc8a6db263572.png) The webhook will send you a pretty complicated [JSON object](https://docs.apify.com/platform/integrations/webhooks/actions.md), but usually, you would only be interested in the `resource` object within the response, which is like the **run info** JSON from the previous sections. We can leave the payload template as is for our example since it is all we need. Once your server receives this request from the webhook, you know that the event happened, and you can ask for the complete data. > Don't forget to respond to the webhook with a **200** status code! Otherwise, it will ping you again. #### Polling[​](#polling "Direct link to Polling") What if you don't have a server, and the run you'd like to do is much too long to use a synchronous call? In cases like these, periodic **polling** of the run's status is the solution. When we run the Actor with the [usual API call](#run-an-actor-or-task) shown above, we will back a response with the **run info** object. From this JSON object, we can then extract the ID of the Actor run that we just started from the `id` field. Then, we can set an interval that will poll the Apify API (let's say every 5 seconds) by calling the [**Get run**](https://docs.apify.com/api/v2/actor-run-get.md) endpoint to retrieve the run's status. Replace the `RUN_ID` in the following URL with the ID you extracted earlier: ``` https://api.apify.com/v2/acts/ACTOR_NAME_OR_ID/runs/RUN_ID ``` Once a status of `SUCCEEDED` or `FAILED` has been received, we know the run has finished and can cancel the interval and finally [collect the data](#collect-the-data). ### Collecting the data[​](#collect-the-data "Direct link to Collecting the data") Unless you used the [synchronous call](#synchronous-flow) mentioned above, you will have to make one additional request to the API to retrieve the data. The **run info** JSON also contains the IDs of the default [dataset](https://docs.apify.com/platform/storage/dataset.md) and [key-value store](https://docs.apify.com/platform/storage/key-value-store.md) that are allocated separately for each run, which is usually everything you need. The fields are called `defaultDatasetId` and `defaultKeyValueStoreId`. #### Retrieving a dataset[​](#retrieve-a-dataset "Direct link to Retrieving a dataset") > If you are scraping products, or any list of items with similar fields, the [dataset](https://docs.apify.com/platform/storage/dataset.md) should be your storage of choice. Don't forget though, that dataset items are immutable. This means that you can only add to the dataset, and not change the content that is already inside it. To retrieve the data from a dataset, send a GET request to the [**Get items**](https://docs.apify.com/api/v2/dataset-items-get.md) endpoint and pass the `defaultDatasetId` into the URL. For a GET request to the default dataset, no token is needed. ``` https://api.apify.com/v2/datasets/DATASET_ID/items ``` By default, it will return the data in JSON format with some metadata. The actual data are in the `items` array. You can use plenty of additional parameters, to learn more about them, visit our API reference [documentation](https://docs.apify.com/api/v2/dataset-items-get.md). We will only mention that you can pass a `format` parameter that transforms the response into popular formats like CSV, XML, Excel, RSS, etc. The items are paginated, which means you can ask only for a subset of the data. Specify this using the `limit` and `offset` parameters. This endpoint has a limit of 250,000 items that it can return per request. To retrieve more, you will need to send more requests incrementing the `offset` parameter. ``` https://api.apify.com/v2/datasets/DATASET_ID/items?format=csv&offset=250000 ``` #### Retrieving a key-value store[​](#retrieve-a-key-value-store "Direct link to Retrieving a key-value store") > [Key-value stores](https://docs.apify.com/platform/storage/key-value-store.md) are mainly useful if you have a single output or any kind of files that cannot be [stringified](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/JSON/stringify) (such as images or PDFs). When you want to retrieve something from a key-value store, the `defaultKeyValueStoreId` is *not* enough. You also need to know the name (or **key**) of the record you want to retrieve. If you have a single output JSON, the convention is to return it as a record named `OUTPUT` to the default key-value store. To retrieve the record's content, call the [**Get record**](https://docs.apify.com/api/v2/key-value-store-record-get.md) endpoint. ``` https://api.apify.com/v2/key-value-stores/STORE_ID/records/RECORD_KEY ``` If you don't know the keys (names) of the records in advance, you can retrieve just the keys with the [**List keys**](https://docs.apify.com/api/v2/key-value-store-keys-get.md) endpoint. Keep in mind that you can get a maximum of 1000 keys per request, so you will need to paginate over the keys using the `exclusiveStartKey` parameter if you have more than 1000 keys. To do this, after each call, take the last record key and provide it as the `exclusiveStartKey` parameter. You can do this until you get 0 keys back. ``` https://api.apify.com/v2/key-value-stores/STORE_ID/keys?exclusiveStartKey=myLastRecordKey ``` --- # Introduction to the Apify platform **Learn all about the Apify platform, all of the tools it offers, and how it can improve your overall development experience.** *** The [Apify platform](https://apify.com) was built to serve large-scale and high-performance web scraping and automation needs. It provides easy access to compute instances ([Actors](https://docs.apify.com/academy/getting-started/actors.md)), convenient request and result storages, proxies, scheduling, webhooks and more - all accessible through the **Console** web interface, [Apify's API](https://docs.apify.com/api/v2.md), or our [JavaScript](https://docs.apify.com/api/client/js) and [Python](https://docs.apify.com/api/client/python) API clients. ## Category outline[​](#this-category "Direct link to Category outline") In this category, you'll learn how to become an Apify platform developer from the ground up. From creating your first account, to developing Actors, this is your one-stop-shop for understanding how the platform works, and how to work with it. ## First up[​](#first "Direct link to First up") We'll start off this category light, by showing you how to create an Apify account and get everything ready for development with the platform. [Let's go!](https://docs.apify.com/academy/getting-started.md) --- # Scraping with Apify **Discover Apify's ready-made web scraping and automation tools. Compare Web Scraper, Cheerio Scraper and Puppeteer Scraper to decide which is right for you.** *** Scraping and crawling the web can be difficult and time-consuming without the right tools. That's why Apify provides ready-made solutions to crawl and scrape any website. They are based on our [Actors](https://apify.com/actors), the [Apify SDK](https://docs.apify.com/sdk/js) and [Crawlee](https://crawlee.dev/). Don't let the number of options confuse you. Unless you're really sure you need to use a specific tool, go ahead and use **Web Scraper** ([apify/web-scraper](https://docs.apify.com/academy/apify-scrapers/web-scraper.md)). It is the easiest to pick up and can handle almost anything. Look at **Puppeteer Scraper** ([apify/puppeteer-scraper](https://docs.apify.com/academy/apify-scrapers/puppeteer-scraper.md)) or **Cheerio Scraper** ([apify/cheerio-scraper](https://docs.apify.com/academy/apify-scrapers/cheerio-scraper.md)) only after you know your target websites well and need to optimize your scraper. [Visit the Scraper introduction tutorial to get started!](https://docs.apify.com/academy/apify-scrapers/getting-started.md) ## Web Scraper[​](#web-scraper "Direct link to Web Scraper") Web Scraper is a ready-made solution for scraping the web using the Chrome browser. It takes away all the work necessary to set up a browser for crawling, controls the browser automatically and produces machine-readable results in several common formats. Underneath, it uses the Puppeteer library to control the browser, but you don't need to worry about that. Using a web UI and a little of basic JavaScript, you can tweak it to serve almost any scraping need. [Visit the Web Scraper tutorial to get started!](https://docs.apify.com/academy/apify-scrapers/web-scraper.md) ## Cheerio Scraper[​](#cheerio-scraper "Direct link to Cheerio Scraper") Cheerio Scraper is a ready-made solution for crawling the web using plain HTTP requests to retrieve HTML pages and then parsing and inspecting the HTML using the [cheerio](https://www.npmjs.com/package/cheerio) library. It's blazing fast. Cheerio is a server-side version of the popular jQuery library that does not run in the browser but instead constructs a DOM out of an HTML string and then provides the user an API to work with that DOM. Cheerio Scraper is ideal for scraping websites that do not rely on client-side JavaScript to serve their content. It can be as much as 20 times faster than using a full-browser solution like Puppeteer. [Visit the Cheerio Scraper tutorial to get started!](https://docs.apify.com/academy/apify-scrapers/cheerio-scraper.md) ## Puppeteer Scraper[​](#puppeteer-scraper "Direct link to Puppeteer Scraper") Puppeteer Scraper is the most powerful scraper tool in our arsenal (aside from developing your own Actors). It uses the Puppeteer library to programmatically control a headless Chrome browser, and it can make it do almost anything. If using Web Scraper does not cut it, Puppeteer Scraper is what you need. Puppeteer is a Node.js library, so knowledge of Node.js and its paradigms is expected when working with Puppeteer Scraper. [Visit the Puppeteer Scraper tutorial to get started!](https://docs.apify.com/academy/apify-scrapers/puppeteer-scraper.md) --- # This scraping tutorial will go into the nitty gritty details of extracting data from **** using **Cheerio Scraper** ([apify/cheerio-scraper](https://apify.com/apify/cheerio-scraper)). If you arrived here from the [Getting started with Apify scrapers](https://docs.apify.com/academy/apify-scrapers/getting-started.md), tutorial, great! You are ready to continue where we left off. If you haven't seen the Getting started yet, check it out, it will help you learn about Apify and scraping in general and set you up for this tutorial, because this one builds on topics and code examples discussed there. ## Getting to know our tools[​](#getting-to-know-our-tools "Direct link to Getting to know our tools") In the [Getting started with Apify scrapers](https://docs.apify.com/academy/apify-scrapers/getting-started.md) tutorial, we've confirmed that the scraper works as expected, so now it's time to add more data to the results. To do that, we'll be using the [Cheerio](https://github.com/cheeriojs/cheerio) library. This may not sound familiar, so let's try again. Does [jQuery](https://jquery.com/) ring a bell? If it does you're in luck, because Cheerio is like jQuery that doesn't need an actual browser to run. Everything else is the same. All the functions you already know are there and even the familiar `$` is used. If you still have no idea what either of those are, don't worry. We'll walk you through using them step by step. > [Check out the Cheerio docs](https://github.com/cheeriojs/cheerio) to learn more about it. Now that's out of the way, let's open one of the Actor detail pages in the Store, for example the **Web Scraper** ([apify/web-scraper](https://apify.com/apify/web-scraper)) page, and use our DevTools-Fu to scrape some data. > If you're wondering why we're using Web Scraper as an example instead of Cheerio Scraper, it's only because we didn't want to triple the number of screenshots we needed to make. Lazy developers! ## Building our Page function[​](#building-our-page-function "Direct link to Building our Page function") Before we start, let's do a quick recap of the data we chose to scrape: 1. **URL** - The URL that goes directly to the Actor's detail page. 2. **Unique identifier** - Such as **apify/web-scraper**. 3. **Title** - The title visible in the Actor's detail page. 4. **Description** - The Actor's description. 5. **Last modification date** - When the Actor was last modified. 6. **Number of runs** - How many times the Actor was run. ![$1](https://raw.githubusercontent.com/apify/actor-scraper/master/docs/img/scraping-practice.webp) We've already scraped numbers 1 and 2 in the [Getting started with Apify scrapers](https://docs.apify.com/academy/apify-scrapers/getting-started.md) tutorial, so let's get to the next one on the list: title. ### Title[​](#title "Direct link to Title") ![$1](https://raw.githubusercontent.com/apify/actor-scraper/master/docs/img/title.webp) By using the element selector tool, we find out that the title is there under an `` tag, as titles should be. Maybe surprisingly, we find that there are actually two `` tags on the detail page. This should get us thinking. Is there any parent element that includes our `` tag, but not the other ones? Yes, there is! A `` element that we can use to select only the heading we're interested in. > Remember that you can press CTRL+F (CMD+F) in the Elements tab of DevTools to open the search bar where you can quickly search for elements using their selectors. And always make sure to use the DevTools to verify your scraping process and assumptions. It's faster than changing the crawler code all the time. To get the title we need to find it using a `header h1` selector, which selects all `` elements that have a `` ancestor. And as we already know, there's only one. ``` // Using Cheerio. async function pageFunction(context) { const { $ } = context; // ... rest of your code can come here return { title: $('header h1').text(), }; } ``` ### Description[​](#description "Direct link to Description") Getting the Actor's description is a little more involved, but still pretty straightforward. We cannot search for a `` tag, because there's a lot of them in the page. We need to narrow our search down a little. Using the DevTools we find that the Actor description is nested within the `` element too, same as the title. Moreover, the actual description is nested inside a `` tag with a class `actor-description`. ![$1](https://raw.githubusercontent.com/apify/actor-scraper/master/docs/img/description.webp) ``` async function pageFunction(context) { const { $ } = context; // ... rest of your code can come here return { title: $('header h1').text(), description: $('header span.actor-description').text(), }; } ``` ### Modified date[​](#modified-date "Direct link to Modified date") The DevTools tell us that the `modifiedDate` can be found in a `` element. ![$1](https://raw.githubusercontent.com/apify/actor-scraper/master/docs/img/modified-date.webp) ``` async function pageFunction(context) { const { $ } = context; // ... rest of your code can come here return { title: $('header h1').text(), description: $('header span.actor-description').text(), modifiedDate: new Date( Number( $('ul.ActorHeader-stats time').attr('datetime'), ), ), }; } ``` It might look a little too complex at first glance, but let us walk you through it. We find all the `` elements. Then, we read its `datetime` attribute, because that's where a unix timestamp is stored as a `string`. But we would much rather see a readable date in our results, not a unix timestamp, so we need to convert it. Unfortunately, the `new Date()` constructor will not accept a `string`, so we cast the `string` to a `number` using the `Number()` function before actually calling `new Date()`. Phew! ### Run count[​](#run-count "Direct link to Run count") And so we're finishing up with the `runCount`. There's no specific element like ``, so we need to create a complex selector and then do a transformation on the result. ``` async function pageFunction(context) { const { $ } = context; // ... rest of your code can come here return { title: $('header h1').text(), description: $('header span.actor-description').text(), modifiedDate: new Date( Number( $('ul.ActorHeader-stats time').attr('datetime'), ), ), runCount: Number( $('ul.ActorHeader-stats > li:nth-of-type(3)') .text() .match(/[\d,]+/)[0] .replace(/,/g, ''), ), }; } ``` The `ul.ActorHeader-stats > li:nth-of-type(3)` looks complicated, but it only reads that we're looking for a `` element and within that element we're looking for the third `` element. We grab its text, but we're only interested in the number of runs. We parse the number out using a regular expression, but its type is still a `string`, so we finally convert the result to a `number` by wrapping it with a `Number()` call. > The numbers are formatted with commas as thousands separators (e.g. `'1,234,567'`), so to extract it, we first use regular expression `/[\d,]+/` - it will search for consecutive number or comma characters. Then we extract the match via `.match(/[\d,]+/)[0]` and finally remove all the commas by calling `.replace(/,/g, '')`. We need to use `/,/g` with the global modifier to support large numbers with multiple separators, without it we would replace only the very first occurrence. > > This will give us a string (e.g. `'1234567'`) that can be converted via `Number` function. ### Wrapping it up[​](#wrapping-it-up "Direct link to Wrapping it up") And there we have it! All the data we needed in a single object. For the sake of completeness, let's add the properties we parsed from the URL earlier and we're good to go. ``` async function pageFunction(context) { const { $ } = context; const { url } = request; // ... rest of your code can come here const uniqueIdentifier = url .split('/') .slice(-2) .join('/'); return { url, uniqueIdentifier, title: $('header h1').text(), description: $('header span.actor-description').text(), modifiedDate: new Date( Number( $('ul.ActorHeader-stats time').attr('datetime'), ), ), runCount: Number( $('ul.ActorHeader-stats > li:nth-of-type(3)') .text() .match(/[\d,]+/)[0] .replace(/,/g, ''), ), }; } ``` All we need to do now is add this to our `pageFunction`: ``` async function pageFunction(context) { // $ is Cheerio const { request, log, skipLinks, $ } = context; if (request.userData.label === 'START') { log.info('Store opened!'); // Do some stuff later. } if (request.userData.label === 'DETAIL') { const { url } = request; log.info(`Scraping ${url}`); await skipLinks(); // Do some scraping. const uniqueIdentifier = url .split('/') .slice(-2) .join('/'); return { url, uniqueIdentifier, title: $('header h1').text(), description: $('header span.actor-description').text(), modifiedDate: new Date( Number( $('ul.ActorHeader-stats time').attr('datetime'), ), ), runCount: Number( $('ul.ActorHeader-stats > li:nth-of-type(3)') .text() .match(/[\d,]+/)[0] .replace(/,/g, ''), ), }; } } ``` ### Test run[​](#test-run "Direct link to Test run") As always, try hitting that **Save & Run** button and visit the **Dataset** preview of clean items. You should see a nice table of all the attributes correctly scraped. You nailed it! ## Pagination[​](#pagination "Direct link to Pagination") Pagination is a term that represents "going to the next page of results". You may have noticed that we did not actually scrape all the Actors, just the first page of results. That's because to load the rest of the Actors, one needs to click the **Show more** button at the very bottom of the list. This is pagination. > This is a typical JavaScript pagination, sometimes called infinite scroll. Other pages may use links that take you to the next page. If you encounter those, make a Pseudo URL for those links and they will be automatically enqueued to the request queue. Use a label to let the scraper know what kind of URL it's processing. If you paid close attention, you may now see a problem. How do we click a button in the page when we're working with Cheerio? We don't have a browser to do it and we only have the HTML of the page to work with. The simple answer is that we can't click a button. Does that mean that we cannot get the data at all? Usually not, but it requires some clever DevTools-Fu. ### Analyzing the page[​](#analyzing-the-page "Direct link to Analyzing the page") While with Web Scraper and **Puppeteer Scraper** ([apify/puppeteer-scraper](https://apify.com/apify/puppeteer-scraper)), we could get away with clicking a button, with Cheerio Scraper we need to dig a little deeper into the page's architecture. For this, we will use the Network tab of the Chrome DevTools. > DevTools is a powerful tool with many features, so if you're not familiar with it, please [see Google's tutorial](https://developer.chrome.com/docs/devtools/), which explains everything much better than we ever could. We want to know what happens when we click the **Show more** button, so we open the DevTools **Network** tab and clear it. Then we click the **Show more** button and wait for incoming requests to appear in the list. ![$1](https://raw.githubusercontent.com/apify/actor-scraper/master/docs/img/inspect-network.webp) Now, this is interesting. It seems that we've only received two images after clicking the button and no additional data. This means that the data about Actors must already be available in the page and the **Show more** button only displays it. This is good news. ### Finding the Actors[​](#finding-the-actors "Direct link to Finding the Actors") Now that we know the information we seek is already in the page, we just need to find it. The first Actor in the store is Web Scraper, so let's try using the search tool in the **Elements** tab to find some reference to it. The first few hits do not provide any interesting information, but in the end, we find our goldmine. A `` tag, with the ID `__NEXT_DATA__` that seems to hold a lot of information about Web Scraper. In DevTools, you can right click an element and click **Store as global variable** to make this element available in the **Console**. ![$1](https://raw.githubusercontent.com/apify/actor-scraper/master/docs/img/find-data.webp) A `temp1` variable is now added to your console. We're mostly interested in its contents and we can get that using the `temp1.textContent` property. You can see that it's a rather large JSON string. How do we know? The `type` attribute of the `` element says `application/json`. But working with a string would be very cumbersome, so we need to parse it. ``` const data = JSON.parse(temp1.textContent); ``` After entering the above command into the console, we can inspect the `data` variable and see that all the information we need is there, in the `data.props.pageProps.items` array. Great! ![$1](https://raw.githubusercontent.com/apify/actor-scraper/master/docs/img/inspect-data.webp) > It's obvious that all the information we set to scrape is available in this one data object, so you might already be wondering, can I make one request to the store to get this JSON and then parse it out and be done with it in a single request? Yes you can! And that's the power of clever page analysis. ### Using the data to enqueue all Actor details[​](#using-the-data-to-enqueue-all-actor-details "Direct link to Using the data to enqueue all Actor details") We don't really need to go to all the Actor details now, but for the sake of practice, let's imagine we only found Actor names such as `cheerio-scraper` and their owners, such as `apify` in the data. We will use this information to construct URLs that will take us to the Actor detail pages and enqueue those URLs into the request queue. ``` // We're not in DevTools anymore, // so we use Cheerio to get the data. const dataJson = $('#__NEXT_DATA__').html(); // We requested HTML, but the data are actually JSON. const data = JSON.parse(dataJson); for (const item of data.props.pageProps.items) { const { name, username } = item; const actorDetailUrl = `https://apify.com/${username}/${name}`; await context.enqueueRequest({ url: actorDetailUrl, userData: { // Don't forget the label. label: 'DETAIL', }, }); } ``` We iterate through the items we found, build Actor detail URLs from the available properties and then enqueue those URLs into the request queue. We need to specify the label too, otherwise our page function wouldn't know how to route those requests. > If you're wondering how we know the structure of the URL, see the [Getting started with Apify Scrapers](https://docs.apify.com/academy/apify-scrapers/getting-started.md) tutorial again. ### Plugging it into the Page function[​](#plugging-it-into-the-page-function "Direct link to Plugging it into the Page function") We've got the general algorithm ready, so all that's left is to integrate it into our earlier `pageFunction`. Remember the `// Do some stuff later` comment? Let's replace it. ``` async function pageFunction(context) { const { request, log, skipLinks, $ } = context; if (request.userData.label === 'START') { log.info('Store opened!'); const dataJson = $('#__NEXT_DATA__').html(); // We requested HTML, but the data are actually JSON. const data = JSON.parse(dataJson); for (const item of data.props.pageProps.items) { const { name, username } = item; const actorDetailUrl = `https://apify.com/${username}/${name}`; await context.enqueueRequest({ url: actorDetailUrl, userData: { label: 'DETAIL', }, }); } } if (request.userData.label === 'DETAIL') { const { url } = request; log.info(`Scraping ${url}`); await skipLinks(); // Do some scraping. const uniqueIdentifier = url .split('/') .slice(-2) .join('/'); return { url, uniqueIdentifier, title: $('header h1').text(), description: $('header span.actor-description').text(), modifiedDate: new Date( Number( $('ul.ActorHeader-stats time').attr('datetime'), ), ), runCount: Number( $('ul.ActorHeader-stats > li:nth-of-type(3)') .text() .match(/[\d,]+/)[0] .replace(/,/g, ''), ), }; } } ``` That's it! You can now remove the **Max pages per run** limit, **Save & Run** your task and watch the scraper scrape all of the Actors' data. After it succeeds, open the **Dataset** tab again click on **Preview**. You should have a table of all the Actor's details in front of you. If you do, great job! You've successfully scraped Apify Store. And if not, no worries, go through the code examples again, it's probably just a typo. > There's an important caveat. The way we implemented pagination here is in no way a generic system that you can use with other websites. Cheerio is fast (and that means it's cheap), but it's not easy. Sometimes there's just no way to get all results with Cheerio only and other times it takes hours of research. Keep this in mind when choosing the right scraper for your job. But don't get discouraged. Often times, the only thing you will ever need is to define a correct Pseudo URL. Do your research first before giving up on Cheerio Scraper. ## Downloading the scraped data[​](#downloading-the-scraped-data "Direct link to Downloading the scraped data") You already know the **Dataset** tab of the run console since this is where we've always previewed our data. Notice the row of data formats such as JSON, CSV, and Excel. Below it are options for viewing and downloading the data. Go ahead and try it. > If you prefer working with an API, you can find the example endpoint under the API tab: **Get dataset items**. ### Clean items[​](#clean-items "Direct link to Clean items") You can view and download your data without modifications, or you can choose to only get **clean** items. Data that aren't cleaned include a record for each `pageFunction` invocation, even if you did not return any results. The record also includes hidden fields such as `#debug`, where you can find a variety of information that can help you with debugging your scrapers. Clean items, on the other hand, include only the data you returned from the `pageFunction`. If you're only interested in the data you scraped, this format is what you will be using most of the time. To control this, open the **Advanced options** view on the **Dataset** tab. ## Bonus: Making your code neater[​](#bonus-making-your-code-neater "Direct link to Bonus: Making your code neater") You may have noticed that the `pageFunction` gets quite bulky. To make better sense of your code and have an easier time maintaining or extending your task, feel free to define other functions inside the `pageFunction` that encapsulate all the different logic. You can, for example, define a function for each of the different pages: ``` async function pageFunction(context) { switch (context.request.userData.label) { case 'START': return handleStart(context); case 'DETAIL': return handleDetail(context); default: throw new Error('Unknown request label.'); } async function handleStart({ log, waitFor, $ }) { log.info('Store opened!'); const dataJson = $('#__NEXT_DATA__').html(); // We requested HTML, but the data are actually JSON. const data = JSON.parse(dataJson); for (const item of data.props.pageProps.items) { const { name, username } = item; const actorDetailUrl = `https://apify.com/${username}/${name}`; await context.enqueueRequest({ url: actorDetailUrl, userData: { label: 'DETAIL', }, }); } } async function handleDetail({ request, log, skipLinks, $ }) { const { url } = request; log.info(`Scraping ${url}`); await skipLinks(); // Do some scraping. const uniqueIdentifier = url .split('/') .slice(-2) .join('/'); return { url, uniqueIdentifier, title: $('header h1').text(), description: $('header span.actor-description').text(), modifiedDate: new Date( Number( $('ul.ActorHeader-stats time').attr('datetime'), ), ), runCount: Number( $('ul.ActorHeader-stats > li:nth-of-type(3)') .text() .match(/[\d,]+/)[0] .replace(/,/g, ''), ), }; } } ``` > If you're confused by the functions being declared below their executions, it's called hoisting and it's a feature of JavaScript. It helps you put what matters on top, if you so desire. ## Final word[​](#final-word "Direct link to Final word") Thank you for reading this whole tutorial! Really! It's important to us that our users have the best information available to them so that they can use Apify and effectively. We're glad that you made it all the way here and congratulations on creating your first scraping task. We hope that you liked the tutorial and if there's anything you'd like to ask, [join us on Discord](https://discord.gg/jyEM2PRvMU)! ## What's next[​](#whats-next "Direct link to What's next") * Check out the [Apify SDK](https://docs.apify.com/sdk) and its [Getting started](https://docs.apify.com/sdk/js/docs/guides/apify-platform) tutorial if you'd like to try building your own Actors. It's a bit more complex and involved than writing a `pageFunction`, but it allows you to fine-tune all the details of your scraper to your liking. * [Take a deep dive into Actors](https://docs.apify.com/platform/actors.md), from how they work to [publishing](https://docs.apify.com/platform/actors/publishing.md) them in Apify Store, and even [making money](https://blog.apify.com/make-regular-passive-income-developing-web-automation-actors-b0392278d085/) on Actors. * Found out you're not into the coding part but would still to use Apify Actors? Check out our [ready-made solutions](https://apify.com/store) or [order a custom Actor](https://apify.com/contact-sales) from an Apify-certified developer. **Learn how to scrape a website using Apify's Cheerio Scraper. Build an Actor's page function, extract information from a web page and download your data.** *** --- # Welcome to the getting started tutorial! It will walk you through creating your first scraping task step by step. You will learn how to set up all the different configuration options, code a **Page function** (`pageFunction`), and finally download the scraped data either as an Excel sheet or in another format, such as JSON or CSV. But first, let's give you a brief introduction to web scraping with Apify. ## What is an Apify scraper[​](#what-is-an-apify-scraper "Direct link to What is an Apify scraper") It doesn't matter whether you arrived here from **Web Scraper** ([apify/web-scraper](https://apify.com/apify/web-scraper)), **Puppeteer Scraper** ([apify/puppeteer-scraper](https://apify.com/apify/puppeteer-scraper)) or **Cheerio Scraper** ([apify/cheerio-scraper](https://apify.com/apify/cheerio-scraper)). All of them are **Actors** and for now, let's think of an **Actor** as an application that you can use with your own configuration. **apify/web-scraper** is therefore an application called **web-scraper**, built by **apify**, that you can configure to scrape any webpage. We call these configurations **tasks**. > If you need help choosing the right scraper, see this [great article](https://help.apify.com/en/articles/3024655-choosing-the-right-solution). If you want to learn more about Actors in general, you can read our [Actors page](https://apify.com/actors) or [browse the documentation](https://docs.apify.com/platform/actors.md). You can create 10 different **tasks** for 10 different websites, with very different options, but there will always be just one **Actor**, the `apify/*-scraper` you chose. This is the essence of tasks. They are nothing but **saved configurations** of the Actor that you can run repeatedly. ## Trying it out[​](#trying-it-out "Direct link to Trying it out") Depending on how you arrived at this tutorial, you may already have your first task created for the scraper of your choice. If not, the easiest way is to go to [Apify Store](https://console.apify.com/actors#/store/) and select the Actor you want to base your task on. Then, click the **Create a new task** button in the top-right corner. > This tutorial covers the use of **Web**, **Cheerio**, and **Puppeteer** scrapers, but a lot of the information here can be used with all Actors. For this tutorial, we will select **Web Scraper**. ![$1](https://raw.githubusercontent.com/apify/actor-scraper/master/docs/img/actor-selection.webp) ### Running a task[​](#running-a-task "Direct link to Running a task") This takes you to the **Input and options** tab of the task configuration. Before we delve into the details, let's see how the example works. You can see that there are already some pre-configured input values. It says that the task should visit **** and all its subpages, such as **** and scrape some data using the provided `pageFunction`, specifically the `` of the page and its URL. Scroll down to the **Performance and limits** section and set the **Max pages per run** option to **10**. This tells your task to finish after 10 pages have been visited. We don't need to crawl the whole domain to see that the Actor works. > This also helps with keeping your [compute unit](https://docs.apify.com/platform/actors/running/usage-and-resources.md) (CU) consumption low. To get an idea, our free plan includes 10 CUs and this run will consume about 0.04 CU, so you can run it 250 times a month for free. If you accidentally go over the limit, no worries, we won't charge you for it. You just won't be able to run more tasks that month. Now click **Save & Run**! *(in the bottom-left part of your screen)* ### The run detail[​](#the-run-detail "Direct link to The run detail") After clicking **Save & Run**, the window will change to the run detail. Here, you will see the run's log. If it seems that nothing is happening, don't worry, it takes a few seconds for the run to fully boot up. In under a minute, you should have the 10 pages scraped. You will know that the run successfully completed when the `RUNNING` card in top-left corner changes to `SUCCEEDED`. > Feel free to browse through the various new tabs: **Log**, **Info**, **Input** and other, but for the sake of brevity, we will not explain all their features in this tutorial. Now that the run has `SUCCEEDED`, click on the glowing **Results** card to see the scrape's results. This takes you to the **Dataset** tab, where you can display or download the results in various formats. For now, click the **Preview** button. Voila, the scraped data! ![$1](https://raw.githubusercontent.com/apify/actor-scraper/master/docs/img/the-run-detail.webp) Good job! We've run our first task and got some results. Let's learn how to change the default configuration to scrape something more interesting than the page's ``. ## Creating your own task[​](#creating-your-own-task "Direct link to Creating your own task") Before we jump into the scraping itself, let's have a quick look at the user interface that's available to us. Click on the task's name in the top-left corner to visit the task's configuration. ![$1](https://raw.githubusercontent.com/apify/actor-scraper/master/docs/img/task-name.webp) ### Input and options[​](#input-and-options "Direct link to Input and options") The **Input** tab is where we started and it's the place where you create your scraping configuration. The Actor's creator prepares the **Input** form so that you can tell the Actor what to do. Feel free to check the tooltips of the various options to get a better idea of what they do. To display the tooltip, click the question mark next to each input field's name. > We will not go through all the available input options in this tutorial. See the Actor's README for detailed information. Below the input fields are the Build, Timeout and Memory options. Let's keep them at default settings for now. Remember that if you see a yellow `TIMED-OUT` status after running your task, you might want to come back here and increase the timeout. > Timeouts are there to prevent tasks from running forever. Always set a reasonable timeout to prevent a rogue task from eating up all your compute units. ### Settings[​](#settings "Direct link to Settings") In the settings tab, you can set options that are common to all tasks and not directly related to the Actor's purpose. Unless you've already changed the task's name, it's set to **my-task**, so why not try changing it to **my-first-scraper** and clicking **Save**. ### Runs[​](#runs "Direct link to Runs") You can find all the task runs and their detail pages here. Every time you start a task, it will appear here in the list. Apify securely stores your ten most recent runs indefinitely, ensuring your records are always accessible. All of your task's runs and their outcomes, beyond the latest ten, will be stored here for the data retention period, [which you can find under your plan](https://apify.com/pricing). ### Webhooks[​](#webhooks "Direct link to Webhooks") Webhooks are a feature that help keep you aware of what's happening with your tasks. You can set them up to inform you when a task starts, finishes, fails etc., or you can even use them to run more tasks, depending on the outcome of the original one. [See webhooks documentation](https://docs.apify.com/platform/integrations/webhooks.md). ### Information[​](#information "Direct link to Information") Since tasks are configurations for Actors, this tab shows you all the information about the underlying Actor, the Apify scraper of your choice. You can see the available versions and their READMEs - it's always a good idea to read an Actor's README first before creating a task for it. ### API[​](#api "Direct link to API") The API tab gives you a quick overview of all the available API calls in case you would like to use your task programmatically. It also includes links to detailed API documentation. You can even try it out immediately using the **Test endpoint** button. > Never share a URL containing the authentication token (`?token=...` parameter in the URLs), as this will compromise your account's security. ## Scraping theory[​](#scraping-theory "Direct link to Scraping theory") Since this is a tutorial, we'll be scraping our own website. [Apify Store](https://apify.com/store) is a great candidate for some scraping practice. It's a page built on popular technologies, which displays a lot of different items in various categories, just like an online store, a typical scraping target, would. ### The goal[​](#the-goal "Direct link to The goal") We want to create a scraper that scrapes all the Actors in the store and collects the following attributes for each Actor: 1. **URL** - The URL that goes directly to the Actor's detail page. 2. **Unique identifier** - Such as **apify/web-scraper**. 3. **Title** - The title visible in the Actor's detail page. 4. **Description** - The Actor's description. 5. **Last modification date** - When the Actor was last modified. 6. **Number of runs** - How many times the Actor was run. Some of this information may be scraped directly from the listing pages, but for the rest, we will need to visit the detail pages of all the Actors. ### The start URL[​](#the-start-url "Direct link to The start URL") In the **Input** tab of the task we have, we'll change the **Start URL** from ****. This will tell the scraper to start by opening a different URL. You can add more **Start URL**s or even [use a file with a list of thousands of them](#crawling-the-website-with-pseudo-urls), but in this case, we'll be good with just one. How do we choose the new **Start URL**? The goal is to scrape all Actors in the store, which is available at [apify.com/store](https://apify.com/store), so we choose this URL as our **Start URL**. ``` https://apify.com/store ``` We also need to somehow distinguish the **Start URL** from all the other URLs that the scraper will add later. To do this, click the **Details** button in the **Start URL** form and see the **User data** input. Here you can add any information you'll need during the scrape in a JSON format. For now, add a label to the **Start URL**. ``` { "label": "START" } ``` ### Filtering with a Link selector[​](#filtering-with-a-link-selector "Direct link to Filtering with a Link selector") The **Link selector**, together with **Pseudo URL**s, are your URL matching arsenal. The Link selector is a CSS selector and its purpose is to select the HTML elements where the scraper should look for URLs. And by looking for URLs, we mean finding the elements' `href` attributes. For example, to enqueue URLs from `` tags, we would enter `'div.my-class'`. What's the connection to **Pseudo URL**s? Well, first, all the URLs found in the elements that match the Link selector are collected. Then, **Pseudo URL**s are used to filter through those URLs and enqueue only the ones that match the **Pseudo URL** structure. To scrape all the Actors in Apify Store, we should use the Link selector to tell the scraper where to find the URLs we need. For now, let us tell you that the Link selector you're looking for is: ``` div.item > a ``` Save it as your **Link selector**. If you're wondering how we figured this out, follow along with the tutorial. By the time we finish, you'll know why we used this selector, too. ### Crawling the website with pseudo URLs[​](#crawling-the-website-with-pseudo-urls "Direct link to Crawling the website with pseudo URLs") What is a **Pseudo URL**? Let us explain. Before we can start scraping the Actor details, we need to find all the links to the details. If the links follow a set structure, we can use a certain pattern to describe this structure. And that's what a **Pseudo URL** is. A pattern that describes a URL structure. By setting a **Pseudo URL**, all links that follow the given structure will automatically be added to the crawling queue. Let's see an example. To find the pattern, open some of the Actor details in the store. You'll find that the URLs are always structured the same: ``` https://apify.com/{OWNER}/{NAME} ``` In the structures, only the `OWNER` and `NAME` change. We can leverage this in a **Pseudo URL**. #### Making a pseudo URL[​](#making-a-pseudo-url "Direct link to Making a pseudo URL") **Pseudo URL**s are URLs with some variable parts in them. Those variable parts are represented by [regular expressions](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions) enclosed in brackets `[]`. Working with our Actor details example, we could produce a **Pseudo URL** like this: ``` https://apify.com/[.+]/[.+] ``` This **Pseudo URL** will match all Actor detail pages, such as: ``` https://apify.com/apify/web-scraper ``` But it will not match pages we're not interested in, such as: ``` https://apify.com/contact ``` In addition, together with the filter we set up using the **Link selector**, the scraper will now avoid URLs such as: ``` https://apify.com/industries/manufacturing ``` This is because even though it matches our **Pseudo URL**'s format, the HTML element that contains it does not match the `div.item > a` element we specified in the **Link selector**. Let's use the above **Pseudo URL** in our task. We should also add a label as we did with our **Start URL**. This label will be added to all pages that were enqueued into the request queue using the given **Pseudo URL**. ``` { "label": "DETAIL" } ``` ### Test run[​](#test-run "Direct link to Test run") Now that we've added some configuration, it's time to test it. Run the task, keeping the **Max pages per run** set to `10` and the `pageFunction` as it is. You should see in the log that the scraper first visits the **Start URL** and then several of the Actor details matching the **Pseudo URL**. ## The page function[​](#the-page-function "Direct link to The page function") The `pageFunction` is a JavaScript function that gets executed for each page the scraper visits. To figure out how to create it, you must first inspect the page's structure to get an idea of its inner workings. The best tools for that are a browser's inbuilt developer tools - DevTools. ### Using DevTools[​](#using-devtools "Direct link to Using DevTools") Open [Apify Store](https://apify.com/store) in the Chrome browser (or use any other browser, just note that the DevTools may differ slightly) and open the DevTools, either by right-clicking on the page and selecting **Inspect** or by pressing **F12**. The DevTools window will pop up and display a lot of, perhaps unfamiliar, information. Don't worry about that too much - open the Elements tab (the one with the page's HTML). The Elements tab allows you to browse the page's structure and search within it using the search tool. You can open the search tool by pressing **CTRL+F** or **CMD+F**. Try typing **title** into the search bar. You'll see that the Element tab jumps to the first `` element of the current page and that the title is **Store · Apify**. It's always good practice to do your research using the DevTools before writing the `pageFunction` and running your task. ![$1](https://raw.githubusercontent.com/apify/actor-scraper/master/docs/img/using-devtools.webp) > For the sake of brevity, we won't go into the details of using the DevTools in this tutorial. If you're just starting out with DevTools, this [Google tutorial](https://developer.chrome.com/docs/devtools/) is a good place to begin. ### Understanding `context`[​](#understanding-context "Direct link to understanding-context") The `pageFunction` has access to global variables such as `window` and `document`, which are provided by the browser, as well as to `context`, which is the `pageFunction`'s single argument. `context` carries a lot of useful information and helpful functions, which are described in the Actor's README. ### New page function boilerplate[​](#new-page-function-boilerplate "Direct link to New page function boilerplate") We know that we'll visit two kinds of pages, the list page (**Start URL**) and the detail pages (enqueued using the **Pseudo URL**). We want to enqueue links on the list page and scrape data on the detail page. Since we're not covering jQuery in this tutorial for the sake of brevity, replace the default boilerplate with the code below. ``` async function pageFunction(context) { const { request, log, skipLinks } = context; if (request.userData.label === 'START') { log.info('Store opened!'); // Do some stuff later. } if (request.userData.label === 'DETAIL') { log.info(`Scraping ${request.url}`); await skipLinks(); // Do some scraping. return { // Scraped data. }; } } ``` This may seem like a lot of new information, but it's all connected to our earlier configuration. ### `context.request`[​](#contextrequest "Direct link to contextrequest") The `request` is an instance of the [`Request`](https://sdk.apify.com/docs/api/request) class and holds information about the currently processed page, such as its `url`. Each `request` also has the `request.userData` property of type `Object`. While configuring the **Start URL** and the **Pseudo URL**, we gave them a `label`. We're now using them in the `pageFunction` to distinguish between the store page and the detail pages. ### `context.skipLinks()`[​](#contextskiplinks "Direct link to contextskiplinks") When a **Pseudo URL** is set, the scraper attempts to enqueue matching links on each page it visits. `skipLinks()` is used to tell the scraper that we don't want this to happen on the current page. ### `context.log`[​](#contextlog "Direct link to contextlog") `log` is used for printing messages to the console. You may be tempted to use `console.log()`, but this will not work unless you turn on the **Browser log** option. `log.info()` should be used for general messages, but you can also use `log.debug()` for messages that will only be shown when you turn on the **Debug log** option. [See the docs for more info](https://sdk.apify.com/docs/api/log). ### The page function's return value[​](#the-page-functions-return-value "Direct link to The page function's return value") The `pageFunction` may only return nothing, `null`, `Object` or `Object[]`. If an `Object` is returned, it will be saved as a single result. Returning an `Array` of `Objects` will save each item in the array as a result. The scraping results are saved in a [dataset](https://docs.apify.com/platform/storage/dataset.md) (one of the tabs in the run console, as you may remember). It behaves like a table. Each item is a row in the table and its properties are its columns. Returning the following `Object`: ``` async function pageFunction(context) { // ... rest of your code return { url: 'https://apify.com', title: 'Web Scraping, Data Extraction and Automation - Apify', }; } ``` will produce the following table: | title | url | | ---------------------------------------------------- | ------------------- | | Web Scraping, Data Extraction and Automation - Apify | | ## Scraper lifecycle[​](#scraper-lifecycle "Direct link to Scraper lifecycle") Now that we're familiar with all the pieces in the puzzle, we'll quickly take a look at the scraper lifecycle, or in other words, what the scraper actually does when it scrapes. It's quite straightforward. The scraper: 1. Visits the first **Start URL** and waits for the page to load. 2. Executes the `pageFunction`. 3. Finds all the elements matching the **Link selector** and extracts their `href` attributes (URLs). 4. Uses the **pseudo URLs** to filter the extracted URLs and throws away those that don't match. 5. Enqueues the matching URLs to the end of the crawling queue. 6. Closes the page and selects a new URL to visit, either from the **Start URL**s if there are any left, or from the beginning of the crawling queue. > When you're not using the request queue, the scraper repeats steps 1 and 2. You would not use the request queue when you already know all the URLs you want to visit. For example, when you have a pre-existing list of a thousand URLs that you uploaded as a text file. Or when scraping a single URL. ## Scraping practice[​](#scraping-practice "Direct link to Scraping practice") We've covered all the concepts that we need to understand to successfully scrape the data in our goal, so let's get to it. We will only output data that are already available to us in the page's URL. Remember from [our goal](#the-goal) that we also want to include the **URL** and a **Unique identifier** in our results. To get those, we need the `request.url`, because it is the URL and includes the Unique identifier. ``` const { url } = request; const uniqueIdentifier = url.split('/').slice(-2).join('/'); ``` ### Test run 2[​](#test-run-2 "Direct link to Test run 2") We'll add our first data to the `pageFunction` and carry out a test run to see that everything works as expected. ``` async function pageFunction(context) { const { request, log, skipLinks } = context; if (request.userData.label === 'START') { log.info('Store opened!'); // Do some stuff later. } if (request.userData.label === 'DETAIL') { const { url } = request; log.info(`Scraping ${url}`); await skipLinks(); // Do some scraping. const uniqueIdentifier = url .split('/') .slice(-2) .join('/'); return { url, uniqueIdentifier, }; } } ``` Now **Save & Run** the task and once it finishes, check the dataset by clicking on the **Results** card. Click **Preview** and you should see the URLs and unique identifiers scraped. Great job! ## Choosing sides[​](#choosing-sides "Direct link to Choosing sides") Up until now, everything has been the same for all the Apify scrapers. Whether you're using Web Scraper, Puppeteer Scraper or Cheerio Scraper, what you've learned now will always be the same. This is great if you ever need to switch scrapers, because there's no need to learn everything from scratch. Differences can be found in the code we use in the `pageFunction`. Often subtle, sometimes large. In the next part of the tutorial, we'll focus on the individual scrapers' specific implementation details. It's time to choose sides. But don't worry, at Apify, no side is the dark side. * [Web Scraper tutorial](https://docs.apify.com/academy/apify-scrapers/web-scraper.md) * [Cheerio Scraper tutorial](https://docs.apify.com/academy/apify-scrapers/cheerio-scraper.md) * [Puppeteer Scraper tutorial](https://docs.apify.com/academy/apify-scrapers/puppeteer-scraper.md) **Step-by-step tutorial that will help you get started with all Apify Scrapers. Learn the foundations of scraping the web with Apify and creating your own Actors.** --- # This scraping tutorial will go into the nitty gritty details of extracting data from **** using **Puppeteer Scraper** ([apify/puppeteer-scraper](https://apify.com/apify/puppeteer-scraper)). If you arrived here from the [Getting started with Apify scrapers](https://docs.apify.com/academy/apify-scrapers/getting-started.md), tutorial, great! You are ready to continue where we left off. If you haven't seen the Getting started yet, check it out, it will help you learn about Apify and scraping in general and set you up for this tutorial, because this one builds on topics and code examples discussed there. ## Getting to know our tools[​](#getting-to-know-our-tools "Direct link to Getting to know our tools") In the [Getting started with Apify scrapers](https://docs.apify.com/academy/apify-scrapers/getting-started) tutorial, we've confirmed that the scraper works as expected, so now it's time to add more data to the results. To do that, we'll be using the [Puppeteer library](https://github.com/puppeteer/puppeteer). Puppeteer is a browser automation library that allows you to control a browser using JavaScript. That is, simulate a real human sitting in front of a computer, using a mouse and a keyboard. It gives you almost unlimited possibilities, but you need to learn quite a lot before you'll be able to use all of its features. We'll walk you through some of the basics of Puppeteer, so that you can start using it for some of the most typical scraping tasks, but if you really want to master it, you'll need to visit its [documentation](https://pptr.dev/) and really dive deep into its intricacies. > The purpose of Puppeteer Scraper is to remove some of the difficulty faced when using Puppeteer by wrapping it in a nice, manageable UI. It provides almost all of its features in a format that is much easier to grasp when first trying to scrape using Puppeteer. ### Web Scraper differences[​](#web-scraper-differences "Direct link to Web Scraper differences") At first glance, it may seem like **Web Scraper** ([apify/web-scraper](https://apify.com/apify/web-scraper)) and Puppeteer Scraper are almost the same. Well, they are. In fact, Web Scraper uses Puppeteer underneath. The difference is the amount of control they give you. Where Web Scraper only gives you access to in-browser JavaScript and the `pageFunction` is executed in the browser context, Puppeteer Scraper's `pageFunction` is executed in Node.js context, giving you much more freedom to bend the browser to your will. You're the puppeteer and the browser is your puppet. It's also much easier to work with external APIs, databases or the [Apify SDK](https://sdk.apify.com) in the Node.js context. The tradeoff is simplicity vs power. Web Scraper is simple, Puppeteer Scraper is powerful (and the [Apify SDK](https://sdk.apify.com) is super-powerful). > In other words, Web Scraper's `pageFunction` is like a single [page.evaluate()](https://pptr.dev/#?product=Puppeteer\&show=api-pageevaluatepagefunction-args) call. Now that's out of the way, let's open one of the Actor detail pages in the Store, for example the Web Scraper page and use our DevTools-Fu to scrape some data. > If you're wondering why we're using Web Scraper as an example instead of Puppeteer Scraper, it's only because we didn't want to triple the number of screenshots we needed to make. Lazy developers! ## Building our Page function[​](#building-our-page-function "Direct link to Building our Page function") Before we start, let's do a quick recap of the data we chose to scrape: 1. **URL** - The URL that goes directly to the Actor's detail page. 2. **Unique identifier** - Such as **apify/web-scraper**. 3. **Title** - The title visible in the Actor's detail page. 4. **Description** - The Actor's description. 5. **Last modification date** - When the Actor was last modified. 6. **Number of runs** - How many times the Actor was run. ![$1](https://raw.githubusercontent.com/apify/actor-scraper/master/docs/img/scraping-practice.webp) We've already scraped numbers 1 and 2 in the [Getting started with Apify scrapers](https://docs.apify.com/academy/apify-scrapers/getting-started.md) tutorial, so let's get to the next one on the list: title. ### Title[​](#title "Direct link to Title") ![$1](https://raw.githubusercontent.com/apify/actor-scraper/master/docs/img/title.webp) By using the element selector tool, we find out that the title is there under an `` tag, as titles should be. Maybe surprisingly, we find that there are actually two `` tags on the detail page. This should get us thinking. Is there any parent element that includes our `` tag, but not the other ones? Yes, there is! A `` element that we can use to select only the heading we're interested in. > Remember that you can press CTRL+F (CMD+F) in the Elements tab of DevTools to open the search bar where you can quickly search for elements using their selectors. And always make sure to use the DevTools to verify your scraping process and assumptions. It's faster than changing the crawler code all the time. To get the title we need to find it using a `header h1` selector, which selects all `` elements that have a `` ancestor. And as we already know, there's only one. ``` // Using Puppeteer async function pageFunction(context) { const { page } = context; const title = await page.$eval( 'header h1', ((el) => el.textContent), ); return { title, }; } ``` The [`page.$eval`](https://pptr.dev/#?product=Puppeteer\&show=api-elementhandleevalselector-pagefunction-args-1) function allows you to run a function in the browser, with the selected element as the first argument. Here we use it to extract the text content of a `h1` element that's in the page. The return value of the function is automatically passed back to the Node.js context, so we receive an actual `string` with the element's text. ### Description[​](#description "Direct link to Description") Getting the Actor's description is a little more involved, but still pretty straightforward. We cannot search for a `` tag, because there's a lot of them in the page. We need to narrow our search down a little. Using the DevTools we find that the Actor description is nested within the `` element too, same as the title. Moreover, the actual description is nested inside a `` tag with a class `actor-description`. ![$1](https://raw.githubusercontent.com/apify/actor-scraper/master/docs/img/description.webp) ``` async function pageFunction(context) { const { page } = context; const title = await page.$eval( 'header h1', ((el) => el.textContent), ); const description = await page.$eval( 'header span.actor-description', ((el) => el.textContent), ); return { title, description, }; } ``` ### Modified date[​](#modified-date "Direct link to Modified date") The DevTools tell us that the `modifiedDate` can be found in a `` element. ![$1](https://raw.githubusercontent.com/apify/actor-scraper/master/docs/img/modified-date.webp) ``` async function pageFunction(context) { const { page } = context; const title = await page.$eval( 'header h1', ((el) => el.textContent), ); const description = await page.$eval( 'header span.actor-description', ((el) => el.textContent), ); const modifiedTimestamp = await page.$eval( 'ul.ActorHeader-stats time', (el) => el.getAttribute('datetime'), ); const modifiedDate = new Date(Number(modifiedTimestamp)); return { title, description, modifiedDate, }; } ``` Similarly to `page.$eval`, the [`page.$$eval`](https://pptr.dev/#?product=Puppeteer\&show=api-elementhandleevalselector-pagefunction-args) function runs a function in the browser, only this time, it does not provide you with a single `Element` as the function's argument, but rather with an `Array` of `Elements`. Once again, the return value of the function will be passed back to the Node.js context. It might look a little too complex at first glance, but let us walk you through it. We find all the `` elements. Then, we read its `datetime` attribute, because that's where a unix timestamp is stored as a `string`. But we would much rather see a readable date in our results, not a unix timestamp, so we need to convert it. Unfortunately, the `new Date()` constructor will not accept a `string`, so we cast the `string` to a `number` using the `Number()` function before actually calling `new Date()`. Phew! ### Run count[​](#run-count "Direct link to Run count") And so we're finishing up with the `runCount`. There's no specific element like ``, so we need to create a complex selector and then do a transformation on the result. ``` async function pageFunction(context) { const { page } = context; const title = await page.$eval( 'header h1', ((el) => el.textContent), ); const description = await page.$eval( 'header span.actor-description', ((el) => el.textContent), ); const modifiedTimestamp = await page.$eval( 'ul.ActorHeader-stats time', (el) => el.getAttribute('datetime'), ); const modifiedDate = new Date(Number(modifiedTimestamp)); const runCountText = await page.$eval( 'ul.ActorHeader-stats > li:nth-of-type(3)', ((el) => el.textContent), ); const runCount = Number(runCountText.match(/[\d,]+/)[0].replace(',', '')); return { title, description, modifiedDate, runCount, }; } ``` The `ul.ActorHeader-stats > li:nth-of-type(3)` looks complicated, but it only reads that we're looking for a `` element and within that element we're looking for the third `` element. We grab its text, but we're only interested in the number of runs. We parse the number out using a regular expression, but its type is still a `string`, so we finally convert the result to a `number` by wrapping it with a `Number()` call. > The numbers are formatted with commas as thousands separators (e.g. `'1,234,567'`), so to extract it, we first use regular expression `/[\d,]+/` - it will search for consecutive number or comma characters. Then we extract the match via `.match(/[\d,]+/)[0]` and finally remove all the commas by calling `.replace(/,/g, '')`. We need to use `/,/g` with the global modifier to support large numbers with multiple separators, without it we would replace only the very first occurrence. > > This will give us a string (e.g. `'1234567'`) that can be converted via `Number` function. ### Wrapping it up[​](#wrapping-it-up "Direct link to Wrapping it up") And there we have it! All the data we needed in a single object. For the sake of completeness, let's add the properties we parsed from the URL earlier and we're good to go. ``` async function pageFunction(context) { const { page, request } = context; const { url } = request; // ... const uniqueIdentifier = url .split('/') .slice(-2) .join('/'); const title = await page.$eval( 'header h1', ((el) => el.textContent), ); const description = await page.$eval( 'header span.actor-description', ((el) => el.textContent), ); const modifiedTimestamp = await page.$eval( 'ul.ActorHeader-stats time', (el) => el.getAttribute('datetime'), ); const modifiedDate = new Date(Number(modifiedTimestamp)); const runCountText = await page.$eval( 'ul.ActorHeader-stats > li:nth-of-type(3)', ((el) => el.textContent), ); const runCount = Number(runCountText.match(/[\d,]+/)[0].replace(',', '')); return { url, uniqueIdentifier, title, description, modifiedDate, runCount, }; } ``` All we need to do now is add this to our `pageFunction`: ``` async function pageFunction(context) { // page is Puppeteer's page const { request, log, skipLinks, page } = context; if (request.userData.label === 'START') { log.info('Store opened!'); // Do some stuff later. } if (request.userData.label === 'DETAIL') { const { url } = request; log.info(`Scraping ${url}`); await skipLinks(); // Do some scraping. const uniqueIdentifier = url .split('/') .slice(-2) .join('/'); // Get attributes in parallel to speed up the process. const titleP = page.$eval( 'header h1', (el) => el.textContent, ); const descriptionP = page.$eval( 'header span.actor-description', (el) => el.textContent, ); const modifiedTimestampP = page.$eval( 'ul.ActorHeader-stats time', (el) => el.getAttribute('datetime'), ); const runCountTextP = page.$eval( 'ul.ActorHeader-stats > li:nth-of-type(3)', (el) => el.textContent, ); const [ title, description, modifiedTimestamp, runCountText, ] = await Promise.all([ titleP, descriptionP, modifiedTimestampP, runCountTextP, ]); const modifiedDate = new Date(Number(modifiedTimestamp)); const runCount = Number(runCountText.match(/[\d,]+/)[0].replace(',', '')); return { url, uniqueIdentifier, title, description, modifiedDate, runCount, }; } } ``` > You have definitely noticed that we changed up the code a little bit. This is because the back and forth communication between Node.js and browser takes some time and it slows down the scraper. To limit the effect of this, we changed all the functions to start at the same time and only wait for all of them to finish at the end. This is called concurrency or parallelism. Unless the functions need to be executed in a specific order, it's often a good idea to run them concurrently to speed things up. ### Test run[​](#test-run "Direct link to Test run") As always, try hitting that **Save & Run** button and visit the **Dataset** preview of clean items. You should see a nice table of all the attributes correctly scraped. You nailed it! ## Pagination[​](#pagination "Direct link to Pagination") Pagination is a term that represents "going to the next page of results". You may have noticed that we did not actually scrape all the Actors, just the first page of results. That's because to load the rest of the Actors, one needs to click the **Show more** button at the very bottom of the list. This is pagination. > This is a typical form of JavaScript pagination, sometimes called infinite scroll. Other pages may use links that take you to the next page. If you encounter those, make a **Pseudo URL** for those links and they will be automatically enqueued to the request queue. Use a label to let the scraper know what kind of URL it's processing. ### Waiting for dynamic content[​](#waiting-for-dynamic-content "Direct link to Waiting for dynamic content") Before we talk about paginating, we need to have a quick look at dynamic content. Since Apify Store is a JavaScript application (a popular approach), the button might not exist in the page when the scraper runs the `pageFunction`. How is this possible? Because the scraper only waits with executing the `pageFunction` for the page to load its HTML. If there's additional JavaScript that modifies the DOM afterwards, the `pageFunction` may execute before this JavaScript had the time to run. At first, you may think that the scraper is broken, but it just cannot wait for all the JavaScript in the page to finish executing. For a lot of pages, there's always some JavaScript executing or some network requests being made. It would never stop waiting. It is therefore up to you, the programmer, to wait for the elements you need. #### The `context.page.waitFor()` function[​](#the-contextpagewaitfor-function "Direct link to the-contextpagewaitfor-function") `waitFor()` is a function that's available on the Puppeteer `page` object that's in turn available on the `context` argument of the `pageFunction` (as you already know from previous chapters). It helps you with, well, waiting for stuff. It accepts either a number of milliseconds to wait, a selector to await in the page, or a function to execute. It will stop waiting once the time elapses, the selector appears or the provided function returns `true`. > See [`page.waitFor()`](https://pptr.dev/#?product=Puppeteer\&show=api-pagewaitforselectororfunctionortimeout-options-args) in the Puppeteer documentation. ``` // Waits for 2 seconds. await page.waitFor(2000); // Waits until an element with id "my-id" appears in the page. await page.waitFor('#my-id'); // Waits until a "myObject" variable appears // on the window object. await page.waitFor(() => !!window.myObject); ``` The selector may never be found and the function might never return `true`, so the `page.waitFor()` function also has a timeout. The default is `30` seconds. You can override it by providing an options object as the second parameter, with a `timeout` property. ``` await page.waitFor('.bad-class', { timeout: 5000 }); ``` With those tools, you should be able to handle any dynamic content the website throws at you. ### How to paginate[​](#how-to-paginate "Direct link to How to paginate") After going through the theory, let's design the algorithm: 1. Wait for the **Show more** button. 2. Click it. 3. Is there another **Show more** button? * Yes? Repeat from 1. (loop) * No? We're done. We have all the Actors. #### Waiting for the button[​](#waiting-for-the-button "Direct link to Waiting for the button") Before we can wait for the button, we need to know its unique selector. A quick look in the DevTools tells us that the button's class is some weird randomly generated string, but fortunately, there's an enclosing `` with a class of `show-more`. Great! Our unique selector: ``` div.show-more > button ``` > Don't forget to confirm our assumption in the DevTools finder tool (CTRL/CMD + F). ![$1](https://raw.githubusercontent.com/apify/actor-scraper/master/docs/img/waiting-for-the-button.webp) Now that we know what to wait for, we plug it into the `waitFor()` function. ``` await page.waitFor('div.show-more > button'); ``` #### Clicking the button[​](#clicking-the-button "Direct link to Clicking the button") We have a unique selector for the button and we know that it's already rendered in the page. Clicking it is a piece of cake. We'll use the Puppeteer `page` again to issue the click. Puppeteer will actually simulate dragging the mouse and making a left mouse click in the element. ``` await page.click('div.show-more > button'); ``` This will show the next page of Actors. #### Repeating the process[​](#repeating-the-process "Direct link to Repeating the process") We've shown two function calls, but how do we make this work together in the `pageFunction`? ``` async function pageFunction(context) { // ... let timeout; // undefined const buttonSelector = 'div.show-more > button'; for (;;) { log.info('Waiting for the "Show more" button.'); try { // Default timeout first time. await page.waitFor(buttonSelector, { timeout }); // 2 sec timeout after the first. timeout = 2000; } catch (err) { // Ignore the timeout error. log.info('Could not find the "Show more button", ' + 'we\'ve reached the end.'); break; } log.info('Clicking the "Show more" button.'); await page.click(buttonSelector); } // ... } ``` We want to run this until the `waitFor()` function throws, so that's why we use a `while(true)` loop. We're also not interested in the error, because we're expecting it, so we ignore it and print a log message instead. You might be wondering what's up with the `timeout`. Well, for the first page load, we want to wait longer, so that all the page's JavaScript has had a chance to execute, but for the other iterations, the JavaScript is already loaded and we're waiting for the page to re-render so waiting for `2` seconds is enough to confirm that the button is not there. We don't want to stall the scraper for `30` seconds just to make sure that there's no button. ### Plugging it into the Page function[​](#plugging-it-into-the-page-function "Direct link to Plugging it into the Page function") We've got the general algorithm ready, so all that's left is to integrate it into our earlier `pageFunction`. Remember the `// Do some stuff later` comment? Let's replace it. ``` async function pageFunction(context) { const { request, log, skipLinks, page } = context; if (request.userData.label === 'START') { log.info('Store opened!'); let timeout; // undefined const buttonSelector = 'div.show-more > button'; for (;;) { log.info('Waiting for the "Show more" button.'); try { // Default timeout first time. await page.waitFor(buttonSelector, { timeout }); // 2 sec timeout after the first. timeout = 2000; } catch (err) { // Ignore the timeout error. log.info('Could not find the "Show more button", ' + 'we\'ve reached the end.'); break; } log.info('Clicking the "Show more" button.'); await page.click(buttonSelector); } } if (request.userData.label === 'DETAIL') { const { url } = request; log.info(`Scraping ${url}`); await skipLinks(); // Do some scraping. const uniqueIdentifier = url .split('/') .slice(-2) .join('/'); // Get attributes in parallel to speed up the process. const titleP = page.$eval( 'header h1', (el) => el.textContent, ); const descriptionP = page.$eval( 'header span.actor-description', (el) => el.textContent, ); const modifiedTimestampP = page.$eval( 'ul.ActorHeader-stats time', (el) => el.getAttribute('datetime'), ); const runCountTextP = page.$eval( 'ul.ActorHeader-stats > li:nth-of-type(3)', (el) => el.textContent, ); const [ title, description, modifiedTimestamp, runCountText, ] = await Promise.all([ titleP, descriptionP, modifiedTimestampP, runCountTextP, ]); const modifiedDate = new Date(Number(modifiedTimestamp)); const runCount = Number(runCountText.match(/[\d,]+/)[0].replace(',', '')); return { url, uniqueIdentifier, title, description, modifiedDate, runCount, }; } } ``` That's it! You can now remove the **Max pages per run** limit, **Save & Run** your task and watch the scraper paginate through all the Actors and then scrape all of their data. After it succeeds, open the **Dataset** tab again and click on **Preview**\*\*. You should have a table of all the Actor's details in front of you. If you do, great job! You've successfully scraped Apify Store. And if not, no worries, go through the code examples again, it's probably just a typo. ![$1](https://raw.githubusercontent.com/apify/actor-scraper/master/docs/img/plugging-it-into-the-pagefunction.webp) ## Downloading the scraped data[​](#downloading-the-scraped-data "Direct link to Downloading the scraped data") You already know the **Dataset** tab of the run console since this is where we've always previewed our data. Notice the row of data formats such as JSON, CSV, and Excel. Below it are options for viewing and downloading the data. Go ahead and try it. > If you prefer working with an API, you can find the example endpoint under the API tab: **Get dataset items**. ### Clean items[​](#clean-items "Direct link to Clean items") You can view and download your data without modifications, or you can choose to only get **clean** items. Data that aren't cleaned include a record for each `pageFunction` invocation, even if you did not return any results. The record also includes hidden fields such as `#debug`, where you can find a variety of information that can help you with debugging your scrapers. Clean items, on the other hand, include only the data you returned from the `pageFunction`. If you're only interested in the data you scraped, this format is what you will be using most of the time. To control this, open the **Advanced options** view on the **Dataset** tab. ## Bonus: Making your code neater[​](#bonus-making-your-code-neater "Direct link to Bonus: Making your code neater") You may have noticed that the `pageFunction` gets quite bulky. To make better sense of your code and have an easier time maintaining or extending your task, feel free to define other functions inside the `pageFunction` that encapsulate all the different logic. You can, for example, define a function for each of the different pages: ``` async function pageFunction(context) { switch (context.request.userData.label) { case 'START': return handleStart(context); case 'DETAIL': return handleDetail(context); default: throw new Error('Unknown request label.'); } async function handleStart({ log, page }) { log.info('Store opened!'); let timeout; // undefined const buttonSelector = 'div.show-more > button'; for (;;) { log.info('Waiting for the "Show more" button.'); try { // Default timeout first time. await page.waitFor(buttonSelector, { timeout }); // 2 sec timeout after the first. timeout = 2000; } catch (err) { // Ignore the timeout error. log.info('Could not find the "Show more button", ' + 'we\'ve reached the end.'); break; } log.info('Clicking the "Show more" button.'); await page.click(buttonSelector); } } async function handleDetail({ request, log, skipLinks, page, }) { const { url } = request; log.info(`Scraping ${url}`); await skipLinks(); // Do some scraping. const uniqueIdentifier = url .split('/') .slice(-2) .join('/'); // Get attributes in parallel to speed up the process. const titleP = page.$eval( 'header h1', (el) => el.textContent, ); const descriptionP = page.$eval( 'header span.actor-description', (el) => el.textContent, ); const modifiedTimestampP = page.$eval( 'ul.ActorHeader-stats time', (el) => el.getAttribute('datetime'), ); const runCountTextP = page.$eval( 'ul.ActorHeader-stats > li:nth-of-type(3)', (el) => el.textContent, ); const [ title, description, modifiedTimestamp, runCountText, ] = await Promise.all([ titleP, descriptionP, modifiedTimestampP, runCountTextP, ]); const modifiedDate = new Date(Number(modifiedTimestamp)); const runCount = Number(runCountText.match(/[\d,]+/)[0].replace(',', '')); return { url, uniqueIdentifier, title, description, modifiedDate, runCount, }; } } ``` > If you're confused by the functions being declared below their executions, it's called hoisting and it's a feature of JavaScript. It helps you put what matters on top, if you so desire. ## Bonus 2: Using jQuery with Puppeteer Scraper[​](#bonus-2-using-jquery-with-puppeteer-scraper "Direct link to Bonus 2: Using jQuery with Puppeteer Scraper") If you're familiar with the [jQuery library](https://jquery.com/), you may have looked at the scraping code and thought that it's unnecessarily complicated. That's probably up to everyone to decide on their own, but the good news is, you can use jQuery with Puppeteer Scraper too. ### Injecting jQuery[​](#injecting-jquery "Direct link to Injecting jQuery") To be able to use jQuery, we first need to introduce it to the browser. The [`Apify.utils.puppeteer.injectJQuery`](https://sdk.apify.com/docs/api/puppeteer#puppeteerinjectjquerypage) function will help us with the task. > Friendly warning: Injecting jQuery into a page may break the page itself, if it expects a specific version of jQuery to be available and you override it with an incompatible one. Be careful. You can either call this function directly in your `pageFunction`, or you can set up jQuery injection in the **Pre goto function** in the **Input and options** section. ``` async function pageFunction(context) { const { Apify, page } = context; await Apify.utils.puppeteer.injectJQuery(page); // your code ... } ``` ``` async function preGotoFunction({ page, Apify }) { await Apify.utils.puppeteer.injectJQuery(page); } ``` The implementations are almost equal in effect. That means that in some cases, you may see performance differences, or one might work while the other does not. Depending on the target website. Let's try refactoring the Bonus 1 version of the `pageFunction` to use jQuery. ``` async function pageFunction(context) { switch (context.request.userData.label) { case 'START': return handleStart(context); case 'DETAIL': return handleDetail(context); default: throw new Error(`Unknown label: ${context.request.userData.label}`); } async function handleStart({ log, page }) { log.info('Store opened!'); let timeout; // undefined const buttonSelector = 'div.show-more > button'; for (;;) { log.info('Waiting for the "Show more" button.'); try { await page.waitFor(buttonSelector, { timeout }); timeout = 2000; } catch (err) { log.info('Could not find the "Show more button", ' + 'we\'ve reached the end.'); break; } log.info('Clicking the "Show more" button.'); await page.click(buttonSelector); } } async function handleDetail(contextInner) { const { request, log, skipLinks, page, Apify, } = contextInner; // Inject jQuery await Apify.utils.puppeteer.injectJQuery(page); const { url } = request; log.info(`Scraping ${url}`); await skipLinks(); // Do some scraping. const uniqueIdentifier = url .split('/') .slice(-2) .join('/'); // Use jQuery only inside page.evaluate (inside browser) const results = await page.evaluate(() => { return { title: $('header h1').text(), description: $('header span.actor-description').text(), modifiedDate: new Date( Number( $('ul.ActorHeader-stats time').attr('datetime'), ), ).toISOString(), runCount: Number( $('ul.ActorHeader-stats > li:nth-of-type(3)') .text() .match(/[\d,]+/)[0] .replace(/,/g, ''), ), }; }); return { url, uniqueIdentifier, // Add results from browser to output ...results, }; } } ``` > There's an important takeaway from the example code. You can only use jQuery in the browser scope, even though you're injecting it outside of the browser. We're using the [`page.evaluate()`](https://pptr.dev/#?product=Puppeteer\&show=api-pageevaluatepagefunction-args) function to run the script in the context of the browser and the return value is passed back to Node.js. Keep this in mind. ## Final word[​](#final-word "Direct link to Final word") Thank you for reading this whole tutorial! Really! It's important to us that our users have the best information available to them so that they can use Apify effectively. We're glad that you made it all the way here and congratulations on creating your first scraping task. We hope that you liked the tutorial and if there's anything you'd like to ask, [join us on Discord](https://discord.gg/jyEM2PRvMU)! ## What's next[​](#whats-next "Direct link to What's next") * Check out the [Apify SDK](https://docs.apify.com/sdk) and its [Getting started](https://docs.apify.com/sdk/js/docs/guides/apify-platform) tutorial if you'd like to try building your own Actors. It's a bit more complex and involved than writing a `pageFunction`, but it allows you to fine-tune all the details of your scraper to your liking. * [Take a deep dive into Actors](https://docs.apify.com/platform/actors.md), from how they work to [publishing](https://docs.apify.com/platform/actors/publishing.md) them in Apify Store, and even [making money](https://blog.apify.com/make-regular-passive-income-developing-web-automation-actors-b0392278d085/) on Actors. * Found out you're not into the coding part but would still to use Apify Actors? Check out our [ready-made solutions](https://apify.com/store) or [order a custom Actor](https://apify.com/contact-sales) from an Apify-certified developer. **Learn how to scrape a website using Apify's Puppeteer Scraper. Build an Actor's page function, extract information from a web page and download your data.** *** --- # This scraping tutorial will go into the nitty gritty details of extracting data from **** using **Web Scraper** ([apify/web-scraper](https://apify.com/apify/web-scraper)). If you arrived here from the [Getting started with Apify scrapers](https://docs.apify.com/academy/apify-scrapers/getting-started.md), tutorial, great! You are ready to continue where we left off. If you haven't seen the Getting started yet, check it out, it will help you learn about Apify and scraping in general and set you up for this tutorial, because this one builds on topics and code examples discussed there. ## Getting to know our tools[​](#getting-to-know-our-tools "Direct link to Getting to know our tools") In the [Getting started with Apify scrapers](https://docs.apify.com/academy/apify-scrapers/getting-started) tutorial, we've confirmed that the scraper works as expected, so now it's time to add more data to the results. To do that, we'll be using the [jQuery library](https://jquery.com/), because it provides some nice tools and a lot of people familiar with JavaScript already know how to use it. > [Check out the jQuery docs](https://api.jquery.com/) if you're not familiar with it. And if you don't want to use it, that's okay. Everything can be done using pure JavaScript, too. To add jQuery, all we need to do is turn on **Inject jQuery** under the **Input and options** tab. This will add a `context.jQuery` function that you can use. Now that's out of the way, let's open one of the Actor detail pages in the Store, for example the [Web Scraper](https://apify.com/apify/web-scraper) page and use our DevTools-Fu to scrape some data. ## Building our Page function[​](#building-our-page-function "Direct link to Building our Page function") Before we start, let's do a quick recap of the data we chose to scrape: 1. **URL** - The URL that goes directly to the Actor's detail page. 2. **Unique identifier** - Such as **apify/web-scraper**. 3. **Title** - The title visible in the Actor's detail page. 4. **Description** - The Actor's description. 5. **Last modification date** - When the Actor was last modified. 6. **Number of runs** - How many times the Actor was run. ![$1](https://raw.githubusercontent.com/apify/actor-scraper/master/docs/img/scraping-practice.webp) We've already scraped numbers 1 and 2 in the [Getting started with Apify scrapers](https://docs.apify.com/academy/apify-scrapers/getting-started.md) tutorial, so let's get to the next one on the list: title. ### Title[​](#title "Direct link to Title") ![$1](https://raw.githubusercontent.com/apify/actor-scraper/master/docs/img/title.webp) By using the element selector tool, we find out that the title is there under an `` tag, as titles should be. Maybe surprisingly, we find that there are actually two `` tags on the detail page. This should get us thinking. Is there any parent element that includes our `` tag, but not the other ones? Yes, there is! A `` element that we can use to select only the heading we're interested in. > Remember that you can press CTRL+F (CMD+F) in the Elements tab of DevTools to open the search bar where you can quickly search for elements using their selectors. And always make sure to use the DevTools to verify your scraping process and assumptions. It's faster than changing the crawler code all the time. To get the title we need to find it using a `header h1` selector, which selects all `` elements that have a `` ancestor. And as we already know, there's only one. ``` // Using jQuery. async function pageFunction(context) { const { jQuery: $ } = context; // ... rest of the code return { title: $('header h1').text(), }; } ``` ### Description[​](#description "Direct link to Description") Getting the Actor's description is a little more involved, but still pretty straightforward. We cannot search for a `` tag, because there's a lot of them in the page. We need to narrow our search down a little. Using the DevTools we find that the Actor description is nested within the `` element too, same as the title. Moreover, the actual description is nested inside a `` tag with a class `actor-description`. ![$1](https://raw.githubusercontent.com/apify/actor-scraper/master/docs/img/description.webp) ``` async function pageFunction(context) { const { jQuery: $ } = context; // ... rest of the code return { title: $('header h1').text(), description: $('header span.actor-description').text(), }; } ``` ### Modified date[​](#modified-date "Direct link to Modified date") The DevTools tell us that the `modifiedDate` can be found in a `` element. ![$1](https://raw.githubusercontent.com/apify/actor-scraper/master/docs/img/modified-date.webp) ``` async function pageFunction(context) { const { jQuery: $ } = context; // ... rest of the code return { title: $('header h1').text(), description: $('header span.actor-description').text(), modifiedDate: new Date( Number( $('ul.ActorHeader-stats time').attr('datetime'), ), ), }; } ``` It might look a little too complex at first glance, but let us walk you through it. We find all the `` elements. Then, we read its `datetime` attribute, because that's where a unix timestamp is stored as a `string`. But we would much rather see a readable date in our results, not a unix timestamp, so we need to convert it. Unfortunately, the `new Date()` constructor will not accept a `string`, so we cast the `string` to a `number` using the `Number()` function before actually calling `new Date()`. Phew! ### Run count[​](#run-count "Direct link to Run count") And so we're finishing up with the `runCount`. There's no specific element like ``, so we need to create a complex selector and then do a transformation on the result. ``` async function pageFunction(context) { const { jQuery: $ } = context; // ... rest of the code return { title: $('header h1').text(), description: $('header span.actor-description').text(), modifiedDate: new Date( Number( $('ul.ActorHeader-stats time').attr('datetime'), ), ), runCount: Number( $('ul.ActorHeader-stats > li:nth-of-type(3)') .text() .match(/[\d,]+/)[0] .replace(/,/g, ''), ), }; } ``` The `ul.ActorHeader-stats > li:nth-of-type(3)` looks complicated, but it only reads that we're looking for a `` element and within that element we're looking for the third `` element. We grab its text, but we're only interested in the number of runs. We parse the number out using a regular expression, but its type is still a `string`, so we finally convert the result to a `number` by wrapping it with a `Number()` call. > The numbers are formatted with commas as thousands separators (e.g. `'1,234,567'`), so to extract it, we first use regular expression `/[\d,]+/` - it will search for consecutive number or comma characters. Then we extract the match via `.match(/[\d,]+/)[0]` and finally remove all the commas by calling `.replace(/,/g, '')`. We need to use `/,/g` with the global modifier to support large numbers with multiple separators, without it we would replace only the very first occurrence. > > This will give us a string (e.g. `'1234567'`) that can be converted via `Number` function. ### Wrapping it up[​](#wrapping-it-up "Direct link to Wrapping it up") And there we have it! All the data we needed in a single object. For the sake of completeness, let's add the properties we parsed from the URL earlier and we're good to go. ``` async function pageFunction(context) { const { request, jQuery: $ } = context; const { url } = request; // ... rest of the code const uniqueIdentifier = url.split('/').slice(-2).join('/'); return { url, uniqueIdentifier, title: $('header h1').text(), description: $('header span.actor-description').text(), modifiedDate: new Date( Number( $('ul.ActorHeader-stats time').attr('datetime'), ), ), runCount: Number( $('ul.ActorHeader-stats > li:nth-of-type(3)') .text() .match(/[\d,]+/)[0] .replace(/,/g, ''), ), }; } ``` All we need to do now is add this to our `pageFunction`: ``` async function pageFunction(context) { // use jQuery as $ const { request, log, skipLinks, jQuery: $ } = context; if (request.userData.label === 'START') { log.info('Store opened!'); // Do some stuff later. } if (request.userData.label === 'DETAIL') { const { url } = request; log.info(`Scraping ${url}`); await skipLinks(); // Do some scraping. const uniqueIdentifier = url .split('/') .slice(-2) .join('/'); return { url, uniqueIdentifier, title: $('header h1').text(), description: $('header span.actor-description').text(), modifiedDate: new Date( Number( $('ul.ActorHeader-stats time').attr('datetime'), ), ), runCount: Number( $('ul.ActorHeader-stats > li:nth-of-type(3)') .text() .match(/[\d,]+/)[0] .replace(/,/g, ''), ), }; } } ``` ### Test run[​](#test-run "Direct link to Test run") As always, try hitting that **Save & Run** button and visit the **Dataset** preview of clean items. You should see a nice table of all the attributes correctly scraped. You nailed it! ## Pagination[​](#pagination "Direct link to Pagination") Pagination is a term that represents "going to the next page of results". You may have noticed that we did not actually scrape all the Actors, just the first page of results. That's because to load the rest of the Actors, one needs to click the **Show more** button at the very bottom of the list. This is pagination. > This is a typical form of JavaScript pagination, sometimes called infinite scroll. Other pages may use links that take you to the next page. If you encounter those, make a **Pseudo URL** for those links and they will be automatically enqueued to the request queue. Use a label to let the scraper know what kind of URL it's processing. ### Waiting for dynamic content[​](#waiting-for-dynamic-content "Direct link to Waiting for dynamic content") Before we talk about paginating, we need to have a quick look at dynamic content. Since Apify Store is a JavaScript application (a popular approach), the button might not exist in the page when the scraper runs the `pageFunction`. How is this possible? Because the scraper only waits with executing the `pageFunction` for the page to load its HTML. If there's additional JavaScript that modifies the DOM afterwards, the `pageFunction` may execute before this JavaScript had the time to run. At first, you may think that the scraper is broken, but it just cannot wait for all the JavaScript in the page to finish executing. For a lot of pages, there's always some JavaScript executing or some network requests being made. It would never stop waiting. It is therefore up to you, the programmer, to wait for the elements you need. #### The `context.waitFor()` function[​](#the-contextwaitfor-function "Direct link to the-contextwaitfor-function") `waitFor()` is a function that's available on the `context` object passed to the `pageFunction` and helps you with, well, waiting for stuff. It accepts either a number of milliseconds to wait, a selector to await in the page, or a function to execute. It will stop waiting once the time elapses, the selector appears or the provided function returns `true`. ``` // Waits for 2 seconds. await waitFor(2000); // Waits until an element with id "my-id" appears // in the page. await waitFor('#my-id'); // Waits until a "myObject" variable appears // on the window object. await waitFor(() => !!window.myObject); ``` The selector may never be found and the function might never return `true`, so the `waitFor()` function also has a timeout. The default is `20` seconds. You can override it by providing an options object as the second parameter, with a `timeoutMillis` property. ``` await waitFor('.bad-class', { timeoutMillis: 5000 }); ``` With those tools, you should be able to handle any dynamic content the website throws at you. ### How to paginate[​](#how-to-paginate "Direct link to How to paginate") After going through the theory, let's design the algorithm: 1. Wait for the **Show more** button. 2. Click it. 3. Is there another **Show more** button? * Yes? Repeat from 1. (loop) * No? We're done. We have all the Actors. #### Waiting for the button[​](#waiting-for-the-button "Direct link to Waiting for the button") Before we can wait for the button, we need to know its unique selector. A quick look in the DevTools tells us that the button's class is some weird randomly generated string, but fortunately, there's an enclosing `` with a class of `show-more`. Great! Our unique selector: ``` div.show-more > button ``` > Don't forget to confirm our assumption in the DevTools finder tool (CTRL/CMD + F). ![$1](https://raw.githubusercontent.com/apify/actor-scraper/master/docs/img/waiting-for-the-button.webp) Now that we know what to wait for, we plug it into the `waitFor()` function. ``` await waitFor('div.show-more > button'); ``` #### Clicking the button[​](#clicking-the-button "Direct link to Clicking the button") We have a unique selector for the button and we know that it's already rendered in the page. Clicking it is a piece of cake. We'll use jQuery again, but feel free to use plain JavaScript, it works the same. ``` $('div.show-more > button').click(); ``` This will show the next page of Actors. #### Repeating the process[​](#repeating-the-process "Direct link to Repeating the process") We've shown two function calls, but how do we make this work together in the `pageFunction`? ``` async function pageFunction(context) { // ... let timeoutMillis; // undefined const buttonSelector = 'div.show-more > button'; for (;;) { log.info('Waiting for the "Show more" button.'); try { // Default timeout first time. await waitFor(buttonSelector, { timeoutMillis }); // 2 sec timeout after the first. timeoutMillis = 2000; } catch (err) { // Ignore the timeout error. log.info('Could not find the "Show more button", ' + 'we\'ve reached the end.'); break; } log.info('Clicking the "Show more" button.'); $(buttonSelector).click(); } // ... } ``` We want to run this until the `waitFor()` function throws, so that's why we use a `while(true)` loop. We're also not interested in the error, because we're expecting it, so we ignore it and print a log message instead. You might be wondering what's up with the `timeoutMillis`. Well, for the first page load, we want to wait longer, so that all the page's JavaScript has had a chance to execute, but for the other iterations, the JavaScript is already loaded and we're waiting for the page to re-render so waiting for `2` seconds is enough to confirm that the button is not there. We don't want to stall the scraper for `20` seconds just to make sure that there's no button. ### Plugging it into the pageFunction[​](#plugging-it-into-the-pagefunction "Direct link to Plugging it into the pageFunction") We've got the general algorithm ready, so all that's left is to integrate it into our earlier `pageFunction`. Remember the `// Do some stuff later` comment? Let's replace it. And don't forget to destructure the `waitFor()` function on the first line. ``` async function pageFunction(context) { const { request, log, skipLinks, jQuery: $, waitFor, } = context; if (request.userData.label === 'START') { log.info('Store opened!'); let timeoutMillis; // undefined const buttonSelector = 'div.show-more > button'; for (;;) { log.info('Waiting for the "Show more" button.'); try { // Default timeout first time. await waitFor(buttonSelector, { timeoutMillis }); // 2 sec timeout after the first. timeoutMillis = 2000; } catch (err) { // Ignore the timeout error. log.info('Could not find the "Show more button", ' + 'we\'ve reached the end.'); break; } log.info('Clicking the "Show more" button.'); $(buttonSelector).click(); } } if (request.userData.label === 'DETAIL') { const { url } = request; log.info(`Scraping ${url}`); await skipLinks(); // Do some scraping. const uniqueIdentifier = url .split('/') .slice(-2) .join('/'); return { url, uniqueIdentifier, title: $('header h1').text(), description: $('header span.actor-description').text(), modifiedDate: new Date( Number( $('ul.ActorHeader-stats time').attr('datetime'), ), ), runCount: Number( $('ul.ActorHeader-stats > li:nth-of-type(3)') .text() .match(/[\d,]+/)[0] .replace(/,/g, ''), ), }; } } ``` That's it! You can now remove the **Max pages per run** limit, **Save & Run** your task and watch the scraper paginate through all the Actors and then scrape all of their data. After it succeeds, open the **Dataset** tab again click on **Preview**. You should have a table of all the Actor's details in front of you. If you do, great job! You've successfully scraped Apify Store. And if not, no worries, go through the code examples again, it's probably just a typo. ![$1](https://raw.githubusercontent.com/apify/actor-scraper/master/docs/img/plugging-it-into-the-pagefunction.webp) ## Downloading the scraped data[​](#downloading-the-scraped-data "Direct link to Downloading the scraped data") You already know the **Dataset** tab of the run console since this is where we've always previewed our data. Notice the row of data formats such as JSON, CSV, and Excel. Below it are options for viewing and downloading the data. Go ahead and try it. > If you prefer working with an API, you can find the example endpoint under the API tab: **Get dataset items**. ### Clean items[​](#clean-items "Direct link to Clean items") You can view and download your data without modifications, or you can choose to only get **clean** items. Data that aren't cleaned include a record for each `pageFunction` invocation, even if you did not return any results. The record also includes hidden fields such as `#debug`, where you can find a variety of information that can help you with debugging your scrapers. Clean items, on the other hand, include only the data you returned from the `pageFunction`. If you're only interested in the data you scraped, this format is what you will be using most of the time. To control this, open the **Advanced options** view on the **Dataset** tab. ## Bonus: Making your code neater[​](#bonus-making-your-code-neater "Direct link to Bonus: Making your code neater") You may have noticed that the `pageFunction` gets quite bulky. To make better sense of your code and have an easier time maintaining or extending your task, feel free to define other functions inside the `pageFunction` that encapsulate all the different logic. You can, for example, define a function for each of the different pages: ``` async function pageFunction(context) { switch (context.request.userData.label) { case 'START': return handleStart(context); case 'DETAIL': return handleDetail(context); default: throw new Error('Unknown request label.'); } async function handleStart({ log, waitFor }) { log.info('Store opened!'); let timeoutMillis; // undefined const buttonSelector = 'div.show-more > button'; for (;;) { log.info('Waiting for the "Show more" button.'); try { // Default timeout first time. await waitFor(buttonSelector, { timeoutMillis }); // 2 sec timeout after the first. timeoutMillis = 2000; } catch (err) { // Ignore the timeout error. log.info('Could not find the "Show more button", ' + 'we\'ve reached the end.'); break; } log.info('Clicking the "Show more" button.'); $(buttonSelector).click(); } } async function handleDetail({ request, log, skipLinks, jQuery: $, }) { const { url } = request; log.info(`Scraping ${url}`); await skipLinks(); // Do some scraping. const uniqueIdentifier = url .split('/') .slice(-2) .join('/'); return { url, uniqueIdentifier, title: $('header h1').text(), description: $('header span.actor-description').text(), modifiedDate: new Date( Number( $('ul.ActorHeader-stats time').attr('datetime'), ), ), runCount: Number( $('ul.ActorHeader-stats > li:nth-of-type(3)') .text() .match(/[\d,]+/)[0] .replace(/,/g, ''), ), }; } } ``` > If you're confused by the functions being declared below their executions, it's called hoisting and it's a feature of JavaScript. It helps you put what matters on top, if you so desire. ## Final word[​](#final-word "Direct link to Final word") Thank you for reading this whole tutorial! Really! It's important to us that our users have the best information available to them so that they can use Apify effectively. We're glad that you made it all the way here and congratulations on creating your first scraping task. We hope that you liked the tutorial and if there's anything you'd like to ask, [join us on Discord](https://discord.gg/jyEM2PRvMU)! ## What's next[​](#whats-next "Direct link to What's next") * Check out the [Apify SDK](https://docs.apify.com/sdk) and its [Getting started](https://docs.apify.com/sdk/js/docs/guides/apify-platform) tutorial if you'd like to try building your own Actors. It's a bit more complex and involved than writing a `pageFunction`, but it allows you to fine-tune all the details of your scraper to your liking. * [Take a deep dive into Actors](https://docs.apify.com/platform/actors.md), from how they work to [publishing](https://docs.apify.com/platform/actors/publishing.md) them in Apify Store, and even [making money](https://blog.apify.com/make-regular-passive-income-developing-web-automation-actors-b0392278d085/) on Actors. * Found out you're not into the coding part but would still to use Apify Actors? Check out our [ready-made solutions](https://apify.com/store) or [order a custom Actor](https://apify.com/contact-sales) from an Apify-certified developer. **Learn how to scrape a website using Apify's Web Scraper. Build an Actor's page function, extract information from a web page and download your data.** *** --- # Concepts 🤔 **Learn about some common yet tricky concepts and terms that are used frequently within the academy, as well as in the world of scraper development.** *** You'll see some terms and concepts frequently repeated throughout various courses in the academy. Many of these concepts are common, and even fundamental in the scraping world, which makes it necessary to explain them to our course-takers; however it would be inconvenient for our readers to explain these terms each time they appear in a lesson. Because of this slight dilemma, and because there are no outside resources which compile all of these concepts into an educational and digestible form, we've decided to do just that. Welcome to the **Concepts** section of the Apify Academy's **Glossary**! > It's important to note that there is no specific order to these concepts. All of them range in their relevance and importance to your every day scraping endeavors. --- # CSS selectors CSS selectors are patterns used to select [HTML elements](https://docs.apify.com/academy/concepts/html-elements.md) on a web page. They are used in combination with CSS styles to change the appearance of web pages, and also in JavaScript to access and manipulate the elements on a web page. > Querying of CSS selectors with JavaScript is done using [query selector functions](https://docs.apify.com/academy/concepts/querying-css-selectors.md). ## Common types of CSS selectors[​](#common-types-of-css-selectors "Direct link to Common types of CSS selectors") Some of the most common types of CSS selectors are: ### Element selector[​](#element-selector "Direct link to Element selector") This is used to select elements by their tag name. For example, to select all `` elements, you would use the `p` selector. ``` const paragraphs = document.querySelectorAll('p'); ``` ### Class selector[​](#class-selector "Direct link to Class selector") This is used to select elements by their class attribute. For example, to select all elements with the class of `highlight`, you would use the `.highlight` selector. ``` const highlightedElements = document.querySelectorAll('.highlight'); ``` ### ID selector[​](#id-selector "Direct link to ID selector") This is used to select an element by its `id` attribute. For example, to select an element with the id of `header`, you would use the `#header` selector. ``` const header = document.querySelector(`#header`); ``` ### Attribute selector[​](#attribute-selector "Direct link to Attribute selector") This is used to select elements based on the value of an attribute. For example, to select all elements with the attribute `data-custom` whose value is `yes`, you would use the `[data-custom="yes"]` selector. ``` const customElements = document.querySelectorAll('[data-custom="yes"]'); ``` ### Chaining selectors[​](#chaining-selectors "Direct link to Chaining selectors") You can also chain multiple selectors together to select elements more precisely. For example, to select an element with the class `highlight` that is inside a `` element, you would use the `p.highlight` selector. ``` const highlightedParagraph = document.querySelectorAll('p.highlight'); ``` ## CSS selectors in web scraping[​](#css-selectors-in-web-scraping "Direct link to CSS selectors in web scraping") CSS selectors are important for web scraping because they allow you to target specific elements on a web page and extract their data. When scraping a web page, you typically want to extract specific pieces of information from the page, such as text, images, or links. CSS selectors allow you to locate these elements on the page, so you can extract the data that you need. For example, if you wanted to scrape a list of all the titles of blog posts on a website, you could use a CSS selector to select all the elements that contain the title text. Once you have selected these elements, you can extract the text from them and use it for your scraping project. Additionally, when web scraping it is important to understand the structure of the website and CSS selectors can help you to navigate it. With them, you can select specific elements and their children, siblings, or parent elements. This allows you to extract data that is nested within other elements, or to navigate through the page structure to find the data you need. ## Resources[​](#resources "Direct link to Resources") * Find all the available CSS selectors and their syntax on the [MDN CSS Selectors page](https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors). --- # Dynamic pages and single-page applications (SPAs) **Understand what makes a page dynamic, and how a page being dynamic might change your approach when writing a scraper for it.** *** Oftentimes, web pages load additional information dynamically, long after their main body is loaded in the browser. A subset of dynamic pages takes this approach further and loads all of its content dynamically. Such style of constructing websites is called Single-page applications (SPAs), and it's widespread thanks to some popular JavaScript libraries, such as [React](https://react.dev/) or [Vue](https://vuejs.org/). As you progress in your scraping journey, you'll quickly realize that different websites load their content and populate their pages with data in different ways. Some pages are rendered entirely on the server, some retrieve the data dynamically, and some use a combination of both those methods. ## How page loading works[​](#about-page-loading "Direct link to How page loading works") The process of loading a page involves three main events, each with a designated corresponding name: 1. `DOMContentLoaded` - The initial HTML document is loaded, which contains the HTML as it was rendered on the website's server. It also includes all of the JavaScript which will be run in the next step. 2. `load` - The page's JavaScript is executed. 3. `networkidle` - Network [XHR/Fetch requests](https://developer.mozilla.org/en-US/docs/Web/API/XMLHttpRequest) are sent and loaded, and data from these requests is populated onto the page. Many websites load essential data this way. These requests might be sent upon certain page events as well (not just the first load), such as scrolling or clicking. Now that we have a solid understanding of the different stages of page-loading, and the order they happen in, we can fully understand what a dynamic page is. ## What is dynamic content[​](#what-is-dynamic-content "Direct link to What is dynamic content") Dynamic content is any content that is rendered **after** the `DOMContentLoaded` event, which means any content loaded by JavaScript during the `load` event, or after any network XHR/Fetch requests have been made. Sometimes, it can be quite obvious when content is dynamically being rendered. For example, take a look at this gif: ![Image](https://blog.apify.com/content/images/2022/02/dynamicLoading-1--1--2.gif) Here, it's very clear that new content is being generated. As we scroll down the Twitter feed, we can see the scroll bar jumping back up, signifying that more elements have been created using JavaScript. Other times, it's less obvious though. Content can appear to be static (non-dynamic) when it is not, or even sometimes the other way around. --- # HTML elements An HTML element is a building block of an HTML document. It is used to represent a piece of content on a web page, such as text, images, or videos. Each element is defined by a tag, which is a set of characters enclosed in angle brackets, such as ``, ``, or ``. For example, this is a paragraph element: ``` This is a paragraph of text. ``` You can also add **attributes** to an element to provide additional information or to control how the element behaves. For example, the `src` attribute is used to specify the source of an image, like this: ``` ``` In JavaScript, you can use the **DOM** (Document Object Model) to interact with elements on a web page. For example, you can use the [`querySelector()` method](https://docs.apify.com/academy/concepts/querying-css-selectors.md) to select an element by its [CSS selector](https://docs.apify.com/academy/concepts/css-selectors.md), like this: ``` const myElement = document.querySelector('#myId'); ``` You can also use `getElementById()` method to select an element by its `id`, like this: ``` const myElement = document.getElementById('myId'); ``` You can also use `getElementsByTagName()` method to select all elements of a certain type, like this: ``` const myElements = document.getElementsByTagName('p'); ``` Once you have selected an element, you can use JavaScript to change its content, style, or behavior. In summary, an HTML element is a building block of a web page. It is defined by a **tag** with **attributes**, which provide additional information or control how the element behaves. You can use the **DOM** (Document Object Model) to interact with elements on a web page. --- # HTTP cookies **Learn a bit about what cookies are, and how they are utilized in scrapers to appear logged-in, view specific data, or even avoid blocking.** *** HTTP cookies are small pieces of data sent by the server to the user's web browser, which are typically stored by the browser and used to send later requests to the same server. Cookies are usually represented as a string (if used together with a plain HTTP request) and sent with the request under the **Cookie** [header](https://docs.apify.com/academy/concepts/http-headers.md). ## Most common uses of cookies in crawlers[​](#uses-in-crawlers "Direct link to Most common uses of cookies in crawlers") 1. To make the website show data to you as if you were a logged-in user. 2. To make the website show location-specific data (works for websites where you could set a zip code or country directly on the page, but unfortunately doesn't work for some location-based ads). 3. To make the website less suspicious of the crawler and let the crawler's traffic blend in with regular user traffic. For local testing, we recommend using the [**EditThisCookie**](https://chrome.google.com/webstore/detail/fngmhnnpilhplaeedifhccceomclgfbg) Chrome extension. --- # HTTP headers **Understand what HTTP headers are, what they're used for, and three of the biggest differences between HTTP/1.1 and HTTP/2 headers.** *** [HTTP headers](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers) let the client and the server pass additional information with an HTTP request or response. Headers are represented by an object where the keys are header names. Headers can also contain certain authentication tokens. In general, there are 4 different paths you'll find yourself on when scraping a website and dealing with headers: ## No headers[​](#no-headers "Direct link to No headers") For some websites, you won't need to worry about modifying headers at all, as there are no checks or verifications in place. ## Some default headers required[​](#needs-default-headers "Direct link to Some default headers required") Some websites will require certain default browser headers to work properly, such as **User-Agent** (though, this header is becoming more obsolete, as there are more sophisticated ways to detect and block a suspicious user). Another example of such a "default" header is **Referer**. Some e-commerce websites might share the same platform, and data is loaded through XMLHttpRequests to that platform, which would not know which data to return without knowing which exact website is requesting it. ## Custom headers required[​](#needs-custom-headers "Direct link to Custom headers required") A custom header is a non-standard HTTP header used for a specific website. For example, an imaginary website of **cool-stuff.com** might have a header with the name **X\_Cool\_Stuff\_Token** which is required for every single request to a product page. Dealing with cases like these usually isn't difficult, but can sometimes be tedious. ## Very specific headers required[​](#needs-specific-headers "Direct link to Very specific headers required") The most challenging websites to scrape are the ones that require a full set of site-specific headers to be included with the request. For example, not only would they potentially require proper **User-Agent** and **Referer** headers mentioned above, but also **Accept**, **Accept-Language**, **Accept-Encoding**, etc. with specific values. Another big one to mention is the **Cookie** header. We cover this in more detail within the [cookies](https://docs.apify.com/academy/concepts/http-cookies.md) lesson. You could use Chrome DevTools to inspect request headers, and [Insomnia](https://docs.apify.com/academy/tools/insomnia.md) or [Postman](https://docs.apify.com/academy/tools/postman.md) to test how the website behaves with or without specific headers. ## HTTP/1.1 vs HTTP/2 headers[​](#http1-vs-http2 "Direct link to HTTP/1.1 vs HTTP/2 headers") HTTP/1.1 and HTTP/2 headers have several differences. Here are the three key differences that you should be aware of: 1. HTTP/2 headers do not include status messages. They only contain status codes. 2. Certain headers are no longer used in HTTP/2 (such as **Connection** along with a few others related to it like **Keep-Alive**). In HTTP/2, connection-specific headers are prohibited. While some browsers will ignore them, Safari and other Webkit-based browsers will outright reject any response that contains them. Easy to do by accident, and a big problem. 3. While HTTP/1.1 headers are case-insensitive and could be sent by the browsers with capitalized letters (e.g. **Accept-Encoding**, **Cache-Control**, **User-Agent**), HTTP/2 headers must be lower-cased (e.g. **accept-encoding**, **cache-control**, **user-agent**). > To learn more about the difference between HTTP/1.1 and HTTP/2 headers, check out [this](https://httptoolkit.com/blog/translating-http-2-into-http-1/) article --- # Querying elements `document.querySelector()` and `document.querySelectorAll()` are JavaScript functions that allow you to select elements on a web page using [CSS selectors](https://docs.apify.com/academy/concepts/css-selectors.md). `document.querySelector()` is used to select the first element that matches the provided [CSS selector](https://docs.apify.com/academy/concepts/css-selectors.md). It returns the first matching element or null if no matching element is found. Here's an example of how you can use it: ``` const firstButton = document.querySelector('button'); ``` This will select the first button element on the page and store it in the variable **firstButton**. `document.querySelectorAll()` is used to select all elements that match the provided CSS selector. It returns a `NodeList` (a collection of elements) that can be accessed and manipulated like an array. Here's an example of how you can use it: ``` const buttons = document.querySelectorAll('button'); ``` This will select all button elements on the page and store them in the variable "buttons". Both functions can be used to access and manipulate the elements in the web page. Here's an example on how you can use it to extract the text of all buttons. ``` const buttons = document.querySelectorAll('button'); const buttonTexts = buttons.forEach((button) => button.textContent); ``` It's important to note that when using `querySelectorAll()` in a browser environment, it returns a live `NodeList`, which means that if the DOM changes, the NodeList will also change. --- # What is robotic process automation (RPA)? **Learn the basics of robotic process automation. Make your processes on the web and other software more efficient by automating repetitive tasks.** *** RPA allows you to create software (also known as **bots**), which can imitate your digital actions. You can program bots to perform repetitive tasks faster, more reliably and more accurately than humans. Plus, they can do these tasks all day, every day. ## What can I use RPA for?[​](#what-can-i-use-rpa-for "Direct link to What can I use RPA for?") You can [use](https://apify.com/use-cases/rpa) RPA to automate any repetitive task you perform using software. The tasks can range from [analyzing content](https://apify.com/jakubbalada/content-checker) to monitoring web pages for changes (such as changes in your competitors' pricing). Other use cases for RPA include filling forms or [uploading files](https://apify.com/lukaskrivka/google-sheets) while you get on with more important tasks. And it's not just simple tasks you can automate. How about [processing your invoices](https://apify.com/katerinahronik/toggl-invoice-download) or posting content across several marketing channels at once? ## How does RPA work?[​](#how-does-rpa-work "Direct link to How does RPA work?") In a traditional automation workflow, you 1. Break a repetitive process down into [manageable chunks](https://kissflow.com/workflow/workflow-automation/an-8-step-checklist-to-get-your-workflow-ready-for-automation/), e.g. open website => log into website => click button "X" => download section "Y", etc. 2. Program a bot that does each of those chunks. 3. Execute the chunks of code in the right order (or in parallel). With the advance of [machine learning](https://en.wikipedia.org/wiki/Machine_learning), it is becoming possible to [record](https://www.nice.com/info/rpa-guide/process-recorder-function-in-rpa/) your workflows and analyze which can be automated. However, this technology is still not perfected and at times can even be less practical than the manual process. ## Is RPA the same as web scraping?[​](#is-rpa-the-same-as-web-scraping "Direct link to Is RPA the same as web scraping?") While [web scraping](https://docs.apify.com/academy/web-scraping-for-beginners.md) is a kind of RPA, it focuses on extracting structured data. RPA focuses on the other tasks in browsers - everything except for extracting information. ## Additional resources[​](#additional-resources "Direct link to Additional resources") An easy-to-follow [video](https://www.youtube.com/watch?v=9URSbTOE4YI) on what RPA is. To learn about RPA in plain English, check out [this](https://enterprisersproject.com/article/2019/5/rpa-robotic-process-automation-how-explain) article. [This](https://www.cio.com/article/227908/what-is-rpa-robotic-process-automation-explained.html) article explains what RPA is and discusses both its advantages and disadvantages. You might also like to check out this article on [12 Steps to Automate Workflows](https://quandarycg.com/automating-workflows/). --- # Deploying your code to Apify **In this course learn how to take an existing project of yours and deploy it to the Apify platform as an Actor.** *** This section will discuss how to use your newfound knowledge of the Apify platform and Actors from the [**Getting started**](https://docs.apify.com/academy/getting-started.md) section to deploy your existing project's code to the Apify platform as an Actor. Any program running in a Docker container can become an Apify Actor. ![The deployment workflow](/assets/images/deployment-workflow-72f8b289e512701951e27c687a932dfa.png) Apify provides detailed guidance on how to deploy Node.js and Python programs as Actors, but apart from that you're not limited in what programming language you choose for your scraper. ![Supported languages](/assets/images/supported-languages-2b3aced02908c1def900dbace072201a.jpg) Here are a few examples of Actors in other languages: * [Rust Actor](https://apify.com/lukaskrivka/rust-actor-example) * [Go Actor](https://apify.com/jirimoravcik/go-actor-example) * [Julia Actor](https://apify.com/jirimoravcik/julia-actor-example) ## The "actorification" workflow[​](#workflow "Direct link to The \"actorification\" workflow") Follow these four main steps to turn a piece of code into an Actor: 1. Handle [accepting inputs and writing outputs](https://docs.apify.com/academy/deploying-your-code/inputs-outputs.md). 2. Create an [input schema](https://docs.apify.com/academy/deploying-your-code/input-schema.md) **(optional)**. 3. Add a [Dockerfile](https://docs.apify.com/academy/deploying-your-code/docker-file.md). 4. [Deploy](https://docs.apify.com/academy/deploying-your-code/deploying.md) to the Apify platform! ## Our example project[​](#our-example-project "Direct link to Our example project") For this section, we'll be turning this example project into an Actor: * JavaScript * Python ``` // index.js const addAllNumbers = (...nums) => nums.reduce((total, curr) => total + curr, 0); console.log(addAllNumbers(1, 2, 3, 4)); // -> 10 ``` ``` # index.py def add_all_numbers (nums): total = 0 for num in nums: total += num return total print(add_all_numbers([1, 2, 3, 4])) # -> 10 ``` > For all lessons in this section, we'll have examples for both Node.js and Python so that you can follow along in either language. ## Next up[​](#next "Direct link to Next up") [Next lesson](https://docs.apify.com/academy/deploying-your-code/inputs-outputs.md), we'll be learning how to accept input into our Actor as well as deliver output. --- # Creating dataset schema **Learn how to generate an appealing Overview table interface to preview your Actor results in real time on the Apify platform.** *** The dataset schema generates an interface that enables users to instantly preview their Actor results in real time. ![Dataset Schema](/assets/images/output-schema-example-42bf91c1c1f39834fad5bbedf209acaa.png) In this quick tutorial, you will learn how to set up an output tab for your own Actor. ## Implementation[​](#implementation "Direct link to Implementation") Firstly, create a `.actor` folder in the root of your Actor's source code. Then, create a `actor.json` file in this folder, after which you'll have .actor/actor.json. ![.actor/actor.json](/assets/images/actor-json-example-7f3c312c187b9f6f86879594a769f35f.webp) Next, copy-paste the following template code into your `actor.json` file. ``` { "actorSpecification": 1, "name": "___ENTER_ACTOR_NAME____", "title": "___ENTER_ACTOR_TITLE____", "version": "1.0.0", "storages": { "dataset": { "actorSpecification": 1, "views": { "overview": { "title": "Overview", "transformation": { "fields": [ "___EXAMPLE_NUMERIC_FIELD___", "___EXAMPLE_PICTURE_URL_FIELD___", "___EXAMPLE_LINK_URL_FIELD___", "___EXAMPLE_TEXT_FIELD___", "___EXAMPLE_BOOLEAN_FIELD___" ] }, "display": { "component": "table", "properties": { "___EXAMPLE_NUMERIC_FIELD___": { "label": "ID", "format": "number" }, "___EXAMPLE_PICTURE_URL_FIELD___": { "format": "image" }, "___EXAMPLE_LINK_URL_FIELD___": { "label": "Clickable link", "format": "link" } } } } } } } } ``` To configure the dataset schema, replace the fields in the template with the relevant fields to your Actor. For reference, you can use the [Zappos Scraper source code](https://github.com/PerVillalva/zappos-scraper-actor/blob/main/.actor/actor.json) as an example of how the final implementation of the output tab should look in a live Actor. ``` { "actorSpecification": 1, "name": "zappos-scraper", "title": "Zappos Scraper", "description": "", "version": "1.0.0", "storages": { "dataset": { "actorSpecification": 1, "title": "Zappos.com Dataset", "description": "", "views": { "products": { "title": "Overview", "description": "It can take about one minute until the first results are available.", "transformation": { "fields": [ "imgUrl", "brand", "name", "SKU", "inStock", "onSale", "price", "url" ] }, "display": { "component": "table", "properties": { "imgUrl": { "label": "Product image", "format": "image" }, "url": { "label": "Link", "format": "link" }, "brand": { "format": "text" }, "name": { "format": "text" }, "SKU": { "format": "text" }, "inStock": { "format": "boolean" }, "onSale": { "format": "boolean" }, "price": { "format": "text" } } } } } } } } ``` Note that the fields specified in the dataset schema should match the object keys of your resulting dataset. Also, if your desired label has the same name as the defined object key, then you don't need to specify a label name. The schema will, by default, show a capitalized version of the key and even split camel case into separate words and capitalize all of them. The matching object for the Zappos Scraper shown in the example above will look something like this: ``` const results = { url: request.loadedUrl, imgUrl: $('#stage button[data-media="image"] img[itemprop="image"]').attr('src'), brand: $('span[itemprop="brand"]').text().trim(), name: $('meta[itemprop="name"]').attr('content'), SKU: $('*[itemprop~="sku"]').text().trim(), inStock: !request.url.includes('oosRedirected=true'), onSale: !$('div[itemprop="offers"]').text().includes('OFF'), price: $('span[itemprop="price"]').text(), }; ``` ## Final result[​](#final-result "Direct link to Final result") Great! Now that everything is set up, it's time to run the Actor and admire your Actor's brand new output tab. > Need some extra guidance? Visit the [dataset schema documentation](https://docs.apify.com/platform/actors/development/actor-definition/dataset-schema.md) for more detailed information about how to implement this feature. A few seconds after running the Actor, you should see its results displayed in the `Overview` table. ![Output table overview](/assets/images/output-schema-final-example-0beffd41c710a5438a8fe1c4a72f0f07.webp) ## Next up[​](#next "Direct link to Next up") In the [next lesson](https://docs.apify.com/academy/deploying-your-code/docker-file.md), we'll learn about a very important file that is required for our project to run on the Apify platform - the Dockerfile. --- # Publishing your Actor **Push local code to the platform, or create a new Actor on the console and integrate it with a Git repository to optionally automatically rebuild any new changes.** *** Once you've **actorified** your code, there are two ways to deploy it to the Apify platform. You can either push the code directly from your local machine onto the platform, or you can create a blank Actor in the web interface, and then integrate its source code with a GitHub repository. ## With a Git repository[​](#with-git-repository "Direct link to With a Git repository") Before we deploy our project onto the Apify platform, let's ensure that we've pushed the changes we made in the last 3 lessons into our remote GitHub repository. > The benefit of using this method is that any time you push to the Git repository, the code on the platform is also updated and the Actor is automatically rebuilt. Also, you don't have to use a GitHub repository - you can use GitLab or any other service you'd like. ### Creating the Actor[​](#creating-the-actor "Direct link to Creating the Actor") Before anything can be integrated, we've gotta create a new Actor. Let's head over to our [Apify Console](https://console.apify.com?asrc=developers_portal), navigate to the **Development** subsection and click on the **Develop new** button, then select the **Empty** template. ![Create new button](/assets/images/develop-new-actor-a499c8a2618fec73c828ddb4dcbb75b4.png) ### Changing source code location[​](#change-source-code "Direct link to Changing source code location") In the **Source** tab on the new Actor's page, we'll click the dropdown menu under **Source code** and select **Git repository**. By default, this is set to **Web IDE**. ![Select source code location](/assets/images/select-source-location-8b84116417145746c275463c49e24baa.png) Now we'll paste the link to our GitHub repository into the **Git URL** text field and click **Save**. ### Adding the webhook to the repository[​](#adding-repository-webhook "Direct link to Adding the webhook to the repository") The final step is to click on **API** in the top right corner of our Actor's page: ![API button](/assets/images/api-button-4384acadb7883bbad6c7f363c0c1a37c.jpg) And scroll through all of the links until we find the **Build Actor** API endpoint. Now we'll copy this endpoint's URL, head back over to our GitHub repository and navigate to **Settings > Webhooks > Add webhook**. The final thing to do is to paste the URL and save the webhook. ![Adding a webhook to your GitHub repository](/assets/images/ci-github-integration-2ee82ac772eb3280155b7027a4259528.png) That's it! The Actor should now pull its source code from the repository and automatically build. ## Without a GitHub repository (using the Apify CLI)[​](#with-apify-cli "Direct link to Without a GitHub repository (using the Apify CLI)") > If you don't yet have the Apify CLI, learn how to install it and log in by following along with [this brief lesson](https://docs.apify.com/academy/tools/apify-cli.md) about it. If you're logged in to the Apify CLI, the `apify push` command can be used to push the code straight onto the Apify platform from your local machine (no GitHub repository required), where it will automatically be built for you. Prior to running this command, make sure that you have an **.actor/actor.json** file at the root of the project. If you don't already have one, you can use `apify init .` to automatically generate one for you. One important thing to note is that you can use a `.gitignore` file to exclude files from being pushed. When you use `apify push` without a `.gitignore`, the full folder contents will be pushed, meaning that even the **storage** and **node\_modules** will be pushed. These files are unnecessary to push, as they are both generated on the platform. > The `apify push` command should only really be used for quickly pushing and testing Actors on the platform during development. If you are ready to make your Actor public, use a Git repository instead, as you will reap the benefits of using Git and others will be able to contribute to the project. ## Deployed\![​](#deployed "Direct link to Deployed!") Great! Once you've pushed your Actor to the platform, you will find it listed under the **Actors** tab. When using the `apify push` command, you will have access to the multifile editor. For details about using the multifile editor, refer to [Creating Actors](https://docs.apify.com/academy/getting-started/creating-actors.md#web-ide). ![Deployed Actor on the Apify platform](/assets/images/actor-page-e3c2002c5e585e896614af6e3e38838e.jpg) The next step is to test your Actor and experiment with the vast amount of features the platform has to offer. ## Wrap up[​](#next "Direct link to Wrap up") That's it! In this short section, you've learned how to take your code written in any programming language and turn it into a usable Actor that can run on the Apify platform! The next step is to start looking into the [paid Actors](https://docs.apify.com/platform/actors/publishing.md) program, which allows you to monetize your work. --- # Creating Actor Dockerfile **Understand how to write a Dockerfile (Docker image blueprint) for your project so that it can be run within a Docker container on the Apify platform.** *** The **Dockerfile** is a file which gives the Apify platform (or Docker, more specifically) instructions on how to create an environment for your code to run in. Every Actor must have a Dockerfile, as Actors run in Docker containers. > Actors on the platform are always run in Docker containers; however, they can also be run in local Docker containers. This is not common practice though, as it requires more setup and a deeper understanding of Docker. For testing, it's best to run the Actor on the local OS (this requires you to have the underlying runtime installed, such as Node.js, Python, Rust, GO, etc). ## Base images[​](#base-images "Direct link to Base images") If your project doesn’t already contain a Dockerfile, don’t worry! Apify offers [many base images](https://docs.apify.com/sdk/js/docs/guides/docker-images) that are optimized for building and running Actors on the platform, which can be found on [Docker Hub](https://hub.docker.com/u/apify). When using a language for which Apify doesn't provide a base image, [Docker Hub](https://hub.docker.com/) provides a ton of free Docker images for most use-cases, upon which you can create your own images. > Tip: You can see all of Apify's Docker images [on DockerHub](https://hub.docker.com/u/apify). At the base level, each Docker image contains a base operating system and usually also a programming language runtime (such as Node.js or Python). You can also find images with preinstalled libraries or install them yourself during the build step. Once you find the base image you need, you can add it as the initial `FROM` statement: ``` FROM apify/actor-node:16 ``` > For syntax highlighting in your Dockerfiles, download the [**Docker** VSCode extension](https://code.visualstudio.com/docs/containers/overview#_installation). ## Writing the file[​](#writing-the-file "Direct link to Writing the file") The rest of the Dockerfile is about copying the source code from the local filesystem into the container's filesystem, installing libraries, and setting the `RUN` command (which falls back to the parent image). > If you are not using a base image from Apify, then you should specify how to launch the source code of your Actor with the `CMD` instruction. Here's the Dockerfile for our Node.js example project's Actor: * Node.js Dockerfile * Python Dockerfile ``` FROM apify/actor-node:16 # Second, copy just package.json and package-lock.json since they are the only files # that affect npm install in the next step COPY package*.json ./ # Install npm packages, skip optional and development dependencies to keep the # image small. Avoid logging too much and print the dependency tree for debugging RUN npm --quiet set progress=false \ && npm install --only=prod --no-optional \ && echo "Installed npm packages:" \ && (npm list --all || true) \ && echo "Node.js version:" \ && node --version \ && echo "npm version:" \ && npm --version # Next, copy the remaining files and directories with the source code. # Since we do this after npm install, quick build will be really fast # for simple source file changes. COPY . ./ ``` ``` # First, specify the base Docker image. # You can also use any other image from Docker Hub. FROM apify/actor-python:3.9 # Second, copy just requirements.txt into the Actor image, # since it should be the only file that affects "pip install" in the next step, # in order to speed up the build COPY requirements.txt ./ # Install the packages specified in requirements.txt, # Print the installed Python version, pip version # and all installed packages with their versions for debugging RUN echo "Python version:" \ && python --version \ && echo "Pip version:" \ && pip --version \ && echo "Installing dependencies from requirements.txt:" \ && pip install -r requirements.txt \ && echo "All installed Python packages:" \ && pip freeze # Next, copy the remaining files and directories with the source code. # Since we do this after installing the dependencies, quick build will be really fast # for most source file changes. COPY . ./ # Specify how to launch the source code of your Actor. # By default, the main.py file is run CMD python3 main.py ``` ## Examples[​](#examples "Direct link to Examples") The examples above show how to deploy Actors written in Node.js or Python, but you can use any language. As an inspiration, here are a few examples for other languages: Go, Rust, Julia. * GO Actor Dockerfile * Rust Actor Dockerfile * Julia Actor Dockerfile ``` FROM golang:1.17.1-alpine WORKDIR /app COPY . . RUN go mod download RUN go build -o /example-actor CMD ["/example-actor"] ``` ``` # Image with prebuilt Rust. We use the newest 1.* version # https://hub.docker.com/_/rust FROM rust:1 # We copy only package setup so we cache building all dependencies COPY Cargo* ./ # We need to have dummy main.rs file to be able to build RUN mkdir src && echo "fn main() {}" > src/main.rs # Build dependencies only # Since we do this before copying the rest of the files, # the dependencies will be cached by Docker, allowing fast # build times for new code changes RUN cargo build --release # Delete dummy main.rs RUN rm -rf src # Copy rest of the files COPY . ./ # Build the source files RUN cargo build --release CMD ["./target/release/actor-example"] ``` ``` FROM julia:1.7.1-alpine WORKDIR /app COPY . . RUN julia install.jl CMD ["julia", "main.jl"] ``` ## Next up[​](#next "Direct link to Next up") In the [next lesson](https://docs.apify.com/academy/deploying-your-code/deploying.md), we'll push our code directly to the Apify platform, or create and integrate a new Actor on the Apify platform with our project's GitHub repository. --- # How to write Actor input schema **Learn how to generate a user interface on the platform for your Actor's input with a single file - the INPUT\_SCHEMA.json file.** *** Though writing an [input schema](https://docs.apify.com/platform/actors/development/actor-definition/input-schema.md) for an Actor is not a required step, it is most definitely an ideal one. The Apify platform will read the **INPUT\_SCHEMA.json** file within the root of your project and generate a user interface for entering input into your Actor, which makes it significantly easier for non-developers (and even developers) to configure and understand the inputs your Actor can receive. Because of this, we'll be writing an input schema for our example Actor. > Without an input schema, the users of our Actor will have to provide the input in JSON format, which can be problematic for those who are not familiar with JSON. ## Schema title & description[​](#title-and-description "Direct link to Schema title & description") In the root of our project, we'll create a file named **INPUT\_SCHEMA.json** and start writing the first part of the schema. ``` { "title": "Adding Actor input", "description": "Add all values in list of numbers with an arbitrary length.", "type": "object", "schemaVersion": 1 } ``` The **title** and **description** describe what the input schema is for, and a bit about what the Actor itself does. ## Properties[​](#properties "Direct link to Properties") In order to define all of the properties our Actor is expecting, we must include them within an object with a key of **properties**. ``` { "title": "Adding Actor input", "description": "Add all values in list of numbers with an arbitrary length.", "type": "object", "schemaVersion": 1, "properties": { "numbers": { "title": "Number list", "description": "The list of numbers to add up." } } } ``` Each property's key corresponds to the name we're expecting within our code, while the **title** and **description** are what the user will see when configuring input on the platform. ## Property types & editor types[​](#property-types "Direct link to Property types & editor types") Within our new **numbers** property, there are two more fields we must specify. Firstly, we must let the platform know that we're expecting an array of numbers with the **type** field. Then, we should also instruct Apify on which UI component to render for this input property. In our case, we have an array of numbers, which means we should use the **json** editor type that we discovered in the ["array" section](https://docs.apify.com/platform/actors/development/actor-definition/input-schema/specification/v1.md#array) of the input schema documentation. We could also use **stringList**, but then we'd have to parse out the numbers from the strings. ``` { "title": "Adding Actor input", "description": "Add all values in list of numbers with an arbitrary length.", "type": "object", "schemaVersion": 1, "properties": { "numbers": { "title": "Number list", "description": "The list of numbers to add up.", "type": "array", "editor": "json" } } } ``` ## Required fields[​](#required-fields "Direct link to Required fields") The great thing about building an input schema is that it will automatically validate your inputs based on their type, maximum value, minimum value, etc. Sometimes, you want to ensure that the user will always provide input for certain fields, as they are crucial to the Actor's run. This can be done by using the **required** field and passing in the names of the fields you'd like to require. ``` { "title": "Adding Actor input", "description": "Add all values in list of numbers with an arbitrary length.", "type": "object", "schemaVersion": 1, "properties": { "numbers": { "title": "Number list", "description": "The list of numbers to add up.", "type": "array", "editor": "json" } }, "required": ["numbers"] } ``` For our case, we've made the **numbers** field required, as it is crucial to our Actor's run. ## Final thoughts[​](#final-thoughts "Direct link to Final thoughts") Here is what the input schema we wrote will render on the platform: ![Rendered UI from input schema](/assets/images/rendered-ui-74b1f9f74dce9ba83249f733716a0745.png) Later on, we'll be building more complex input schemas, as well as discussing how to write quality input schemas that allow the user to understand the Actor and not become overwhelmed. It's not expected to memorize all of the fields that properties can take or the different editor types available, which is why it's always good to reference the [input schema documentation](https://docs.apify.com/platform/actors/development/actor-definition/input-schema.md) when writing a schema. ## Next up[​](#next "Direct link to Next up") In the [next lesson](https://docs.apify.com/platform/actors/development/actor-definition/dataset-schema.md), we'll learn how to generate an appealing Overview table to display our Actor's results in real time, so users can get immediate feedback about the data being extracted. --- # Managing Actor inputs and outputs **Learn to accept input into your Actor, do something with it, and then return output. Actors can be written in any language, so this concept is language agnostic.** *** Most of the time when you're creating a project, you are expecting some sort of input from which your software will run off. Oftentimes as well, you want to provide some sort of output once your software has completed running. Apify provides a convenient way to handle inputs and deliver outputs. An important thing to understand regarding inputs and outputs is that they are read/written differently depending on where the Actor is running: * If your Actor is running locally, the inputs/outputs are usually provided in the filesystem, and environment variables are injected either by you, the developer, or by the Apify CLI by running the project with the `apify run` command. * While running in a Docker container on the platform, environment variables are automatically injected, and inputs & outputs are provided and modified using Apify's REST API. ## A bit about storage[​](#about-storage "Direct link to A bit about storage") You can read/write your inputs/outputs: to the [key-value store](https://docs.apify.com/platform/storage/key-value-store.md), or to the [dataset](https://docs.apify.com/platform/storage/dataset.md). The key-value store can be used to store any sort of unorganized/unrelated data in any format, while the data pushed to a dataset typically resembles a table with columns (fields) and rows (items). Each Actor's run is allocated both a default dataset and a default key-value store. When running locally, these storages are accessible through the **storage** folder within your project's root directory, while on the platform they are accessible via Apify's API. ## Accepting input[​](#accepting-input "Direct link to Accepting input") You can utilize multiple ways to accept input into your project. The option you go with depends on the language you have written your project in. If you are using Node.js for your repo's code, you can use the [`apify`](https://www.npmjs.com/package/apify) package. Otherwise, you can use the useful environment variables automatically set up for you by Apify to write utility functions which read the Actor's input and return it. ### Accepting input with the Apify SDK[​](#accepting-input-with-the-apify-sdk "Direct link to Accepting input with the Apify SDK") Since we're using Node.js, let's install the `apify` package by running the following command: ``` npm install apify ``` Now, let's import `Actor` from `apify` and use the `Actor.getInput()` function to grab our input. ``` // index.js import { Actor } from 'apify'; // We must initialize and exit the Actor. The rest of our code // goes in between these two. await Actor.init(); const input = await Actor.getInput(); console.log(input); await Actor.exit(); ``` If we run this right now, we'll see **null** in our terminal - this is because we never provided any sort of test input, which should be provided in the default key-value store. The `Actor.getInput()` function has detected that there is no **storage** folder and generated one for us. ![Default key-value store filepath](/assets/images/filepath-6c643f3e6fc1e05a2c8e477557a9dd4e.jpg) We'll now add an **INPUT.json** file within **storage/key\_value\_stores/default** to match what we're expecting in our code. ``` { "numbers": [5, 5, 5, 5] } ``` Then we can add our example project code from earlier. It will grab the input and use it to generate a solution which is logged into the console. ``` // index.js import { Actor } from 'apify'; await Actor.init(); const { numbers } = await Actor.getInput(); const addAllNumbers = (...nums) => nums.reduce((total, curr) => total + curr, 0); const solution = addAllNumbers(...numbers); console.log(solution); await Actor.exit(); ``` Cool! When we run `node index.js`, we see **20**. ### Accepting input without the Apify SDK[​](#accepting-input-without-the-apify-sdk "Direct link to Accepting input without the Apify SDK") Alternatively, when writing in a language other than JavaScript, we can create our own `get_input()` function which utilizes the Apify API when the Actor is running on the platform. For this example, we are using the [Apify Client](https://docs.apify.com/academy/getting-started/apify-client.md) for Python to access the API. ``` # index.py from apify_client import ApifyClient from os import environ import json client = ApifyClient(token='YOUR_TOKEN') # If being run on the platform, the "APIFY_IS_AT_HOME" environment variable # will be "1". Otherwise, it will be undefined/None def is_on_apify (): return 'APIFY_IS_AT_HOME' in environ # Get the input def get_input (): if not is_on_apify(): with open('./apify_storage/key_value_stores/default/INPUT.json') as actor_input: return json.load(actor_input) kv_store = client.key_value_store(environ.get('APIFY_DEFAULT_KEY_VALUE_STORE_ID')) return kv_store.get_record('INPUT')['value'] def add_all_numbers (nums): total = 0 for num in nums: total += num return total actor_input = get_input()['numbers'] solution = add_all_numbers(actor_input) print(solution) ``` > For a better understanding of the API endpoints for reading and modifying key-value stores, check the [official API reference](https://docs.apify.com/api/v2/storage-key-value-stores.md). ## Writing output[​](#writing-output "Direct link to Writing output") Similarly to reading input, you can write the Actor's output either by using the Apify SDK in Node.js or by manually writing a utility function to do so. ### Writing output with the Apify SDK[​](#writing-output-with-the-apify-sdk "Direct link to Writing output with the Apify SDK") In the SDK, we can write to the dataset with the `Actor.pushData()` function. Let's go ahead and write the solution of the `addAllNumbers()` function to the dataset store using this function: ``` // index.js // This is our example project code from earlier. // We will use the Apify input as its input. import { Actor } from 'apify'; await Actor.init(); const { numbers } = await Actor.getInput(); const addAllNumbers = (...nums) => nums.reduce((total, curr) => total + curr, 0); const solution = addAllNumbers(...numbers); // And save its output to the default dataset await Actor.pushData({ solution }); await Actor.exit(); ``` ### Writing output without the Apify SDK[​](#writing-output-without-the-apify-sdk "Direct link to Writing output without the Apify SDK") Just as with the custom `get_input()` utility function, you can write a custom `set_output()` function as well if you cannot use the Apify SDK. > You can read and write your output anywhere; however, it is standard practice to use a folder named **storage**. ``` # index.py from apify_client import ApifyClient from os import environ import json client = ApifyClient(token='YOUR_TOKEN') def is_on_apify (): return 'APIFY_IS_AT_HOME' in environ def get_input (): if not is_on_apify(): with open('./apify_storage/key_value_stores/default/INPUT.json') as actor_input: return json.load(actor_input) kv_store = client.key_value_store(environ.get('APIFY_DEFAULT_KEY_VALUE_STORE_ID')) return kv_store.get_record('INPUT')['value'] # Push the solution to the dataset def set_output (data): if not is_on_apify(): with open('./apify_storage/datasets/default/solution.json', 'w') as output: return output.write(json.dumps(data, indent=2)) dataset = client.dataset(environ.get('APIFY_DEFAULT_DATASET_ID')) dataset.push_items('OUTPUT', value=[json.dumps(data, indent=4)]) def add_all_numbers (nums): total = 0 for num in nums: total += num return total actor_input = get_input()['numbers'] solution = add_all_numbers(actor_input) set_output({ 'solution': solution }) ``` ## Testing locally[​](#testing-locally "Direct link to Testing locally") Since we've changed our code a lot from the way it originally was by wrapping it in the Apify SDK to accept inputs and return outputs, we most definitely should test it locally before worrying about pushing it to the Apify platform. After running our script, there should be a single item in the default dataset that looks like this: ``` { "solution": 20 } ``` ## Next up[​](#next "Direct link to Next up") That's it! We've now added all of the files and code necessary to convert our software into an Actor. In the [next lesson](https://docs.apify.com/academy/deploying-your-code/input-schema.md), we'll be learning how to generate a user interface for our Actor's input so that users don't have to provide the input in raw JSON format. --- # Expert scraping with Apify **After learning the basics of Actors and Apify, learn to develop pro-level scrapers on the Apify platform with this advanced course.** *** This course will teach you the nitty gritty of what it takes to build pro-level scrapers with Apify. We recommend that you've at least looked through all of the other courses in the academy prior to taking this one. ## Preparations[​](#preparations "Direct link to Preparations") Before developing a pro-level Apify scraper, there are some important things you should have at least a bit of knowledge about (knowing the basics of each is enough to continue through this section), as well as some things that you should have installed on your system. > If you've already gone through the [Web scraping basics for JavaScript devs](https://docs.apify.com/academy/web-scraping-for-beginners.md) and the first courses of the [Apify platform category](https://docs.apify.com/academy/apify-platform.md), you will be more than well equipped to continue on with the lessons in this course. ### Crawlee, Apify SDK, and the Apify CLI[​](#crawlee-apify-sdk-and-cli "Direct link to Crawlee, Apify SDK, and the Apify CLI") If you're feeling ambitious, you don't need to have any prior experience with Crawlee to get started with this course; however, at least 5–10 minutes of exposure is recommended. If you haven't yet tried out Crawlee, you can refer to [this lesson](https://docs.apify.com/academy/web-scraping-for-beginners/crawling/pro-scraping.md) in the **Web scraping basics for JavaScript devs** course (and ideally follow along). To familiarize yourself with the Apify SDK, you can refer to the [Apify Platform](https://docs.apify.com/academy/apify-platform.md) category. The Apify CLI will play a core role in the running and testing of the Actor you will build, so if you haven't gotten it installed already, please refer to [this short lesson](https://docs.apify.com/academy/tools/apify-cli.md). ### Git[​](#git "Direct link to Git") In one of the later lessons, we'll be learning how to integrate our Actor on the Apify platform with a GitHub repository. For this, you'll need to understand at least the basics of [Git](https://git-scm.com/docs). Here's a [great tutorial](https://product.hubspot.com/blog/git-and-github-tutorial-for-beginners) to help you get started with Git. ### Docker[​](#docker "Direct link to Docker") Docker is a massive topic on its own, but don't be worried! We only expect you to know and understand the very basics of it, which can be learned about in [this short article](https://docs.docker.com/guides/docker-overview/) (10 minute read). ### The basics of Actors[​](#actor-basics "Direct link to The basics of Actors") Part of this course will be learning more in-depth about Actors; however, some basic knowledge is already assumed. If you haven't yet gone through the [Actors](https://docs.apify.com/academy/getting-started/actors.md) lesson of the **Apify platform** course, it's highly recommended to at least give it a glance before moving forward. ## First up[​](#first "Direct link to First up") [First up](https://docs.apify.com/academy/expert-scraping-with-apify/actors-webhooks.md), we'll be learning in-depth about integrating Actors with each other using webhooks. > Each lesson will have a short *(and optional)* quiz that you can take at home to test your skills and knowledge related to the lesson's content. Some questions have straight factual answers, but some others can have varying opinionated answers. --- # Webhooks & advanced Actor overview **Learn more advanced details about Actors, how they work, and the default configurations they can take. Also, learn how to integrate your Actor with webhooks.** *** Thus far, you've run Actors on the platform and written an Actor of your own, which you published to the platform yourself using the Apify CLI; therefore, it's fair to say that you are becoming more familiar and comfortable with the concept of **Actors**. Within this lesson, we'll take a more in-depth look at Actors and what they can do. ## Advanced Actor overview[​](#advanced-actors "Direct link to Advanced Actor overview") In this course, we'll be working out of the Amazon scraper project from the **Web scraping basics for JavaScript devs** course. If you haven't already built that project, you can do it in three short lessons [here](https://docs.apify.com/academy/web-scraping-for-beginners/challenge.md). We've made a few small modifications to the project with the Apify SDK, but 99% of the code is still the same. Take another look at the files within your Amazon scraper project. You'll notice that there is a **Dockerfile**. Every single Actor has a Dockerfile (the Actor's **Image**) which tells Docker how to spin up a container on the Apify platform which can successfully run the Actor's code. "Apify Actors" is a serverless platform that runs multiple Docker containers. For a deeper understanding of Actor Dockerfiles, refer to the [Apify Actor Dockerfile docs](https://docs.apify.com/sdk/js/docs/guides/docker-images#example-dockerfile). ## Webhooks[​](#webhooks "Direct link to Webhooks") Webhooks are a powerful tool that can be used for just about anything. You can set up actions to be taken when an Actor reaches a certain state (started, failed, succeeded, etc). These actions usually take the form of an API call (generally a POST request). ## Learning 🧠[​](#learning "Direct link to Learning 🧠") Prior to moving forward, please read over these resources: * Read about [running Actors, handling Actor inputs, memory and CPU](https://docs.apify.com/platform/actors/running.md). * Learn about [Actor webhooks](https://docs.apify.com/platform/integrations/webhooks.md), which we will implement in the next lesson. * Learn [how to run Actors](https://docs.apify.com/academy/api/run-actor-and-retrieve-data-via-api.md) using Apify's REST API. ## Knowledge check 📝[​](#quiz "Direct link to Knowledge check 📝") 1. How do you allocate more CPU for an Actor's run? 2. Within itself, can you get the exact time that an Actor was started? 3. What are the types of default storages connected to an Actor's run? 4. Can you change the allocated memory of an Actor while it's running? 5. How can you run an Actor with Puppeteer on the Apify platform with headless mode set to `false`? ## Our task[​](#our-task "Direct link to Our task") In this task, we'll be building on top of what we already created in the [Web scraping basics for JavaScript devs](https://docs.apify.com/academy/web-scraping-for-beginners/challenge.md) course's final challenge, so keep those files safe! Once our Amazon Actor has completed its run, we will, rather than sending an email to ourselves, call an Actor through a webhook. The Actor called will be a new Actor that we will create together, which will take the dataset ID as input, then subsequently filter through all of the results and return only the cheapest one for each product. All of the results of the Actor will be pushed to its default dataset. [**Solution**](https://docs.apify.com/academy/expert-scraping-with-apify/solutions/integrating-webhooks.md) ## Next up[​](#next "Direct link to Next up") This course's [next lesson](https://docs.apify.com/academy/expert-scraping-with-apify/managing-source-code.md) is brief, but discusses a very important topic: managing your code and storing it in a safe place. --- # Apify API & client **Gain an in-depth understanding of the two main ways of programmatically interacting with the Apify platform - through the API, and through a client.** *** You can use one of the two main ways to programmatically interact with the Apify platform: by directly using [Apify's RESTful API](https://docs.apify.com/api/v2.md), or by using the [JavaScript](https://docs.apify.com/api/client/js) and [Python](https://docs.apify.com/api/client/python) API clients. In the next two lessons, we'll be focusing on the first two. > Apify's API and JavaScript API client allow us to do anything a regular user can do when interacting with the platform's web interface, only programmatically. ## Learning 🧠[​](#learning "Direct link to Learning 🧠") * Scroll through the [Apify API docs](https://docs.apify.com/api/v2.md) (there's a whole lot there, so you're not expected to memorize everything). * Read about the Apify client in [Apify's docs](https://docs.apify.com/api/client/js). It can also be seen on [GitHub](https://github.com/apify/apify-client-js) and [npm](https://www.npmjs.com/package/apify-client). * Learn about the [`Actor.newClient()`](https://docs.apify.com/sdk/js/reference/class/Actor#newClient) function in the Apify SDK. * Skim through [this article](https://help.apify.com/en/articles/2868670-how-to-pass-data-from-web-scraper-to-another-actor) about API integration (this article is old; however, still relevant). ## Knowledge check 📝[​](#quiz "Direct link to Knowledge check 📝") 1. What is the relationship between the Apify API and the Apify client? Are there any significant differences? 2. How do you pass input when running an Actor or task via API? 3. Do you need to install the `apify-client` npm package when already using the `apify` package? ## Our task[​](#our-task "Direct link to Our task") We'll be creating another new Actor, which will have two jobs: 1. Programmatically call the task for the Amazon Actor. 2. Export its results into CSV format under a new key called **OUTPUT.csv** in the default key-value store. Though it's a bit unintuitive, this is a perfect activity for learning how to use both the Apify API and the Apify JavaScript client. The new Actor should take the following input values, which be mapped to parameters in the API calls: ``` { // How much memory to allocate to the Amazon Actor // Must be a power of 2 "memory": 4096, // Whether to use the JavaScript client to make the // call, or to use the API "useClient": false, // The fields in each item to return back. All other // fields should be ommitted "fields": ["title", "itemUrl", "offer"], // The maximum number of items to return back "maxItems": 10 } ``` [**Solution**](https://docs.apify.com/academy/expert-scraping-with-apify/solutions/using-api-and-client.md) ## Next up[​](#next "Direct link to Next up") [Lesson VI](https://docs.apify.com/academy/expert-scraping-with-apify/migrations-maintaining-state.md) will teach us everything we need to know about migrations and how to handle them properly to avoid losing any state; therefore, increasing the reliability of our `demo-actor` Amazon scraper. --- # Bypassing anti-scraping methods **Learn about bypassing anti-scraping methods using proxies and proxy/session rotation together with Crawlee and the Apify SDK.** *** Effectively bypassing anti-scraping software is one of the most crucial, but also one of the most difficult skills to master. The different types of [anti-scraping protections](https://docs.apify.com/academy/anti-scraping.md) can vary a lot on the web. Some websites aren't even protected at all, some require only moderate IP rotation, and some cannot even be scraped without using advanced techniques and workarounds. Additionally, because the web is evolving, anti-scraping techniques are also evolving and becoming more advanced. It is generally quite difficult to recognize the anti-scraping protections a page may have when first inspecting it, so it is important to thoroughly investigate a site prior to writing any lines of code, as anti-scraping measures can significantly change your approach as well as complicate the development process of an Actor. As your skills expand, you will be able to spot anti-scraping measures quicker, and better evaluate the complexity of a new project. You might have already noticed that we've been using the **RESIDENTIAL** proxy group in the `proxyConfiguration` within our Amazon scraping Actor. But what does that mean? This is a proxy group from [Apify Proxy](https://apify.com/proxy) which has been preventing us from being blocked by Amazon this entire time. We'll be learning more about proxies and Apify Proxy in this lesson. ## Learning 🧠[​](#learning "Direct link to Learning 🧠") * Skim [this page](https://apify.com/proxy) for a general idea of Apify Proxy. * Give the [proxy documentation](https://docs.apify.com/platform/proxy.md) a solid readover (feel free to skip most of the examples). * Check out the [anti-scraping guide](https://docs.apify.com/academy/anti-scraping.md). * Gain a solid understanding of the [SessionPool](https://crawlee.dev/api/core/class/SessionPool). * Look at a few Actors on the [Apify store](https://apify.com/store). How are they utilizing proxies? ## Knowledge check 📝[​](#quiz "Direct link to Knowledge check 📝") 1. What are the different types of proxies that Apify proxy offers? What are the main differences between them? 2. Which proxy groups do users get on the free plan? Can they access the proxy from their computer? 3. How can you prevent an error from occurring if one of the proxy groups that a user has is removed? What are the best practices for these scenarios? 4. Does it make sense to rotate proxies when you are logged into a website? 5. Construct a proxy URL that will select proxies **only from the US**. 6. What do you need to do to rotate a proxy (one proxy usually has one IP)? How does this differ for CheerioCrawler and PuppeteerCrawler? 7. Name a few different ways how a website can prevent you from scraping it. ## Our task[​](#our-task "Direct link to Our task") This time, we're going to build a trivial proxy-session manager for our Amazon scraping Actor. A session should be used a maximum of 5 times before being rotated; however, if a request fails, the IP should be rotated immediately. Additionally, the proxies used by our scraper should now only be from the US. [**Solution**](https://docs.apify.com/academy/expert-scraping-with-apify/solutions/rotating-proxies.md) ## Next up[​](#next "Direct link to Next up") Up [next](https://docs.apify.com/academy/expert-scraping-with-apify/saving-useful-stats.md), we'll be learning about how to save useful stats about our run, which becomes more and more useful as a project scales. --- # Managing source code **Learn how to manage your Actor's source code more efficiently by integrating it with a GitHub repository. This is standard on the Apify platform.** *** In this brief lesson, we'll discuss how to better manage an Actor's source code. Up 'til now, you've been developing your scripts locally, and then pushing the code directly to the Actor on the Apify platform; however, there is a much more optimal (and standard) way. ## Learning 🧠[​](#learning "Direct link to Learning 🧠") Thus far, every time we've updated our code on the Apify platform, we've used the `apify push` CLI command; however, this can be problematic for a few reasons - mainly because, if someone else wants to make a change to/maintain your code, they don't have access to it, as it is on your local machine. If you're not yet familiar with Git, please get familiar with it through the [Git documentation](https://git-scm.com/docs), then take a quick moment to read about [GitHub integration](https://docs.apify.com/platform/integrations/github.md) in the Apify docs. Also, try to explore the **Multifile editor** in one of the Actors you developed in the previous lessons before moving forward. ## Knowledge check 📝[​](#quiz "Direct link to Knowledge check 📝") 1. Do you have to rebuild an Actor each time the source code is changed? 2. In Git, what is the difference between **pushing** changes and making a **pull request**? 3. Based on your knowledge and experience, is the `apify push` command worth using (in your opinion)? [**Answers**](https://docs.apify.com/academy/expert-scraping-with-apify/solutions/managing-source.md) ## Our task[​](#our-task "Direct link to Our task") First, we must initialize a GitHub repository (you can use Gitlab if you like, but this lesson's examples will be using GitHub). Then, after pushing our main Amazon Actor's code to the repo, we must switch its source code to use the content of the GitHub repository instead. ## Integrating GitHub source code[​](#integrating-github "Direct link to Integrating GitHub source code") First, let's create a repository. This can be done [in a number of ways](https://kbroman.org/github_tutorial/pages/init.html), but in this lesson, we'll do it by creating the remote repository on GitHub's website: ![Create a new GitHub repo](/assets/images/github-new-repo-1e45ed3d75fdb3672b6253b016e1186d.png) Then, we'll run the commands it tells us in our terminal (while within the **demo-actor** directory) to initialize the repository locally, and then push all of the files to the remote one. After you've created your repo, navigate on the Apify platform to the Actor we called **demo-actor**. In the **Source** tab, click the dropdown menu under **Source code** and select **Git repository**. By default, this is set to **Web IDE**, which is what we've been using so far. ![Select source code location](/assets/images/select-source-location-8b84116417145746c275463c49e24baa.png) Then, go ahead and paste the link to your repository into the **Git URL** text field and click **Save**. The final step is to click on **API** in the top right corner of your Actor's page: ![API button](/assets/images/api-button-4384acadb7883bbad6c7f363c0c1a37c.jpg) And scroll through all of the links until you find the **Build Actor** API endpoint. Copy this endpoint's URL, then head back over to your GitHub repository and navigate to **Settings > Webhooks > Add webhook**. The final thing to do is to paste the URL and save the webhook. ![Adding a webhook to your GitHub repo](/assets/images/ci-github-integration-2ee82ac772eb3280155b7027a4259528.png) And you're done! 🎉 ## Quick chat about code management[​](#code-management "Direct link to Quick chat about code management") This was a bit of overhead, but the good news is that you don't ever have to configure this stuff again for this Actor. Now, every time the content of your **main**/**master** branch changes, the Actor on the Apify platform will rebuild based on the newest code. Think of it as combining two steps into one! Normally, you'd have to do a `git push` from your terminal in order to get the newest code onto GitHub, then run `apify push` to push it to the platform. It's also important to know that GitHub/Gitlab repository integration is standard practice. As projects grow and the number of contributors and maintainers increases, it only makes sense to have a GitHub repository integrated with the project's Actor. For the remainder of this course, all Actors created will be integrated with a GitHub repository. ## Next up[​](#next "Direct link to Next up") [Next up](https://docs.apify.com/academy/expert-scraping-with-apify/tasks-and-storage.md), you'll learn about the different ways to store scraped data, as well as how to utilize a cool feature to run pre-configured Actors. --- # Migrations & maintaining state **Learn about what Actor migrations are and how to handle them properly so that the state is not lost and runs can safely be resurrected.** *** We already know that Actors are Docker containers that can be run on any server. This means that they can be allocated anywhere there is space available, making them very efficient. Unfortunately, there is one big caveat: Actors move - a lot. When an Actor moves, it is called a **migration**. On migration, the process inside of an Actor is completely restarted and everything in its memory is lost, meaning that any values stored within variables or classes are lost. When a migration happens, you want to do a so-called "state transition", which means saving any data you care about so the Actor can continue right where it left off before the migration. ## Learning 🧠[​](#learning "Direct link to Learning 🧠") Read this [article](https://docs.apify.com/platform/actors/development/builds-and-runs/state-persistence.md) on migrations and dealing with state transitions. Before moving forward, read about Actor [events](https://docs.apify.com/sdk/js/docs/upgrading/upgrading-to-v3#events) and how to listen for them. ## Knowledge check 📝[​](#quiz "Direct link to Knowledge check 📝") 1. Actors have an option in the **Settings** tab to **Restart on error**. Would you use this feature for regular Actors? When would you use this feature? 2. Migrations happen randomly, but by [aborting **gracefully**](https://docs.apify.com/platform/actors/running/runs-and-builds.md#aborting-runs), you can simulate a similar situation. Try this out on the platform and observe what happens. What changes occur, and what remains the same for the restarted Actor's run? 3. Why don't you (usually) need to add any special migration handling code for a standard crawling/scraping Actor? Are there any features in the Crawlee/Apify SDK that handle this under the hood? 4. How can you intercept the migration event? How much time do you have after this event happens and before the Actor migrates? 5. When would you persist data to the default key-value store instead of to a named key-value store? ## Our task[​](#our-task "Direct link to Our task") Once again returning to our Amazon **demo-actor**, let's say that we need to store an object in memory (as a variable) containing all of the scraped ASINs as keys and the number of offers scraped from each ASIN as values. The object should follow this format: ``` { "B079ZJ1BPR": 3, "B07D4R4258": 21 } ``` Every 10 seconds, we should log the most up-to-date version of this object to the console. Additionally, the object should be able to solve Actor migrations, which means that even if the Actor were to migrate, its data would not be lost upon resurrection. [**Solution**](https://docs.apify.com/academy/expert-scraping-with-apify/solutions/handling-migrations.md) ## Next up[​](#next "Direct link to Next up") You might have already noticed that we've been using the **RESIDENTIAL** proxy group in the `proxyConfiguration` within our Amazon scraping Actor. But what does that mean? Learn why we've used this group, about proxies, and about avoiding anti-scraping measures in the [next lesson](https://docs.apify.com/academy/expert-scraping-with-apify/bypassing-anti-scraping.md). --- # Saving useful run statistics **Understand how to save statistics about an Actor's run, what types of statistics you can save, and why you might want to save them for a large-scale scraper.** *** Using Crawlee and the Apify SDK, we are now able to collect and format data coming directly from websites and save it into a Key-Value store or Dataset. This is great, but sometimes, we want to store some extra data about the run itself, or about each request. We might want to store some extra general run information separately from our results or potentially include statistics about each request within its corresponding dataset item. The types of values that are saved are totally up to you, but the most common are error scores, number of total saved items, number of request retries, number of captchas hit, etc. Storing these values is not always necessary, but can be valuable when debugging and maintaining an Actor. As your projects scale, this will become more and more useful and important. ## Learning 🧠[​](#learning "Direct link to Learning 🧠") Before moving on, give these valuable resources a quick lookover: * Refamiliarize with the various available data on the [Request object](https://crawlee.dev/api/core/class/Request). * Learn about the [`failedRequestHandler` function](https://crawlee.dev/api/browser-crawler/interface/BrowserCrawlerOptions#failedRequestHandler). * Understand how to use the [`errorHandler`](https://crawlee.dev/api/browser-crawler/interface/BrowserCrawlerOptions#errorHandler) function to handle request failures. * Ensure you are comfortable using [key-value stores](https://docs.apify.com/sdk/js/docs/guides/result-storage#key-value-store) and [datasets](https://docs.apify.com/sdk/js/docs/guides/result-storage#dataset), and understand the differences between the two storage types. ## Knowledge check 📝[​](#quiz "Direct link to Knowledge check 📝") 1. Why might you want to store statistics about an Actor's run (or a specific request)? 2. In our Amazon scraper, we are trying to store the number of retries of a request once its data is pushed to the dataset. Where would you get this information? Where would you store it? 3. What is the difference between the `failedRequestHandler` and `errorHandler`? ## Our task[​](#our-task "Direct link to Our task") In our Amazon Actor, each dataset result must now have the following extra keys: ``` { "dateHandled": "date-here", // the date + time at which the request was handled "numberOfRetries": 4, // the number of retries of the request before running successfully "currentPendingRequests": 24 // the current number of requests left pending in the request queue } ``` Also, an object including these values should be persisted during the run in th Key-Value store and logged to the console every 10 seconds: ``` { "errors": { // all of the errors for every request path "some-site.com/products/123": [ "error1", "error2" ] }, "totalSaved": 43 // total number of saved items throughout the entire run } ``` [**Solution**](https://docs.apify.com/academy/expert-scraping-with-apify/solutions/saving-stats.md) ## Wrap up[​](#wrap-up "Direct link to Wrap up") Wow, you've learned a whole lot in this course, so give yourself the pat on the back that you deserve! If you were able to follow along with this course, that means that you're officially an **Apify pro**, and that you're equipped with all of the knowledge and tools you need to build awesome scalable web-scrapers either for your own personal projects or for the Apify platform. Congratulations! 🎉 --- # Solutions **View all of the solutions for all of the activities and tasks of this course. Please try to complete each task on your own before reading the solution!** *** The final section of each lesson in this course will be a task which you as the course-taker are expected to complete before moving on to the next lesson. Each task's completion and understanding plays an important role in the ability to continue through the course. If you ever get stuck, or if you feel like your solution could be more optimal, you can always refer to the **Solutions** section of the course. Each solution will have all of the code and explanations needed to understand it. **Please** try to do each task **on your own** prior to checking out the solution! --- # Handling migrations **Get real-world experience of maintaining a stateful object stored in memory, which will be persisted through migrations and even graceful aborts.** *** Let's first head into our **demo-actor** and create a new file named **asinTracker.js** in the **src** folder. Within this file, we are going to build a utility class which will allow us to store, modify, persist, and log our tracked ASIN data. Here's the skeleton of our class: ``` // asinTracker.js class ASINTracker { constructor() { this.state = {}; // Log the state to the console every ten // seconds setInterval(() => console.log(this.state), 10000); } // Add an offer to the ASIN's offer count // If ASIN doesn't exist yet, set it to 0 incrementASIN(asin) { if (this.state[asin] === undefined) { this.state[asin] = 0; return; } this.state[asin] += 1; } } // It is only a utility class, so we will immediately // create an instance of it and export that. We only // need one instance for our use case. module.exports = new ASINTracker(); ``` Multiple techniques exist for storing data in memory; however, this is the most modular way, as all state-persistence and modification logic will be held in this file. Here is our updated **routes.js** file which is now utilizing this utility class to track the number of offers for each product ASIN: ``` // routes.js import { createCheerioRouter } from '@crawlee/cheerio'; import { BASE_URL, OFFERS_URL, labels } from './constants'; import tracker from './asinTracker'; import { dataset } from './main.js'; export const router = createCheerioRouter(); router.addHandler(labels.START, async ({ $, crawler, request }) => { const { keyword } = request.userData; const products = $('div > div[data-asin]:not([data-asin=""])'); for (const product of products) { const element = $(product); const titleElement = $(element.find('.a-text-normal[href]')); const url = `${BASE_URL}${titleElement.attr('href')}`; // For each product, add it to the ASIN tracker // and initialize its collected offers count to 0 tracker.incrementASIN(element.attr('data-asin')); await crawler.addRequest([{ url, label: labels.PRODUCT, userData: { data: { title: titleElement.first().text().trim(), asin: element.attr('data-asin'), itemUrl: url, keyword, }, }, }]); } }); router.addHandler(labels.PRODUCT, async ({ $, crawler, request }) => { const { data } = request.userData; const element = $('div#productDescription'); await crawler.addRequests([{ url: OFFERS_URL(data.asin), label: labels.OFFERS, userData: { data: { ...data, description: element.text().trim(), }, }, }]); }); router.addHandler(labels.OFFERS, async ({ $, request }) => { const { data } = request.userData; const { asin } = data; for (const offer of $('#aod-offer')) { // For each offer, add 1 to the ASIN's // offer count tracker.incrementASIN(asin); const element = $(offer); await dataset.pushData({ ...data, sellerName: element.find('div[id*="soldBy"] a[aria-label]').text().trim(), offer: element.find('.a-price .a-offscreen').text().trim(), }); } }); ``` ## Persisting state[​](#persisting-state "Direct link to Persisting state") The **persistState** event is automatically fired (by default) every 60 seconds by the Apify SDK while the Actor is running and is also fired when the **migrating** event occurs. In order to persist our ASIN tracker object, let's use the `Actor.on` function to listen for the **persistState** event and store it in the key-value store each time it is emitted. ``` // asinTracker.js import { Actor } from 'apify'; // We've updated our constants.js file to include the name // of this new key in the key-value store const { ASIN_TRACKER } = require('./constants'); class ASINTracker { constructor() { this.state = {}; Actor.on('persistState', async () => { await Actor.setValue(ASIN_TRACKER, this.state); }); setInterval(() => console.log(this.state), 10000); } incrementASIN(asin) { if (this.state[asin] === undefined) { this.state[asin] = 0; return; } this.state[asin] += 1; } } module.exports = new ASINTracker(); ``` ## Handling resurrections[​](#handling-resurrections "Direct link to Handling resurrections") Great! Now our state will be persisted every 60 seconds in the key-value store. However, we're not done. Let's say that the Actor migrates and is resurrected. We never actually update the `state` variable of our `ASINTracker` class with the state stored in the key-value store, so as our code currently stands, we still don't support state-persistence on migrations. In order to fix this, let's create a method called `initialize` which will be called at the very beginning of the Actor's run, and will check the key-value store for a previous state under the key **ASIN-TRACKER**. If a previous state does live there, then it will update the class' `state` variable with the value read from the key-value store: ``` // asinTracker.js import { Actor } from 'apify'; import { ASIN_TRACKER } from './constants'; class ASINTracker { constructor() { this.state = {}; Actor.on('persistState', async () => { await Actor.setValue(ASIN_TRACKER, this.state); }); setInterval(() => console.log(this.state), 10000); } async initialize() { // Read the data from the key-value store. If it // doesn't exist, it will be undefined const data = await Actor.getValue(ASIN_TRACKER); // If the data does exist, replace the current state // (initialized as an empty object) with the data if (data) this.state = data; } incrementASIN(asin) { if (this.state[asin] === undefined) { this.state[asin] = 0; return; } this.state[asin] += 1; } } module.exports = new ASINTracker(); ``` We'll now call this function at the top level of the **main.js** file to ensure it is the first thing that gets called when the Actor starts up: ``` // main.js // ... import tracker from './asinTracker'; // The Actor.init() function should be executed before // the tracker's initialization await Actor.init(); await tracker.initialize(); // ... ``` That's everything! Now, even if the Actor migrates (or is gracefully aborted and then resurrected), this `state` object will always be persisted. ## Quiz answers 📝[​](#quiz-answers "Direct link to Quiz answers 📝") **Q: Actors have an option in the Settings tab to Restart on error. Would you use this feature for regular Actors? When would you use this feature?** **A:** It's not best to use this option by default. If it fails, there must be a reason, which would need to be thought through first - meaning that the edge case of failing should be handled when resurrecting the Actor. The state should be persisted beforehand. **Q: Migrations happen randomly, but by [aborting gracefully](https://docs.apify.com/platform/actors/running/runs-and-builds.md#aborting-runs), you can simulate a similar situation. Try this out on the platform and observe what happens. What changes occur, and what remains the same for the restarted Actor's run?** **A:** After aborting or throwing an error mid-process, it manages to start back from where it was upon resurrection. **Q: Why don't you (usually) need to add any special migration handling code for a standard crawling/scraping Actor? Are there any features in Crawlee or Apify SDK that handle this under the hood?** **A:** Because Apify SDK handles all of the migration handling code for us. If you want to add custom migration-handling code, you can use `Actor.events` to listen for the `migrating` or `persistState` events to save the current state in key-value store (or elsewhere). **Q: How can you intercept the migration event? How much time do you have after this event happens and before the Actor migrates?** **A:** By using the `Actor.on` function. You have a maximum of a few seconds before shutdown after the `migrating` event has been fired. **Q: When would you persist data to the default key-value store instead of to a named key-value store?** **A:** Persisting data to the default key-value store would help when handling an Actor's run state or with storing metadata about the run (such as results, miscellaneous files, or logs). Using a named key-value store allows you to persist data at the account level to handle data across multiple Actor runs. ## Wrap up[​](#wrap-up "Direct link to Wrap up") In this activity, we learned how to persist custom values on an interval as well as after Actor migrations by using the `persistState` event and the key-value store. With this knowledge, you can safely increase your Actor's performance by storing data in variables and then pushing them to the dataset periodically/at the end of the Actor's run as opposed to pushing data immediately after it's been collected. One important thing to note is that this workflow can be used to replace the usage of `userData` to pass data between requests, as it allows for the creation of a "global store" which all requests have access to at any time. --- # Integrating webhooks **Learn how to integrate webhooks into your Actors. Webhooks are a super powerful tool, and can be used to do almost anything!** *** In this lesson we'll be writing a new Actor and integrating it with our beloved Amazon scraping Actor. First, we'll navigate to the same directory where our **demo-actor** folder lives, and run `apify create filter-actor` *(once again, you can name the Actor whatever you want, but for this lesson, we'll be calling the new Actor **filter-actor**)*. When prompted for which type of boilerplate to start out with, select **Empty**. ![Selecting an empty template to start with](/assets/images/select-empty-e3ce0a2c3ae3594722f0d98bc85fe40c.jpg) Cool! Now, we're ready to get started. ## Building the new Actor[​](#building-the-new-actor "Direct link to Building the new Actor") First of all, we should clear out any of the boilerplate code within **main.js** to get a clean slate: ``` // main.js import { Actor } from 'apify'; await Actor.init(); // ... await Actor.exit(); ``` We'll be passing the ID of the Amazon Actor's default dataset along to the new Actor, so we can expect that as an input: ``` const { datasetId } = await Actor.getInput(); const dataset = await Actor.openDataset(datasetId); // ... ``` > Tip: You will need to use `forceCloud` option - `Actor.openDataset(, { forceCloud: true });` - to open dataset from platform storage while running Actor locally. Next, we'll grab hold of the dataset's items with the `dataset.getData()` function: ``` const { items } = await dataset.getData(); ``` While several methods can achieve the goal output of this Actor, using the [`Array.reduce()`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/reduce) is the most concise approach ``` const filtered = items.reduce((acc, curr) => { // Grab the price of the item matching our current // item's ASIN in the map. If it doesn't exist, set // "prevPrice" to null const prevPrice = acc?.[curr.asin] ? +acc[curr.asin].offer.slice(1) : null; // Grab the price of our current offer const price = +curr.offer.slice(1); // If the item doesn't yet exist in the map, add it. // Or, if the current offer's price is less than the // saved one, replace the saved one if (!acc[curr.asin] || prevPrice > price) acc[curr.asin] = curr; // Return the map return acc; }, {}); ``` The results should be an array, so we can take the map we just created and push an array of its values to the Actor's default dataset: ``` await Actor.pushData(Object.values(filtered)); ``` Our final code looks like this: ``` import { Actor } from 'apify'; await Actor.init(); const { datasetId } = await Actor.getInput(); const dataset = await Actor.openDataset(datasetId); const { items } = await dataset.getData(); const filtered = items.reduce((acc, curr) => { const prevPrice = acc?.[curr.asin] ? +acc[curr.asin].offer.slice(1) : null; const price = +curr.offer.slice(1); if (!acc[curr.asin] || prevPrice > price) acc[curr.asin] = curr; return acc; }, {}); await Actor.pushData(Object.values(filtered)); await Actor.exit(); ``` Cool! But **wait**, don't forget to configure the **INPUT\_SCHEMA.json** file as well! It's not necessary to do this step, as we'll be calling the Actor through Apify's API within a webhook, but it's still good to get into the habit of writing quality input schemas that describe the input values your Actors are expecting. ``` { "title": "Amazon Filter Actor", "type": "object", "schemaVersion": 1, "properties": { "datasetId": { "title": "Dataset ID", "type": "string", "description": "Enter the ID of the dataset.", "editor": "textfield" } }, "required": ["datasetId"] } ``` Now we're done, and we can push it up to the Apify platform with the `apify push` command. ## Setting up the webhook[​](#setting-up-the-webhook "Direct link to Setting up the webhook") Since we'll be calling the Actor via the [Apify API](https://docs.apify.com/academy/api/run-actor-and-retrieve-data-via-api.md), we'll need to grab hold of the ID of the Actor we just created and pushed to the platform. The ID is always accessible through the **Settings** page of the Actor. ![Actor ID in Actor settings](/assets/images/actor-settings-589ee339d8a9630025e7364c5c7e97fc.jpg) With this `actorId`, and our `token`, which is retrievable through **Settings > Integrations** on the Apify Console, we can construct a link which will call the Actor: ``` https://api.apify.com/v2/acts/Yk1bieximsduYDydP/runs?token=YOUR_TOKEN_HERE ``` We can also use our username and the name of the Actor like this: ``` https://api.apify.com/v2/acts/USERNAME~filter-actor/runs?token=YOUR_TOKEN_HERE ``` Whichever one you choose is totally up to your preference. Next, within the Amazon scraping Actor, we will click the **Integrations** tab and choose **Webhook**, then fill out the details to look like this: ![Configuring a webhook](/assets/images/adding-webhook-c76d2f73bb0cadcf48620b59db1a1a9c.jpg) We have chosen to run the webhook once the Actor has succeeded, which means that its default dataset will surely be populated. Since the filtering Actor is expecting the default dataset ID of the Amazon Actor, we use the `resource` variable to grab hold of the `defaultDatasetId`. Click **Save**, then run the Amazon **demo-actor** again. ## Making sure it worked[​](#checking-the-webhook "Direct link to Making sure it worked") If everything worked, then at the end of the **demo-actor**'s run, we should see this within the **Integrations** tab: ![Webhook succeeded](/assets/images/webhook-succeeded-f95ddb172f63747d28dc72e5cdbb9c21.png) Additionally, we should be able to see that our **filter-actor** was run, and have access to its dataset: ![Dataset preview](/assets/images/dataset-preview-711de106446452a93cc8c15675d77a4d.png) ## Quiz answers 📝[​](#quiz-answers "Direct link to Quiz answers 📝") **Q: How do you allocate more CPU for an Actor's run?** **A:** On the platform, more memory can be allocated in the Actor's input configuration, and the default allocated CPU can be changed in the Actor's **Settings** tab. When running locally, you can use the **APIFY\_MEMORY\_MBYTES** environment variable to set the allocated CPU. 4GB is equal to 1 CPU core on the Apify platform. **Q: Within itself, can you get the exact time that an Actor was started?** **A:** Yes. The time the Actor was started can be retrieved through the `startedAt` property from the `Actor.getEnv()` function, or directly from `process.env.APIFY_STARTED_AT` **Q: What are the types of default storages connected to an Actor's run?** Every Actor's run is given a default key-value store and a default dataset. The default key-value store by default has the `INPUT` and `OUTPUT` keys. The Actor's request queue is also stored. **Q: Can you change the allocated memory of an Actor while it's running?** **A:** Not while it's running. You'd need to stop it and run a new one. However, there is an option to soft abort an Actor, then resurrect then run with a different memory configuration. **Q: How can you run an Actor with Puppeteer on the Apify platform with headless mode set to `false`?** **A:** This can be done by using the `actor-node-puppeteer-chrome` Docker image and making sure that `launchContext.launchOptions.headless` in `PuppeteerCrawlerOptions` is set to `false`. ## Wrap up[​](#wrap-up "Direct link to Wrap up") See that?! Integrating webhooks is a piece of cake on the Apify platform! You'll soon discover that the platform factors away a lot of complex things and allows you to focus on what's most important - developing and releasing Actors. --- # Managing source **View in-depth answers for all three of the quiz questions that were provided in the corresponding lesson about managing source code.** *** In the lesson corresponding to this solution, we discussed an extremely important topic: source code management. Though we solved the task right in the lesson, we've still included the quiz answers here. ## Quiz answers[​](#quiz-answers "Direct link to Quiz answers") **Q: Do you have to rebuild an Actor each time the source code is changed?** **A:** Yes. It needs to be built into an image, saved in a registry, and later on run in a container. **Q: In Git, what is the difference between pushing changes and making a pull request?** **A:** Pushing changes to the remote branch based on the content on the local branch. The pushing of code changes is usually made to a branch parallel to the one you want to eventually push it to. When creating a pull request, the code is meant to be reviewed, or at least pass all the test suites before being merged into the target branch. **Q: Based on your knowledge and experience, is the `apify push` command worth using (in your opinion)?** **A:** The `apify push` command can sometimes be useful when testing ideas; however, it is much more ideal to use GitHub integration rather than directly pushing to the platform. --- # Rotating proxies/sessions **Learn firsthand how to rotate proxies and sessions in order to avoid the majority of the most common anti-scraping protections.** *** If you take a look at our current code for the Amazon scraping Actor, you might notice this snippet: ``` const proxyConfiguration = await Actor.createProxyConfiguration({ groups: ['RESIDENTIAL'], }); ``` We didn't provide much explanation for this initially, as it was not directly relevant to the lesson at hand. When you [create a **ProxyConfiguration**](https://docs.apify.com/academy/anti-scraping/mitigation/using-proxies.md) and pass it to a crawler, Crawlee will make the crawler automatically rotate through the proxies. This entire time, we've been using the **RESIDENTIAL** proxy group to avoid being blocked by Amazon. > Go ahead and try commenting out the proxy configuration code then running the scraper. What happens? In order to rotate sessions, we must utilize the [**SessionPool**](https://crawlee.dev/api/core/class/AutoscaledPool), which we've actually also already been using by setting the **useSessionPool** option in our crawler's configuration to **true**. The SessionPool advances the concept of proxy rotation by tying proxies to user-like sessions and rotating those instead. In addition to a proxy, each user-like session has cookies attached to it (and potentially a browser fingerprint as well). ## Configuring SessionPool[​](#configuring-session-pool "Direct link to Configuring SessionPool") Let's go ahead and add a **sessionPoolOptions** key to our crawler's configuration so that we can modify the default settings: ``` const crawler = new CheerioCrawler({ requestList, requestQueue, proxyConfiguration, useSessionPool: true, // This is where our session pool // configuration lives sessionPoolOptions: { // We can add options for each // session created by the session // pool here sessionOptions: { }, }, maxConcurrency: 50, // ... }); ``` Now, we'll use the **maxUsageCount** key to force each session to be thrown away after 5 uses and **maxErrorScore** to trash a session once it receives an error. ``` const crawler = new CheerioCrawler({ requestList, requestQueue, proxyConfiguration, useSessionPool: true, sessionPoolOptions: { sessionOptions: { maxUsageCount: 5, maxErrorScore: 1, }, }, maxConcurrency: 50, // ... }); ``` And that's it! We've successfully configured the session pool to match the task's requirements. ## Limiting proxy location[​](#limiting-proxy-location "Direct link to Limiting proxy location") The final requirement was to use proxies only from the US. Back in our **ProxyConfiguration**, we need to add the **countryCode** key and set it to **US**: ``` const proxyConfiguration = await Actor.createProxyConfiguration({ groups: ['RESIDENTIAL'], countryCode: 'US', }); ``` ## Quiz answers[​](#quiz-answers "Direct link to Quiz answers") **Q: What are the different types of proxies that Apify proxy offers? What are the main differences between them?** **A:** Datacenter, residential, and Google SERP proxies with sub-groups. Datacenter proxies are fast and cheap but have a higher chance of being blocked on certain sites in comparison to residential proxies, which are IP addresses located in homes and offices around the world. Google SERP proxies are specifically for Google. **Q: Which proxy groups do users get on the free plan? Can they access the proxy from their computer?** **A:** All users have access to the **BUYPROXIES94952**, **GOOGLE\_SERP** and **RESIDENTIAL** groups. Free users cannot access the proxy from outside the Apify platform (paying users can). **Q: How can you prevent an error from occurring if one of the proxy groups that a user has is removed? What are the best practices for these scenarios?** **A:** By making the proxy for the scraper to use be configurable by the user through the Actor's input. That way, they can switch proxies if the Actor stops working due to proxy-related issues. It can also be done by using the **AUTO** proxy instead of specific groups. **Q: Does it make sense to rotate proxies when you are logged into a website?** **A:** No, because most websites tie an IP address to a session. If you start making requests with cookies used with a different IP address, the website might see it as unusual activity and either block the scraper or automatically log out. **Q: Construct a proxy URL that will select proxies only from the US.** **A:** `http://country-US:@proxy.apify.com:8000` **Q: What do you need to do to rotate a proxy (one proxy usually has one IP)? How does this differ for CheerioCrawler and PuppeteerCrawler?** **A:** Making a new request with the proxy endpoint above will automatically rotate it. Sessions can also be used to automatically do this. While proxy rotation is fairly straightforward for Cheerio, it's more complex in Puppeteer, as you have to retire the browser each time a new proxy is rotated in. The SessionPool will automatically retire a browser when a session is retired. Sessions can be manually retired with `session.retire()`. **Q: Name a few different ways how a website can prevent you from scraping it.** **A:** IP detection and rate-limiting, browser/fingerprint detection, user behavior tracking, etc. ## Wrap up[​](#wrap-up "Direct link to Wrap up") In this solution, you learned one of the most important concepts in web scraping - proxy/session rotation. With your newfound knowledge of the SessionPool, you'll be (practically) unstoppable! --- # Saving run stats **Implement the saving of general statistics about an Actor's run, as well as adding request-specific statistics to dataset items.** *** The code in this solution will be similar to what we already did in the **Handling migrations** solution; however, we'll be storing and logging different data. First, let's create a new file called **Stats.js** and write a utility class for storing our run stats: ``` import Actor from 'apify'; class Stats { constructor() { this.state = { errors: {}, totalSaved: 0, }; } async initialize() { const data = await Actor.getValue('STATS'); if (data) this.state = data; Actor.on('persistState', async () => { await Actor.setValue('STATS', this.state); }); setInterval(() => console.log(this.state), 10000); } addError(url, errorMessage) { if (!this.state.errors?.[url]) this.state.errors[url] = []; this.state.errors[url].push(errorMessage); } success() { this.state.totalSaved += 1; } } module.exports = new Stats(); ``` Cool, very similar to the **AsinTracker** class we wrote earlier. We'll now import **Stats** into our **main.js** file and initialize it along with the ASIN tracker: ``` // ... import Stats from './Stats.js'; await Actor.init(); await asinTracker.initialize(); await Stats.initialize(); // ... ``` ## Tracking errors[​](#tracking-errors "Direct link to Tracking errors") In order to keep track of errors, we must write a new function within the crawler's configuration called **errorHandler**. Passed into this function is an object containing an **Error** object for the error which occurred and the **Request** object, as well as information about the session and proxy which were used for the request. ``` const crawler = new CheerioCrawler({ proxyConfiguration, useSessionPool: true, sessionPoolOptions: { persistStateKey: 'AMAZON-SESSIONS', sessionOptions: { maxUsageCount: 5, maxErrorScore: 1, }, }, maxConcurrency: 50, requestHandler: router, // Handle all failed requests errorHandler: async ({ error, request }) => { // Add an error for this url to our error tracker Stats.addError(request.url, error?.message); }, }); ``` ## Tracking total saved[​](#tracking-total-saved "Direct link to Tracking total saved") Now, we'll increment our **totalSaved** count for every offer added to the dataset. ``` router.addHandler(labels.OFFERS, async ({ $, request }) => { const { data } = request.userData; const { asin } = data; for (const offer of $('#aod-offer')) { tracker.incrementASIN(asin); // Add 1 to totalSaved for every offer Stats.success(); const element = $(offer); await dataset.pushData({ ...data, sellerName: element.find('div[id*="soldBy"] a[aria-label]').text().trim(), offer: element.find('.a-price .a-offscreen').text().trim(), }); } }); ``` ## Saving stats with dataset items[​](#saving-stats-with-dataset-items "Direct link to Saving stats with dataset items") Still, in the **OFFERS** handler, we need to add a few extra keys to the items which are pushed to the dataset. Luckily, all of the data required by the task is accessible in the context object. ``` router.addHandler(labels.OFFERS, async ({ $, request }) => { const { data } = request.userData; const { asin } = data; for (const offer of $('#aod-offer')) { tracker.incrementASIN(asin); // Add 1 to totalSaved for every offer Stats.success(); const element = $(offer); await dataset.pushData({ ...data, sellerName: element.find('div[id*="soldBy"] a[aria-label]').text().trim(), offer: element.find('.a-price .a-offscreen').text().trim(), // Store the handledAt date or current date if that is undefined dateHandled: request.handledAt || new Date().toISOString(), // Access the number of retries on the request object numberOfRetries: request.retryCount, // Grab the number of pending requests from the requestQueue currentPendingRequests: (await requestQueue.getInfo()).pendingRequestCount, }); } }); ``` ## Quiz answers[​](#quiz-answers "Direct link to Quiz answers") **Q: Why might you want to store statistics about an Actor's run (or a specific request)?** **A:** If certain types of requests are error-prone, you might want to save stats about the run to look at them later to either eliminate or better handle the errors. Things like **dateHandled** can be generally useful information. **Q: In our Amazon scraper, we are trying to store the number of retries of a request once its data is pushed to the dataset. Where would you get this information? Where would you store it?** **A:** This information is available directly on the request object under the property **retryCount**. **Q: What is the difference between the `failedRequestHandler` and `errorHandler`?** **A:** `failedRequestHandler` runs after a request has failed and reached its `maxRetries` count. `errorHandler` runs on every failure and retry. --- # Using the Apify API & JavaScript client **Learn how to interact with the Apify API directly through the well-documented RESTful routes, or by using the proprietary Apify JavaScript client.** *** Since we need to create another Actor, we'll once again use the `apify create` command and start from an empty template. ![Selecting an empty template to start with](/assets/images/select-empty-e3ce0a2c3ae3594722f0d98bc85fe40c.jpg) This time, let's call our project **actor-caller**. Let's also set up some boilerplate, grabbing our inputs and creating a constant variable for the task: ``` import { Actor } from 'apify'; import axios from 'axios'; await Actor.init(); const { useClient, memory, fields, maxItems } = await Actor.getInput(); const TASK = 'YOUR_USERNAME~demo-actor-task'; // our future code will go here await Actor.exit(); ``` ## Calling a task via JavaScript client[​](#calling-a-task-via-client "Direct link to Calling a task via JavaScript client") When using the `apify-client` package, you can create a new client instance by using `new ApifyClient()`. Within the Apify SDK however, it is not necessary to even install the `apify-client` package, as the `Actor.newClient()` function is available for use. We'll start by creating a function called `withClient()` and creating a new client, then calling the task: ``` const withClient = async () => { const client = Actor.newClient(); const task = client.task(TASK); const { id } = await task.call({ memory }); }; ``` After the task has run, we'll grab hold of its dataset, then attempt to download the items, plugging in our `maxItems` and `fields` inputs. Then, once the data has been downloaded, we'll push it to the default key-value store under a key named **OUTPUT.csv**. ``` const withClient = async () => { const client = Actor.newClient(); const task = client.task(TASK); const { id } = await task.call({ memory }); const dataset = client.run(id).dataset(); const items = await dataset.downloadItems('csv', { limit: maxItems, fields, }); // If the content type is anything other than JSON, it must // be specified within the third options parameter return Actor.setValue('OUTPUT', items, { contentType: 'text/csv' }); }; ``` ## Calling a task via API[​](#calling-a-task-via-api "Direct link to Calling a task via API") First, we'll create a function (right under the `withClient()`) function named `withAPI` and instantiate a new variable which represents the API endpoint to run our task: ``` const withAPI = async () => { const uri = `https://api.apify.com/v2/actor-tasks/${TASK}/run-sync-get-dataset-items?`; }; ``` To add the query parameters to the URL, we could create a super long string literal, plugging in all of our input values; however, there is a much better way: [`URLSearchParams`](https://nodejs.org/api/url.html#new-urlsearchparams). By using `URLSearchParams`, we can add the query parameters in an object: ``` const withAPI = async () => { const uri = `https://api.apify.com/v2/actor-tasks/${TASK}/run-sync-get-dataset-items?`; const url = new URL(uri); url.search = new URLSearchParams({ memory, format: 'csv', limit: maxItems, fields: fields.join(','), token: process.env.APIFY_TOKEN, }); }; ``` Finally, let's make a `POST` request to our endpoint. You can use any library you want, but in this example, we'll use [`axios`](https://www.npmjs.com/package/axios). Don't forget to run `npm install axios` if you're going to use this package too! ``` const withAPI = async () => { const uri = `https://api.apify.com/v2/actor-tasks/${TASK}/run-sync-get-dataset-items?`; const url = new URL(uri); url.search = new URLSearchParams({ memory, format: 'csv', limit: maxItems, fields: fields.join(','), token: process.env.APIFY_TOKEN, }); const { data } = await axios.post(url.toString()); return Actor.setValue('OUTPUT', data, { contentType: 'text/csv' }); }; ``` ## Finalizing the Actor[​](#finalizing-the-actor "Direct link to Finalizing the Actor") Now, since we've written both of these functions, all we have to do is write a conditional statement based on the boolean value from `useClient`: ``` if (useClient) await withClient(); else await withAPI(); ``` And before we push to the platform, let's not forget to write an input schema in the **INPUT\_SCHEMA.JSON** file: ``` { "title": "Actor Caller", "type": "object", "schemaVersion": 1, "properties": { "memory": { "title": "Memory", "type": "integer", "description": "Select memory in megabytes.", "default": 4096, "maximum": 32768, "unit": "MB" }, "useClient": { "title": "Use client?", "type": "boolean", "description": "Specifies whether the Apify JS client, or the pure Apify API should be used.", "default": true }, "fields": { "title": "Fields", "type": "array", "description": "Enter the dataset fields to export to CSV", "prefill": ["title", "url", "price"], "editor": "stringList" }, "maxItems": { "title": "Max items", "type": "integer", "description": "Fill the maximum number of items to export.", "default": 10 } }, "required": ["useClient", "memory", "fields", "maxItems"] } ``` ## Final code[​](#final-code "Direct link to Final code") To ensure we're on the same page, here is what the final code looks like: ``` import { Actor } from 'apify'; import axios from 'axios'; await Actor.init(); const { useClient, memory, fields, maxItems } = await Actor.getInput(); const TASK = 'YOUR_USERNAME~demo-actor-task'; const withClient = async () => { const client = Actor.newClient(); const task = client.task(TASK); const { id } = await task.call({ memory }); const dataset = client.run(id).dataset(); const items = await dataset.downloadItems('csv', { limit: maxItems, fields, }); return Actor.setValue('OUTPUT', items, { contentType: 'text/csv' }); }; const withAPI = async () => { const uri = `https://api.apify.com/v2/actor-tasks/${TASK}/run-sync-get-dataset-items?`; const url = new URL(uri); url.search = new URLSearchParams({ memory, format: 'csv', limit: maxItems, fields: fields.join(','), token: process.env.APIFY_TOKEN, }); const { data } = await axios.post(url.toString()); return Actor.setValue('OUTPUT', data, { contentType: 'text/csv' }); }; if (useClient) { await withClient(); } else { await withAPI(); } await Actor.exit(); ``` ## Quiz answers 📝[​](#quiz-answers "Direct link to Quiz answers 📝") **Q: What is the relationship between the Apify API and Apify client? Are there any significant differences?** **A:** The Apify client mimics the Apify API, so there aren't any super significant differences. It's super handy as it helps with managing the API calls (parsing, error handling, retries, etc) and even adds convenience functions. The one main difference is that the Apify client automatically uses [**exponential backoff**](https://docs.apify.com/api/client/js/docs#retries-with-exponential-backoff) to deal with errors. **Q: How do you pass input when running an Actor or task via API?** **A:** The input should be passed into the **body** of the request when running an actor/task via API. **Q: Do you need to install the `apify-client` npm package when already using the `apify` package?** **A:** No. The Apify client is available right in the SDK with the `Actor.newClient()` function. ## Wrap up[​](#wrap-up "Direct link to Wrap up") That's it! Now, if you want to go above and beyond, you should create a GitHub repository for this Actor, integrate it with a new one on the Apify platform, and test if it works there as well (with multiple input configurations). --- # Using storage & creating tasks ## Quiz answers 📝[​](#quiz-answers "Direct link to Quiz answers 📝") **Q: What is the relationship between Actors and tasks?** **A:** Tasks are pre-configured runs of Actors. The configurations of an Actor can be saved as a task so that it doesn't have to be manually configured every single time. **Q: What are the differences between default (unnamed) and named storage? Which one would you use for everyday usage?** **A:** Unnamed storage is persisted for only 7 days, while named storage is persisted indefinitely. For everyday usage, it is best to use default unnamed storages unless the data should explicitly be persisted for more than 7 days. > With named storages, it's easier to verify that you're using the correct store, as they can be referred to by name rather than by an ID. **Q: What is data retention, and how does it work for all types of storages (default and named)?** **A:** Default/unnamed storages expire after 7 days unless otherwise specified. Named storages are retained indefinitely. ## Wrap up[​](#wrap-up "Direct link to Wrap up") You've learned how to use the different storage options available on Apify, the two different types of storage, as well as how to create tasks for Actors. --- # Tasks & storage **Understand how to save the configurations for Actors with Actor tasks. Also, learn about storage and the different types Apify offers.** *** Both of these are very different things; however, they are also tied together in many ways. **Tasks** run Actors, Actors return data, and data is stored in different types of **Storages**. ## Tasks[​](#tasks "Direct link to Tasks") Tasks are a very useful feature which allow us to save pre-configured inputs for Actors. This means that rather than configuring the Actor every time, or rather than having to save screenshots of various different Actor configurations, you can store the configurations right in your Apify account instead, and run the Actor at will with them. ## Storage[​](#storage "Direct link to Storage") Storage allows us to save persistent data for further processing. As you'll learn, there are two main storage options on the Apify platform, as well as two main storage types (**named** and **unnamed**) with one big difference between them. ## Learning 🧠[​](#learning "Direct link to Learning 🧠") * Check out [the docs about Actor tasks](https://docs.apify.com/platform/actors/running/tasks.md). * Read about the [two main storage options](https://docs.apify.com/platform/storage/dataset.md) on the Apify platform. * Understand the [crucial differences between named and unnamed storages](https://docs.apify.com/platform/storage/usage.md#named-and-unnamed-storages). * Learn about the [`Dataset`](https://docs.apify.com/sdk/js/reference/class/Dataset) and [`KeyValueStore`](https://docs.apify.com/sdk/js/reference/class/KeyValueStore) objects in the Apify SDK. ## Knowledge check 📝[​](#quiz "Direct link to Knowledge check 📝") 1. What is the relationship between Actors and tasks? 2. What are the differences between default (unnamed) and named storage? Which one would you use for everyday usage? 3. What is data retention, and how does it work for all types of storages (default and named)? [**Solution**](https://docs.apify.com/academy/expert-scraping-with-apify/solutions/using-storage-creating-tasks.md) ## Next up[​](#next "Direct link to Next up") The [next lesson](https://docs.apify.com/academy/expert-scraping-with-apify/apify-api-and-client.md) is very exciting, as it will unlock the ability to seamlessly integrate your Apify Actors into your own external projects and applications with the Apify API. --- # Monetizing your Actor **Learn how you can monetize your web scraping and automation projects by publishing Actors to users in Apify Store.** *** When you publish your Actor on the Apify platform, you have the option to make it a *Paid Actor* and earn revenue from users who benefit from your tool. You can choose between two pricing models: * Rental * Pay-per-result ## Rental pricing model[​](#rental-pricing-model "Direct link to Rental pricing model") With the rental model, you can specify a free trial period and a monthly rental price. After the trial, users with an [Apify paid plan](https://apify.com/pricing) can continue using your Actor by paying the monthly fee. You can receive 80% of the total rental fees collected each month. Example - rental pricing model You make your Actor rental with 7 days free trial and then $30/month. During the first calendar month, three users start to use your Actor: 1. First user, on Apify paid plan, starts the free trial on 15th 2. Second user, on Apify paid plan, starts the free trial on 25th 3. Third user, on Apify free plan, start the free trial on 20th The first user pays their first rent 7 days after the free trial, i.e., on 22nd. The second user only starts paying the rent next month. The third user is on Apify free plan, so after the free trial ends on 27th, they are not charged and cannot use the Actor further until they get a paid plan. Your profit is computed only from the first user. They were charged $30, so 80% of this goes to you, i.e., *0.8 \* 30 = $24*. ## Pay-per-result pricing model[​](#pay-per-result-pricing-model "Direct link to Pay-per-result pricing model") In this model, you set a price per 1000 results. Users are charged based on the number of results your Actor produces. Your profit is calculated as 80% of the revenue minus platform usage costs. The formula is: `(0.8 * revenue) - costs = profit` ### Pay-per-result unit pricing for cost computation[​](#pay-per-result-unit-pricing-for-cost-computation "Direct link to Pay-per-result unit pricing for cost computation") | Service | Unit price | | ------------------------------- | -------------------------- | | Compute unit | **$0.4** / CU | | Residential proxies | **$13** / GB | | SERPs proxy | **$3** / 1,000 SERPs | | Data transfer - external | **$0.20** / GB | | Data transfer - internal | **$0.05** / GB | | Dataset - timed storage | **$1.00** / 1,000 GB-hours | | Dataset - reads | **$0.0004** / 1,000 reads | | Dataset - writes | **$0.005** / 1,000 writes | | Key-value store - timed storage | **$1.00** / 1,000 GB-hours | | Key-value store - reads | **$0.005** / 1,000 reads | | Key-value store - writes | **$0.05** / 1,000 writes | | Key-value store - lists | **$0.05** / 1,000 lists | | Request queue - timed storage | **$4.00** / 1,000 GB-hours | | Request queue - reads | **$0.004** / 1,000 reads | | Request queue - writes | **$0.02** / 1,000 writes | Only revenue & cost for Apify customers on paid plans are taken into consideration when computing your profit. Users on free plans are not reflected there, although you can see statistics about the potential revenue of users that are currently on free plans in Actor Insights in the Apify Console. What are Gigabyte-hours? Gigabyte-hours (GB-hours) are a unit of measurement used to quantify data storage and processing capacity over time. To calculate GB-hours, multiply the amount of data in gigabytes by the number of hours it's stored or processed. For example, if you host 50GB of data for 30 days: * Convert days to hours: *30 \* 24 = 720* * Multiply data size by hours: *50 \* 720 = 36,000* This means that storing 50 GB of data for 30 days results in 36,000 GB-hours. Read more about Actors in the Store and different pricing models from the perspective of your users in the [Store documentation](https://docs.apify.com/platform/actors/running/actors-in-store). Example - pay-per-result pricing model You make your Actor pay-per-result and set price to be $1/1,000 results. During the first month, two users on Apify paid plans use your Actor to get 50,000 and 20,000 results, costing them $50 and $20 respectively. Let's say the underlying platform usage for the first user is $5 and for the second $2. Third user, this time on Apify free plan, uses the Actor to get 5,000 results, with underlying platform usage $0.5. Your profit is computed only from the first two users, since they are on Apify paid plans. The revenue for the first user is $50 and for the second $20, i.e., total revenue is $70. The total underlying cost is *$5 + $2 = $7*. Since your profit is 80% of the revenue minus the cost, it would be *0.8 \* 70 - 7 = $49*. ### Best practices for Pay-per-results Actors[​](#best-practices-for-pay-per-results-actors "Direct link to Best practices for Pay-per-results Actors") To ensure profitable operation: * Set memory limits in your [`actor.json`](https://docs.apify.com/platform/actors/development/actor-definition/actor-json) file to control platform usage costs * Implement the `ACTOR_MAX_PAID_DATASET_ITEMS` check to prevent excess result generation * Test your Actor with various result volumes to determine optimal pricing ## Setting up monetization[​](#setting-up-monetization "Direct link to Setting up monetization") Navigate to your [Actor page](https://console.apify.com/actors?tab=my) in the Apify Console choose Actor that you want to monetize, and select the Publication tab. ![Monetization section](/assets/images/monetization-section-5ea234343a91208580100eb37c1b9e7f.png) Open the Monetization section and complete your billing and payment details. ![Set up monetization](/assets/images/monetize_actor_set_up_monetization-7612e44589223f7e92b8adcd006bc1bb.png) Follow the monetization wizard to configure. Follow the monetization wizard to configure your pricing model. ![Monetization wizard](/assets/images/monetization_wizard-39bd82ef5ffa7a6f5a9143d2892178a4.png) ### Changing monetization[​](#changing-monetization "Direct link to Changing monetization") You can change the monetization setting of your Actor by using the same wixard as for the setup in the **Monetization** section of your Actor's **Publication** tab. Any changes made to an already published Actor will take *14 days* to come in effect, so that the users of your Actor have time to prepare. Frequency of monetization adjustments Be aware that you can change monetization setting of each Actor only once per month. For further information & guidelines please refer to our [Terms & Conditions](https://apify.com/store-terms-and-conditions) ## Payouts & analytics[​](#payouts--analytics "Direct link to Payouts & analytics") Payout invoices are generated automatically on the 14th of each month. Review your invoice in the Settings > Payout section within one week. If not approved by the 20th, the system will auto-approve on the 21st. Track your Actor's performance through: * The payout section for financial records * Actor Analytics for usage statistics ![Actor analytics](/assets/images/actor_analytics-72d29767ca18eb8c642d199bb488627f.png) * Individual Actor Insights for detailed performance metrics ![Actor insights](/assets/images/actor-insights-5178afe3392983f919cf0f8755be182a.png) ## Promoting your Actor[​](#promoting-your-actor "Direct link to Promoting your Actor") Create SEO-optimized descriptions and README files to improve search engine visibility. Share your Actor on multiple channels: * Post on Reddit, Quora, and social media platform * Create tutorial videos demonstrating key features * Publish articles about your Actor on relevant websites * Consider creating a product showcase on platforms like Product hunt Remember to tag Apify in your social media posts for additional exposure. Effective promotion can significantly impact your Actor's success, differentiating between those with many paid users and those with few to none. Learn more about promoting your Actor from [Apify's Marketing Playbook](https://apify.notion.site/3fdc9fd4c8164649a2024c9ca7a2d0da?v=6d262c0b026d49bfa45771cd71f8c9ab). --- # Getting started **Get started with the Apify platform by creating an account and learning about the Apify Console, which is where all Apify Actors are born!** *** Your gateway to the Apify platform is your Apify account. The great thing about creating an account is that we support integration with both Google and GitHub, which takes only about 30 seconds! 1. Create your account on the [sign up](https://console.apify.com/sign-up?asrc=developers_portal) page. 2. Check your email, you should have a verification email with a link. Click it! 3. Done! 👍 ## Getting to know the platform[​](#getting-to-know-the-platform "Direct link to Getting to know the platform") Now that you have an account, you have access to the [Apify Console](https://console.apify.com?asrc=developers_portal), which is a wonderful place where you utilize all of the features the platform has to offer, as well as manage and test your own projects. ## Next up[​](#next "Direct link to Next up") In our next lesson, we'll learn about something super exciting - **Actors**. Actors are the living and breathing core of the Apify platform and are an extremely powerful concept. What are you waiting for? Let's jump [right into the next lesson](https://docs.apify.com/academy/getting-started/actors.md)! --- # Actors **What is an Actor? How do we create them? Learn the basics of what Actors are, how they work, and try out an Actor yourself right on the Apify platform!** *** After you've followed the **Getting started** lesson, you're almost ready to start creating some Actors! But before we get into that, let's discuss what an Actor is, and a bit about how they work. ## What's an Actor?[​](#what-is-an-actor "Direct link to What's an Actor?") When you deploy your script to the Apify platform, it is then called an **Actor**, which is a [serverless microservice](https://www.datadoghq.com/knowledge-center/serverless-architecture/serverless-microservices/#:~:text=Serverless%20microservices%20are%20cloud-based,suited%20for%20microservice-based%20architectures.) that accepts an input and produces an output. Actors can run for a few seconds, hours or even infinitely. An Actor can perform anything from a basic action such as filling out a web form or sending an email, to complex operations such as crawling an entire website and removing duplicates from a large dataset. Once an Actor has been pushed to the Apify platform, they can be shared to the world through the [Apify Store](https://apify.com/store), and even monetized after going public. > Though the majority of Actors that are currently on the Apify platform are scrapers, crawlers, or automation software, Actors are not limited to scraping. They can be any program running in a Docker container. ## Actors on the Apify platform[​](#actors-on-platform "Direct link to Actors on the Apify platform") For a super quick and dirty understanding of what a published Actor looks like, and how it works, let's run an SEO audit of **apify.com** using the [SEO audit Actor](https://apify.com/misceres/seo-audit-tool). On the front page of the Actor, click the green **Try for free** button. If you're logged into your Apify account which you created during the [**Getting started**](https://docs.apify.com/academy/getting-started.md) lesson, you'll be taken to the Apify Console and greeted with a page that looks like this: ![Actor configuration](/assets/images/seo-actor-config-6cde16dcb2bc752723bf7c6ed8364075.png) This is where we can provide input to the Actor. The defaults here are just fine, so we'll leave it as is and click the green **Start** button to run it. While the Actor is running, you'll see it log some information about itself. ![Actor logs](/assets/images/actor-logs-a100ea07b38cdbe0ff6bc9cf3d808472.jpg) After the Actor has completed its run (you'll know this when you see **SEO audit for apify.com finished.** in the logs), the results of the run can be viewed by clicking the **Results** tab, then subsequently the **View in another tab** option under **Export**. ## The "Actors" tab[​](#actors-tab "Direct link to The \"Actors\" tab") While still on the platform, click on the tab with the **** icon which says **Actors**. This tab is your one-stop-shop for seeing which Actors you've used recently, and which ones you've developed yourself. You will be frequently using this tab when developing and testing on the Apify platform. ![The \"Actors\" tab on the Apify platform](/assets/images/actors-tab-393a0bbdf5d8664406d2a146d8486a7e.jpg) Now that you know the basics of what Actors are and how to use them, it's time to develop **an Actor of your own**! ## Next up[​](#next "Direct link to Next up") Get ready, because in the [next lesson](https://docs.apify.com/academy/getting-started/creating-actors.md), you'll be writing your very own Actor! --- # The Apify API **Learn how to use the Apify API to programmatically call your Actors, retrieve data stored on the platform, view Actor logs, and more!** *** [Apify's API](https://docs.apify.com/api/v2.md) is your ticket to the Apify platform without even needing to access the [Apify Console](https://console.apify.com?asrc=developers_portal) web-interface. The API is organized around RESTful HTTP endpoints. In this lesson, we'll be learning how to use the Apify API to call an Actor and view its results. We'll be using the Actor we created in the previous lesson, so if you haven't already gotten that one set up, go ahead do that before moving forward if you'd like to follow along. ## Finding your endpoint[​](#finding-your-endpoint "Direct link to Finding your endpoint") Within one of your Actors on the [Apify Console](https://console.apify.com?asrc=developers_portal) (we'll use the **adding-actor** from the previous lesson), click on the **API** button in the top right-hand corner: ![The \"API\" button on an Actor\'s page on the Apify Console](/assets/images/api-tab-1fb75598685ed64e58605cd51734d19c.jpg) You should see a long list of API endpoints that you can copy and paste elsewhere, or even test right within the **API** modal. Go ahead and copy the endpoint labeled **Run Actor synchronously and get dataset items**. It should look something like this: ``` https://api.apify.com/v2/acts/YOUR_USERNAME~adding-actor/run-sync?token=YOUR_TOKEN ``` > In this lesson, we'll only be focusing on this one endpoint, as it is the most popularly used one; however, don't let this limit your curiosity! Take a look at the other endpoints in the **API** window to learn about everything you can do to your Actor programmatically. Now, let's move over to our favorite HTTP client (in this lesson we'll use [Insomnia](https://docs.apify.com/academy/tools/insomnia.md) in order to prepare and send the request). ## Providing input[​](#providing-input "Direct link to Providing input") Our **adding-actor** takes in two input values (`num1` and `num2`). When using the Actor on the platform, provide these fields either through the UI generated by the **INPUT\_SCHEMA.json**, or directly in JSON format. When providing input when making an API call to run an Actor, the input must be provided in the **body** of the POST request as a JSON object. ![Providing input](/assets/images/provide-input-16fe316e976462f5e2d9ede9158b6b8b.jpg) ## Parameters[​](#parameters "Direct link to Parameters") Let's say we want to run our **adding-actor** via API and view its results in CSV format at the end. We'll achieve this by passing the **format** parameter with a value of **csv** to change the output format: ``` https://api.apify.com/v2/acts/YOUR_USERNAME~adding-actor/run-sync-get-dataset-items?token=YOUR_TOKEN_HERE&format=csv ``` Additional parameters can be passed to this endpoint. You can learn about them in our [API documentation](https://docs.apify.com/api/v2/act-run-sync-get-dataset-items-post.md) > Network components can record visited URLs, so it's more secure to send the token as a HTTP header, not as a parameter. The header should look like `Authorization: Bearer YOUR_TOKEN`. Popular HTTP clients, such as [Postman](https://docs.apify.com/academy/tools/postman.md) or [Insomnia](https://docs.apify.com/academy/tools/insomnia.md), provide a convenient way to configure the Authorization header for all your API requests. ## Sending the request[​](#sending-the-request "Direct link to Sending the request") If you're not using an HTTP client, you can send the request through your terminal with this command: ``` curl -d '{"num1":1, "num2":8}' -H "Content-Type: application/json" -X POST "https://api.apify.com/v2/acts/YOUR_USERNAME~adding-actor/run-sync-get-dataset-items?token=YOUR_TOKEN_HERE&format=csv" ``` Here's the response we got: ![API response](/assets/images/api-csv-response-486ba68d3939c6f5c9328f8fefa5c7a2.png) And there it is! The Actor was run with our inputs of **num1** and **num2**, then the dataset results were returned back to us in CSV format. ## Apify API's many features[​](#api-many-features "Direct link to Apify API's many features") What we've done in this lesson only scratches the surface of what the Apify API can do. Right from Insomnia, or from any HTTP client, you can [manage datasets](https://docs.apify.com/api/v2/storage-datasets.md) and [key-value stores](https://docs.apify.com/api/v2/storage-key-value-stores.md), \[add to request queues]/api/v2/storage-request-queues), [update Actors](https://docs.apify.com/api/v2/storage-request-queues-requests.md), and much more! Basically, whatever you can do on the platform's web interface, you also do through the API. ## Next up[​](#next "Direct link to Next up") [Next up](https://docs.apify.com/academy/getting-started/apify-client.md), we'll be learning about how to use Apify's JavaScript and Python clients to interact with the API right within our code. --- # Apify client **Interact with the Apify API in your code by using the apify-client package, which is available for both JavaScript and Python.** *** Now that you've gotten your toes wet with interacting with the Apify API through raw HTTP requests, you're ready to become familiar with the **Apify client**, which is a package available for both JavaScript and Python that allows you to interact with the API in your code without explicitly needing to make any GET or POST requests. This lesson will provide code examples for both Node.js and Python, so regardless of the language you are using, you can follow along! ## Examples[​](#examples "Direct link to Examples") You can access `apify-client` examples in the Console Actor detail page. Click the **API** button and then the **API Client** dropdown button. ![API button](/assets/images/api-button-16287c6b358ebf6ad02c35f2ece5c333.png) ## Installing and importing[​](#installing-and-importing "Direct link to Installing and importing") If you are going to use the client in Node.js, use this command within one of your projects to install the package through npm: ``` npm install apify-client ``` In Python, you can install it from PyPI with this command: ``` pip install apify-client ``` After installing the package, let's make a file named **client** and import the Apify client like so: * Node.js * Python ``` // client.js import { ApifyClient } from 'apify-client'; ``` ``` # client.py from apify_client import ApifyClient ``` ## Running an Actor[​](#running-an-actor "Direct link to Running an Actor") In the last lesson, we ran the **adding-actor** and retrieved its dataset items. That's exactly what we're going to do now; however, by using the Apify client instead. Before we can use the client though, we must create a new instance of the `ApifyClient` class and pass it our API token from the [**Integrations** page](https://console.apify.com/account?tab=integrations\&asrc=developers_portal) on the Apify Console: * Node.js * Python ``` const client = new ApifyClient({ token: 'YOUR_TOKEN', }); ``` ``` client = ApifyClient(token='YOUR_TOKEN') ``` > If you are planning on publishing your code to a public GitHub/Gitlab repository or anywhere else online, be sure to set your API token as en environment variable, and never hardcode it directly into your script. Now that we've got our instance, we can point to an Actor using the [`client.actor()`](https://docs.apify.com/api/client/js/reference/class/ApifyClient#actor) function, then call the Actor with some input with the [`.call()`](https://docs.apify.com/api/client/js/reference/class/ApifyClient#actor) function - the first parameter of which is the input for the Actor. * Node.js * Python ``` const run = await client.actor('YOUR_USERNAME/adding-actor').call({ num1: 4, num2: 2, }); ``` ``` run = client.actor('YOUR_USERNAME/adding-actor').call(run_input={ 'num1': 4, 'num2': 2 }) ``` > Learn more about the `.call()` function in our [API documentation](https://docs.apify.com/api/client/js/reference/class/ApifyClient#actor). ## Downloading dataset items[​](#downloading-dataset-items "Direct link to Downloading dataset items") Once an Actor's run has completed, it will return a **run info** object that looks something like this: ![Run info object](/assets/images/run-info-5744283cdcb67851aa05d10ef782d69d.jpg) The `run` variable we created in the last section points to the **run info** object of the run we created with the `.call()` function, which means that through this variable, we can access the run's `defaultDatasetId`. This ID can then be passed into the `client.dataset()` function. * Node.js * Python ``` const dataset = client.dataset(run.defaultDatasetId); ``` ``` dataset = client.dataset(run['defaultDatasetId']) ``` Finally, we can download the items in the dataset by using the **list items** function, then log them to the console. * Node.js * Python ``` const { items } = await dataset.listItems(); console.log(items); ``` ``` items = dataset.list_items().items print(items) ``` The final code for running the Actor and fetching its dataset items looks like this: * Node.js * Python ``` // client.js import { ApifyClient } from 'apify-client'; const client = new ApifyClient({ token: 'YOUR_TOKEN', }); const run = await client.actor('YOUR_USERNAME/adding-actor').call({ num1: 4, num2: 2, }); const dataset = client.dataset(run.defaultDatasetId); const { items } = await dataset.listItems(); console.log(items); ``` ``` # client.py from apify_client import ApifyClient client = ApifyClient(token='YOUR_TOKEN') actor = client.actor('YOUR_USERNAME/adding-actor').call(run_input={ 'num1': 4, 'num2': 2 }) dataset = client.dataset(run['defaultDatasetId']) items = dataset.list_items().items print(items) ``` ## Updating an Actor[​](#updating-actor "Direct link to Updating an Actor") If you check the **Settings** tab within your **adding-actor**, you'll notice that the default memory being allocated to the Actor is **2048 MB**. This is a bit overkill considering the fact that the Actor is only adding two numbers together - **256 MB** would be much more reasonable. Also, we can safely say that the run should never take more than 20 seconds (even this is a generous number) and that the default of 3600 seconds is also overkill. Let's change these two Actor settings via the Apify client using the [`actor.update()`](https://docs.apify.com/api/client/js/reference/class/ActorClient#update) function. This function will call the **update Actor** endpoint, which can take `defaultRunOptions` as an input property. You can find the shape of the `defaultRunOptions` in the [API documentation](https://docs.apify.com/api/v2/act-put.md). Perfect! First, we'll create a pointer to our Actor, similar to before (except this time, we won't be using `.call()` at the end): * Node.js * Python ``` const actor = client.actor('YOUR_USERNAME/adding-actor'); ``` ``` actor = client.actor('YOUR_USERNAME/adding-actor') ``` Then, we'll call the `.update()` method on the `actor` variable we created and pass in our new **default run options**: * Node.js * Python ``` await actor.update({ defaultRunOptions: { build: 'latest', memoryMbytes: 256, timeoutSecs: 20, }, }); ``` ``` actor.update(default_run_build='latest', default_run_memory_mbytes=256, default_run_timeout_secs=20) ``` After running the code, go back to the **Settings** page of **adding-actor**. If your default options now look like this, then it worked!: ![New run defaults](/assets/images/new-defaults-3782123fb2eca500d91a6db38bf82f13.jpg) ## Overview[​](#overview "Direct link to Overview") You can do so much more with the Apify client than running Actors, updating Actors, and downloading dataset items. The purpose of this lesson was to get you comfortable using the client in your own projects, as it's the absolute best developer tool for integrating the Apify platform with an external system. For a more in-depth understanding of the Apify API client, give these a quick lookover: * [API client for JavaScript](https://docs.apify.com/api/client/js) * [API client for Python](https://docs.apify.com/api/client/python) ## Next up[​](#next "Direct link to Next up") Now that you're familiar and a bit more comfortable with the Apify platform, you're ready to start deploying your code to Apify! In the [next section](https://docs.apify.com/academy/deploying-your-code.md), you'll learn how to take any project written in any programming language and turn it into an Actor. --- # Creating Actors **This lesson offers hands-on experience in building and running Actors in Apify Console using a template. By the end of it, you will be able to build and run your first Actor using an Actor template.** *** You can create an Actor in several ways. You can create one from your own source code hosted in a Git repository or in your local machine, for example. But in this tutorial, we'll focus on the easiest method: selecting an Actor code template. We don't need to install any special software, and everything can be done directly in Apify Console using an Apify account. ## Choose the source[​](#choose-the-source "Direct link to Choose the source") Once you're in Apify Console, go to [Development](https://console.apify.com/actors/development/my-actors), and click on the **Develop new** button in the top right-hand corner. ![Develop an Actor button](/assets/images/develop-new-actor-a499c8a2618fec73c828ddb4dcbb75b4.png) You'll be presented with a page featuring two ways to get started with a new Actor. 1. Creating an Actor from existing source code (using Git providers or pushing the code from your local machine using Apify CLI) 2. Creating an Actor from a code template | Existing source code | Code templates | | ----------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------- | | ![Create and Actor from source code](/assets/images/create-actor-from-source-code-3b8f6761162e4c51daea94589b9e2407.png) | ![Create an Actor from code templates](/assets/images/create-actor-from-templates-80f2545ea6bf5071f073ab66af3d9973.png) | ## Creating Actor from existing source code[​](#existing-source-code "Direct link to Creating Actor from existing source code") If you already have your code hosted by a Git provider, you can use it to create an Actor by linking the repository. If you use GitHub, you can use our [GitHub integration](https://docs.apify.com/platform/integrations/github.md) to create an Actor from your public or private repository. You can also use GitLab, Bitbucket or other Git providers or external repositories. ![Create an Actor from Git repository](/assets/images/create-actor-git-0f6cdca6e156997d67fc7078944c97c9.png) You can also push your existing code from your local machine using [Apify CLI](https://docs.apify.com/cli). This is useful when you develop your code locally and then you want to push it to the Apify Console to run the code as an Actor in the cloud. For this option, you'll need the [Apify CLI installed](https://docs.apify.com/cli/docs/installation) on your machine. By clicking on the **Push your code using the Apify command-line interface (CLI)** button, you will be presented with instructions on how to push your code to the Apify Console. ![Push your code using the Apify CLI](/assets/images/create-actor-cli-4a172ba02eb3aeda5fc286317274f201.png) ## Creating Actor from code template[​](#code-template "Direct link to Creating Actor from code template") Python, JavaScript, and TypeScript have several template options that you can use. > You can select one from the list on this page or you can browse all the templates in the template library by clicking on the **View all templates** button in the right corner. For example, let's choose the **Start with JavaScript** template and click on the template card. ![JavaScript template card](/assets/images/create-actor-template-javascript-card-c532263658eb98fa3d68a1b522c4af94.png) You will end up on a template detail page where you can see all the important information about the template - description, included features, used technologies, and what is the use-case of this template. More importantly, there is a code preview and also instructions for how the code works. ![JavaScript template detail page](/assets/images/create-actor-template-detail-page-8ff37bb2c50a5756663f61ffca76a010.png) ### Using the template in the Web IDE[​](#web-ide "Direct link to Using the template in the Web IDE") By clicking **Use this template** button you will create the Actor in Apify Console and you will be moved to the **Code** tab with the [Web IDE](https://docs.apify.com/platform/actors/development/quick-start/web-ide.md) where you can see the code of the template and start editing it. > The Web IDE is a great tool for developing your Actor directly in Apify Console without the need to install or use any other software. ![Web IDE](/assets/images/create-actor-web-ide-53857177e9d96389456c6d0e5feff72a.png) ### Using the template locally[​](#local "Direct link to Using the template locally") If you want to use the template locally, you can again use our [Apify CLI](https://docs.apify.com/cli) to download the template to your local machine. > Creating an Actor from a template locally is a great option if you want to develop your code using your local environment and IDE and then push the final solution back to the Apify Console. When you click on the **Use locally** button, you'll be presented with instructions on how to create an Actor from this template in your local environment. With the Apify CLI installed, you can run the following commands in your terminal: ``` apify create my-actor -t getting_started_node ``` ``` cd my-actor apify run ``` ![Use the template locally](/assets/images/create-actor-template-locally-b4d9caaebe286c60cbc29017f02ab3d4.png) ## Start with scraping single page[​](#scraping-single-page "Direct link to Start with scraping single page") This template is a great starting point for web scraping as it extracts data from a single website. It uses [Axios](https://axios-http.com/docs/intro) for downloading the page content and [Cheerio](https://cheerio.js.org/) for parsing the HTML from the content. Let's see what's inside the **Start with JavaScript** template. The main logic of the template lives in the `src/main.js` file. ``` // Axios - Promise based HTTP client for the browser and node.js (Read more at https://axios-http.com/docs/intro). import { Actor } from 'apify'; import axios from 'axios'; // Cheerio - The fast, flexible & elegant library for parsing and manipulating HTML and XML (Read more at https://cheerio.js.org/). import * as cheerio from 'cheerio'; // Apify SDK - toolkit for building Apify Actors (Read more at https://docs.apify.com/sdk/js/). // The init() call configures the Actor for its environment. It's recommended to start every Actor with an init(). await Actor.init(); // Structure of input is defined in input_schema.json const input = await Actor.getInput(); const { url } = input; // Fetch the HTML content of the page. const response = await axios.get(url); // Parse the downloaded HTML with Cheerio to enable data extraction. const $ = cheerio.load(response.data); // Extract all headings from the page (tag name and text). const headings = []; $('h1, h2, h3, h4, h5, h6').each((i, element) => { const headingObject = { level: $(element).prop('tagName').toLowerCase(), text: $(element).text(), }; console.log('Extracted heading', headingObject); headings.push(headingObject); }); // Save headings to Dataset - a table-like storage. await Actor.pushData(headings); // Gracefully exit the Actor process. It's recommended to quit all Actors with an exit(). await Actor.exit(); ``` The Actor takes the `url` from the input and then: 1. Sends a request to the URL. 2. Downloads the page's HTML content. 3. Extracts headings (H1 - H6) from the page. 4. Stores the extracted data. The extracted data is stored in the [Dataset](https://docs.apify.com/platform/storage/dataset.md) where you can preview it and download it. We'll show how to do that later in [Run the Actor](#run-the-actor) section. > Feel free to play around with the code and add some more features to it. For example, you can extract all the links from the page or extract all the images or completely change the logic of this template. Keep in mind that this template uses [input schema](https://docs.apify.com/academy/deploying-your-code/input-schema.md) defined in the `.actor/input_schema.json` file and linked to the `.actor/actor.json`. If you want to change the input schema, you need to change it in those files as well. Learn more about the Actor input and output [in the next page](https://docs.apify.com/academy/getting-started/inputs-outputs.md). ## Build the Actor 🧱[​](#build-an-actor "Direct link to Build the Actor 🧱") In order to run the Actor, you need to [build](https://docs.apify.com/platform/actors/development/builds-and-runs/builds.md) it first. Click on the **Build** button at the bottom of the page or **Build now** button right under the code editor. ![Build the Actor](/assets/images/build-actor-5aaefc12ec3684c08bd92818b88e3576.png) After you've clicked the **Build** button, it'll take around 5–10 seconds to complete the build. You'll know it's finished when you see a green **Start** button. ![Start button]() ## Fill the input[​](#fill-input "Direct link to Fill the input") And now we are ready to run the Actor. But before we do that, let's give the Actor some input by going to the `Input` tab. The input tab is where you can provide the Actor with some meaningful input. In this case, we'll be providing the Actor with a URL to scrape. For now, we'll use the prefilled value of [Apify website](https://apify.com/) (`https://apify.com/`). You can change the website you want to extract the data from by changing the URL in the input field. ![Input tab](/assets/images/actor-input-tab-93256e980a452661e0a608910bddecb1.png) ## Run the Actor[​](#run-the-actor "Direct link to Run the Actor") Once you have provided the Actor with some URL you want to extract the data from, click **Start** button and wait a few seconds. You should see the Actor run logs in the **Last run** tab. ![Actor run logs](/assets/images/actor-run-1c928e9040dac9112be91f2bfbfde02f.png) After the Actor finishes, you can preview or download the extracted data by clicking on the **Export X results** button. ![Export results](/assets/images/actor-run-dataset-a27223a2b496df661e18f8e311c9bfc4.png) And that's it! You've just created your first Actor and extracted data from a website 🎉. ## Getting stuck? Check out the tips 💡[​](#get-help-with-tips "Direct link to Getting stuck? Check out the tips 💡") If you ever get stuck, you can always click on the **Tips** button in the top right corner of the page. It will show you a list of tips that are relevant to the Actor development. ![Tips](/assets/images/actor-tips-a80adc9d100c09064235edd7143300c9.png) ## Next up[​](#next "Direct link to Next up") We've created an Actor, but how can we give it more complex inputs and make it do stuff based on these inputs? This is exactly what we'll be discussing in the [next lesson](https://docs.apify.com/academy/getting-started/inputs-outputs.md)'s activity. --- # Inputs & outputs **Create an Actor from scratch which takes an input, processes that input, and then outputs a result that can be used elsewhere.** *** Actors, as any other programs, take inputs and generate outputs. The Apify platform has a way how to specify what inputs the Actor expects, and a way to temporarily or permanently store its results. In this lesson, we'll be demonstrating inputs and outputs by building an Actor which takes two numbers as input, adds them up, and then outputs the result. ## Accept input into an Actor[​](#accept-input "Direct link to Accept input into an Actor") Let's first create another new Actor using the same template as before. Feel free to refer to the [previous lesson](https://docs.apify.com/academy/getting-started/creating-actors.md) for a refresher on how to do this. Replace all of the code in **main.js** with this code snippet: ``` import { Actor } from 'apify'; await Actor.init(); // Grab our numbers which were inputted const { num1, num2 } = await Actor.getInput(); // Calculate the solution const solution = num1 + num2; // Push the solution to the dataset await Actor.pushData({ solution }); await Actor.exit(); ``` Then, replace everything in **INPUT\_SCHEMA.json** with this: > This step isn't necessary, as the Actor will still be able to take input in JSON format without it; however, we are providing the content for this Actor's input schema in this lesson, as it will give the Apify platform a blueprint off of which it can generate a nice UI for your inputs, as well as validate their values. ``` { "title": "Number adder", "type": "object", "schemaVersion": 1, "properties": { "num1": { "title": "1st Number", "type": "integer", "description": "First number.", "editor": "number" }, "num2": { "title": "2nd Number", "type": "integer", "description": "Second number.", "editor": "number" } }, "required": ["num1", "num2"] } ``` > If you're interested in learning more about how the code works, and what the **INPUT\_SCHEMA.json** means, read about [inputs](https://docs.apify.com/sdk/js/docs/examples/accept-user-input) and [adding data to a dataset](https://docs.apify.com/sdk/js/docs/examples/add-data-to-dataset) in the Apify SDK documentation, and refer to the [input schema docs](https://docs.apify.com/platform/actors/development/actor-definition/input-schema/specification/v1.md#integer). Finally, **Save** and **Build** the Actor just as you did in the previous lesson. ## Configuring an Actor with inputs[​](#configuring "Direct link to Configuring an Actor with inputs") If you scroll down a bit, you'll find the **Developer console** located under the multifile editor. By default, after running a build, the **Last build** tab will be selected, where you can see all of the logs related to building the Actor. Inputs can be configured within the **Input** tab. ![Configuring inputs](/assets/images/configure-inputs-0efc6f6ade028079e5da7b87e966bdcf.jpg) Enter any two numbers you'd like, then press **Start**. The Actor's run should be completed almost immediately. ## View Actor results[​](#view-results "Direct link to View Actor results") Since we've pushed the result into the default dataset, it, and some info about it can be viewed by clicking this box, which will take you to the results tab: ![Result box]() On the results tab, there are a whole lot of options for which format to view/download the data in. Keep the default of **JSON** selected, and click on **Preview**. ![Dataset preview](/assets/images/dataset-preview-da23f5956de7eccb38a691f09fd3dd1c.png) There's our solution! Did it work for you as well? Now, we can download the data right from the results tab to be used elsewhere, or even programmatically retrieve it by using [Apify's API](https://docs.apify.com/api/v2.md) (we'll be discussing how to do this in the next lesson). It's important to note that the default dataset of the Actor, which we pushed our solution to, will be retained for 7 days. If we wanted the data to be retained for an indefinite period of time, we'd have to use a named dataset. For more information about named storages vs unnamed storages, read a bit about [data retention on the Apify platform](https://docs.apify.com/platform/storage/usage.md#data-retention). ## Next up[​](#next "Direct link to Next up") In [next lesson](https://docs.apify.com/academy/getting-started/apify-api.md)'s fun activity, you'll learn how to call the Actor we created in this lesson programmatically using one of Apify's most powerful tools - the Apify API. --- # Why a glossary? **Browse important web scraping concepts, tools and topics in succinct articles explaining common web development terms in a web scraping and automation context.** *** Web scraping comes with a lot of terms that are specific to the area. Some of them are tools and libraries, like [Playwright](https://docs.apify.com/academy/puppeteer-playwright.md) or Insomnia. Others are general topics that have a special place in web scraping, like headless browsers or browser fingerprints. And some topics are related to all web development, but play a special role in web scraping, such as HTTP headers and cookies. When writing the academy, we very early on realized that we needed a place to reference these terms, but quickly found out that the usual tutorials and guides available all over the web weren't the most ideal. The explanations were too broad and generic and did not fit the web scraping context. With the **Apify Academy** glossary, we aim to provide you with short articles and lessons that provide the necessary web scraping context for specific terms, then link to other parts of the web for further in-depth reading. --- # Node.js Tutorials 💻📚 **A collection of various Node.js tutorials on scraping sitemaps, optimizing your scrapers, using popular Node.js web scraping libraries, and more.** *** This section contains various web-scraping or web-scraping related tutorials for Node.js. Whether you're trying to scrape from a website with sitemaps, struggling with a dynamic page, want to optimize your slow Puppeteer scraper, or need some general tips for scraping in Node.js, this section is right for you. --- # How to add external libraries to Web Scraper Sometimes you need to use some extra JavaScript in your [Web Scraper](https://apify.com/apify/web-scraper) page functions. Whether it is to work with dates and times using [Moment.js](https://momentjs.com/), or to manipulate the DOM using [jQuery](https://jquery.com/), libraries save precious time and make your code more concise and readable. Web Scraper already provides a way to add jQuery to your page functions. All you need to do is to check the Inject jQuery input option. There's also the option to Inject Underscore, a popular helper function library. In this tutorial, we'll learn how to inject any JavaScript library into your page functions, with the only limitation being that the library needs to be available somewhere on the internet as a downloadable file (typically a CDN). ## Injecting Moment.js[​](#injecting-momentjs "Direct link to Injecting Moment.js") Moment.js is a very popular library for working with date and time. It helps you with the parsing, manipulation, and formatting of datetime values in multiple locales and has become the de-facto standard for this kind of work in JavaScript. To inject Moment.js into our page function (or any other library using the same method), we first need to find a link to download it from. We can find it in [Moment.js' documentation](https://momentjs.com/docs/#/use-it/browser/) under the CDN links. > Now we have two options. Inject the library using plain JavaScript, or if you prefer working with jQuery, use a jQuery helper. ## Injecting a library with plain JavaScript[​](#injecting-a-library-with-plain-javascript "Direct link to Injecting a library with plain JavaScript") ``` async function pageFunction(context) { const libraryUrl = 'https://cdnjs.cloudflare.com/ajax/libs/moment.js/2.24.0/moment.min.js'; // Inject Moment.js\ await new Promise((resolve) => { const script = document.createElement('script'); script.src = libraryUrl; script.addEventListener('load', resolve); document.body.append(script); }); // Confirm that it works.\ const now = moment().format('ddd, hA'); context.log.info(`NOW: ${now}`); } ``` We're creating a script element in the page's DOM and waiting for the script to load. Afterwards, we confirm that the library has been successfully loaded by using one of its functions. ## Injecting a library using jQuery[​](#injecting-a-library-using-jquery "Direct link to Injecting a library using jQuery") After you select the Inject jQuery input option, jQuery will become available in your page function as `context.jQuery` . ``` async function pageFunction(context) { const libraryUrl = 'https://cdnjs.cloudflare.com/ajax/libs/moment.js/2.24.0/moment.min.js'; const $ = context.jQuery; // Inject Moment.js\ await $.getScript(libraryUrl); // Confirm that it works.\ const now = moment().format('ddd, hA'); context.log.info(`NOW: ${now}`); } ``` With jQuery, we're using the `$.getScript()` helper to fetch the script for us and wait for it to load. ## Dealing with errors[​](#dealing-with-errors "Direct link to Dealing with errors") Some websites employ security measures that disallow loading external scripts within their pages. Luckily, those measures can be overridden with Web Scraper. If you are encountering errors saying that your library cannot be loaded due to a security policy, select the Ignore CORS and CSP input option at the very bottom of Web Scraper input and the errors should go away. Happy scraping! --- # How to analyze and fix errors when scraping a website **Learn how to deal with random crashes in your web-scraping and automation jobs. Find out the essentials of debugging and fixing problems in your crawlers.** *** Debugging is absolutely essential in programming. Even if you don't call yourself a programmer, having basic debugging skills will make building crawlers easier. It will also help you save money by allowing you to avoid hiring an expensive developer to solve your issue for you. This quick lesson covers the absolute basics by discussing some of the most common problems and the simplest tools for analyzing and fixing them. ## Possible causes[​](#possible-causes "Direct link to Possible causes") It is often tricky to see the full scope of what can go wrong. We assume that once the code is set up correctly, it will keep working. Unfortunately, that is rarely true in the realm of web scraping and automation. Websites change, they introduce new [anti-scraping technologies](https://docs.apify.com/academy/anti-scraping.md), programming tools change and, in addition, people make mistakes. Here are the most common reasons your working solution may break. * The website changes its layout or [data feed](https://www.datafeedwatch.com/academy/data-feed). * A site's layout changes depending on location or uses [A/B testing](https://www.youtube.com/watch?v=XDoKXaGrUxE\&feature=youtu.be). * A page starts to block you (recognizes you as a bot). * The website [loads its data later dynamically](https://docs.apify.com/academy/node-js/dealing-with-dynamic-pages.md), so the code works only sometimes, if you are slow or lucky enough. * You made a mistake when updating your code. * Your [proxies](https://docs.apify.com/academy/anti-scraping/mitigation/proxies.md) aren't working. * You have upgraded your [dependencies](https://www.quora.com/What-is-a-dependency-in-coding) (other software that your software relies upon), and the new versions no longer work (this is harder to debug). ## Diagnosing/analyzing the issue[​](#issue-analysis "Direct link to Diagnosing/analyzing the issue") Web scraping and automation are very specific types of programming. It is not possible to rely on specialized debugging tools, since the code does not output the same results every time. However, there are still many ways to diagnose issues in a crawler. > Many issues are edge cases, which occur in one of a thousand pages or are time-dependent. Because of this, you cannot rely only on [determinism](https://en.wikipedia.org/wiki/Deterministic_algorithm). ### Logging[​](#logging "Direct link to Logging") Logging is an essential tool for any programmer. When used correctly, it helps you capture a surprising amount of information. Here are some general rules for logging: * Usually, **many logs** is better than **no logs** at all. * Putting more information into one line, rather than logging multiple short lines, helps reduce the overall log size. * Focus on numbers. Log how many items you extract from a page, etc. * Structure your logs and use the same structure in all your logs. * Append the current page's URL to each log. This lets you immediately open that page and review it. Here's an example of what a structured log message might look like: ``` [CATEGORY]: Products: 20, Unique products: 4, Next page: true --- https://apify.com/store ``` The log begins with the **page type**. Usually, we use labels such as **\[CATEGORY]** and **\[DETAIL]**. Then, we log important numbers and other information. Finally, we add the page's URL, so we can check if the log is correct. #### Logging errors[​](#logging-errors "Direct link to Logging errors") Errors require a different approach because, if your code crashes, your usual logs will not be called. Instead, exception handlers will print the error, but these are usually ugly messages with a [stack trace](https://en.wikipedia.org/wiki/Stack_trace) that only the experts will understand. You can overcome this by adding [try/catch blocks](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Statements/try...catch) into your code. In the catch block, explain what happened and re-throw the error (so the request is automatically retried). ``` try { // Sensitive code block // ... } catch (error) { // You know where the code crashed so you can explain here throw new Error('Request failed during login with an error', { cause: error }); } ``` Read more information about logging and error handling in our developer [best practices](https://docs.apify.com/academy/web-scraping-for-beginners/best-practices.md) section. ### Saving snapshots[​](#saving-snapshots "Direct link to Saving snapshots") By snapshots, we mean **screenshots** if you use a [browser with Puppeteer/Playwright](https://docs.apify.com/academy/puppeteer-playwright.md) and HTML saved into a [key-value store](https://crawlee.dev/api/core/class/KeyValueStore) that you can display in your own browser. Snapshots are useful throughout your code but especially important in error handling. Note that an error can happen only in a few pages out of a thousand and look completely random. You cannot do much else than to save and analyze a snapshot. Snapshots can tell you if: * A website has changed its layout. This can also mean A/B testing or different content for different locations. * You have been blocked—you open a [CAPTCHA](https://en.wikipedia.org/wiki/CAPTCHA) or an **Access Denied** page. * Data load later dynamically—the page is empty. * The page was redirected—the content is different. You can learn how to take snapshots in Puppeteer or Playwright in [this short lesson](https://docs.apify.com/academy/puppeteer-playwright/page/page-methods.md) #### When to save snapshots[​](#when-to-save-snapshots "Direct link to When to save snapshots") The most common approach is to save on error. We can enhance our previous try/catch block like this: ``` import { puppeteerUtils } from 'crawlee'; // ... // storeId is ID of current key value store, where we save snapshots const storeId = Actor.getEnv().defaultKeyValueStoreId; try { // Sensitive code block // ... } catch (error) { // Change the way you save it depending on what tool you use const randomNumber = Math.random(); const key = `ERROR-LOGIN-${randomNumber}`; await puppeteerUtils.saveSnapshot(page, { key }); const screenshotLink = `https://api.apify.com/v2/key-value-stores/${storeId}/records/${key}.jpg`; // You know where the code crashed so you can explain here throw new Error('Request failed during login with an error', { cause: error }); } // ... ``` To make the error snapshot descriptive, we name it **ERROR-LOGIN**. We add a random number so the next **ERROR-LOGIN**s would not overwrite this one and we can see all the snapshots. If you can use an ID of some sort, it is even better. **Beware:** * The snapshot's **name** (key) can only contain letter, number, dot and dash characters. Other characters will cause an error, which makes the random number a safe pick. * Do not overdo the snapshots. Once you get out of the testing phase, limit them to critical places. Saving snapshots uses resources. ### Error reporting[​](#error-reporting "Direct link to Error reporting") Logging and snapshotting are great tools but once you reach a certain run size, it may be hard to read through them all. For a large project, it is handy to create a more sophisticated reporting system. ## With the Apify SDK[​](#with-the-apify-sdk "Direct link to With the Apify SDK") This example extends our snapshot solution above by creating a [named dataset](https://docs.apify.com/platform/storage/usage.md#named-and-unnamed-storages) (named datasets have infinite retention), where we will accumulate error reports. Those reports will explain what happened and will link to a saved snapshot, so we can do a quick visual check. ``` import { Actor } from 'apify'; import { puppeteerUtils } from 'crawlee'; await Actor.init(); // ... // Let's create reporting dataset // If you already have one, this will continue adding to it const reportingDataset = await Actor.openDataset('REPORTING'); try { // Sensitive code block // ... } catch (error) { // Change the way you save it depending on what tool you use const randomNumber = Math.random(); const key = `ERROR-LOGIN-${randomNumber}`; // The store gets removed with the run after data retention period so the links will stop working eventually // You can store the snapshots infinitely in a named KV store by adding `keyValueStoreName` option await puppeteerUtils.saveSnapshot(page, { key }); // To create the reporting URLs, we need to know the Key-Value store and run IDs const { actorRunId, defaultKeyValueStoreId } = Actor.getEnv(); // We create a report object const report = { errorType: 'login', errorMessage: error.toString(), // .html and .jpg file extensions are added automatically by the saveSnapshot function htmlSnapshotUrl: `https://api.apify.com/v2/key-value-stores/${defaultKeyValueStoreId}/records/${key}.html`, screenshotUrl: `https://api.apify.com/v2/key-value-stores/${defaultKeyValueStoreId}/records/${key}.jpg`, runUrl: `https://console.apify.com/actors/runs/${actorRunId}`, }; // And we push the report to our reporting dataset await reportingDataset.pushData(report); // You know where the code crashed so you can explain here throw new Error('Request failed during login with an error', { cause: error }); } // ... await Actor.exit(); ``` --- # Apify's free Google SERP API You need to regularly grab SERP data about your target keywords? Apify provides a free SERP API that includes organic search, ads, people also ask, etc. Free Apify accounts come with unlimited proxy access and $5 of credit. To get started, head over to the [Google Search Results Scraper](https://apify.com/apify/google-search-scraper) page and click the `Try me` button. You'll be taken to a page where you can enter the search query, region, language and other settings. ![Apify Google SERP API](/assets/images/gserp-api-2621c8ee29f74544ef0ec986a4a8989a.png) Hit `Save & Run` and you'll have the downloaded data as soon as the query finishes. To have it run at a regular frequency, you can set up the task to run on an [automatic schedule](https://docs.apify.com/platform/schedules.md#setting-up-a-new-schedule). To run from the API, send a [synchronous POST request](https://docs.apify.com/api/v2/actor-task-run-sync-get-dataset-items-post.md) to an endpoint such as `https://api.apify.com/v2/acts/TASK_NAME_OR_ID/runs?token=YOUR_TOKEN`. Include any required input in a JSON object in the request's body. Keep in mind that, as Google search uses a non-deterministic algorithm, output results may vary even if the input settings are exactly the same. --- # Avoid EACCES error in Actor builds with a custom Dockerfile Sometimes when building an Actor using a custom Dockerfile, you might receive errors like: ``` Missing write access to ... ``` or ``` EACCES: permission denied ``` This problem is usually caused by the fact that by default, the `COPY` Dockerfile instruction copies files as the root user (with UID and GID of 0), while your Dockerfile probably uses another user to copy files and run commands. To fix this problem, make sure the `COPY`  instruction in Dockerfile uses the `--chown` flag. For example, instead of ``` COPY . ./ ``` use ``` COPY --chown=myuser:myuser . ./ ``` where `myuser` is the user and group defined by the `USER`  instruction in the base Docker image. To learn more, see [Dockerfile documentation](https://docs.docker.com/reference/dockerfile/#copy). Hope this helps! --- # Block requests in Puppeteer Improve Performance: Use `blockRequests` Unfortunately, in the recent version of Puppeteer, request interception disables the native cache and slows down the Actor significantly. Therefore, it's not recommended to follow the examples shown in this article. Instead, use [`blockRequests`](https://crawlee.dev/api/puppeteer-crawler/namespace/puppeteerUtils#BlockRequestsOptions) *utility function from* [*Crawlee*](https://crawlee.dev). It works through different paths and doesn't slow down your process. When using Puppeteer, often a webpage will load many resources that are not actually necessary for your use case. For example page could be loading many tracking libraries, that are completely unnecessary for most crawlers, but will cause the page to use more traffic and load slower. For example for this web page: If we run an Actor that measures extracted downloaded data from each response until the page is fully loaded, we get these results: ![Actor loading](/assets/images/actor-load-e6fc832092a1c94156fd96b3522c2c3b.png) Now if we want to optimize this to keep the webpage looking the same, but ignore unnecessary requests, then after ``` const page = await browser.newPage(); ``` we could can use this piece of code ``` await page.setRequestInterception(true); page.on('request', (request) => { if (someCondition) request.abort(); else request.continue(); }); ``` Where `someCondition` is a custom condition (not actually implemented in the code above) that checks whether a request should be aborted. For our example we will only disable some tracking scripts and then check if everything looks the same. Here is the code used: ``` await page.setRequestInterception(true); page.on('request', (request) => { const url = request.url(); const filters = [ 'livefyre', 'moatad', 'analytics', 'controltag', 'chartbeat', ]; const shouldAbort = filters.some((urlPart) => url.includes(urlPart)); if (shouldAbort) request.abort(); else request.continue(); }); ``` With this code set up this is the output: ![Improved Actor loading](/assets/images/improved-actor-loading-a1e7b6b855bb90ba1780f19f3653a34c.png) And except for different ads, the page should look the same. From this we can see that just by blocking a few analytics and tracking scripts the page was loaded nearly 25 seconds faster and downloaded 35% less data (approximately since the data is measured after it's decompressed). Hopefully this helps you make your solutions faster and use fewer resources. --- # How to optimize Puppeteer by caching responses **Learn why it is important for performance to cache responses in memory when intercepting requests in Puppeteer and how to implement it in your code.** *** > In the latest version of Puppeteer, the request-interception function inconveniently disables the native cache and significantly slows down the crawler. Therefore, it's not recommended to follow the examples shown in this article unless you have a very specific use-case where the default browser cache is not enough (e.g. cashing over multiple scraper runs) When running crawlers that go through a single website, each open page has to load all resources again. The problem is that each resource needs to be downloaded through the network, which can be slow and/or unstable (especially when proxies are used). For this reason, in this article, we will take a look at how to use memory to cache responses in Puppeteer (only those that contain header **cache-control** with **max-age** above **0**). In this example, we will use a scraper which goes through top stories on the CNN website and takes a screenshot of each opened page. The scraper is very slow right now because it waits till all network requests are finished and because the posts contain videos. If the scraper runs with disabled caching, these statistics will show at the end of the run: ![Bad run stats](/assets/images/bad-scraper-stats-b38622928fa3b188cae38d285750451e.png) As you can see, we used 177MB of traffic for 10 posts (that is how many posts are in the top-stories column) and 1 main page. From the screenshot above, it's clear that most of the traffic is coming from script files (124MB) and documents (22.8MB). For this kind of situation, it's always good to check if the content of the page is cache-able. You can do that using Chromes Developer tools. ## Understanding and reproducing the issue[​](#understanding-and-reproducing-the-issue "Direct link to Understanding and reproducing the issue") If we go to the CNN website, open up the tools and go to the **Network** tab, we will find an option to disable caching. ![Disabling cache in the Network tab](/assets/images/cnn-network-tab-0ca18e39872e758ab7f60f2cd601e0f1.png) Once caching is disabled, we can take a look at how much data is transferred when we open the page. This is visible at the bottom of the developer tools. ![5.3MB of data transferred](/assets/images/slow-no-cache-0681379c53774a230ff67f2ec4704f7c.png) If we uncheck the disable-cache checkbox and refresh the page, we will see how much data we can save by caching responses. ![642KB of data transferred](/assets/images/fast-with-cache-1a683d4e3a74468186b8d004c5fba276.png) By comparison, the data transfer appears to be reduced by 88%! ## Solving the problem by creating an in-memory cache[​](#solving-the-problem-by-creating-an-in-memory-cache "Direct link to Solving the problem by creating an in-memory cache") We can now emulate this and cache responses in Puppeteer. All we have to do is to check, when the response is received, whether it contains the **cache-control** header, and whether it's set with a **max-age** higher than **0**. If so, then we'll save the headers, URL, and body of the response to memory, and on the next request check if the requested URL is already stored in the cache. The code will look like this: ``` // On top of your code const cache = {}; // The code below should go between newPage function and goto function await page.setRequestInterception(true); page.on('request', async (request) => { const url = request.url(); if (cache[url] && cache[url].expires > Date.now()) { await request.respond(cache[url]); return; } request.continue(); }); page.on('response', async (response) => { const url = response.url(); const headers = response.headers(); const cacheControl = headers['cache-control'] || ''; const maxAgeMatch = cacheControl.match(/max-age=(\d+)/); const maxAge = maxAgeMatch && maxAgeMatch.length > 1 ? parseInt(maxAgeMatch[1], 10) : 0; if (maxAge) { if (cache[url] && cache[url].expires > Date.now()) return; let buffer; try { buffer = await response.buffer(); } catch (error) { // some responses do not contain buffer and do not need to be catched return; } cache[url] = { status: response.status(), headers: response.headers(), body: buffer, expires: Date.now() + (maxAge * 1000), }; } }); ``` > If the code above looks completely foreign to you, we recommending going through our free [Puppeteer/Playwright course](https://docs.apify.com/academy/puppeteer-playwright.md). After implementing this code, we can run the scraper again. ![Good run results](/assets/images/good-run-results-38dc359a0a3b4cdf6b7611255218d234.png) Looking at the statistics, caching responses in Puppeteer brought the traffic down from 177MB to 13.4MB, which is a reduction of data transfer by 92%. The related screenshots can be found [here](https://my.apify.com/storage/key-value/iWQ3mQE2XsLA2eErL). It did not speed up the crawler, but that is only because the crawler is set to wait until the network is nearly idle, and CNN has a lot of tracking and analytics scripts that keep the network busy. ## Implementation in Crawlee[​](#implementation-in-crawlee "Direct link to Implementation in Crawlee") Since most of you are likely using [Crawlee](https://crawlee.dev), here is what response caching would look like using `PuppeteerCrawler`: [Run on](https://console.apify.com/actors/7tWSD8hrYzuc9Lte7?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IERhdGFzZXQsIFB1cHBldGVlckNyYXdsZXIgfSBmcm9tICdjcmF3bGVlJztcXG5cXG5jb25zdCBjYWNoZSA9IHt9O1xcblxcbmNvbnN0IGNyYXdsZXIgPSBuZXcgUHVwcGV0ZWVyQ3Jhd2xlcih7XFxuICAgIHByZU5hdmlnYXRpb25Ib29rczogW2FzeW5jICh7IHBhZ2UgfSkgPT4ge1xcbiAgICAgICAgYXdhaXQgcGFnZS5zZXRSZXF1ZXN0SW50ZXJjZXB0aW9uKHRydWUpO1xcblxcbiAgICAgICAgcGFnZS5vbigncmVxdWVzdCcsIGFzeW5jIChyZXF1ZXN0KSA9PiB7XFxuICAgICAgICAgICAgY29uc3QgdXJsID0gcmVxdWVzdC51cmwoKTtcXG4gICAgICAgICAgICBpZiAoY2FjaGVbdXJsXSAmJiBjYWNoZVt1cmxdLmV4cGlyZXMgPiBEYXRlLm5vdygpKSB7XFxuICAgICAgICAgICAgICAgIGF3YWl0IHJlcXVlc3QucmVzcG9uZChjYWNoZVt1cmxdKTtcXG4gICAgICAgICAgICAgICAgcmV0dXJuO1xcbiAgICAgICAgICAgIH1cXG4gICAgICAgICAgICByZXF1ZXN0LmNvbnRpbnVlKCk7XFxuICAgICAgICB9KTtcXG5cXG4gICAgICAgIHBhZ2Uub24oJ3Jlc3BvbnNlJywgYXN5bmMgKHJlc3BvbnNlKSA9PiB7XFxuICAgICAgICAgICAgY29uc3QgdXJsID0gcmVzcG9uc2UudXJsKCk7XFxuICAgICAgICAgICAgY29uc3QgaGVhZGVycyA9IHJlc3BvbnNlLmhlYWRlcnMoKTtcXG4gICAgICAgICAgICBjb25zdCBjYWNoZUNvbnRyb2wgPSBoZWFkZXJzWydjYWNoZS1jb250cm9sJ10gfHwgJyc7XFxuICAgICAgICAgICAgY29uc3QgbWF4QWdlTWF0Y2ggPSBjYWNoZUNvbnRyb2wubWF0Y2goL21heC1hZ2U9KFxcXFxkKykvKTtcXG4gICAgICAgICAgICBjb25zdCBtYXhBZ2UgPSBtYXhBZ2VNYXRjaCAmJiBtYXhBZ2VNYXRjaC5sZW5ndGggPiAxID8gcGFyc2VJbnQobWF4QWdlTWF0Y2hbMV0sIDEwKSA6IDA7XFxuXFxuICAgICAgICAgICAgaWYgKG1heEFnZSkge1xcbiAgICAgICAgICAgICAgICBpZiAoIWNhY2hlW3VybF0gfHwgY2FjaGVbdXJsXS5leHBpcmVzID4gRGF0ZS5ub3coKSkgcmV0dXJuO1xcblxcbiAgICAgICAgICAgICAgICBsZXQgYnVmZmVyO1xcbiAgICAgICAgICAgICAgICB0cnkge1xcbiAgICAgICAgICAgICAgICAgICAgYnVmZmVyID0gYXdhaXQgcmVzcG9uc2UuYnVmZmVyKCk7XFxuICAgICAgICAgICAgICAgIH0gY2F0Y2gge1xcbiAgICAgICAgICAgICAgICAgICAgLy8gc29tZSByZXNwb25zZXMgZG8gbm90IGNvbnRhaW4gYnVmZmVyIGFuZCBkbyBub3QgbmVlZCB0byBiZSBjYWNoZWRcXG4gICAgICAgICAgICAgICAgICAgIHJldHVybjtcXG4gICAgICAgICAgICAgICAgfVxcblxcbiAgICAgICAgICAgICAgICBjYWNoZVt1cmxdID0ge1xcbiAgICAgICAgICAgICAgICAgICAgc3RhdHVzOiByZXNwb25zZS5zdGF0dXMoKSxcXG4gICAgICAgICAgICAgICAgICAgIGhlYWRlcnM6IHJlc3BvbnNlLmhlYWRlcnMoKSxcXG4gICAgICAgICAgICAgICAgICAgIGJvZHk6IGJ1ZmZlcixcXG4gICAgICAgICAgICAgICAgICAgIGV4cGlyZXM6IERhdGUubm93KCkgKyBtYXhBZ2UgKiAxMDAwLFxcbiAgICAgICAgICAgICAgICB9O1xcbiAgICAgICAgICAgIH1cXG4gICAgICAgIH0pO1xcbiAgICB9XSxcXG4gICAgcmVxdWVzdEhhbmRsZXI6IGFzeW5jICh7IHBhZ2UsIHJlcXVlc3QgfSkgPT4ge1xcbiAgICAgICAgYXdhaXQgRGF0YXNldC5wdXNoRGF0YSh7XFxuICAgICAgICAgICAgdGl0bGU6IGF3YWl0IHBhZ2UudGl0bGUoKSxcXG4gICAgICAgICAgICB1cmw6IHJlcXVlc3QudXJsLFxcbiAgICAgICAgICAgIHN1Y2NlZWRlZDogdHJ1ZSxcXG4gICAgICAgIH0pO1xcbiAgICB9LFxcbn0pO1xcblxcbmF3YWl0IGNyYXdsZXIucnVuKFsnaHR0cHM6Ly9hcGlmeS5jb20vc3RvcmUnLCAnaHR0cHM6Ly9hcGlmeS5jb20nXSk7XFxuXCJ9Iiwib3B0aW9ucyI6eyJidWlsZCI6ImxhdGVzdCIsImNvbnRlbnRUeXBlIjoiYXBwbGljYXRpb24vanNvbjsgY2hhcnNldD11dGYtOCIsIm1lbW9yeSI6NDA5NiwidGltZW91dCI6MTgwfX0.JN2lYfrYhuU1Kj6T5Ya9YEuVQboRB4s5BbGj-WHjpVw\&asrc=run_on_apify) ``` import { Dataset, PuppeteerCrawler } from 'crawlee'; const cache = {}; const crawler = new PuppeteerCrawler({ preNavigationHooks: [async ({ page }) => { await page.setRequestInterception(true); page.on('request', async (request) => { const url = request.url(); if (cache[url] && cache[url].expires > Date.now()) { await request.respond(cache[url]); return; } request.continue(); }); page.on('response', async (response) => { const url = response.url(); const headers = response.headers(); const cacheControl = headers['cache-control'] || ''; const maxAgeMatch = cacheControl.match(/max-age=(\d+)/); const maxAge = maxAgeMatch && maxAgeMatch.length > 1 ? parseInt(maxAgeMatch[1], 10) : 0; if (maxAge) { if (!cache[url] || cache[url].expires > Date.now()) return; let buffer; try { buffer = await response.buffer(); } catch { // some responses do not contain buffer and do not need to be cached return; } cache[url] = { status: response.status(), headers: response.headers(), body: buffer, expires: Date.now() + maxAge * 1000, }; } }); }], requestHandler: async ({ page, request }) => { await Dataset.pushData({ title: await page.title(), url: request.url, succeeded: true, }); }, }); await crawler.run(['https://apify.com/store', 'https://apify.com']); ``` --- # How to choose the right scraper for the job **Learn basic web scraping concepts to help you analyze a website and choose the best scraper for your particular use case.** *** You can use one of the two main ways to proceed with building your crawler: 1. Using plain HTTP requests. 2. Using an automated browser. We will briefly go through the pros and cons of both, and also will cover the basic steps on how to determine which one should you go with. ## Performance[​](#performance "Direct link to Performance") First, let's discuss performance. Plain HTTP request-based scraping will **always** be faster than browser-based scraping. When using plain requests, the page's HTML is not rendered, no JavaScript is executed, no images are loaded, etc. Also, there's no memory used by the browser, and there are no CPU-hungry operations. If it were only a question of performance, you'd of course use request-based scraping every time; however, it's unfortunately not that simple. ## Dynamic pages & blocking[​](#dynamic-pages "Direct link to Dynamic pages & blocking") Some websites do not load any data without a browser, as they need to execute some scripts to show it (these are known as [dynamic pages](https://docs.apify.com/academy/node-js/dealing-with-dynamic-pages.md)). Another problem is blocking. If the website collects a [browser fingerprint](https://docs.apify.com/academy/anti-scraping/techniques/fingerprinting.md), it can distinguish between a real user and a bot (crawler) and block access. ## Making the choice[​](#making-the-choice "Direct link to Making the choice") When choosing which scraper to use, we would suggest first checking whether the website works without JavaScript or not. Probably the easiest way to do so is to use the [Quick JavaScript Switcher](https://docs.apify.com/academy/tools/quick-javascript-switcher.md) extension for Chrome. If JavaScript is not needed, or you've spotted some XHR requests in the **Network** tab with the data you need, you probably won't need to use an automated browser browser. You can then check what data is received in response using [Postman](https://docs.apify.com/academy/tools/postman.md) or [Insomnia](https://docs.apify.com/academy/tools/insomnia.md) or try to send a few requests programmatically. If the data is there and you're not blocked straight away, a request-based scraper is probably the way to go. It also depends of course on whether you need to fill in some data (like a username and password) or select a location (such as entering a zip code manually). Tasks where interacting with the page is absolutely necessary cannot be done using plain HTTP scraping, and require headless browsers. In some cases, you might also decide to use a browser-based solution in order to better blend in with the rest of the "regular" traffic coming from real users. --- # How to scrape from dynamic pages **Learn about dynamic pages and dynamic content. How can we find out if a page is dynamic? How do we programmatically scrape dynamic content?** *** ## A quick experiment[​](#quick-experiment "Direct link to A quick experiment") From our adored and beloved [Fakestore](https://demo-webstore.apify.org/), we have been tasked to scrape each product's title, price, and image from the [new arrivals](https://demo-webstore.apify.org/search/new-arrivals) page. ![New arrival products in Fakestore](/assets/images/new-arrivals-a6b6da0fc639633520351f429b66bf4f.jpg) First, create a file called **dynamic.js** and copy-paste the following boiler plate code into it: ``` import { CheerioCrawler } from 'crawlee'; const crawler = new CheerioCrawler({ requestHandler: async ({ $, request }) => { // We'll put our logic here in a minute }, }); await crawler.addRequests([{ url: 'https://demo-webstore.apify.org/search/new-arrivals' }]); await crawler.run(); ``` If you're in a brand new project, don't forget to initialize your project, then install the necessary dependencies: ``` # this command will initialize your project # and install the "crawlee" and "cheerio" packages npm init -y && npm i crawlee ``` Now, let's write some data extraction code to extract each product's data. This should look familiar if you went through the [Data Extraction](https://docs.apify.com/academy/web-scraping-for-beginners/data-extraction.md) lessons: ``` import { CheerioCrawler } from 'crawlee'; const BASE_URL = 'https://demo-webstore.apify.org'; const crawler = new CheerioCrawler({ requestHandler: async ({ $, request }) => { const products = $('a[href*="/product/"]'); const results = [...products].map((product) => { const elem = $(product); const title = elem.find('h3').text(); const price = elem.find('div[class*="price"]').text(); const image = elem.find('img[src]').attr('src'); return { title, price, image: new URL(image, BASE_URL).href, }; }); console.log(results); }, }); await crawler.run([{ url: 'https://demo-webstore.apify.org/search/new-arrivals' }]); ``` > Here, we are using the [`Array.prototype.map()`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/map) function to loop through all of the product elements and save them into an array we call `results` all at the same time. After running it, you might say, "Great! It works!" **But wait...** What are those results being logged to console? ![Bad results in console](/assets/images/bad-results-f0ad878dbe1965962328c43da45fb920.png) Every single image seems to have the same exact "URL," but they are most definitely not the image URLs we are looking for. This is strange, because in the browser, we were getting URLs that looked like this: ``` https://demo-webstore.apify.org/_next/image?url=https%3A%2F%2Fm.media-amazon.com%2Fimages%2FI%2F81ywGFOb0eL._AC_UL1500_.jpg&w=3840&q=85 ``` The reason this is happening is because CheerioCrawler makes static HTTP requests, so it only manages to capture the content from the `DOMContentLoaded` event. Any elements or attributes generated dynamically thereafter using JavaScript (and usually XHR/Fetch requests) are not part of the downloaded HTML, and therefore are not accessible through the `$` object. What's the solution? We need to use something that is able to allow the page to follow through with the entire load process - a headless browser. ## Scraping dynamic content[​](#scraping-dynamic-content "Direct link to Scraping dynamic content") Let's change a few lines of our code to switch the crawler type from CheerioCrawler to PuppeteerCrawler, which will run a headless browser, allowing the `load` and `networkidle` events to fire: > Also, don't forget to run `npm i puppeteer` in order to install the `puppeteer` package! ``` import { PuppeteerCrawler } from 'crawlee'; const BASE_URL = 'https://demo-webstore.apify.org'; // Switch CheerioCrawler to PuppeteerCrawler const crawler = new PuppeteerCrawler({ // Replace "$" with "page" requestHandler: async ({ parseWithCheerio, request }) => { // Create the $ Cheerio object based on the page's content const $ = await parseWithCheerio(); const products = $('a[href*="/product/"]'); const results = [...products].map((product) => { const elem = $(product); const title = elem.find('h3').text(); const price = elem.find('div[class*="price"]').text(); const image = elem.find('img[src]').attr('src'); return { title, price, image: new URL(image, BASE_URL).href, }; }); console.log(results); }, }); await crawler.run([{ url: 'https://demo-webstore.apify.org/search/new-arrivals' }]); ``` After running this one, we can see that our results look different from before. We're getting the image links! ![Not perfect results](/assets/images/almost-there-689821c3a9b7953bbffa2ef30e67beab.png) Well... Not quite. It seems that the only images which we got the full links to were the ones that were being displayed within the view of the browser. This means that the images are lazy-loaded. **Lazy-loading** is a common technique used across the web to improve performance. Lazy-loaded items allow the user to load content incrementally, as they perform some action. In most cases, including our current one, this action is scrolling. We've gotta scroll down the page to load these images. Luckily, because we're using Crawlee, we don't have to write the logic that will achieve that, because a utility function specifically for Puppeteer called [`infiniteScroll`](https://crawlee.dev/api/puppeteer-crawler/namespace/puppeteerUtils#infiniteScroll) already exists right in the library, and can be accessed through `utils.puppeteer`. Let's add it to our code now: [Run on](https://console.apify.com/actors/7tWSD8hrYzuc9Lte7?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IERhdGFzZXQsIFB1cHBldGVlckNyYXdsZXIgfSBmcm9tICdjcmF3bGVlJztcXG5cXG5jb25zdCBCQVNFX1VSTCA9ICdodHRwczovL2RlbW8td2Vic3RvcmUuYXBpZnkub3JnJztcXG5cXG5jb25zdCBjcmF3bGVyID0gbmV3IFB1cHBldGVlckNyYXdsZXIoe1xcbiAgICByZXF1ZXN0SGFuZGxlcjogYXN5bmMgKHsgcGFyc2VXaXRoQ2hlZXJpbywgaW5maW5pdGVTY3JvbGwgfSkgPT4ge1xcbiAgICAgICAgLy8gQWRkIHRoZSB1dGlsaXR5IGZ1bmN0aW9uXFxuICAgICAgICBhd2FpdCBpbmZpbml0ZVNjcm9sbCgpO1xcblxcbiAgICAgICAgY29uc3QgJCA9IGF3YWl0IHBhcnNlV2l0aENoZWVyaW8oKTtcXG5cXG4gICAgICAgIGNvbnN0IHByb2R1Y3RzID0gJCgnYVtocmVmKj1cXFwiL3Byb2R1Y3QvXFxcIl0nKTtcXG5cXG4gICAgICAgIGNvbnN0IHJlc3VsdHMgPSBbLi4ucHJvZHVjdHNdLm1hcCgocHJvZHVjdCkgPT4ge1xcbiAgICAgICAgICAgIGNvbnN0IGVsZW0gPSAkKHByb2R1Y3QpO1xcblxcbiAgICAgICAgICAgIGNvbnN0IHRpdGxlID0gZWxlbS5maW5kKCdoMycpLnRleHQoKTtcXG4gICAgICAgICAgICBjb25zdCBwcmljZSA9IGVsZW0uZmluZCgnZGl2W2NsYXNzKj1cXFwicHJpY2VcXFwiXScpLnRleHQoKTtcXG4gICAgICAgICAgICBjb25zdCBpbWFnZSA9IGVsZW0uZmluZCgnaW1nW3NyY10nKS5hdHRyKCdzcmMnKTtcXG5cXG4gICAgICAgICAgICByZXR1cm4ge1xcbiAgICAgICAgICAgICAgICB0aXRsZSxcXG4gICAgICAgICAgICAgICAgcHJpY2UsXFxuICAgICAgICAgICAgICAgIGltYWdlOiBuZXcgVVJMKGltYWdlLCBCQVNFX1VSTCkuaHJlZixcXG4gICAgICAgICAgICB9O1xcbiAgICAgICAgfSk7XFxuXFxuICAgICAgICAvLyBQdXNoIG91ciByZXN1bHRzIHRvIHRoZSBkYXRhc2V0XFxuICAgICAgICBhd2FpdCBEYXRhc2V0LnB1c2hEYXRhKHJlc3VsdHMpO1xcbiAgICB9LFxcbn0pO1xcblxcbmF3YWl0IGNyYXdsZXIucnVuKFt7IHVybDogJ2h0dHBzOi8vZGVtby13ZWJzdG9yZS5hcGlmeS5vcmcvc2VhcmNoL25ldy1hcnJpdmFscycgfV0pO1xcblwifSIsIm9wdGlvbnMiOnsiYnVpbGQiOiJsYXRlc3QiLCJjb250ZW50VHlwZSI6ImFwcGxpY2F0aW9uL2pzb247IGNoYXJzZXQ9dXRmLTgiLCJtZW1vcnkiOjQwOTYsInRpbWVvdXQiOjE4MH19.PGZUSPbQL3ooxDjeGftoPaHw-O18NzHba1zVXzq0E6k\&asrc=run_on_apify) ``` import { Dataset, PuppeteerCrawler } from 'crawlee'; const BASE_URL = 'https://demo-webstore.apify.org'; const crawler = new PuppeteerCrawler({ requestHandler: async ({ parseWithCheerio, infiniteScroll }) => { // Add the utility function await infiniteScroll(); const $ = await parseWithCheerio(); const products = $('a[href*="/product/"]'); const results = [...products].map((product) => { const elem = $(product); const title = elem.find('h3').text(); const price = elem.find('div[class*="price"]').text(); const image = elem.find('img[src]').attr('src'); return { title, price, image: new URL(image, BASE_URL).href, }; }); // Push our results to the dataset await Dataset.pushData(results); }, }); await crawler.run([{ url: 'https://demo-webstore.apify.org/search/new-arrivals' }]); ``` Let's run this and check our dataset results... ``` { "title": "women's shoes", "price": "$40.00 USD", "image": "https://demo-webstore.apify.org/_next/image?url=https%3A%2F%2Fdummyjson.com%2Fimage%2Fi%2Fproducts%2F46%2F1.jpg&w=3840&q=85" } ``` Each product looks like this, and each image is a valid link that can be visited. These are the results we were after. ## Small Recap[​](#small-recap "Direct link to Small Recap") Making static HTTP requests only downloads the HTML content from the `DOMContentLoaded` event. We must use a browser to allow dynamic code to load, or find different means altogether of scraping the data (see [API Scraping](https://docs.apify.com/academy/api-scraping.md)) --- A lot of beginners struggle through trial and error while scraping a simple site. They write some code that might work, press the run button, see that error happened and they continue writing more code that might work but probably won't. This is extremely inefficient and gets tedious really fast. What beginners are missing are basic tools and tricks to get things done quickly. One of these wow tricks is the option to run the JavaScript code directly in your browser. Pressing F12 while browsing with Chrome, Firefox, or other popular browsers opens up the browser console, the magic toolbox of any web developer. The console allows you to run a code in the context of the website you are in. Don't worry, you cannot mess the site up (well, unless you start doing really nasty tricks) as the page content is downloaded on your computer and any change is only local to your PC. # Running code in a browser console > Test your Page Function's code directly in your browser's console. First, you need to inject jQuery. You can try to paste and run this snippet. ``` const jq = document.createElement('script'); jq.src = 'https://ajax.googleapis.com/ajax/libs/jquery/2.2.2/jquery.min.js'; document.getElementsByTagName('head')[0].appendChild(jq); ``` If that doesn't work because of a CORS violation, you can install [this extension](https://chrome.google.com/webstore/detail/ekkjohcjbjcjjifokpingdbdlfekjcgi) that injects jQuery on a button click. You can test a `pageFunction` code in two ways in your console: ## Pasting and running a small code snippet[​](#pasting-and-running-a-small-code-snippet "Direct link to Pasting and running a small code snippet") Usually, you don't need to paste in the whole pageFunction as you can isolate the critical part of the code you are trying to debug. You will need to remove any references to the `context` object and its properties like `request` and the final return statement but otherwise, the code should work 1:1. I will also usually remove `const` declarations on the top level variables. This helps you to run the same code many times over without needing to restart the console (you cannot declare constants more than once). My declaration will change from: ``` const results = []; // Scraping something to fill the results ``` into ``` results = []; ``` You can get all the information you need by running a snippet of your `pageFunction` like this: ``` results = []; $('.my-list-item').each((i, el) => { results.push({ title: $(el).find('.title').text().trim(), // other fields }); }); ``` Now the `results` variable stays on the page and you can do whatever you wish with it. Log it to analyze if your scraping code is correct. Writing a single expression will also log it in a browser console. ``` results; // Will log a nicely formatted [{ title: 'my-article-1'}, { title: 'my-article-2'}] etc. ``` ## Pasting and running a full pageFunction[​](#pasting-and-running-a-full-pagefunction "Direct link to Pasting and running a full pageFunction") If you don't want to deal with copy/pasting a proper snippet, you can always paste the whole pageFunction. You will have to mock the context object when calling it. If you use some advanced tricks, this might not work but in most cases copy pasting this code should do it. This code is only for debugging your Page Function for a particular page. It does not crawl the website and the output is not saved anywhere. ``` async function pageFunction(context) { // this is your pageFunction } // Now you will call it with mocked context pageFunction({ request: { url: window.location.href, userData: { label: 'paste-a-label-if-you-use-one' }, }, async waitFor(ms) { console.log('(waitFor)'); await new Promise((res) => setTimeout(res, ms)); }, enqueueRequest() { console.log('(enqueuePage)', arguments); }, skipLinks() { console.log('(skipLinks)', arguments); }, jQuery: $, }); ``` Happy debugging! --- # Filter out blocked proxies using sessions *This article explains how the problem was solved before the [SessionPool](https://docs.apify.com/sdk/js/docs/api/session-pool) class was added into [Apify SDK](https://docs.apify.com/sdk/js). We are keeping the article here as it might be interesting for people who want to see how to work with sessions on a lower level. For any practical usage of sessions, follow the documentation and examples of SessionPool.* ### Overview of the problem[​](#overview-of-the-problem "Direct link to Overview of the problem") You want to crawl a website with a proxy pool, but most of your proxies are blocked. It's a very common situation. Proxies can be blocked for many reasons: 1. You overused them in your current Actor run and they got banned. 2. You overused them in some of your previous runs and they are still banned (and may never be unbanned). 3. Some other user with whom you share part of your proxy pool overused them when crawling the same website before you even touched it. 4. The proxies were actually banned before anyone used them to crawl the website because they share a subnetwork in some datacenter and all proxies of that subnet got banned. 5. The proxies actually got banned before anyone used them to crawl the website because they use anti-bot protection that bans proxies across websites (e.g. Cloudflare). Nobody can make sure that a proxy will work infinitely. The only real solution to this problem is to use [residential proxies](https://docs.apify.com/platform/proxy/residential-proxy.md), but they can sometimes be too costly. However, usually, at least some of our proxies work. To crawl successfully, it is therefore imperative to handle blocked requests properly. You first need to discover that you are blocked, which usually means that either your request returned status greater or equal to 400 (it didn't return the proper response) or that the page displayed a captcha. To ensure that this bad request is retried, you usually throw an error and it gets automatically retried later (our [SDK](https://docs.apify.com/sdk/js) handles this for you). Check out [this article](https://docs.apify.com/academy/node-js/handle-blocked-requests-puppeteer) as inspiration for how to handle this situation with `PuppeteerCrawler` class. ### Solution[​](#solution "Direct link to Solution") Now we are able to retry bad requests and eventually unless all of our proxies get banned, we should be able to successfully crawl what we want. The problem is that it takes too long and our log is full of errors. Fortunately, we can overcome this with [proxy sessions](https://docs.apify.com/platform/proxy/datacenter-proxy.md#username-parameters) (look at the proxy and SDK documentation for how to use them in your Actors.) First we define `sessions`  object at the top of our code (in global scope) to hold the state of our working sessions. `let sessions;` Then we need to define an interval that will ensure our sessions are periodically saved to the key-value store, so if the Actor restarts, we can load them. ``` setInterval(async () => { await Apify.setValue('SESSIONS', sessions); }, 30 * 1000); ``` And inside our main function, we load the sessions the same way we load an input. If they were not saved yet (the Actor was not restarted), we instantiate them as an empty object. ``` Apify.main(async () => { sessions = (await Apify.getValue('SESSIONS')) || {}; // ...the rest of your code }); ``` ### Algorithm[​](#algorithm "Direct link to Algorithm") You don't necessarily need to understand the solution below - it should be fine to copy/paste it to your Actor. `sessions`  will be an object whose keys will be the names of the sessions and values will be objects with the name of the session (we choose a random number as a name here) and user agent (you can add any other useful properties that you want to match with each session.) This will be created automatically, for example: ``` { "0.7870849452667994": { "name": "0.7870849452667994", "userAgent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36" }, "0.4787584713044999": { "name": "0.4787584713044999", "userAgent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 Edge/16.16299" } // ... } ``` Now let's get to the algorithm that will define which sessions to pick for a request. It can be done in many ways and this is by no means the ideal way, so I encourage you to find a more intelligent algorithm and paste it into the comments of this article. This function takes `sessions`  as an argument and returns a `session`  object which will either be a random object from `sessions`  or a new one with random user agent. ``` const pickSession = (sessions, maxSessions = 100) => { // sessions is our sessions object, at the beginning instantiated as {} // maxSessions is a constant which should be the number of working proxies we aspire to have. // The lower the number, the faster you will use the working proxies // but the faster the new one will not be picked // 100 is reasonable default // Since sessions is an object, we prepare an array of the session names const sessionsKeys = Object.keys(sessions); console.log(`Currently we have ${sessionsKeys.length} working sessions`); // We define a random floating number from 0 to 1 that will serve // both as a chance to pick the session and its possible name const randomNumber = Math.random(); // The chance to pick a session will be higher when we have more working sessions const chanceToPickSession = sessionsKeys.length / maxSessions; console.log(`Chance to pick a working session is ${Math.round(chanceToPickSession * 100)}%`); // If the chance is higher than the random number, we pick one from the working sessions const willPickSession = chanceToPickSession > randomNumber; if (willPickSession) { // We randomly pick one of the working sessions and return it const indexToPick = Math.floor(sessionsKeys.length * Math.random()); const nameToPick = sessionsKeys[indexToPick]; console.log(`We picked a working session: ${nameToPick} on index ${indexToPick}`); return sessions[nameToPick]; } // We create a new session object, assign a random userAgent to it and return it console.log(`Creating new session: ${randomNumber}`); return { name: randomNumber.toString(), userAgent: Apify.utils.getRandomUserAgent(), }; }; ``` ### Puppeteer example[​](#puppeteer-example "Direct link to Puppeteer example") We then use this function whenever we want to get the session for our request. Here is an example of how we would use it for bare bones Puppeteer (for example as a part of `BasicCrawler` class). ``` const session = pickSession(sessions); const browser = await Apify.launchPuppeteer({ useApifyProxy: true, apifyProxySession: session.name, userAgent: session.userAgent, }); ``` Then we only need to add the session if the request was successful or remove it if it was not. It doesn't matter if we add the same session twice or delete a non-existent session (because of how JavaScript objects work). After success: `sessions[session.name] = session;` After failure (captcha, blocked request, etc.): `delete sessions[session.name]` ### PuppeteerCrawler example[​](#puppeteercrawler-example "Direct link to PuppeteerCrawler example") Now you might start to wonder, "I have already prepared an Actor using PuppeteerCrawler, can I make it work there?". The problem is that with PuppeteerCrawler we don't have everything nicely inside one function scope like when using pure Puppeteer or BasicCrawler. Fortunately, there is a little hack that enables passing the session name to where we need it. First we define `lauchPuppeteerFunction` which tells the crawler how to create new browser instances and we pass the picked session there. ``` const crawler = new Apify.PuppeteerCrawler({ launchPuppeteerFunction: async () => { const session = pickSession(sessions); return Apify.launchPuppeteer({ useApifyProxy: true, userAgent: `${session.userAgent} s=${session.name}`, apifyProxySession: session.name, }); }, // handlePageFunction etc. }); ``` We picked the session and added it to the browser as `apifyProxySession` but for userAgent, we didn't pass the User-Agent as it is but added the session name into it. That is the hack because we can retrieve the user agent from the Puppeteer browser itself. Now we need to retrieve the session name back in the `gotoFunction`, pass it into userData and fix the hacked userAgent back to normal so it is not suspicious for the website. ``` const gotoFunction = async ({ request, page }) => { const userAgentWithSession = await page.browser().userAgent(); const match = userAgentWithSession.match(/(.+) s=(.+)/); const session = { name: match[2], userAgent: match[1], }; request.userData.session = session; await page.setUserAgent(session.userAgent); return page.goto(request.url, { timeout: 60000 }); }; ``` Now we have access to the session in the `handlePageFunction` and the rest of the logic is the same as in the first example. We extract the session from the userData, try/catch the whole code and on success we add the session and on error we delete it. Also it is useful to retire the browser completely (check [here](https://docs.apify.com/academy/node-js/handle-blocked-requests-puppeteer) for reference) since the other requests will probably have similar problem. ``` const handlePageFunction = async ({ request, page, puppeteerPool }) => { const { session } = request.userData; console.log(`URL: ${request.url}, session: ${session.name}, userAgent: ${session.userAgent}`); try { // your main logic that is executed on each page sessions[session.name] = session; } catch (e) { delete sessions[session.name]; await puppeteerPool.retire(page.browser()); throw e; } }; ``` Things to consider 1. Since the good and bad proxies are getting filtered over time, this solution only makes sense for crawlers with at least hundreds of requests. 2. This solution will not help you if you don't have enough proxies for your job. It can even get your proxies banned faster (since the good ones will be used more often), so you should be cautious about the speed of your crawl. 3. If you are more concerned about the speed of your crawler and less about banning proxies, set the `maxSessions` parameter of `pickSession` function to a number relatively lower than your total number of proxies. If on the other hand, keeping your proxies alive is more important, set `maxSessions`  relatively higher so you will always pick new proxies. 4. Since sessions only last 24 hours, if you have bigger intervals between your crawler runs, they will start fresh each time. --- One of the main defense mechanisms websites use to ensure they are not scraped by bots is allowing only a limited number of requests from a specific IP address. That's why Apify provides a [proxy](https://docs.apify.com/platform/proxy) component with intelligent rotation. With a large enough pool of proxies, you can multiply the number of allowed requests per day to cover your crawling needs. Let's look at how we can rotate proxies when using our [JavaScript SDK](https://github.com/apify/apify-sdk-js). # BasicCrawler > Getting around website defense mechanisms when crawling. You can use `handleRequestFunction` to set up proxy rotation for a [BasicCrawler](https://crawlee.dev/api/basic-crawler/class/BasicCrawler). The following example shows how to use a fresh proxy on each request if you make requests through the popular [request-promise](https://www.npmjs.com/package/request-promise) npm package: ``` const Apify = require('apify'); const requestPromise = require('request-promise'); const PROXY_PASSWORD = process.env.APIFY_PROXY_PASSWORD; const proxyUrl = `http://auto:${PROXY_PASSWORD}@proxy.apify.com`; const crawler = new Apify.BasicCrawler({ requestList: someInitializedRequestList, handleRequestFunction: async ({ request }) => { const response = await requestPromise({ url: request.url, proxy: proxyUrl, }); }, }); ``` Each time `handleRequestFunction` is executed in this example, requestPromise will send a request through the least used proxy for that target domain. This way you will not burn through your proxies. # Puppeteer Crawler With [PuppeteerCrawler](https://docs.apify.com/sdk/js/docs/api/puppeteer-crawler) the situation is a little more complicated. That's because you have to restart the browser to change the proxy the browser is using. By default, PuppeteerCrawler restarts the browser every 100 requests, which can lead to a number of requests being wasted because the IP address the browser is using is already blocked by the website. The straightforward solution would be to set the 'retireInstanceAfterRequestCount' option to 1. PuppeteerCrawler would then rotate the proxies in the same way as BasicCrawler. While this approach could sometimes be useful for the toughest websites, the price you pay is in performance. Restarting the browser is an expensive operation. That's why PuppeteerCrawler offers a utility retire() function through a PuppeteerPool class. You can access PuppeteerPool by passing it into the object parameter of gotoFunction or handlePageFunction. ``` const crawler = new PuppeteerCrawler({ requestList: someInitializedRequestList, launchPuppeteerOptions: { useApifyProxy: true, }, handlePageFunction: async ({ request, page, puppeteerPool }) => { // you are on the page now }, }); ``` It is really up to a developer to spot if something is wrong with his request. A website can interfere with your crawling in [many ways](https://docs.apify.com/academy/anti-scraping). Page loading can be cancelled right away, it can timeout, the page can display a captcha, some error or warning message, or the data may be missing or corrupted. The developer can then choose if he will try to handle these problems in the code or focus on receiving the proper data. Either way, if the request went wrong, you should throw a proper error. Now that we know when the request is blocked, we can use the retire() function and continue crawling with a new proxy. Google is one of the most popular websites for scrapers, so let's code a Google search crawler. The two main blocking mechanisms used by Google is either to display their (in)famous 'sorry' captcha or to not load the page at all so we will focus on covering these. For example, let's assume we have already initialized a requestList of Google search pages. Let's show how you can use the retire() function in both gotoFunction and handlePageFunction. ``` const crawler = new Apify.PuppeteerCrawler({ requestList: someInitializedRequestList, launchPuppeteerOptions: { useApifyProxy: true, }, gotoFunction: async ({ request, page, puppeteerPool }) => { const response = page.goto(request.url).catch(() => null); if (!response) { await puppeteerPool.retire(page.browser()); throw new Error(`Page didn't load for ${request.url}`); } return response; }, handlePageFunction: async ({ request, page, puppeteerPool }) => { if (page.url().includes('sorry')) { await puppeteerPool.retire(page.browser()); throw new Error(`We got captcha for ${request.url}`); } }, retireInstanceAfterRequestCount: 50, }); Apify.main(async () => { await crawler.run(); }); ``` Now we have a crawler that catches the most common blocking issues on Google. In `gotoFunction` we will catch if the page doesn't load and in the handlePageFunction we check if we were redirected to the 'sorry page'. In both cases we throw an error afterwards so the request is added back to the crawling queue (otherwise the crawler would think everything was okay and would treat that request as handled). --- # How to fix 'Target closed' error in Puppeteer and Playwright **Learn about common causes for the 'Target closed' error in browser automation and what you can do to fix it.** *** The `Target closed` error happens when you try to access the `page` object (or some of its parent objects like the `browser`), but the underlying browser tab has already been closed. The exact error message can appear in several variants, such as `Target page, context or browser has been closed`, but none of them are very helpful for debugging. To debug it, attach logs in multiple places or use the headful mode. ## Out of memory[​](#out-of-memory "Direct link to Out of memory") ![Chrome crashed tab](/assets/images/chrome-crashed-tab-b7f5310d7661df3872ca9c294b3b28a5.png) Browsers create a separate process for each tab. That means each tab lives with a separate memory space. If you have a lot of tabs open, you might run out of memory. The browser cannot close your old tabs to free extra memory so it will usually kill your current memory hungry tab. ### Memory solution[​](#memory-solution "Direct link to Memory solution") If you use [Crawlee](https://crawlee.dev/), your concurrency automatically scales up and down to fit in the allocated memory. You can change the allocated memory using the environment variable or the [Configuration](https://crawlee.dev/docs/guides/configuration) class. But very hungry pages can still occasionally cause sudden memory spikes, and you might have to limit the [maxConcurrency](https://crawlee.dev/docs/guides/scaling-crawlers#minconcurrency-and-maxconcurrency) of the crawler. This problem is very rare, though. Without Crawlee, you will need to predict the maximum concurrency the particular use case can handle or increase the allocated memory. ## Page closed prematurely[​](#page-closed-prematurely "Direct link to Page closed prematurely") If you close the page before executing all code that tries to access the page, you will get the 'Target closed' error. The most common cause is that your crawler doesn't properly wait for all actions and instead closes the page earlier than it should. Usually, this is caused by forgotten `await` keyword (floating promise), using event handlers like `page.on` or having wrongly ordered crawling loop. ### Page closed solution[​](#page-closed-solution "Direct link to Page closed solution") [Add logs to your code](https://docs.apify.com/academy/node-js/analyzing-pages-and-fixing-errors) to see exactly at which point the crash occurs. See if you can spot one of the above mentioned problems. Adding missing `await` is simple but if your code runs in an event handler, you will need to wrap it in try/catch block and ensure that you give it enough time to execute before you close the main crawling handler. If you use Crawlee and utilize [preNavigationHooks](https://crawlee.dev/api/playwright-crawler/interface/PlaywrightCrawlerOptions#preNavigationHooks) to execute event handlers like `page.on` asynchronously be aware that this can cause the above mentioned problem that the [requestHandler](https://crawlee.dev/api/playwright-crawler/interface/PlaywrightCrawlerOptions#requestHandler) already finishes before we access the `page` in the event handler. You can solve this issue by making sure the `requestHandler` waits for all promises from the `preNavigationHooks`. This can be achieved by passing the promises to the `context` which is accessible to both functions and awaiting them before the scraping code starts. ``` const crawler = new PlaywrightCrawler({ // ...other options preNavigationHooks: [ async ({ page, context }) => { // Some action that takes time, we don't await here // Try/catch all non awaited code because it can cause unhandled rejection which crashes the whole process const responsePromise = page.waitForResponse('https://example.com/resource').catch((e) => e); // Attach the promise to the context which is accessible to requestHandler context.responsePromise = responsePromise; }, ], requestHandler: async ({ request, page, context }) => { // We first wait for the response before doing anything else const response = await context.responsePromise; // Check if it errored out, otherwise proceed with parsing it if (typeof response === 'string' || response instanceof Error) { throw new Error(`Failed to load resource from response`, { cause: response }); } // Now process the response and continue with the code synchronously }, }); ``` If you are still unsure what causes your particular error, check with the community and Apify team on [Discord](https://discord.com/invite/jyEM2PRvMU). --- # How to save screenshots from puppeteer A good way to debug your puppeteer crawler in Apify Actors is to save a screenshot of a browser window to the Apify key-value store. You can do that using this function: ``` /** * Store screen from puppeteer page to Apify key-value store * @param page - Instance of puppeteer Page class https://pptr.dev/api/puppeteer.page * @param [key] - Function stores your screen in Apify key-value store under this key * @return {Promise} */ const saveScreen = async (page, key = 'debug-screen') => { const screenshotBuffer = await page.screenshot({ fullPage: true }); await Apify.setValue(key, screenshotBuffer, { contentType: 'image/png' }); }; ``` This function takes the parameters page (an instance of a puppeteer page) and key (your screen is stored under this key function in the Apify key-value store). Because this is so common use-case Apify SDK has a utility function called [saveSnapshot](https://docs.apify.com/sdk/js/docs/api/puppeteer#puppeteersavesnapshot) that does exactly this and a little bit more: * You can choose the quality of your screenshots (high-quality images take more size) * You can also save the HTML of the page An example of such Apify Actor: ``` import { Actor } from 'apify'; import { puppeteerUtils, launchPuppeteer } from 'crawlee'; Actor.main(async () => { const input = await Actor.getValue('INPUT'); console.log('Launching Puppeteer...'); const browser = await launchPuppeteer(); const page = await browser.newPage(); await page.goto(input.url); await puppeteerUtils.saveSnapshot(page, { key: 'test-screen' }); console.log('Closing Puppeteer...'); await browser.close(); console.log('Done.'); }); ``` After you call the function, your screen appears in the KEY-VALUE STORE tab in the Actor console. You can click on the row with your saved screen and it'll open it in a new window. ![Puppeteer Key-Value store](/assets/images/kv-store-puppeteer-35b752a254c5d7f34d23bea8d97bb3dc.png) If you have any questions, feel free to contact us in chat. Happy coding! --- # How to scrape hidden JavaScript objects in HTML **Learn about "hidden" data found within the JavaScript of certain pages, which can increase the scraper reliability and improve your development experience.** *** Depending on the technology the target website is using, the data to be collected not only can be found within HTML elements, but also in a JSON format within `` tags in the DOM. The advantages of using these objects instead of parsing the HTML are that parsing JSON is much simpler, and more reliable than parsing HTML elements. They are much less likely to change, while the CSS selectors are prone to updates and re-namings every time the website is updated. > **Note:** In this tutorial, we'll be using [SoundCloud's website](https://soundcloud.com) as an example target, but the techniques described here can be applied to any site. ## Locating JSON objects within script tags[​](#locating-json-in-html "Direct link to Locating JSON objects within script tags") Using our DevTools, we can inspect our [target page](https://soundcloud.com/tiesto/tracks), or right click the page and click **View Page Source** to see the DOM. Next, we'll find a value on the page that we can predict would be in a potential API response. For our page, we'll use the **Tracks** count of `845`. On the **View Page Source** page, we'll do **⌘** + **F** and type in this value, which will show all matches for it within the DOM. This method can expose `` tag objects which hold the target data. ![Find the value within the DOM using CMD + F](/assets/images/view-845-77582d897496190ac1b44e2eb4364273.png) These data objects will usually be attached to the window object (often prefixed with two underscores - `__`). When scrolling to the beginning of the script tag on our **View Page Source** page, we see that the name of our target object is `__sc_hydration`. Heading back to DevTools and typing this into the console, the object is displayed. ![View the target data in the window object using the console in DevTools](/assets/images/view-object-in-window-b9e1031f84b636d9038ecf8a4f6b394d.png) ## Parsing[​](#parsing-objects "Direct link to Parsing") You can obtain these objects to be used and manipulated in JavaScript in two ways: ### 1. Parsing them directly from the HTML[​](#1-parsing-them-directly-from-the-html "Direct link to 1. Parsing them directly from the HTML") ``` // same as "document.querySelector('html').innerHTML" const html = $.html(); const string = html.split('window.__sc_hydration = ')[1].split(';')[0]; const data = JSON.parse(string); console.log(data); ``` ### 2. Retrieving them within the context of the browser[​](#2-retrieving-them-within-the-context-of-the-browser "Direct link to 2. Retrieving them within the context of the browser") Tools like [Puppeteer](https://github.com/puppeteer/puppeteer) allow us to run code within the context in the browser, as well as return things out of these functions and use the data back in the Node.js context. ``` const data = await page.evaluate(() => window.__sc_hydration); console.log(data); ``` Which of these methods you use totally depends on the type of crawler you are using. Grabbing the data directly from the `window` object within the context of the browser using Puppeteer is of course the most reliable solution; however, it is less efficient than making a static HTTP request and parsing the object directly from the downloaded HTML. --- # Scrape website in parallel with multiple Actor runs **Learn how to run multiple instances of an Actor to scrape a website faster. This tutorial will guide you through the process of setting up your scraper.** *** Imagine a large website that you need to scrape. You have a scraper that works well, but scraping the whole website is slow. You can speed up the scraping process by running multiple instances of the scraper in parallel. This tutorial will guide you through setting up your scraper to run multiple instances in parallel. In a rush? You can check [full code example](https://github.com/apify/apify-docs/tree/master/examples/ts-parallel-scraping) right away. ## Managing Multiple Scraper Runs[​](#managing-multiple-scraper-runs "Direct link to Managing Multiple Scraper Runs") To manage multiple instances of the scraper, we need to build an Orchestrator Actor to oversee the process. This Orchestrator Actor will initiate several scraper runs and manage their operations. It will set up a request queue and a dataset that the other Actor runs will utilize to crawl the website and store results. In this tutorial, we set up the Orchestrator Actor and the scraper Actor. ## Orchestrator Actor Configuration[​](#orchestrator-actor-configuration "Direct link to Orchestrator Actor Configuration") The Orchestrator Actor orchestrates the parallel execution of scraper Actor runs. It runs multiple instances of the scraper Actor and passes the request queue and dataset to them. For the Actor's base structure, we use Apify CLI and create a new Actor with the following command and use the [Empty TypeScript Actor template](https://apify.com/templates/ts-empty). ``` apify create orchestrator-actor ``` If you don't have Apify CLI installed, check out our installation [instructions](https://docs.apify.com/cli/docs/installation). ### Input Configuration[​](#input-configuration "Direct link to Input Configuration") Let's start by defining the Input Schema for the Orchestrator Actor. The input for the Actor will specify configurations needed to initiate and manage multiple scraper Actors in parallel. Here’s the breakdown of the necessary input: * input\_schema.json * main.ts ``` { "title": "Orchestrator Actor Input", "type": "object", "schemaVersion": 1, "properties": { "parallelRunsCount": { "title": "Parallel Actor runs count", "type": "integer", "description": "Number of parallel runs of the Actor.", "default": 1 }, "targetActorId": { "title": "Actor ID", "type": "string", "editor": "textfield", "description": "ID of the Actor to run." }, "targetActorInput": { "title": "Actor Input", "type": "object", "description": "Input of the Actor to run", "editor": "json", "prefill": {} }, "targetActorRunOptions": { "title": "Actor Run Options", "type": "object", "description": "Options for the Actor run", "editor": "json", "prefill": {} } }, "required": ["parallelRunsCount", "targetActorId"] } ``` ``` import { Actor, log } from 'apify'; interface Input { parallelRunsCount: number; targetActorId: string; targetActorInput: Record; targetActorRunOptions: Record; } await Actor.init(); const { parallelRunsCount = 1, targetActorId, targetActorInput = {}, targetActorRunOptions = {}, } = await Actor.getInput() ?? {} as Input; const { apifyClient } = Actor; if (!targetActorId) throw new Error('Missing the "targetActorId" input!'); ``` ### Reusing dataset and request queue[​](#reusing-dataset-and-request-queue "Direct link to Reusing dataset and request queue") The Orchestrator Actor will reuse its default dataset and request queue. The dataset stores the results of the scraping process, and the request queue is used as shared storage for processing requests. ``` import { Actor } from 'apify'; const requestQueue = await Actor.openRequestQueue(); const dataset = await Actor.openDataset(); ``` ### State[​](#state "Direct link to State") The Orchestrator Actor will maintain the state of the scraping runs to track progress and manage continuity. It will record the state of Actor runs, initializing this tracking with the first run. This persistent state ensures that, in migration or restart (resurrection) cases, the Actor can resume the same runs without losing progress. ``` import { Actor, log } from 'apify'; const { apifyClient } = Actor; const state = await Actor.useState('actor-state', { parallelRunIds: [], isInitialized: false }); if (state.isInitialized) { for (const runId of state.parallelRunIds) { const runClient = apifyClient.run(runId); const run = await runClient.get(); // This should happen if the run was deleted or the state was incorectly saved. if (!run) throw new Error(`The run ${runId} from state does not exists.`); if (run.status === 'RUNNING') { log.info('Parallel run is already running.', { runId }); } else { log.info(`Parallel run was in state ${run.status}, resurrecting.`, { runId }); await runClient.resurrect(targetActorRunOptions); } } } else { for (let i = 0; i {const runClient=apifyClient.run(runId);return runClient.waitForFinish();});// Abort parallel runs if the main run is aborted Actor.on('aborting',async()=>{for(const runId of state.parallelRunIds){log.info('Aborting run',{runId});await apifyClient.run(runId).abort();}});// Wait for all parallel runs to finish await Promise.all(parallelRunPromises);// Gracefully exit the Actor process. It's recommended to quit all Actors with an exit() await Actor.exit(); ``` ### Pushing to Apify[​](#pushing-to-apify "Direct link to Pushing to Apify") Once you have the Orchestrator Actor ready, you can push it to Apify using the following command from the root directory of the Actor project: ``` apify push ``` First log in If you are pushing the Actor for the first time, you will need to [login to your Apify account](https://docs.apify.com/cli/docs/reference#apify-login). By running this command, you will be prompted to provide the Actor ID, which you can find in the Apify Console under the Actors tab. ![orchestrator-actor.png](/assets/images/orchestrator-actor-7a722f44faddf4f5e3a8439acb4baea0.png) ## Scraper Actor Configuration[​](#scraper-actor-configuration "Direct link to Scraper Actor Configuration") The Scraper Actor performs website scraping. It operates using the request queue and dataset provided by the Orchestrator Actor. You will need to integrate your chosen scraper logic into this framework. The only thing you need to do is utilize the request queue and dataset initialized by the Orchestrator Actor. ``` import { Actor } from 'apify'; interface Input { requestQueueId: string; datasetId: string; } const { requestQueueId, datasetId, } = await Actor.getInput() ?? {} as Input; const requestQueue = await Actor.openRequestQueue(requestQueueId); const dataset = await Actor.openDataset(datasetId); ``` Once you initialized the request queue and dataset, you can start scraping the website. In this example, we will use the CheerioCrawler to scrape [the example of ecommerce website](https://warehouse-theme-metal.myshopify.com/). You can create your scraper from the [Crawlee + Cheerio TypeScript Actor template](https://apify.com/templates/ts-crawlee-cheerio). * input\_schema.json * main.ts ``` { "title": "Scraper Actor Input", "type": "object", "schemaVersion": 1, "properties": { "requestQueueId": { "title": "Request Queue ID", "type": "string", "editor": "textfield", "description": "Request queue to use in scraper." }, "datasetId": { "title": "Dataset ID", "type": "string", "editor": "textfield", "description": "Dataset to use in scraper." } }, "required": ["requestQueueId", "datasetId"] } ``` ``` import{Actor}from'apify';import{CheerioCrawler}from'crawlee';await Actor.init();const{requestQueueId,datasetId}=(await Actor.getInput())??{};const requestQueue=await Actor.openRequestQueue(requestQueueId);const dataset=await Actor.openDataset(datasetId);const proxyConfiguration=await Actor.createProxyConfiguration();const crawler=new CheerioCrawler({proxyConfiguration,requestQueue,requestHandler:async({enqueueLinks,request,$,log})=>{log.info('Processing page',{url:request.url});const newPages=await enqueueLinks({selector:'a[href]'});log.info(`Enqueued ${newPages.processedRequests.length} new pages.`);// If the product page is loaded, save the title and URL to the Dataset. if(request?.loadedUrl?.includes('/products/')){const title=$('title').text();await dataset.pushData({url:request.loadedUrl,title});}}});await crawler.run(['https://warehouse-theme-metal.myshopify.com/']);// Gracefully exit the Actor process. It's recommended to quit all Actors with an exit() await Actor.exit(); ``` You can check [full code example](https://github.com/apify/apify-docs/tree/master/examples/ts-parallel-scraping/scraper). You need to push the Scraper Actor to Apify using the following command from the root directory of the Actor project: ``` apify push ``` After pushing the Scraper Actor to Apify, you must get the Actor ID from the Apify Console. ![scraper-actor.png](/assets/images/scraper-actor-1a5be4b501a30d12e2b13ab56d8f6a05.png) ## Run orchestration in Apify Console[​](#run-orchestration-in-apify-console "Direct link to Run orchestration in Apify Console") Once you have the Orchestrator Actor and Scraper Actor pushed to Apify, you can run the Orchestrator Actor in the Apify Console. You can set the input for the Orchestrator Actor to specify the number of parallel runs and the target Actor ID, input, and run options. After you hit the **Start** button, the Orchestrator Actor will start the parallel runs of the Scraper Actor. ![orchestrator-actor-input.png](/assets/images/orchestrator-actor-input-37f6e29cbeb76c3db86773b4587e24ce.png) After starting the Orchestrator Actor, you will see the parallel runs initiated in the Apify Console. ![scraper-actor-runs.png](/assets/images/scraper-actor-runs-e07ddb88c801539c276c62a4a110f2e2.png) ## Summary[​](#summary "Direct link to Summary") In this tutorial, you learned how to run multiple instances of an Actor to scrape a website faster. You created an Orchestrator Actor to manage the parallel execution of the Scraper Actor runs. The Orchestrator Actor initialized the Scraper Actor runs and managed their state. The Scraper Actor utilized the request queue and dataset provided by the Orchestrator Actor to scrape the website. You could speed up the scraping process by running multiple instances of the Scraper Actor in parallel. The code in this tutorial is for learning purposes and does not cover all specific edge cases. You can modify it to suit your exact requirements and use cases. --- # How to optimize and speed up your web scraper **We all want our scrapers to run as cost-effective as possible. Learn how to think about performance in the context of web scraping and automation.** *** Especially if you are running your scrapers on [Apify](https://apify.com), performance is directly related to your wallet (or rather bank account). The slower and heavier your program is, the more proxy bandwidth, storage, [compute units](https://help.apify.com/en/articles/3490384-what-is-a-compute-unit) and higher [subscription plan](https://apify.com/pricing) you'll need. The goal of optimization is to make the code run as fast as possible while using the least resources possible. On Apify, the resources are memory and CPU usage (don't forget that the more memory you allocate to a run, the bigger share of CPU you get - proportionally). The memory alone should never be a bottleneck though. If it is, that means either a bug (memory leak) or bad architecture of the program (you need to split the computation into smaller parts). The rest of this article will focus only on optimizing CPU usage. You allocate more memory only to get more power from the CPU. One more thing to remember. Optimization has its own cost: development time. You should always think about how much time you're able to spend on it and if it's worth it. Before we dive into the practical side of things, let us diverge with an analogy to help us think about the performance of scrapers. ## Game development analogy[​](#analogy "Direct link to Game development analogy") Games are extremely complicated beasts. Every frame (usually 60 times a second), the game has to calculate the physics of the world, run AI, user input, and render everything into a beautiful scene. You can imagine that running all of that every 16 ms in a complicated game is a developer's nightmare. That's why a significant portion of game development is spent on optimizations. Every little waste matters. This is mainly true in the programming heart of the game - the engine. The engine is responsible for the heavy lifting of performance critical parts like physics, animation, AI, and rendering. Once the engine is built, you can design the game on top of it. You can add different spells, conversation chains, items, animations etc. to make your game cool. Those extra things may not run every frame and don't need to be optimized as heavily as the engine itself. Now, if you want to build your own game and you are not a C/C++ veteran with a team, you will likely use an existing engine (like Unreal or Unity) and focus on the design of the game environment itself. Unless you go crazy, the game will likely run just fine since those engines have already been optimized for you. Your job is to choose an appropriate engine and use it well. ## Back to scrapers[​](#back-to-scrapers "Direct link to Back to scrapers") What are the engines of the scraping world? A [browser](https://github.com/puppeteer/puppeteer?tab=readme-ov-file#puppeteer), an [HTTP library](https://www.npmjs.com/package/@apify/http-request), an [HTML parser](https://github.com/cheeriojs/cheerio), and a [JSON parser](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/JSON/parse). The CPU spends more than 99% of its workload in these libraries. As with engines, you are not likely gonna write these from scratch - instead you'll use something like [Crawlee](https://crawlee.dev) that handles a lot of the overheads for you. It is about how you use these tools. The small amount of code you write in your [`requestHandler`](https://crawlee.dev/api/http-crawler/interface/HttpCrawlerOptions#requestHandler) is absolutely insignificant compared to what is running inside these tools. In other words, it doesn't matter how many functions you call or how many variables you extract. If you want to optimize your scrapers, you need to choose the lightweight option from the tools and use it as little as possible. A crawler scraping only JSON API can be as much as 200 times faster/cheaper than a browser based solution. **Ranking of the tools from the most efficient to the least:** 1. **JSON API** (HTTP call + JSON parse) - Scraping an API (public or internal) is the best option. The response is usually smaller than the HTML page and the data are already structured and cheap to parse. Usable for about 30% of websites. 2. **Pure HTML** (HTTP call + HTML parse) - All data is on the main single HTML page. Often the HTML contains script and JSON data that are rich and nicely structured. Some pages can be quite big and the parsing is slower than for JSON. But it is still 10–20 times faster than a browser. Usable for about 90% of websites. 3. **Browser** (hundreds of HTTP calls, script execution, rendering) - Browsers are huge beasts. They do so much work to allow for smooth human interaction which makes them really inefficient for scraping. Use a browser only if it helps you bypass anti-scraping protection or if you need to interact with the page. --- Sometimes you need to process the same URL several times, but each time with a different setup. For example, you may want to submit the same form with different data each time. Let's illustrate a solution to this problem by creating a scraper which starts with an array of keywords and inputs each of them to Google, one by one. Then it retrieves the results. > This isn't an efficient solution to searching keywords on Google. You could directly enqueue search URLs like `https://www.google.cz/search?q=KEYWORD`. # Enqueuing start pages for all keywords > Solving a common problem with scraper automatically deduplicating the same URLs. First, we need to start the scraper on the page from which we're going to do our enqueuing. To do that, we create one start URL with the label "enqueue" and URL "". Now we can proceed to enqueue all the pages. The first part of our `pageFunction` will look like this: ``` async function pageFunction(context) { const $ = context.jQuery; if (context.request.userData.label === 'enqueue') { // parse input keywords const keywords = context.customData; // process all the keywords for (const keyword of keywords) { // enqueue the page and pass the keyword in // the interceptRequestData attribute await context.enqueueRequest({ url: 'https://google.com', uniqueKey: `${Math.random()}`, userData: { label: 'fill-form', keyword, }, }); } // No return here because we don't extract any data yet } } ``` To set the keywords, we're using the customData scraper parameter. This is useful for smaller data sets, but may not be perfect for bigger ones. For such cases you may want to use something like [Importing a list of URLs from an external source](https://docs.apify.com/academy/node-js/scraping-urls-list-from-google-sheets). Since we're enqueuing the same page more than once, we need to set our own uniqueKey so the page will be added to the queue (by default uniqueKey is set to be the same as the URL). The label for the next page will be "fill-form". We're passing the keyword to the next page in the userData field (this can contain any data). # Inputting the keyword into Google Now we come to the next page (Google). We need to retrieve the keyword and input it into the Google search bar. This will be the next part of the pageFunction: ``` async function pageFunction(context) { const $ = context.jQuery; if (context.request.userData.label === 'enqueue') { // copy from the previous part } else if (context.request.userData.label === 'fill-form') { // retrieve the keyword const { keyword } = context.request.userData; // input the keyword into the search bar $('#lst-ib').val(keyword); // submit the form $('#tsf').submit(); } } ``` For the next page to correctly enqueue, we're going to need a new pseudoURL. Create a pseudoURL with the label "result" and the URL `https://www.google.com/search?[.+]`. Now we're on the last page and can finally extract the results. ``` async function pageFunction(context) { const $ = context.jQuery; if (context.request.userData.label === 'enqueue') { // copy from the previous part } else if (context.request.userData.label === 'result') { // create result array const result = []; // process all the results $('.rc').each((index, elem) => { // wrap element in jQuery const gResult = $(elem); // lookup link and text const link = gResult.find('.r a'); const text = gResult.find('.s .st'); // extract data and add it to result array result.push({ name: link.text(), link: link.attr('href'), text: text.text(), }); }); // Now we finally return return result; } } ``` To test the scraper, set the customData to something like this `["apple", "orange", "banana"]` and push the Run button to start. --- # Request labels and how to pass data to other requests Are you trying to use Actors for the first time and don't know how to deal with the request label or how to pass data to the request? Here's how to do it. If you are using the requestQueue, you can do it this way. When you add a request to the queue, use the userData attribute. ``` // Create a request list. const requestQueue = await Apify.openRequestQueue(); // Add the request to the queue await requestQueue.addRequest({ url: 'https://www.example.com/', userData: { label: 'START', }, }); ``` Right now, we have one request in the queue that has the label "START".  Now we can specify which code should be executed for this request in the handlePageFunction. ``` if (request.userData.label === 'START') { // your code for the first request for example // enqueue the items of a shop } else if (request.userData.label === 'ITEM') { // other code for the item of a shop } ``` And in the same way you can keep adding requests in the handlePageFunction. You can also handle the passing of data to the request like this. For example, when we have extracted the item from the shop above, we want to extract some information about the seller. We need to pass the item object to the seller page, where we save the rating of a seller, e.g.. ``` await requestQueue.addRequest({ url: sellerDetailUrl, userData: { label: 'SELLERDETAIL', data: itemObject, }, }); ``` Now, in the "SELLERDETAIL" url, we can evaluate the page and extracted data merge to the object from the item detail, for example like this ``` const result = { ...request.userData.data, ...sellerDetail }; ``` Save the results, and we're done! ``` await Apify.pushData(result); ``` --- # How to scrape from sitemaps Processing sitemaps automatically with Crawlee Crawlee allows you to scrape sitemaps with ease. If you are using Crawlee, you can skip the following steps and just gather all the URLs from the sitemap in a few lines of code. ``` import { RobotsFile } from 'crawlee'; const robots = await RobotsFile.find('https://www.mysite.com'); const allWebsiteUrls = await robots.parseUrlsFromSitemaps(); ``` **The sitemap.xml file is a jackpot for every web scraper developer. Take advantage of this and learn an easier way to extract data from websites using Crawlee.** *** Let's say we want to scrape a database of craft beers ([brewbound.com](https://www.brewbound.com/)) before summer starts. If we are lucky, the website will contain a sitemap at [brewbound.com/sitemap.xml](https://www.brewbound.com/sitemap.xml). > Check out [Sitemap Sniffer](https://apify.com/vaclavrut/sitemap-sniffer), which can discover sitemaps in hidden locations! ## Analyzing the sitemap[​](#analyzing-the-sitemap "Direct link to Analyzing the sitemap") The sitemap is usually located at the path **/sitemap.xml**. It is always worth trying that URL, as it is rarely linked anywhere on the site. It usually contains a list of all pages in [XML format](https://en.wikipedia.org/wiki/XML). ``` http://www.brewbound.com/advertise 2015-03-19 daily ... ``` The URLs of breweries take this form: ``` http://www.brewbound.com/breweries/[BREWERY_NAME] ``` And the URLs of craft beers look like this: ``` http://www.brewbound.com/breweries/[BREWERY_NAME]/[BEER_NAME] ``` They can be matched using the following regular expression: ``` http(s)?:\/\/www\.brewbound\.com\/breweries\/[^\/]+\/[^\/` tag, which closes each URL. ## Scraping the sitemap in Crawlee[​](#scraping-the-sitemap "Direct link to Scraping the sitemap in Crawlee") If you're scraping sitemaps (or anything else, really), [Crawlee](https://crawlee.dev) is perfect for the job. First, let's add the beer URLs from the sitemap to the [`RequestList`](https://crawlee.dev/api/core/class/RequestList) using our regular expression to match only the (craft!!) beer URLs and not pages of breweries, contact page, etc. ``` const requestList = await RequestList.open(null, [{ requestsFromUrl: 'https://www.brewbound.com/sitemap.xml', regex: /http(s)?:\/\/www\.brewbound\.com\/breweries\/[^/ { return document.getElementsByClassName('productreviews').length; }); if (!beerPage) return; const data = await page.evaluate(() => { const title = document.getElementsByTagName('h1')[0].innerText; const [brewery, beer] = title.split(':'); const description = document.getElementsByClassName('productreviews')[0].innerText; return { brewery, beer, description }; }); await Dataset.pushData(data); }, }); ``` ## Full code[​](#full-code "Direct link to Full code") If we create a new Actor using the code below on the [Apify platform](https://docs.apify.com/academy/apify-platform.md), it returns a nicely formatted spreadsheet containing a list of breweries with their beers with descriptions. Make sure to use the **apify/actor-node-puppeteer-chrome** image for your Dockerfile, otherwise the run will fail. [Run on](https://console.apify.com/actors/7tWSD8hrYzuc9Lte7?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IERhdGFzZXQsIFB1cHBldGVlckNyYXdsZXIsIFJlcXVlc3RMaXN0IH0gZnJvbSAnY3Jhd2xlZSc7XFxuXFxuY29uc3QgcmVxdWVzdExpc3QgPSBhd2FpdCBSZXF1ZXN0TGlzdC5vcGVuKG51bGwsIFt7XFxuICAgIHJlcXVlc3RzRnJvbVVybDogJ2h0dHBzOi8vd3d3LmJyZXdib3VuZC5jb20vc2l0ZW1hcC54bWwnLFxcbiAgICByZWdleDogL2h0dHAocyk_OlxcXFwvXFxcXC93d3dcXFxcLmJyZXdib3VuZFxcXFwuY29tXFxcXC9icmV3ZXJpZXNcXFxcL1teLzxdK1xcXFwvW14vPF0rL2dtLFxcbn1dKTtcXG5cXG5jb25zdCBjcmF3bGVyID0gbmV3IFB1cHBldGVlckNyYXdsZXIoe1xcbiAgICByZXF1ZXN0TGlzdCxcXG4gICAgYXN5bmMgcmVxdWVzdEhhbmRsZXIoeyBwYWdlIH0pIHtcXG4gICAgICAgIGNvbnN0IGJlZXJQYWdlID0gYXdhaXQgcGFnZS5ldmFsdWF0ZSgoKSA9PiB7XFxuICAgICAgICAgICAgcmV0dXJuIGRvY3VtZW50LmdldEVsZW1lbnRzQnlDbGFzc05hbWUoJ3Byb2R1Y3RyZXZpZXdzJykubGVuZ3RoO1xcbiAgICAgICAgfSk7XFxuICAgICAgICBpZiAoIWJlZXJQYWdlKSByZXR1cm47XFxuXFxuICAgICAgICBjb25zdCBkYXRhID0gYXdhaXQgcGFnZS5ldmFsdWF0ZSgoKSA9PiB7XFxuICAgICAgICAgICAgY29uc3QgdGl0bGUgPSBkb2N1bWVudC5nZXRFbGVtZW50c0J5VGFnTmFtZSgnaDEnKVswXS5pbm5lclRleHQ7XFxuICAgICAgICAgICAgY29uc3QgW2JyZXdlcnksIGJlZXJdID0gdGl0bGUuc3BsaXQoJzonKTtcXG4gICAgICAgICAgICBjb25zdCBkZXNjcmlwdGlvbiA9IGRvY3VtZW50LmdldEVsZW1lbnRzQnlDbGFzc05hbWUoJ3Byb2R1Y3RyZXZpZXdzJylbMF0uaW5uZXJUZXh0O1xcblxcbiAgICAgICAgICAgIHJldHVybiB7IGJyZXdlcnksIGJlZXIsIGRlc2NyaXB0aW9uIH07XFxuICAgICAgICB9KTtcXG5cXG4gICAgICAgIGF3YWl0IERhdGFzZXQucHVzaERhdGEoZGF0YSk7XFxuICAgIH0sXFxufSk7XFxuXFxuYXdhaXQgY3Jhd2xlci5ydW4oKTtcXG5cIn0iLCJvcHRpb25zIjp7ImJ1aWxkIjoibGF0ZXN0IiwiY29udGVudFR5cGUiOiJhcHBsaWNhdGlvbi9qc29uOyBjaGFyc2V0PXV0Zi04IiwibWVtb3J5Ijo0MDk2LCJ0aW1lb3V0IjoxODB9fQ.KFqjQiNxNkx_HPnvJ4H_W0e58W3L7D_Ga9pq_ZQ7tqI\&asrc=run_on_apify) ``` import { Dataset, PuppeteerCrawler, RequestList } from 'crawlee'; const requestList = await RequestList.open(null, [{ requestsFromUrl: 'https://www.brewbound.com/sitemap.xml', regex: /http(s)?:\/\/www\.brewbound\.com\/breweries\/[^/ { return document.getElementsByClassName('productreviews').length; }); if (!beerPage) return; const data = await page.evaluate(() => { const title = document.getElementsByTagName('h1')[0].innerText; const [brewery, beer] = title.split(':'); const description = document.getElementsByClassName('productreviews')[0].innerText; return { brewery, beer, description }; }); await Dataset.pushData(data); }, }); await crawler.run(); ``` --- # How to scrape sites with a shadow DOM **The shadow DOM enables isolation of web components, but causes problems for those building web scrapers. Here's a workaround.** *** Each website is represented by an HTML DOM, a tree-like structure consisting of HTML elements (e.g. paragraphs, images, videos) and text. [Shadow DOM](https://developer.mozilla.org/en-US/docs/Web/API/Web_components/Using_shadow_DOM) allows the separate DOM trees to be attached to the main DOM while remaining isolated in terms of CSS inheritance and JavaScript DOM manipulation. The CSS and JavaScript codes of separate shadow DOM components do not clash, but the downside is that you can't access the content from outside. Let's take a look at this page [alodokter.com](https://www.alodokter.com/). If you click on the menu and open a Chrome debugger, you will see that the menu tree is attached to the main DOM as shadow DOM under the element ``. ![Shadow root of the top-navbar-view custom element](/assets/images/shadow-023c6b4266de5874b37593ca6e0a0ad6.png) The rest of the content is rendered the same way. This makes it hard to scrape because `document.body.innerText`, `document.getElementsByTagName('a')`, and all others return an empty result. The content of the menu can be accessed only via the [`shadowRoot`](https://developer.mozilla.org/en-US/docs/Web/API/ShadowRoot) property. If you use jQuery you can do the following: ``` // Find element that is shadow root of menu DOM tree. const { shadowRoot } = document.getElementById('top-navbar-view'); // Create a copy of its HTML and use jQuery find links. const links = $(shadowRoot.innerHTML).find('a'); // Get URLs from link elements. const urls = links.map((obj, el) => el.href); ``` However, this isn't very convenient, because you have to find the root element of each component you want to work with, and you can't take advantage of all the scripts and tools you already have. Instead of that, we can replace the content of each element containing shadow DOM with the HTML of shadow DOM. ``` // Iterate over all elements in the main DOM. for (const el of document.getElementsByTagName('*')) { // If element contains shadow root then replace its // content with the HTML of shadow DOM. if (el.shadowRoot) el.innerHTML = el.shadowRoot.innerHTML; } ``` After you run this, you can access all the elements and content using jQuery or plain JavaScript. The downside is that it breaks all the interactive components because you create a new copy of the shadow DOM HTML content without the JavaScript code and CSS attached, so this must be done after all the content has been rendered. Some websites may contain shadow DOMs recursively inside of shadow DOMs. In these cases, we must replace them with HTML recursively: ``` // Returns HTML of given shadow DOM. const getShadowDomHtml = (shadowRoot) => { let shadowHTML = ''; for (const el of shadowRoot.childNodes) { shadowHTML += el.nodeValue || el.outerHTML; } return shadowHTML; }; // Recursively replaces shadow DOMs with their HTML. const replaceShadowDomsWithHtml = (rootElement) => { for (const el of rootElement.querySelectorAll('*')) { if (el.shadowRoot) { replaceShadowDomsWithHtml(shadowRoot); el.innerHTML += getShadowDomHtml(el.shadowRoot); } } }; replaceShadowDomsWithHtml(document.body); ``` --- # Scraping a list of URLs from a Google Sheets document You can export URLs from [Google Sheets](https://workspace.google.com/products/sheets/) such as [this one](https://docs.google.com/spreadsheets/d/1-2mUcRAiBbCTVA5KcpFdEYWflLMLp9DDU3iJutvES4w) directly into an [Actor](https://docs.apify.com/platform/actors.md)'s Start URLs field. 1. Make sure the spreadsheet has one sheet and a simple structure to help the Actor find the URLs. 2. Add the `/gviz/tq?tqx=out:csv` query parameter to the Google Sheet URL base, right after the long document identifier part. For example, . This automatically exports the spreadsheet to CSV format. 3. In the Actor's input, click Link remote text file and paste the URL there: ![List of URLs](/assets/images/gsheets-url-27adbc7f89057db71fc4d2f03a65cedf.png) IMPORTANT: Make sure anyone with the link can view the document. Otherwise, the Actor will not be able to access it. ![Link sharing](/assets/images/anyone-with-link-38a1b714c55ca2b0f1ee21c9adaed0a3.png) --- When doing web automation with Apify, it can sometimes be necessary to submit an HTML form with a file attachment. This article will cover a situation where the file is publicly accessible (e.g. hosted somewhere) and will use an Apify Actor. If it's impossible to use request-promise, it might be necessary to use [Puppeteer](https://docs.apify.com/academy/puppeteer-playwright/common-use-cases/submitting-a-form-with-a-file-attachment). # Downloading the file to memory **How to submit a form with attachment using request-promise.** *** After creating a new Actor, the first thing to do is download the file. We can do that using the request-promise module, so make sure it is included. ``` const request = require('request-promise'); ``` The actual downloading is going to be slightly different for text and binary files. For a text file, do it like this: ``` const fileData = await request('https://example.com/file.txt'); ``` For a binary file, we need to provide additional parameters so as not to interpret it as text: ``` const fileData = await request({ uri: 'https://example.com/file.pdf', encoding: null, }); ``` In this case, fileData will be a Buffer instead of a String. # Submitting the form When the file is ready, we can submit the form as follows: ``` await request({ uri: 'https://example.com/submit-form.php', method: 'POST', formData: { // set any form values name: 'John', surname: 'Doe', email: 'john.doe@example.com', // add the attachment attachment: { value: fileData, options: { filename: 'file.pdf', contentType: 'application/pdf', }, }, }, }); ``` The header Content-Type: multipart/form-data will be set automatically. --- # Submitting forms on .ASPX pages Apify users sometimes need to submit a form on pages created with ASP.NET (URL typically ends with .aspx). These pages have a different approach for how they submit forms and navigate through pages. This tutorial shows you how to handle these kinds of pages. This approach is based on a [blog post](https://web.archive.org/web/20230530120937/https://toddhayton.com/2015/05/04/scraping-aspnet-pages-with-ajax-pagination/) from Todd Hayton, where he explains how crawlers for ASP.NET pages should work. First of all, you need to copy\&paste this function to your [Web Scraper](https://apify.com/apify/web-scraper) *Page function*: ``` const enqueueAspxForm = async function (request, formSelector, submitButtonSelector, async) { request.payload = $(formSelector).serialize(); if ($(submitButtonSelector).length) { request.payload += decodeURIComponent(`&${$(submitButtonSelector).attr('name')}=${$(submitButtonSelector).attr('value')}`); } request.payload += decodeURIComponent(`&__ASYNCPOST=${async.toString()}`); request.method = 'POST'; request.uniqueKey = Math.random(); await context.enqueueRequest(request); return request; }; ``` The function has these parameters: `request` - the object that describes the next request `formSelector` - selector for a form to be submitted e.g 'form\[name="test"]' `submitButtonSelector` - selector for a button for submit form e.g. '#nextPageButton' `async` - if true, request returns only params, not HTML content Then you can use it in your Page function as follows: ``` await enqueueAspxForm({ url: 'http://architectfinder.aia.org/frmSearch.aspx', userData: { label: 'SEARCH-RESULT' }, }, 'form[name="aspnetForm"]', '#ctl00_ContentPlaceHolder1_btnSearch', false); ``` --- # Using man-in-the-middle proxy to intercept requests in Puppeteer Sometimes you may need to intercept (or maybe block) requests in headless Chrome / Puppeteer, but `page.setRequestInterception()`  is not 100% reliable when the request is started in a new window. One possible way to intercept these requests is to use a man-in-the-middle (MITM) proxy, i.e. a proxy server that can intercept and modify HTTP requests, even those over HTTPS. In this example, we're going to use , since it has all the tools that we need. First we set up the MITM proxy: ``` const { promisify } = require('util'); const { exec } = require('child_process'); const Proxy = require('http-mitm-proxy'); const Promise = require('bluebird'); const execPromise = promisify(exec); const wait = (timeout) => new Promise((resolve) => setTimeout(resolve, timeout)); const setupProxy = async (port) => { // Setup chromium certs directory // WARNING: this only works in debian docker images // modify it for any other use cases or local usage. await execPromise('mkdir -p $HOME/.pki/nssdb'); await execPromise('certutil -d sql:$HOME/.pki/nssdb -N'); const proxy = Proxy(); proxy.use(Proxy.wildcard); proxy.use(Proxy.gunzip); return new Promise((resolve, reject) => { proxy.listen({ port, silent: true }, (err) => { if (err) return reject(err); // Add CA certificate to chromium and return initialize proxy object execPromise('certutil -d sql:$HOME/.pki/nssdb -A -t "C,," -n mitm-ca -i ./.http-mitm-proxy/certs/ca.pem') .then(() => resolve(proxy)) .catch(reject); }); }); }; ``` Then we'll need a Docker image that has the `certutil` utility. Here is an [example of a Dockerfile](https://github.com/apify/actor-example-proxy-intercept-request/blob/master/Dockerfile) that can create such an image and is based on the [apify/actor-node-chrome](https://hub.docker.com/r/apify/actor-node-chrome/) image that contains Puppeteer. Now we need to specify how the proxy shall handle the intercepted requests: ``` // Setup blocking of requests in proxy const proxyPort = 8000; const proxy = setupProxy(proxyPort); proxy.onRequest((context, callback) => { if (blockRequests) { const request = context.clientToProxyRequest; // Log out blocked requests console.log('Blocked request:', request.headers.host, request.url); // Close the connection with custom content context.proxyToClientResponse.end('Blocked'); return; } return callback(); }); ``` The final step is to let Puppeteer use the local proxy: ``` // Launch puppeteer with local proxy const browser = await puppeteer.launch({ args: ['--no-sandbox', `--proxy-server=localhost:${proxyPort}`], }); ``` And we're done! By adjusting the `blockRequests` variable, you can allow or block any request initiated through Puppeteer. Here is a GitHub repository with a full example and all necessary files: If you have any questions, feel free to contact us in the chat. Happy intercepting! --- # Waiting for dynamic content Use these helper functions to wait for data: * `page.waitFor` in [Puppeteer](https://pptr.dev/) (or Puppeteer Scraper ([apify/puppeteer-scraper](https://apify.com/apify/puppeteer-scraper))). * `context.waitFor` in Web Scraper ([apify/web-scraper](https://apify.com/apify/web-scraper)). Pass in time in milliseconds or a selector to wait for. Examples: * `await page.waitFor(10000)` - waits for 10 seconds. * `await context.waitFor('my-selector')` - waits for `my-selector` to appear on the page. For details, code examples, and advanced use cases, visit our [documentation](https://docs.apify.com/academy/puppeteer-playwright/page/waiting.md). --- # When to use Puppeteer Scraper You may have read in the [Web Scraper](https://apify.com/apify/web-scraper) readme or somewhere else at Apify that [Puppeteer Scraper](https://apify.com/apify/puppeteer-scraper) is more powerful and gives you more control over the browser, enabling you to do almost anything. But what does that really mean? In this article, we will talk about the differences in more detail and show you some minimal examples to strengthen that understanding. ## What exactly is Puppeteer?[​](#what-exactly-is-puppeteer "Direct link to What exactly is Puppeteer?") Both the Web Scraper and Puppeteer Scraper use Puppeteer to control the Chrome browser, so, what's the difference? Consider Puppeteer and Chrome as two separate programs. Puppeteer is a JavaScript program that's used to control the browser and by controlling we mean opening tabs, closing tabs, moving the mouse, clicking buttons, typing on the keyboard, managing network activity, etc. If a website is watching for any of these events, there is no way for it to know that those actions were performed by a robot and not a human user. Chrome is just Chrome as you know it. *Robot browsers can be detected in numerous ways.. But there are no ways to tell if a specific mouse click was made by a user or a robot.* Ok, so both Web Scraper and Puppeteer Scraper use Puppeteer to give commands to Chrome. Where's the difference? It's called the execution environment. ## Execution environment[​](#execution-environment "Direct link to Execution environment") It may sound fancy, but it's just a technical term for "where does my code run". When you open the DevTools and start typing JavaScript in the browser Console, it gets executed in the browser. Browser is the code's execution environment. But you can't control the browser from the inside. For that, you need a different environment. Puppeteer's environment is Node.js. If you don't know what Node.js is, don't worry about it too much. Remember that it's the environment where Puppeteer runs. By now you probably figured this out on your own, so this will not come as a surprise. The difference between Web Scraper and Puppeteer Scraper is where your page function gets executed. When using the Web Scraper, it's executed in the browser environment. It means that it gets access to all the browser specific features such as the `window` or `document` objects, but it cannot control the browser with Puppeteer directly. This is done automatically in the background by the scraper. Whereas in Puppeteer Scraper, the page function is executed in the Node.js environment, giving you full access to Puppeteer and all its features. ![Puppeteer Scraper Diagram](/assets/images/puppeteer-scraper-diagram-5eb36bbee183cfd0066ee3807e8f9073.jpeg) *This does not mean that you can't execute in-browser code with Puppeteer Scraper. Keep reading to learn how.* ## Practical differences[​](#practical-differences "Direct link to Practical differences") Ok, cool, different environments, but how does that help you scrape stuff? Actually, quite a lot. Some things you just can't do from within the browser, but you can do them with Puppeteer. We will not attempt to create an exhaustive list, but rather show you some very useful features that we use every day in our scraping. ## Evaluating in-browser code[​](#evaluating-in-browser-code "Direct link to Evaluating in-browser code") In Web Scraper, everything runs in the browser, so there's really not much to talk about there. With Puppeteer Scraper, it's a single function call away. ``` const bodyHTML = await context.page.evaluate(() => { console.log('This will be printed in browser console.'); return document.body.innerHTML; }); ``` The `context.page.evaluate()` call executes the provided function in the browser environment and passes back the return value back to the Node.js environment. One very important caveat though! Since we're in different environments, we cannot use our existing variables, such as `context` inside of the evaluated function, because they are not available there. Different environments, different variables. *See the* `page.evaluate()` *[documentation](https://pptr.dev/#?product=Puppeteer\&show=api-pageevaluatepagefunction-args) for info on how to pass variables from Node.js to browser.* With the help of Apify SDK, we can even inject jQuery into the browser. You can use the `Pre goto function` input option to manipulate the page's environment before it loads. ``` async function preGotoFunction({ request, page, Apify }) { await Apify.utils.puppeteer.injectJQuery(page); } ``` This will make jQuery available in all pages. You can then use it in `context.page.evaluate()` calls: ``` const bodyText = await context.page.evaluate(() => { return $('body').text(); }); ``` You can do a lot of DOM manipulation directly from Node.js / Puppeteer, but when you're planning to do a lot of sequential operations, it's often better and faster to do it with jQuery in a single `context.page.evaluate()` call than using multiple `context.page.$`, `context.page.$eval()` and other Puppeteer methods. ## Navigation to other pages (URLs)[​](#navigation-to-other-pages-urls "Direct link to Navigation to other pages (URLs)") In Web Scraper, your page function literally runs within a page so it makes sense that when this page gets destroyed, the page function throws an error. Sadly, navigation (going to a different URL) destroys pages, so whenever you click a button in Web Scraper that forces the browser to navigate somewhere else, you end up with an error. In Puppeteer Scraper, this is not an issue, because the `page` object gets updated with new data seamlessly. Imagine that you currently have `https://example.com/page-1` open and there's a button on the page that will take you to `https://example.com/page-2`.Or that you're on `https://google.com` and you fill in the search bar and click on the search button. Consider the following code inside Web Scraper page function: ``` await context.waitFor('button'); $('button').click(); ``` With a `button` that takes you to the next page or launches a Google search (which takes you to the results page), the page function will fail with a nasty error. However, when using Puppeteer Scraper, this code: ``` await context.page.waitFor('button'); await Promise.all([ context.page.waitForNavigation(), context.page.click('button'), ]); ``` Will work as expected and after the `Promise.all()` call resolves, you will have the next page loaded and ready for scraping. Pay special attention to the `page.waitForNavigation()` ([see docs](https://pptr.dev/#?product=Puppeteer\&show=api-pagewaitfornavigationoptions)) call which is very important. It pauses your script until the navigation completes. Without it, the execution would start immediately after the mouse click. It's also important that you place it before the click itself, otherwise it creates a race condition and your script will behave unpredictably. You can go even further and navigate programmatically by calling: ``` await context.page.goto('https://some-new-page.com'); ``` ## Intercepting network activity[​](#intercepting-network-activity "Direct link to Intercepting network activity") Some very useful scraping techniques revolve around listening to network requests and responses and even modifying them on the fly. Web Scraper's page function doesn't have access to the network, besides calling JavaScript APIs such as `fetch()`. Puppeteer Scraper, on the other hand, has full control over the browser's network activity. You can listen to all the network requests that are being dispatched from the browser. For example, the following code will print all their URLs to the console. ``` context.page.on('request', (req) => console.log(req.url())); ``` This can be useful in many ways, such as blocking unwanted assets or scripts from being downloaded, modifying request methods or faking responses, etc. *Explaining how to do interception properly is out of scope of this article. See [Puppeteer docs](https://pptr.dev/#?product=Puppeteer\&show=api-pagesetrequestinterceptionvalue) and the [Apify SDK helper](https://docs.apify.com/sdk/js/docs/api/puppeteer#puppeteeraddinterceptrequesthandler-promise) for request interception.* ## Enqueueing JavaScript links[​](#enqueueing-javascript-links "Direct link to Enqueueing JavaScript links") A large number of websites use either form submissions or JavaScript redirects for navigation and displaying of data. With Web Scraper, you cannot crawl those websites, because there are no links to find and enqueue on those pages. Puppeteer Scraper enables you to automatically click all those elements that cause navigation, intercept the navigation requests and enqueue them to the request queue. If it seems complicated, don't worry. We've abstracted all the complexity away to a `Clickable elements selector` input option. When left empty, none of the said clicking and intercepting happens, but once you choose a selector, Puppeteer Scraper will automatically click all the selected elements, watch for page navigations and enqueue them into the `RequestQueue`. *The* `Clickable elements selector` *will also work on regular non-JavaScript links, however, it is significantly slower than using the plain* `Link selector`*. Unless you know you need it, use the* `Link selector` *for best performance.* ## Word of caution[​](#word-of-caution "Direct link to Word of caution") Since we're actually clicking in the page, which may or may not trigger some nasty JavaScript, anything can happen really, including the page completely breaking. Three common scenarios exist though. ## Plain form submit navigations[​](#plain-form-submit-navigations "Direct link to Plain form submit navigations") This works out of the box. It's typically used on older websites such as [Turkish Remax](https://www.remax.com.tr/ofis-office-franchise-girisimci-agent-arama). For a site like this you can set the `Clickable elements selector` and you're good to go: ``` 'a[onclick ^= getPage]'; ``` ## Form submit navigations with side-effects[​](#form-submit-navigations-with-side-effects "Direct link to Form submit navigations with side-effects") Those are similar to the ones above with an important caveat. Once you click the first thing, it usually modifies the page in a way that causes more clicking to become impossible. We deal with those by scraping the pages one by one, using the pagination "next" button. See [Maxwell Materials](http://www.maxwellrender.com/materials/) and use the following selector: ``` 'li.page-item.next a'; ``` ## Frontend navigations[​](#frontend-navigations "Direct link to Frontend navigations") Websites often won't navigate away just to fetch the next set of results. They will do it in the background and update the displayed data. You can paginate such websites with either Web Scraper or Puppeteer Scraper. Try it on [Udemy](https://www.udemy.com/topic/javascript/) for example. Click the next button to load the next set of courses. ``` // Web Scraper\ $('li a span.pagination-next').click(); // Puppeteer Scraper\ await page.click('li a span.pagination-next'); ``` ## Using Apify SDK[​](#using-apify-sdk "Direct link to Using Apify SDK") [Apify SDK](https://docs.apify.com/sdk/js) is the library we used to build all of our scrapers. For power users, it is the best tool out there to scrape using JavaScript. If you're not yet ready to start writing your own Actors using SDK, Puppeteer Scraper enables you to use its features without having to worry about building your own Actors. The possibilities are endless, but to show you some examples: * Check out the [Apify.utils.puppeteer.infiniteScroll()](https://docs.apify.com/sdk/js/docs/api/puppeteer#puppeteer.infiniteScroll) function that enables scraping pages with infinite scroll in one line of code. * [Apify.utils.puppeteer.blockRequests()](https://docs.apify.com/sdk/js/docs/api/puppeteer#puppeteer.blockRequests) allows you to block network requests based on URL patterns. * [Apify.openDataset()](https://docs.apify.com/sdk/js/docs/api/apify#module_Apify.openDataset) lets you work with any dataset under your account. * Make HTTP requests with `Apify.utils.requestAsBrowser()` to fetch external resources. And we're only scratching the surface here. ## Wrapping it up[​](#wrapping-it-up "Direct link to Wrapping it up") Many more techniques are available to Puppeteer Scraper that are either too complicated to replicate in Web Scraper or downright impossible to do. Web Scraper is a great tool for basic scraping, because it goes right to the point and uses in-browser JavaScript which is well-known to millions of people, even non-developers. Once you start hitting some roadblocks, you may find that Puppeteer Scraper is just what you need to overcome them. And if Puppeteer Scraper still doesn't cut it, there's still Apify SDK to rule them all. We hope you found this tutorial helpful and happy scraping. --- # PHP Tutorials 👨‍💻📚 **A collection of PHP tutorials to aid you in your journey to becoming a master web scraping and automation developer.** *** This section contains web-scraping or web-scraping related tutorials for PHP. Whether you're trying to scrape from a website with sitemaps, struggling with a dynamic page, want to optimize your slow scraper, or need some general tips for scraping in Apify with PHP, this section is right for you. --- # How to use Apify from PHP Apify's [RESTful API](https://docs.apify.com/api/v2#) allows you to use the platform from basically anywhere. Many projects are and will continue to be built using [PHP](https://www.php.net/). This tutorial enables you to use Apify in these projects in PHP and frameworks built on it. Apify does not have an official PHP client (yet), so we are going to use [guzzle](https://github.com/guzzle/guzzle), a great library for HTTP requests. By covering a few fundamental endpoints, this tutorial will show you the principles you can use for all Apify API endpoints. ## Before you start[​](#before-you-start "Direct link to Before you start") Make sure you have an Apify account and API token. You will find the token in the [Integrations](https://console.apify.com/account#/integrations) section in Apify Console. If you don't already have guzzle installed in your project (or just want to try out the code examples), run `composer require guzzlehttp/guzzle` to install it in the current directory. ## Preparing the client[​](#preparing-the-client "Direct link to Preparing the client") To get a guzzle instance ready to be used with the Apify API, we first need to set up the base endpoint and authentication. ``` require 'vendor/autoload.php'; $client = new \GuzzleHttp\Client([ 'base_uri' => 'https://api.apify.com/v2/', 'headers' => [ // Replace with your actual token 'Authorization' => 'Bearer ', ] ]); ``` Note that we pass the API token in the header. It can also be passed as a query string `token` parameter, but passing it in the header is preferred and more secure. To check whether everything works well, we'll try to get information about the [current user](https://docs.apify.com/api/v2/users-me-get.md). ``` // Call the endpoint using our client // Note that the path does not have a leading slash $response = $client->get('users/me'); // Parse the response (most Apify API endpoints return JSON) $parsedResponse = \json_decode($response->getBody(), true); // The actual data are usually present under the `data` key $data = $parsedResponse['data']; echo \json_encode($data, JSON_PRETTY_PRINT); ``` If, instead of data, you see an error saying `Authentication token is not valid`, check if the API token you used to instantiate the client is valid. ## Running an Actor[​](#running-an-actor "Direct link to Running an Actor") Now that we have our guzzle client ready to go, we can run some Actors. Let's try the **Contact Details Scraper** ([vdrmota/contact-info-scraper](https://apify.com/vdrmota/contact-info-scraper)). The [API reference](https://docs.apify.com/api/v2/act-runs-post.md) states that an Actor's input should be passed as JSON in the request body. Other options are passed as query parameters. ``` // To run the Actor, we make a POST request to its run's endpoint // To identify the Actor, you can use its ID, but you can also pass // the full Actor name [username]~[actorName] or just ~[actorName] for // your own Actors $response = $client->post('acts/vdrmota~contact-info-scraper/runs', [ // Actors usually accept JSON as input. When using the `json` key in // a POST request's options, guzzle sets proper request headers // and serializes the array we pass in 'json' => [ 'startUrls' => [ ['url' => 'https://www.apify.com/contact'] ], 'maxDepth' => 0, ], // Other run options are passed in as query parameters // This is optional since Actors usually have reasonable defaults 'query' => [ 'timeout' => 30 ], ]); $parsedResponse = \json_decode($response->getBody(), true); $data = $parsedResponse['data']; echo \json_encode($data, JSON_PRETTY_PRINT); ``` You should see information about the run, including its ID and the ID of its default [dataset](https://docs.apify.com/platform/storage/dataset.md). Take note of these, we will need them later. ## Getting the results from dataset[​](#getting-the-results-from-dataset "Direct link to Getting the results from dataset") Actors usually store their output in a default dataset. The [Actor runs endpoint](https://docs.apify.com/api/v2/actor-runs.md) lets you get overall info about an Actor run's default dataset. ``` // Replace with the run ID you from earlier $response = $client->get('actor-runs//dataset'); $parsedResponse = \json_decode($response->getBody(), true); $data = $parsedResponse['data']; echo \json_encode($data, JSON_PRETTY_PRINT); ``` As you can see, the response contains overall stats about the dataset, like its number of items, but not the actual data. To get those, we have to call the **items** endpoint. ``` // Replace with the run ID from earlier $response = $client->get('actor-runs//dataset/items'); // The dataset items endpoint returns an array of dataset items // they are not under the `data` key like in other endpoints $data = \json_decode($response->getBody(), true); echo \json_encode($data, JSON_PRETTY_PRINT); ``` Some of the Actors write to datasets other than the default. In these cases, you need to have the dataset ID and call the `datasets/` and `datasets//items` endpoints instead. For larger datasets, you can paginate through the results by passing query parameters. ``` $response = $client->get('datasets//items', [ 'query' => [ 'offset' => 20, 'limit' => 10, ] ]); $parsedResponse = \json_decode($response->getBody(), true); echo \json_encode($parsedResponse, JSON_PRETTY_PRINT); ``` All the available parameters are described in [our API reference](https://docs.apify.com/api/v2/dataset-items-get.md) and work both for all datasets. ## Getting the results from key-value stores[​](#getting-the-results-from-key-value-stores "Direct link to Getting the results from key-value stores") Datasets are great for structured data, but are not suited for binary files like images or PDFs. In these cases, Actors store their output in [key-value stores](https://docs.apify.com/platform/storage/key-value-store.md). One such Actor is the **HTML String To PDF** ([mhamas/html-string-to-pdf](https://apify.com/mhamas/html-string-to-pdf)) converter. Let's run it. ``` $response = $client->post('acts/mhamas~html-string-to-pdf/runs', [ 'json' => [ 'htmlString' => 'Hello World' ], ]); $parsedResponse = \json_decode($response->getBody(), true); $data = $parsedResponse['data']; echo \json_encode($data, JSON_PRETTY_PRINT); ``` Keep track of the returned run ID. Similar to datasets, we can get overall info about the default key-value store. ``` // Replace with the ID returned by the code above $response = $client->get('actor-runs//key-value-store'); $parsedResponse = \json_decode($response->getBody(), true); $data = $parsedResponse['data']; echo \json_encode($data, JSON_PRETTY_PRINT); ``` The items in key-value stores are not structured, so we cannot use the same approach as we did with dataset items. We can obtain some information about a store's content using its **keys** endpoint. ``` // Don't forget to replace with the ID you got earlier $response = $client->get('actor-runs//key-value-store/keys'); $parsedResponse = \json_decode($response->getBody(), true); $data = $parsedResponse['data']; echo \json_encode($data, JSON_PRETTY_PRINT); ``` We can see that there are two record keys: `INPUT` and `OUTPUT`. The HTML String to PDF Actor's README states that the PDF is stored under the `OUTPUT` key. Let's download it: ``` // Don't forget to replace the $response = $client->get('actor-runs//key-value-store/records/OUTPUT'); // Make sure that the destination (filename) is writable file_put_contents(__DIR__ . '/hello-world.pdf', $response->getBody()); ``` If you open the generated `hello-world.pdf` file, you should see... well, "Hello World". If the Actor stored the data in a key-value store other than the default, we can use the standalone endpoints, `key-value-stores/`, `key-value-stores//keys`, and `key-value-stores//records/`. They behave the same way as the default endpoints. [See the full docs](https://docs.apify.com/api/v2/storage-key-value-stores.md). ## When are the data ready[​](#when-are-the-data-ready "Direct link to When are the data ready") It takes some time for an Actor to generate its output. Some even have Actors that run for days! In the previous examples, we chose Actors whose runs only take a few seconds. This meant the runs had enough time to finish before we ran the code to retrieve their dataset or key-value store (so the Actor had time to produce some output). If we ran the code immediately after starting a longer-running Actor, the dataset would probably still be empty. For Actors that are expected to be quick, we can use the `waitForFinish` parameter. Then, the running Actor's endpoint does not respond immediately but waits until the run finishes (up to the given limit). Let's try this with the HTML String to PDF Actor. ``` $response = $client->post('acts/mhamas~html-string-to-pdf/runs', [ 'json' => [ 'htmlString' => 'Hi World' ], // Pass in how long we want to wait, in seconds 'query' => [ 'waitForFinish' => 60 ] ]); $parsedResponse = \json_decode($response->getBody(), true); $data = $parsedResponse['data']; echo \json_encode($data, JSON_PRETTY_PRINT); $runId = $data['id']; $response = $client->get(sprintf('actor-runs/%s/key-value-store/records/OUTPUT', $runId)); file_put_contents(__DIR__ . '/hi-world.pdf', $response->getBody()); ``` ## Webhooks[​](#webhooks "Direct link to Webhooks") For Actors that take longer to run, we can use [webhooks](https://docs.apify.com/platform/integrations/webhooks.md). A webhook is an HTML POST request that is sent to a specified URL when an Actor's status changes. We can use them as a kind of notification that is sent when your run finishes. You can set them up using query parameters. If we used webhooks in the example above, it would look like this: ``` // Webhooks need to be passed as a base64-encoded JSON string $webhooks = \base64_encode(\json_encode([ [ // The webhook can be sent on multiple events // this one fires when the run succeeds 'eventTypes' => ['ACTOR.RUN.SUCCEEDED'], // Set this to some url that you can react to // To see what is sent to the URL, // you can set up a temporary request bin at https://requestbin.com/r 'requestUrl' => '', ], ])); $response = $client->post('acts/mhamas~html-string-to-pdf/runs', [ 'json' => [ 'htmlString' => 'Hello World' ], 'query' => [ 'webhooks' => $webhooks ] ]); ``` ## How to use Apify Proxy[​](#how-to-use-apify-proxy "Direct link to How to use Apify Proxy") Let's use another important feature: [proxy](https://docs.apify.com/platform/proxy.md). If you want to make sure that your server's IP address won't get blocked somewhere when making requests, you can use the automatic proxy selection mode. ``` $client = new \GuzzleHttp\Client([ // Replace below with your password // found at https://console.apify.com/proxy 'proxy' => 'http://auto:@proxy.apify.com:8000' ]); // This request will be made through an automatically chosen proxy $response = $client->get("http://proxy.apify.com/?format=json"); echo $response->getBody(); ``` If you want to maintain the same IP between requests, you can use the session mode. ``` $client = new \GuzzleHttp\Client([ // Replace below with your password // found at https://console.apify.com/proxy 'proxy' => 'http://session-my_session:@proxy.apify.com:8000' ]); // Both responses should contain the same clientIp $response = $client->get("https://api.apify.com/v2/browser-info"); echo $response->getBody(); $response = $client->get("https://api.apify.com/v2/browser-info"); echo $response->getBody(); ``` [See the proxy docs](https://docs.apify.com/platform/proxy/usage.md) for more details on using specific proxies. ## Feedback[​](#feedback "Direct link to Feedback") Are you interested in an Apify PHP client or other PHP-related content? Do you have some feedback on this tutorial? [Let us know](https://apify.typeform.com/to/KqhmiJge#source=tutorial_use_apify_from_php)! --- # Puppeteer & Playwright course **Learn in-depth how to use two of the most popular Node.js libraries for controlling a headless browser - Puppeteer and Playwright.** *** [Puppeteer](https://pptr.dev/) and [Playwright](https://playwright.dev/) are libraries that allow you to automate browsing. Based on your instructions, they can open a browser window, load a website, click on links, etc. They can also do this *headlessly*, i.e., in a way that the browser window isn't visible, which is faster. Both packages were developed by the same team and are very similar, which is why we have combined the Puppeteer course and the Playwright course into one super-course that shows code examples for both technologies. The two differ in only small ways, and those will always be highlighted in the examples. > Each lesson's activity will contain examples for both libraries, but we recommend using Playwright, as it is newer and has more features and better [documentation](https://playwright.dev/docs/intro) ## Advantages of using a headless browser[​](#advantages-of-headless-browsers "Direct link to Advantages of using a headless browser") When automating a headless browser, you can do a whole lot more in comparison to making HTTP requests for static content. In fact, you can programmatically do pretty much anything a human could do with a browser, such as clicking elements, taking screenshots, typing into text areas, etc. Additionally, since the requests aren't static, [dynamic content](https://docs.apify.com/academy/concepts/dynamic-pages.md) can be rendered and interacted with (or, data from the dynamic content can be scraped). Turn on the [headful mode](https://playwright.dev/docs/api/class-testoptions#test-options-headless) (`headless: false`) to see exactly what the browser is doing. Browsers can also be effective for [overcoming anti-scraping measures](https://docs.apify.com/academy/anti-scraping.md), especially if the website is running [JavaScript browser challenges](https://docs.apify.com/academy/anti-scraping/techniques/browser-challenges.md). ## Disadvantages of headless browsers[​](#disadvantages-of-headless-browsers "Direct link to Disadvantages of headless browsers") Browsers are slow and expensive to run. In the follow-up courses, the Apify Academy will show you how to scrape websites without a browser. Every website can potentially be reverse-engineered into a series of quick and cheap HTTP calls, but it might require significant effort and specialized knowledge. ## Setup[​](#setup "Direct link to Setup") For this course, we'll be jumping right into the features of these awesome libraries and expecting you to already have an environment set up. Here's how we set up our environment: 1. Make sure you've installed [Node.js](https://nodejs.org/en/) 2. Create a new folder called **puppeteer-playwright** (or whatever you want to call it) 3. Run the command `npm init -y` within your new folder to automatically initialize the project 4. Add `"type": "module"` to the **package.json** file 5. Create a new file named **index.js** 6. Install the library you're going to be using during this course: * Install Playwright * Install Puppeteer ``` npm install playwright ``` ``` npm install puppeteer ``` > For a more in-depth guide on how to set up the basic environment we'll be using in this tutorial, check out the [**Computer preparation**](https://docs.apify.com/academy/web-scraping-for-beginners/data-extraction/computer-preparation.md) lesson in the **Web scraping basics for JavaScript devs** course ## Course overview[​](#course-overview "Direct link to Course overview") 1. [Launching a browser](https://docs.apify.com/academy/puppeteer-playwright/browser.md) 2. [Opening a page](https://docs.apify.com/academy/puppeteer-playwright/page.md) * [Interacting with a page](https://docs.apify.com/academy/puppeteer-playwright/page/interacting-with-a-page.md) * [Waiting for content & events](https://docs.apify.com/academy/puppeteer-playwright/page/waiting.md) * [Page methods](https://docs.apify.com/academy/puppeteer-playwright/page/page-methods.md) 3. [Executing scripts](https://docs.apify.com/academy/puppeteer-playwright/executing-scripts.md) * [Injecting code](https://docs.apify.com/academy/puppeteer-playwright/executing-scripts/injecting-code.md) * [Extracting data](https://docs.apify.com/academy/puppeteer-playwright/executing-scripts/collecting-data.md) 4. [Reading & intercepting requests](https://docs.apify.com/academy/puppeteer-playwright/reading-intercepting-requests.md) 5. [Using proxies](https://docs.apify.com/academy/puppeteer-playwright/proxies.md) 6. [Creating multiple browser contexts](https://docs.apify.com/academy/puppeteer-playwright/browser-contexts.md) 7. [Common use cases](https://docs.apify.com/academy/puppeteer-playwright/common-use-cases.md) ## First up[​](#next "Direct link to First up") In the [first lesson](https://docs.apify.com/academy/puppeteer-playwright/browser.md) of this course, we'll be learning a bit about how to create and use the **Browser** object. --- # Browser **Understand what the Browser object is in Puppeteer/Playwright, how to create one, and a bit about how to interact with one.** *** In order to automate a browser in Playwright or Puppeteer, we need to open one up programmatically. Playwright supports Chromium, Firefox, and Webkit (Safari), while Puppeteer only supports Chromium based browsers. For ease of understanding, we've chosen to use Chromium in the Playwright examples to keep things working on the same plane. Let's start by using the `launch()` function in the **index.js** file we created in the intro to this course: * Playwright * Puppeteer ``` import { chromium } from 'playwright'; await chromium.launch(); console.log('launched!'); ``` ``` import puppeteer from 'puppeteer'; await puppeteer.launch(); console.log('launched!'); ``` When we run this code with the command `node index.js`, a browser will open up; however, we won't actually see anything. This is because the default mode of a browser after `launch()`ing it is **headless**, meaning that it has no visible UI. > If you run this code right now, it will hang. Use **control^** + **C** to force quit the program. ## Launch options[​](#launch-options "Direct link to Launch options") In order to see what's actually happening, we can pass an **options** object ([Puppeteer](https://pptr.dev/#?product=Puppeteer\&version=v13.7.0\&show=api-puppeteerlaunchoptions), [Playwright](https://playwright.dev/docs/api/class-browsertype#browser-type-launch)) with **headless** set to **false**. * Playwright * Puppeteer ``` import { chromium } from 'playwright'; const browser = await chromium.launch({ headless: false }); await browser.newPage(); ``` ``` import puppeteer from 'puppeteer'; const browser = await puppeteer.launch({ headless: false }); await browser.newPage(); ``` Now we'll actually see a browser open up. ![Chromium browser opened by Puppeteer/Playwright](/assets/images/chromium-844298b27f771e8c1bb0441bf5572180.jpg) You can pass a whole lot more options to the `launch()` function. We'll be getting into those a little bit later on. ## Browser methods[​](#browser-methods "Direct link to Browser methods") The `launch()` function also returns a **Browser** object ([Puppeteer](https://pptr.dev/#?product=Puppeteer\&version=v13.7.0\&show=api-class-browser), [Playwright](https://playwright.dev/docs/api/class-browser)), which is a representation of the browser. This object has many methods, which allow us to interact with the browser from our code. One of them is `close()`. Until now, we've been using **control^** + **C** to force quit the process, but with this function, we'll no longer have to do that. * Playwright * Puppeteer ``` import { chromium } from 'playwright'; const browser = await chromium.launch({ headless: false }); await browser.newPage(); // code will be here in the future await browser.close(); ``` ``` import puppeteer from 'puppeteer'; const browser = await puppeteer.launch({ headless: false }); await browser.newPage(); // code will be here in the future await browser.close(); ``` ## Next up[​](#next "Direct link to Next up") Now that we can open a browser, let's move onto the [next lesson](https://docs.apify.com/academy/puppeteer-playwright/page.md) where we will learn how to create pages and visit websites programmatically. --- # Creating multiple browser contexts **Learn what a browser context is, how to create one, how to emulate devices, and how to use browser contexts to automate multiple sessions at one time.** *** A [**BrowserContext**](https://playwright.dev/docs/api/class-browsercontext) is an isolated incognito session within a **Browser** instance. This means that contexts can have different device/screen size configurations, different language and color scheme settings, etc. It is useful to use multiple browser instances when dealing with automating logging into multiple accounts simultaneously (therefore requiring multiple sessions), or in any cases where multiple sessions are required. When we create a **Browser** object by using the `launch()` function, a single [browser context](https://playwright.dev/docs/browser-contexts) is automatically created. In order to create more, we use the [`browser.newContext()`](https://playwright.dev/docs/api/class-browser#browser-new-context) function in Playwright, and [`browser.createIncognitoBrowserContext`](https://pptr.dev/#?product=Puppeteer\&version=v14.1.0\&show=api-browsercreateincognitobrowsercontextoptions) in Puppeteer. * Playwright * Puppeteer ``` const myNewContext = await browser.newContext(); ``` ``` const myNewContext = await browser.createIncognitoBrowserContext(); ``` ## Persistent vs non-persistent browser contexts[​](#persistent-vs-non-persistent "Direct link to Persistent vs non-persistent browser contexts") In both examples above, we are creating a new **non-persistent** browser context, which means that once it closes, all of its cookies, cache, etc. will be lost. For some cases, that's okay, but in most situations, the performance hit from this is too large. This is why we have **persistent** browser contexts. Persistent browser contexts open up a bit slower and they store all their cache, cookies, session storage, and local storage in a file on disk. In Puppeteer, the **default** browser context is the persistent one, while in Playwright we have to use [`BrowserType.launchPersistentContext()`](https://playwright.dev/docs/api/class-browsertype#browser-type-launch-persistent-context) instead of `BrowserType.launch()` in order for the default context to be persistent. * Playwright * Puppeteer ``` import { chromium } from 'playwright'; // Here, we launch a persistent browser context. The first // argument is the location to store the data. const browser = await chromium.launchPersistentContext('./persistent-context', { headless: false }); const page = await browser.newPage(); await browser.close(); ``` ``` import puppeteer from 'puppeteer'; const browser = await puppeteer.launch({ headless: false }); // This page will be under the default context, which is persistent. // Cache, cookies, etc. will be stored on disk and persisted const page = await browser.newPage(); await browser.close(); ``` ## Using browser contexts[​](#using-browser-contexts "Direct link to Using browser contexts") In both Playwright and Puppeteer, various devices (iPhones, iPads, Androids, etc.) can be emulated by using [`playwright.devices`](https://playwright.dev/docs/api/class-playwright#playwright-devices) or [`puppeteer.devices`](https://pptr.dev/#?product=Puppeteer\&version=v14.1.0\&show=api-puppeteerdevices). We'll be using this to create two different browser contexts, one emulating an iPhone, and one emulating an Android device: * Playwright * Puppeteer ``` import { chromium, devices } from 'playwright'; // Launch the browser const browser = await chromium.launch({ headless: false }); const iPhone = devices['iPhone 11 Pro']; // Create a new context for our iPhone emulation const iPhoneContext = await browser.newContext({ ...iPhone }); // Open a page on the newly created iPhone context const iPhonePage = await iPhoneContext.newPage(); const android = devices['Galaxy Note 3']; // Create a new context for our Android emulation const androidContext = await browser.newContext({ ...android }); // Open a page on the newly created Android context const androidPage = await androidContext.newPage(); // The code in the next step will go here await browser.close(); ``` ``` import puppeteer from 'puppeteer'; // Launch the browser const browser = await puppeteer.launch({ headless: false }); const iPhone = puppeteer.devices['iPhone 11 Pro']; // Create a new context for our iPhone emulation const iPhoneContext = await browser.createIncognitoBrowserContext(); // Open a page on the newly created iPhone context const iPhonePage = await iPhoneContext.newPage(); // Emulate the device await iPhonePage.emulate(iPhone); const android = puppeteer.devices['Galaxy Note 3']; // Create a new context for our Android emulation const androidContext = await browser.createIncognitoBrowserContext(); // Open a page on the newly created Android context const androidPage = await androidContext.newPage(); // Emulate the device await androidPage.emulate(android); // The code in the next step will go here await browser.close(); ``` Then, we'll make both `iPhonePage` and `androidPage` visit [deviceinfo.me](https://www.deviceinfo.me/), which is a website that displays the type of device you have, the operating system you're using, and more device and location-specific information. ``` // Go to deviceinfo.me on both at the same time await Promise.all([iPhonePage.goto('https://www.deviceinfo.me/'), androidPage.goto('https://www.deviceinfo.me/')]); // Wait for 10 seconds on both before shutting down await Promise.all([iPhonePage.waitForTimeout(10000), androidPage.waitForTimeout(10000)]); ``` Let's go ahead and run our code and analyze the data on each **deviceinfo.me** page. Here's what we see: ![deviceinfo.me results for both browser contexts](/assets/images/dual-contexts-1cf77aac6062264d0ba205af600f5c5a.jpg) We see that **deviceinfo.me** detects both contexts as using different devices, despite the fact they're visiting the same page at the same time. This shows firsthand that different browser contexts can have totally different configurations, as they all have separate sessions. ## Accessing browser contexts[​](#accessing-browser-contexts "Direct link to Accessing browser contexts") When working with multiple browser contexts, it can be difficult to keep track of all of them and making changes becomes a repetitive job. This is why the **Browser** instance returned from the `launch()` function also has a `contexts()` function (`browserContexts()` in Puppeteer). This function returns an array of all the contexts that are currently attached to the browser. Let's go ahead and use this function to loop through all of our browser contexts and make them log **Site visited** to the console whenever the website is visited: * Playwright * Puppeteer ``` for (const context of browser.contexts()) { // In Playwright, lots of events are supported in the "on" function of // a BrowserContext instance context.on('request', (req) => req.url() === 'https://www.deviceinfo.me/' && console.log('Site visited')); } ``` ``` for (const context of browser.browserContexts()) { // In Puppeteer, only three events are supported in the "on" function // of a BrowserContext instance context.on('targetchanged', () => console.log('Site visited')); } ``` After adding this above our `page.goto`s and running the code once again, we see this logged to the console: ``` Site visited Site visited ``` Cool! We've modified both our `iPhoneContext` and `androidContext`, as well as our default context, to log the message. > Note that the Puppeteer code and Playwright code are slightly different in the examples above. The Playwright code will log **Site visited** any time the specific URL is visited, while the Puppeteer code will log any time the target URL is changed to anything. Finally, in Puppeteer, you can use the `browser.defaultBrowserContext()` function to grab hold of the default context at any point. ## Wrap up[​](#wrap-up "Direct link to Wrap up") Thus far in this course, you've learned how to launch a browser, open a page, run scripts on a page, extract data from a page, intercept requests made on the page, use proxies, and use multiple browser contexts. Stay tuned for new lessons! --- # Common use cases **Learn about some of the most common use cases of Playwright and Puppeteer, and how to handle these use cases when you run into them.** *** You can do about anything with a headless browser, but, there are some extremely common use cases that are important to understand and be prepared for when you might run into them. This short section will be all about solving these common situations. Here's what we'll be covering: 1. Login flow (logging into an account) 2. Paginating through results on a website 3. Solving browser challenges (ex. captchas) 4. More! # Next up The [first lesson](https://docs.apify.com/academy/puppeteer-playwright/common-use-cases/logging-into-a-website.md) of this section is all about logging into a website and running multiple concurrent operations within a user's account. --- # Downloading files **Learn how to automatically download and save files to the disk using two of the most popular web automation libraries, Puppeteer and Playwright.** *** Downloading a file using Puppeteer can be tricky. On some systems, there can be issues with the usual file saving process that prevent you from doing it in a straightforward way. However, there are different techniques that work (most of the time). These techniques are only necessary when we don't have a direct file link, which is usually the case when the file being downloaded is based on more complicated data export. ## Setting up a download path[​](#setting-up-a-download-path "Direct link to Setting up a download path") Let's start with the easiest technique. This method tells the browser in what folder we want to download a file from Puppeteer after clicking on it. ``` const client = await page.target().createCDPSession(); await client.send('Page.setDownloadBehavior', { behavior: 'allow', downloadPath: './my-downloads' }); ``` We use the mysterious `client` API which gives us access to all the functions of the underlying [Chrome DevTools Protocol](https://pptr.dev/api/puppeteer.cdpsession) (Puppeteer & Playwright are built on top of it). Basically, it extends Puppeteer's functionality. Then we can download the file by clicking on the button. ``` await page.click('.export-button'); ``` Let's wait for one minute. In a real use case, you want to check the state of the file in the file system. ``` await page.waitFor(60000); ``` To extract the file from the file system into memory, we have to first find its name, and then we can read it. ``` import fs from 'fs'; const fileNames = fs.readdirSync('./my-downloads'); // Let's pick the first one const fileData = fs.readFileSync(`./my-downloads/${fileNames[0]}`); // ...Now we can do whatever we want with the data ``` ## Intercepting and replicating a file download request[​](#intercepting-a-file-download-request "Direct link to Intercepting and replicating a file download request") For this second option, we can trigger the file download, intercept the request going out, and then replicate it to get the actual data. First, we need to enable request interception. This is done using the following line of code: ``` await page.setRequestInterception(true); ``` Next, we need to trigger the actual file export. We might need to fill in some form, select an exported file type, etc. In the end, it will look something like this: ``` await page.click('.export-button'); ``` We don't need to await this promise since we'll be waiting for the result of this action anyway (the triggered request). The crucial part is intercepting the request that would result in downloading the file. Since the interception is already enabled, we just need to wait for the request to be sent. ``` const xRequest = await new Promise((resolve) => { page.on('request', (interceptedRequest) => { interceptedRequest.abort(); // stop intercepting requests resolve(interceptedRequest); }); }); ``` The last thing is to convert the intercepted Puppeteer request into a request-promise options object. We need to have the `request-promise` package installed. ``` import request from 'request-promise'; ``` Since the request interception does not include cookies, we need to add them subsequently. ``` const options = { encoding: null, method: xRequest._method, uri: xRequest._url, body: xRequest._postData, headers: xRequest._headers, }; // Add the cookies const cookies = await page.cookies(); options.headers.Cookie = cookies.map((ck) => `${ck.name}=${ck.value}`).join(';'); // Resend the request const response = await request(options); ``` Now, the response contains the binary data of the downloaded file. It can be saved to the disk, uploaded somewhere, or [submitted with another form](https://docs.apify.com/academy/puppeteer-playwright/common-use-cases/submitting-a-form-with-a-file-attachment.md). --- # Logging into a website **Understand the "login flow" - logging into a website, then maintaining a logged in status within different browser contexts for an efficient automation process.** *** Whether it's auto-renewing a service, automatically sending a message on an interval, or automatically cancelling a Netflix subscription, one of the most popular things headless browsers are used for is automating things within a user's account on a certain website. Of course, automating anything on a user's account requires the automation of the login process as well. In this lesson, we'll be covering how to build a login flow from start to finish with Playwright or Puppeteer. > In this lesson, we'll be using [yahoo.com](https://www.yahoo.com/) as an example. Feel free to follow along using the academy Yahoo account credentials, or even deviate from the lesson a bit and try building a login flow for a different website of your choosing! ## Inputting credentials[​](#inputting-credentials "Direct link to Inputting credentials") The full logging in process on Yahoo goes like this: 1. Accept their cookies policy, then load the main page. 2. Click on the **Sign in** button and load the sign-in page. 3. Enter the username and click the button. 4. Enter the password and click the button, then load the main page again (but now logged in). When we lay out the steps like this in [pseudocode](https://en.wikipedia.org/wiki/Pseudocode), it makes it significantly easier to translate over into code. Here's the four steps above loop in JavaScript: * Playwright * Puppeteer ``` import { chromium } from 'playwright'; // Launch a browser and open a page const browser = await chromium.launch({ headless: false }); const page = await browser.newPage(); await page.goto('https://www.yahoo.com/'); // Agree to the cookies terms, then click on the "Sign in" button await page.click('button[name="agree"]'); await page.waitForSelector('a:has-text("Sign in")'); await page.click('a:has-text("Sign in")'); await page.waitForLoadState('load'); // Type in the username and continue forward await page.type('input[name="username"]', 'YOUR-LOGIN-HERE'); await page.click('input[name="signin"]'); // Type in the password and continue forward await page.type('input[name="password"]', 'YOUR-PASSWORD-HERE'); await page.click('button[name="verifyPassword"]'); await page.waitForLoadState('load'); // Wait for 10 seconds so we can see that we have in fact // successfully logged in await page.waitForTimeout(10000); ``` ``` import puppeteer from 'puppeteer'; // Launch a browser and open a page const browser = await puppeteer.launch({ headless: false }); const page = await browser.newPage(); await page.goto('https://www.yahoo.com/'); // Agree to the cookies terms, then click on the "Sign in" button await Promise.all([page.waitForSelector('a[data-ylk*="sign-in"]'), page.click('button[name="agree"]')]); await Promise.all([page.waitForNavigation(), page.click('a[data-ylk*="sign-in"]')]); // Type in the username and continue forward await page.type('input[name="username"]', 'YOUR-LOGIN-HERE'); await Promise.all([page.waitForNavigation(), page.click('input[name="signin"]')]); // Type in the password and continue forward await page.type('input[name="password"]', 'YOUR-PASSWORD-HERE'); await Promise.all([page.waitForNavigation(), page.click('button[name="verifyPassword"]')]); // Wait for 10 seconds so we can see that we have in fact // successfully logged in await page.waitForTimeout(10000); ``` Great! If you're following along and you've replaced the placeholder credentials with your own, you should see that on the final navigated page, you're logged into your Yahoo account. ![Successfully logged into Yahoo](/assets/images/logged-in-f2dbd0d55a1fb21609322a97684c600b.jpg) ## Passing around cookies[​](#passing-around-cookies "Direct link to Passing around cookies") Now that we all know how to log into a website let's try and solve a more complex problem. Let's say that we want to send 3 different emails at the same exact time, all from the **Academy** Yahoo account. Here is an object we'll create which represents the three different emails we want to send: ``` const emailsToSend = [ { to: 'alice@example.com', subject: 'Hello', body: 'This is a message.', }, { to: 'bob@example.com', subject: 'Testing', body: 'I love the academy!', }, { to: 'carol@example.com', subject: 'Apify is awesome!', body: 'Some content.', }, ]; ``` What we could do is log in 3 different times, then automate the sending of each email; however, this is extremely inefficient. When you log into a website, one of the main things that allows you to stay logged in and perform actions on your account is the [cookies](https://docs.apify.com/academy/concepts/http-cookies.md) stored in your browser. These cookies tell the website that you have been authenticated, and that you have the permissions required to modify your account. With this knowledge of cookies, it can be concluded that we can pass the cookies generated by the code above right into each new browser context that we use to send each email. That way, we won't have to run the login flow each time. ### Retrieving cookies[​](#retrieving-cookies "Direct link to Retrieving cookies") First, we'll grab the cookies we generated: * Playwright * Puppeteer ``` // Grab the cookies from the default browser context, // which was used to log in const cookies = await browser.contexts()[0].cookies(); ``` ``` // Grab the cookies from the page used to log in const cookies = await page.cookies(); ``` Notice that in Playwright, cookies are tied to a **BrowserContext**, while in Puppeteer they are tied to a **Page**. ### Passing cookies to a new browser context[​](#passing-cookies-to-new-contexts "Direct link to Passing cookies to a new browser context") Remembering from the section above, we stored our cookies in a variable named **cookies**. These can now be directly passed into a new browser context like so: * Playwright * Puppeteer ``` // Create a fresh non-persistent browser context const sendEmailContext = await browser.newContext(); // Add the cookies from the previous one to this one so that // we'll be logged into Yahoo without having to re-do the // logging in automation await sendEmailContext.addCookies(cookies); const page2 = await sendEmailContext.newPage(); // Notice that we are logged in, even though we didn't // go through the logging in process again! await page2.goto('https://mail.yahoo.com/'); await page2.waitForTimeout(10000); ``` ``` // Create a fresh non-persistent browser context const sendEmailContext = await browser.createIncognitoBrowserContext(); // Create a new page on the new browser context and set its cookies // to be the same ones from the page we used to log into the website. const page2 = await sendEmailContext.newPage(); await page2.setCookie(...cookies); // Notice that we are logged in, even though we didn't // go through the logging in process again! await page2.goto('https://mail.yahoo.com/'); await page2.waitForTimeout(10000); ``` ### Completing the flow[​](#completing-the-flow "Direct link to Completing the flow") Now that passing cookies around is out of the way, we can finally complete the goal at hand and send all three of these emails at once. This can be done by mapping through **emailsToSend**, creating an array of promises where each function creates a new browser context, adds the initial cookies, and sends the email. * Playwright * Puppeteer ``` // Grab the cookies from the default browser context, // which was used to log in const cookies = await browser.contexts()[0].cookies(); await page.close(); // Create an array of promises, running the cookie passing // and email sending logic each time const promises = emailsToSend.map(({ to, subject, body }) => (async () => { // Create a fresh non-persistent browser context const sendEmailContext = await browser.newContext(); // Add the cookies from the previous one to this one so that // we'll be logged into Yahoo without having to re-do the // logging in automation await sendEmailContext.addCookies(cookies); const page2 = await sendEmailContext.newPage(); await page2.goto('https://mail.yahoo.com/'); // Compose an email await page2.click('a[aria-label="Compose"]'); // Populate the fields with the details from the object await page2.type('input#message-to-field', to); await page2.type('input[data-test-id="compose-subject"]', subject); await page2.type('div[data-test-id="compose-editor-container"] div[contenteditable="true"]', body); // Send the email await page2.click('button[title="Send this email"]'); await sendEmailContext.close(); })(), ); // Wait for all emails to be sent await Promise.all(promises); ``` ``` // Create an array of promises, running the cookie passing // and email sending logic each time const promises = emailsToSend.map(({ to, subject, body }) => (async () => { // Create a fresh non-persistent browser context const sendEmailContext = await browser.createIncognitoBrowserContext(); // Create a new page on the new browser context and set its cookies // to be the same ones from the page we used to log into the website. const page2 = await sendEmailContext.newPage(); await page2.setCookie(...cookies); await page2.goto('https://mail.yahoo.com/'); // Compose an email await page2.click('a[aria-label="Compose"]'); // Populate the fields with the details from the object await page2.type('input#message-to-field', to); await page2.type('input[data-test-id="compose-subject"]', subject); await page2.type('div[data-test-id="compose-editor-container"] div[contenteditable="true"]', body); // Send the email await page2.click('button[title="Send this email"]'); await sendEmailContext.close(); })(), ); // Wait for all emails to be sent await Promise.all(promises); ``` ## Final code overview[​](#final-code "Direct link to Final code overview") To sum up what we've built during this lesson: 1. Log into Yahoo. 2. Store the login cookies in a variable. 3. Concurrently create 3 new browser contexts and inject the cookies into each one. 4. Concurrently send 3 emails from the same account logged into in the first step. Here's what the final code looks like: * Playwright * Puppeteer ``` import { chromium } from 'playwright'; const emailsToSend = [ { to: 'alice@example.com', subject: 'Hello', body: 'This is a message.', }, { to: 'bob@example.com', subject: 'Testing', body: 'I love the academy!', }, { to: 'carol@example.com', subject: 'Apify is awesome!', body: 'Some content.', }, ]; const browser = await chromium.launch({ headless: false }); const page = await browser.newPage(); // Login logic await page.goto('https://www.yahoo.com/'); await page.click('button[name="agree"]'); await page.waitForSelector('a:has-text("Sign in")'); await page.click('a:has-text("Sign in")'); await page.waitForLoadState('load'); await page.type('input[name="username"]', 'YOUR-LOGIN-HERE'); await page.click('input[name="signin"]'); await page.type('input[name="password"]', 'YOUR-PASSWORD-HERE'); await page.click('button[name="verifyPassword"]'); await page.waitForLoadState('load'); const cookies = await browser.contexts()[0].cookies(); await page.close(); // Email sending logic const promises = emailsToSend.map(({ to, subject, body }) => (async () => { const sendEmailContext = await browser.newContext(); await sendEmailContext.addCookies(cookies); const page2 = await sendEmailContext.newPage(); await page2.goto('https://mail.yahoo.com/'); await page2.click('a[aria-label="Compose"]'); await page2.type('input#message-to-field', to); await page2.type('input[data-test-id="compose-subject"]', subject); await page2.type('div[data-test-id="compose-editor-container"] div[contenteditable="true"]', body); await page2.click('button[title="Send this email"]'); await sendEmailContext.close(); })(), ); await Promise.all(promises); await browser.close(); ``` ``` import puppeteer from 'puppeteer'; const emailsToSend = [ { to: 'alice@example.com', subject: 'Hello', body: 'This is a message.', }, { to: 'bob@example.com', subject: 'Testing', body: 'I love the academy!', }, { to: 'carol@example.com', subject: 'Apify is awesome!', body: 'Some content.', }, ]; const browser = await puppeteer.launch({ headless: false }); const page = await browser.newPage(); // Login logic await page.goto('https://www.yahoo.com/'); await Promise.all([page.waitForSelector('a[data-ylk*="sign-in"]'), page.click('button[name="agree"]')]); await Promise.all([page.waitForNavigation(), page.click('a[data-ylk*="sign-in"]')]); await page.type('input[name="username"]', 'YOUR-LOGIN-HERE'); await Promise.all([page.waitForNavigation(), page.click('input[name="signin"]')]); await page.type('input[name="password"]', 'YOUR-PASSWORD-HERE'); await Promise.all([page.waitForNavigation(), page.click('button[name="verifyPassword"]')]); const cookies = await page.cookies(); await page.close(); // Email sending logic const promises = emailsToSend.map(({ to, subject, body }) => (async () => { const sendEmailContext = await browser.createIncognitoBrowserContext(); const page2 = await sendEmailContext.newPage(); await page2.setCookie(...cookies); await page2.goto('https://mail.yahoo.com/'); await page2.click('a[aria-label="Compose"]'); await page2.type('input#message-to-field', to); await page2.type('input[data-test-id="compose-subject"]', subject); await page2.type('div[data-test-id="compose-editor-container"] div[contenteditable="true"]', body); await page2.click('button[title="Send this email"]'); await sendEmailContext.close(); })(), ); await Promise.all(promises); await browser.close(); ``` ## Next up[​](#next "Direct link to Next up") In the [next lesson](https://docs.apify.com/academy/puppeteer-playwright/common-use-cases/paginating-through-results.md), you'll learn how to paginate through results on a website. --- # Paginating through results **Learn how to paginate through results on websites that use either pagination based on page numbers or dynamic lazy loading.** *** If you're trying to [collect data](https://docs.apify.com/academy/puppeteer-playwright/executing-scripts/collecting-data.md) on a website that has millions, thousands, or even hundreds of results, it is very likely that they are paginating their results to reduce strain on their back-end as well as on the users loading and rendering the content. ![Amazon pagination]() ## Page number-based pagination[​](#page-number-based-pagination "Direct link to Page number-based pagination") At the time of writing this lesson, Facebook has [over a hundred repositories on GitHub](https://github.com/orgs/facebook/repositories). By default, GitHub lists repositories in descending order based on when they were last updated (the most recently updated ones are at the top of the list). We want to scrape the titles, links, and descriptions of all of Facebook's repositories; however, GitHub only displays 30 repositories per page. This means we need to paginate through the results. Let's start by defining some variables: ``` // This is where we'll store scraped data const repositories = []; // This will come handy when resolving relative links const BASE_URL = 'https://github.com'; // We'll use this URL a couple of times within our code const REPOSITORIES_URL = `${BASE_URL}/orgs/facebook/repositories`; ``` ### Finding the last page[​](#finding-the-last-page "Direct link to Finding the last page") Going through each page is easier if we know in advance when to stop. The good news is that GitHub's pagination is upfront about the number of the last page, so the total number of pages is available to us: ![Last page number](/assets/images/github-last-page-008b2aa5e56b82a67ad682bc7398ff65.jpg) As Facebook adds repositories over time, the number you see in your browser might be different. Let's read the number now with the following code: * Playwright * Puppeteer ``` import { chromium } from 'playwright'; const repositories = []; const BASE_URL = 'https://github.com'; const REPOSITORIES_URL = `${BASE_URL}/orgs/facebook/repositories`; const browser = await chromium.launch({ headless: false }); const firstPage = await browser.newPage(); await firstPage.goto(REPOSITORIES_URL); const lastPageElement = firstPage.locator('a[aria-label*="Page "]:nth-last-child(2)'); const lastPageLabel = await lastPageElement.getAttribute('aria-label'); const lastPageNumber = Number(lastPageLabel.replace(/\D/g, '')); console.log(lastPageNumber); await browser.close(); ``` ``` import puppeteer from 'puppeteer'; const repositories = []; const BASE_URL = 'https://github.com'; const REPOSITORIES_URL = `${BASE_URL}/orgs/facebook/repositories`; const browser = await puppeteer.launch({ headless: false }); const firstPage = await browser.newPage(); await firstPage.goto(REPOSITORIES_URL); const lastPageLabel = await firstPage.$eval( 'a[aria-label*="Page "]:nth-last-child(2)', (element) => element.getAttribute('aria-label'), ); const lastPageNumber = Number(lastPageLabel.replace(/\D/g, '')); console.log(lastPageNumber); await browser.close(); ``` :nth-last-child [Learn more](https://developer.mozilla.org/en-US/docs/Web/CSS/:nth-last-child) about the `:nth-last-child` pseudo-class. It works like `:nth-child`, but starts from the bottom of the parent element's children instead of from the top. When we run the code, it prints the total number of pages, which is `4` at the time of writing this lesson. Now let's scrape repositories from all the pages. First, we'll add a function that can handle the data extraction for a single page and return an array of results. Then, to start, we'll run this function just for the first page: * Playwright * Puppeteer ``` import { chromium } from 'playwright'; import * as cheerio from 'cheerio'; const repositories = []; const BASE_URL = 'https://github.com'; const REPOSITORIES_URL = `${BASE_URL}/orgs/facebook/repositories`; // Scrapes all repositories from a single page const scrapeRepos = async (page) => { const $ = cheerio.load(await page.content()); return [...$('.list-view-item')].map((item) => { const repoElement = $(item); return { title: repoElement.find('h4').text().trim(), description: repoElement.find('.repos-list-description').text().trim(), link: new URL(repoElement.find('h4 a').attr('href'), BASE_URL).href, }; }); }; const browser = await chromium.launch({ headless: false }); const firstPage = await browser.newPage(); await firstPage.goto(REPOSITORIES_URL); const lastPageElement = firstPage.locator('a[aria-label*="Page "]:nth-last-child(2)'); const lastPageLabel = await lastPageElement.getAttribute('aria-label'); const lastPageNumber = Number(lastPageLabel.replace(/\D/g, '')); // Push all results from the first page to the repositories array repositories.push(...(await scrapeRepos(firstPage))); // Log the 30 repositories scraped from the first page console.log(repositories); await browser.close(); ``` ``` import puppeteer from 'puppeteer'; import * as cheerio from 'cheerio'; const repositories = []; const BASE_URL = 'https://github.com'; const REPOSITORIES_URL = `${BASE_URL}/orgs/facebook/repositories`; // Scrapes all repositories from a single page const scrapeRepos = async (page) => { const $ = cheerio.load(await page.content()); return [...$('.list-view-item')].map((item) => { const repoElement = $(item); return { title: repoElement.find('h4').text().trim(), description: repoElement.find('.repos-list-description').text().trim(), link: new URL(repoElement.find('h4 a').attr('href'), BASE_URL).href, }; }); }; const browser = await puppeteer.launch({ headless: false }); const firstPage = await browser.newPage(); await firstPage.goto(REPOSITORIES_URL); const lastPageLabel = await firstPage.$eval( 'a[aria-label*="Page "]:nth-last-child(2)', (element) => element.getAttribute('aria-label'), ); const lastPageNumber = Number(lastPageLabel.replace(/\D/g, '')); // Push all results from the first page to the repositories array repositories.push(...(await scrapeRepos(firstPage))); // Log the 30 repositories scraped from the first page console.log(repositories); await browser.close(); ``` If we run the code above, it outputs data about the first 30 repositories listed: ``` $ node index.js [ { title: 'react-native', description: 'A framework for building native applications using React', link: 'https://github.com/facebook/react-native' }, { title: 'fboss', description: 'Facebook Open Switching System Software for controlling network switches.', link: 'https://github.com/facebook/fboss' }, ... ] ``` ### Making a request for each results page[​](#making-a-request-for-each-results-page "Direct link to Making a request for each results page") If we click around the pagination links, we can observe that all the URLs follow certain format. For example, we can find page number 2 at `https://github.com/orgs/facebook/repositories?page=2`. That means we could construct URL for each page if we had an array of numbers with the same range as the pages. If `lastPageNumber` is `4`, the following code creates `[0, 1, 2, 3, 4]`: ``` const array = Array(lastPageNumber + 1); // getting an array of certain size const numbers = [...array.keys()]; // getting the keys (the actual numbers) as another array ``` Page `0` doesn't exist though and we've already scraped page `1`, so we need one more step to remove those: ``` const pageNumbers = numbers.slice(2); // removes the first two numbers ``` To have our code examples shorter, we'll squash the above to a single line of code: ``` const pageNumbers = [...Array(lastPageNumber + 1).keys()].slice(2); ``` Now let's scrape repositories for each of these numbers. We'll create promises for each request and collect results to a single `repositories` array: ``` const pageNumbers = [...Array(lastPageNumber + 1).keys()].slice(2); const promises = pageNumbers.map((pageNumber) => (async () => { const paginatedPage = await browser.newPage(); // Construct the URL by setting the ?page=... parameter to value of pageNumber const url = new URL(REPOSITORIES_URL); url.searchParams.set('page', pageNumber); // Scrape the page await paginatedPage.goto(url.href); const results = await scrapeRepos(paginatedPage); // Push results to the repositories array repositories.push(...results); await paginatedPage.close(); })(), ); await Promise.all(promises); // For brievity logging just the count of repositories scraped console.log(repositories.length); ``` Scaling to hundreds of requests Using `Promise.all()` is okay for up to ten or maybe tens of requests, but won't work well for large numbers. When scraping hundreds or even thousands of pages, it's necessary to have more robust infrastructure in place, such as a request queue. ### Final code[​](#final-pagination-code "Direct link to Final code") The code below puts all the bits together: * Playwright * Puppeteer ``` import { chromium } from 'playwright'; import * as cheerio from 'cheerio'; const repositories = []; const BASE_URL = 'https://github.com'; const REPOSITORIES_URL = `${BASE_URL}/orgs/facebook/repositories`; // Scrapes all repositories from a single page const scrapeRepos = async (page) => { const $ = cheerio.load(await page.content()); return [...$('.list-view-item')].map((item) => { const repoElement = $(item); return { title: repoElement.find('h4').text().trim(), description: repoElement.find('.repos-list-description').text().trim(), link: new URL(repoElement.find('h4 a').attr('href'), BASE_URL).href, }; }); }; const browser = await chromium.launch({ headless: false }); const firstPage = await browser.newPage(); await firstPage.goto(REPOSITORIES_URL); const lastPageElement = firstPage.locator('a[aria-label*="Page "]:nth-last-child(2)'); const lastPageLabel = await lastPageElement.getAttribute('aria-label'); const lastPageNumber = Number(lastPageLabel.replace(/\D/g, '')); // Push all results from the first page to the repositories array repositories.push(...(await scrapeRepos(firstPage))); await firstPage.close(); const pageNumbers = [...Array(lastPageNumber + 1).keys()].slice(2); const promises = pageNumbers.map((pageNumber) => (async () => { const paginatedPage = await browser.newPage(); // Construct the URL by setting the ?page=... parameter to value of pageNumber const url = new URL(REPOSITORIES_URL); url.searchParams.set('page', pageNumber); // Scrape the page await paginatedPage.goto(url.href); const results = await scrapeRepos(paginatedPage); // Push results to the repositories array repositories.push(...results); await paginatedPage.close(); })(), ); await Promise.all(promises); // For brievity logging just the count of repositories scraped console.log(repositories.length); await browser.close(); ``` ``` import puppeteer from 'puppeteer'; import * as cheerio from 'cheerio'; const repositories = []; const BASE_URL = 'https://github.com'; const REPOSITORIES_URL = `${BASE_URL}/orgs/facebook/repositories`; // Scrapes all repositories from a single page const scrapeRepos = async (page) => { const $ = cheerio.load(await page.content()); return [...$('.list-view-item')].map((item) => { const repoElement = $(item); return { title: repoElement.find('h4').text().trim(), description: repoElement.find('.repos-list-description').text().trim(), link: new URL(repoElement.find('h4 a').attr('href'), BASE_URL).href, }; }); }; const browser = await puppeteer.launch({ headless: false }); const firstPage = await browser.newPage(); await firstPage.goto(REPOSITORIES_URL); const lastPageLabel = await firstPage.$eval( 'a[aria-label*="Page "]:nth-last-child(2)', (element) => element.getAttribute('aria-label'), ); const lastPageNumber = Number(lastPageLabel.replace(/\D/g, '')); // Push all results from the first page to the repositories array repositories.push(...(await scrapeRepos(page))); await firstPage.close(); const pageNumbers = [...Array(lastPageNumber + 1).keys()].slice(2); const promises = pageNumbers.map((pageNumber) => (async () => { const paginatedPage = await browser.newPage(); // Construct the URL by setting the ?page=... parameter to value of pageNumber const url = new URL(REPOSITORIES_URL); url.searchParams.set('page', pageNumber); // Scrape the page await paginatedPage.goto(url.href); const results = await scrapeRepos(paginatedPage); // Push results to the repositories array repositories.push(...results); await paginatedPage.close(); })(), ); await Promise.all(promises); // For brievity logging just the count of repositories scraped console.log(repositories.length); await browser.close(); ``` At the time of writing this lesson, a summary at the top of the [listing page](https://github.com/orgs/facebook/repositories) claims that Facebook has 115 repositories. Whatever is the number you are seeing, it should be equal to the number you get if you run the program: ``` $ node index.js 115 ``` ## Lazy-loading pagination[​](#lazy-loading-pagination "Direct link to Lazy-loading pagination") Pagination based on page numbers is straightforward to automate, but many websites use [lazy-loading](https://en.wikipedia.org/wiki/Lazy_loading) instead. > On websites with lazy-loading pagination, if [API scraping](https://docs.apify.com/academy/api-scraping.md) is a viable option, it is a much better approach due to reliability and performance. Take a moment to look at and scroll through the women's clothing section [on About You's website](https://www.aboutyou.com/c/women/clothing-20204). Notice that the items are loaded as you scroll, and that there are no page numbers. Because of how drastically different this pagination implementation is from the previous one, it also requires a different workflow to scrape. We're going to scrape the brand and price from the first 75 results on the **About You** page linked above. Here's our basic setup: * Playwright * Puppeteer ``` import { chromium } from 'playwright'; // Create an array where all scraped products will // be pushed to const products = []; const browser = await chromium.launch({ headless: false }); const page = await browser.newPage(); await page.goto('https://www.aboutyou.com/c/women/clothing-20204'); await browser.close(); ``` ``` import puppeteer from 'puppeteer'; // Create an array where all scraped products will // be pushed to const products = []; const browser = await puppeteer.launch({ headless: false }); const page = await browser.newPage(); await page.goto('https://www.aboutyou.com/c/women/clothing-20204'); await browser.close(); ``` ### Auto scrolling[​](#auto-scrolling "Direct link to Auto scrolling") Now, what we'll do is grab the height in pixels of a result item to have somewhat of a reference to how much we should scroll each time, as well as create a variable for keeping track of how many pixels have been scrolled. ``` // Grab the height of result item in pixels, which will be used to scroll down const itemHeight = await page.$eval('a[data-testid*="productTile"]', (elem) => elem.clientHeight); // Keep track of how many pixels have been scrolled down const totalScrolled = 0; ``` Then, within a `while` loop that ends once the length of the **products** array has reached 75, we'll run some logic that scrolls down the page and waits 1 second before running again. * Playwright * Puppeteer ``` while (products.length document.body.scrollHeight); await page.mouse.wheel(0, itemHeight * 3); totalScrolled += itemHeight * 3; // Allow the products 1 second to load await page.waitForTimeout(1000); // Data extraction login will go here const innerHeight = await page.evaluate(() => window.innerHeight); // if the total pixels scrolled is equal to the true available scroll // height of the page, we've reached the end and should stop scraping. // even if we haven't reach our goal of 75 products. if (totalScrolled >= scrollHeight - innerHeight) { break; } } ``` ``` while (products.length document.body.scrollHeight); await page.mouse.wheel({ deltaY: itemHeight * 3 }); totalScrolled += itemHeight * 3; // Allow the products 1 second to load await page.waitForTimeout(1000); // Data extraction login will go here const innerHeight = await page.evaluate(() => window.innerHeight); // if the total pixels scrolled is equal to the true available scroll // height of the page, we've reached the end and should stop scraping. // even if we haven't reach our goal of 75 products. if (totalScrolled >= scrollHeight - innerHeight) { break; } } ``` Now, the `while` loop will exit out if we've reached the bottom of the page. > Generally, you'd want to create a utility function that handles this scrolling logic instead of putting all of the code directly into the while loop. ### Extracting data[​](#extracting-data "Direct link to Extracting data") Within the loop, we can grab hold of the total number of items on the page. To avoid extracting and pushing duplicate items to the **products** array, we can use the `.slice()` method to cut out the items we've already scraped. ``` import * as cheerio from 'cheerio'; const $ = cheerio.load(await page.content()); // Grab the newly loaded items const items = [...$('a[data-testid*="productTile"]')].slice(products.length); const newItems = items.map((item) => { const elem = $(item); return { brand: elem.find('p[data-testid="brandName"]').text().trim(), price: elem.find('span[data-testid="finalPrice"]').text().trim(), }; }); products.push(...newItems); ``` ### Final code[​](#final-lazy-loading-code "Direct link to Final code") With everything completed, this is what we're left with: * Playwright * Puppeteer ``` import { chromium } from 'playwright'; import * as cheerio from 'cheerio'; const products = []; const browser = await chromium.launch({ headless: false }); const page = await browser.newPage(); await page.goto('https://www.aboutyou.com/c/women/clothing-20204'); // Grab the height of result item in pixels, which will be used to scroll down const itemHeight = await page.$eval('a[data-testid*="productTile"]', (elem) => elem.clientHeight); // Keep track of how many pixels have been scrolled down let totalScrolled = 0; while (products.length document.body.scrollHeight); await page.mouse.wheel(0, itemHeight * 3); totalScrolled += itemHeight * 3; // Allow the products 1 second to load await page.waitForTimeout(1000); const $ = cheerio.load(await page.content()); // Grab the newly loaded items const items = [...$('a[data-testid*="productTile"]')].slice(products.length); const newItems = items.map((item) => { const elem = $(item); return { brand: elem.find('p[data-testid="brandName"]').text().trim(), price: elem.find('span[data-testid="finalPrice"]').text().trim(), }; }); products.push(...newItems); const innerHeight = await page.evaluate(() => window.innerHeight); // if the total pixels scrolled is equal to the true available scroll // height of the page, we've reached the end and should stop scraping. // even if we haven't reach our goal of 75 products. if (totalScrolled >= scrollHeight - innerHeight) { break; } } console.log(products.slice(0, 75)); await browser.close(); ``` ``` import puppeteer from 'puppeteer'; import * as cheerio from 'cheerio'; const products = []; const browser = await puppeteer.launch({ headless: false }); const page = await browser.newPage(); await page.goto('https://www.aboutyou.com/c/women/clothing-20204'); // Grab the height of result item in pixels, which will be used to scroll down const itemHeight = await page.$eval('a[data-testid*="productTile"]', (elem) => elem.clientHeight); // Keep track of how many pixels have been scrolled down let totalScrolled = 0; while (products.length document.body.scrollHeight); await page.mouse.wheel({ deltaY: itemHeight * 3 }); totalScrolled += itemHeight * 3; // Allow the products 1 second to load await page.waitForTimeout(1000); const $ = cheerio.load(await page.content()); // Grab the newly loaded items const items = [...$('a[data-testid*="productTile"]')].slice(products.length); const newItems = items.map((item) => { const elem = $(item); return { brand: elem.find('p[data-testid="brandName"]').text().trim(), price: elem.find('span[data-testid="finalPrice"]').text().trim(), }; }); products.push(...newItems); const innerHeight = await page.evaluate(() => window.innerHeight); // if the total pixels scrolled is equal to the true available scroll // height of the page, we've reached the end and should stop scraping. // even if we haven't reach our goal of 75 products. if (totalScrolled >= scrollHeight - innerHeight) { break; } } console.log(products.slice(0, 75)); await browser.close(); ``` ## Quick note[​](#quick-note "Direct link to Quick note") The examples shown in this lesson are not the only ways to paginate through websites. They are here to serve as solid examples, but don't view them as the end-all be-all of scraping paginated websites. The methods you use and the algorithms you write might differ to various degrees based on what pages you're scraping and how your specific target website implemented pagination. ## Next up[​](#next "Direct link to Next up") We're actively working in expanding this section of the course, so stay tuned! --- # Scraping iFrames **Extracting data from iFrames can be frustrating. In this tutorial, we will learn how to scrape information from iFrames using Puppeteer or Playwright.** *** Getting information from inside iFrames is a known pain, especially for new developers. After spending some time on Stack Overflow, you usually find answers like jQuery's `contents()` method or native contentDocument property, which can guide you to the insides of an iframe. But still, getting the right identifiers and holding that new context is a little annoying. Fortunately, you can make everything simpler and more straightforward by scraping iFrames with Puppeteer. ## Finding the right ``[​](#finding-the-right-iframe "Direct link to finding-the-right-iframe") If you are using basic methods of page objects like `page.evaluate()`, you are actually already working with frames. Behind the scenes, Puppeteer will call `page.mainFrame().evaluate()`, so most of the methods you are using with page object can be used the same way with frame object. To access frames, you need to loop over the main frame's child frames and identify the one you want to use. As a demonstration, we'll scrape the Twitter widget iFrame from [IMDB](https://www.imdb.com/). ``` import puppeteer from 'puppeteer'; const browser = await puppeteer.launch(); const page = await browser.newPage(); await page.goto('https://www.imdb.com'); await page.waitForTimeout(5000); // we need to wait for Twitter widget to load let twitterFrame; // this will be populated later by our identified frame for (const frame of page.mainFrame().childFrames()) { // Here you can use few identifying methods like url(),name(),title() if (frame.url().includes('twitter')) { console.log('we found the Twitter iframe'); twitterFrame = frame; // we assign this frame to myFrame to use it later } } await browser.close(); ``` If it is hard to identify the iframe you want to access, don't worry. You can already use any Puppeteer method on the frame object to help you identify it, scrape it or manipulate it. You can also go through any nested frames. ``` let twitterFrame; for (const frame of page.mainFrame().childFrames()) { if (frame.url().includes('twitter')) { for (const nestedFrame of frame.childFrames()) { const tweetList = await nestedFrame.$('.timeline-TweetList'); if (tweetList) { console.log('We found the frame with tweet list'); twitterFrame = nestedFrame; } } } } ``` Here we used some more advanced techniques to find a nested ``. Now when we have it assigned to our twitterFrame object, the hard work is over and we can start working with it (almost) like with a regular page object. ``` const textFeed = await twitterFrame.$$eval('.timeline-Tweet-text', (pElements) => pElements.map((elem) => elem.textContent)); for (const text of textFeed) { console.log(text); console.log('**********'); } ``` With a little more effort, we could also follow different links from the feed or even play a video, but that is not within the scope of this article. For all references about page and frame objects (and Puppeteer generally), you should study [the documentation](https://pub.dev/documentation/puppeteer/latest/puppeteer/Frame-class.html). New versions are released quite often, so checking the docs regularly can help you to stay on top of web scraping and automation. --- # Submitting a form with a file attachment **Understand how to download a file, attach it to a form using a headless browser in Playwright or Puppeteer, then submit the form.** *** We can use Puppeteer or Playwright to simulate submitting the same way a human-operated browser would. ## Downloading the file[​](#downloading-the-file "Direct link to Downloading the file") The first thing necessary is to download the file, which can be done using the `request-promise` module. We will also be using the `fs/promises` module to save it to the disk, so make sure they are included. ``` import * as fs from 'fs/promises'; import request from 'request-promise'; ``` The actual downloading is slightly different for text and binary files. For a text file, it can be done like this: ``` const fileData = await request('https://some-site.com/file.txt'); ``` For a binary data file, we need to provide an additional parameter so as not to interpret it as text: ``` const fileData = await request({ uri: 'https://some-site.com/file.pdf', encoding: null, }); ``` In this case, `fileData` will be a `Buffer` instead of a string. To use the file in Puppeteer/Playwright, we need to save it to the disk. This can be done using the `fs/promises` module. ``` await fs.writeFile('./file.pdf', fileData); ``` ## Submitting the form[​](#submitting-the-form "Direct link to Submitting the form") The first step necessary is to open the form page in Puppeteer. This can be done as follows: ``` const browser = await puppeteer.launch(); const page = await browser.newPage(); await page.goto('https://some-site.com/file-upload.php'); ``` To fill in any necessary form inputs, we can use the `page.type()` function. This works even in cases when `elem.value = 'value'` is not usable. ``` await page.type('input[name=firstName]', 'John'); await page.type('input[name=surname]', 'Doe'); await page.type('input[name=email]', 'john.doe@example.com'); ``` To add the file to the appropriate input, we first need to find it and then use the [`uploadFile()`](https://pptr.dev/api/puppeteer.elementhandle.uploadfile) function. ``` const fileInput = await page.$('input[type=file]'); await fileInput.uploadFile('./file.pdf'); ``` Now we can finally submit the form. ``` await page.click('input[type=submit]'); ``` --- # Executing scripts **Understand the two different contexts which your code can be run in, and how to run custom scripts in the context of the browser.** *** An important concept to understand when dealing with headless browsers is the **context** in which your code is being run. For example, if you try to use the native `fs` Node.js module (used in the previous lesson) while running code in the context of the browser, errors will be thrown saying that it is undefined. Similarly, if you are trying to use `document.querySelector()` or other browser-specific functions in the server-side Node.js context, errors will also be thrown. ![Diagram explaining the two different contexts your code can be run in](/assets/images/context-diagram-f4475f84c3ebf68da73881f283fbc174.jpg) Here is an example of a common mistake made by beginners to Puppeteer/Playwright: ``` // This code is incorrect! import { chromium } from 'playwright'; const browser = await chromium.launch({ headless: false }); const page = await browser.newPage(); // visit google await page.goto('https://www.google.com/'); // change background to green document.body.style.background = 'green'; await page.waitForTimeout(10000); await browser.close(); ``` When we try and run this, we get this error: ``` ReferenceError: document is not defined ``` The reason this is happening is because we're trying to run browser-side code on the server-side where it is not supported. [`document`](https://developer.mozilla.org/en-US/docs/Web/API/Document) is a property of the browser [**Window**](https://developer.mozilla.org/en-US/docs/Web/API/Window) instance that holds the rendered website; therefore, this API is not available in Node.js. How are we supposed to run code within the context of the browser? ## Running code in the context of the browser[​](#running-in-browser-context "Direct link to Running code in the context of the browser") We will use `page.evaluate()` to run our code in the browser. This method takes a callback as its first parameter, which will be executed within the browser. * Playwright * Puppeteer ``` import { chromium } from 'playwright'; const browser = await chromium.launch({ headless: false }); const page = await browser.newPage(); await page.goto('https://www.google.com/'); await page.evaluate(() => { document.body.style.background = 'green'; }); await page.waitForTimeout(10000); await browser.close(); ``` ``` import puppeteer from 'puppeteer'; const browser = await puppeteer.launch({ headless: false }); const page = await browser.newPage(); await page.goto('https://www.google.com/'); await page.evaluate(() => { document.body.style.background = 'green'; }); await page.waitForTimeout(10000); await browser.close(); ``` Here's what we see in the automated browser when we run this code: ![Google with the background color changed to green](/assets/images/green-google-c009bd62b8a1b2ec669f6e5ccef214fc.png) ## Using variables in `page.evaluate()`[​](#using-variables-in-page-evaluate "Direct link to using-variables-in-page-evaluate") Within our code, we generate a `randomString` in the Node.js context: ``` const randomString = Math.random().toString(36).slice(2); ``` Now, let's say we want to change the title of the document to be this random string. To have the random string available in the callback of our `page.evaluate()`, we'll pass it in a second parameter. It's best practice to have this second parameter as an object, because in real world situations you often need to pass more than one value. * Playwright * Puppeteer ``` import { chromium } from 'playwright'; const browser = await chromium.launch({ headless: false }); const page = await browser.newPage(); await page.goto('https://www.google.com/'); const params = { randomString: Math.random().toString(36).slice(2) }; await page.evaluate(({ randomString }) => { document.querySelector('title').textContent = randomString; }, params); await page.waitForTimeout(10000); await browser.close(); ``` ``` import puppeteer from 'puppeteer'; const browser = await puppeteer.launch({ headless: false }); const page = await browser.newPage(); await page.goto('https://www.google.com/'); const params = { randomString: Math.random().toString(36).slice(2) }; await page.evaluate(({ randomString }) => { document.querySelector('title').textContent = randomString; }, params); await page.waitForTimeout(10000); await browser.close(); ``` Now, when we run this code, we can see the title change on the page's tab: ![Google with the background color changed to green]() ## Next up[​](#next "Direct link to Next up") The [next lesson](https://docs.apify.com/academy/puppeteer-playwright/executing-scripts/injecting-code.md) will be a short one discussing two different ways of executing scripts on a page. --- # Extracting data **Learn how to extract data from a page with evaluate functions, then how to parse it by using a second library called Cheerio.** *** Now that we know how to execute scripts on a page, we're ready to learn a bit about [data extraction](https://docs.apify.com/academy/web-scraping-for-beginners/data-extraction.md). In this lesson, we'll be scraping all the on-sale products from our [Fakestore](https://demo-webstore.apify.org/search/on-sale) website. Playwright & Puppeteer offer two main methods for data extraction: 1. Directly in `page.evaluate()` and other evaluate functions such as `page.$$eval()`. 2. In the Node.js context using a parsing library such as [Cheerio](https://www.npmjs.com/package/cheerio) Crawlee and parsing with Cheerio If you are using Crawlee, we highly recommend the [parseWithCheerio](https://crawlee.dev/api/playwright-crawler/interface/PlaywrightCrawlingContext#parseWithCheerio) function for unified data extraction syntax. This way, switching between browser and plain HTTP scraping is a breeze. ## Setup[​](#setup "Direct link to Setup") Here is the base setup for our code, upon which we'll be building off of in this lesson: * Playwright * Puppeteer ``` import { chromium } from 'playwright'; const browser = await chromium.launch({ headless: false }); const page = await browser.newPage(); await page.goto('https://demo-webstore.apify.org/search/on-sale'); // code will go here await page.waitForTimeout(10000); await browser.close(); ``` ``` import puppeteer from 'puppeteer'; const browser = await puppeteer.launch({ headless: false }); const page = await browser.newPage(); await page.goto('https://demo-webstore.apify.org/search/on-sale'); // code will go here await page.waitForTimeout(10000); await browser.close(); ``` ## Extracting from the browser context[​](#extracting-in-page-evaluate "Direct link to Extracting from the browser context") Whatever is returned from the callback function in `page.evaluate()` will be returned by the evaluate function, which means that we can set it to a variable like so: ``` const products = await page.evaluate(() => ({ foo: 'bar' })); console.log(products); // -> { foo: 'bar' } ``` We'll be returning a bunch of product objects from this function, which will be accessible back in our Node.js context after the promise has resolved. Let's now go ahead and write some data extraction code to collect each product: ``` const products = await page.evaluate(() => { const productCards = Array.from(document.querySelectorAll('a[class*="ProductCard_root"]')); return productCards.map((element) => { const name = element.querySelector('h3[class*="ProductCard_name"]').textContent; const price = element.querySelector('div[class*="ProductCard_price"]').textContent; return { name, price, }; }); }); console.log(products); ``` When we run this code, we see this logged to our console: ![Products logged to the console](/assets/images/log-products-f59a9aaf95e34ba0915ff44098f8fef4.png) ## Using jQuery[​](#using-jquery "Direct link to Using jQuery") Working with `document.querySelector` is cumbersome and quite verbose, but with the `page.addScriptTag()` function and the latest [jQuery CDN link](https://releases.jquery.com/), we can inject jQuery into the current page to gain access to its syntactical sweetness: ``` await page.addScriptTag({ url: 'https://code.jquery.com/jquery-3.6.0.min.js' }); ``` This function will literally append a `` tag to the `` element of the current page, allowing access to jQuery's API when using `page.evaluate()` to run code in the browser context. Now, since we're able to use jQuery, let's translate our vanilla JavaScript code within the `page.evaluate()` function to jQuery: ``` await page.addScriptTag({ url: 'https://code.jquery.com/jquery-3.6.0.min.js' }); const products = await page.evaluate(() => { const productCards = Array.from($('a[class*="ProductCard_root"]')); return productCards.map((element) => { const card = $(element); const name = card.find('h3[class*="ProductCard_name"]').text(); const price = card.find('div[class*="ProductCard_price"]').text(); return { name, price, }; }); }); console.log(products); ``` This will output the same exact result as the code in the previous section. ## Parsing in the Node.js context[​](#parsing-in-node-context "Direct link to Parsing in the Node.js context") One of the most popular parsing libraries for Node.js is [Cheerio](https://www.npmjs.com/package/cheerio), which can be used in tandem with Playwright and Puppeteer. It is extremely beneficial to parse the page's HTML in the Node.js context for a number of reasons: * You can port the code between headless browser data extraction and plain HTTP data extraction * You don't have to worry in which context you're working (which can sometimes be confusing) * Errors are easier to handle when running in the base Node.js context To install it, we can run the following command within your project's directory: ``` npm install cheerio ``` Then, we'll import the `load` function like so: ``` import { load } from 'cheerio'; ``` Finally, we can create a `Cheerio` object based on our page's current content like so: ``` const $ = load(await page.content()); ``` > It's important to note that this `$` object is static. If any content on the page changes, the `$` variable will not automatically be updated. It will need to be re-declared or re-defined. Here's our full code so far: * Playwright * Puppeteer ``` import { chromium } from 'playwright'; import { load } from 'cheerio'; const browser = await chromium.launch({ headless: false }); const page = await browser.newPage(); await page.goto('https://demo-webstore.apify.org/search/on-sale'); const $ = load(await page.content()); // code will go here await browser.close(); ``` ``` import puppeteer from 'puppeteer'; import { load } from 'cheerio'; const browser = await puppeteer.launch({ headless: false }); const page = await browser.newPage(); await page.goto('https://demo-webstore.apify.org/search/on-sale'); const $ = load(await page.content()); // code will go here await browser.close(); ``` Now, to loop through all of the products, we'll make use of the `$` object and loop through them while safely in the server-side context rather than running the code in the browser. Notice that this code is nearly exactly the same as the jQuery code above - it is just not running inside of a `page.evaluate()` in the browser context. ``` const $ = load(await page.content()); const productCards = Array.from($('a[class*="ProductCard_root"]')); const products = productCards.map((element) => { const card = $(element); const name = card.find('h3[class*="ProductCard_name"]').text(); const price = card.find('div[class*="ProductCard_price"]').text(); return { name, price, }; }); console.log(products); ``` ## Final code[​](#final-code "Direct link to Final code") Here's what our final optimized code looks like: * Playwright * Puppeteer ``` import { chromium } from 'playwright'; import { load } from 'cheerio'; const browser = await chromium.launch({ headless: false }); const page = await browser.newPage(); await page.goto('https://demo-webstore.apify.org/search/on-sale'); const $ = load(await page.content()); const productCards = Array.from($('a[class*="ProductCard_root"]')); const products = productCards.map((element) => { const card = $(element); const name = card.find('h3[class*="ProductCard_name"]').text(); const price = card.find('div[class*="ProductCard_price"]').text(); return { name, price, }; }); console.log(products); await browser.close(); ``` ``` import puppeteer from 'puppeteer'; import { load } from 'cheerio'; const browser = await puppeteer.launch({ headless: false }); const page = await browser.newPage(); await page.goto('https://demo-webstore.apify.org/search/on-sale'); const $ = load(await page.content()); const productCards = Array.from($('a[class*="ProductCard_root"]')); const products = productCards.map((element) => { const card = $(element); const name = card.find('h3[class*="ProductCard_name"]').text(); const price = card.find('div[class*="ProductCard_price"]').text(); return { name, price, }; }); console.log(products); await browser.close(); ``` ## Next up[​](#next "Direct link to Next up") Our [next lesson](https://docs.apify.com/academy/puppeteer-playwright/reading-intercepting-requests.md) will be discussing something super cool - request interception and reading data from requests and responses. It's like using DevTools, except programmatically! --- # Injecting code **Learn how to inject scripts prior to a page's load (pre-injecting), as well as how to expose functions to be run at a later time on the page.** *** In the previous lesson, we learned how to execute code on the page using `page.evaluate()`, and though this fits the majority of use cases, there are still some more unusual cases. For example, what if we want to execute our custom script prior to the page's load? Or, what if we want to define a function in the page's context to be run at a later time? We'll be covering both of these cases in this brief lesson. ## Pre-injecting scripts[​](#pre-injecting "Direct link to Pre-injecting scripts") Sometimes, you need your custom code to run before any other code is run on the page. Perhaps you need to modify an object's prototype, or even re-define certain global variables before they are used by the page's native scripts. Luckily, Puppeteer and Playwright both have functions for this. In Puppeteer, we use the [`page.evaluateOnNewDocument()`](https://pptr.dev/api/puppeteer.page.evaluateonnewdocument) function, while in Playwright we use [`page.addInitScript()`](https://playwright.dev/docs/api/class-page#page-add-init-script). We'll use these functions to override the native `addEventListener` function, setting it to a function that does nothing. This will prevent event listeners from being added to elements. * Playwright * Puppeteer ``` import { chromium } from 'playwright'; const browser = await chromium.launch({ headless: false }); const page = await browser.newPage(); await page.addInitScript(() => { // Override the prototype Node.prototype.addEventListener = () => { /* do nothing */ }; }); await page.goto('https://google.com'); await page.waitForTimeout(10000); await browser.close(); ``` ``` import puppeteer from 'puppeteer'; const browser = await puppeteer.launch({ headless: false }); const page = await browser.newPage(); await page.evaluateOnNewDocument(() => { // Override the prototype Node.prototype.addEventListener = null; }); await page.goto('https://google.com'); await page.waitForTimeout(10000); await browser.close(); ``` > Go ahead and run this code. Can you click the **I accept** button to accept Google's cookies policy? ## Exposing functions[​](#exposing-functions "Direct link to Exposing functions") Here's a super awesome function we've created called `returnMessage()`, which returns the string **Apify Academy!**: ``` const returnMessage = () => 'Apify academy!'; ``` We want to **expose** this function to our loaded page so that it can be later executed there, which can be done with [`page.exposeFunction()`](https://playwright.dev/docs/api/class-page#page-expose-function). This will make `returnMessage()` available when running scripts not only inside of `page.evaluate()`, but also directly from DevTools. * Playwright * Puppeteer ``` import { chromium } from 'playwright'; const browser = await chromium.launch({ headless: false }); const page = await browser.newPage(); await page.goto('https://google.com'); const returnMessage = () => 'Apify academy!'; await page.exposeFunction(returnMessage.name, returnMessage); const msg = await page.evaluate(() => returnMessage()); console.log(msg); await page.waitForTimeout(10000); await browser.close(); ``` ``` import puppeteer from 'puppeteer'; const browser = await puppeteer.launch({ headless: false }); const page = await browser.newPage(); await page.goto('https://google.com'); const returnMessage = () => 'Apify academy!'; await page.exposeFunction(returnMessage.name, returnMessage); const msg = await page.evaluate(() => returnMessage()); console.log(msg); await page.waitForTimeout(10000); await browser.close(); ``` ## Next up[​](#next "Direct link to Next up") Next, we'll be learning a bit about how to extract data using Playwright/Puppeteer. You can use one of the two main ways to do this, so [the next exciting lesson](https://docs.apify.com/academy/puppeteer-playwright/executing-scripts/collecting-data.md) will be about both of them! --- # Opening a page **Learn how to create and open a Page with a Browser, and how to use it to visit and programmatically interact with a website.** *** When you open up your regular browser and visit a website, you open up a new page (or tab) before entering the URL in the search bar and hitting the **Enter** key. In Playwright and Puppeteer, you also have to open up a new page before visiting a URL. This can be done with the `browser.newPage()` function, which will return a **Page** object ([Puppeteer](https://pptr.dev/#?product=Puppeteer\&version=v13.7.0\&show=api-class-page), [Playwright](https://playwright.dev/docs/api/class-page)). * Playwright * Puppeteer ``` import { chromium } from 'playwright'; const browser = await chromium.launch({ headless: false }); // Open a new page const page = await browser.newPage(); await browser.close(); ``` ``` import puppeteer from 'puppeteer'; const browser = await puppeteer.launch({ headless: false }); // Open a new page const page = await browser.newPage(); await browser.close(); ``` Then, we can visit a website with the `page.goto()` method. Let's go to [Google](https://google.com) for now. We'll also use the `page.waitForTimeout()` function, which will force the program to wait for a number of seconds before quitting (otherwise, everything will flash before our eyes and we won't really be able to tell what's going on): * Playwright * Puppeteer ``` import { chromium } from 'playwright'; const browser = await chromium.launch({ headless: false }); // Open a new page const page = await browser.newPage(); // Visit Google await page.goto('https://google.com'); // wait for 10 seconds before shutting down await page.waitForTimeout(10000); await browser.close(); ``` ``` import puppeteer from 'puppeteer'; const browser = await puppeteer.launch({ headless: false }); // Open a new page const page = await browser.newPage(); // Visit Google await page.goto('https://google.com'); // wait for 10 seconds before shutting down await page.waitForTimeout(10000); await browser.close(); ``` > If you haven't already, go ahead and run this code to see what happens. ## Next up[​](#next "Direct link to Next up") Now that we know how to open up a page, [let's learn](https://docs.apify.com/academy/puppeteer-playwright/page/interacting-with-a-page.md) how to automate page interaction, such as clicking, typing, and pressing keys. --- # Interacting with a page **Learn how to programmatically do actions on a page such as clicking, typing, and pressing keys. Also, discover a common roadblock that comes up when automating.** *** The **Page** object has a whole boat-load of functions which can be used to interact with the loaded page. We're not going to go over every single one of them right now, but we *will* use a few of the most common ones to add some functionality to our current project. Let's say that we want to automate searching for **hello world** on Google, then click on the first result and log the title of the page to the console, then take a screenshot and write it it to the filesystem. In order to understand how we're going to automate this, let's break down how we would do it manually: 1. Click on the button which accepts Google's cookies policy (To see how it looks, open Google in an anonymous window.) 2. Type **hello world** into the search bar 3. Press **Enter** 4. Wait for the results page to load 5. Click on the first result 6. Read the title of the clicked result's loaded page 7. Screenshot the page Though it seems complex, the wonderful **Page** API can help us with all the steps. ## Clicking & pressing keys[​](#clicking-and-pressing-keys "Direct link to Clicking & pressing keys") Let's first focus on the first 3 steps listed above. By using `page.click()` and the CSS selector of the element to click, we can click an element: * Playwright * Puppeteer ``` // Click the "Accept all" button await page.click('button:has-text("Accept all")'); ``` ``` // Click the "Accept all" button await page.click('button + button'); ``` With `page.click()`, Puppeteer and Playwright actually drag the mouse and click, allowing the bot to act more human-like. This is different from programmatically clicking with `Element.click()` in vanilla client-side JavaScript. Notice that in the Playwright example, we are using a different selector than in the Puppeteer example. This is because Playwright supports [many custom CSS selectors](https://playwright.dev/docs/other-locators#css-elements-matching-one-of-the-conditions), such as the **has-text** pseudo class. As a rule of thumb, using text selectors is much more preferable to using regular selectors, as they are much less likely to break. If Google makes the sibling above the **Accept all** button a `` element instead of a `` element, our `button + button` selector will break. However, the button will always have the text **Accept all**; therefore, `button:has-text("Accept all")` is more reliable. > If you're not already familiar with CSS selectors and how to find them, we recommend referring to [this lesson](https://docs.apify.com/academy/web-scraping-for-beginners/data-extraction/using-devtools.md) in the **Web scraping basics for JavaScript devs** course. Then, we can type some text into an input field `` with `page.type()`; passing a CSS selector as the first, and the string to input as the second parameter: ``` // Type the query into the search box await page.type('textarea[title]', 'hello world'); ``` Finally, we can press a single key by accessing the `keyboard` property of `page` and calling the `press()` function on it: ``` // Press enter await page.keyboard.press('Enter'); ``` This is what we've got so far: * Playwright * Puppeteer ``` import { chromium } from 'playwright'; const browser = await chromium.launch({ headless: false }); const page = await browser.newPage(); await page.goto('https://www.google.com/'); // Click the "Accept all" button await page.click('button:has-text("Accept all")'); // Type the query into the search box await page.type('textarea[title]', 'hello world'); // Press enter await page.keyboard.press('Enter'); await page.waitForTimeout(10000); await browser.close(); ``` ``` import puppeteer from 'puppeteer'; const browser = await puppeteer.launch({ headless: false }); const page = await browser.newPage(); await page.goto('https://www.google.com/'); // Click the "Accept all" button await page.click('button + button'); // Type the query into the search box await page.type('textarea[title]', 'hello world'); // Press enter await page.keyboard.press('Enter'); await page.waitForTimeout(10000); await browser.close(); ``` When we run it, we leave off on the results page: ![Google results page reached by headless browser](/assets/images/google-results-7c52a69dcd7170b0a8d1a8b93b321811.png) Great! Now all we have to do is click the first result which matches the CSS selector `.g a`: * Playwright * Puppeteer ``` import { chromium } from 'playwright'; const browser = await chromium.launch({ headless: false }); const page = await browser.newPage(); await page.goto('https://www.google.com/'); await page.click('button:has-text("Accept all")'); await page.type('textarea[title]', 'hello world'); await page.keyboard.press('Enter'); // Click the first result await page.click('.g a'); await page.waitForTimeout(10000); await browser.close(); ``` ``` // This code will throw an error! import puppeteer from 'puppeteer'; const browser = await puppeteer.launch({ headless: false }); const page = await browser.newPage(); await page.goto('https://www.google.com/'); await page.click('button + button'); await page.type('textarea[title]', 'hello world'); await page.keyboard.press('Enter'); // Click the first result await page.click('.g a'); await page.waitForTimeout(10000); await browser.close(); ``` But wait, when we try to run the Puppeteer code, we run into this nasty error: > The following error won't be present if you're following the Playwright examples. You'll learn why in the next lesson. ``` /Users/me/Desktop/playwright-puppeteer/node_modules/puppeteer/lib/cjs/puppeteer/common/assert.js:26 throw new Error(message); ^ Error: No node found for selector: .g a at assert (/Users/me/Desktop/playwright-puppeteer/node_modules/puppeteer/lib/cjs/puppeteer/common/assert.js:26:15) ... ``` We hit this error because we attempted to click an element that wasn't yet present on the page. The results page hadn't even loaded yet! ## Next up[​](#next "Direct link to Next up") In the [next lesson](https://docs.apify.com/academy/puppeteer-playwright/page/waiting.md), we'll be taking a look at how to **wait for** navigation, events, and content before resuming interactions. --- # Page methods **Understand that the Page object has many different methods to offer, and learn how to use two of them to capture a page's title and take a screenshot.** *** Other than having methods for interacting with a page and waiting for events and elements, the **Page** object also supports various methods for doing other things, such as [reloading](https://pptr.dev/api/puppeteer.page.reload), [screenshotting](https://playwright.dev/docs/api/class-page#page-screenshot), [changing headers](https://playwright.dev/docs/api/class-page#page-set-extra-http-headers), and extracting the [page's content](https://pptr.dev/api/puppeteer.page.content). Last lesson, we left off at a point where we were waiting for the page to navigate so that we can extract the page's title and take a screenshot of it. In this lesson, we'll be learning about the two methods we can use to achieve both of those things. ## Grabbing the title[​](#grabbing-the-title "Direct link to Grabbing the title") Two main page functions exist that will return general data: 1. `page.content()` will return the entire HTML content of the page. 2. `page.title()` will return the title of the current page found in the `` tag. For our case, we'll utilize the `page.title()` function to grab the title and log it to the console: ``` // Grab the title and set it to a variable const title = await page.title(); // Log the title to the console console.log(title); ``` ## Screenshotting[​](#screenshotting "Direct link to Screenshotting") The `page.screenshot()` function will return a buffer which can be written to the filesystem as an image: ``` // Take the screenshot and write it to the filesystem await page.screenshot({ path: 'screenshot.png' }); ``` > The image will by default be **.png**. To change the image to **.jpeg** type, set the (optional) `type` option to **jpeg**. ## Final code[​](#final-code "Direct link to Final code") Here's our final code which extracts the page's title, takes a screenshot and saves it to our project's folder as `screenshot.png`: * Playwright * Puppeteer ``` import { chromium } from 'playwright'; const browser = await chromium.launch({ headless: false }); // Create a page and visit Google const page = await browser.newPage(); await page.goto('https://google.com'); // Agree to the cookies policy await page.click('button:has-text("Accept all")'); // Type the query and visit the results page await page.type('textarea[title]', 'hello world'); await page.keyboard.press('Enter'); // Click on the first result await page.click('.g a'); await page.waitForLoadState('load'); // Grab the page's title and log it to the console const title = await page.title(); console.log(title); // Take a screenshot and write it to the filesystem await page.screenshot({ path: 'screenshot.png' }); await browser.close(); ``` ``` import puppeteer from 'puppeteer'; const browser = await puppeteer.launch({ headless: false }); // Create a page and visit Google const page = await browser.newPage(); await page.goto('https://google.com'); // Agree to the cookies policy await page.click('button + button'); // Type the query and visit the results page await page.type('textarea[title]', 'hello world'); await page.keyboard.press('Enter'); // Wait for the first result to appear on the page, // then click on it await page.waitForSelector('.g a'); await Promise.all([page.waitForNavigation(), page.click('.g a')]); // Grab the page's title and log it to the console const title = await page.title(); console.log(title); // Take a screenshot and write it to the filesystem await page.screenshot({ path: 'screenshot.png' }); await browser.close(); ``` When you run this code, you should see this logged to the console: ``` "Hello, World!" program - Wikipedia ``` Additionally, you should see a new image named **screenshot.png** in your project's folder that looks something like this: ![Screenshot from Playwright/Puppeteer](/assets/images/wikipedia-screenshot-e418e43eabee246c354755dd29f091c5.png) ## Next up[​](#next "Direct link to Next up") In the [next exciting lesson](https://docs.apify.com/academy/puppeteer-playwright/executing-scripts.md), we'll gain a solid understanding of the two different contexts we can run our code in when using Puppeteer and Playwright, as well as how to run code in the context of the browser. --- # Waiting for elements and events **Learn the importance of waiting for content and events before running interaction or extraction code, as well as the best practices for doing so.** *** In a perfect world, every piece of content served on a website would be loaded instantaneously. We don't live in a perfect world though, and often times it can take anywhere between 1/10th of a second to a few seconds to load some content onto a page. Certain elements are also [generated dynamically](https://docs.apify.com/academy/concepts/dynamic-pages.md), which means that they are not present in the initial HTML and that they are created by scripts or data from API calls. Puppeteer and Playwright don't sit around waiting for a page (or specific elements) to load though - if we tell it to do something with an element that hasn't been rendered yet, it'll start trying to do it (which will result in nasty errors). We've got to tell it to wait. > For a thorough explanation on how dynamic rendering works, give [**Dynamic pages**](https://docs.apify.com/academy/concepts/dynamic-pages.md) a quick readover, and check out the examples. Different events and elements can be waited for using the various `waitFor...` methods offered. ## Elements[​](#waiting-for-elements "Direct link to Elements") In the previous lesson, we ran into an error with Puppeteer due to the fact that we weren't waiting for the `.g a` selector to be present on the page before clicking it. The same error didn't occur in Playwright, because `page.click()` [automatically waits](https://playwright.dev/docs/actionability) for the element to be visible on the page before clicking it. Elements with specific selectors can be waited for by using the `page.waitForSelector()` function. Let's use this knowledge to wait for the first result to be present on the page prior to clicking on it: ``` // This example is relevant for Puppeteer only! import puppeteer from 'puppeteer'; const browser = await puppeteer.launch({ headless: false }); const page = await browser.newPage(); await page.goto('https://www.google.com/'); await page.click('button + button'); await page.type('textarea[title]', 'hello world'); await page.keyboard.press('Enter'); // Wait for the element to be present on the page prior to clicking it await page.waitForSelector('.g a'); await page.click('.g a'); await page.waitForTimeout(10000); await browser.close(); ``` Now, we won't see the error message anymore, and the first result will be successfully clicked by Puppeteer. > Playwright also has a `page.waitForSelector()` function and it's useful in other scenarios than clicking, or for more granular control over the waiting process. ## Navigation[​](#waiting-for-navigation "Direct link to Navigation") If we remember properly, after clicking the first result, we want to console log the title of the result's page and save a screenshot into the filesystem. In order to grab a solid screenshot of the loaded page though, we should **wait for navigation** before snapping the image. This can be done with [`page.waitForNavigation()`](https://pptr.dev/#?product=Puppeteer\&version=v14.1.0\&show=api-pagewaitfornavigationoptions). > A navigation is when a new [page load](https://docs.apify.com/academy/concepts/dynamic-pages.md) happens. First, the `domcontentloaded` event is fired, then the `load` event. `page.waitForNavigation()` will wait for the `load` event to fire. Naively, you might immediately think that this is the way we should wait for navigation after clicking the first result: ``` await page.click('.g a'); await page.waitForNavigation(); ``` Though in theory this is correct, it can result in a race condition in which the page navigates quickly before the `page.waitForNavigation()` function is ever run, which means that once it is finally called, it will hang and wait forever for the [`load` event](https://developer.mozilla.org/en-US/docs/Web/API/Window/load_event) event to fire even though it already fired. To solve this, we can stick the waiting logic and the clicking logic into a `Promise.all()` call (placing `page.waitForNavigation()` first). ``` await Promise.all([page.waitForNavigation(), page.click('.g a')]); ``` Though the line of code above is also valid in Playwright, it is recommended to use [`page.waitForLoadState('load')`](https://playwright.dev/docs/api/class-page#page-wait-for-load-state) instead of `page.waitForNavigation()`, as it automatically handles the issues being solved by using `Promise.all()`. ``` await page.click('.g a'); await page.waitForLoadState('load'); ``` This implementation will do the following: 1. Begin waiting for the page to navigate without blocking the `page.click()` function 2. Click the element, firing off a navigating event 3. Resolve once the page has navigated, allowing further code to run ## Our code so far[​](#current-code "Direct link to Our code so far") Here's what our project's code looks like so far: * Playwright * Puppeteer ``` import * as fs from 'fs/promises'; import { chromium } from 'playwright'; const browser = await chromium.launch({ headless: false }); // Create a page and visit Google const page = await browser.newPage(); await page.goto('https://google.com'); // Agree to the cookies policy await page.click('button:has-text("Accept all")'); // Type the query and visit the results page await page.type('textarea[title]', 'hello world'); await page.keyboard.press('Enter'); // Click on the first result await page.click('.g a'); await page.waitForLoadState('load'); // Our title extraction and screenshotting logic // will go here await page.waitForTimeout(10000); await browser.close(); ``` ``` import * as fs from 'fs/promises'; import puppeteer from 'puppeteer'; const browser = await puppeteer.launch({ headless: false }); // Create a page and visit Google const page = await browser.newPage(); await page.goto('https://google.com'); // Agree to the cookies policy await page.click('button + button'); // Type the query and visit the results page await page.type('textarea[title]', 'hello world'); await page.keyboard.press('Enter'); // Wait for the first result to appear on the page, // then click on it await page.waitForSelector('.g a'); await Promise.all([page.waitForNavigation(), page.click('.g a')]); // Our title extraction and screenshotting logic // will go here await page.waitForTimeout(10000); await browser.close(); ``` ## Next up[​](#next "Direct link to Next up") In the [final lesson](https://docs.apify.com/academy/puppeteer-playwright/page/page-methods.md) of the **Opening & controlling a page** section of this course, we'll be learning about various methods on **Page** which aren't related to directly interacting with a page or waiting for stuff, as well as finally adding the final touches to our mini-project (page title grabbing and screenshotting). --- # Using proxies **Understand how to use proxies in your Puppeteer and Playwright requests, as well as a couple of the most common use cases for proxies.** *** [Proxies](https://docs.apify.com/academy/anti-scraping/mitigation/proxies.md) are a great way of appearing as if you are making requests from a different location. A common use case for proxies is to avoid [geolocation](https://docs.apify.com/academy/anti-scraping/techniques/geolocation.md) restrictions. For example your favorite TV show might not be available on Netflix in your country, but it might be available for Vietnamese Netflix watchers. In this lesson, we'll be learning how to use proxies with Playwright and Puppeteer. This will be demonstrated with a Vietnamese proxy that we got by running [this](https://apify.com/mstephen190/proxy-scraper) proxy-scraping Actor on the Apify platform. ## Adding a proxy[​](#adding-a-proxy "Direct link to Adding a proxy") First, let's add our familiar boilerplate code for visiting Google and also create a variable called `proxy` which will point to our proxy server: > Note that this proxy may no longer be working at the time of reading. If you don't have a proxy to use during this lesson, we recommend using Proxy Scraper for a list of free ones, or checking out [Apify proxy](https://apify.com/proxy) * Playwright * Puppeteer ``` import { chromium } from 'playwright'; // our proxy server const proxy = '103.214.9.13:3128'; const browser = await chromium.launch({ headless: false }); const page = await browser.newPage(); await page.goto('https://google.com'); await page.waitForTimeout(10000); await browser.close(); ``` ``` import puppeteer from 'puppeteer'; // our proxy server const proxy = '103.214.9.13:3128'; const browser = await puppeteer.launch({ headless: false }); const page = await browser.newPage(); await page.goto('https://google.com'); await page.waitForTimeout(10000); await browser.close(); ``` For both Puppeteer and Playwright, the proxy server's URL should be passed into the options of the `launch()` function; however, it's done a bit differently depending on which library you're using. In Puppeteer, the server must be passed within the **--proxy-server** [Chromium command line argument](https://peter.sh/experiments/chromium-command-line-switches/), while in Playwright, it can be passed into the **proxy** option. * Playwright * Puppeteer ``` import { chromium } from 'playwright'; const proxy = '103.214.9.13:3128'; const browser = await chromium.launch({ headless: false, // Using the "proxy" option proxy: { // Pass in the server URL server: proxy, }, }); const page = await browser.newPage(); await page.goto('https://google.com'); await page.waitForTimeout(10000); await browser.close(); ``` ``` import puppeteer from 'puppeteer'; const proxy = '103.214.9.13:3128'; // Using the "args" option, which is an array of Chromium command // line switches, we pass the server URL in with "--proxy-server" const browser = await puppeteer.launch({ headless: false, args: [`--proxy-server=${proxy}`], }); const page = await browser.newPage(); await page.goto('https://google.com'); await page.waitForTimeout(10000); await browser.close(); ``` And that's it! Now, when we visit Google, it's in Vietnamese. Depending on the country of your proxy, the language will vary. ![Vietnamese Google](/assets/images/vietnamese-google-a742c6f89651d9c47a6d3701140a11cd.png) > Note that in order to rotate through multiple proxies, you must retire a browser instance then create a new one to continue automating with a new proxy. ## Authenticating a proxy[​](#authenticating-a-proxy "Direct link to Authenticating a proxy") The proxy in the last activity didn't require a username and password, but let's say that this one does: ``` proxy.example.com:3001 ``` One might automatically assume that this would be the solution: * Playwright * Puppeteer ``` // This code is wrong! import { chromium } from 'playwright'; const proxy = 'proxy.example.com:3001'; const username = 'someUsername'; const password = 'password123'; const browser = await chromium.launch({ headless: false, proxy: { server: `http://${username}:${password}@${proxy}`, }, }); ``` ``` // This code is wrong! import puppeteer from 'puppeteer'; const proxy = 'proxy.example.com:3001'; const username = 'someUsername'; const password = 'password123'; const browser = await puppeteer.launch({ headless: false, args: [`--proxy-server=http://${username}:${password}@${proxy}`], }); ``` However, authentication parameters need to be passed in separately in order to work. In Puppeteer, the username and password need to be passed to the `page.authenticate()` prior to any navigations being made, while in Playwright they can be passed to the **proxy** option object. * Playwright * Puppeteer ``` import { chromium } from 'playwright'; const proxy = 'proxy.example.com:3001'; const username = 'someUsername'; const password = 'password123'; const browser = await chromium.launch({ headless: false, proxy: { server: proxy, username, password, }, }); // Proxy will now be authenticated ``` ``` import puppeteer from 'puppeteer'; const proxy = 'proxy.example.com:3001'; const username = 'someUsername'; const password = 'password123'; const browser = await puppeteer.launch({ headless: false, args: [`--proxy-server=${proxy}`], }); const page = await browser.newPage(); await page.authenticate({ username, password }); // Proxy will now be authenticated ``` ## Next up[​](#next "Direct link to Next up") You already know how to launch a browser with various configurations, which means you're ready to [learn about browser contexts](https://docs.apify.com/academy/puppeteer-playwright/browser-contexts.md). Browser contexts can be used to automate multiple sessions at once with completely different configurations. You'll also learn how to emulate different devices, such as iPhones, iPads, and Androids. --- # Reading & intercepting requests **You can use DevTools, but did you know that you can do all the same stuff (plus more) programmatically? Read and intercept requests in Puppeteer/Playwright.** *** On any website that serves up images, makes [XMLHttpRequests](https://developer.mozilla.org/en-US/docs/Web/API/XMLHttpRequest), or fetches content in some other way, you can see those requests (and their responses) in the [**Network** tab](https://docs.apify.com/academy/api-scraping/general-api-scraping/locating-and-learning.md) of your browser's DevTools. Lots of data about the request can be found there, such as the headers, payload, and response body. In Playwright and Puppeteer, it is also possible to read (and even intercept) requests being made on the page - programmatically. This is very useful for things like reading dynamic headers, saving API responses, blocking certain resources, and much more. During this lesson, we'll be using [Tiësto's following list](https://soundcloud.com/tiesto/following) on SoundCloud to demonstrate request/response reading and interception. Here's our basic setup for opening the page: * Playwright * Puppeteer ``` import { chromium } from 'playwright'; const browser = await chromium.launch({ headless: false }); const page = await browser.newPage(); // Our code will go here await page.goto('https://soundcloud.com/tiesto/following'); await page.waitForTimeout(10000); await browser.close(); ``` ``` import puppeteer from 'puppeteer'; const browser = await puppeteer.launch({ headless: false }); const page = await browser.newPage(); // Our code will go here await page.goto('https://soundcloud.com/tiesto/following'); await page.waitForTimeout(10000); await browser.close(); ``` ## Reading requests[​](#reading-requests "Direct link to Reading requests") We can use the [`page.on()`](https://pptr.dev/#?product=Puppeteer\&version=v14.0.0\&show=api-event-close) function to listen for the **request** event, passing in a callback function. The first parameter of the passed in callback function is an object representing the request. Upon visiting Tiësto's following page, we can see in the **Network** tab that a request is made to fetch all of the users which he is following. ![Request to grab Tiësto\'s following list](/assets/images/tiesto-request-e84745c5eb8edc0ec84bfeea4472ae7a.png) Let's go ahead and listen for this request in our code: * Playwright * Puppeteer ``` // Listen for all requests page.on('request', (req) => { // If the URL doesn't include our keyword, ignore it if (!req.url().includes('followings')) return; console.log('Request for followers was made!'); }); ``` ``` // Listen for all requests page.on('request', (req) => { // If the URL doesn't include our keyword, ignore it if (!req.url().includes('followings')) return; console.log('Request for followers was made!'); }); ``` > Note that you should always define any request reading/interception code prior to calling the `page.goto()` function. Cool! Now when we run our code, we'll see this logged to the console: ``` Request for followers was made! ``` This request includes some useful query parameters, namely the `client_id`. Let's go ahead and grab these values from the request URL and print them to the console: * Playwright * Puppeteer ``` import { chromium } from 'playwright'; const browser = await chromium.launch({ headless: false }); const page = await browser.newPage(); // Listen for all requests page.on('request', (req) => { // If the URL doesn't include our keyword, ignore it if (!req.url().includes('followings')) return; // Convert the request URL into a URL object const url = new URL(req.url()); // Print the search parameters in object form console.log(Object.fromEntries(url.searchParams)); }); await page.goto('https://soundcloud.com/tiesto/following'); await page.waitForTimeout(10000); await browser.close(); ``` ``` import puppeteer from 'puppeteer'; const browser = await puppeteer.launch({ headless: false }); const page = await browser.newPage(); // Listen for all requests page.on('request', (req) => { // If the URL doesn't include our keyword, ignore it if (!req.url().includes('followings')) return; // Convert the request URL into a URL object const url = new URL(req.url()); // Print the search parameters in object form console.log(Object.fromEntries(url.searchParams)); }); await page.goto('https://soundcloud.com/tiesto/following'); await page.waitForTimeout(10000); await browser.close(); ``` After running this code, we can see this logged to the console: ``` { client_id: 'llCGDUjKpxUslgO1yEce7Zh95PXE78Bo', limit: '12', offset: '0', linked_partitioning: '1', app_version: '1652347025', app_locale: 'en' } ``` ## Reading responses[​](#reading-responses "Direct link to Reading responses") Listening for and reading responses is very similar to reading requests. The only difference is that we need to listen for the **response** event instead of **request**. Additionally, the object passed into the callback function represents the response instead of the request. This time, instead of grabbing the query parameters of the request URL, let's grab hold of the response body and print it to the console in JSON format: * Playwright * Puppeteer ``` // Notice that the callback function is now async page.on('response', async (res) => { if (!res.request().url().includes('followings')) return; // Grab the response body in JSON format try { const json = await res.json(); console.log(json); } catch (err) { console.error('Response wasn\'t JSON or failed to parse response.'); } }); ``` ``` // Notice that the callback function is now async page.on('response', async (res) => { if (!res.request().url().includes('followings')) return; // Grab the response body in JSON format try { const json = await res.json(); console.log(json); } catch (err) { console.error('Response wasn\'t JSON or failed to parse response.'); } }); ``` > Take notice of our usage of a `try...catch` block. This is because if the response is not JSON, the `res.json()` function will fail and throw an error, which we must handle to prevent any unexpected crashes. Upon running this code, we'll see the API response logged into the console: ![API response in console](/assets/images/api-response-tiesto-dafcb582f617c9ceae890c19b8faa6c4.png) ## Intercepting requests[​](#intercepting-requests "Direct link to Intercepting requests") One of the most popular ways of speeding up website loading in Puppeteer and Playwright is by blocking certain resources from loading. These resources are usually CSS files, images, and other miscellaneous resources that aren't super necessary (mainly because the computer doesn't have eyes - it doesn't care how the website looks!). In Puppeteer, we must first enable request interception with the `page.setRequestInterception()` function. Then, we can check whether or not the request's resource ends with one of our blocked file extensions. If so, we'll abort the request. Otherwise, we'll let it continue. All of this logic will still be within the `page.on()` method. With Playwright, request interception is a bit different. We use the [`page.route()`](https://playwright.dev/docs/api/class-page#page-route) function instead of `page.on()`, passing in a string, regular expression, or a function that will match the URL of the request we'd like to read from. The second parameter is also a callback function, but with the [**Route**](https://playwright.dev/docs/api/class-route) object passed into it instead. ### Blocking resources[​](#blocking-resources "Direct link to Blocking resources") We'll first create an array of some file extensions that we'd like to block: ``` const blockedExtensions = ['.png', '.css', '.jpg', '.jpeg', '.pdf', '.svg']; ``` Then, we'll `abort()` all requests that end with any of these extensions. * Playwright * Puppeteer ``` import { chromium } from 'playwright'; const browser = await chromium.launch({ headless: false }); const page = await browser.newPage(); const blockedExtensions = ['.png', '.css', '.jpg', '.jpeg', '.pdf', '.svg']; // Only listen for requests with one of our blocked extensions // Abort all matching requests page.route(`**/*{${blockedExtensions.join(',')}}`, async (route) => route.abort()); await page.goto('https://soundcloud.com/tiesto/following'); await page.waitForTimeout(10000); await browser.close(); ``` ``` import puppeteer from 'puppeteer'; const browser = await puppeteer.launch({ headless: false }); const page = await browser.newPage(); const blockedExtensions = ['.png', '.css', '.jpg', '.jpeg', '.pdf', '.svg']; // Enable request interception (skipping this step will result in an error) await page.setRequestInterception(true); // Listen for all requests page.on('request', async (req) => { // If the request ends in a blocked extension, abort the request if (blockedExtensions.some((str) => req.url().endsWith(str))) return req.abort(); // Otherwise, continue await req.continue(); }); await page.goto('https://soundcloud.com/tiesto/following'); await page.waitForTimeout(10000); await browser.close(); ``` > You can also use `request.resourceType()` to grab the resource type. Here's what we see when we run this logic: ![SoundCloud with no CSS or image resources loaded](/assets/images/ugly-soundcloud-28eb2e994a6aca46ad03a97b7102f066.png) This confirms that we've successfully blocked the CSS and image resources from loading. #### Quick note about resource blocking[​](#quick-note "Direct link to Quick note about resource blocking") Something **very** important to note is that by using request interception, the browser's cache is turned **off**. This means that resources on websites that would normally be cached (and pulled from the cache instead on the next request for those resources) will not be cached, which can have varying negative effects on performance, especially when making many requests to the same domain, which is very common in web scraping. You can learn how to solve this problem in [this short tutorial](https://docs.apify.com/academy/node-js/caching-responses-in-puppeteer.md). To block resources, it is better to use a CDP (Chrome DevTools Protocol) Session ([Playwright](https://playwright.dev/docs/api/class-cdpsession)/[Puppeteer](https://pptr.dev/#?product=Puppeteer\&version=v14.1.0\&show=api-class-cdpsession)) to set the blocked URLs. Here is an implementation that achieves the same goal as our above example above; however, the browser's cache remains enabled. * Playwright * Puppeteer ``` // Note, you can't use CDP session in other browsers! // Only in Chromium. import { chromium } from 'playwright'; const browser = await chromium.launch({ headless: false }); const page = await browser.newPage(); // Define our blocked extensions const blockedExtensions = ['.png', '.css', '.jpg', '.jpeg', '.pdf', '.svg']; // Use CDP session to block resources const client = await page.context().newCDPSession(page); await client.send('Network.setBlockedURLs', { urls: blockedExtensions }); await page.goto('https://soundcloud.com/tiesto/following'); await page.waitForTimeout(10000); await browser.close(); ``` ``` import puppeteer from 'puppeteer'; const browser = await puppeteer.launch({ headless: false }); const page = await browser.newPage(); // Define our blocked extensions const blockedExtensions = ['.png', '.css', '.jpg', '.jpeg', '.pdf', '.svg']; // Use CDP session to block resources await page.client().send('Network.setBlockedURLs', { urls: blockedExtensions }); await page.goto('https://soundcloud.com/tiesto/following'); await page.waitForTimeout(10000); await browser.close(); ``` ### Modifying the request[​](#modifyng-the-request "Direct link to Modifying the request") There's much more to intercepting requests than just aborting them though. We can change the payload, headers, query parameters, and even the base URL. Let's go ahead and intercept and modify the initial request we fire off with the `page.goto()` by making it go to [Mesto's following page](https://soundcloud.com/mestomusic) instead. * Playwright * Puppeteer ``` import { chromium } from 'playwright'; const browser = await chromium.launch({ headless: false }); const page = await browser.newPage(); // Only listen for requests matching this regular expression page.route(/soundcloud.com\/tiesto/, async (route) => { // Continue the route, but replace "tiesto" in the URL with "mestomusic" return route.continue({ url: route.request().url().replace('tiesto', 'mestomusic') }); }); await page.goto('https://soundcloud.com/tiesto/following'); await page.waitForTimeout(10000); await browser.close(); ``` ``` import puppeteer from 'puppeteer'; const browser = await puppeteer.launch({ headless: false }); const page = await browser.newPage(); await page.setRequestInterception(true); // Listen for all requests page.on('request', async (req) => { // If it doesn't match, continue the route normally if (!/soundcloud.com\/tiesto/.test(req.url())) return req.continue(); // Otherwise, continue the route, but replace "tiesto" // in the URL with "mestomusic" await req.continue({ url: req.url().replace('tiesto', 'mestomusic') }); }); await page.goto('https://soundcloud.com/tiesto/following'); await page.waitForTimeout(10000); await browser.close(); ``` > Note that this **is not** a redirect, because Tiësto's page was never even visited. The request was changed before it was even fulfilled. Here's what we see when we run `node index.js`: ![Request intercepted and sent to Mesto\'s page instead](/assets/images/mesto-following-bb43b980a4fe0f6afdaefd7e548530f3.jpg) ## Next up[​](#next "Direct link to Next up") The [next lesson](https://docs.apify.com/academy/puppeteer-playwright/proxies.md) will teach you how to use proxies in Playwright and Puppeteer in order to avoid blocking or to appear as if you are requesting from a different location. --- # Python Tutorials 🐍📚 **A collection of various Python tutorials to aid you in your journey to becoming a master web scraping and automation developer.** *** This section contains various web-scraping or web-scraping related tutorials for Python. Whether you're trying to scrape from a website with sitemaps, struggling with a dynamic page, want to optimize your slow scraper, or need some general tips for scraping in Python, this section is right for you. --- # How to process data in Python using Pandas **Learn how to process the resulting data of a web scraper in Python using the Pandas library, and how to visualize the processed data using Matplotlib.** *** In the [previous tutorial](https://docs.apify.com/academy/python/scrape-data-python.md), we learned how to scrape data from the web in Python using the [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/) library. The Python ecosystem's strengths lie mainly in data processing, though, so in this tutorial we will learn how to process the data stored in an Apify dataset using the [Pandas](https://pandas.pydata.org/) library, and how to visualize it using [Matplotlib](https://matplotlib.org/). In this tutorial, we will use the Actor we created in the [previous tutorial](https://docs.apify.com/academy/python/scrape-data-python.md), so if you haven't completed that tutorial yet, please do so now. > In a rush? Skip this tutorial and [get the full code example](https://github.com/apify/apify-docs/tree/master/examples/python-data-parser/). ## Processing previously scraped data[​](#processing-previously-scraped-data "Direct link to Processing previously scraped data") In the previous tutorial, we set out to select our next holiday destination based on the forecast of the upcoming weather there. We have written an Actor that scrapes the BBC Weather forecast for the upcoming two weeks for three destinations: Prague, New York, and Honolulu. It then saves the scraped data to a [dataset](https://docs.apify.com/platform/storage/dataset.md) on the Apify platform. Now, we need to process the scraped data and make a visualization that will help us decide which location has the best weather, and will therefore become our next holiday destination. ### Setting up the Actor[​](#setting-up-the-actor "Direct link to Setting up the Actor") First, we need to create another Actor. You can do it the same way as before - go to the [Apify Console](https://console.apify.com/), open the [Actors section](https://console.apify.com/actors), click on the **Create new** button in the top right, and select the **Example: Hello world in Python** Actor template. In the page that opens, you can see your newly created Actor. In the **Settings** tab, you can give it a name (e.g. `bbc-weather-parser`) and further customize its settings. We'll skip customizing the settings for now, the defaults should be fine. In the **Source** tab, you can see the files that are at the heart of the Actor. Although there are several of them, just two are important for us now, `main.py` and `requirements.txt`. First, we'll start with the `requirements.txt` file. Its purpose is to list all the third-party packages that your Actor will use. We will be using the `pandas` package for parsing the downloaded weather data, and the `matplotlib` package for visualizing it. We don't care about versions of these packages, so we list just their names: ``` # Add your dependencies here. # See https://pip.pypa.io/en/latest/cli/pip_install/#requirements-file-format # for how to format them matplotlib pandas ``` The Actor's main logic will live in the `main.py` file. Let's delete everything currently in it and start from an empty file. Next, we'll import all the packages we will use in the code: ``` from io import BytesIO import os from apify_client import ApifyClient from apify_client.consts import ActorJobStatus import pandas ``` ### Scraping the data[​](#scraping-the-data "Direct link to Scraping the data") Next, we need to run the weather scraping Actor and access its results. We do that through the [Apify API Client for Python](https://docs.apify.com/api/client/python), which greatly simplifies working with the Apify platform and allows you to use its functions without having to call the Apify API directly. First, we initialize an `ApifyClient` instance. All the necessary arguments are automatically provided to the Actor process as environment variables accessible in Python through the `os.environ` mapping. We need to run the Actor from the previous tutorial, which we have named `bbc-weather-scraper`, and wait for it to finish. We create a sub-client for working with that Actor and run the Actor through it. We then check whether the Actor run has succeeded. If so, we create a client for working with its default dataset. ``` # Initialize the main ApifyClient instance client = ApifyClient(os.environ['APIFY_TOKEN'], api_url=os.environ['APIFY_API_BASE_URL']) # Run the weather scraper and wait for it to finish print('Downloading the weather data...') scraper_run = client.actor('~bbc-weather-scraper').call() # Check if the scraper finished successfully, otherwise raise an error if scraper_run['status'] != ActorJobStatus.SUCCEEDED: raise RuntimeError('The weather scraper run has failed') # Get the resource sub-client for working with the dataset with the source data dataset_client = client.dataset(scraper_run['defaultDatasetId']) ``` ### Processing the data[​](#processing-the-data "Direct link to Processing the data") Now, we need to load the data from the dataset to a Pandas dataframe. Pandas supports reading data from a CSV file stream, so we create a stream with the dataset items in the right format and supply it to `pandas.read_csv()`. ``` # Load the dataset items into a pandas dataframe print('Parsing weather data...') dataset_items_stream = dataset_client.stream_items(item_format='csv') weather_data = pandas.read_csv(dataset_items_stream, parse_dates=['datetime'], date_parser=lambda val: pandas.to_datetime(val, utc=True)) ``` Once we have the data loaded, we can process it. Each data row comes as three fields: `datetime`, `location` and `temperature`. We would like to transform the data so that we have the datetimes in one column, and the temperatures for each location at that datetime in separate columns, one for each location. To achieve this, we use the `.pivot()` method on the dataframe. Since the temperature varies considerably between day and night, and we would like to get an overview of the temperature trends over a longer period of time, we calculate a rolling average of the temperatures with a 24-hour window. ``` # Transform data to a pivot table for easier plotting pivot = weather_data.pivot(index='datetime', columns='location', values='temperature') mean_daily_temperatures = pivot.rolling(window='24h', min_periods=24, center=True).mean() ``` ### Visualizing the data[​](#visualizing-the-data "Direct link to Visualizing the data") With the data processed, we can then make a plot of the results. For that, we use the `.plot()` method of the dataframe, which creates a figure with the plot, using the Matplotlib library internally. We set the right titles and labels to the plot, and apply some additional formatting to achieve a nicer result. ``` # Create a plot of the data print('Plotting the data...') axes = mean_daily_temperatures.plot(figsize=(10, 5)) axes.set_title('Weather prediction for holiday destinations') axes.set_xlabel(None) axes.yaxis.set_major_formatter(lambda val, _: f'{int(val)} °C') axes.grid(which='both', linestyle='dotted') axes.legend(loc='best') axes.figure.tight_layout() ``` As the last step, we need to save the plot to a record in a [key-value store](https://docs.apify.com/platform/storage/key-value-store.md) on the Apify platform, so that we can access it later. We save the rendered figure with the plot to an in-memory buffer, and then save the contents of that buffer to the default key-value store of the Actor run through its resource subclient. ``` # Get the resource sub-client for working with the default key-value store of the run key_value_store_client = client.key_value_store(os.environ['APIFY_DEFAULT_KEY_VALUE_STORE_ID']) # Save the resulting plot to the key-value store through an in-memory buffer print('Saving plot to key-value store...') with BytesIO() as buf: axes.figure.savefig(buf, format='png', dpi=200, facecolor='w') buf.seek(0) key_value_store_client.set_record('prediction.png', buf, 'image/png') print(f'Result is available at {os.environ["APIFY_API_PUBLIC_BASE_URL"]}' + f'/v2/key-value-stores/{os.environ["APIFY_DEFAULT_KEY_VALUE_STORE_ID"]}/records/prediction.png') ``` And that's it! Now you can save the changes in the editor, and then click **Build and run** at the bottom of the page. The Actor will get built, the built Actor image will get saved for future re-use, and then it will be executed. You can follow the progress of the Actor build and the Actor run in the **Last build** and **Last run** tabs, respectively, in the developer console in the Actor source view. Once the Actor finishes running, it will output the URL where you can access the plot we created in its log. ![Building and running the BBC Weather Parser Actor](/assets/images/bbc-weather-parser-source-0517e35ad3095a9673b91e66a69b4374.png) Looking at the results, Honolulu seems like the right choice now, don't you think? 🙂 ![Weather prediction plot created by the BBC Weather Parser Actor](/assets/images/bbc-weather-prediction-c1835f07c2e0dbbd50f85b66fa21ccd9.png) --- # How to scrape data in Python using Beautiful Soup **Learn how to create a Python Actor and use Python libraries to scrape, process and visualize data extracted from the web.** *** Web scraping is not limited to the JavaScript world. The Python ecosystem contains some pretty powerful scraping tools as well. One of those is [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/), a library for parsing HTML and navigating or modifying of its DOM tree. This tutorial shows you how to write a Python [Actor](https://docs.apify.com/academy/getting-started/actors.md) for scraping the weather forecast from [BBC Weather](https://www.bbc.com/weather) and process the scraped data using [Pandas](https://pandas.pydata.org/). > In a rush? Skip this tutorial and get the [full code example](https://github.com/apify/apify-docs/tree/master/examples/python-data-scraper/). ## Exploring the BBC Weather page[​](#exploring-the-bbc-weather-page "Direct link to Exploring the BBC Weather page") BBC Weather offers you the weather forecast for the upcoming 14 days for a large selection of places around the world. Let's say we want to decide on our next holiday destination. We're choosing between Prague, New York, and Honolulu, and we will pick the destination based on which one has the best weather. To do that, we will scrape the weather forecast for each of our options, and then compare the results. ### Understanding the URL format[​](#understanding-the-url-format "Direct link to Understanding the URL format") First, we need to look around the BBC Weather page and understand how the weather data is being retrieved and presented. If we open the [BBC Weather](https://www.bbc.com/weather) page and search for Prague, we can see that it opened a page with a URL ending in a seven-digit number, which we can assume is the ID of the displayed location BBC Weather uses internally. Opening a different location changes only that number in the URL, confirming our assumptions. The page shows the weather forecast for the upcoming 14 days. If we hover over the days in the displayed carousel, we can see that the link for each day leads to a URL ending with `/day{X}`, with `{X}` representing how many days in the future the specific day is. Combining this information gives us the full format for the URL of a page for a given location and day: `https://www.bbc.com/weather/{LOCATION_ID}/day{DAY_OFFSET}`. ![BBC Weather URL format](/assets/images/bbc-weather-url-format-e8135406724ef738c71b26d310cddf7e.png) ### Determining the forecast's starting date[​](#determining-the-forecasts-starting-date "Direct link to Determining the forecast's starting date") Looking more closely at the BBC Weather page, we can see that it shows the forecast for each day from 6:00 AM to 5:00 AM the next day. But what happens when we view a location where the current time is between midnight and 5 AM? Trying that, we can see that, in the day represented by **Tonight**, there are only a few slots for the hours between midnight and 5 AM displayed. This means that the first displayed day can either represent the current date at the location, or the day before the current date. To find out which of these two it is, we will first have to determine the current date and time at the location, and then possibly adjust it by one day based on whether the date matches the first displayed day. ![BBC Weather displaying a location with current time between midnight and 5 AM](/assets/images/bbc-weather-after-midnight-7348c6c0be8c3441be74734d16b90451.png) To determine the current date and time at the displayed location, we will need to know the location's timezone. Fortunately, the timezone and its offset to GMT are displayed near the bottom of the page. ![The timezone offset on the BBC Weather page](/assets/images/bbc-time-offset-a7657ef9dc2721e46f2c2adf0f71aac0.png) ### Understanding the element structure[​](#understanding-the-element-structure "Direct link to Understanding the element structure") To extract data from the page, we need to figure out where exactly in the internal page structure it is stored. If we right-click on the day title in the top carousel (**Today** or **Tonight**) and select **Inspect** in the popup menu, we can open the Chrome DevTools Inspector with the clicked element highlighted. We can see that the element with the currently displayed day in the top carousel has the class `wr-day--active`, and that the element with the day's title has the class `wr-day__title` and the accessibility label attribute `aria-label` contains the actual date of that day, not just **Today** or **Tonight**. Additionally, the timezone information is in an element with the class `wr-c-footer-timezone__item`. You can see two elements with the same class, so we will need to pick the second one when parsing the page. Exploring the document tree further, we can see that the element containing all the displayed hours has the class `wr-time-slot-container__slots`. The elements with the forecast for a given hour have the class `wr-time-slot`. In each time slot, the element containing the slot's hour has the class `wr-time-slot-primary__hours` and the element containing the slot's predicted temperature in degrees Celsius has the class `wr-value--temperature--c`. ![BBC Weather with the DevTools Inspector open](/assets/images/bbc-weather-devtools-ede1403dc7aa869f51e9b01efae06def.png) ## Scraping the data from the page[​](#scraping-the-data-from-the-page "Direct link to Scraping the data from the page") Now that we understand the element structure of the page and know where to find all the data we need, we can start writing the scraper. ### Setting up the Actor[​](#setting-up-the-actor "Direct link to Setting up the Actor") First, we need to create a new Actor. To do this, go to [Apify Console](https://console.apify.com/), open the [Development section](https://console.apify.com/actors/development/my-actors), click on the **Develop new** button in the top right, and select the **Example: Hello world in Python** Actor template. In the page that opens, you can see your newly created Actor. In the **Settings** tab, you can give it a name (e.g. `bbc-weather-scraper`) and further customize its settings. We'll skip customizing the settings for now, the defaults should be fine. In the **Source** tab, you can see the files that are at the heart of the Actor. Although there are several of them, just two are important for us now, `main.py` and `requirements.txt`. First we'll start with the `requirements.txt` file. Its purpose is to list all the third-party packages that your Actor will use. We will be using the `requests` package for downloading the BBC Weather pages, and the `beautifulsoup4` package for parsing and processing the downloaded pages. We don't care about versions of these packages, so we list just their names: ``` # Add your dependencies here. # See https://pip.pypa.io/en/latest/cli/pip_install/#requirements-file-format # for how to format them beautifulsoup4 requests ``` ### Writing the code[​](#writing-the-code "Direct link to Writing the code") Finally, we can get to writing the main logic for the Actor, which will live in the `main.py` file. Let's delete everything currently in it and start from an empty file. First, we need to import all the packages we will use in the code: ``` from datetime import datetime, time, timedelta, timezone import os import re from apify_client import ApifyClient from bs4 import BeautifulSoup import requests ``` Next, let's set up the locations we want to scrape in a constant for easier reference and, optionally, modification. ``` # Locations which to scrape and their BBC Weather IDs LOCATIONS = [ ('Prague', '3067696'), ('Honolulu', '5856195'), ('New York', '5128581'), ] ``` #### Extracting the data[​](#extracting-the-data "Direct link to Extracting the data") We'll be scraping each location separately. For each location, we need to know in which timezone it resides and what is the first displayed date in the weather forecast for that location. We will scrape each of the 14 forecast days one by one. For each day, we will first download its forecast page using the `requests` library, and then parse the downloaded HTML using the `BeautifulSoup` parser: ``` # List with scraped results weather_data = [] # Scrape each location separately for (location_name, location_id) in LOCATIONS: print(f'Scraping weather from {location_name}...') location_timezone = None first_displayed_date = None for day_offset in range(14): # Get the BBC Weather page for the given location and day and parse it with BeautifulSoup response = requests.get(f'https://www.bbc.com/weather/{location_id}/day{day_offset}') soup = BeautifulSoup(response.content, 'html.parser') ``` When scraping a location, we need to know in which timezone it lies, and what date the first displayed day of the forecast represents. We can find that out at the beginning, when scraping the first day of the forecast for that location. To get the necessary data, we will need to find the elements in which it is contained. Let's use the `soup.find(...)` and `soup.findAll(...)` methods, which find elements matching some specified conditions in the parsed HTML. First, we extract the timezone from the second element with class `wr-c-footer-timezone__item`. The timezone information is described there with a full sentence, but we're only interested in the numerical representation of the timezone offset, so we parse it out using a regular expression. With the timezone offset parsed, we can construct a `timezone` object and from that get the current datetime at the location. Afterwards, we can figure out which date is represented by the first displayed day. We find the element with the class `wr-day--active` containing the header for the currently displayed day. Inside it, we find the element with the title of that day, which has the class `wr-day__title`. This element has the accessibility label containing the actual date of the day in its `aria-label` attribute, but it contains only the day and month and not the year, so we can't use it directly. Instead, to get the full date of the first displayed day, we compare the day from the accessibility label and the day from the current datetime at the location. If they match, we know the first displayed date is the current date at the location. If they don't, we know the first displayed date is the day before the current date at the location. ``` # When parsing the first day, find out what day it represents, # to know when do the results start if day_offset == 0: # Get the timezone offset written in the page footer and parse it tz_description = soup.find_all(class_='wr-c-footer-timezone__item')[1].text tz_offset_match = re.search(r'([+-]\d\d)(\d\d)', tz_description) tz_offset_hours = int(tz_offset_match.group(1)) tz_offset_minutes = int(tz_offset_match.group(2)) # Get the current date and time at the scraped location timezone_offset = timedelta(hours=tz_offset_hours, minutes=tz_offset_minutes) location_timezone = timezone(timezone_offset) location_current_datetime = datetime.now(tz=location_timezone) # The times displayed for each day are from 6:00 AM that day to 5:00 AM the next day, # so "today" on BBC Weather might actually mean "yesterday" in actual datetime. # We have to parse the accessibility label containing the actual date on the header for the first day # and compare it with the current date at the location, then adjust the date accordingly day_carousel_item = soup.find(class_='wr-day--active') day_carousel_title = day_carousel_item.find(class_='wr-day__title')['aria-label'] website_first_displayed_item_day = int(re.search(r'\d{1,2}', day_carousel_title).group(0)) if location_current_datetime.day == website_first_displayed_item_day: first_displayed_date = location_current_datetime.date() else: first_displayed_date = location_current_datetime.date() - timedelta(days=1) ``` Now that we've figured out the date of the first displayed day, we can extract the predicted weather from each hour of each forecast day. The forecast for the displayed day is in the element with class `wr-time-slot-container__slots`, and that element contains time slots for each predicted hour represented by elements with the class `wr-time-slot`. In each time slot, the element with the class `wr-time-slot-primary__hours` contains the hour of the time slot. The element with the class `wr-value--temperature--c` contains the temperature in degrees Celsius. To get the datetime of each slot, we need to combine the date of the first displayed day, the hour displayed in the slot, and the timezone of the currently processed location. Since the page shows the forecast for each day from 6 AM to 5 AM the next day, we need to add one day to the slots from midnight to 5 AM to get the correct datetime. Finally, we can put all the extracted information together and push them to the array holding the resulting data. ``` # Go through the elements for each displayed time slot of the displayed day slot_container = soup.find(class_='wr-time-slot-container__slots') for slot in slot_container.find_all(class_='wr-time-slot'): # Find out the date and time of the displayed element from the day offset and the displayed hour. # The times displayed for each day are from 6:00 AM that day to 5:00 AM the next day, # so anything between midnight and 6 AM actually represents the next day slot_hour = int(slot.find(class_='wr-time-slot-primary__hours').text) slot_datetime = datetime.combine(first_displayed_date, time(hour=slot_hour), tzinfo=location_timezone) slot_datetime += timedelta(days=day_offset) if slot_hour In a rush? Skip this tutorial and [get the full code example](https://github.com/apify/apify-docs/tree/master/examples/python-data-parser/). ## Processing previously scraped data[​](#processing-previously-scraped-data "Direct link to Processing previously scraped data") In the previous tutorial, we set out to select our next holiday destination based on the forecast of the upcoming weather there. We have written an Actor that scrapes the BBC Weather forecast for the upcoming two weeks for three destinations: Prague, New York, and Honolulu. It then saves the scraped data to a [dataset](https://docs.apify.com/platform/storage/dataset.md) on the Apify platform. Now, we need to process the scraped data and make a visualization that will help us decide which location has the best weather, and will therefore become our next holiday destination. ### Setting up the Actor[​](#setting-up-the-actor "Direct link to Setting up the Actor") First, we need to create another Actor. You can do it the same way as before - go to the [Apify Console](https://console.apify.com/), open the [Actors section](https://console.apify.com/actors), click on the **Create new** button in the top right, and select the **Example: Hello world in Python** Actor template. In the page that opens, you can see your newly created Actor. In the **Settings** tab, you can give it a name (e.g. `bbc-weather-parser`) and further customize its settings. We'll skip customizing the settings for now, the defaults should be fine. In the **Source** tab, you can see the files that are at the heart of the Actor. Although there are several of them, just two are important for us now, `main.py` and `requirements.txt`. First, we'll start with the `requirements.txt` file. Its purpose is to list all the third-party packages that your Actor will use. We will be using the `pandas` package for parsing the downloaded weather data, and the `matplotlib` package for visualizing it. We don't care about versions of these packages, so we list just their names: ``` # Add your dependencies here. # See https://pip.pypa.io/en/latest/cli/pip_install/#requirements-file-format # for how to format them matplotlib pandas ``` The Actor's main logic will live in the `main.py` file. Let's delete everything currently in it and start from an empty file. Next, we'll import all the packages we will use in the code: ``` from io import BytesIO import os from apify_client import ApifyClient from apify_client.consts import ActorJobStatus import pandas ``` ### Scraping the data[​](#scraping-the-data "Direct link to Scraping the data") Next, we need to run the weather scraping Actor and access its results. We do that through the [Apify API Client for Python](https://docs.apify.com/api/client/python), which greatly simplifies working with the Apify platform and allows you to use its functions without having to call the Apify API directly. First, we initialize an `ApifyClient` instance. All the necessary arguments are automatically provided to the Actor process as environment variables accessible in Python through the `os.environ` mapping. We need to run the Actor from the previous tutorial, which we have named `bbc-weather-scraper`, and wait for it to finish. We create a sub-client for working with that Actor and run the Actor through it. We then check whether the Actor run has succeeded. If so, we create a client for working with its default dataset. ``` # Initialize the main ApifyClient instance client = ApifyClient(os.environ['APIFY_TOKEN'], api_url=os.environ['APIFY_API_BASE_URL']) # Run the weather scraper and wait for it to finish print('Downloading the weather data...') scraper_run = client.actor('~bbc-weather-scraper').call() # Check if the scraper finished successfully, otherwise raise an error if scraper_run['status'] != ActorJobStatus.SUCCEEDED: raise RuntimeError('The weather scraper run has failed') # Get the resource sub-client for working with the dataset with the source data dataset_client = client.dataset(scraper_run['defaultDatasetId']) ``` ### Processing the data[​](#processing-the-data "Direct link to Processing the data") Now, we need to load the data from the dataset to a Pandas dataframe. Pandas supports reading data from a CSV file stream, so we create a stream with the dataset items in the right format and supply it to `pandas.read_csv()`. ``` # Load the dataset items into a pandas dataframe print('Parsing weather data...') dataset_items_stream = dataset_client.stream_items(item_format='csv') weather_data = pandas.read_csv(dataset_items_stream, parse_dates=['datetime'], date_parser=lambda val: pandas.to_datetime(val, utc=True)) ``` Once we have the data loaded, we can process it. Each data row comes as three fields: `datetime`, `location` and `temperature`. We would like to transform the data so that we have the datetimes in one column, and the temperatures for each location at that datetime in separate columns, one for each location. To achieve this, we use the `.pivot()` method on the dataframe. Since the temperature varies considerably between day and night, and we would like to get an overview of the temperature trends over a longer period of time, we calculate a rolling average of the temperatures with a 24-hour window. ``` # Transform data to a pivot table for easier plotting pivot = weather_data.pivot(index='datetime', columns='location', values='temperature') mean_daily_temperatures = pivot.rolling(window='24h', min_periods=24, center=True).mean() ``` ### Visualizing the data[​](#visualizing-the-data "Direct link to Visualizing the data") With the data processed, we can then make a plot of the results. For that, we use the `.plot()` method of the dataframe, which creates a figure with the plot, using the Matplotlib library internally. We set the right titles and labels to the plot, and apply some additional formatting to achieve a nicer result. ``` # Create a plot of the data print('Plotting the data...') axes = mean_daily_temperatures.plot(figsize=(10, 5)) axes.set_title('Weather prediction for holiday destinations') axes.set_xlabel(None) axes.yaxis.set_major_formatter(lambda val, _: f'{int(val)} °C') axes.grid(which='both', linestyle='dotted') axes.legend(loc='best') axes.figure.tight_layout() ``` As the last step, we need to save the plot to a record in a [key-value store](https://docs.apify.com/platform/storage/key-value-store.md) on the Apify platform, so that we can access it later. We save the rendered figure with the plot to an in-memory buffer, and then save the contents of that buffer to the default key-value store of the Actor run through its resource subclient. ``` # Get the resource sub-client for working with the default key-value store of the run key_value_store_client = client.key_value_store(os.environ['APIFY_DEFAULT_KEY_VALUE_STORE_ID']) # Save the resulting plot to the key-value store through an in-memory buffer print('Saving plot to key-value store...') with BytesIO() as buf: axes.figure.savefig(buf, format='png', dpi=200, facecolor='w') buf.seek(0) key_value_store_client.set_record('prediction.png', buf, 'image/png') print(f'Result is available at {os.environ["APIFY_API_PUBLIC_BASE_URL"]}' + f'/v2/key-value-stores/{os.environ["APIFY_DEFAULT_KEY_VALUE_STORE_ID"]}/records/prediction.png') ``` And that's it! Now you can save the changes in the editor, and then click **Build and run** at the bottom of the page. The Actor will get built, the built Actor image will get saved for future re-use, and then it will be executed. You can follow the progress of the Actor build and the Actor run in the **Last build** and **Last run** tabs, respectively, in the developer console in the Actor source view. Once the Actor finishes running, it will output the URL where you can access the plot we created in its log. ![Building and running the BBC Weather Parser Actor](/assets/images/bbc-weather-parser-source-0517e35ad3095a9673b91e66a69b4374.png) Looking at the results, Honolulu seems like the right choice now, don't you think? 🙂 ![Weather prediction plot created by the BBC Weather Parser Actor](/assets/images/bbc-weather-prediction-c1835f07c2e0dbbd50f85b66fa21ccd9.png) --- # Running a web server on the Apify platform **A web server running in an Actor can act as a communication channel with the outside world. Learn how to set one up with Node.js.** *** Sometimes, an Actor needs a channel for communication with other systems (or humans). This channel might be used to receive commands, to provide info about progress, or both. To implement this, we will run a HTTP web server inside the Actor that will provide: * An API to receive commands. * An HTML page displaying output data. Running a web server in an Actor is a piece of cake! Each Actor run is available at a unique URL (container URL) which always takes the form `https://CONTAINER-KEY.runs.apify.net`. This URL is available in the [**Actor run** object](https://docs.apify.com/api/v2/actor-run-get.md) returned by the Apify API, as well as in the Apify console. If you start a web server on the port defined by the **APIFY\_CONTAINER\_PORT** environment variable (the default value is **4321**), the container URL becomes available and gets displayed in the **Live View** tab in the Actor run console. For more details, see [the documentation](https://docs.apify.com/platform/actors/development/programming-interface/container-web-server.md). ## Building the Actor[​](#building-the-actor "Direct link to Building the Actor") Let's try to build the following Actor: * The Actor will provide an API to receive URLs to be processed. * For each URL, the Actor will create a screenshot. * The screenshot will be stored in the key-value store. * The Actor will provide a web page displaying thumbnails linked to screenshots and a HTML form to submit new URLs. To achieve this we will use the following technologies: * [Express.js](https://expressjs.com) framework to create the server * [Puppeteer](https://pptr.dev) to grab screenshots. * The [Apify SDK](https://docs.apify.com/sdk/js) to access Apify storages to store the screenshots. Our server needs two paths: * `/` - Index path will display a page form to submit a new URL and the thumbnails of processed URLs. * `/add-url` - Will provide an API to add new URLs using a HTTP POST request. First, we'll import `express` and create an Express.js app. Then, we'll add some middleware that will allow us to receive form submissions. ``` import { Actor } from 'apify'; import express from 'express'; await Actor.init(); const app = express(); app.use(express.json()); app.use(express.urlencoded({ extended: true })); ``` Now we need to read the following environment variables: * **APIFY\_CONTAINER\_PORT** contains a port number where we must start the server. * **APIFY\_CONTAINER\_URL** contains a URL under which we can access the container. * **APIFY\_DEFAULT\_KEY\_VALUE\_STORE\_ID** is the ID of the default key-value store of this Actor where we can store screenshots. ``` const { APIFY_CONTAINER_PORT, APIFY_CONTAINER_URL, APIFY_DEFAULT_KEY_VALUE_STORE_ID, } = process.env; ``` Next, we'll create an array of the processed URLs where the **n**th URL has its screenshot stored under the key **n**.jpg in the key-value store. ``` const processedUrls = []; ``` After that, the index route is ready to be defined. ``` app.get('/', (req, res) => { let listItems = ''; // For each of the processed processedUrls.forEach((url, index) => { const imageUrl = `https://api.apify.com/v2/key-value-stores/${APIFY_DEFAULT_KEY_VALUE_STORE_ID}/records/${index}.jpg`; // Display the screenshots below the form listItems += ` ${url} `; }); const pageHtml = ` Example URL: ${listItems} `; res.send(pageHtml); }); ``` And then a second path that receives the new URL submitted using the HTML form; after the URL is processed, it redirects the user back to the root path. ``` import { launchPuppeteer } from 'crawlee'; app.post('/add-url', async (req, res) => { const { url } = req.body; console.log(`Got new URL: ${url}`); // Start chrome browser and open new page ... const browser = await launchPuppeteer(); const page = await browser.newPage(); // ... go to our URL and grab a screenshot ... await page.goto(url); const screenshot = await page.screenshot({ type: 'jpeg' }); // ... close browser ... await page.close(); await browser.close(); // ... save screenshot to key-value store and add URL to processedUrls. await Actor.setValue(`${processedUrls.length}.jpg`, screenshot, { contentType: 'image/jpeg' }); processedUrls.push(url); res.redirect('/'); }); ``` And finally, we need to start the web server. ``` // Start the web server! app.listen(APIFY_CONTAINER_PORT, () => { console.log(`Application is listening at URL ${APIFY_CONTAINER_URL}.`); }); ``` ### Final code[​](#final-code "Direct link to Final code") ``` import { Actor } from 'apify'; import express from 'express'; await Actor.init(); const app = express(); app.use(express.json()); app.use(express.urlencoded({ extended: true })); const { APIFY_CONTAINER_PORT, APIFY_CONTAINER_URL, APIFY_DEFAULT_KEY_VALUE_STORE_ID, } = process.env; const processedUrls = []; app.get('/', (req, res) => { let listItems = ''; // For each of the processed processedUrls.forEach((url, index) => { const imageUrl = `https://api.apify.com/v2/key-value-stores/${APIFY_DEFAULT_KEY_VALUE_STORE_ID}/records/${index}.jpg`; // Display the screenshots below the form listItems += ` ${url} `; }); const pageHtml = ` Example URL: ${listItems} `; res.send(pageHtml); }); app.post('/add-url', async (req, res) => { const { url } = req.body; console.log(`Got new URL: ${url}`); // Start chrome browser and open new page ... const browser = await Actor.launchPuppeteer(); const page = await browser.newPage(); // ... go to our URL and grab a screenshot ... await page.goto(url); const screenshot = await page.screenshot({ type: 'jpeg' }); // ... close browser ... await page.close(); await browser.close(); // ... save screenshot to key-value store and add URL to processedUrls. await Actor.setValue(`${processedUrls.length}.jpg`, screenshot, { contentType: 'image/jpeg' }); processedUrls.push(url); res.redirect('/'); }); app.listen(APIFY_CONTAINER_PORT, () => { console.log(`Application is listening at URL ${APIFY_CONTAINER_URL}.`); }); ``` When we deploy and run this Actor on the Apify platform, then we can open the **Live View** tab in the Actor console to submit the URL to your Actor through the form. After the URL is successfully submitted, it appears in the Actor log. With that, we're done! And our application works like a charm :) The complete code of this Actor is available on its Store [page](https://apify.com/apify/example-web-server/source-code). You can run it there or copy it to your account. --- # Web scraping basics for JavaScript devs **Learn how to use JavaScript to extract information from websites in this practical course, starting from the absolute basics.** *** In this course we'll use JavaScript to create an application for watching prices. It'll be able to scrape all product pages of an e-commerce website and record prices. Data from several runs of such program would be useful for seeing trends in price changes, detecting discounts, etc. ![E-commerce listing on the left, JSON with data on the right](/assets/images/scraping-7add65f10b823af8c68c3f30a39dd679.webp) ## What we'll do[​](#what-well-do "Direct link to What we'll do") * Inspect pages using browser DevTools. * Download web pages using the Fetch API. * Extract data from web pages using the Cheerio library. * Save extracted data in various formats (e.g. CSV which MS Excel or Google Sheets can open) using the json2csv library. * Follow links programmatically (crawling). * Save time and effort with frameworks, such as Crawlee, and scraping platforms, such as Apify. ## Who this course is for[​](#who-this-course-is-for "Direct link to Who this course is for") Anyone with basic knowledge of developing programs in JavaScript who wants to start with web scraping can take this course. The course does not expect you to have any prior knowledge of other web technologies or scraping. ## Requirements[​](#requirements "Direct link to Requirements") * A macOS, Linux, or Windows machine with a web browser and Node.js installed. * Familiarity with JavaScript basics: variables, conditions, loops, functions, strings, lists, dictionaries, files, classes, and exceptions. * Comfort with building a Node.js package and installing dependencies with `npm`. * Familiarity with running commands in Terminal (macOS/Linux) or Command Prompt (Windows). ## You may want to know[​](#you-may-want-to-know "Direct link to You may want to know") Let's explore the key reasons to take this course. What is web scraping good for, and what career opportunities does it enable for you? ### Why learn scraping[​](#why-learn-scraping "Direct link to Why learn scraping") The internet is full of useful data, but most of it isn't offered in a structured way that's easy to process programmatically. That's why you need scraping, a set of approaches to download websites and extract data from them. Scraper development is also a fun and challenging way to learn web development, web technologies, and understand the internet. You'll reverse-engineer websites, understand how they work internally, discover what technologies they use, and learn how they communicate with servers. You'll also master your chosen programming language and core programming concepts. Understanding web scraping gives you a head start in learning web technologies such as HTML, CSS, JavaScript, frontend frameworks (like React or Next.js), HTTP, REST APIs, GraphQL APIs, and more. ### Why build your own scrapers[​](#why-build-your-own-scrapers "Direct link to Why build your own scrapers") Scrapers are programs specifically designed to mine data from the internet. Point-and-click or no-code scraping solutions do exist, but they only take you so far. While simple to use, they lack the flexibility and optimization needed to handle advanced cases. Only custom-built scrapers can tackle more difficult challenges. And unlike ready-made solutions, they can be fine-tuned to perform tasks more efficiently, at a lower cost, or with greater precision. ### Why become a scraper dev[​](#why-become-a-scraper-dev "Direct link to Why become a scraper dev") As a scraper developer, you are not limited by whether certain data is available programmatically through an official API—the entire web becomes your API! Here are some things you can do if you understand scraping: * Improve your productivity by building personal tools, such as your own real estate or rare sneakers watchdog. * Companies can hire you to build custom scrapers mining data important for their business. * Become an invaluable asset to data journalism, data science, or nonprofit teams working to make the world a better place. * You can publish your scrapers on platforms like the [Apify Store](https://apify.com/store) and earn money by renting them out to others. ### Why learn with Apify[​](#why-learn-with-apify "Direct link to Why learn with Apify") We are [Apify](https://apify.com), a web scraping and automation platform. We do our best to build this course on top of open source technologies. That means what you learn applies to any scraping project, and you'll be able to run your scrapers on any computer. We will show you how a scraping platform can simplify your life, but that lesson is optional and designed to fit within our [free tier](https://apify.com/pricing). ## Course content[​](#course-content "Direct link to Course content") --- # Crawling websites with Node.js **In this lesson, we'll follow links to individual product pages. We'll use the Fetch API to download them and Cheerio to process them.** *** In previous lessons we've managed to download the HTML code of a single page, parse it with BeautifulSoup, and extract relevant data from it. We'll do the same now for each of the products. Thanks to the refactoring, we have functions ready for each of the tasks, so we won't need to repeat ourselves in our code. This is what you should see in your editor now: ``` import httpx from bs4 import BeautifulSoup from decimal import Decimal import json import csv from urllib.parse import urljoin def download(url): response = httpx.get(url) response.raise_for_status() html_code = response.text return BeautifulSoup(html_code, "html.parser") def parse_product(product, base_url): title_element = product.select_one(".product-item__title") title = title_element.text.strip() url = urljoin(base_url, title_element["href"]) price_text = ( product .select_one(".price") .contents[-1] .strip() .replace("$", "") .replace(",", "") ) if price_text.startswith("From "): min_price = Decimal(price_text.removeprefix("From ")) price = None else: min_price = Decimal(price_text) price = min_price return {"title": title, "min_price": min_price, "price": price, "url": url} def export_csv(file, data): fieldnames = list(data[0].keys()) writer = csv.DictWriter(file, fieldnames=fieldnames) writer.writeheader() for row in data: writer.writerow(row) def export_json(file, data): def serialize(obj): if isinstance(obj, Decimal): return str(obj) raise TypeError("Object not JSON serializable") json.dump(data, file, default=serialize, indent=2) listing_url = "https://warehouse-theme-metal.myshopify.com/collections/sales" listing_soup = download(listing_url) data = [] for product in listing_soup.select(".product-item"): item = parse_product(product, listing_url) data.append(item) with open("products.csv", "w") as file: export_csv(file, data) with open("products.json", "w") as file: export_json(file, data) ``` ## Extracting vendor name[​](#extracting-vendor-name "Direct link to Extracting vendor name") Each product URL points to a so-called *product detail page*, or PDP. If we open one of the product URLs in the browser, e.g. the one about [Sony XBR-950G BRAVIA](https://warehouse-theme-metal.myshopify.com/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv), we can see that it contains a vendor name, [SKU](https://en.wikipedia.org/wiki/Stock_keeping_unit), number of reviews, product images, product variants, stock availability, description, and perhaps more. ![Product detail page](/assets/images/pdp-5399b26afde645f8ab2426f24cef87b9.png) Depending on what's valuable for our use case, we can now use the same techniques as in previous lessons to extract any of the above. As a demonstration, let's scrape the vendor name. In browser DevTools, we can see that the HTML around the vendor name has the following structure: ``` Sony XBR-950G BRAVIA 4K HDR Ultra HD TV ... Sony SKU: SON-985594-XBR-65 ... 3 reviews ... ``` It looks like using a CSS selector to locate the element with the `product-meta__vendor` class, and then extracting its text, should be enough to get the vendor name as a string: ``` vendor = product_soup.select_one(".product-meta__vendor").text.strip() ``` But where do we put this line in our program? ## Crawling product detail pages[​](#crawling-product-detail-pages "Direct link to Crawling product detail pages") In the `data` loop we're already going through all the products. Let's expand it to include downloading the product detail page, parsing it, extracting the vendor's name, and adding it as a new key in the item's dictionary: ``` ... listing_url = "https://warehouse-theme-metal.myshopify.com/collections/sales" listing_soup = download(listing_url) data = [] for product in listing_soup.select(".product-item"): item = parse_product(product, listing_url) product_soup = download(item["url"]) item["vendor"] = product_soup.select_one(".product-meta__vendor").text.strip() data.append(item) ... ``` If we run the program now, it'll take longer to finish since it's making 24 more HTTP requests. But in the end, it should produce exports with a new field containing the vendor's name: ``` [ { "title": "JBL Flip 4 Waterproof Portable Bluetooth Speaker", "min_price": "74.95", "price": "74.95", "url": "https://warehouse-theme-metal.myshopify.com/products/jbl-flip-4-waterproof-portable-bluetooth-speaker", "vendor": "JBL" }, { "title": "Sony XBR-950G BRAVIA 4K HDR Ultra HD TV", "min_price": "1398.00", "price": null, "url": "https://warehouse-theme-metal.myshopify.com/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv", "vendor": "Sony" }, ... ] ``` ## Extracting price[​](#extracting-price "Direct link to Extracting price") Scraping the vendor's name is nice, but the main reason we started checking the detail pages in the first place was to figure out how to get a price for each product. From the product listing, we could only scrape the min price, and remember—we’re building a Python app to track prices! Looking at the [Sony XBR-950G BRAVIA](https://warehouse-theme-metal.myshopify.com/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv), it's clear that the listing only shows min prices, because some products have variants, each with a different price. And different stock availability. And different SKUs… ![Morpheus revealing the existence of product variants](/assets/images/variants-32d39ca999dadb22954e83b0dedd782a.png) In the next lesson, we'll scrape the product detail pages so that each product variant is represented as a separate item in our dataset. *** ## Exercises[​](#exercises "Direct link to Exercises") These challenges are here to help you test what you’ve learned in this lesson. Try to resist the urge to peek at the solutions right away. Remember, the best learning happens when you dive in and do it yourself! Real world You're about to touch the real web, which is practical and exciting! But websites change, so some exercises might break. If you run into any issues, please leave a comment below or [file a GitHub Issue](https://github.com/apify/apify-docs/issues). ### Scrape calling codes of African countries[​](#scrape-calling-codes-of-african-countries "Direct link to Scrape calling codes of African countries") This is a follow-up to an exercise from the previous lesson, so feel free to reuse your code. Scrape links to Wikipedia pages for all African states and territories. Follow each link and extract the *calling code* from the info table. Print the URL and the calling code for each country. Start with this URL: ``` https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Africa ``` Your program should print the following: ``` https://en.wikipedia.org/wiki/Algeria +213 https://en.wikipedia.org/wiki/Angola +244 https://en.wikipedia.org/wiki/Benin +229 https://en.wikipedia.org/wiki/Botswana +267 https://en.wikipedia.org/wiki/Burkina_Faso +226 https://en.wikipedia.org/wiki/Burundi None https://en.wikipedia.org/wiki/Cameroon +237 ... ``` Hint: Locating cells in tables is sometimes easier if you know how to [navigate up](https://beautiful-soup-4.readthedocs.io/en/latest/index.html#going-up) in the HTML element soup. Solution ``` import httpx from bs4 import BeautifulSoup from urllib.parse import urljoin def download(url): response = httpx.get(url) response.raise_for_status() return BeautifulSoup(response.text, "html.parser") def parse_calling_code(soup): for label in soup.select("th.infobox-label"): if label.text.strip() == "Calling code": data = label.parent.select_one("td.infobox-data") return data.text.strip() return None listing_url = "https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Africa" listing_soup = download(listing_url) for name_cell in listing_soup.select(".wikitable tr td:nth-child(3)"): link = name_cell.select_one("a") country_url = urljoin(listing_url, link["href"]) country_soup = download(country_url) calling_code = parse_calling_code(country_soup) print(country_url, calling_code) ``` ### Scrape authors of F1 news articles[​](#scrape-authors-of-f1-news-articles "Direct link to Scrape authors of F1 news articles") This is a follow-up to an exercise from the previous lesson, so feel free to reuse your code. Scrape links to the Guardian's latest F1 news articles. For each article, follow the link and extract both the author's name and the article's title. Print the author's name and the title for all the articles. Start with this URL: ``` https://www.theguardian.com/sport/formulaone ``` Your program should print something like this: ``` Daniel Harris: Sports quiz of the week: Johan Neeskens, Bond and airborne antics Colin Horgan: The NHL is getting its own Drive to Survive. But could it backfire? Reuters: US GP ticket sales ‘took off’ after Max Verstappen stopped winning in F1 Giles Richards: Liam Lawson gets F1 chance to replace Pérez alongside Verstappen at Red Bull PA Media: Lewis Hamilton reveals lifelong battle with depression after school bullying ... ``` Hints: * You can use [attribute selectors](https://developer.mozilla.org/en-US/docs/Web/CSS/Attribute_selectors) to select HTML elements based on their attribute values. * Sometimes a person authors the article, but other times it's contributed by a news agency. Solution ``` import httpx from bs4 import BeautifulSoup from urllib.parse import urljoin def download(url): response = httpx.get(url) response.raise_for_status() return BeautifulSoup(response.text, "html.parser") def parse_author(article_soup): link = article_soup.select_one('aside a[rel="author"]') if link: return link.text.strip() address = article_soup.select_one('aside address') if address: return address.text.strip() return None listing_url = "https://www.theguardian.com/sport/formulaone" listing_soup = download(listing_url) for item in listing_soup.select("#maincontent ul li"): link = item.select_one("a") article_url = urljoin(listing_url, link["href"]) article_soup = download(article_url) title = article_soup.select_one("h1").text.strip() author = parse_author(article_soup) print(f"{author}: {title}") ``` --- # Extracting data from a web page with browser DevTools **In this lesson we'll use the browser tools for developers to manually extract product data from an e-commerce website.** *** In our pursuit to scrape products from the [Sales page](https://warehouse-theme-metal.myshopify.com/collections/sales), we've been able to locate parent elements containing relevant data. Now how do we extract the data? ## Finding product details[​](#finding-product-details "Direct link to Finding product details") Previously, we've figured out how to save the subwoofer product card to a variable in the **Console**: ``` products = document.querySelectorAll('.product-item'); subwoofer = products[2]; ``` The product details are within the element as text, so maybe if we extract the text, we could work out the individual values? ``` subwoofer.textContent; ``` That indeed outputs all the text, but in a form which would be hard to break down to relevant pieces. ![Printing text content of the parent element](/assets/images/devtools-extracting-text-67926f7b912635ebd5518af391019c91.png) We'll need to first locate relevant child elements and extract the data from each of them individually. ## Extracting title[​](#extracting-title "Direct link to Extracting title") We'll use the **Elements** tab of DevTools to inspect all child elements of the product card for the Sony subwoofer. We can see that the title of the product is inside an `a` element with several classes. From those the `product-item__title` seems like a great choice to locate the element. ![Finding child elements](/assets/images/devtools-product-details-83a62a55f65911b057ec16c4c7cde7b6.png) Browser JavaScript represents HTML elements as [Element](https://developer.mozilla.org/en-US/docs/Web/API/Element) objects. Among properties we've already played with, such as `textContent` or `outerHTML`, it also has the [`querySelector()`](https://developer.mozilla.org/en-US/docs/Web/API/Element/querySelector) method. Here the method looks for matches only within children of the element: ``` title = subwoofer.querySelector('.product-item__title'); title.textContent; ``` Notice we're calling `querySelector()` on the `subwoofer` variable, not `document`. And just like this, we've scraped our first piece of data! We've extracted the product title: ![Extracting product title](/assets/images/devtools-extracting-title-a40e8cdc7afe941acf105ada55a1ef1f.png) ## Extracting price[​](#extracting-price "Direct link to Extracting price") To figure out how to get the price, we'll use the **Elements** tab of DevTools again. We notice there are two prices, a regular price and a sale price. For the purposes of watching prices we'll need the sale price. Both are `span` elements with the `price` class. ![Finding child elements](/assets/images/devtools-product-details-83a62a55f65911b057ec16c4c7cde7b6.png) We could either rely on the fact that the sale price is likely to be always the one which is highlighted, or that it's always the first price. For now we'll rely on the later and we'll let `querySelector()` to simply return the first result: ``` price = subwoofer.querySelector('.price'); price.textContent; ``` It works, but the price isn't alone in the result. Before we'd use such data, we'd need to do some **data cleaning**: ![Extracting product price](/assets/images/devtools-extracting-price-5886f33b1f72fe38581cfe8521154987.png) But for now that's okay. We're just testing the waters now, so that we have an idea about what our scraper will need to do. Once we'll get to extracting prices in Node.js, we'll figure out how to get the values as numbers. In the next lesson, we'll start with our Node.js project. First we'll be figuring out how to download the Sales page without browser and make it accessible in a Node.js program. *** ## Exercises[​](#exercises "Direct link to Exercises") These challenges are here to help you test what you’ve learned in this lesson. Try to resist the urge to peek at the solutions right away. Remember, the best learning happens when you dive in and do it yourself! Real world You're about to touch the real web, which is practical and exciting! But websites change, so some exercises might break. If you run into any issues, please leave a comment below or [file a GitHub Issue](https://github.com/apify/apify-docs/issues). ### Extract the price of IKEA's most expensive artificial plant[​](#extract-the-price-of-ikeas-most-expensive-artificial-plant "Direct link to Extract the price of IKEA's most expensive artificial plant") At IKEA's [Artificial plants & flowers listing](https://www.ikea.com/se/en/cat/artificial-plants-flowers-20492/), use CSS selectors and HTML elements manipulation in the **Console** to extract the price of the most expensive artificial plant (sold in Sweden, as you'll be browsing their Swedish offer). Before opening DevTools, use your judgment to adjust the page to make the task as straightforward as possible. Finally, use the [`parseInt()`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/parseInt) function to convert the price text into a number. Solution 1. Open the [Artificial plants & flowers listing](https://www.ikea.com/se/en/cat/artificial-plants-flowers-20492/). 2. Sort the products by price, from high to low, so the most expensive plant appears first in the listing. 3. Activate the element selection tool in your DevTools. 4. Click on the price of the first and most expensive plant. 5. Notice that the price is structured into two elements, with the integer separated from the currency, under a class named `plp-price__integer`. This structure is convenient for extracting the value. 6. In the **Console**, execute `document.querySelector('.plp-price__integer')`. This returns the element representing the first price in the listing. Since `document.querySelector()` returns the first matching element, it directly selects the most expensive plant's price. 7. Save the element in a variable by executing `price = document.querySelector('.plp-price__integer')`. 8. Convert the price text into a number by executing `parseInt(price.textContent)`. 9. At the time of writing, this returns `699`, meaning [699 SEK](https://www.google.com/search?q=699%20sek). ### Extract the name of the top wiki on Fandom Movies[​](#extract-the-name-of-the-top-wiki-on-fandom-movies "Direct link to Extract the name of the top wiki on Fandom Movies") On Fandom's [Movies page](https://www.fandom.com/topics/movies), use CSS selectors and HTML element manipulation in the **Console** to extract the name of the top wiki. Use the [`trim()`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/trim) method to remove white space around the name. ![Fandom\'s Movies page](/assets/images/devtools-exercise-fandom-2ca7a8be3f2ec2f4cb54aeb459a20591.png) Solution 1. Open the [Movies page](https://www.fandom.com/topics/movies). 2. Activate the element selection tool in your DevTools. 3. Click on the list item for the top Fandom wiki in the category. 4. Notice that it has a class `topic_explore-wikis__link`. 5. In the **Console**, execute `document.querySelector('.topic_explore-wikis__link')`. This returns the element representing the top list item. They use the selector only for the **Top Wikis** list, and because `document.querySelector()` returns the first matching element, you're almost done. 6. Save the element in a variable by executing `item = document.querySelector('.topic_explore-wikis__link')`. 7. Get the element's text without extra white space by executing `item.textContent.trim()`. At the time of writing, this returns `"Pixar Wiki"`. ### Extract details about the first post on Guardian's F1 news[​](#extract-details-about-the-first-post-on-guardians-f1-news "Direct link to Extract details about the first post on Guardian's F1 news") On the Guardian's [F1 news page](https://www.theguardian.com/sport/formulaone), use CSS selectors and HTML manipulation in the **Console** to extract details about the first post. Specifically, extract its title, lead paragraph, and URL of the associated photo. ![F1 news page](/assets/images/devtools-exercise-guardian2-387ac6e6789398edf8ec27dcf214d0a0.png) Solution 1. Open the [F1 news page](https://www.theguardian.com/sport/formulaone). 2. Activate the element selection tool in your DevTools. 3. Click on the first post. 4. Notice that the markup does not provide clear, reusable class names for this task. The structure uses generic tag names and randomized classes, requiring you to rely on the element hierarchy and order instead. 5. In the **Console**, execute `post = document.querySelector('#maincontent ul li')`. This returns the element representing the first post. 6. Extract the post's title by executing `post.querySelector('h3').textContent`. 7. Extract the lead paragraph by executing `post.querySelector('span div').textContent`. 8. Extract the photo URL by executing `post.querySelector('img').src`. --- # Inspecting web pages with browser DevTools **In this lesson we'll use the browser tools for developers to inspect and manipulate the structure of a website.** *** A browser is the most complete tool for navigating websites. Scrapers are like automated browsers—and sometimes, they actually are automated browsers. The key difference? There's no user to decide where to go or eyes to see what's displayed. Everything has to be pre-programmed. All modern browsers provide developer tools, or *DevTools*, for website developers to debug their work. We'll use them to understand how websites are structured and identify the behavior our scraper needs to mimic. Here's the typical workflow for creating a scraper: 1. Inspect the target website in DevTools to understand its structure and determine how to extract the required data. 2. Translate those findings into code. 3. If the scraper fails due to overlooked edge cases or, over time, due to website changes, go back to step 1. Now let's spend some time figuring out what the detective work in step 1 is about. ## Opening DevTools[​](#opening-devtools "Direct link to Opening DevTools") Google Chrome is currently the most popular browser, and many others use the same core. That's why we'll focus on [Chrome DevTools](https://developer.chrome.com/docs/devtools) here. However, the steps are similar in other browsers, as Safari has its [Web Inspector](https://developer.apple.com/documentation/safari-developer-tools/web-inspector) and Firefox also has [DevTools](https://firefox-source-docs.mozilla.org/devtools-user/). Now let's peek behind the scenes of a real-world website—say, Wikipedia. We'll open Google Chrome and visit [wikipedia.org](https://www.wikipedia.org/). Then, let's press **F12**, or right-click anywhere on the page and select **Inspect**. ![Wikipedia with Chrome DevTools open](/assets/images/devtools-wikipedia-912f0473b3c31f441ab1659205bd1e08.png) Websites are built with three main technologies: HTML, CSS, and JavaScript. In the **Elements** tab, DevTools shows the HTML and CSS of the current page: ![Elements tab in Chrome DevTools](/assets/images/devtools-elements-tab-19ef2bf359464d39570f06f182dbc92e.png) Screen adaptations DevTools may appear differently depending on your screen size. For instance, on smaller screens, the CSS panel might move below the HTML elements panel instead of appearing in the right pane. Think of [HTML](https://developer.mozilla.org/en-US/docs/Learn/HTML) elements as the frame that defines a page's structure. A basic HTML element includes an opening tag, a closing tag, and attributes. Here's an `article` element with an `id` attribute. It wraps `h1` and `p` elements, both containing text. Some text is emphasized using `em`. ``` First Level Heading Paragraph with emphasized text. ``` HTML, a markup language, describes how everything on a page is organized, how elements relate to each other, and what they mean. It doesn't define how elements should look—that's where [CSS](https://developer.mozilla.org/en-US/docs/Learn/CSS) comes in. CSS is like the velvet covering the frame. Using styles, we can select elements and assign rules that tell the browser how they should appear. For instance, we can style all elements with `heading` in their `class` attribute to make the text blue and uppercase. ``` .heading { color: blue; text-transform: uppercase; } ``` While HTML and CSS describe what the browser should display, JavaScript adds interaction to the page. In DevTools, the **Console** tab allows ad-hoc experimenting with JavaScript. If you don't see it, press `ESC` to toggle the Console. Running commands in the Console lets us manipulate the loaded page—we’ll try this shortly. ![Console in Chrome DevTools](/assets/images/devtools-console-4ef1db79ae90e6fa619d2dc5ccc86b57.png) ## Selecting an element[​](#selecting-an-element "Direct link to Selecting an element") In the top-left corner of DevTools, let's find the icon with an arrow pointing to a square. ![Chrome DevTools element selection tool](/assets/images/devtools-element-selection-58f754a14f1c856aae8960432f7ebe73.png) We'll click the icon and hover your cursor over Wikipedia's subtitle, **The Free Encyclopedia**. As we move our cursor, DevTools will display information about the HTML element under it. We'll click on the subtitle. In the **Elements** tab, DevTools will highlight the HTML element that represents the subtitle. ![Chrome DevTools element hover](/assets/images/devtools-hover-c780c2944cc8718fc2131d83e1b0b1e3.png) The highlighted section should look something like this: ``` The Free Encyclopedia ``` If we were experienced creators of scrapers, our eyes would immediately spot what's needed to make a program that fetches Wikipedia's subtitle. The program would need to download the page's source code, find a `strong` element with `localized-slogan` in its `class` attribute, and extract its text. HTML and whitespace In HTML, whitespace isn't significant, i.e., it only makes the code readable. The following code snippets are equivalent: ``` The Free Encyclopedia ``` ``` The Free Encyclopedia ``` ## Interacting with an element[​](#interacting-with-an-element "Direct link to Interacting with an element") We won't be creating Node.js scrapers just yet. Let's first get familiar with what we can do in the DevTools console and how we can further interact with HTML elements on the page. In the **Elements** tab, with the subtitle element highlighted, let's right-click the element to open the context menu. There, we'll choose **Store as global variable**. The **Console** should appear, with a `temp1` variable ready. ![Global variable in Chrome DevTools Console](/assets/images/devtools-console-variable-e7bc489b7be25174922e4a1880217dba.png) The Console allows us to run code in the context of the loaded page. We can use it to play around with elements. For a start, let's access some of the subtitle's properties. One such property is `textContent`, which contains the text inside the HTML element. The last line in the Console is where your cursor is. We'll type the following and hit **Enter**: ``` temp1.textContent; ``` The result should be `'The Free Encyclopedia'`. Now let's try this: ``` temp1.outerHTML; ``` This should return the element's HTML tag as a string. Finally, we'll run the next line to change the text of the element: ``` temp1.textContent = 'Hello World!'; ``` When we change elements in the Console, those changes reflect immediately on the page! ![Changing textContent in Chrome DevTools Console](/assets/images/devtools-console-textcontent-24a921cedf18c995feac29a2120ad52e.png) But don't worry—we haven't hacked Wikipedia. The change only happens in our browser. If we reload the page, the change will disappear. This, however, is an easy way to craft a screenshot with fake content. That's why screenshots shouldn't be trusted as evidence. We're not here for playing around with elements, though—we want to create a scraper for an e-commerce website to watch prices. In the next lesson, we'll examine the website and use CSS selectors to locate HTML elements containing the data we need. *** ## Exercises[​](#exercises "Direct link to Exercises") These challenges are here to help you test what you’ve learned in this lesson. Try to resist the urge to peek at the solutions right away. Remember, the best learning happens when you dive in and do it yourself! Real world You're about to touch the real web, which is practical and exciting! But websites change, so some exercises might break. If you run into any issues, please leave a comment below or [file a GitHub Issue](https://github.com/apify/apify-docs/issues). ### Find FIFA logo[​](#find-fifa-logo "Direct link to Find FIFA logo") Open the [FIFA website](https://www.fifa.com/) and use the DevTools to figure out the URL of FIFA's logo image file. Hint: You're looking for an [`img`](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/img) element with a `src` attribute. Solution 1. Go to [fifa.com](https://www.fifa.com/). 2. Activate the element selection tool. 3. Click on the logo. 4. Send the highlighted element to the **Console** using the **Store as global variable** option from the context menu. 5. In the console, type `temp1.src` and hit **Enter**. ![DevTools exercise result](/assets/images/devtools-exercise-fifa-b92e335eb6684698a954601ef59dcb0f.png) ### Make your own news[​](#make-your-own-news "Direct link to Make your own news") Open a news website, such as [CNN](https://cnn.com). Use the Console to change the headings of some articles. Solution 1. Go to [cnn.com](https://cnn.com). 2. Activate the element selection tool. 3. Click on a heading. 4. Send the highlighted element to the **Console** using the **Store as global variable** option from the context menu. 5. In the console, type `temp1.textContent = 'Something something'` and hit **Enter**. ![DevTools exercise result](/assets/images/devtools-exercise-cnn-0eb495bb32a2a11eb795e83096b65949.png) --- # Locating HTML elements on a web page with browser DevTools **In this lesson we'll use the browser tools for developers to manually find products on an e-commerce website.** *** Inspecting Wikipedia and tweaking its subtitle is fun, but let's shift gears and focus on building an app to track prices on an e-commerce site. As part of the groundwork, let's check out the site we'll be working with. ## Meeting the Warehouse store[​](#meeting-the-warehouse-store "Direct link to Meeting the Warehouse store") Instead of artificial scraping playgrounds or sandboxes, we'll scrape a real e-commerce site. Shopify, a major e-commerce platform, has a demo store at [warehouse-theme-metal.myshopify.com](https://warehouse-theme-metal.myshopify.com/). It strikes a good balance between being realistic and stable enough for a tutorial. Our scraper will track prices for all products listed on the [Sales page](https://warehouse-theme-metal.myshopify.com/collections/sales). Balancing authenticity and stability Live sites like Amazon are complex, loaded with promotions, frequently changing, and equipped with anti-scraping measures. While those challenges are manageable, they're advanced topics. For this beginner course, we're sticking to a lightweight, stable environment. That said, we designed all the additional exercises to work with live websites. This means occasional updates might be needed, but we think it's worth it for a more authentic learning experience. ## Finding a product card[​](#finding-a-product-card "Direct link to Finding a product card") As mentioned in the previous lesson, before building a scraper, we need to understand structure of the target page and identify the specific elements our program should extract. Let's figure out how to select details for each product on the [Sales page](https://warehouse-theme-metal.myshopify.com/collections/sales). ![Warehouse store with DevTools open](/assets/images/devtools-warehouse-193f0152a0cd14df5068bc13512c31ee.png) The page displays a grid of product cards, each showing a product's title and picture. Let's open DevTools and locate the title of the **Sony SACS9 Active Subwoofer**. We'll highlight it in the **Elements** tab by clicking on it. ![Selecting an element with DevTools](/assets/images/devtools-product-title-29537d86966f0c3ae781b4cf8d53ef7e.png) Next, let's find all the elements containing details about this subwoofer—its price, number of reviews, image, and more. In the **Elements** tab, we'll move our cursor up from the `a` element containing the subwoofer's title. On the way, we'll hover over each element until we highlight the entire product card. Alternatively, we can use the arrow-up key. The `div` element we land on is the **parent element**, and all nested elements are its **child elements**. ![Selecting an element with hover](/assets/images/devtools-hover-product-72db0f66037c498f4b84e2405cc5e80c.png) At this stage, we could use the **Store as global variable** option to send the element to the **Console**. While helpful for manual inspection, this isn't something a program can do. Scrapers typically rely on [CSS selectors](https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_selectors) to locate elements on a page, and these selectors often target elements based on their `class` attributes. The product card we highlighted has markup like this: ``` ... ``` The `class` attribute can hold multiple values separated by whitespace. This particular element has four classes. Let's move to the **Console** and experiment with CSS selectors to locate this element. ## Programmatically locating a product card[​](#programmatically-locating-a-product-card "Direct link to Programmatically locating a product card") Let's jump into the **Console** and write some code. In browsers, JavaScript represents the current page as the [`Document`](https://developer.mozilla.org/en-US/docs/Web/API/Document) object, accessible via `document`. This object offers many useful methods, including [`querySelector()`](https://developer.mozilla.org/en-US/docs/Web/API/Document/querySelector). This method takes a CSS selector as a string and returns the first HTML element that matches. We'll try typing this into the **Console**: ``` document.querySelector('.product-item'); ``` It will return the HTML element for the first product card in the listing: ![Using querySelector() in DevTools Console](/assets/images/devtools-queryselector-803d0a68e972691c71ad6551e8ab27d6.webp) CSS selectors can get quite complex, but the basics are enough to scrape most of the Warehouse store. Let's cover two simple types and how they can combine. The [type selector](https://developer.mozilla.org/en-US/docs/Web/CSS/Type_selectors) matches elements by tag name. For example, `h1` would match the highlighted element: ``` Title Paragraph. ``` The [class selector](https://developer.mozilla.org/en-US/docs/Web/CSS/Class_selectors) matches elements based on their class attribute. For instance, `.heading` (note the dot) would match the following: ``` Title Subtitle Paragraph Heading ``` You can combine selectors to narrow results. For example, `p.lead` matches `p` elements with the `lead` class, but not `p` elements without the class or elements with the class but a different tag name: ``` Lead paragraph. Paragraph Paragraph ``` How did we know `.product-item` selects a product card? By inspecting the markup of the product card element. After checking its classes, we chose the one that best fit our purpose. Testing in the **Console** confirmed it—selecting by the most descriptive class worked. ## Choosing good selectors[​](#choosing-good-selectors "Direct link to Choosing good selectors") Multiple approaches often exist for creating a CSS selector that targets the element we want. We should pick selectors that are simple, readable, unique, and semantically tied to the data. These are **resilient selectors**. They're the most reliable and likely to survive website updates. We better avoid randomly generated attributes like `class="F4jsL8"`, as they tend to change without warning. The product card has four classes: `product-item`, `product-item--vertical`, `1/3--tablet-and-up`, and `1/4--desk`. Only the first one checks all the boxes. A product card *is* a product item, after all. The others seem more about styling—defining how the element looks on the screen—and are probably tied to CSS rules. This class is also unique enough in the page's context. If it were something generic like `item`, there would be a higher risk that developers of the website might use it for unrelated elements. In the **Elements** tab, we can see a parent element `product-list` that contains all the product cards marked as `product-item`. This structure aligns with the data we're after. ![Overview of all the product cards in DevTools](/assets/images/devtools-product-list-9ebeb190d65fc7f7ae765caaa6eb128b.png) ## Locating all product cards[​](#locating-all-product-cards "Direct link to Locating all product cards") In the **Console**, hovering our cursor over objects representing HTML elements highlights the corresponding elements on the page. This way we can verify that when we query `.product-item`, the result represents the JBL Flip speaker—the first product card in the list. ![Highlighting a querySelector() result](/assets/images/devtools-hover-queryselector-747bedb6133e1d9919c10bdb0e6cc599.png) But what if we want to scrape details about the Sony subwoofer we inspected earlier? For that, we need a method that selects more than just the first match: [`querySelectorAll()`](https://developer.mozilla.org/en-US/docs/Web/API/Document/querySelectorAll). As the name suggests, it takes a CSS selector string and returns all matching HTML elements. Let's type this into the **Console**: ``` document.querySelectorAll('.product-item'); ``` The returned value is a [`NodeList`](https://developer.mozilla.org/en-US/docs/Web/API/NodeList), a collection of nodes. Browsers understand an HTML document as a tree of nodes. Most nodes are HTML elements, but there are also text nodes for plain text, and others. We'll expand the result by clicking the small arrow, then hover our cursor over the third element in the list. Indexing starts at 0, so the third element is at index 2. There it is—the product card for the subwoofer! ![Highlighting a querySelectorAll() result](/assets/images/devtools-hover-queryselectorall-b2c49b9d01f9bf3a25ee797b0e652896.png) To save the subwoofer in a variable for further inspection, we can use index access with brackets, just like with regular JavaScript arrays: ``` products = document.querySelectorAll('.product-item'); subwoofer = products[2]; ``` Even though we're just playing in the browser's **Console**, we're inching closer to figuring out what our Node.js program will need to do. In the next lesson, we'll dive into accessing child elements and extracting product details. *** ## Exercises[​](#exercises "Direct link to Exercises") These challenges are here to help you test what you’ve learned in this lesson. Try to resist the urge to peek at the solutions right away. Remember, the best learning happens when you dive in and do it yourself! Real world You're about to touch the real web, which is practical and exciting! But websites change, so some exercises might break. If you run into any issues, please leave a comment below or [file a GitHub Issue](https://github.com/apify/apify-docs/issues). ### Locate headings on Wikipedia's Main Page[​](#locate-headings-on-wikipedias-main-page "Direct link to Locate headings on Wikipedia's Main Page") On English Wikipedia's [Main Page](https://en.wikipedia.org/wiki/Main_Page), use CSS selectors in the **Console** to list the HTML elements representing headings of the colored boxes (including the grey ones). ![Wikipedia\'s Main Page headings](/assets/images/devtools-exercise-wikipedia-5d47de5c50985ec7cc87b3a220f9d14c.png) Solution 1. Open the [Main Page](https://en.wikipedia.org/wiki/Main_Page). 2. Activate the element selection tool in your DevTools. 3. Click on several headings to examine the markup. 4. Notice that all headings are `h2` elements with the `mp-h2` class. 5. In the **Console**, execute `document.querySelectorAll('h2')`. 6. At the time of writing, this selector returns 8 headings. Each corresponds to a box, and there are no other `h2` elements on the page. Thus, the selector is sufficient as is. ### Locate products on Shein[​](#locate-products-on-shein "Direct link to Locate products on Shein") Go to Shein's [Jewelry & Accessories](https://shein.com/RecommendSelection/Jewelry-Accessories-sc-017291431.html) category. In the **Console**, use CSS selectors to list all HTML elements representing the products. ![Products in Shein\'s Jewelry \& Accessories category](/assets/images/devtools-exercise-shein-e289fb63ac18c6aa8ea8ed48e6ffd805.png) Solution 1. Visit the [Jewelry & Accessories](https://shein.com/RecommendSelection/Jewelry-Accessories-sc-017291431.html) page. Close any pop-ups or promotions. 2. Activate the element selection tool in your DevTools. 3. Click on the first product to inspect its markup. Repeat with a few others. 4. Observe that all products are `section` elements with multiple classes, including `product-card`. 5. Since `section` is a generic wrapper, focus on the `product-card` class. 6. In the **Console**, execute `document.querySelectorAll('.product-card')`. 7. At the time of writing, this selector returns 120 results, all representing products. No further narrowing is necessary. ### Locate articles on Guardian[​](#locate-articles-on-guardian "Direct link to Locate articles on Guardian") Go to Guardian's [page about F1](https://www.theguardian.com/sport/formulaone). Use the **Console** to find all HTML elements representing the articles. Hint: Learn about the [descendant combinator](https://developer.mozilla.org/en-US/docs/Web/CSS/Descendant_combinator). ![Articles on Guardian\'s page about F1](/assets/images/devtools-exercise-guardian1-f3587204f6314c9a25a1955f94420ebc.png) Solution 1. Open the [page about F1](https://www.theguardian.com/sport/formulaone). 2. Activate the element selection tool in your DevTools. 3. Click on an article to inspect its structure. Check several articles, including the ones with smaller cards. 4. Note that all articles are `li` elements, but their classes (e.g., `dcr-1qmyfxi`) are dynamically generated and unreliable. 5. Using `document.querySelectorAll('li')` returns too many results, including unrelated items like navigation links. 6. Inspect the page structure. The `main` element contains the primary content, including articles. Use the descendant combinator to target `li` elements within `main`. 7. In the **Console**, execute `document.querySelectorAll('main li')`. 8. At the time of writing, this selector returns 21 results. All appear to represent articles, so the solution works! --- # Downloading HTML with Node.js **In this lesson we'll start building a Node.js application for watching prices. As a first step, we'll use the Fetch API to download HTML code of a product listing page.** *** Using browser tools for developers is crucial for understanding the structure of a particular page, but it's a manual task. Let's start building our first automation, a JavaScript program which downloads HTML code of the product listing. ## Starting a Node.js project[​](#starting-a-nodejs-project "Direct link to Starting a Node.js project") Before we start coding, we need to set up a Node.js project. Let's create new directory and let's name it `product-scraper`. Inside the directory, we'll initialize new project: ``` $ npm init This utility will walk you through creating a package.json file. ... Press ^C at any time to quit. package name: (product-scraper) version: (1.0.0) description: Product scraper entry point: (index.js) test command: git repository: keywords: author: license: (ISC) type: (commonjs) module About to write to /Users/.../product-scraper/package.json: { "name": "product-scraper", "version": "1.0.0", "description": "Product scraper", "main": "index.js", "scripts": { "test": "echo \"Error: no test specified\" && exit 1" }, "author": "", "license": "ISC", "type": "module" } ``` The above creates a `package.json` file with configuration of our project. While most of the values are arbitrary, it's important that the project's type is set to `module`. Now let's test that all works. Inside the project directory we'll create a new file called `index.js` with the following code: ``` import process from 'node:process'; console.log(`All is OK, ${process.argv[2]}`); ``` Running it as a Node.js program will verify that our setup is okay and we've correctly set the type to `module`. The program takes a single word as an argument and will address us with it, so let's pass it "mate", for example: ``` $ node index.js mate All is OK, mate ``` Troubleshooting If you see errors or are otherwise unable to run the code above, it likely means your environment isn't set up correctly. Unfortunately, diagnosing the issue is out of scope for this course. Make sure that in your `package.json` the type property is set to `module`, otherwise you'll get the following warning: ``` [MODULE_TYPELESS_PACKAGE_JSON] Warning: Module type of file:///Users/.../product-scraper/index.js is not specified and it doesn't parse as CommonJS. Reparsing as ES module because module syntax was detected. This incurs a performance overhead. To eliminate this warning, add "type": "module" to /Users/.../product-scraper/package.json. ``` In older versions of Node.js, you may even encounter this error: ``` SyntaxError: Cannot use import statement outside a module ``` ## Downloading product listing[​](#downloading-product-listing "Direct link to Downloading product listing") Now onto coding! Let's change our code so it downloads HTML of the product listing instead of printing `All is OK`. The [documentation of the Fetch API](https://developer.mozilla.org/en-US/docs/Web/API/Fetch_API/Using_Fetch) provides us with examples how to use it. Inspired by those, our code will look like this: ``` const url = "https://warehouse-theme-metal.myshopify.com/collections/sales"; const response = await fetch(url); console.log(await response.text()); ``` Asynchronous flow First time you see `await`? It's a modern syntax for working with promises. See the [JavaScript Asynchronous Programming and Callbacks](https://nodejs.org/en/learn/asynchronous-work/javascript-asynchronous-programming-and-callbacks) and [Discover Promises in Node.js](https://nodejs.org/en/learn/asynchronous-work/discover-promises-in-nodejs) tutorials in the official Node.js documentation for more. If we run the program now, it should print the downloaded HTML: ``` $ node index.js Sales ... ``` Running `await fetch(url)`, we made a HTTP request and received a response. It's not particularly useful yet, but it's a good start of our scraper. Client and server, request and response HTTP is a network protocol powering the internet. Understanding it well is an important foundation for successful scraping, but for this course, it's enough to know just the basic flow and terminology: * HTTP is an exchange between two participants. * The *client* sends a *request* to the *server*, which replies with a *response*. * In our case, `index.js` is the client, and the technology running at `warehouse-theme-metal.myshopify.com` replies to our request as the server. ## Handling errors[​](#handling-errors "Direct link to Handling errors") Websites can return various errors, such as when the server is temporarily down, applying anti-scraping protections, or simply being buggy. In HTTP, each response has a three-digit *status code* that indicates whether it is an error or a success. All status codes If you've never worked with HTTP response status codes before, briefly scan their [full list](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status) to get at least a basic idea of what you might encounter. For further education on the topic, we recommend [HTTP Cats](https://http.cat/) as a highly professional resource. A robust scraper skips or retries requests on errors. Given the complexity of this task, it's best to use libraries or frameworks. For now, we'll at least make sure that our program visibly crashes and prints what happened in case there's an error. First, let's ask for trouble. We'll change the URL in our code to a page that doesn't exist, so that we get a response with [status code 404](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/404). This could happen, for example, when the product we are scraping is no longer available: ``` https://warehouse-theme-metal.myshopify.com/does/not/exist ``` We could check the value of `response.status` against a list of allowed numbers, but the Fetch API already provides `response.ok`, a property which returns `false` if our request wasn't successful: ``` const url = "https://warehouse-theme-metal.myshopify.com/does/not/exist"; const response = await fetch(url); if (response.ok) { console.log(await response.text()); } else { throw new Error(`HTTP ${response.status}`); } ``` If you run the code above, the program should crash: ``` $ node index.js file:///Users/.../index.js:7 throw new Error(`HTTP ${response.status}`); ^ Error: HTTP 404 at file:///Users/.../index.js:7:9 at process.processTicksAndRejections (node:internal/process/task_queues:105:5) ``` Letting our program visibly crash on error is enough for our purposes. Now, let's return to our primary goal. In the next lesson, we'll be looking for a way to extract information about products from the downloaded HTML. *** ## Exercises[​](#exercises "Direct link to Exercises") These challenges are here to help you test what you’ve learned in this lesson. Try to resist the urge to peek at the solutions right away. Remember, the best learning happens when you dive in and do it yourself! Real world You're about to touch the real web, which is practical and exciting! But websites change, so some exercises might break. If you run into any issues, please leave a comment below or [file a GitHub Issue](https://github.com/apify/apify-docs/issues). ### Scrape AliExpress[​](#scrape-aliexpress "Direct link to Scrape AliExpress") Download HTML of a product listing page, but this time from a real world e-commerce website. For example this page with AliExpress search results: ``` https://www.aliexpress.com/w/wholesale-darth-vader.html ``` Solution ``` const url = "https://www.aliexpress.com/w/wholesale-darth-vader.html"; const response = await fetch(url); if (response.ok) { console.log(await response.text()); } else { throw new Error(`HTTP ${response.status}`); } ``` ### Save downloaded HTML as a file[​](#save-downloaded-html-as-a-file "Direct link to Save downloaded HTML as a file") Download HTML, then save it on your disk as a `products.html` file. You can use the URL we've been already playing with: ``` https://warehouse-theme-metal.myshopify.com/collections/sales ``` Solution Right in your Terminal or Command Prompt, you can create files by *redirecting output* of command line programs: ``` node index.js > products.html ``` If you want to use Node.js instead, it offers several ways how to create files. The solution below uses the [Promises API](https://nodejs.org/api/fs.html#promises-api): ``` import { writeFile } from 'node:fs/promises'; const url = "https://warehouse-theme-metal.myshopify.com/collections/sales"; const response = await fetch(url); if (response.ok) { const html = await response.text(); await writeFile('products.html', html); } else { throw new Error(`HTTP ${response.status}`); } ``` ### Download an image as a file[​](#download-an-image-as-a-file "Direct link to Download an image as a file") Download a product image, then save it on your disk as a file. While HTML is *textual* content, images are *binary*. You may want to scan through the [Fetch API documentation](https://developer.mozilla.org/en-US/docs/Web/API/Fetch_API/Using_Fetch#reading_the_response_body) and the [Writing files with Node.js](https://nodejs.org/en/learn/manipulating-files/writing-files-with-nodejs) tutorial for guidance. Especially check `Response.arrayBuffer()`. You can use this URL pointing to an image of a TV: ``` https://warehouse-theme-metal.myshopify.com/cdn/shop/products/sonyxbr55front_f72cc8ff-fcd6-4141-b9cc-e1320f867785.jpg ``` Solution Node.js offers several ways how to create files. The solution below uses [Promises API](https://nodejs.org/api/fs.html#promises-api): ``` import { writeFile } from 'node:fs/promises'; const url = "https://warehouse-theme-metal.myshopify.com/cdn/shop/products/sonyxbr55front_f72cc8ff-fcd6-4141-b9cc-e1320f867785.jpg"; const response = await fetch(url); if (response.ok) { const buffer = Buffer.from(await response.arrayBuffer()); await writeFile('tv.jpg', buffer); } else { throw new Error(`HTTP ${response.status}`); } ``` --- # Extracting data from HTML with Node.js **In this lesson we'll finish extracting product data from the downloaded HTML. With help of basic string manipulation we'll focus on cleaning and correctly representing the product price.** *** Locating the right HTML elements is the first step of a successful data extraction, so it's no surprise that we're already close to having the data in the correct form. The last bit that still requires our attention is the price: ``` $ node index.js JBL Flip 4 Waterproof Portable Bluetooth Speaker | $74.95 Sony XBR-950G BRAVIA 4K HDR Ultra HD TV | From $1,398.00 ... ``` Let's summarize what stands in our way if we want to have it in our Python program as a number: * A dollar sign precedes the number, * the number contains decimal commas for better human readability, and * some prices start with `From`, which reveals there is a certain complexity in how the shop deals with prices. ## Representing price[​](#representing-price "Direct link to Representing price") The last bullet point is the most important to figure out before we start coding. We thought we'll be scraping numbers, but in the middle of our effort, we discovered that the price is actually a range. It's because some products have variants with different prices. Later in the course we'll get to crawling, i.e. following links and scraping data from more than just one page. That will allow us to get exact prices for all the products, but for now let's extract just what's in the listing. Ideally we'd go and discuss the problem with those who are about to use the resulting data. For their purposes, is the fact that some prices are just minimum prices important? What would be the most useful representation of the range for them? Maybe they'd tell us that it's okay if we just remove the `From` prefix? ``` const priceText = price.text().replace("From ", ""); ``` In other cases, they'd tell us the data must include the range. And in cases when we just don't know, the safest option is to include all the information we have and leave the decision on what's important to later stages. One approach could be having the exact and minimum prices as separate values. If we don't know the exact price, we leave it empty: ``` const priceRange = { minPrice: null, price: null }; const priceText = price.text() if (priceText.startsWith("From ")) { priceRange.minPrice = priceText.replace("From ", ""); } else { priceRange.minPrice = priceText; priceRange.price = priceRange.minPrice; } ``` Built-in string methods If you're not proficient in JavaScript's string methods, [.startsWith()](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/startsWith) checks the beginning of a given string, and [.replace()](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/replace) changes part of a given string. The whole program would look like this: ``` import * as cheerio from 'cheerio'; const url = "https://warehouse-theme-metal.myshopify.com/collections/sales"; const response = await fetch(url); if (response.ok) { const html = await response.text(); const $ = cheerio.load(html); $(".product-item").each((i, element) => { const productItem = $(element); const title = productItem.find(".product-item__title"); const titleText = title.text(); const price = productItem.find(".price").contents().last(); const priceRange = { minPrice: null, price: null }; const priceText = price.text(); if (priceText.startsWith("From ")) { priceRange.minPrice = priceText.replace("From ", ""); } else { priceRange.minPrice = priceText; priceRange.price = priceRange.minPrice; } console.log(`${titleText} | ${priceRange.minPrice} | ${priceRange.price}`); }); } else { throw new Error(`HTTP ${response.status}`); } ``` ## Removing white space[​](#removing-white-space "Direct link to Removing white space") Often, the strings we extract from a web page start or end with some amount of whitespace, typically space characters or newline characters, which come from the [indentation](https://en.wikipedia.org/wiki/Indentation_\(typesetting\)#Indentation_in_programming) of the HTML tags. We call the operation of removing whitespace *trimming* or *stripping*, and it's so useful in many applications that programming languages and libraries include ready-made tools for it. Let's add JavaScript's built-in [.trim()](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/trim): ``` const titleText = title.text().trim(); const priceText = price.text().trim(); ``` ## Removing dollar sign and commas[​](#removing-dollar-sign-and-commas "Direct link to Removing dollar sign and commas") We got rid of the `From` and possible whitespace, but we still can't save the price as a number in our JavaScript program: ``` > const priceText = "$1,998.00" > parseFloat(priceText) NaN ``` Interactive JavaScript The demonstration above is inside the Node.js' [interactive REPL](https://nodejs.org/en/learn/command-line/how-to-use-the-nodejs-repl). It's similar to running arbitrary code in your browser's DevTools Console, and it's a useful playground where you can try how code behaves before you use it in your program. We need to remove the dollar sign and the decimal commas. For this type of cleaning, [regular expressions](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_expressions) are often the best tool for the job, but in this case [`.replace()`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/replace) is also sufficient: ``` const priceText = price .text() .trim() .replace("$", "") .replace(",", ""); ``` ## Representing money in programs[​](#representing-money-in-programs "Direct link to Representing money in programs") Now we should be able to add `parseFloat()`, so that we have the prices not as a text, but as numbers: ``` const priceRange = { minPrice: null, price: null }; const priceText = price.text() if (priceText.startsWith("From ")) { priceRange.minPrice = parseFloat(priceText.replace("From ", "")); } else { priceRange.minPrice = parseFloat(priceText); priceRange.price = priceRange.minPrice; } ``` Great! Only if we didn't overlook an important pitfall called [floating-point error](https://en.wikipedia.org/wiki/Floating-point_error_mitigation). In short, computers save floating point numbers in a way which isn't always reliable: ``` > 0.1 + 0.2 0.30000000000000004 ``` These errors are small and usually don't matter, but sometimes they can add up and cause unpleasant discrepancies. That's why it's typically best to avoid floating point numbers when working with money. We won't store dollars, but cents: ``` const priceText = price .text() .trim() .replace("$", "") .replace(".", "") .replace(",", ""); ``` In this case, removing the dot from the price text is the same as if we multiplied all the numbers with 100, effectively converting dollars to cents. This is how the whole program looks like now: ``` import * as cheerio from 'cheerio'; const url = "https://warehouse-theme-metal.myshopify.com/collections/sales"; const response = await fetch(url); if (response.ok) { const html = await response.text(); const $ = cheerio.load(html); $(".product-item").each((i, element) => { const productItem = $(element); const title = productItem.find(".product-item__title"); const titleText = title.text().trim(); const price = productItem.find(".price").contents().last(); const priceRange = { minPrice: null, price: null }; const priceText = price .text() .trim() .replace("$", "") .replace(".", "") .replace(",", ""); if (priceText.startsWith("From ")) { priceRange.minPrice = parseInt(priceText.replace("From ", "")); } else { priceRange.minPrice = parseInt(priceText); priceRange.price = priceRange.minPrice; } console.log(`${titleText} | ${priceRange.minPrice} | ${priceRange.price}`); }); } else { throw new Error(`HTTP ${response.status}`); } ``` If we run the code above, we have nice, clean data about all the products! ``` $ node index.js JBL Flip 4 Waterproof Portable Bluetooth Speaker | 7495 | 7495 Sony XBR-950G BRAVIA 4K HDR Ultra HD TV | 139800 | null ... ``` Well, not to spoil the excitement, but in its current form, the data isn't very useful. In the next lesson we'll save the product details to a file which data analysts can use or other programs can read. *** ## Exercises[​](#exercises "Direct link to Exercises") These challenges are here to help you test what you’ve learned in this lesson. Try to resist the urge to peek at the solutions right away. Remember, the best learning happens when you dive in and do it yourself! Real world You're about to touch the real web, which is practical and exciting! But websites change, so some exercises might break. If you run into any issues, please leave a comment below or [file a GitHub Issue](https://github.com/apify/apify-docs/issues). ### Scrape units on stock[​](#scrape-units-on-stock "Direct link to Scrape units on stock") Change our scraper so that it extracts how many units of each product are on stock. Your program should print the following. Note the unit amounts at the end of each line: ``` JBL Flip 4 Waterproof Portable Bluetooth Speaker | 672 Sony XBR-950G BRAVIA 4K HDR Ultra HD TV | 77 Sony SACS9 10" Active Subwoofer | 7 Sony PS-HX500 Hi-Res USB Turntable | 15 Klipsch R-120SW Powerful Detailed Home Speaker - Unit | 0 Denon AH-C720 In-Ear Headphones | 236 ... ``` Solution ``` import * as cheerio from 'cheerio'; function parseUnitsText(text) { const count = text .replace("In stock,", "") .replace("Only", "") .replace(" left", "") .replace("units", "") .trim(); return count === "Sold out" ? 0 : parseInt(count); } const url = "https://warehouse-theme-metal.myshopify.com/collections/sales"; const response = await fetch(url); if (response.ok) { const html = await response.text(); const $ = cheerio.load(html); $(".product-item").each((i, element) => { const productItem = $(element); const title = productItem.find(".product-item__title"); const titleText = title.text().trim(); const unitsText = productItem.find(".product-item__inventory").text(); const unitsCount = parseUnitsText(unitsText); console.log(`${titleText} | ${unitsCount}`); }); } else { throw new Error(`HTTP ${response.status}`); } ``` Conditional (ternary) operator For brevity, the solution uses the [conditional (ternary) operator](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Operators/Conditional_operator). You can achieve the same with a plain `if` and `else` block. ### Use regular expressions[​](#use-regular-expressions "Direct link to Use regular expressions") Simplify the code from previous exercise. Use [regular expressions](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_expressions) to parse the number of units. You can match digits using a range like `[0-9]` or by a special sequence `\d`. To match more characters of the same type you can use `+`. Solution ``` import * as cheerio from 'cheerio'; function parseUnitsText(text) { const match = text.match(/\d+/); if (match) { return parseInt(match[0]); } return 0; } const url = "https://warehouse-theme-metal.myshopify.com/collections/sales"; const response = await fetch(url); if (response.ok) { const html = await response.text(); const $ = cheerio.load(html); $(".product-item").each((i, element) => { const productItem = $(element); const title = productItem.find(".product-item__title"); const titleText = title.text().trim(); const unitsText = productItem.find(".product-item__inventory").text(); const unitsCount = parseUnitsText(unitsText); console.log(`${titleText} | ${unitsCount}`); }); } else { throw new Error(`HTTP ${response.status}`); } ``` Conditional (ternary) operator For brevity, the solution uses the [conditional (ternary) operator](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Operators/Conditional_operator). You can achieve the same with a plain `if` and `else` block. ### Scrape publish dates of F1 news[​](#scrape-publish-dates-of-f1-news "Direct link to Scrape publish dates of F1 news") Download Guardian's page with the latest F1 news and use Beautiful Soup to parse it. Print titles and publish dates of all the listed articles. This is the URL: ``` https://www.theguardian.com/sport/formulaone ``` Your program should print something like the following. Note the dates at the end of each line: ``` Brad Pitt in the paddock: how F1 the Movie went deep to keep fans coming | Fri Jun 20 2025 Wolff hits out at Red Bull protest after Russell’s Canadian GP win | Tue Jun 17 2025 F1 the Movie review – spectacular macho melodrama handles Brad Pitt with panache | Tue Jun 17 2025 Hamilton reveals distress over ‘devastating’ groundhog accident at Canadian F1 GP | Mon Jun 16 2025 ... ``` Hints: * HTML's `time` element can have an attribute `datetime`, which [contains data in a machine-readable format](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/time), such as the ISO 8601. * Cheerio gives you [.attr()](https://cheerio.js.org/docs/api/classes/Cheerio#attr) to access attributes. * In JavaScript you can use an ISO 8601 string to create a [`Date`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Date) object. * To get the date, you can call `.toDateString()` on `Date` objects. Solution ``` import * as cheerio from 'cheerio'; const url = "https://www.theguardian.com/sport/formulaone"; const response = await fetch(url); if (response.ok) { const html = await response.text(); const $ = cheerio.load(html); $("#maincontent ul li").each((i, element) => { const article = $(element); const titleText = article .find("h3") .text() .trim(); const dateText = article .find("time") .attr("datetime") .trim(); const date = new Date(dateText); console.log(`${titleText} | ${date.toDateString()}`); }); } else { throw new Error(`HTTP ${response.status}`); } ``` --- # Using a scraping framework with Node.js **In this lesson, we'll rework our application for watching prices so that it builds on top of a scraping framework. We'll use Crawlee to make the program simpler, faster, and more robust.** *** Before rewriting our code, let's point out several caveats in our current solution: * *Hard to maintain:* All the data we need from the listing page is also available on the product page. By scraping both, we have to maintain selectors for two HTML documents. Instead, we could scrape links from the listing page and process all data on the product pages. * *Slow:* The program runs sequentially, which is generously considerate toward the target website, but extremely inefficient. * *No logging:* The scraper gives no sense of progress, making it tedious to use. Debugging issues becomes even more frustrating without proper logs. * *Boilerplate code:* We implement downloading and parsing HTML, or exporting data to CSV, although we're not the first people to meet and solve these problems. * *Prone to anti-scraping:* If the target website implemented anti-scraping measures, a bare-bones program like ours would stop working. * *Browser means rewrite:* We got lucky extracting variants. If the website didn't include a fallback, we might have had no choice but to spin up a browser instance and automate clicking on buttons. Such a change in the underlying technology would require a complete rewrite of our program. * *No error handling:* The scraper stops if it encounters issues. It should allow for skipping problematic products with warnings or retrying downloads when the website returns temporary errors. In this lesson, we'll tackle all the above issues while keeping the code concise thanks to a scraping framework. Why Crawlee and not Scrapy From the two main open-source options for Python, [Scrapy](https://scrapy.org/) and [Crawlee](https://crawlee.dev/python/), we chose the latter—not just because we're the company financing its development. We genuinely believe beginners to scraping will like it more, since it allows to create a scraper with less code and less time spent reading docs. Scrapy's long history ensures it's battle-tested, but it also means its code relies on technologies that aren't really necessary today. Crawlee, on the other hand, builds on modern Python features like asyncio and type hints. ## Installing Crawlee[​](#installing-crawlee "Direct link to Installing Crawlee") When starting with the Crawlee framework, we first need to decide which approach to downloading and parsing we prefer. We want the one based on Beautiful Soup, so let's install the `crawlee` package with the `beautifulsoup` extra specified in brackets. The framework has a lot of dependencies, so expect the installation to take a while. ``` $ pip install crawlee[beautifulsoup] ... Successfully installed Jinja2-0.0.0 ... ... ... crawlee-0.0.0 ... ... ... ``` ## Running Crawlee[​](#running-crawlee "Direct link to Running Crawlee") Now let's use the framework to create a new version of our scraper. First, let's rename the `main.py` file to `oldmain.py`, so that we can keep peeking at the original implementation while working on the new one. Then, in the same project directory, we'll create a new, empty `main.py`. The initial content will look like this: ``` import asyncio from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext async def main(): crawler = BeautifulSoupCrawler() @crawler.router.default_handler async def handle_listing(context: BeautifulSoupCrawlingContext): if title := context.soup.title: print(title.text.strip()) await crawler.run(["https://warehouse-theme-metal.myshopify.com/collections/sales"]) if __name__ == '__main__': asyncio.run(main()) ``` In the code, we do the following: 1. We import the necessary modules and define an asynchronous `main()` function. 2. Inside `main()`, we first create a crawler object, which manages the scraping process. In this case, it's a crawler based on Beautiful Soup. 3. Next, we define a nested asynchronous function called `handle_listing()`. It receives a `context` parameter, and Python type hints show it's of type `BeautifulSoupCrawlingContext`. Type hints help editors suggest what we can do with the object. 4. We use a Python decorator (the line starting with `@`) to register `handle_listing()` as the *default handler* for processing HTTP responses. 5. Inside the handler, we extract the page title from the `soup` object and print its text without whitespace. 6. At the end of the function, we run the crawler on a product listing URL and await its completion. 7. The last two lines ensure that if the file is executed directly, Python will properly run the `main()` function using its asynchronous event loop. Don't worry if some of this is new. We don't need to know exactly how [`asyncio`](https://docs.python.org/3/library/asyncio.html), decorators, or type hints work. Let's stick to the practical side and observe what the program does when executed: ``` $ python main.py [BeautifulSoupCrawler] INFO Current request statistics: ┌───────────────────────────────┬──────────┐ │ requests_finished │ 0 │ │ requests_failed │ 0 │ │ retry_histogram │ [0] │ │ request_avg_failed_duration │ None │ │ request_avg_finished_duration │ None │ │ requests_finished_per_minute │ 0 │ │ requests_failed_per_minute │ 0 │ │ request_total_duration │ 0.0 │ │ requests_total │ 0 │ │ crawler_runtime │ 0.010014 │ └───────────────────────────────┴──────────┘ [crawlee._autoscaling.autoscaled_pool] INFO current_concurrency = 0; desired_concurrency = 2; cpu = 0; mem = 0; event_loop = 0.0; client_info = 0.0 Sales [crawlee._autoscaling.autoscaled_pool] INFO Waiting for remaining tasks to finish [BeautifulSoupCrawler] INFO Final request statistics: ┌───────────────────────────────┬──────────┐ │ requests_finished │ 1 │ │ requests_failed │ 0 │ │ retry_histogram │ [1] │ │ request_avg_failed_duration │ None │ │ request_avg_finished_duration │ 0.308998 │ │ requests_finished_per_minute │ 185 │ │ requests_failed_per_minute │ 0 │ │ request_total_duration │ 0.308998 │ │ requests_total │ 1 │ │ crawler_runtime │ 0.323721 │ └───────────────────────────────┴──────────┘ ``` If our previous scraper didn't give us any sense of progress, Crawlee feeds us with perhaps too much information for the purposes of a small program. Among all the logging, notice the line `Sales`. That's the page title! We managed to create a Crawlee scraper that downloads the product listing page, parses it with Beautiful Soup, extracts the title, and prints it. Advanced Python features You don't need to be an expert in asynchronous programming, decorators, or type hints to finish this lesson, but you might find yourself curious for more details. If so, check out [Async IO in Python: A Complete Walkthrough](https://realpython.com/async-io-python/), [Primer on Python Decorators](https://realpython.com/primer-on-python-decorators/), and [Python Type Checking](https://realpython.com/python-type-checking/). ## Crawling product detail pages[​](#crawling-product-detail-pages "Direct link to Crawling product detail pages") The code now features advanced Python concepts, so it's less accessible to beginners, and the size of the program is about the same as if we worked without a framework. The tradeoff of using a framework is that primitive scenarios may become unnecessarily complex, while complex scenarios may become surprisingly primitive. As we rewrite the rest of the program, the benefits of using Crawlee will become more apparent. For example, it takes a single line of code to extract and follow links to products. Three more lines, and we have parallel processing of all the product detail pages: ``` import asyncio from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext async def main(): crawler = BeautifulSoupCrawler() @crawler.router.default_handler async def handle_listing(context: BeautifulSoupCrawlingContext): await context.enqueue_links(label="DETAIL", selector=".product-list a.product-item__title") @crawler.router.handler("DETAIL") async def handle_detail(context: BeautifulSoupCrawlingContext): print(context.request.url) await crawler.run(["https://warehouse-theme-metal.myshopify.com/collections/sales"]) if __name__ == '__main__': asyncio.run(main()) ``` First, it's necessary to inspect the page in browser DevTools to figure out the CSS selector that allows us to locate links to all the product detail pages. Then we can use the `enqueue_links()` method to find the links and add them to Crawlee's internal HTTP request queue. We tell the method to label all the requests as `DETAIL`. Below that, we give the crawler another asynchronous function, `handle_detail()`. We again inform the crawler that this function is a handler using a decorator, but this time it's not a default one. This handler will only take care of HTTP requests labeled as `DETAIL`. For now, all it does is print the request URL. If we run the code, we should see how Crawlee first downloads the listing page and then makes parallel requests to each of the detail pages, printing their URLs along the way: ``` $ python main.py [BeautifulSoupCrawler] INFO Current request statistics: ┌───────────────────────────────┬──────────┐ ... └───────────────────────────────┴──────────┘ [crawlee._autoscaling.autoscaled_pool] INFO current_concurrency = 0; desired_concurrency = 2; cpu = 0; mem = 0; event_loop = 0.0; client_info = 0.0 https://warehouse-theme-metal.myshopify.com/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv https://warehouse-theme-metal.myshopify.com/products/jbl-flip-4-waterproof-portable-bluetooth-speaker https://warehouse-theme-metal.myshopify.com/products/sony-sacs9-10-inch-active-subwoofer https://warehouse-theme-metal.myshopify.com/products/sony-ps-hx500-hi-res-usb-turntable ... [crawlee._autoscaling.autoscaled_pool] INFO Waiting for remaining tasks to finish [BeautifulSoupCrawler] INFO Final request statistics: ┌───────────────────────────────┬──────────┐ │ requests_finished │ 25 │ │ requests_failed │ 0 │ │ retry_histogram │ [25] │ │ request_avg_failed_duration │ None │ │ request_avg_finished_duration │ 0.349434 │ │ requests_finished_per_minute │ 318 │ │ requests_failed_per_minute │ 0 │ │ request_total_duration │ 8.735843 │ │ requests_total │ 25 │ │ crawler_runtime │ 4.713262 │ └───────────────────────────────┴──────────┘ ``` In the final stats, we can see that we made 25 requests (1 listing page + 24 product pages) in less than 5 seconds. Your numbers might differ, but regardless, it should be much faster than making the requests sequentially. ## Extracting data[​](#extracting-data "Direct link to Extracting data") The Beautiful Soup crawler provides handlers with the `context.soup` attribute, which contains the parsed HTML of the handled page. This is the same `soup` object we used in our previous program. Let's locate and extract the same data as before: ``` async def main(): ... @crawler.router.handler("DETAIL") async def handle_detail(context: BeautifulSoupCrawlingContext): item = { "url": context.request.url, "title": context.soup.select_one(".product-meta__title").text.strip(), "vendor": context.soup.select_one(".product-meta__vendor").text.strip(), } print(item) ``` Fragile code The code above assumes the `.select_one()` call doesn't return `None`. If your editor checks types, it might even warn that `text` is not a known attribute of `None`. This isn't robust and could break, but in our program, that's fine. We expect the elements to be there, and if they're not, we'd rather the scraper break quickly—it's a sign something's wrong and needs fixing. Now for the price. We're not doing anything new here—just import `Decimal` and copy-paste the code from our old scraper. The only change will be in the selector. In `main.py`, we looked for `.price` within a `product_soup` object representing a product card. Now, we're looking for `.price` within the entire product detail page. It's better to be more specific so we don't accidentally match another price on the same page: ``` async def main(): ... @crawler.router.handler("DETAIL") async def handle_detail(context: BeautifulSoupCrawlingContext): price_text = ( context.soup .select_one(".product-form__info-content .price") .contents[-1] .strip() .replace("$", "") .replace(",", "") ) item = { "url": context.request.url, "title": context.soup.select_one(".product-meta__title").text.strip(), "vendor": context.soup.select_one(".product-meta__vendor").text.strip(), "price": Decimal(price_text), } print(item) ``` Finally, the variants. We can reuse the `parse_variant()` function as-is, and in the handler we'll again take inspiration from what we had in `main.py`. The full program will look like this: ``` import asyncio from decimal import Decimal from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext async def main(): crawler = BeautifulSoupCrawler() @crawler.router.default_handler async def handle_listing(context: BeautifulSoupCrawlingContext): await context.enqueue_links(selector=".product-list a.product-item__title", label="DETAIL") @crawler.router.handler("DETAIL") async def handle_detail(context: BeautifulSoupCrawlingContext): price_text = ( context.soup .select_one(".product-form__info-content .price") .contents[-1] .strip() .replace("$", "") .replace(",", "") ) item = { "url": context.request.url, "title": context.soup.select_one(".product-meta__title").text.strip(), "vendor": context.soup.select_one(".product-meta__vendor").text.strip(), "price": Decimal(price_text), "variant_name": None, } if variants := context.soup.select(".product-form__option.no-js option"): for variant in variants: print(item | parse_variant(variant)) else: print(item) await crawler.run(["https://warehouse-theme-metal.myshopify.com/collections/sales"]) def parse_variant(variant): text = variant.text.strip() name, price_text = text.split(" - ") price = Decimal( price_text .replace("$", "") .replace(",", "") ) return {"variant_name": name, "price": price} if __name__ == '__main__': asyncio.run(main()) ``` If we run this scraper, we should get the same data for the 24 products as before. Crawlee has saved us a lot of effort by managing downloading, parsing, and parallelization. The code is also cleaner, with two separate and labeled handlers. Crawlee doesn't do much to help with locating and extracting the data—that part of the code remains almost the same, framework or not. This is because the detective work of finding and extracting the right data is the core value of custom scrapers. With Crawlee, we can focus on just that while letting the framework take care of everything else. ## Saving data[​](#saving-data "Direct link to Saving data") When we're at *letting the framework take care of everything else*, let's take a look at what it can do about saving data. As of now the product detail page handler prints each item as soon as the item is ready. Instead, we can push the item to Crawlee's default dataset: ``` async def main(): ... @crawler.router.handler("DETAIL") async def handle_detail(context: BeautifulSoupCrawlingContext): price_text = ( ... ) item = { ... } if variants := context.soup.select(".product-form__option.no-js option"): for variant in variants: await context.push_data(item | parse_variant(variant)) else: await context.push_data(item) ``` That's it! If we run the program now, there should be a `storage` directory alongside the `main.py` file. Crawlee uses it to store its internal state. If we go to the `storage/datasets/default` subdirectory, we'll see over 30 JSON files, each representing a single item. ![Single dataset item](/assets/images/dataset-item-de374828159ee4b4a61e17ea6cee3c07.png) We can also export all the items to a single file of our choice. We'll do it at the end of the `main()` function, after the crawler has finished scraping: ``` async def main(): ... await crawler.run(["https://warehouse-theme-metal.myshopify.com/collections/sales"]) await crawler.export_data_json(path='dataset.json', ensure_ascii=False, indent=2) await crawler.export_data_csv(path='dataset.csv') ``` After running the scraper again, there should be two new files in your directory, `dataset.json` and `dataset.csv`, containing all the data. If we peek into the JSON file, it should have indentation. ## Logging[​](#logging "Direct link to Logging") Crawlee gives us stats about HTTP requests and concurrency, but we don't get much visibility into the pages we're crawling or the items we're saving. Let's add some custom logging: ``` import asyncio from decimal import Decimal from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext async def main(): crawler = BeautifulSoupCrawler() @crawler.router.default_handler async def handle_listing(context: BeautifulSoupCrawlingContext): context.log.info("Looking for product detail pages") await context.enqueue_links(selector=".product-list a.product-item__title", label="DETAIL") @crawler.router.handler("DETAIL") async def handle_detail(context: BeautifulSoupCrawlingContext): context.log.info(f"Product detail page: {context.request.url}") price_text = ( context.soup .select_one(".product-form__info-content .price") .contents[-1] .strip() .replace("$", "") .replace(",", "") ) item = { "url": context.request.url, "title": context.soup.select_one(".product-meta__title").text.strip(), "vendor": context.soup.select_one(".product-meta__vendor").text.strip(), "price": Decimal(price_text), "variant_name": None, } if variants := context.soup.select(".product-form__option.no-js option"): for variant in variants: context.log.info("Saving a product variant") await context.push_data(item | parse_variant(variant)) else: context.log.info("Saving a product") await context.push_data(item) await crawler.run(["https://warehouse-theme-metal.myshopify.com/collections/sales"]) crawler.log.info("Exporting data") await crawler.export_data_json(path='dataset.json', ensure_ascii=False, indent=2) await crawler.export_data_csv(path='dataset.csv') def parse_variant(variant): text = variant.text.strip() name, price_text = text.split(" - ") price = Decimal( price_text .replace("$", "") .replace(",", "") ) return {"variant_name": name, "price": price} if __name__ == '__main__': asyncio.run(main()) ``` Depending on what we find helpful, we can tweak the logs to include more or less detail. The `context.log` or `crawler.log` objects are [standard Python loggers](https://docs.python.org/3/library/logging.html). If we compare `main.py` and `oldmain.py` now, it's clear we've cut at least 20 lines of code compared to the original program, even with the extra logging we've added. Throughout this lesson, we've introduced features to match the old scraper's functionality, but at each phase, the code remained clean and readable. Plus, we've been able to focus on what's unique to the website we're scraping and the data we care about. In the next lesson, we'll use a scraping platform to set up our application to run automatically every day. *** ## Exercises[​](#exercises "Direct link to Exercises") These challenges are here to help you test what you’ve learned in this lesson. Try to resist the urge to peek at the solutions right away. Remember, the best learning happens when you dive in and do it yourself! Real world You're about to touch the real web, which is practical and exciting! But websites change, so some exercises might break. If you run into any issues, please leave a comment below or [file a GitHub Issue](https://github.com/apify/apify-docs/issues). ### Build a Crawlee scraper of F1 Academy drivers[​](#build-a-crawlee-scraper-of-f1-academy-drivers "Direct link to Build a Crawlee scraper of F1 Academy drivers") Scrape information about all [F1 Academy](https://en.wikipedia.org/wiki/F1_Academy) drivers listed on the official [Drivers](https://www.f1academy.com/Racing-Series/Drivers) page. Each item you push to Crawlee's default dataset should include the following data: * URL of the driver's f1academy.com page * Name * Team * Nationality * Date of birth (as a `date()` object) * Instagram URL If you export the dataset as JSON, it should look something like this: ``` [ { "url": "https://www.f1academy.com/Racing-Series/Drivers/29/Emely-De-Heus", "name": "Emely De Heus", "team": "MP Motorsport", "nationality": "Dutch", "dob": "2003-02-10", "instagram_url": "https://www.instagram.com/emely.de.heus/", }, { "url": "https://www.f1academy.com/Racing-Series/Drivers/28/Hamda-Al-Qubaisi", "name": "Hamda Al Qubaisi", "team": "MP Motorsport", "nationality": "Emirati", "dob": "2002-08-08", "instagram_url": "https://www.instagram.com/hamdaalqubaisi_official/", }, ... ] ``` Hints: * Use Python's `datetime.strptime(text, "%d/%m/%Y").date()` to parse dates in the `DD/MM/YYYY` format. Check out the [docs](https://docs.python.org/3/library/datetime.html#datetime.datetime.strptime) for more details. * To locate the Instagram URL, use the attribute selector `a[href*='instagram']`. Learn more about attribute selectors in the [MDN docs](https://developer.mozilla.org/en-US/docs/Web/CSS/Attribute_selectors). Solution ``` import asyncio from datetime import datetime from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext async def main(): crawler = BeautifulSoupCrawler() @crawler.router.default_handler async def handle_listing(context: BeautifulSoupCrawlingContext): await context.enqueue_links(selector=".teams-driver-item a", label="DRIVER") @crawler.router.handler("DRIVER") async def handle_driver(context: BeautifulSoupCrawlingContext): info = {} for row in context.soup.select(".common-driver-info li"): name = row.select_one("span").text.strip() value = row.select_one("h4").text.strip() info[name] = value detail = {} for row in context.soup.select(".driver-detail--cta-group a"): name = row.select_one("p").text.strip() value = row.select_one("h2").text.strip() detail[name] = value await context.push_data({ "url": context.request.url, "name": context.soup.select_one("h1").text.strip(), "team": detail["Team"], "nationality": info["Nationality"], "dob": datetime.strptime(info["DOB"], "%d/%m/%Y").date(), "instagram_url": context.soup.select_one(".common-social-share a[href*='instagram']").get("href"), }) await crawler.run(["https://www.f1academy.com/Racing-Series/Drivers"]) await crawler.export_data_json(path='dataset.json', ensure_ascii=False, indent=2) if __name__ == '__main__': asyncio.run(main()) ``` ### Use Crawlee to find the ratings of the most popular Netflix films[​](#use-crawlee-to-find-the-ratings-of-the-most-popular-netflix-films "Direct link to Use Crawlee to find the ratings of the most popular Netflix films") The [Global Top 10](https://www.netflix.com/tudum/top10) page has a table listing the most popular Netflix films worldwide. Scrape the movie names from this page, then search for each movie on [IMDb](https://www.imdb.com/). Assume the first search result is correct and retrieve the film's rating. Each item you push to Crawlee's default dataset should include the following data: * URL of the film's IMDb page * Title * Rating If you export the dataset as JSON, it should look something like this: ``` [ { "url": "https://www.imdb.com/title/tt32368345/?ref_=fn_tt_tt_1", "title": "The Merry Gentlemen", "rating": "5.0/10" }, { "url": "https://www.imdb.com/title/tt32359447/?ref_=fn_tt_tt_1", "title": "Hot Frosty", "rating": "5.4/10" }, ... ] ``` To scrape IMDb data, you'll need to construct a `Request` object with the appropriate search URL for each movie title. The following code snippet gives you an idea of how to do this: ``` ... from urllib.parse import quote_plus async def main(): ... @crawler.router.default_handler async def handle_netflix_table(context: BeautifulSoupCrawlingContext): requests = [] for name_cell in context.soup.select(...): name = name_cell.text.strip() imdb_search_url = f"https://www.imdb.com/find/?q={quote_plus(name)}&s=tt&ttype=ft" requests.append(Request.from_url(imdb_search_url, label="...")) await context.add_requests(requests) ... ... ``` When navigating to the first search result, you might find it helpful to know that `context.enqueue_links()` accepts a `limit` keyword argument, letting you specify the max number of HTTP requests to enqueue. Solution ``` import asyncio from urllib.parse import quote_plus from crawlee import Request from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext async def main(): crawler = BeautifulSoupCrawler() @crawler.router.default_handler async def handle_netflix_table(context: BeautifulSoupCrawlingContext): requests = [] for name_cell in context.soup.select(".list-tbl-global .tbl-cell-name"): name = name_cell.text.strip() imdb_search_url = f"https://www.imdb.com/find/?q={quote_plus(name)}&s=tt&ttype=ft" requests.append(Request.from_url(imdb_search_url, label="IMDB_SEARCH")) await context.add_requests(requests) @crawler.router.handler("IMDB_SEARCH") async def handle_imdb_search(context: BeautifulSoupCrawlingContext): await context.enqueue_links(selector=".find-result-item a", label="IMDB", limit=1) @crawler.router.handler("IMDB") async def handle_imdb(context: BeautifulSoupCrawlingContext): rating_selector = "[data-testid='hero-rating-bar__aggregate-rating__score']" rating_text = context.soup.select_one(rating_selector).text.strip() await context.push_data({ "url": context.request.url, "title": context.soup.select_one("h1").text.strip(), "rating": rating_text, }) await crawler.run(["https://www.netflix.com/tudum/top10"]) await crawler.export_data_json(path='dataset.json', ensure_ascii=False, indent=2) if __name__ == '__main__': asyncio.run(main()) ``` --- # Getting links from HTML with Node.js **In this lesson, we'll locate and extract links to individual product pages. We'll use Cheerio to find the relevant bits of HTML.** *** The previous lesson concludes our effort to create a scraper. Our program now downloads HTML, locates and extracts data from the markup, and saves the data in a structured and reusable way. For some use cases, this is already enough! In other cases, though, scraping just one page is hardly useful. The data is spread across the website, over several pages. ## Crawling websites[​](#crawling-websites "Direct link to Crawling websites") We'll use a technique called crawling, i.e. following links to scrape multiple pages. The algorithm goes like this: 1. Visit the start URL. 2. Extract new URLs (and data), and save them. 3. Visit one of the newly found URLs and save data and/or more URLs from it. 4. Repeat steps 2 and 3 until you have everything you need. This will help us figure out the actual prices of products, as right now, for some, we're only getting the min price. Implementing the algorithm will require quite a few changes to our code, though. ## Restructuring code[​](#restructuring-code "Direct link to Restructuring code") Over the course of the previous lessons, the code of our program grew to almost 50 lines containing downloading, parsing, and exporting: ``` import * as cheerio from 'cheerio'; import { writeFile } from 'fs/promises'; import { AsyncParser } from '@json2csv/node'; const url = "https://warehouse-theme-metal.myshopify.com/collections/sales"; const response = await fetch(url); if (response.ok) { const html = await response.text(); const $ = cheerio.load(html); const data = []; $(".product-item").each((i, element) => { const productItem = $(element); const title = productItem.find(".product-item__title"); const titleText = title.text().trim(); const price = productItem.find(".price").contents().last(); const priceRange = { minPrice: null, price: null }; const priceText = price .text() .trim() .replace("$", "") .replace(".", "") .replace(",", ""); if (priceText.startsWith("From ")) { priceRange.minPrice = parseInt(priceText.replace("From ", "")); } else { priceRange.minPrice = parseInt(priceText); priceRange.price = priceRange.minPrice; } data.push({ title: titleText, ...priceRange }); }); const jsonData = JSON.stringify(data); await writeFile('products.json', jsonData); const parser = new AsyncParser(); const csvData = await parser.parse(data).promise(); await writeFile('products.csv', csvData); } else { throw new Error(`HTTP ${response.status}`); } ``` Let's introduce several functions to make the whole thing easier to digest. First, we can turn the beginning of our program into this `download()` function, which takes a URL and returns a Cheerio object: ``` async function download(url) { const response = await fetch(url); if (response.ok) { const html = await response.text(); return cheerio.load(html); } else { throw new Error(`HTTP ${response.status}`); } } ``` Next, we can put parsing into a `parseProduct()` function, which takes the product item element and returns the object with data: ``` function parseProduct(productItem) { const title = productItem.find(".product-item__title"); const titleText = title.text().trim(); const price = productItem.find(".price").contents().last(); const priceRange = { minPrice: null, price: null }; const priceText = price .text() .trim() .replace("$", "") .replace(".", "") .replace(",", ""); if (priceText.startsWith("From ")) { priceRange.minPrice = parseInt(priceText.replace("From ", "")); } else { priceRange.minPrice = parseInt(priceText); priceRange.price = priceRange.minPrice; } return { title: titleText, ...priceRange }; } ``` Now the JSON export. For better readability, let's make a small change here and set the indentation level to two spaces: ``` async function exportJSON(data) { return JSON.stringify(data, null, 2); } ``` Why asynchronous? The `exportJSON()` function doesn't need to be `async` now, but keeping it makes future changes easier — like switching to an async JSON parser. It also stays consistent with the upcoming `exportCSV()` function, which must be asynchronous. The last function we'll add will take care of the CSV export: ``` async function exportCSV(data) { const parser = new AsyncParser(); return await parser.parse(data).promise(); } ``` Now let's put it all together: ``` import * as cheerio from 'cheerio'; import { writeFile } from 'fs/promises'; import { AsyncParser } from '@json2csv/node'; async function download(url) { const response = await fetch(url); if (response.ok) { const html = await response.text(); return cheerio.load(html); } else { throw new Error(`HTTP ${response.status}`); } } function parseProduct(productItem) { const title = productItem.find(".product-item__title"); const titleText = title.text().trim(); const price = productItem.find(".price").contents().last(); const priceRange = { minPrice: null, price: null }; const priceText = price .text() .trim() .replace("$", "") .replace(".", "") .replace(",", ""); if (priceText.startsWith("From ")) { priceRange.minPrice = parseInt(priceText.replace("From ", "")); } else { priceRange.minPrice = parseInt(priceText); priceRange.price = priceRange.minPrice; } return { title: titleText, ...priceRange }; } async function exportJSON(data) { return JSON.stringify(data, null, 2); } async function exportCSV(data) { const parser = new AsyncParser(); return await parser.parse(data).promise(); } const listingURL = "https://warehouse-theme-metal.myshopify.com/collections/sales" const $ = await download(listingURL); const data = [] $(".product-item").each((i, element) => { const productItem = $(element); const item = parseProduct(productItem); data.push(item); }); await writeFile('products.json', await exportJSON(data)); await writeFile('products.csv', await exportCSV(data)); ``` The program is much easier to read now. With the `parseProduct()` function handy, we could also replace the convoluted loop with one that only takes up five lines of code. Refactoring We turned the whole program upside down, and at the same time, we didn't make any actual changes! This is [refactoring](https://en.wikipedia.org/wiki/Code_refactoring): improving the structure of existing code without changing its behavior. ![Refactoring](/assets/images/refactoring-269525c424ecc082de290a3ecfd4d9ce.gif) ## Extracting links[​](#extracting-links "Direct link to Extracting links") With everything in place, we can now start working on a scraper that also scrapes the product pages. For that, we'll need the links to those pages. Let's open the browser DevTools and remind ourselves of the structure of a single product item: ![Product card\'s child elements](/assets/images/child-elements-83a62a55f65911b057ec16c4c7cde7b6.png) Several methods exist for transitioning from one page to another, but the most common is a link element, which looks like this: ``` Text of the link ``` In DevTools, we can see that each product title is, in fact, also a link element. We already locate the titles, so that makes our task easier. We just need to edit the code so that it extracts not only the text of the element but also the `href` attribute. Cheerio selections support accessing attributes using the `.attr()` method: ``` function parseProduct(productItem) { const title = productItem.find(".product-item__title"); const titleText = title.text().trim(); const url = title.attr("href"); ... return { url, title: titleText, ...priceRange }; } ``` In the previous code example, we've also added the URL to the object returned by the function. If we run the scraper now, it should produce exports where each product contains a link to its product page: ``` [ { "url": "/products/jbl-flip-4-waterproof-portable-bluetooth-speaker", "title": "JBL Flip 4 Waterproof Portable Bluetooth Speaker", "minPrice": 7495, "price": 7495 }, { "url": "/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv", "title": "Sony XBR-950G BRAVIA 4K HDR Ultra HD TV", "minPrice": 139800, "price": null }, ... ] ``` Hmm, but that isn't what we wanted! Where is the beginning of each URL? It turns out the HTML contains so-called *relative links*. ## Turning relative links into absolute[​](#turning-relative-links-into-absolute "Direct link to Turning relative links into absolute") Browsers reading the HTML know the base address and automatically resolve such links, but we'll have to do this manually. The built-in [`URL`](https://developer.mozilla.org/en-US/docs/Web/API/URL) object will help us. We'll change the `parseProduct()` function so that it also takes the base URL as an argument and then joins it with the relative URL to the product page: ``` function parseProduct(productItem, baseURL) { const title = productItem.find(".product-item__title"); const titleText = title.text().trim(); const url = new URL(title.attr("href"), baseURL).href; ... return { url, title: titleText, ...priceRange }; } ``` Now we'll pass the base URL to the function in the main body of our program: ``` const listingURL = "https://warehouse-theme-metal.myshopify.com/collections/sales" const $ = await download(listingURL); const data = [] $(".product-item").each((i, element) => { const productItem = $(element); const item = parseProduct(productItem, listingURL); data.push(item); }); ``` When we run the scraper now, we should see full URLs in our exports: ``` [ { "url": "https://warehouse-theme-metal.myshopify.com/products/jbl-flip-4-waterproof-portable-bluetooth-speaker", "title": "JBL Flip 4 Waterproof Portable Bluetooth Speaker", "minPrice": 7495, "price": 7495 }, { "url": "https://warehouse-theme-metal.myshopify.com/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv", "title": "Sony XBR-950G BRAVIA 4K HDR Ultra HD TV", "minPrice": 139800, "price": null }, ... ] ``` Ta-da! We've managed to get links leading to the product pages. In the next lesson, we'll crawl these URLs so that we can gather more details about the products in our dataset. *** ## Exercises[​](#exercises "Direct link to Exercises") These challenges are here to help you test what you’ve learned in this lesson. Try to resist the urge to peek at the solutions right away. Remember, the best learning happens when you dive in and do it yourself! Real world You're about to touch the real web, which is practical and exciting! But websites change, so some exercises might break. If you run into any issues, please leave a comment below or [file a GitHub Issue](https://github.com/apify/apify-docs/issues). ### Scrape links to countries in Africa[​](#scrape-links-to-countries-in-africa "Direct link to Scrape links to countries in Africa") Download Wikipedia's page with the list of African countries, use Cheerio to parse it, and print links to Wikipedia pages of all the states and territories mentioned in all tables. Start with this URL: ``` https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Africa ``` Your program should print the following: ``` https://en.wikipedia.org/wiki/Algeria https://en.wikipedia.org/wiki/Angola https://en.wikipedia.org/wiki/Benin https://en.wikipedia.org/wiki/Botswana ... ``` Solution ``` import * as cheerio from 'cheerio'; const listingURL = "https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Africa"; const response = await fetch(listingURL); if (response.ok) { const html = await response.text(); const $ = cheerio.load(html); $(".wikitable tr td:nth-child(3)").each((i, element) => { const nameCell = $(element); const link = nameCell.find("a").first(); const url = new URL(link.attr("href"), listingURL).href; console.log(url); }); } else { throw new Error(`HTTP ${response.status}`); } ``` ### Scrape links to F1 news[​](#scrape-links-to-f1-news "Direct link to Scrape links to F1 news") Download Guardian's page with the latest F1 news, use Cheerio to parse it, and print links to all the listed articles. Start with this URL: ``` https://www.theguardian.com/sport/formulaone ``` Your program should print something like the following: ``` https://www.theguardian.com/world/2024/sep/13/africa-f1-formula-one-fans-lewis-hamilton-grand-prix https://www.theguardian.com/sport/2024/sep/12/mclaren-lando-norris-oscar-piastri-team-orders-f1-title-race-max-verstappen https://www.theguardian.com/sport/article/2024/sep/10/f1-designer-adrian-newey-signs-aston-martin-deal-after-quitting-red-bull https://www.theguardian.com/sport/article/2024/sep/02/max-verstappen-damns-his-undriveable-monster-how-bad-really-is-it-and-why ... ``` Solution ``` import * as cheerio from 'cheerio'; const listingURL = "https://www.theguardian.com/sport/formulaone"; const response = await fetch(listingURL); if (response.ok) { const html = await response.text(); const $ = cheerio.load(html); $("#maincontent ul li").each((i, element) => { const link = $(element).find("a").first(); const url = new URL(link.attr("href"), listingURL).href; console.log(url); }); } else { throw new Error(`HTTP ${response.status}`); } ``` Note that some cards contain two links. One leads to the article, and one to the comments. If we selected all the links in the list by `#maincontent ul li a`, we would get incorrect output like this: ``` https://www.theguardian.com/sport/article/2024/sep/02/example https://www.theguardian.com/sport/article/2024/sep/02/example#comments ``` --- # Locating HTML elements with Node.js **In this lesson we'll locate product data in the downloaded HTML. We'll use Cheerio to find those HTML elements which contain details about each product, such as title or price.** *** In the previous lesson we've managed to print text of the page's main heading or count how many products are in the listing. Let's combine those two. What happens if we print `.text` for each product card? ``` import httpx from bs4 import BeautifulSoup url = "https://warehouse-theme-metal.myshopify.com/collections/sales" response = httpx.get(url) response.raise_for_status() html_code = response.text soup = BeautifulSoup(html_code, "html.parser") for product in soup.select(".product-item"): print(product.text) ``` Well, it definitely prints *something*… ``` $ python main.py Save $25.00 JBL JBL Flip 4 Waterproof Portable Bluetooth Speaker Black +7 Blue +6 Grey ... ``` To get details about each product in a structured way, we'll need a different approach. ## Locating child elements[​](#locating-child-elements "Direct link to Locating child elements") As in the browser DevTools lessons, we need to change the code so that it locates child elements for each product card. ![Product card\'s child elements](/assets/images/child-elements-83a62a55f65911b057ec16c4c7cde7b6.png) We should be looking for elements which have the `product-item__title` and `price` classes. We already know how that translates to CSS selectors: ``` import httpx from bs4 import BeautifulSoup url = "https://warehouse-theme-metal.myshopify.com/collections/sales" response = httpx.get(url) response.raise_for_status() html_code = response.text soup = BeautifulSoup(html_code, "html.parser") for product in soup.select(".product-item"): titles = product.select(".product-item__title") first_title = titles[0].text prices = product.select(".price") first_price = prices[0].text print(first_title, first_price) ``` Let's run the program now: ``` $ python main.py JBL Flip 4 Waterproof Portable Bluetooth Speaker Sale price$74.95 Sony XBR-950G BRAVIA 4K HDR Ultra HD TV Sale priceFrom $1,398.00 ... ``` There's still some room for improvement, but it's already much better! ## Locating a single element[​](#locating-a-single-element "Direct link to Locating a single element") Often, we want to assume in our code that a certain element exists only once. It's a bit tedious to work with lists when you know you're looking for a single element. For this purpose, Beautiful Soup offers the `.select_one()` method. Like `document.querySelector()` in browser DevTools, it returns just one result or `None`. Let's simplify our code! ``` import httpx from bs4 import BeautifulSoup url = "https://warehouse-theme-metal.myshopify.com/collections/sales" response = httpx.get(url) response.raise_for_status() html_code = response.text soup = BeautifulSoup(html_code, "html.parser") for product in soup.select(".product-item"): title = product.select_one(".product-item__title").text price = product.select_one(".price").text print(title, price) ``` This program does the same as the one we already had, but its code is more concise. Fragile code We assume that the selectors we pass to the `select()` or `select_one()` methods return at least one element. If they don't, calling `[0]` on an empty list or `.text` on `None` would crash the program. If you perform type checking on your Python program, the code examples above will trigger warnings about this. Not handling these cases allows us to keep the code examples more succinct. Additionally, if we expect the selectors to return elements but they suddenly don't, it usually means the website has changed since we wrote our scraper. Letting the program crash in such cases is a valid way to notify ourselves that we need to fix it. ## Precisely locating price[​](#precisely-locating-price "Direct link to Precisely locating price") In the output we can see that the price isn't located precisely: ``` JBL Flip 4 Waterproof Portable Bluetooth Speaker Sale price$74.95 Sony XBR-950G BRAVIA 4K HDR Ultra HD TV Sale priceFrom $1,398.00 ... ``` For each product, our scraper also prints the text `Sale price`. Let's look at the HTML structure again. Each bit containing the price looks like this: ``` Sale price $74.95 ``` When translated to a tree of Python objects, the element with class `price` will contain several *nodes*: * Textual node with white space, * a `span` HTML element, * a textual node representing the actual amount and possibly also white space. We can use Beautiful Soup's `.contents` property to access individual nodes. It returns a list of nodes like this: ``` ["\n", Sale price, "$74.95"] ``` It seems like we can read the last element to get the actual amount from a list like the above. Let's fix our program: ``` import httpx from bs4 import BeautifulSoup url = "https://warehouse-theme-metal.myshopify.com/collections/sales" response = httpx.get(url) response.raise_for_status() html_code = response.text soup = BeautifulSoup(html_code, "html.parser") for product in soup.select(".product-item"): title = product.select_one(".product-item__title").text price = product.select_one(".price").contents[-1] print(title, price) ``` If we run the scraper now, it should print prices as only amounts: ``` $ python main.py JBL Flip 4 Waterproof Portable Bluetooth Speaker $74.95 Sony XBR-950G BRAVIA 4K HDR Ultra HD TV From $1,398.00 ... ``` ## Formatting output[​](#formatting-output "Direct link to Formatting output") The results seem to be correct, but they're hard to verify because the prices visually blend with the titles. Let's set a different separator for the `print()` function: ``` print(title, price, sep=" | ") ``` The output is much nicer this way: ``` $ python main.py JBL Flip 4 Waterproof Portable Bluetooth Speaker | $74.95 Sony XBR-950G BRAVIA 4K HDR Ultra HD TV | From $1,398.00 ... ``` Great! We have managed to use CSS selectors and walk the HTML tree to get a list of product titles and prices. But wait a second—what's `From $1,398.00`? One does not simply scrape a price! We'll need to clean that. But that's a job for the next lesson, which is about extracting data. *** ## Exercises[​](#exercises "Direct link to Exercises") These challenges are here to help you test what you’ve learned in this lesson. Try to resist the urge to peek at the solutions right away. Remember, the best learning happens when you dive in and do it yourself! Real world You're about to touch the real web, which is practical and exciting! But websites change, so some exercises might break. If you run into any issues, please leave a comment below or [file a GitHub Issue](https://github.com/apify/apify-docs/issues). ### Scrape Wikipedia[​](#scrape-wikipedia "Direct link to Scrape Wikipedia") Download Wikipedia's page with the list of African countries, use Beautiful Soup to parse it, and print short English names of all the states and territories mentioned in all tables. This is the URL: ``` https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Africa ``` Your program should print the following: ``` Algeria Angola Benin Botswana ... ``` Solution ``` import httpx from bs4 import BeautifulSoup url = "https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Africa" response = httpx.get(url) response.raise_for_status() html_code = response.text soup = BeautifulSoup(html_code, "html.parser") for table in soup.select(".wikitable"): for row in table.select("tr"): cells = row.select("td") if cells: third_column = cells[2] title_link = third_column.select_one("a") print(title_link.text) ``` Because some rows contain [table headers](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/th), we skip processing a row if `table_row.select("td")` doesn't find any [table data](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/td) cells. ### Use CSS selectors to their max[​](#use-css-selectors-to-their-max "Direct link to Use CSS selectors to their max") Simplify the code from previous exercise. Use a single for loop and a single CSS selector. You may want to check out the following pages: * [Descendant combinator](https://developer.mozilla.org/en-US/docs/Web/CSS/Descendant_combinator) * [`:nth-child()` pseudo-class](https://developer.mozilla.org/en-US/docs/Web/CSS/:nth-child) Solution ``` import httpx from bs4 import BeautifulSoup url = "https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Africa" response = httpx.get(url) response.raise_for_status() html_code = response.text soup = BeautifulSoup(html_code, "html.parser") for name_cell in soup.select(".wikitable tr td:nth-child(3)"): print(name_cell.select_one("a").text) ``` ### Scrape F1 news[​](#scrape-f1-news "Direct link to Scrape F1 news") Download Guardian's page with the latest F1 news, use Beautiful Soup to parse it, and print titles of all the listed articles. This is the URL: ``` https://www.theguardian.com/sport/formulaone ``` Your program should print something like the following: ``` Wolff confident Mercedes are heading to front of grid after Canada improvement Frustrated Lando Norris blames McLaren team for missed chance Max Verstappen wins Canadian Grand Prix: F1 – as it happened ... ``` Solution ``` import httpx from bs4 import BeautifulSoup url = "https://www.theguardian.com/sport/formulaone" response = httpx.get(url) response.raise_for_status() html_code = response.text soup = BeautifulSoup(html_code, "html.parser") for title in soup.select("#maincontent ul li h3"): print(title.text) ``` --- # Parsing HTML with Node.js **In this lesson we'll look for products in the downloaded HTML. We'll use Cheerio to turn the HTML into objects which we can work with in our Node.js program.** *** From lessons about browser DevTools we know that the HTML elements representing individual products have a `class` attribute which, among other values, contains `product-item`. ![Products have the ‘product-item’ class](/assets/images/product-item-b3160bf23207c2794e7c37a2e8517815.png) As a first step, let's try counting how many products are on the listing page. ## Processing HTML[​](#processing-html "Direct link to Processing HTML") After downloading, the entire HTML is available in our program as a string. We can print it to the screen or save it to a file, but not much more. However, since it's a string, could we use [string operations](https://docs.python.org/3/library/stdtypes.html#string-methods) or [regular expressions](https://docs.python.org/3/library/re.html) to count the products? While somewhat possible, such an approach is tedious, fragile, and unreliable. To work with HTML, we need a robust tool dedicated to the task: an *HTML parser*. It takes a text with HTML markup and turns it into a tree of Python objects. Why regex can't parse HTML While [Bobince's infamous StackOverflow answer](https://stackoverflow.com/a/1732454/325365) is funny, it doesn't go much into explaining. In formal language theory, HTML's hierarchical and nested structure makes it a [context-free language](https://en.wikipedia.org/wiki/Context-free_language). Regular expressions match patterns in [regular languages](https://en.wikipedia.org/wiki/Regular_language), which are much simpler. This difference makes it hard for a regex to handle HTML's nested tags. HTML's complex syntax rules and various edge cases also add to the difficulty. We'll choose [Beautiful Soup](https://beautiful-soup-4.readthedocs.io/) as our parser, as it's a popular library renowned for its ability to process even non-standard, broken markup. This is useful for scraping, because real-world websites often contain all sorts of errors and discrepancies. ``` $ pip install beautifulsoup4 ... Successfully installed beautifulsoup4-4.0.0 soupsieve-0.0 ``` Now let's use it for parsing the HTML. The `BeautifulSoup` object allows us to work with the HTML elements in a structured way. As a demonstration, we'll first get the `` element, which represents the main heading of the page. ![Element of the main heading](/assets/images/h1-c1c17bca77070ce481691298487e3952.png) We'll update our code to the following: ``` import httpx from bs4 import BeautifulSoup url = "https://warehouse-theme-metal.myshopify.com/collections/sales" response = httpx.get(url) response.raise_for_status() html_code = response.text soup = BeautifulSoup(html_code, "html.parser") print(soup.select("h1")) ``` Then let's run the program: ``` $ python main.py [Sales] ``` Our code lists all `h1` elements it can find on the page. It's the case that there's just one, so in the result we can see a list with a single item. What if we want to print just the text? Let's change the end of the program to the following: ``` headings = soup.select("h1") first_heading = headings[0] print(first_heading.text) ``` If we run our scraper again, it prints the text of the first `h1` element: ``` $ python main.py Sales ``` Dynamic websites The Warehouse returns full HTML in its initial response, but many other sites add content via JavaScript after the page loads or after user interaction. In such cases, what we see in DevTools may differ from `response.text` in Python. Learn how to handle these scenarios in our [API Scraping](https://docs.apify.com/academy/api-scraping.md) and [Puppeteer & Playwright](https://docs.apify.com/academy/puppeteer-playwright.md) courses. ## Using CSS selectors[​](#using-css-selectors "Direct link to Using CSS selectors") Beautiful Soup's `.select()` method runs a *CSS selector* against a parsed HTML document and returns all the matching elements. It's like calling `document.querySelectorAll()` in browser DevTools. Scanning through [usage examples](https://beautiful-soup-4.readthedocs.io/en/latest/#css-selectors) will help us to figure out code for counting the product cards: ``` import httpx from bs4 import BeautifulSoup url = "https://warehouse-theme-metal.myshopify.com/collections/sales" response = httpx.get(url) response.raise_for_status() html_code = response.text soup = BeautifulSoup(html_code, "html.parser") products = soup.select(".product-item") print(len(products)) ``` In CSS, `.product-item` selects all elements whose `class` attribute contains value `product-item`. We call `soup.select()` with the selector and get back a list of matching elements. Beautiful Soup handles all the complexity of understanding the HTML markup for us. On the last line, we use `len()` to count how many items there is in the list. ``` $ python main.py 24 ``` That's it! We've managed to download a product listing, parse its HTML, and count how many products it contains. In the next lesson, we'll be looking for a way to extract detailed information about individual products. *** ## Exercises[​](#exercises "Direct link to Exercises") These challenges are here to help you test what you’ve learned in this lesson. Try to resist the urge to peek at the solutions right away. Remember, the best learning happens when you dive in and do it yourself! Real world You're about to touch the real web, which is practical and exciting! But websites change, so some exercises might break. If you run into any issues, please leave a comment below or [file a GitHub Issue](https://github.com/apify/apify-docs/issues). ### Scrape F1 teams[​](#scrape-f1-teams "Direct link to Scrape F1 teams") Print a total count of F1 teams listed on this page: ``` https://www.formula1.com/en/teams ``` Solution ``` import httpx from bs4 import BeautifulSoup url = "https://www.formula1.com/en/teams" response = httpx.get(url) response.raise_for_status() html_code = response.text soup = BeautifulSoup(html_code, "html.parser") print(len(soup.select(".group"))) ``` ### Scrape F1 drivers[​](#scrape-f1-drivers "Direct link to Scrape F1 drivers") Use the same URL as in the previous exercise, but this time print a total count of F1 drivers. Solution ``` import httpx from bs4 import BeautifulSoup url = "https://www.formula1.com/en/teams" response = httpx.get(url) response.raise_for_status() html_code = response.text soup = BeautifulSoup(html_code, "html.parser") print(len(soup.select(".f1-team-driver-name"))) ``` --- # Using a scraping platform with Node.js **In this lesson, we'll deploy our application to a scraping platform that automatically runs it daily. We'll also use the platform's API to retrieve and work with the results.** *** Before starting with a scraping platform, let's highlight a few caveats in our current setup: * *User-operated:* We have to run the scraper ourselves. If we're tracking price trends, we'd need to remember to run it daily. And if we want alerts for big discounts, manually running the program isn't much better than just checking the site in a browser every day. * *No monitoring:* If we have a spare server or a Raspberry Pi lying around, we could use [cron](https://en.wikipedia.org/wiki/Cron) to schedule it. But even then, we'd have little insight into whether it ran successfully, what errors or warnings occurred, how long it took, or what resources it used. * *Manual data management:* Tracking prices over time means figuring out how to organize the exported data ourselves. Processing the data could also be tricky since different analysis tools often require different formats. * *Anti-scraping risks:* If the target website detects our scraper, they can rate-limit or block us. Sure, we could run it from a coffee shop's Wi-Fi, but eventually, they'd block that too—risking seriously annoying our barista. In this lesson, we'll use a platform to address all of these issues. Generic cloud platforms like [GitHub Actions](https://github.com/features/actions) can work for simple scenarios. But platforms dedicated to scraping, like [Apify](https://apify.com/), offer extra features such as monitoring scrapers, managing retrieved data, and overcoming anti-scraping measures. Why Apify Scraping platforms come in many varieties, offering a wide range of tools and approaches. As the course authors, we're obviously biased toward Apify—we think it's both powerful and complete. That said, the main goal of this lesson is to show how deploying to *any platform* can make life easier. Plus, everything we cover here fits within [Apify's free tier](https://apify.com/pricing). ## Registering[​](#registering "Direct link to Registering") First, let's [create a new Apify account](https://console.apify.com/sign-up). We'll go through a few checks to confirm we're human and our email is valid—annoying but necessary to prevent abuse of the platform. Apify serves both as an infrastructure where to privately deploy and run own scrapers, and as a marketplace, where anyone can offer their ready scrapers to others for rent. But let's hold off on exploring the Apify Store for now. ## Getting access from the command line[​](#getting-access-from-the-command-line "Direct link to Getting access from the command line") To control the platform from our machine and send the code of our program there, we'll need the Apify CLI. On macOS, we can install the CLI using [Homebrew](https://brew.sh), otherwise we'll first need [Node.js](https://nodejs.org/en/download). After following the [Apify CLI installation guide](https://docs.apify.com/cli/docs/installation), we'll verify that we installed the tool by printing its version: ``` $ apify --version apify-cli/0.0.0 system-arch00 node-v0.0.0 ``` Now let's connect the CLI with the cloud platform using our account from previous step: ``` $ apify login ... Success: You are logged in to Apify as user1234! ``` ## Starting a real-world project[​](#starting-a-real-world-project "Direct link to Starting a real-world project") Until now, we've kept our scrapers simple, each with just a single Python module like `main.py`, and we've added dependencies only by installing them with `pip` inside a virtual environment. If we sent our code to a friend, they wouldn't know what to install to avoid import errors. The same goes for deploying to a cloud platform. To share our project, we need to package it. The best way is following the official [Python Packaging User Guide](https://packaging.python.org/), but for this course, we'll take a shortcut with the Apify CLI. In our terminal, let's change to a directory where we usually start new projects. Then, we'll run the following command: ``` apify create warehouse-watchdog --template=python-crawlee-beautifulsoup ``` It will create a new subdirectory called `warehouse-watchdog` for the new project, containing all the necessary files: ``` Info: Python version 0.0.0 detected. Info: Creating a virtual environment in ... ... Success: Actor 'warehouse-watchdog' was created. To run it, run "cd warehouse-watchdog" and "apify run". Info: To run your code in the cloud, run "apify push" and deploy your code to Apify Console. Info: To install additional Python packages, you need to activate the virtual environment in the ".venv" folder in the actor directory. ``` ## Adjusting the template[​](#adjusting-the-template "Direct link to Adjusting the template") Inside the `warehouse-watchdog` directory, we should see a `src` subdirectory containing several Python files, including `main.py`. This is a sample Beautiful Soup scraper provided by the template. The file contains a single asynchronous function, `main()`. At the beginning, it handles [input](https://docs.apify.com/platform/actors/running/input-and-output#input), then passes that input to a small crawler built on top of the Crawlee framework. Every program that runs on the Apify platform first needs to be packaged as a so-called [Actor](https://apify.com/actors)—a standardized container with designated places for input and output. Crawlee scrapers automatically connect their default dataset to the Actor output, but input must be handled explicitly in the code. ![The expected file structure](/assets/images/actor-file-structure-8ca91fac30aeb65a0b69ece59f2609d0.webp) We'll now adjust the template so that it runs our program for watching prices. As the first step, we'll create a new empty file, `crawler.py`, inside the `warehouse-watchdog/src` directory. Then, we'll fill this file with final, unchanged code from the previous lesson: ``` import asyncio from decimal import Decimal from crawlee.crawlers import BeautifulSoupCrawler async def main(): crawler = BeautifulSoupCrawler() @crawler.router.default_handler async def handle_listing(context): context.log.info("Looking for product detail pages") await context.enqueue_links(selector=".product-list a.product-item__title", label="DETAIL") @crawler.router.handler("DETAIL") async def handle_detail(context): context.log.info(f"Product detail page: {context.request.url}") price_text = ( context.soup .select_one(".product-form__info-content .price") .contents[-1] .strip() .replace("$", "") .replace(",", "") ) item = { "url": context.request.url, "title": context.soup.select_one(".product-meta__title").text.strip(), "vendor": context.soup.select_one(".product-meta__vendor").text.strip(), "price": Decimal(price_text), "variant_name": None, } if variants := context.soup.select(".product-form__option.no-js option"): for variant in variants: context.log.info("Saving a product variant") await context.push_data(item | parse_variant(variant)) else: context.log.info("Saving a product") await context.push_data(item) await crawler.run(["https://warehouse-theme-metal.myshopify.com/collections/sales"]) crawler.log.info("Exporting data") await crawler.export_data_json(path='dataset.json', ensure_ascii=False, indent=2) await crawler.export_data_csv(path='dataset.csv') def parse_variant(variant): text = variant.text.strip() name, price_text = text.split(" - ") price = Decimal( price_text .replace("$", "") .replace(",", "") ) return {"variant_name": name, "price": price} if __name__ == '__main__': asyncio.run(main()) ``` Now, let's replace the contents of `warehouse-watchdog/src/main.py` with this: ``` from apify import Actor from .crawler import main as crawl async def main(): async with Actor: await crawl() ``` We import our scraper as a function and await the result inside the Actor block. Unlike the sample scraper, the one we made in the previous lesson doesn't expect any input data, so we can omit the code that handles that part. Next, we'll change to the `warehouse-watchdog` directory in our terminal and verify that everything works locally before deploying the project to the cloud: ``` $ apify run Run: /Users/course/Projects/warehouse-watchdog/.venv/bin/python3 -m src [apify] INFO Initializing Actor... [apify] INFO System info ({"apify_sdk_version": "0.0.0", "apify_client_version": "0.0.0", "crawlee_version": "0.0.0", "python_version": "0.0.0", "os": "xyz"}) [BeautifulSoupCrawler] INFO Current request statistics: ┌───────────────────────────────┬──────────┐ │ requests_finished │ 0 │ │ requests_failed │ 0 │ │ retry_histogram │ [0] │ │ request_avg_failed_duration │ None │ │ request_avg_finished_duration │ None │ │ requests_finished_per_minute │ 0 │ │ requests_failed_per_minute │ 0 │ │ request_total_duration │ 0.0 │ │ requests_total │ 0 │ │ crawler_runtime │ 0.016736 │ └───────────────────────────────┴──────────┘ [crawlee._autoscaling.autoscaled_pool] INFO current_concurrency = 0; desired_concurrency = 2; cpu = 0; mem = 0; event_loop = 0.0; client_info = 0.0 [BeautifulSoupCrawler] INFO Looking for product detail pages [BeautifulSoupCrawler] INFO Product detail page: https://warehouse-theme-metal.myshopify.com/products/jbl-flip-4-waterproof-portable-bluetooth-speaker [BeautifulSoupCrawler] INFO Saving a product variant [BeautifulSoupCrawler] INFO Saving a product variant ... ``` ## Updating the Actor configuration[​](#updating-the-actor-configuration "Direct link to Updating the Actor configuration") The Actor configuration from the template tells the platform to expect input, so we need to update that before running our scraper in the cloud. Inside `warehouse-watchdog`, there's a directory called `.actor`. Within it, we'll edit the `input_schema.json` file, which looks like this by default: ``` { "title": "Python Crawlee BeautifulSoup Scraper", "type": "object", "schemaVersion": 1, "properties": { "start_urls": { "title": "Start URLs", "type": "array", "description": "URLs to start with", "prefill": [ { "url": "https://apify.com" } ], "editor": "requestListSources" } }, "required": ["start_urls"] } ``` Hidden dot files On some systems, `.actor` might be hidden in the directory listing because it starts with a dot. Use your editor's built-in file explorer to locate it. We'll remove the expected properties and the list of required ones. After our changes, the file should look like this: ``` { "title": "Python Crawlee BeautifulSoup Scraper", "type": "object", "schemaVersion": 1, "properties": {} } ``` Trailing commas in JSON Make sure there's no trailing comma after `{}`, or the file won't be valid JSON. ## Deploying the scraper[​](#deploying-the-scraper "Direct link to Deploying the scraper") Now we can proceed to deployment: ``` $ apify push Info: Created Actor with name warehouse-watchdog on Apify. Info: Deploying Actor 'warehouse-watchdog' to Apify. Run: Updated version 0.0 for Actor warehouse-watchdog. Run: Building Actor warehouse-watchdog ... Actor build detail https://console.apify.com/actors/a123bCDefghiJkLMN#/builds/0.0.1 ? Do you want to open the Actor detail in your browser? (Y/n) ``` After opening the link in our browser, assuming we're logged in, we should see the **Source** screen on the Actor's detail page. We'll go to the **Input** tab of that screen. We won't change anything—just hit **Start**, and we should see logs similar to what we see locally, but this time our scraper will be running in the cloud. ![Actor\'s detail page, screen Source, tab Input](/assets/images/actor-input-85254e66fa557db99c1229b45180c609.webp) When the run finishes, the interface will turn green. On the **Output** tab, we can preview the results as a table or JSON. We can even export the data to formats like CSV, XML, Excel, RSS, and more. ![Actor\'s detail page, screen Source, tab Output](/assets/images/actor-output-27017a6add46795ea603405961e45293.webp) Accessing data We don't need to click buttons to download the data. It's possible to retrieve it also using Apify's API, the `apify datasets` CLI command, or the Python SDK. Learn more in the [Dataset docs](https://docs.apify.com/platform/storage/dataset). ## Running the scraper periodically[​](#running-the-scraper-periodically "Direct link to Running the scraper periodically") Now that our scraper is deployed, let's automate its execution. In the Apify web interface, we'll go to [Schedules](https://console.apify.com/schedules). Let's click **Create new**, review the periodicity (default: daily), and specify the Actor to run. Then we'll click **Enable**—that's it! From now on, the Actor will execute daily. We can inspect each run, view logs, check collected data, [monitor stats and charts](https://docs.apify.com/platform/monitoring), and even set up alerts. ![Schedule detail page](/assets/images/actor-schedule-517425bbd37b0c4989a5324ccf9b4ec7.webp) ## Adding support for proxies[​](#adding-support-for-proxies "Direct link to Adding support for proxies") If monitoring shows that our scraper frequently fails to reach the Warehouse Shop website, it's likely being blocked. To avoid this, we can [configure proxies](https://docs.apify.com/platform/proxy) so our requests come from different locations, reducing the chances of detection and blocking. Proxy configuration is a type of Actor input, so let's start by reintroducing the necessary code. We'll update `warehouse-watchdog/src/main.py` like this: ``` from apify import Actor from .crawler import main as crawl async def main(): async with Actor: input_data = await Actor.get_input() if actor_proxy_input := input_data.get("proxyConfig"): proxy_config = await Actor.create_proxy_configuration(actor_proxy_input=actor_proxy_input) else: proxy_config = None await crawl(proxy_config) ``` Next, we'll add `proxy_config` as an optional parameter in `warehouse-watchdog/src/crawler.py`. Thanks to the built-in integration between Apify and Crawlee, we only need to pass it to `BeautifulSoupCrawler()`, and the class will handle the rest: ``` import asyncio from decimal import Decimal from crawlee.crawlers import BeautifulSoupCrawler async def main(proxy_config = None): crawler = BeautifulSoupCrawler(proxy_configuration=proxy_config) crawler.log.info(f"Using proxy: {'yes' if proxy_config else 'no'}") @crawler.router.default_handler async def handle_listing(context): context.log.info("Looking for product detail pages") await context.enqueue_links(selector=".product-list a.product-item__title", label="DETAIL") ... ``` Finally, we'll modify the Actor configuration in `warehouse-watchdog/src/.actor/input_schema.json` to include the `proxyConfig` input parameter: ``` { "title": "Python Crawlee BeautifulSoup Scraper", "type": "object", "schemaVersion": 1, "properties": { "proxyConfig": { "title": "Proxy config", "description": "Proxy configuration", "type": "object", "editor": "proxy", "prefill": { "useApifyProxy": true, "apifyProxyGroups": [] }, "default": { "useApifyProxy": true, "apifyProxyGroups": [] } } } } ``` To verify everything works, we'll run the scraper locally. We'll use the `apify run` command again, but this time with the `--purge` option to ensure we're not reusing data from a previous run: ``` $ apify run --purge Info: All default local stores were purged. Run: /Users/course/Projects/warehouse-watchdog/.venv/bin/python3 -m src [apify] INFO Initializing Actor... [apify] INFO System info ({"apify_sdk_version": "0.0.0", "apify_client_version": "0.0.0", "crawlee_version": "0.0.0", "python_version": "0.0.0", "os": "xyz"}) [BeautifulSoupCrawler] INFO Using proxy: no [BeautifulSoupCrawler] INFO Current request statistics: ┌───────────────────────────────┬──────────┐ │ requests_finished │ 0 │ │ requests_failed │ 0 │ │ retry_histogram │ [0] │ │ request_avg_failed_duration │ None │ │ request_avg_finished_duration │ None │ │ requests_finished_per_minute │ 0 │ │ requests_failed_per_minute │ 0 │ │ request_total_duration │ 0.0 │ │ requests_total │ 0 │ │ crawler_runtime │ 0.014976 │ └───────────────────────────────┴──────────┘ [crawlee._autoscaling.autoscaled_pool] INFO current_concurrency = 0; desired_concurrency = 2; cpu = 0; mem = 0; event_loop = 0.0; client_info = 0.0 [BeautifulSoupCrawler] INFO Looking for product detail pages [BeautifulSoupCrawler] INFO Product detail page: https://warehouse-theme-metal.myshopify.com/products/jbl-flip-4-waterproof-portable-bluetooth-speaker [BeautifulSoupCrawler] INFO Saving a product variant [BeautifulSoupCrawler] INFO Saving a product variant ... ``` In the logs, we should see `Using proxy: no`, because local runs don't include proxy settings. All requests will be made from our own location, just as before. Now, let's update the cloud version of our scraper with `apify push`: ``` $ apify push Info: Deploying Actor 'warehouse-watchdog' to Apify. Run: Updated version 0.0 for Actor warehouse-watchdog. Run: Building Actor warehouse-watchdog (timestamp) ACTOR: Found input schema referenced from .actor/actor.json ... ? Do you want to open the Actor detail in your browser? (Y/n) ``` Back in the Apify console, we'll go to the **Source** screen and switch to the **Input** tab. We should see the new **Proxy config** option, which defaults to **Datacenter - Automatic**. ![Actor\'s detail page, screen Source, tab Input with proxies](/assets/images/actor-input-proxies-f7747e47adb97be0c3909f8b8870324f.webp) We'll leave it as is and click **Start**. This time, the logs should show `Using proxy: yes`, as the scraper uses proxies provided by the platform: ``` (timestamp) ACTOR: Pulling Docker image of build o6vHvr5KwA1sGNxP0 from repository. (timestamp) ACTOR: Creating Docker container. (timestamp) ACTOR: Starting Docker container. (timestamp) [apify] INFO Initializing Actor... (timestamp) [apify] INFO System info ({"apify_sdk_version": "0.0.0", "apify_client_version": "0.0.0", "crawlee_version": "0.0.0", "python_version": "0.0.0", "os": "xyz"}) (timestamp) [BeautifulSoupCrawler] INFO Using proxy: yes (timestamp) [BeautifulSoupCrawler] INFO Current request statistics: (timestamp) ┌───────────────────────────────┬──────────┐ (timestamp) │ requests_finished │ 0 │ (timestamp) │ requests_failed │ 0 │ (timestamp) │ retry_histogram │ [0] │ (timestamp) │ request_avg_failed_duration │ None │ (timestamp) │ request_avg_finished_duration │ None │ (timestamp) │ requests_finished_per_minute │ 0 │ (timestamp) │ requests_failed_per_minute │ 0 │ (timestamp) │ request_total_duration │ 0.0 │ (timestamp) │ requests_total │ 0 │ (timestamp) │ crawler_runtime │ 0.036449 │ (timestamp) └───────────────────────────────┴──────────┘ (timestamp) [crawlee._autoscaling.autoscaled_pool] INFO current_concurrency = 0; desired_concurrency = 2; cpu = 0; mem = 0; event_loop = 0.0; client_info = 0.0 (timestamp) [crawlee.storages._request_queue] INFO The queue still contains requests locked by another client (timestamp) [BeautifulSoupCrawler] INFO Looking for product detail pages (timestamp) [BeautifulSoupCrawler] INFO Product detail page: https://warehouse-theme-metal.myshopify.com/products/jbl-flip-4-waterproof-portable-bluetooth-speaker (timestamp) [BeautifulSoupCrawler] INFO Saving a product variant ... ``` ## Congratulations\![​](#congratulations "Direct link to Congratulations!") We've reached the end of the course—congratulations! Together, we've built a program that: * Crawls a shop and extracts product and pricing data. * Exports the results in several formats. * Uses a concise code, thanks to a scraping framework. * Runs on a cloud platform with monitoring and alerts. * Executes periodically without manual intervention, collecting data over time. * Uses proxies to avoid being blocked. We hope this serves as a solid foundation for your next scraping project. Perhaps you'll even [start publishing scrapers](https://docs.apify.com/platform/actors/publishing) for others to use—for a fee? --- # Saving data with Node.js **In this lesson, we'll save the data we scraped in the popular formats, such as CSV or JSON. We'll use the json2csv library to export the files.** *** We managed to scrape data about products and print it, with each product separated by a new line and each field separated by the `|` character. This already produces structured text that can be parsed, i.e., read programmatically. ``` $ python main.py JBL Flip 4 Waterproof Portable Bluetooth Speaker | 74.95 | 74.95 Sony XBR-950G BRAVIA 4K HDR Ultra HD TV | 1398.00 | None ... ``` However, the format of this text is rather *ad hoc* and does not adhere to any specific standard that others could follow. It's unclear what to do if a product title already contains the `|` character or how to represent multi-line product descriptions. No ready-made library can handle all the parsing. We should use widely popular formats that have well-defined solutions for all the corner cases and that other programs can read without much effort. Two such formats are CSV (*Comma-separated values*) and JSON (*JavaScript Object Notation*). ## Collecting data[​](#collecting-data "Direct link to Collecting data") Producing results line by line is an efficient approach to handling large datasets, but to simplify this lesson, we'll store all our data in one variable. This'll take three changes to our program: ``` import httpx from bs4 import BeautifulSoup from decimal import Decimal url = "https://warehouse-theme-metal.myshopify.com/collections/sales" response = httpx.get(url) response.raise_for_status() html_code = response.text soup = BeautifulSoup(html_code, "html.parser") data = [] for product in soup.select(".product-item"): title = product.select_one(".product-item__title").text.strip() price_text = ( product .select_one(".price") .contents[-1] .strip() .replace("$", "") .replace(",", "") ) if price_text.startswith("From "): min_price = Decimal(price_text.removeprefix("From ")) price = None else: min_price = Decimal(price_text) price = min_price data.append({"title": title, "min_price": min_price, "price": price}) print(data) ``` Before looping over the products, we prepare an empty list. Then, instead of printing each line, we append the data of each product to the list in the form of a Python dictionary. At the end of the program, we print the entire list at once. ``` $ python main.py [{'title': 'JBL Flip 4 Waterproof Portable Bluetooth Speaker', 'min_price': Decimal('74.95'), 'price': Decimal('74.95')}, {'title': 'Sony XBR-950G BRAVIA 4K HDR Ultra HD TV', 'min_price': Decimal('1398.00'), 'price': None}, ...] ``` Pretty print If you find the complex data structures printed by `print()` difficult to read, try using [`pp()`](https://docs.python.org/3/library/pprint.html#pprint.pp) from the `pprint` module instead. ## Saving data as CSV[​](#saving-data-as-csv "Direct link to Saving data as CSV") The CSV format is popular among data analysts because a wide range of tools can import it, including spreadsheets apps like LibreOffice Calc, Microsoft Excel, Apple Numbers, and Google Sheets. In Python, it's convenient to read and write CSV files, thanks to the [`csv`](https://docs.python.org/3/library/csv.html) standard library module. First let's try something small in the Python's interactive REPL to familiarize ourselves with the basic usage: ``` >>> import csv >>> with open("data.csv", "w") as file: ... writer = csv.DictWriter(file, fieldnames=["name", "age", "hobbies"]) ... writer.writeheader() ... writer.writerow({"name": "Alice", "age": 24, "hobbies": "kickbox, Python"}) ... writer.writerow({"name": "Bob", "age": 42, "hobbies": "reading, TypeScript"}) ... ``` We first opened a new file for writing and created a `DictWriter()` instance with the expected field names. We instructed it to write the header row first and then added two more rows containing actual data. The code produced a `data.csv` file in the same directory where we're running the REPL. It has the following contents: ``` name,age,hobbies Alice,24,"kickbox, Python" Bob,42,"reading, TypeScript" ``` In the CSV format, if values contain commas, we should enclose them in quotes. You can see that the writer automatically handled this. When browsing the directory on macOS, we can see a nice preview of the file's contents, which proves that the file is correct and that other programs can read it as well. If you're using a different operating system, try opening the file with any spreadsheet program you have. ![CSV example preview](/assets/images/csv-example-370c0acb3f87d0f1598505b6f595f35a.png) Now that's nice, but we didn't want Alice, Bob, kickbox, or TypeScript. What we actually want is a CSV containing `Sony XBR-950G BRAVIA 4K HDR Ultra HD TV`, right? Let's do this! First, let's add `csv` to our imports: ``` import httpx from bs4 import BeautifulSoup from decimal import Decimal import csv ``` Next, instead of printing the data, we'll finish the program by exporting it to CSV. Replace `print(data)` with the following: ``` with open("products.csv", "w") as file: writer = csv.DictWriter(file, fieldnames=["title", "min_price", "price"]) writer.writeheader() for row in data: writer.writerow(row) ``` If we run our scraper now, it won't display any output, but it will create a `products.csv` file in the current working directory, which contains all the data about the listed products. ![CSV preview](/assets/images/csv-5b03f7c2a2b8a1f19f48f71ab832054d.png) ## Saving data as JSON[​](#saving-data-as-json "Direct link to Saving data as JSON") The JSON format is popular primarily among developers. We use it for storing data, configuration files, or as a way to transfer data between programs (e.g., APIs). Its origin stems from the syntax of objects in the JavaScript programming language, which is similar to the syntax of Python dictionaries. In Python, there's a [`json`](https://docs.python.org/3/library/json.html) standard library module, which is so straightforward that we can start using it in our code right away. We'll need to begin with imports: ``` import httpx from bs4 import BeautifulSoup from decimal import Decimal import csv import json ``` Next, let’s append one more export to end of the source code of our scraper: ``` with open("products.json", "w") as file: json.dump(data, file) ``` That’s it! If we run the program now, it should also create a `products.json` file in the current working directory: ``` $ python main.py Traceback (most recent call last): ... raise TypeError(f'Object of type {o.__class__.__name__} ' TypeError: Object of type Decimal is not JSON serializable ``` Ouch! JSON supports integers and floating-point numbers, but there's no guidance on how to handle `Decimal`. To maintain precision, it's common to store monetary values as strings in JSON files. But this is a convention, not a standard, so we need to handle it manually. We'll pass a custom function to `json.dump()` to serialize objects that it can't handle directly: ``` def serialize(obj): if isinstance(obj, Decimal): return str(obj) raise TypeError("Object not JSON serializable") with open("products.json", "w") as file: json.dump(data, file, default=serialize) ``` Now the program should work as expected, producing a JSON file with the following content: ``` [{"title": "JBL Flip 4 Waterproof Portable Bluetooth Speaker", "min_price": "74.95", "price": "74.95"}, {"title": "Sony XBR-950G BRAVIA 4K HDR Ultra HD TV", "min_price": "1398.00", "price": null}, ...] ``` If you skim through the data, you'll notice that the `json.dump()` function handled some potential issues, such as escaping double quotes found in one of the titles by adding a backslash: ``` {"title": "Sony SACS9 10\" Active Subwoofer", "min_price": "158.00", "price": "158.00"} ``` Pretty JSON While a compact JSON file without any whitespace is efficient for computers, it can be difficult for humans to read. You can pass `indent=2` to `json.dump()` for prettier output. Also, if your data contains non-English characters, set `ensure_ascii=False`. By default, Python encodes everything except [ASCII](https://en.wikipedia.org/wiki/ASCII), which means it would save [Bún bò Nam Bô](https://vi.wikipedia.org/wiki/B%C3%BAn_b%C3%B2_Nam_B%E1%BB%99) as `B\\u00fan b\\u00f2 Nam B\\u00f4`. We've built a Python application that downloads a product listing, parses the data, and saves it in a structured format for further use. But the data still has gaps: for some products, we only have the min price, not the actual prices. In the next lesson, we'll attempt to scrape more details from all the product pages. *** ## Exercises[​](#exercises "Direct link to Exercises") In this lesson, you learned how to create export files in two formats. The following challenges are designed to help you empathize with the people who'd be working with them. ### Process your CSV[​](#process-your-csv "Direct link to Process your CSV") Open the `products.csv` file in a spreadsheet app. Use the app to find all products with a min price greater than $500. Solution Let's use [Google Sheets](https://www.google.com/sheets/about/), which is free to use. After logging in with a Google account: 1. Go to **File > Import**, choose **Upload**, and select the file. Import the data using the default settings. You should see a table with all the data. 2. Select the header row. Go to **Data > Create filter**. 3. Use the filter icon that appears next to `min_price`. Choose **Filter by condition**, select **Greater than**, and enter **500** in the text field. Confirm the dialog. You should see only the filtered data. ![CSV in Google Sheets](/assets/images/csv-sheets-898267aa223628c99e056cb8ce345dcd.png) ### Process your JSON[​](#process-your-json "Direct link to Process your JSON") Write a new Python program that reads `products.json`, finds all products with a min price greater than $500, and prints each one using [`pp()`](https://docs.python.org/3/library/pprint.html#pprint.pp). Solution ``` import json from pprint import pp from decimal import Decimal with open("products.json", "r") as file: products = json.load(file) for product in products: if Decimal(product["min_price"]) > 500: pp(product) ``` --- # Scraping product variants with Node.js **In this lesson, we'll scrape the product detail pages to represent each product variant as a separate item in our dataset.** *** We'll need to figure out how to extract variants from the product detail page, and then change how we add items to the data list so we can add multiple items after scraping one product URL. ## Locating variants[​](#locating-variants "Direct link to Locating variants") First, let's extract information about the variants. If we go to [Sony XBR-950G BRAVIA](https://warehouse-theme-metal.myshopify.com/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv) and open the DevTools, we can see that the buttons for switching between variants look like this: ``` 55" 65" ``` Nice! We can extract the variant names, but we also need to extract the price for each variant. Switching the variants using the buttons shows us that the HTML changes dynamically. This means the page uses JavaScript to display this information. ![Switching variants](/assets/images/variants-js-304898fbc90d9c188535e88eae04ba1d.gif) If we can't find a workaround, we'd need our scraper to run JavaScript. That's not impossible. Scrapers can spin up their own browser instance and automate clicking on buttons, but it's slow and resource-intensive. Ideally, we want to stick to plain HTTP requests and Beautiful Soup as much as possible. After a bit of detective work, we notice that not far below the `block-swatch-list` there's also a block of HTML with a class `no-js`, which contains all the data! ``` Variant 55" - $1,398.00 65" - $2,198.00 ``` These elements aren't visible to regular visitors. They're there just in case JavaScript fails to work, otherwise they're hidden. This is a great find because it allows us to keep our scraper lightweight. ## Extracting variants[​](#extracting-variants "Direct link to Extracting variants") Using our knowledge of Beautiful Soup, we can locate the options and extract the data we need: ``` ... listing_url = "https://warehouse-theme-metal.myshopify.com/collections/sales" listing_soup = download(listing_url) data = [] for product in listing_soup.select(".product-item"): item = parse_product(product, listing_url) product_soup = download(item["url"]) vendor = product_soup.select_one(".product-meta__vendor").text.strip() if variants := product_soup.select(".product-form__option.no-js option"): for variant in variants: data.append(item | {"variant_name": variant.text.strip()}) else: item["variant_name"] = None data.append(item) ... ``` The CSS selector `.product-form__option.no-js` matches elements with both `product-form__option` and `no-js` classes. Then we're using the [descendant combinator](https://developer.mozilla.org/en-US/docs/Web/CSS/Descendant_combinator) to match all `option` elements somewhere inside the `.product-form__option.no-js` wrapper. Python dictionaries are mutable, so if we assigned the variant with `item["variant_name"] = ...`, we'd always overwrite the values. Instead of saving an item for each variant, we'd end up with the last variant repeated several times. To avoid this, we create a new dictionary for each variant and merge it with the `item` data before adding it to `data`. If we don't find any variants, we add the `item` as is, leaving the `variant_name` key empty. Modern Python syntax Since Python 3.8, you can use `:=` to simplify checking if an assignment resulted in a non-empty value. It's called an *assignment expression* or *walrus operator*. You can learn more about it in the [docs](https://docs.python.org/3/reference/expressions.html#assignment-expressions) or in the [proposal document](https://peps.python.org/pep-0572/). Since Python 3.9, you can use `|` to merge two dictionaries. If the [docs](https://docs.python.org/3/library/stdtypes.html#dict) aren't clear enough, check out the [proposal document](https://peps.python.org/pep-0584/) for more details. If we run the program now, we'll see 34 items in total. Some items don't have variants, so they won't have a variant name. However, they should still have a price set—our scraper should already have that info from the product listing page. ``` [ ... { "variant_name": null, "title": "Klipsch R-120SW Powerful Detailed Home Speaker - Unit", "min_price": "324.00", "price": "324.00", "url": "https://warehouse-theme-metal.myshopify.com/products/klipsch-r-120sw-powerful-detailed-home-speaker-set-of-1", "vendor": "Klipsch" }, ... ] ``` Some products will break into several items, each with a different variant name. We don't know their exact prices from the product listing, just the min price. In the next step, we should be able to parse the actual price from the variant name for those items. ``` [ ... { "variant_name": "Red - $178.00", "title": "Sony XB-950B1 Extra Bass Wireless Headphones with App Control", "min_price": "128.00", "price": null, "url": "https://warehouse-theme-metal.myshopify.com/products/sony-xb950-extra-bass-wireless-headphones-with-app-control", "vendor": "Sony" }, { "variant_name": "Black - $178.00", "title": "Sony XB-950B1 Extra Bass Wireless Headphones with App Control", "min_price": "128.00", "price": null, "url": "https://warehouse-theme-metal.myshopify.com/products/sony-xb950-extra-bass-wireless-headphones-with-app-control", "vendor": "Sony" }, ... ] ``` Perhaps surprisingly, some products with variants will have the price field set. That's because the shop sells all variants of the product for the same price, so the product listing shows the price as a fixed amount, like *$74.95*, instead of *from $74.95*. ``` [ ... { "variant_name": "Red - $74.95", "title": "JBL Flip 4 Waterproof Portable Bluetooth Speaker", "min_price": "74.95", "price": "74.95", "url": "https://warehouse-theme-metal.myshopify.com/products/jbl-flip-4-waterproof-portable-bluetooth-speaker", "vendor": "JBL" }, ... ] ``` ## Parsing price[​](#parsing-price "Direct link to Parsing price") The items now contain the variant as text, which is good for a start, but we want the price to be in the `price` key. Let's introduce a new function to handle that: ``` def parse_variant(variant): text = variant.text.strip() name, price_text = text.split(" - ") price = Decimal( price_text .replace("$", "") .replace(",", "") ) return {"variant_name": name, "price": price} ``` First, we split the text into two parts, then we parse the price as a decimal number. This part is similar to what we already do for parsing product listing prices. The function returns a dictionary we can merge with `item`. ## Saving price[​](#saving-price "Direct link to Saving price") Now, if we use our new function, we should finally get a program that can scrape exact prices for all products, even if they have variants. The whole code should look like this now: ``` import httpx from bs4 import BeautifulSoup from decimal import Decimal import json import csv from urllib.parse import urljoin def download(url): response = httpx.get(url) response.raise_for_status() html_code = response.text return BeautifulSoup(html_code, "html.parser") def parse_product(product, base_url): title_element = product.select_one(".product-item__title") title = title_element.text.strip() url = urljoin(base_url, title_element["href"]) price_text = ( product .select_one(".price") .contents[-1] .strip() .replace("$", "") .replace(",", "") ) if price_text.startswith("From "): min_price = Decimal(price_text.removeprefix("From ")) price = None else: min_price = Decimal(price_text) price = min_price return {"title": title, "min_price": min_price, "price": price, "url": url} def parse_variant(variant): text = variant.text.strip() name, price_text = text.split(" - ") price = Decimal( price_text .replace("$", "") .replace(",", "") ) return {"variant_name": name, "price": price} def export_csv(file, data): fieldnames = list(data[0].keys()) writer = csv.DictWriter(file, fieldnames=fieldnames) writer.writeheader() for row in data: writer.writerow(row) def export_json(file, data): def serialize(obj): if isinstance(obj, Decimal): return str(obj) raise TypeError("Object not JSON serializable") json.dump(data, file, default=serialize, indent=2) listing_url = "https://warehouse-theme-metal.myshopify.com/collections/sales" listing_soup = download(listing_url) data = [] for product in listing_soup.select(".product-item"): item = parse_product(product, listing_url) product_soup = download(item["url"]) vendor = product_soup.select_one(".product-meta__vendor").text.strip() if variants := product_soup.select(".product-form__option.no-js option"): for variant in variants: data.append(item | parse_variant(variant)) else: item["variant_name"] = None data.append(item) with open("products.csv", "w") as file: export_csv(file, data) with open("products.json", "w") as file: export_json(file, data) ``` Let's run the scraper and see if all the items in the data contain prices: ``` [ ... { "variant_name": "Red", "title": "Sony XB-950B1 Extra Bass Wireless Headphones with App Control", "min_price": "128.00", "price": "178.00", "url": "https://warehouse-theme-metal.myshopify.com/products/sony-xb950-extra-bass-wireless-headphones-with-app-control", "vendor": "Sony" }, { "variant_name": "Black", "title": "Sony XB-950B1 Extra Bass Wireless Headphones with App Control", "min_price": "128.00", "price": "178.00", "url": "https://warehouse-theme-metal.myshopify.com/products/sony-xb950-extra-bass-wireless-headphones-with-app-control", "vendor": "Sony" }, ... ] ``` Success! We managed to build a Python application for watching prices! Is this the end? Maybe! In the next lesson, we'll use a scraping framework to build the same application, but with less code, faster requests, and better visibility into what's happening while we wait for the program to finish. *** ## Exercises[​](#exercises "Direct link to Exercises") These challenges are here to help you test what you’ve learned in this lesson. Try to resist the urge to peek at the solutions right away. Remember, the best learning happens when you dive in and do it yourself! Real world You're about to touch the real web, which is practical and exciting! But websites change, so some exercises might break. If you run into any issues, please leave a comment below or [file a GitHub Issue](https://github.com/apify/apify-docs/issues). ### Build a scraper for watching Python jobs[​](#build-a-scraper-for-watching-python-jobs "Direct link to Build a scraper for watching Python jobs") You're able to build a scraper now, aren't you? Let's build another one! Python's official website has a [job board](https://www.python.org/jobs/). Scrape the job postings that match the following criteria: * Tagged as "Database" * Posted within the last 60 days For each job posting found, use [`pp()`](https://docs.python.org/3/library/pprint.html#pprint.pp) to print a dictionary containing the following data: * Job title * Company * URL to the job posting * Date of posting Your output should look something like this: ``` {'title': 'Senior Full Stack Developer', 'company': 'Baserow', 'url': 'https://www.python.org/jobs/7705/', 'posted_on': datetime.date(2024, 9, 16)} {'title': 'Senior Python Engineer', 'company': 'Active Prime', 'url': 'https://www.python.org/jobs/7699/', 'posted_on': datetime.date(2024, 9, 5)} ... ``` You can find everything you need for working with dates and times in Python's [`datetime`](https://docs.python.org/3/library/datetime.html) module, including `date.today()`, `datetime.fromisoformat()`, `datetime.date()`, and `timedelta()`. Solution After inspecting the job board, you'll notice that job postings tagged as "Database" have a dedicated URL. We'll use that as our starting point, which saves us from having to scrape and check the tags manually. ``` from pprint import pp import httpx from bs4 import BeautifulSoup from urllib.parse import urljoin from datetime import datetime, date, timedelta today = date.today() jobs_url = "https://www.python.org/jobs/type/database/" response = httpx.get(jobs_url) response.raise_for_status() soup = BeautifulSoup(response.text, "html.parser") for job in soup.select(".list-recent-jobs li"): link = job.select_one(".listing-company-name a") time = job.select_one(".listing-posted time") posted_at = datetime.fromisoformat(time["datetime"]) posted_on = posted_at.date() posted_ago = today - posted_on if posted_ago DevTools: Inspecting](https://docs.apify.com/academy/scraping-basics-python/devtools-inspecting.md) [Lesson about using the browser tools for developers to inspect and manipulate the structure of a website.](https://docs.apify.com/academy/scraping-basics-python/devtools-inspecting.md) ## [📄️ DevTools: Locating HTML elements](https://docs.apify.com/academy/scraping-basics-python/devtools-locating-elements.md) [Lesson about using the browser tools for developers to manually find products on an e-commerce website.](https://docs.apify.com/academy/scraping-basics-python/devtools-locating-elements.md) ## [📄️ DevTools: Extracting data](https://docs.apify.com/academy/scraping-basics-python/devtools-extracting-data.md) [Lesson about using the browser tools for developers to manually extract product data from an e-commerce website.](https://docs.apify.com/academy/scraping-basics-python/devtools-extracting-data.md) ## [📄️ Downloading HTML](https://docs.apify.com/academy/scraping-basics-python/downloading-html.md) [Lesson about building a Python application for watching prices. Using the HTTPX library to download HTML code of a product listing page.](https://docs.apify.com/academy/scraping-basics-python/downloading-html.md) ## [📄️ Parsing HTML](https://docs.apify.com/academy/scraping-basics-python/parsing-html.md) [Lesson about building a Python application for watching prices. Using the Beautiful Soup library to parse HTML code of a product listing page.](https://docs.apify.com/academy/scraping-basics-python/parsing-html.md) ## [📄️ Locating HTML elements](https://docs.apify.com/academy/scraping-basics-python/locating-elements.md) [Lesson about building a Python application for watching prices. Using the Beautiful Soup library to locate products on the product listing page.](https://docs.apify.com/academy/scraping-basics-python/locating-elements.md) ## [📄️ Extracting data from HTML](https://docs.apify.com/academy/scraping-basics-python/extracting-data.md) [Lesson about building a Python application for watching prices. Using string manipulation to extract and clean data scraped from the product listing page.](https://docs.apify.com/academy/scraping-basics-python/extracting-data.md) ## [📄️ Saving data](https://docs.apify.com/academy/scraping-basics-python/saving-data.md) [Lesson about building a Python application for watching prices. Using standard library to save data scraped from product listing pages in popular formats such as CSV or JSON.](https://docs.apify.com/academy/scraping-basics-python/saving-data.md) ## [📄️ Getting links from HTML](https://docs.apify.com/academy/scraping-basics-python/getting-links.md) [Lesson about building a Python application for watching prices. Using the Beautiful Soup library to locate links to individual product pages.](https://docs.apify.com/academy/scraping-basics-python/getting-links.md) ## [📄️ Crawling websites](https://docs.apify.com/academy/scraping-basics-python/crawling.md) [Lesson about building a Python application for watching prices. Using the HTTPX library to follow links to individual product pages.](https://docs.apify.com/academy/scraping-basics-python/crawling.md) ## [📄️ Scraping product variants](https://docs.apify.com/academy/scraping-basics-python/scraping-variants.md) [Lesson about building a Python application for watching prices. Using browser DevTools to figure out how to extract product variants and exporting them as separate items.](https://docs.apify.com/academy/scraping-basics-python/scraping-variants.md) ## [📄️ Using a framework](https://docs.apify.com/academy/scraping-basics-python/framework.md) [Lesson about building a Python application for watching prices. Using the Crawlee framework to simplify creating a scraper.](https://docs.apify.com/academy/scraping-basics-python/framework.md) ## [📄️ Using a platform](https://docs.apify.com/academy/scraping-basics-python/platform.md) [Lesson about building a Python application for watching prices. Using the Apify platform to deploy a scraper.](https://docs.apify.com/academy/scraping-basics-python/platform.md) --- # Crawling websites with Python **In this lesson, we'll follow links to individual product pages. We'll use HTTPX to download them and BeautifulSoup to process them.** *** In previous lessons we've managed to download the HTML code of a single page, parse it with BeautifulSoup, and extract relevant data from it. We'll do the same now for each of the products. Thanks to the refactoring, we have functions ready for each of the tasks, so we won't need to repeat ourselves in our code. This is what you should see in your editor now: ``` import httpx from bs4 import BeautifulSoup from decimal import Decimal import json import csv from urllib.parse import urljoin def download(url): response = httpx.get(url) response.raise_for_status() html_code = response.text return BeautifulSoup(html_code, "html.parser") def parse_product(product, base_url): title_element = product.select_one(".product-item__title") title = title_element.text.strip() url = urljoin(base_url, title_element["href"]) price_text = ( product .select_one(".price") .contents[-1] .strip() .replace("$", "") .replace(",", "") ) if price_text.startswith("From "): min_price = Decimal(price_text.removeprefix("From ")) price = None else: min_price = Decimal(price_text) price = min_price return {"title": title, "min_price": min_price, "price": price, "url": url} def export_json(file, data): def serialize(obj): if isinstance(obj, Decimal): return str(obj) raise TypeError("Object not JSON serializable") json.dump(data, file, default=serialize, indent=2) def export_csv(file, data): fieldnames = list(data[0].keys()) writer = csv.DictWriter(file, fieldnames=fieldnames) writer.writeheader() for row in data: writer.writerow(row) listing_url = "https://warehouse-theme-metal.myshopify.com/collections/sales" listing_soup = download(listing_url) data = [] for product in listing_soup.select(".product-item"): item = parse_product(product, listing_url) data.append(item) with open("products.json", "w") as file: export_json(file, data) with open("products.csv", "w") as file: export_csv(file, data) ``` ## Extracting vendor name[​](#extracting-vendor-name "Direct link to Extracting vendor name") Each product URL points to a so-called *product detail page*, or PDP. If we open one of the product URLs in the browser, e.g. the one about [Sony XBR-950G BRAVIA](https://warehouse-theme-metal.myshopify.com/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv), we can see that it contains a vendor name, [SKU](https://en.wikipedia.org/wiki/Stock_keeping_unit), number of reviews, product images, product variants, stock availability, description, and perhaps more. ![Product detail page](/assets/images/pdp-5399b26afde645f8ab2426f24cef87b9.png) Depending on what's valuable for our use case, we can now use the same techniques as in previous lessons to extract any of the above. As a demonstration, let's scrape the vendor name. In browser DevTools, we can see that the HTML around the vendor name has the following structure: ``` Sony XBR-950G BRAVIA 4K HDR Ultra HD TV ... Sony SKU: SON-985594-XBR-65 ... 3 reviews ... ``` It looks like using a CSS selector to locate the element with the `product-meta__vendor` class, and then extracting its text, should be enough to get the vendor name as a string: ``` vendor = product_soup.select_one(".product-meta__vendor").text.strip() ``` But where do we put this line in our program? ## Crawling product detail pages[​](#crawling-product-detail-pages "Direct link to Crawling product detail pages") In the `data` loop we're already going through all the products. Let's expand it to include downloading the product detail page, parsing it, extracting the vendor's name, and adding it as a new key in the item's dictionary: ``` ... listing_url = "https://warehouse-theme-metal.myshopify.com/collections/sales" listing_soup = download(listing_url) data = [] for product in listing_soup.select(".product-item"): item = parse_product(product, listing_url) product_soup = download(item["url"]) item["vendor"] = product_soup.select_one(".product-meta__vendor").text.strip() data.append(item) ... ``` If we run the program now, it'll take longer to finish since it's making 24 more HTTP requests. But in the end, it should produce exports with a new field containing the vendor's name: ``` [ { "title": "JBL Flip 4 Waterproof Portable Bluetooth Speaker", "min_price": "74.95", "price": "74.95", "url": "https://warehouse-theme-metal.myshopify.com/products/jbl-flip-4-waterproof-portable-bluetooth-speaker", "vendor": "JBL" }, { "title": "Sony XBR-950G BRAVIA 4K HDR Ultra HD TV", "min_price": "1398.00", "price": null, "url": "https://warehouse-theme-metal.myshopify.com/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv", "vendor": "Sony" }, ... ] ``` ## Extracting price[​](#extracting-price "Direct link to Extracting price") Scraping the vendor's name is nice, but the main reason we started checking the detail pages in the first place was to figure out how to get a price for each product. From the product listing, we could only scrape the min price, and remember—we’re building a Python app to track prices! Looking at the [Sony XBR-950G BRAVIA](https://warehouse-theme-metal.myshopify.com/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv), it's clear that the listing only shows min prices, because some products have variants, each with a different price. And different stock availability. And different SKUs… ![Morpheus revealing the existence of product variants](/assets/images/variants-32d39ca999dadb22954e83b0dedd782a.png) In the next lesson, we'll scrape the product detail pages so that each product variant is represented as a separate item in our dataset. *** ## Exercises[​](#exercises "Direct link to Exercises") These challenges are here to help you test what you’ve learned in this lesson. Try to resist the urge to peek at the solutions right away. Remember, the best learning happens when you dive in and do it yourself! Real world You're about to touch the real web, which is practical and exciting! But websites change, so some exercises might break. If you run into any issues, please leave a comment below or [file a GitHub Issue](https://github.com/apify/apify-docs/issues). ### Scrape calling codes of African countries[​](#scrape-calling-codes-of-african-countries "Direct link to Scrape calling codes of African countries") This is a follow-up to an exercise from the previous lesson, so feel free to reuse your code. Scrape links to Wikipedia pages for all African states and territories. Follow each link and extract the *calling code* from the info table. Print the URL and the calling code for each country. Start with this URL: ``` https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Africa ``` Your program should print the following: ``` https://en.wikipedia.org/wiki/Algeria +213 https://en.wikipedia.org/wiki/Angola +244 https://en.wikipedia.org/wiki/Benin +229 https://en.wikipedia.org/wiki/Botswana +267 https://en.wikipedia.org/wiki/Burkina_Faso +226 https://en.wikipedia.org/wiki/Burundi None https://en.wikipedia.org/wiki/Cameroon +237 ... ``` Hint: Locating cells in tables is sometimes easier if you know how to [navigate up](https://beautiful-soup-4.readthedocs.io/en/latest/index.html#going-up) in the HTML element soup. Solution ``` import httpx from bs4 import BeautifulSoup from urllib.parse import urljoin def download(url): response = httpx.get(url) response.raise_for_status() return BeautifulSoup(response.text, "html.parser") def parse_calling_code(soup): for label in soup.select("th.infobox-label"): if label.text.strip() == "Calling code": data = label.parent.select_one("td.infobox-data") return data.text.strip() return None listing_url = "https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Africa" listing_soup = download(listing_url) for name_cell in listing_soup.select(".wikitable tr td:nth-child(3)"): link = name_cell.select_one("a") country_url = urljoin(listing_url, link["href"]) country_soup = download(country_url) calling_code = parse_calling_code(country_soup) print(country_url, calling_code) ``` ### Scrape authors of F1 news articles[​](#scrape-authors-of-f1-news-articles "Direct link to Scrape authors of F1 news articles") This is a follow-up to an exercise from the previous lesson, so feel free to reuse your code. Scrape links to the Guardian's latest F1 news articles. For each article, follow the link and extract both the author's name and the article's title. Print the author's name and the title for all the articles. Start with this URL: ``` https://www.theguardian.com/sport/formulaone ``` Your program should print something like this: ``` Daniel Harris: Sports quiz of the week: Johan Neeskens, Bond and airborne antics Colin Horgan: The NHL is getting its own Drive to Survive. But could it backfire? Reuters: US GP ticket sales ‘took off’ after Max Verstappen stopped winning in F1 Giles Richards: Liam Lawson gets F1 chance to replace Pérez alongside Verstappen at Red Bull PA Media: Lewis Hamilton reveals lifelong battle with depression after school bullying ... ``` Hints: * You can use [attribute selectors](https://developer.mozilla.org/en-US/docs/Web/CSS/Attribute_selectors) to select HTML elements based on their attribute values. * Sometimes a person authors the article, but other times it's contributed by a news agency. Solution ``` import httpx from bs4 import BeautifulSoup from urllib.parse import urljoin def download(url): response = httpx.get(url) response.raise_for_status() return BeautifulSoup(response.text, "html.parser") def parse_author(article_soup): link = article_soup.select_one('aside a[rel="author"]') if link: return link.text.strip() address = article_soup.select_one('aside address') if address: return address.text.strip() return None listing_url = "https://www.theguardian.com/sport/formulaone" listing_soup = download(listing_url) for item in listing_soup.select("#maincontent ul li"): link = item.select_one("a") article_url = urljoin(listing_url, link["href"]) article_soup = download(article_url) title = article_soup.select_one("h1").text.strip() author = parse_author(article_soup) print(f"{author}: {title}") ``` --- # Extracting data from a web page with browser DevTools **In this lesson we'll use the browser tools for developers to manually extract product data from an e-commerce website.** *** In our pursuit to scrape products from the [Sales page](https://warehouse-theme-metal.myshopify.com/collections/sales), we've been able to locate parent elements containing relevant data. Now how do we extract the data? ## Finding product details[​](#finding-product-details "Direct link to Finding product details") Previously, we've figured out how to save the subwoofer product card to a variable in the **Console**: ``` products = document.querySelectorAll('.product-item'); subwoofer = products[2]; ``` The product details are within the element as text, so maybe if we extract the text, we could work out the individual values? ``` subwoofer.textContent; ``` That indeed outputs all the text, but in a form which would be hard to break down to relevant pieces. ![Printing text content of the parent element](/assets/images/devtools-extracting-text-67926f7b912635ebd5518af391019c91.png) We'll need to first locate relevant child elements and extract the data from each of them individually. ## Extracting title[​](#extracting-title "Direct link to Extracting title") We'll use the **Elements** tab of DevTools to inspect all child elements of the product card for the Sony subwoofer. We can see that the title of the product is inside an `a` element with several classes. From those the `product-item__title` seems like a great choice to locate the element. ![Finding child elements](/assets/images/devtools-product-details-83a62a55f65911b057ec16c4c7cde7b6.png) JavaScript represents HTML elements as [Element](https://developer.mozilla.org/en-US/docs/Web/API/Element) objects. Among properties we've already played with, such as `textContent` or `outerHTML`, it also has the [`querySelector()`](https://developer.mozilla.org/en-US/docs/Web/API/Element/querySelector) method. Here the method looks for matches only within children of the element: ``` title = subwoofer.querySelector('.product-item__title'); title.textContent; ``` Notice we're calling `querySelector()` on the `subwoofer` variable, not `document`. And just like this, we've scraped our first piece of data! We've extracted the product title: ![Extracting product title](/assets/images/devtools-extracting-title-a40e8cdc7afe941acf105ada55a1ef1f.png) ## Extracting price[​](#extracting-price "Direct link to Extracting price") To figure out how to get the price, we'll use the **Elements** tab of DevTools again. We notice there are two prices, a regular price and a sale price. For the purposes of watching prices we'll need the sale price. Both are `span` elements with the `price` class. ![Finding child elements](/assets/images/devtools-product-details-83a62a55f65911b057ec16c4c7cde7b6.png) We could either rely on the fact that the sale price is likely to be always the one which is highlighted, or that it's always the first price. For now we'll rely on the later and we'll let `querySelector()` to simply return the first result: ``` price = subwoofer.querySelector('.price'); price.textContent; ``` It works, but the price isn't alone in the result. Before we'd use such data, we'd need to do some **data cleaning**: ![Extracting product price](/assets/images/devtools-extracting-price-5886f33b1f72fe38581cfe8521154987.png) But for now that's okay. We're just testing the waters now, so that we have an idea about what our scraper will need to do. Once we'll get to extracting prices in Python, we'll figure out how to get the values as numbers. In the next lesson, we'll start with our Python project. First we'll be figuring out how to download the Sales page without browser and make it accessible in a Python program. *** ## Exercises[​](#exercises "Direct link to Exercises") These challenges are here to help you test what you’ve learned in this lesson. Try to resist the urge to peek at the solutions right away. Remember, the best learning happens when you dive in and do it yourself! Real world You're about to touch the real web, which is practical and exciting! But websites change, so some exercises might break. If you run into any issues, please leave a comment below or [file a GitHub Issue](https://github.com/apify/apify-docs/issues). ### Extract the price of IKEA's most expensive artificial plant[​](#extract-the-price-of-ikeas-most-expensive-artificial-plant "Direct link to Extract the price of IKEA's most expensive artificial plant") At IKEA's [Artificial plants & flowers listing](https://www.ikea.com/se/en/cat/artificial-plants-flowers-20492/), use CSS selectors and HTML elements manipulation in the **Console** to extract the price of the most expensive artificial plant (sold in Sweden, as you'll be browsing their Swedish offer). Before opening DevTools, use your judgment to adjust the page to make the task as straightforward as possible. Finally, use JavaScript's [`parseInt()`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/parseInt) function to convert the price text into a number. Solution 1. Open the [Artificial plants & flowers listing](https://www.ikea.com/se/en/cat/artificial-plants-flowers-20492/). 2. Sort the products by price, from high to low, so the most expensive plant appears first in the listing. 3. Activate the element selection tool in your DevTools. 4. Click on the price of the first and most expensive plant. 5. Notice that the price is structured into two elements, with the integer separated from the currency, under a class named `plp-price__integer`. This structure is convenient for extracting the value. 6. In the **Console**, execute `document.querySelector('.plp-price__integer')`. This returns the element representing the first price in the listing. Since `document.querySelector()` returns the first matching element, it directly selects the most expensive plant's price. 7. Save the element in a variable by executing `price = document.querySelector('.plp-price__integer')`. 8. Convert the price text into a number by executing `parseInt(price.textContent)`. 9. At the time of writing, this returns `699`, meaning [699 SEK](https://www.google.com/search?q=699%20sek). ### Extract the name of the top wiki on Fandom Movies[​](#extract-the-name-of-the-top-wiki-on-fandom-movies "Direct link to Extract the name of the top wiki on Fandom Movies") On Fandom's [Movies page](https://www.fandom.com/topics/movies), use CSS selectors and HTML element manipulation in the **Console** to extract the name of the top wiki. Use JavaScript's [`trim()`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/trim) method to remove white space around the name. ![Fandom\'s Movies page](/assets/images/devtools-exercise-fandom-2ca7a8be3f2ec2f4cb54aeb459a20591.png) Solution 1. Open the [Movies page](https://www.fandom.com/topics/movies). 2. Activate the element selection tool in your DevTools. 3. Click on the list item for the top Fandom wiki in the category. 4. Notice that it has a class `topic_explore-wikis__link`. 5. In the **Console**, execute `document.querySelector('.topic_explore-wikis__link')`. This returns the element representing the top list item. They use the selector only for the **Top Wikis** list, and because `document.querySelector()` returns the first matching element, you're almost done. 6. Save the element in a variable by executing `item = document.querySelector('.topic_explore-wikis__link')`. 7. Get the element's text without extra white space by executing `item.textContent.trim()`. At the time of writing, this returns `"Pixar Wiki"`. ### Extract details about the first post on Guardian's F1 news[​](#extract-details-about-the-first-post-on-guardians-f1-news "Direct link to Extract details about the first post on Guardian's F1 news") On the Guardian's [F1 news page](https://www.theguardian.com/sport/formulaone), use CSS selectors and HTML manipulation in the **Console** to extract details about the first post. Specifically, extract its title, lead paragraph, and URL of the associated photo. ![F1 news page](/assets/images/devtools-exercise-guardian2-387ac6e6789398edf8ec27dcf214d0a0.png) Solution 1. Open the [F1 news page](https://www.theguardian.com/sport/formulaone). 2. Activate the element selection tool in your DevTools. 3. Click on the first post. 4. Notice that the markup does not provide clear, reusable class names for this task. The structure uses generic tag names and randomized classes, requiring you to rely on the element hierarchy and order instead. 5. In the **Console**, execute `post = document.querySelector('#maincontent ul li')`. This returns the element representing the first post. 6. Extract the post's title by executing `post.querySelector('h3').textContent`. 7. Extract the lead paragraph by executing `post.querySelector('span div').textContent`. 8. Extract the photo URL by executing `post.querySelector('img').src`. --- # Inspecting web pages with browser DevTools **In this lesson we'll use the browser tools for developers to inspect and manipulate the structure of a website.** *** A browser is the most complete tool for navigating websites. Scrapers are like automated browsers—and sometimes, they actually are automated browsers. The key difference? There's no user to decide where to go or eyes to see what's displayed. Everything has to be pre-programmed. All modern browsers provide developer tools, or *DevTools*, for website developers to debug their work. We'll use them to understand how websites are structured and identify the behavior our scraper needs to mimic. Here's the typical workflow for creating a scraper: 1. Inspect the target website in DevTools to understand its structure and determine how to extract the required data. 2. Translate those findings into code. 3. If the scraper fails due to overlooked edge cases or, over time, due to website changes, go back to step 1. Now let's spend some time figuring out what the detective work in step 1 is about. ## Opening DevTools[​](#opening-devtools "Direct link to Opening DevTools") Google Chrome is currently the most popular browser, and many others use the same core. That's why we'll focus on [Chrome DevTools](https://developer.chrome.com/docs/devtools) here. However, the steps are similar in other browsers, as Safari has its [Web Inspector](https://developer.apple.com/documentation/safari-developer-tools/web-inspector) and Firefox also has [DevTools](https://firefox-source-docs.mozilla.org/devtools-user/). Now let's peek behind the scenes of a real-world website—say, Wikipedia. We'll open Google Chrome and visit [wikipedia.org](https://www.wikipedia.org/). Then, let's press **F12**, or right-click anywhere on the page and select **Inspect**. ![Wikipedia with Chrome DevTools open](/assets/images/devtools-wikipedia-912f0473b3c31f441ab1659205bd1e08.png) Websites are built with three main technologies: HTML, CSS, and JavaScript. In the **Elements** tab, DevTools shows the HTML and CSS of the current page: ![Elements tab in Chrome DevTools](/assets/images/devtools-elements-tab-19ef2bf359464d39570f06f182dbc92e.png) Screen adaptations DevTools may appear differently depending on your screen size. For instance, on smaller screens, the CSS panel might move below the HTML elements panel instead of appearing in the right pane. Think of [HTML](https://developer.mozilla.org/en-US/docs/Learn/HTML) elements as the frame that defines a page's structure. A basic HTML element includes an opening tag, a closing tag, and attributes. Here's an `article` element with an `id` attribute. It wraps `h1` and `p` elements, both containing text. Some text is emphasized using `em`. ``` First Level Heading Paragraph with emphasized text. ``` HTML, a markup language, describes how everything on a page is organized, how elements relate to each other, and what they mean. It doesn't define how elements should look—that's where [CSS](https://developer.mozilla.org/en-US/docs/Learn/CSS) comes in. CSS is like the velvet covering the frame. Using styles, we can select elements and assign rules that tell the browser how they should appear. For instance, we can style all elements with `heading` in their `class` attribute to make the text blue and uppercase. ``` .heading { color: blue; text-transform: uppercase; } ``` While HTML and CSS describe what the browser should display, [JavaScript](https://developer.mozilla.org/en-US/docs/Learn/JavaScript) is a general-purpose programming language that adds interaction to the page. In DevTools, the **Console** tab allows ad-hoc experimenting with JavaScript. If you don't see it, press `ESC` to toggle the Console. Running commands in the Console lets us manipulate the loaded page—we’ll try this shortly. ![Console in Chrome DevTools](/assets/images/devtools-console-4ef1db79ae90e6fa619d2dc5ccc86b57.png) ## Selecting an element[​](#selecting-an-element "Direct link to Selecting an element") In the top-left corner of DevTools, let's find the icon with an arrow pointing to a square. ![Chrome DevTools element selection tool](/assets/images/devtools-element-selection-58f754a14f1c856aae8960432f7ebe73.png) We'll click the icon and hover your cursor over Wikipedia's subtitle, **The Free Encyclopedia**. As we move our cursor, DevTools will display information about the HTML element under it. We'll click on the subtitle. In the **Elements** tab, DevTools will highlight the HTML element that represents the subtitle. ![Chrome DevTools element hover](/assets/images/devtools-hover-c780c2944cc8718fc2131d83e1b0b1e3.png) The highlighted section should look something like this: ``` The Free Encyclopedia ``` If we were experienced creators of scrapers, our eyes would immediately spot what's needed to make a program that fetches Wikipedia's subtitle. The program would need to download the page's source code, find a `strong` element with `localized-slogan` in its `class` attribute, and extract its text. HTML and whitespace In HTML, whitespace isn't significant, i.e., it only makes the code readable. The following code snippets are equivalent: ``` The Free Encyclopedia ``` ``` The Free Encyclopedia ``` ## Interacting with an element[​](#interacting-with-an-element "Direct link to Interacting with an element") We won't be creating Python scrapers just yet. Let's first get familiar with what we can do in the JavaScript console and how we can further interact with HTML elements on the page. In the **Elements** tab, with the subtitle element highlighted, let's right-click the element to open the context menu. There, we'll choose **Store as global variable**. The **Console** should appear, with a `temp1` variable ready. ![Global variable in Chrome DevTools Console](/assets/images/devtools-console-variable-e7bc489b7be25174922e4a1880217dba.png) The Console allows us to run JavaScript in the context of the loaded page, similar to Python's [interactive REPL](https://realpython.com/interacting-with-python/). We can use it to play around with elements. For a start, let's access some of the subtitle's properties. One such property is `textContent`, which contains the text inside the HTML element. The last line in the Console is where your cursor is. We'll type the following and hit **Enter**: ``` temp1.textContent; ``` The result should be `'The Free Encyclopedia'`. Now let's try this: ``` temp1.outerHTML; ``` This should return the element's HTML tag as a string. Finally, we'll run the next line to change the text of the element: ``` temp1.textContent = 'Hello World!'; ``` When we change elements in the Console, those changes reflect immediately on the page! ![Changing textContent in Chrome DevTools Console](/assets/images/devtools-console-textcontent-24a921cedf18c995feac29a2120ad52e.png) But don't worry—we haven't hacked Wikipedia. The change only happens in our browser. If we reload the page, the change will disappear. This, however, is an easy way to craft a screenshot with fake content. That's why screenshots shouldn't be trusted as evidence. We're not here for playing around with elements, though—we want to create a scraper for an e-commerce website to watch prices. In the next lesson, we'll examine the website and use CSS selectors to locate HTML elements containing the data we need. *** ## Exercises[​](#exercises "Direct link to Exercises") These challenges are here to help you test what you’ve learned in this lesson. Try to resist the urge to peek at the solutions right away. Remember, the best learning happens when you dive in and do it yourself! Real world You're about to touch the real web, which is practical and exciting! But websites change, so some exercises might break. If you run into any issues, please leave a comment below or [file a GitHub Issue](https://github.com/apify/apify-docs/issues). ### Find FIFA logo[​](#find-fifa-logo "Direct link to Find FIFA logo") Open the [FIFA website](https://www.fifa.com/) and use the DevTools to figure out the URL of FIFA's logo image file. Hint: You're looking for an [`img`](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/img) element with a `src` attribute. Solution 1. Go to [fifa.com](https://www.fifa.com/). 2. Activate the element selection tool. 3. Click on the logo. 4. Send the highlighted element to the **Console** using the **Store as global variable** option from the context menu. 5. In the console, type `temp1.src` and hit **Enter**. ![DevTools exercise result](/assets/images/devtools-exercise-fifa-b92e335eb6684698a954601ef59dcb0f.png) ### Make your own news[​](#make-your-own-news "Direct link to Make your own news") Open a news website, such as [CNN](https://cnn.com). Use the Console to change the headings of some articles. Solution 1. Go to [cnn.com](https://cnn.com). 2. Activate the element selection tool. 3. Click on a heading. 4. Send the highlighted element to the **Console** using the **Store as global variable** option from the context menu. 5. In the console, type `temp1.textContent = 'Something something'` and hit **Enter**. ![DevTools exercise result](/assets/images/devtools-exercise-cnn-0eb495bb32a2a11eb795e83096b65949.png) --- # Locating HTML elements on a web page with browser DevTools **In this lesson we'll use the browser tools for developers to manually find products on an e-commerce website.** *** Inspecting Wikipedia and tweaking its subtitle is fun, but let's shift gears and focus on building an app to track prices on an e-commerce site. As part of the groundwork, let's check out the site we'll be working with. ## Meeting the Warehouse store[​](#meeting-the-warehouse-store "Direct link to Meeting the Warehouse store") Instead of artificial scraping playgrounds or sandboxes, we'll scrape a real e-commerce site. Shopify, a major e-commerce platform, has a demo store at [warehouse-theme-metal.myshopify.com](https://warehouse-theme-metal.myshopify.com/). It strikes a good balance between being realistic and stable enough for a tutorial. Our scraper will track prices for all products listed on the [Sales page](https://warehouse-theme-metal.myshopify.com/collections/sales). Balancing authenticity and stability Live sites like Amazon are complex, loaded with promotions, frequently changing, and equipped with anti-scraping measures. While those challenges are manageable, they're advanced topics. For this beginner course, we're sticking to a lightweight, stable environment. That said, we designed all the additional exercises to work with live websites. This means occasional updates might be needed, but we think it's worth it for a more authentic learning experience. ## Finding a product card[​](#finding-a-product-card "Direct link to Finding a product card") As mentioned in the previous lesson, before building a scraper, we need to understand structure of the target page and identify the specific elements our program should extract. Let's figure out how to select details for each product on the [Sales page](https://warehouse-theme-metal.myshopify.com/collections/sales). ![Warehouse store with DevTools open](/assets/images/devtools-warehouse-193f0152a0cd14df5068bc13512c31ee.png) The page displays a grid of product cards, each showing a product's title and picture. Let's open DevTools and locate the title of the **Sony SACS9 Active Subwoofer**. We'll highlight it in the **Elements** tab by clicking on it. ![Selecting an element with DevTools](/assets/images/devtools-product-title-29537d86966f0c3ae781b4cf8d53ef7e.png) Next, let's find all the elements containing details about this subwoofer—its price, number of reviews, image, and more. In the **Elements** tab, we'll move our cursor up from the `a` element containing the subwoofer's title. On the way, we'll hover over each element until we highlight the entire product card. Alternatively, we can use the arrow-up key. The `div` element we land on is the **parent element**, and all nested elements are its **child elements**. ![Selecting an element with hover](/assets/images/devtools-hover-product-72db0f66037c498f4b84e2405cc5e80c.png) At this stage, we could use the **Store as global variable** option to send the element to the **Console**. While helpful for manual inspection, this isn't something a program can do. Scrapers typically rely on [CSS selectors](https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_selectors) to locate elements on a page, and these selectors often target elements based on their `class` attributes. The product card we highlighted has markup like this: ``` ... ``` The `class` attribute can hold multiple values separated by whitespace. This particular element has four classes. Let's move to the **Console** and experiment with CSS selectors to locate this element. ## Programmatically locating a product card[​](#programmatically-locating-a-product-card "Direct link to Programmatically locating a product card") Let's jump into the **Console** and write some JavaScript. Don't worry—we don't need to know the language, and yes, this is a helpful step on our journey to creating a scraper in Python. In browsers, JavaScript represents the current page as the [`Document`](https://developer.mozilla.org/en-US/docs/Web/API/Document) object, accessible via `document`. This object offers many useful methods, including [`querySelector()`](https://developer.mozilla.org/en-US/docs/Web/API/Document/querySelector). This method takes a CSS selector as a string and returns the first HTML element that matches. We'll try typing this into the **Console**: ``` document.querySelector('.product-item'); ``` It will return the HTML element for the first product card in the listing: ![Using querySelector() in DevTools Console](/assets/images/devtools-queryselector-803d0a68e972691c71ad6551e8ab27d6.webp) CSS selectors can get quite complex, but the basics are enough to scrape most of the Warehouse store. Let's cover two simple types and how they can combine. The [type selector](https://developer.mozilla.org/en-US/docs/Web/CSS/Type_selectors) matches elements by tag name. For example, `h1` would match the highlighted element: ``` Title Paragraph. ``` The [class selector](https://developer.mozilla.org/en-US/docs/Web/CSS/Class_selectors) matches elements based on their class attribute. For instance, `.heading` (note the dot) would match the following: ``` Title Subtitle Paragraph Heading ``` You can combine selectors to narrow results. For example, `p.lead` matches `p` elements with the `lead` class, but not `p` elements without the class or elements with the class but a different tag name: ``` Lead paragraph. Paragraph Paragraph ``` How did we know `.product-item` selects a product card? By inspecting the markup of the product card element. After checking its classes, we chose the one that best fit our purpose. Testing in the **Console** confirmed it—selecting by the most descriptive class worked. ## Choosing good selectors[​](#choosing-good-selectors "Direct link to Choosing good selectors") Multiple approaches often exist for creating a CSS selector that targets the element we want. We should pick selectors that are simple, readable, unique, and semantically tied to the data. These are **resilient selectors**. They're the most reliable and likely to survive website updates. We better avoid randomly generated attributes like `class="F4jsL8"`, as they tend to change without warning. The product card has four classes: `product-item`, `product-item--vertical`, `1/3--tablet-and-up`, and `1/4--desk`. Only the first one checks all the boxes. A product card *is* a product item, after all. The others seem more about styling—defining how the element looks on the screen—and are probably tied to CSS rules. This class is also unique enough in the page's context. If it were something generic like `item`, there would be a higher risk that developers of the website might use it for unrelated elements. In the **Elements** tab, we can see a parent element `product-list` that contains all the product cards marked as `product-item`. This structure aligns with the data we're after. ![Overview of all the product cards in DevTools](/assets/images/devtools-product-list-9ebeb190d65fc7f7ae765caaa6eb128b.png) ## Locating all product cards[​](#locating-all-product-cards "Direct link to Locating all product cards") In the **Console**, hovering our cursor over objects representing HTML elements highlights the corresponding elements on the page. This way we can verify that when we query `.product-item`, the result represents the JBL Flip speaker—the first product card in the list. ![Highlighting a querySelector() result](/assets/images/devtools-hover-queryselector-747bedb6133e1d9919c10bdb0e6cc599.png) But what if we want to scrape details about the Sony subwoofer we inspected earlier? For that, we need a method that selects more than just the first match: [`querySelectorAll()`](https://developer.mozilla.org/en-US/docs/Web/API/Document/querySelectorAll). As the name suggests, it takes a CSS selector string and returns all matching HTML elements. Let's type this into the **Console**: ``` document.querySelectorAll('.product-item'); ``` The returned value is a [`NodeList`](https://developer.mozilla.org/en-US/docs/Web/API/NodeList), a collection of nodes. Browsers understand an HTML document as a tree of nodes. Most nodes are HTML elements, but there are also text nodes for plain text, and others. We'll expand the result by clicking the small arrow, then hover our cursor over the third element in the list. Indexing starts at 0, so the third element is at index 2. There it is—the product card for the subwoofer! ![Highlighting a querySelectorAll() result](/assets/images/devtools-hover-queryselectorall-b2c49b9d01f9bf3a25ee797b0e652896.png) To save the subwoofer in a variable for further inspection, we can use index access with brackets, just like with Python lists (or JavaScript arrays): ``` products = document.querySelectorAll('.product-item'); subwoofer = products[2]; ``` Even though we're just playing with JavaScript in the browser's **Console**, we're inching closer to figuring out what our Python program will need to do. In the next lesson, we'll dive into accessing child elements and extracting product details. *** ## Exercises[​](#exercises "Direct link to Exercises") These challenges are here to help you test what you’ve learned in this lesson. Try to resist the urge to peek at the solutions right away. Remember, the best learning happens when you dive in and do it yourself! Real world You're about to touch the real web, which is practical and exciting! But websites change, so some exercises might break. If you run into any issues, please leave a comment below or [file a GitHub Issue](https://github.com/apify/apify-docs/issues). ### Locate headings on Wikipedia's Main Page[​](#locate-headings-on-wikipedias-main-page "Direct link to Locate headings on Wikipedia's Main Page") On English Wikipedia's [Main Page](https://en.wikipedia.org/wiki/Main_Page), use CSS selectors in the **Console** to list the HTML elements representing headings of the colored boxes (including the grey ones). ![Wikipedia\'s Main Page headings](/assets/images/devtools-exercise-wikipedia-5d47de5c50985ec7cc87b3a220f9d14c.png) Solution 1. Open the [Main Page](https://en.wikipedia.org/wiki/Main_Page). 2. Activate the element selection tool in your DevTools. 3. Click on several headings to examine the markup. 4. Notice that all headings are `h2` elements with the `mp-h2` class. 5. In the **Console**, execute `document.querySelectorAll('h2')`. 6. At the time of writing, this selector returns 8 headings. Each corresponds to a box, and there are no other `h2` elements on the page. Thus, the selector is sufficient as is. ### Locate products on Shein[​](#locate-products-on-shein "Direct link to Locate products on Shein") Go to Shein's [Jewelry & Accessories](https://shein.com/RecommendSelection/Jewelry-Accessories-sc-017291431.html) category. In the **Console**, use CSS selectors to list all HTML elements representing the products. ![Products in Shein\'s Jewelry \& Accessories category](/assets/images/devtools-exercise-shein-e289fb63ac18c6aa8ea8ed48e6ffd805.png) Solution 1. Visit the [Jewelry & Accessories](https://shein.com/RecommendSelection/Jewelry-Accessories-sc-017291431.html) page. Close any pop-ups or promotions. 2. Activate the element selection tool in your DevTools. 3. Click on the first product to inspect its markup. Repeat with a few others. 4. Observe that all products are `section` elements with multiple classes, including `product-card`. 5. Since `section` is a generic wrapper, focus on the `product-card` class. 6. In the **Console**, execute `document.querySelectorAll('.product-card')`. 7. At the time of writing, this selector returns 120 results, all representing products. No further narrowing is necessary. ### Locate articles on Guardian[​](#locate-articles-on-guardian "Direct link to Locate articles on Guardian") Go to Guardian's [page about F1](https://www.theguardian.com/sport/formulaone). Use the **Console** to find all HTML elements representing the articles. Hint: Learn about the [descendant combinator](https://developer.mozilla.org/en-US/docs/Web/CSS/Descendant_combinator). ![Articles on Guardian\'s page about F1](/assets/images/devtools-exercise-guardian1-f3587204f6314c9a25a1955f94420ebc.png) Solution 1. Open the [page about F1](https://www.theguardian.com/sport/formulaone). 2. Activate the element selection tool in your DevTools. 3. Click on an article to inspect its structure. Check several articles, including the ones with smaller cards. 4. Note that all articles are `li` elements, but their classes (e.g., `dcr-1qmyfxi`) are dynamically generated and unreliable. 5. Using `document.querySelectorAll('li')` returns too many results, including unrelated items like navigation links. 6. Inspect the page structure. The `main` element contains the primary content, including articles. Use the descendant combinator to target `li` elements within `main`. 7. In the **Console**, execute `document.querySelectorAll('main li')`. 8. At the time of writing, this selector returns 21 results. All appear to represent articles, so the solution works! --- # Downloading HTML with Python **In this lesson we'll start building a Python application for watching prices. As a first step, we'll use the HTTPX library to download HTML code of a product listing page.** *** Using browser tools for developers is crucial for understanding the structure of a particular page, but it's a manual task. Let's start building our first automation, a Python program which downloads HTML code of the product listing. ## Starting a Python project[​](#starting-a-python-project "Direct link to Starting a Python project") Before we start coding, we need to set up a Python project. Let's create new directory with a virtual environment. Inside the directory and with the environment activated, we'll install the HTTPX library: ``` $ pip install httpx ... Successfully installed ... httpx-0.0.0 ``` Installing packages Being comfortable around Python project setup and installing packages is a prerequisite of this course, but if you wouldn't say no to a recap, we recommend the [Installing Packages](https://packaging.python.org/en/latest/tutorials/installing-packages/) tutorial from the official Python Packaging User Guide. Now let's test that all works. Inside the project directory we'll create a new file called `main.py` with the following code: ``` import httpx print("OK") ``` Running it as a Python program will verify that our setup is okay and we've installed HTTPX: ``` $ python main.py OK ``` Troubleshooting If you see errors or for any other reason cannot run the code above, it means that your environment isn't set up correctly. We're sorry, but figuring out the issue is out of scope of this course. ## Downloading product listing[​](#downloading-product-listing "Direct link to Downloading product listing") Now onto coding! Let's change our code so it downloads HTML of the product listing instead of printing `OK`. The [documentation of the HTTPX library](https://www.python-httpx.org/) provides us with examples how to use it. Inspired by those, our code will look like this: ``` import httpx url = "https://warehouse-theme-metal.myshopify.com/collections/sales" response = httpx.get(url) print(response.text) ``` If we run the program now, it should print the downloaded HTML: ``` $ python main.py Sales ... ``` Running `httpx.get(url)`, we made a HTTP request and received a response. It's not particularly useful yet, but it's a good start of our scraper. Client and server, request and response HTTP is a network protocol powering the internet. Understanding it well is an important foundation for successful scraping, but for this course, it's enough to know just the basic flow and terminology: * HTTP is an exchange between two participants. * The *client* sends a *request* to the *server*, which replies with a *response*. * In our case, `main.py` is the client, and the technology running at `warehouse-theme-metal.myshopify.com` replies to our request as the server. ## Handling errors[​](#handling-errors "Direct link to Handling errors") Websites can return various errors, such as when the server is temporarily down, applying anti-scraping protections, or simply being buggy. In HTTP, each response has a three-digit *status code* that indicates whether it is an error or a success. All status codes If you've never worked with HTTP response status codes before, briefly scan their [full list](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status) to get at least a basic idea of what you might encounter. For further education on the topic, we recommend [HTTP Cats](https://http.cat/) as a highly professional resource. A robust scraper skips or retries requests on errors. Given the complexity of this task, it's best to use libraries or frameworks. For now, we'll at least make sure that our program visibly crashes and prints what happened in case there's an error. First, let's ask for trouble. We'll change the URL in our code to a page that doesn't exist, so that we get a response with [status code 404](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/404). This could happen, for example, when the product we are scraping is no longer available: ``` https://warehouse-theme-metal.myshopify.com/does/not/exist ``` We could check the value of `response.status_code` against a list of allowed numbers, but HTTPX already provides `response.raise_for_status()`, a method that analyzes the number and raises the `httpx.HTTPError` exception if our request wasn't successful: ``` import httpx url = "https://warehouse-theme-metal.myshopify.com/does/not/exist" response = httpx.get(url) response.raise_for_status() print(response.text) ``` If you run the code above, the program should crash: ``` $ python main.py Traceback (most recent call last): File "/Users/.../main.py", line 5, in response.raise_for_status() File "/Users/.../.venv/lib/python3/site-packages/httpx/_models.py", line 761, in raise_for_status raise HTTPStatusError(message, request=request, response=self) httpx.HTTPStatusError: Client error '404 Not Found' for url 'https://warehouse-theme-metal.myshopify.com/does/not/exist' For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/404 ``` Letting our program visibly crash on error is enough for our purposes. Now, let's return to our primary goal. In the next lesson, we'll be looking for a way to extract information about products from the downloaded HTML. *** ## Exercises[​](#exercises "Direct link to Exercises") These challenges are here to help you test what you’ve learned in this lesson. Try to resist the urge to peek at the solutions right away. Remember, the best learning happens when you dive in and do it yourself! Real world You're about to touch the real web, which is practical and exciting! But websites change, so some exercises might break. If you run into any issues, please leave a comment below or [file a GitHub Issue](https://github.com/apify/apify-docs/issues). ### Scrape AliExpress[​](#scrape-aliexpress "Direct link to Scrape AliExpress") Download HTML of a product listing page, but this time from a real world e-commerce website. For example this page with AliExpress search results: ``` https://www.aliexpress.com/w/wholesale-darth-vader.html ``` Solution ``` import httpx url = "https://www.aliexpress.com/w/wholesale-darth-vader.html" response = httpx.get(url) response.raise_for_status() print(response.text) ``` ### Save downloaded HTML as a file[​](#save-downloaded-html-as-a-file "Direct link to Save downloaded HTML as a file") Download HTML, then save it on your disk as a `products.html` file. You can use the URL we've been already playing with: ``` https://warehouse-theme-metal.myshopify.com/collections/sales ``` Solution Right in your Terminal or Command Prompt, you can create files by *redirecting output* of command line programs: ``` python main.py > products.html ``` If you want to use Python instead, it offers several ways how to create files. The solution below uses [pathlib](https://docs.python.org/3/library/pathlib.html): ``` import httpx from pathlib import Path url = "https://warehouse-theme-metal.myshopify.com/collections/sales" response = httpx.get(url) response.raise_for_status() Path("products.html").write_text(response.text) ``` ### Download an image as a file[​](#download-an-image-as-a-file "Direct link to Download an image as a file") Download a product image, then save it on your disk as a file. While HTML is *textual* content, images are *binary*. You may want to scan through the [HTTPX QuickStart](https://www.python-httpx.org/quickstart/) for guidance. You can use this URL pointing to an image of a TV: ``` https://warehouse-theme-metal.myshopify.com/cdn/shop/products/sonyxbr55front_f72cc8ff-fcd6-4141-b9cc-e1320f867785.jpg ``` Solution Python offers several ways how to create files. The solution below uses [pathlib](https://docs.python.org/3/library/pathlib.html): ``` from pathlib import Path import httpx url = "https://warehouse-theme-metal.myshopify.com/cdn/shop/products/sonyxbr55front_f72cc8ff-fcd6-4141-b9cc-e1320f867785.jpg" response = httpx.get(url) response.raise_for_status() Path("tv.jpg").write_bytes(response.content) ``` --- # Extracting data from HTML with Python **In this lesson we'll finish extracting product data from the downloaded HTML. With help of basic string manipulation we'll focus on cleaning and correctly representing the product price.** *** Locating the right HTML elements is the first step of a successful data extraction, so it's no surprise that we're already close to having the data in the correct form. The last bit that still requires our attention is the price: ``` $ python main.py JBL Flip 4 Waterproof Portable Bluetooth Speaker | $74.95 Sony XBR-950G BRAVIA 4K HDR Ultra HD TV | From $1,398.00 ... ``` Let's summarize what stands in our way if we want to have it in our Python program as a number: * A dollar sign precedes the number, * the number contains decimal commas for better human readability, and * some prices start with `From`, which reveals there is a certain complexity in how the shop deals with prices. ## Representing price[​](#representing-price "Direct link to Representing price") The last bullet point is the most important to figure out before we start coding. We thought we'll be scraping numbers, but in the middle of our effort, we discovered that the price is actually a range. It's because some products have variants with different prices. Later in the course we'll get to crawling, i.e. following links and scraping data from more than just one page. That will allow us to get exact prices for all the products, but for now let's extract just what's in the listing. Ideally we'd go and discuss the problem with those who are about to use the resulting data. For their purposes, is the fact that some prices are just minimum prices important? What would be the most useful representation of the range for them? Maybe they'd tell us that it's okay if we just remove the `From` prefix? ``` price_text = product.select_one(".price").contents[-1] price = price_text.removeprefix("From ") ``` In other cases, they'd tell us the data must include the range. And in cases when we just don't know, the safest option is to include all the information we have and leave the decision on what's important to later stages. One approach could be having the exact and minimum prices as separate values. If we don't know the exact price, we leave it empty: ``` price_text = product.select_one(".price").contents[-1] if price_text.startswith("From "): min_price = price_text.removeprefix("From ") price = None else: min_price = price_text price = min_price ``` Built-in string methods If you're not proficient in Python's string methods, [.startswith()](https://docs.python.org/3/library/stdtypes.html#str.startswith) checks the beginning of a given string, and [.removeprefix()](https://docs.python.org/3/library/stdtypes.html#str.removeprefix) removes something from the beginning of a given string. The whole program would look like this: ``` import httpx from bs4 import BeautifulSoup url = "https://warehouse-theme-metal.myshopify.com/collections/sales" response = httpx.get(url) response.raise_for_status() html_code = response.text soup = BeautifulSoup(html_code, "html.parser") for product in soup.select(".product-item"): title = product.select_one(".product-item__title").text price_text = product.select_one(".price").contents[-1] if price_text.startswith("From "): min_price = price_text.removeprefix("From ") price = None else: min_price = price_text price = min_price print(title, min_price, price, sep=" | ") ``` ## Removing white space[​](#removing-white-space "Direct link to Removing white space") Often, the strings we extract from a web page start or end with some amount of whitespace, typically space characters or newline characters, which come from the [indentation](https://en.wikipedia.org/wiki/Indentation_\(typesetting\)#Indentation_in_programming) of the HTML tags. We call the operation of removing whitespace *stripping* or *trimming*, and it's so useful in many applications that programming languages and libraries include ready-made tools for it. Let's add Python's built-in [.strip()](https://docs.python.org/3/library/stdtypes.html#str.strip): ``` title = product.select_one(".product-item__title").text.strip() price_text = product.select_one(".price").contents[-1].strip() ``` Handling strings in Beautiful Soup Beautiful Soup offers several attributes when it comes to working with strings: * `.string`, which often is like `.text`, * `.strings`, which [returns a list of all nested textual nodes](https://beautiful-soup-4.readthedocs.io/en/latest/#strings-and-stripped-strings), * `.stripped_strings`, which does the same but with whitespace removed. These might be useful in some complex scenarios, but in our case, they won't make scraping the title or price any shorter or more elegant. ## Removing dollar sign and commas[​](#removing-dollar-sign-and-commas "Direct link to Removing dollar sign and commas") We got rid of the `From` and possible whitespace, but we still can't save the price as a number in our Python program: ``` >>> price = "$1,998.00" >>> float(price) Traceback (most recent call last): File "", line 1, in ValueError: could not convert string to float: '$1,998.00' ``` Interactive Python The demonstration above is inside the Python's [interactive REPL](https://realpython.com/interacting-with-python/). It's a useful playground where you can try how code behaves before you use it in your program. We need to remove the dollar sign and the decimal commas. For this type of cleaning, [regular expressions](https://docs.python.org/3/library/re.html) are often the best tool for the job, but in this case [`.replace()`](https://docs.python.org/3/library/stdtypes.html#str.replace) is also sufficient: ``` price_text = ( product .select_one(".price") .contents[-1] .strip() .replace("$", "") .replace(",", "") ) ``` ## Representing money in programs[​](#representing-money-in-programs "Direct link to Representing money in programs") Now we should be able to add `float()`, so that we have the prices not as a text, but as numbers: ``` if price_text.startswith("From "): min_price = float(price_text.removeprefix("From ")) price = None else: min_price = float(price_text) price = min_price ``` Great! Only if we didn't overlook an important pitfall called [floating-point error](https://en.wikipedia.org/wiki/Floating-point_error_mitigation). In short, computers save floating point numbers in a way which isn't always reliable: ``` >>> 0.1 + 0.2 0.30000000000000004 ``` These errors are small and usually don't matter, but sometimes they can add up and cause unpleasant discrepancies. That's why it's typically best to avoid floating point numbers when working with money. Let's instead use Python's built-in [`Decimal()`](https://docs.python.org/3/library/decimal.html) type: ``` import httpx from bs4 import BeautifulSoup from decimal import Decimal url = "https://warehouse-theme-metal.myshopify.com/collections/sales" response = httpx.get(url) response.raise_for_status() html_code = response.text soup = BeautifulSoup(html_code, "html.parser") for product in soup.select(".product-item"): title = product.select_one(".product-item__title").text.strip() price_text = ( product .select_one(".price") .contents[-1] .strip() .replace("$", "") .replace(",", "") ) if price_text.startswith("From "): min_price = Decimal(price_text.removeprefix("From ")) price = None else: min_price = Decimal(price_text) price = min_price print(title, min_price, price, sep=" | ") ``` If we run the code above, we have nice, clean data about all the products! ``` $ python main.py JBL Flip 4 Waterproof Portable Bluetooth Speaker | 74.95 | 74.95 Sony XBR-950G BRAVIA 4K HDR Ultra HD TV | 1398.00 | None ... ``` Well, not to spoil the excitement, but in its current form, the data isn't very useful. In the next lesson we'll save the product details to a file which data analysts can use or other programs can read. *** ## Exercises[​](#exercises "Direct link to Exercises") These challenges are here to help you test what you’ve learned in this lesson. Try to resist the urge to peek at the solutions right away. Remember, the best learning happens when you dive in and do it yourself! Real world You're about to touch the real web, which is practical and exciting! But websites change, so some exercises might break. If you run into any issues, please leave a comment below or [file a GitHub Issue](https://github.com/apify/apify-docs/issues). ### Scrape units on stock[​](#scrape-units-on-stock "Direct link to Scrape units on stock") Change our scraper so that it extracts how many units of each product are on stock. Your program should print the following. Note the unit amounts at the end of each line: ``` JBL Flip 4 Waterproof Portable Bluetooth Speaker | 672 Sony XBR-950G BRAVIA 4K HDR Ultra HD TV | 77 Sony SACS9 10" Active Subwoofer | 7 Sony PS-HX500 Hi-Res USB Turntable | 15 Klipsch R-120SW Powerful Detailed Home Speaker - Unit | 0 Denon AH-C720 In-Ear Headphones | 236 ... ``` Solution ``` import httpx from bs4 import BeautifulSoup url = "https://warehouse-theme-metal.myshopify.com/collections/sales" response = httpx.get(url) response.raise_for_status() html_code = response.text soup = BeautifulSoup(html_code, "html.parser") for product in soup.select(".product-item"): title = product.select_one(".product-item__title").text.strip() units_text = ( product .select_one(".product-item__inventory") .text .removeprefix("In stock,") .removeprefix("Only") .removesuffix(" left") .removesuffix("units") .strip() ) if "Sold out" in units_text: units = 0 else: units = int(units_text) print(title, units, sep=" | ") ``` ### Use regular expressions[​](#use-regular-expressions "Direct link to Use regular expressions") Simplify the code from previous exercise. Use [regular expressions](https://docs.python.org/3/library/re.html) to parse the number of units. You can match digits using a range like `[0-9]` or by a special sequence `\d`. To match more characters of the same type you can use `+`. Solution ``` import re import httpx from bs4 import BeautifulSoup url = "https://warehouse-theme-metal.myshopify.com/collections/sales" response = httpx.get(url) response.raise_for_status() html_code = response.text soup = BeautifulSoup(html_code, "html.parser") for product in soup.select(".product-item"): title = product.select_one(".product-item__title").text.strip() units_text = product.select_one(".product-item__inventory").text if re_match := re.search(r"\d+", units_text): units = int(re_match.group()) else: units = 0 print(title, units, sep=" | ") ``` ### Scrape publish dates of F1 news[​](#scrape-publish-dates-of-f1-news "Direct link to Scrape publish dates of F1 news") Download Guardian's page with the latest F1 news and use Beautiful Soup to parse it. Print titles and publish dates of all the listed articles. This is the URL: ``` https://www.theguardian.com/sport/formulaone ``` Your program should print something like the following. Note the dates at the end of each line: ``` Brad Pitt in the paddock: how F1 the Movie went deep to keep fans coming | Fri Jun 20 2025 Wolff hits out at Red Bull protest after Russell’s Canadian GP win | Tue Jun 17 2025 F1 the Movie review – spectacular macho melodrama handles Brad Pitt with panache | Tue Jun 17 2025 Hamilton reveals distress over ‘devastating’ groundhog accident at Canadian F1 GP | Mon Jun 16 2025 ... ``` Hints: * HTML's `time` element can have an attribute `datetime`, which [contains data in a machine-readable format](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/time), such as the ISO 8601. * Beautiful Soup gives you [access to attributes as if they were dictionary keys](https://beautiful-soup-4.readthedocs.io/en/latest/#attributes). * In Python you can create `datetime` objects using `datetime.fromisoformat()`, a [built-in method for parsing ISO 8601 strings](https://docs.python.org/3/library/datetime.html#datetime.datetime.fromisoformat). * To get the date, you can call `.strftime('%a %b %d %Y')` on `datetime` objects. Solution ``` import httpx from bs4 import BeautifulSoup from datetime import datetime url = "https://www.theguardian.com/sport/formulaone" response = httpx.get(url) response.raise_for_status() html_code = response.text soup = BeautifulSoup(html_code, "html.parser") for article in soup.select("#maincontent ul li"): title = article.select_one("h3").text.strip() date_iso = article.select_one("time")["datetime"].strip() date = datetime.fromisoformat(date_iso) print(title, date.strftime('%a %b %d %Y'), sep=" | ") ``` --- # Using a scraping framework with Python **In this lesson, we'll rework our application for watching prices so that it builds on top of a scraping framework. We'll use Crawlee to make the program simpler, faster, and more robust.** *** Before rewriting our code, let's point out several caveats in our current solution: * *Hard to maintain:* All the data we need from the listing page is also available on the product page. By scraping both, we have to maintain selectors for two HTML documents. Instead, we could scrape links from the listing page and process all data on the product pages. * *Slow:* The program runs sequentially, which is generously considerate toward the target website, but extremely inefficient. * *No logging:* The scraper gives no sense of progress, making it tedious to use. Debugging issues becomes even more frustrating without proper logs. * *Boilerplate code:* We implement downloading and parsing HTML, or exporting data to CSV, although we're not the first people to meet and solve these problems. * *Prone to anti-scraping:* If the target website implemented anti-scraping measures, a bare-bones program like ours would stop working. * *Browser means rewrite:* We got lucky extracting variants. If the website didn't include a fallback, we might have had no choice but to spin up a browser instance and automate clicking on buttons. Such a change in the underlying technology would require a complete rewrite of our program. * *No error handling:* The scraper stops if it encounters issues. It should allow for skipping problematic products with warnings or retrying downloads when the website returns temporary errors. In this lesson, we'll tackle all the above issues while keeping the code concise thanks to a scraping framework. Why Crawlee and not Scrapy From the two main open-source options for Python, [Scrapy](https://scrapy.org/) and [Crawlee](https://crawlee.dev/python/), we chose the latter—not just because we're the company financing its development. We genuinely believe beginners to scraping will like it more, since it allows to create a scraper with less code and less time spent reading docs. Scrapy's long history ensures it's battle-tested, but it also means its code relies on technologies that aren't really necessary today. Crawlee, on the other hand, builds on modern Python features like asyncio and type hints. ## Installing Crawlee[​](#installing-crawlee "Direct link to Installing Crawlee") When starting with the Crawlee framework, we first need to decide which approach to downloading and parsing we prefer. We want the one based on Beautiful Soup, so let's install the `crawlee` package with the `beautifulsoup` extra specified in brackets. The framework has a lot of dependencies, so expect the installation to take a while. ``` $ pip install crawlee[beautifulsoup] ... Successfully installed Jinja2-0.0.0 ... ... ... crawlee-0.0.0 ... ... ... ``` ## Running Crawlee[​](#running-crawlee "Direct link to Running Crawlee") Now let's use the framework to create a new version of our scraper. First, let's rename the `main.py` file to `oldmain.py`, so that we can keep peeking at the original implementation while working on the new one. Then, in the same project directory, we'll create a new, empty `main.py`. The initial content will look like this: ``` import asyncio from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext async def main(): crawler = BeautifulSoupCrawler() @crawler.router.default_handler async def handle_listing(context: BeautifulSoupCrawlingContext): if title := context.soup.title: print(title.text.strip()) await crawler.run(["https://warehouse-theme-metal.myshopify.com/collections/sales"]) if __name__ == '__main__': asyncio.run(main()) ``` In the code, we do the following: 1. We import the necessary modules and define an asynchronous `main()` function. 2. Inside `main()`, we first create a crawler object, which manages the scraping process. In this case, it's a crawler based on Beautiful Soup. 3. Next, we define a nested asynchronous function called `handle_listing()`. It receives a `context` parameter, and Python type hints show it's of type `BeautifulSoupCrawlingContext`. Type hints help editors suggest what we can do with the object. 4. We use a Python decorator (the line starting with `@`) to register `handle_listing()` as the *default handler* for processing HTTP responses. 5. Inside the handler, we extract the page title from the `soup` object and print its text without whitespace. 6. At the end of the function, we run the crawler on a product listing URL and await its completion. 7. The last two lines ensure that if the file is executed directly, Python will properly run the `main()` function using its asynchronous event loop. Don't worry if some of this is new. We don't need to know exactly how [`asyncio`](https://docs.python.org/3/library/asyncio.html), decorators, or type hints work. Let's stick to the practical side and observe what the program does when executed: ``` $ python main.py [BeautifulSoupCrawler] INFO Current request statistics: ┌───────────────────────────────┬──────────┐ │ requests_finished │ 0 │ │ requests_failed │ 0 │ │ retry_histogram │ [0] │ │ request_avg_failed_duration │ None │ │ request_avg_finished_duration │ None │ │ requests_finished_per_minute │ 0 │ │ requests_failed_per_minute │ 0 │ │ request_total_duration │ 0.0 │ │ requests_total │ 0 │ │ crawler_runtime │ 0.010014 │ └───────────────────────────────┴──────────┘ [crawlee._autoscaling.autoscaled_pool] INFO current_concurrency = 0; desired_concurrency = 2; cpu = 0; mem = 0; event_loop = 0.0; client_info = 0.0 Sales [crawlee._autoscaling.autoscaled_pool] INFO Waiting for remaining tasks to finish [BeautifulSoupCrawler] INFO Final request statistics: ┌───────────────────────────────┬──────────┐ │ requests_finished │ 1 │ │ requests_failed │ 0 │ │ retry_histogram │ [1] │ │ request_avg_failed_duration │ None │ │ request_avg_finished_duration │ 0.308998 │ │ requests_finished_per_minute │ 185 │ │ requests_failed_per_minute │ 0 │ │ request_total_duration │ 0.308998 │ │ requests_total │ 1 │ │ crawler_runtime │ 0.323721 │ └───────────────────────────────┴──────────┘ ``` If our previous scraper didn't give us any sense of progress, Crawlee feeds us with perhaps too much information for the purposes of a small program. Among all the logging, notice the line `Sales`. That's the page title! We managed to create a Crawlee scraper that downloads the product listing page, parses it with Beautiful Soup, extracts the title, and prints it. Advanced Python features You don't need to be an expert in asynchronous programming, decorators, or type hints to finish this lesson, but you might find yourself curious for more details. If so, check out [Async IO in Python: A Complete Walkthrough](https://realpython.com/async-io-python/), [Primer on Python Decorators](https://realpython.com/primer-on-python-decorators/), and [Python Type Checking](https://realpython.com/python-type-checking/). ## Crawling product detail pages[​](#crawling-product-detail-pages "Direct link to Crawling product detail pages") The code now features advanced Python concepts, so it's less accessible to beginners, and the size of the program is about the same as if we worked without a framework. The tradeoff of using a framework is that primitive scenarios may become unnecessarily complex, while complex scenarios may become surprisingly primitive. As we rewrite the rest of the program, the benefits of using Crawlee will become more apparent. For example, it takes a single line of code to extract and follow links to products. Three more lines, and we have parallel processing of all the product detail pages: ``` import asyncio from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext async def main(): crawler = BeautifulSoupCrawler() @crawler.router.default_handler async def handle_listing(context: BeautifulSoupCrawlingContext): await context.enqueue_links(label="DETAIL", selector=".product-list a.product-item__title") @crawler.router.handler("DETAIL") async def handle_detail(context: BeautifulSoupCrawlingContext): print(context.request.url) await crawler.run(["https://warehouse-theme-metal.myshopify.com/collections/sales"]) if __name__ == '__main__': asyncio.run(main()) ``` First, it's necessary to inspect the page in browser DevTools to figure out the CSS selector that allows us to locate links to all the product detail pages. Then we can use the `enqueue_links()` method to find the links and add them to Crawlee's internal HTTP request queue. We tell the method to label all the requests as `DETAIL`. Below that, we give the crawler another asynchronous function, `handle_detail()`. We again inform the crawler that this function is a handler using a decorator, but this time it's not a default one. This handler will only take care of HTTP requests labeled as `DETAIL`. For now, all it does is print the request URL. If we run the code, we should see how Crawlee first downloads the listing page and then makes parallel requests to each of the detail pages, printing their URLs along the way: ``` $ python main.py [BeautifulSoupCrawler] INFO Current request statistics: ┌───────────────────────────────┬──────────┐ ... └───────────────────────────────┴──────────┘ [crawlee._autoscaling.autoscaled_pool] INFO current_concurrency = 0; desired_concurrency = 2; cpu = 0; mem = 0; event_loop = 0.0; client_info = 0.0 https://warehouse-theme-metal.myshopify.com/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv https://warehouse-theme-metal.myshopify.com/products/jbl-flip-4-waterproof-portable-bluetooth-speaker https://warehouse-theme-metal.myshopify.com/products/sony-sacs9-10-inch-active-subwoofer https://warehouse-theme-metal.myshopify.com/products/sony-ps-hx500-hi-res-usb-turntable ... [crawlee._autoscaling.autoscaled_pool] INFO Waiting for remaining tasks to finish [BeautifulSoupCrawler] INFO Final request statistics: ┌───────────────────────────────┬──────────┐ │ requests_finished │ 25 │ │ requests_failed │ 0 │ │ retry_histogram │ [25] │ │ request_avg_failed_duration │ None │ │ request_avg_finished_duration │ 0.349434 │ │ requests_finished_per_minute │ 318 │ │ requests_failed_per_minute │ 0 │ │ request_total_duration │ 8.735843 │ │ requests_total │ 25 │ │ crawler_runtime │ 4.713262 │ └───────────────────────────────┴──────────┘ ``` In the final stats, we can see that we made 25 requests (1 listing page + 24 product pages) in less than 5 seconds. Your numbers might differ, but regardless, it should be much faster than making the requests sequentially. ## Extracting data[​](#extracting-data "Direct link to Extracting data") The Beautiful Soup crawler provides handlers with the `context.soup` attribute, which contains the parsed HTML of the handled page. This is the same `soup` object we used in our previous program. Let's locate and extract the same data as before: ``` async def main(): ... @crawler.router.handler("DETAIL") async def handle_detail(context: BeautifulSoupCrawlingContext): item = { "url": context.request.url, "title": context.soup.select_one(".product-meta__title").text.strip(), "vendor": context.soup.select_one(".product-meta__vendor").text.strip(), } print(item) ``` Fragile code The code above assumes the `.select_one()` call doesn't return `None`. If your editor checks types, it might even warn that `text` is not a known attribute of `None`. This isn't robust and could break, but in our program, that's fine. We expect the elements to be there, and if they're not, we'd rather the scraper break quickly—it's a sign something's wrong and needs fixing. Now for the price. We're not doing anything new here—just import `Decimal` and copy-paste the code from our old scraper. The only change will be in the selector. In `main.py`, we looked for `.price` within a `product_soup` object representing a product card. Now, we're looking for `.price` within the entire product detail page. It's better to be more specific so we don't accidentally match another price on the same page: ``` async def main(): ... @crawler.router.handler("DETAIL") async def handle_detail(context: BeautifulSoupCrawlingContext): price_text = ( context.soup .select_one(".product-form__info-content .price") .contents[-1] .strip() .replace("$", "") .replace(",", "") ) item = { "url": context.request.url, "title": context.soup.select_one(".product-meta__title").text.strip(), "vendor": context.soup.select_one(".product-meta__vendor").text.strip(), "price": Decimal(price_text), } print(item) ``` Finally, the variants. We can reuse the `parse_variant()` function as-is, and in the handler we'll again take inspiration from what we had in `main.py`. The full program will look like this: ``` import asyncio from decimal import Decimal from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext async def main(): crawler = BeautifulSoupCrawler() @crawler.router.default_handler async def handle_listing(context: BeautifulSoupCrawlingContext): await context.enqueue_links(selector=".product-list a.product-item__title", label="DETAIL") @crawler.router.handler("DETAIL") async def handle_detail(context: BeautifulSoupCrawlingContext): price_text = ( context.soup .select_one(".product-form__info-content .price") .contents[-1] .strip() .replace("$", "") .replace(",", "") ) item = { "url": context.request.url, "title": context.soup.select_one(".product-meta__title").text.strip(), "vendor": context.soup.select_one(".product-meta__vendor").text.strip(), "price": Decimal(price_text), "variant_name": None, } if variants := context.soup.select(".product-form__option.no-js option"): for variant in variants: print(item | parse_variant(variant)) else: print(item) await crawler.run(["https://warehouse-theme-metal.myshopify.com/collections/sales"]) def parse_variant(variant): text = variant.text.strip() name, price_text = text.split(" - ") price = Decimal( price_text .replace("$", "") .replace(",", "") ) return {"variant_name": name, "price": price} if __name__ == '__main__': asyncio.run(main()) ``` If we run this scraper, we should get the same data for the 24 products as before. Crawlee has saved us a lot of effort by managing downloading, parsing, and parallelization. The code is also cleaner, with two separate and labeled handlers. Crawlee doesn't do much to help with locating and extracting the data—that part of the code remains almost the same, framework or not. This is because the detective work of finding and extracting the right data is the core value of custom scrapers. With Crawlee, we can focus on just that while letting the framework take care of everything else. ## Saving data[​](#saving-data "Direct link to Saving data") When we're at *letting the framework take care of everything else*, let's take a look at what it can do about saving data. As of now the product detail page handler prints each item as soon as the item is ready. Instead, we can push the item to Crawlee's default dataset: ``` async def main(): ... @crawler.router.handler("DETAIL") async def handle_detail(context: BeautifulSoupCrawlingContext): price_text = ( ... ) item = { ... } if variants := context.soup.select(".product-form__option.no-js option"): for variant in variants: await context.push_data(item | parse_variant(variant)) else: await context.push_data(item) ``` That's it! If we run the program now, there should be a `storage` directory alongside the `main.py` file. Crawlee uses it to store its internal state. If we go to the `storage/datasets/default` subdirectory, we'll see over 30 JSON files, each representing a single item. ![Single dataset item](/assets/images/dataset-item-de374828159ee4b4a61e17ea6cee3c07.png) We can also export all the items to a single file of our choice. We'll do it at the end of the `main()` function, after the crawler has finished scraping: ``` async def main(): ... await crawler.run(["https://warehouse-theme-metal.myshopify.com/collections/sales"]) await crawler.export_data_json(path='dataset.json', ensure_ascii=False, indent=2) await crawler.export_data_csv(path='dataset.csv') ``` After running the scraper again, there should be two new files in your directory, `dataset.json` and `dataset.csv`, containing all the data. If we peek into the JSON file, it should have indentation. ## Logging[​](#logging "Direct link to Logging") Crawlee gives us stats about HTTP requests and concurrency, but we don't get much visibility into the pages we're crawling or the items we're saving. Let's add some custom logging: ``` import asyncio from decimal import Decimal from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext async def main(): crawler = BeautifulSoupCrawler() @crawler.router.default_handler async def handle_listing(context: BeautifulSoupCrawlingContext): context.log.info("Looking for product detail pages") await context.enqueue_links(selector=".product-list a.product-item__title", label="DETAIL") @crawler.router.handler("DETAIL") async def handle_detail(context: BeautifulSoupCrawlingContext): context.log.info(f"Product detail page: {context.request.url}") price_text = ( context.soup .select_one(".product-form__info-content .price") .contents[-1] .strip() .replace("$", "") .replace(",", "") ) item = { "url": context.request.url, "title": context.soup.select_one(".product-meta__title").text.strip(), "vendor": context.soup.select_one(".product-meta__vendor").text.strip(), "price": Decimal(price_text), "variant_name": None, } if variants := context.soup.select(".product-form__option.no-js option"): for variant in variants: context.log.info("Saving a product variant") await context.push_data(item | parse_variant(variant)) else: context.log.info("Saving a product") await context.push_data(item) await crawler.run(["https://warehouse-theme-metal.myshopify.com/collections/sales"]) crawler.log.info("Exporting data") await crawler.export_data_json(path='dataset.json', ensure_ascii=False, indent=2) await crawler.export_data_csv(path='dataset.csv') def parse_variant(variant): text = variant.text.strip() name, price_text = text.split(" - ") price = Decimal( price_text .replace("$", "") .replace(",", "") ) return {"variant_name": name, "price": price} if __name__ == '__main__': asyncio.run(main()) ``` Depending on what we find helpful, we can tweak the logs to include more or less detail. The `context.log` or `crawler.log` objects are [standard Python loggers](https://docs.python.org/3/library/logging.html). If we compare `main.py` and `oldmain.py` now, it's clear we've cut at least 20 lines of code compared to the original program, even with the extra logging we've added. Throughout this lesson, we've introduced features to match the old scraper's functionality, but at each phase, the code remained clean and readable. Plus, we've been able to focus on what's unique to the website we're scraping and the data we care about. In the next lesson, we'll use a scraping platform to set up our application to run automatically every day. *** ## Exercises[​](#exercises "Direct link to Exercises") These challenges are here to help you test what you’ve learned in this lesson. Try to resist the urge to peek at the solutions right away. Remember, the best learning happens when you dive in and do it yourself! Real world You're about to touch the real web, which is practical and exciting! But websites change, so some exercises might break. If you run into any issues, please leave a comment below or [file a GitHub Issue](https://github.com/apify/apify-docs/issues). ### Build a Crawlee scraper of F1 Academy drivers[​](#build-a-crawlee-scraper-of-f1-academy-drivers "Direct link to Build a Crawlee scraper of F1 Academy drivers") Scrape information about all [F1 Academy](https://en.wikipedia.org/wiki/F1_Academy) drivers listed on the official [Drivers](https://www.f1academy.com/Racing-Series/Drivers) page. Each item you push to Crawlee's default dataset should include the following data: * URL of the driver's f1academy.com page * Name * Team * Nationality * Date of birth (as a `date()` object) * Instagram URL If you export the dataset as JSON, it should look something like this: ``` [ { "url": "https://www.f1academy.com/Racing-Series/Drivers/29/Emely-De-Heus", "name": "Emely De Heus", "team": "MP Motorsport", "nationality": "Dutch", "dob": "2003-02-10", "instagram_url": "https://www.instagram.com/emely.de.heus/", }, { "url": "https://www.f1academy.com/Racing-Series/Drivers/28/Hamda-Al-Qubaisi", "name": "Hamda Al Qubaisi", "team": "MP Motorsport", "nationality": "Emirati", "dob": "2002-08-08", "instagram_url": "https://www.instagram.com/hamdaalqubaisi_official/", }, ... ] ``` Hints: * Use Python's `datetime.strptime(text, "%d/%m/%Y").date()` to parse dates in the `DD/MM/YYYY` format. Check out the [docs](https://docs.python.org/3/library/datetime.html#datetime.datetime.strptime) for more details. * To locate the Instagram URL, use the attribute selector `a[href*='instagram']`. Learn more about attribute selectors in the [MDN docs](https://developer.mozilla.org/en-US/docs/Web/CSS/Attribute_selectors). Solution ``` import asyncio from datetime import datetime from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext async def main(): crawler = BeautifulSoupCrawler() @crawler.router.default_handler async def handle_listing(context: BeautifulSoupCrawlingContext): await context.enqueue_links(selector=".teams-driver-item a", label="DRIVER") @crawler.router.handler("DRIVER") async def handle_driver(context: BeautifulSoupCrawlingContext): info = {} for row in context.soup.select(".common-driver-info li"): name = row.select_one("span").text.strip() value = row.select_one("h4").text.strip() info[name] = value detail = {} for row in context.soup.select(".driver-detail--cta-group a"): name = row.select_one("p").text.strip() value = row.select_one("h2").text.strip() detail[name] = value await context.push_data({ "url": context.request.url, "name": context.soup.select_one("h1").text.strip(), "team": detail["Team"], "nationality": info["Nationality"], "dob": datetime.strptime(info["DOB"], "%d/%m/%Y").date(), "instagram_url": context.soup.select_one(".common-social-share a[href*='instagram']").get("href"), }) await crawler.run(["https://www.f1academy.com/Racing-Series/Drivers"]) await crawler.export_data_json(path='dataset.json', ensure_ascii=False, indent=2) if __name__ == '__main__': asyncio.run(main()) ``` ### Use Crawlee to find the ratings of the most popular Netflix films[​](#use-crawlee-to-find-the-ratings-of-the-most-popular-netflix-films "Direct link to Use Crawlee to find the ratings of the most popular Netflix films") The [Global Top 10](https://www.netflix.com/tudum/top10) page has a table listing the most popular Netflix films worldwide. Scrape the movie names from this page, then search for each movie on [IMDb](https://www.imdb.com/). Assume the first search result is correct and retrieve the film's rating. Each item you push to Crawlee's default dataset should include the following data: * URL of the film's IMDb page * Title * Rating If you export the dataset as JSON, it should look something like this: ``` [ { "url": "https://www.imdb.com/title/tt32368345/?ref_=fn_tt_tt_1", "title": "The Merry Gentlemen", "rating": "5.0/10" }, { "url": "https://www.imdb.com/title/tt32359447/?ref_=fn_tt_tt_1", "title": "Hot Frosty", "rating": "5.4/10" }, ... ] ``` To scrape IMDb data, you'll need to construct a `Request` object with the appropriate search URL for each movie title. The following code snippet gives you an idea of how to do this: ``` ... from urllib.parse import quote_plus async def main(): ... @crawler.router.default_handler async def handle_netflix_table(context: BeautifulSoupCrawlingContext): requests = [] for name_cell in context.soup.select(...): name = name_cell.text.strip() imdb_search_url = f"https://www.imdb.com/find/?q={quote_plus(name)}&s=tt&ttype=ft" requests.append(Request.from_url(imdb_search_url, label="...")) await context.add_requests(requests) ... ... ``` When navigating to the first search result, you might find it helpful to know that `context.enqueue_links()` accepts a `limit` keyword argument, letting you specify the max number of HTTP requests to enqueue. Solution ``` import asyncio from urllib.parse import quote_plus from crawlee import Request from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext async def main(): crawler = BeautifulSoupCrawler() @crawler.router.default_handler async def handle_netflix_table(context: BeautifulSoupCrawlingContext): requests = [] for name_cell in context.soup.select(".list-tbl-global .tbl-cell-name"): name = name_cell.text.strip() imdb_search_url = f"https://www.imdb.com/find/?q={quote_plus(name)}&s=tt&ttype=ft" requests.append(Request.from_url(imdb_search_url, label="IMDB_SEARCH")) await context.add_requests(requests) @crawler.router.handler("IMDB_SEARCH") async def handle_imdb_search(context: BeautifulSoupCrawlingContext): await context.enqueue_links(selector=".find-result-item a", label="IMDB", limit=1) @crawler.router.handler("IMDB") async def handle_imdb(context: BeautifulSoupCrawlingContext): rating_selector = "[data-testid='hero-rating-bar__aggregate-rating__score']" rating_text = context.soup.select_one(rating_selector).text.strip() await context.push_data({ "url": context.request.url, "title": context.soup.select_one("h1").text.strip(), "rating": rating_text, }) await crawler.run(["https://www.netflix.com/tudum/top10"]) await crawler.export_data_json(path='dataset.json', ensure_ascii=False, indent=2) if __name__ == '__main__': asyncio.run(main()) ``` --- # Getting links from HTML with Python **In this lesson, we'll locate and extract links to individual product pages. We'll use BeautifulSoup to find the relevant bits of HTML.** *** The previous lesson concludes our effort to create a scraper. Our program now downloads HTML, locates and extracts data from the markup, and saves the data in a structured and reusable way. For some use cases, this is already enough! In other cases, though, scraping just one page is hardly useful. The data is spread across the website, over several pages. ## Crawling websites[​](#crawling-websites "Direct link to Crawling websites") We'll use a technique called crawling, i.e. following links to scrape multiple pages. The algorithm goes like this: 1. Visit the start URL. 2. Extract new URLs (and data), and save them. 3. Visit one of the newly found URLs and save data and/or more URLs from it. 4. Repeat steps 2 and 3 until you have everything you need. This will help us figure out the actual prices of products, as right now, for some, we're only getting the min price. Implementing the algorithm will require quite a few changes to our code, though. ## Restructuring code[​](#restructuring-code "Direct link to Restructuring code") Over the course of the previous lessons, the code of our program grew to almost 50 lines containing downloading, parsing, and exporting: ``` import httpx from bs4 import BeautifulSoup from decimal import Decimal import json import csv url = "https://warehouse-theme-metal.myshopify.com/collections/sales" response = httpx.get(url) response.raise_for_status() html_code = response.text soup = BeautifulSoup(html_code, "html.parser") data = [] for product in soup.select(".product-item"): title = product.select_one(".product-item__title").text.strip() price_text = ( product .select_one(".price") .contents[-1] .strip() .replace("$", "") .replace(",", "") ) if price_text.startswith("From "): min_price = Decimal(price_text.removeprefix("From ")) price = None else: min_price = Decimal(price_text) price = min_price data.append({"title": title, "min_price": min_price, "price": price}) def serialize(obj): if isinstance(obj, Decimal): return str(obj) raise TypeError("Object not JSON serializable") with open("products.json", "w") as file: json.dump(data, file, default=serialize) with open("products.csv", "w") as file: writer = csv.DictWriter(file, fieldnames=["title", "min_price", "price"]) writer.writeheader() for row in data: writer.writerow(row) ``` Let's introduce several functions to make the whole thing easier to digest. First, we can turn the beginning of our program into this `download()` function, which takes a URL and returns a `BeautifulSoup` instance: ``` def download(url): response = httpx.get(url) response.raise_for_status() html_code = response.text return BeautifulSoup(html_code, "html.parser") ``` Next, we can put parsing into a `parse_product()` function, which takes the product item element and returns the dictionary with data: ``` def parse_product(product): title = product.select_one(".product-item__title").text.strip() price_text = ( product .select_one(".price") .contents[-1] .strip() .replace("$", "") .replace(",", "") ) if price_text.startswith("From "): min_price = Decimal(price_text.removeprefix("From ")) price = None else: min_price = Decimal(price_text) price = min_price return {"title": title, "min_price": min_price, "price": price} ``` Now the JSON export. For better readability of it, let's make a small change here and set the indentation level to two spaces: ``` def export_json(file, data): def serialize(obj): if isinstance(obj, Decimal): return str(obj) raise TypeError("Object not JSON serializable") json.dump(data, file, default=serialize, indent=2) ``` The last function we'll add will take care of the CSV export. We'll make a small change here as well. Having to specify the field names is not ideal. What if we add more field names in the parsing function? We'd always have to remember to go and edit the export function as well. If we could figure out the field names in place, we'd remove this dependency. One way would be to infer the field names from the dictionary keys of the first row: ``` def export_csv(file, data): fieldnames = list(data[0].keys()) writer = csv.DictWriter(file, fieldnames=fieldnames) writer.writeheader() for row in data: writer.writerow(row) ``` Fragile code The code above assumes the `data` variable contains at least one item, and that all the items have the same keys. This isn't robust and could break, but in our program, this isn't a problem, and omitting these corner cases allows us to keep the code examples more succinct. Now let's put it all together: ``` import httpx from bs4 import BeautifulSoup from decimal import Decimal import json import csv def download(url): response = httpx.get(url) response.raise_for_status() html_code = response.text return BeautifulSoup(html_code, "html.parser") def parse_product(product): title = product.select_one(".product-item__title").text.strip() price_text = ( product .select_one(".price") .contents[-1] .strip() .replace("$", "") .replace(",", "") ) if price_text.startswith("From "): min_price = Decimal(price_text.removeprefix("From ")) price = None else: min_price = Decimal(price_text) price = min_price return {"title": title, "min_price": min_price, "price": price} def export_json(file, data): def serialize(obj): if isinstance(obj, Decimal): return str(obj) raise TypeError("Object not JSON serializable") json.dump(data, file, default=serialize, indent=2) def export_csv(file, data): fieldnames = list(data[0].keys()) writer = csv.DictWriter(file, fieldnames=fieldnames) writer.writeheader() for row in data: writer.writerow(row) listing_url = "https://warehouse-theme-metal.myshopify.com/collections/sales" listing_soup = download(listing_url) data = [] for product in listing_soup.select(".product-item"): item = parse_product(product) data.append(item) with open("products.json", "w") as file: export_json(file, data) with open("products.csv", "w") as file: export_csv(file, data) ``` The program is much easier to read now. With the `parse_product()` function handy, we could also replace the convoluted loop with one that only takes up four lines of code. Refactoring We turned the whole program upside down, and at the same time, we didn't make any actual changes! This is [refactoring](https://en.wikipedia.org/wiki/Code_refactoring): improving the structure of existing code without changing its behavior. ![Refactoring](/assets/images/refactoring-269525c424ecc082de290a3ecfd4d9ce.gif) ## Extracting links[​](#extracting-links "Direct link to Extracting links") With everything in place, we can now start working on a scraper that also scrapes the product pages. For that, we'll need the links to those pages. Let's open the browser DevTools and remind ourselves of the structure of a single product item: ![Product card\'s child elements](/assets/images/child-elements-83a62a55f65911b057ec16c4c7cde7b6.png) Several methods exist for transitioning from one page to another, but the most common is a link element, which looks like this: ``` Text of the link ``` In DevTools, we can see that each product title is, in fact, also a link element. We already locate the titles, so that makes our task easier. We just need to edit the code so that it extracts not only the text of the element but also the `href` attribute. Beautiful Soup elements support accessing attributes as if they were dictionary keys: ``` def parse_product(product): title_element = product.select_one(".product-item__title") title = title_element.text.strip() url = title_element["href"] ... return {"title": title, "min_price": min_price, "price": price, "url": url} ``` In the previous code example, we've also added the URL to the dictionary returned by the function. If we run the scraper now, it should produce exports where each product contains a link to its product page: ``` [ { "title": "JBL Flip 4 Waterproof Portable Bluetooth Speaker", "min_price": "74.95", "price": "74.95", "url": "/products/jbl-flip-4-waterproof-portable-bluetooth-speaker" }, { "title": "Sony XBR-950G BRAVIA 4K HDR Ultra HD TV", "min_price": "1398.00", "price": null, "url": "/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv" }, ... ] ``` Hmm, but that isn't what we wanted! Where is the beginning of each URL? It turns out the HTML contains so-called *relative links*. ## Turning relative links into absolute[​](#turning-relative-links-into-absolute "Direct link to Turning relative links into absolute") Browsers reading the HTML know the base address and automatically resolve such links, but we'll have to do this manually. The function [`urljoin`](https://docs.python.org/3/library/urllib.parse.html#urllib.parse.urljoin) from Python's standard library will help us. Let's add it to our imports first: ``` import httpx from bs4 import BeautifulSoup from decimal import Decimal import json import csv from urllib.parse import urljoin ``` Next, we'll change the `parse_product()` function so that it also takes the base URL as an argument and then joins it with the relative URL to the product page: ``` def parse_product(product, base_url): title_element = product.select_one(".product-item__title") title = title_element.text.strip() url = urljoin(base_url, title_element["href"]) ... return {"title": title, "min_price": min_price, "price": price, "url": url} ``` Now we'll pass the base URL to the function in the main body of our program: ``` listing_url = "https://warehouse-theme-metal.myshopify.com/collections/sales" listing_soup = download(listing_url) data = [] for product in listing_soup.select(".product-item"): item = parse_product(product, listing_url) data.append(item) ``` When we run the scraper now, we should see full URLs in our exports: ``` [ { "title": "JBL Flip 4 Waterproof Portable Bluetooth Speaker", "min_price": "74.95", "price": "74.95", "url": "https://warehouse-theme-metal.myshopify.com/products/jbl-flip-4-waterproof-portable-bluetooth-speaker" }, { "title": "Sony XBR-950G BRAVIA 4K HDR Ultra HD TV", "min_price": "1398.00", "price": null, "url": "https://warehouse-theme-metal.myshopify.com/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv" }, ... ] ``` Ta-da! We've managed to get links leading to the product pages. In the next lesson, we'll crawl these URLs so that we can gather more details about the products in our dataset. *** ## Exercises[​](#exercises "Direct link to Exercises") These challenges are here to help you test what you’ve learned in this lesson. Try to resist the urge to peek at the solutions right away. Remember, the best learning happens when you dive in and do it yourself! Real world You're about to touch the real web, which is practical and exciting! But websites change, so some exercises might break. If you run into any issues, please leave a comment below or [file a GitHub Issue](https://github.com/apify/apify-docs/issues). ### Scrape links to countries in Africa[​](#scrape-links-to-countries-in-africa "Direct link to Scrape links to countries in Africa") Download Wikipedia's page with the list of African countries, use Beautiful Soup to parse it, and print links to Wikipedia pages of all the states and territories mentioned in all tables. Start with this URL: ``` https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Africa ``` Your program should print the following: ``` https://en.wikipedia.org/wiki/Algeria https://en.wikipedia.org/wiki/Angola https://en.wikipedia.org/wiki/Benin https://en.wikipedia.org/wiki/Botswana ... ``` Solution ``` import httpx from bs4 import BeautifulSoup from urllib.parse import urljoin listing_url = "https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Africa" response = httpx.get(listing_url) response.raise_for_status() html_code = response.text soup = BeautifulSoup(html_code, "html.parser") for name_cell in soup.select(".wikitable tr td:nth-child(3)"): link = name_cell.select_one("a") url = urljoin(listing_url, link["href"]) print(url) ``` ### Scrape links to F1 news[​](#scrape-links-to-f1-news "Direct link to Scrape links to F1 news") Download Guardian's page with the latest F1 news, use Beautiful Soup to parse it, and print links to all the listed articles. Start with this URL: ``` https://www.theguardian.com/sport/formulaone ``` Your program should print something like the following: ``` https://www.theguardian.com/world/2024/sep/13/africa-f1-formula-one-fans-lewis-hamilton-grand-prix https://www.theguardian.com/sport/2024/sep/12/mclaren-lando-norris-oscar-piastri-team-orders-f1-title-race-max-verstappen https://www.theguardian.com/sport/article/2024/sep/10/f1-designer-adrian-newey-signs-aston-martin-deal-after-quitting-red-bull https://www.theguardian.com/sport/article/2024/sep/02/max-verstappen-damns-his-undriveable-monster-how-bad-really-is-it-and-why ... ``` Solution ``` import httpx from bs4 import BeautifulSoup from urllib.parse import urljoin listing_url = "https://www.theguardian.com/sport/formulaone" response = httpx.get(listing_url) response.raise_for_status() html_code = response.text soup = BeautifulSoup(html_code, "html.parser") for item in soup.select("#maincontent ul li"): link = item.select_one("a") url = urljoin(listing_url, link["href"]) print(url) ``` Note that some cards contain two links. One leads to the article, and one to the comments. If we selected all the links in the list by `#maincontent ul li a`, we would get incorrect output like this: ``` https://www.theguardian.com/sport/article/2024/sep/02/example https://www.theguardian.com/sport/article/2024/sep/02/example#comments ``` --- # Locating HTML elements with Python **In this lesson we'll locate product data in the downloaded HTML. We'll use BeautifulSoup to find those HTML elements which contain details about each product, such as title or price.** *** In the previous lesson we've managed to print text of the page's main heading or count how many products are in the listing. Let's combine those two. What happens if we print `.text` for each product card? ``` import httpx from bs4 import BeautifulSoup url = "https://warehouse-theme-metal.myshopify.com/collections/sales" response = httpx.get(url) response.raise_for_status() html_code = response.text soup = BeautifulSoup(html_code, "html.parser") for product in soup.select(".product-item"): print(product.text) ``` Well, it definitely prints *something*… ``` $ python main.py Save $25.00 JBL JBL Flip 4 Waterproof Portable Bluetooth Speaker Black +7 Blue +6 Grey ... ``` To get details about each product in a structured way, we'll need a different approach. ## Locating child elements[​](#locating-child-elements "Direct link to Locating child elements") As in the browser DevTools lessons, we need to change the code so that it locates child elements for each product card. ![Product card\'s child elements](/assets/images/child-elements-83a62a55f65911b057ec16c4c7cde7b6.png) We should be looking for elements which have the `product-item__title` and `price` classes. We already know how that translates to CSS selectors: ``` import httpx from bs4 import BeautifulSoup url = "https://warehouse-theme-metal.myshopify.com/collections/sales" response = httpx.get(url) response.raise_for_status() html_code = response.text soup = BeautifulSoup(html_code, "html.parser") for product in soup.select(".product-item"): titles = product.select(".product-item__title") first_title = titles[0].text prices = product.select(".price") first_price = prices[0].text print(first_title, first_price) ``` Let's run the program now: ``` $ python main.py JBL Flip 4 Waterproof Portable Bluetooth Speaker Sale price$74.95 Sony XBR-950G BRAVIA 4K HDR Ultra HD TV Sale priceFrom $1,398.00 ... ``` There's still some room for improvement, but it's already much better! ## Locating a single element[​](#locating-a-single-element "Direct link to Locating a single element") Often, we want to assume in our code that a certain element exists only once. It's a bit tedious to work with lists when you know you're looking for a single element. For this purpose, Beautiful Soup offers the `.select_one()` method. Like `document.querySelector()` in browser DevTools, it returns just one result or `None`. Let's simplify our code! ``` import httpx from bs4 import BeautifulSoup url = "https://warehouse-theme-metal.myshopify.com/collections/sales" response = httpx.get(url) response.raise_for_status() html_code = response.text soup = BeautifulSoup(html_code, "html.parser") for product in soup.select(".product-item"): title = product.select_one(".product-item__title").text price = product.select_one(".price").text print(title, price) ``` This program does the same as the one we already had, but its code is more concise. Fragile code We assume that the selectors we pass to the `select()` or `select_one()` methods return at least one element. If they don't, calling `[0]` on an empty list or `.text` on `None` would crash the program. If you perform type checking on your Python program, the code examples above will trigger warnings about this. Not handling these cases allows us to keep the code examples more succinct. Additionally, if we expect the selectors to return elements but they suddenly don't, it usually means the website has changed since we wrote our scraper. Letting the program crash in such cases is a valid way to notify ourselves that we need to fix it. ## Precisely locating price[​](#precisely-locating-price "Direct link to Precisely locating price") In the output we can see that the price isn't located precisely: ``` JBL Flip 4 Waterproof Portable Bluetooth Speaker Sale price$74.95 Sony XBR-950G BRAVIA 4K HDR Ultra HD TV Sale priceFrom $1,398.00 ... ``` For each product, our scraper also prints the text `Sale price`. Let's look at the HTML structure again. Each bit containing the price looks like this: ``` Sale price $74.95 ``` When translated to a tree of Python objects, the element with class `price` will contain several *nodes*: * Textual node with white space, * a `span` HTML element, * a textual node representing the actual amount and possibly also white space. We can use Beautiful Soup's `.contents` property to access individual nodes. It returns a list of nodes like this: ``` ["\n", Sale price, "$74.95"] ``` It seems like we can read the last element to get the actual amount from a list like the above. Let's fix our program: ``` import httpx from bs4 import BeautifulSoup url = "https://warehouse-theme-metal.myshopify.com/collections/sales" response = httpx.get(url) response.raise_for_status() html_code = response.text soup = BeautifulSoup(html_code, "html.parser") for product in soup.select(".product-item"): title = product.select_one(".product-item__title").text price = product.select_one(".price").contents[-1] print(title, price) ``` If we run the scraper now, it should print prices as only amounts: ``` $ python main.py JBL Flip 4 Waterproof Portable Bluetooth Speaker $74.95 Sony XBR-950G BRAVIA 4K HDR Ultra HD TV From $1,398.00 ... ``` ## Formatting output[​](#formatting-output "Direct link to Formatting output") The results seem to be correct, but they're hard to verify because the prices visually blend with the titles. Let's set a different separator for the `print()` function: ``` print(title, price, sep=" | ") ``` The output is much nicer this way: ``` $ python main.py JBL Flip 4 Waterproof Portable Bluetooth Speaker | $74.95 Sony XBR-950G BRAVIA 4K HDR Ultra HD TV | From $1,398.00 ... ``` Great! We have managed to use CSS selectors and walk the HTML tree to get a list of product titles and prices. But wait a second—what's `From $1,398.00`? One does not simply scrape a price! We'll need to clean that. But that's a job for the next lesson, which is about extracting data. *** ## Exercises[​](#exercises "Direct link to Exercises") These challenges are here to help you test what you’ve learned in this lesson. Try to resist the urge to peek at the solutions right away. Remember, the best learning happens when you dive in and do it yourself! Real world You're about to touch the real web, which is practical and exciting! But websites change, so some exercises might break. If you run into any issues, please leave a comment below or [file a GitHub Issue](https://github.com/apify/apify-docs/issues). ### Scrape Wikipedia[​](#scrape-wikipedia "Direct link to Scrape Wikipedia") Download Wikipedia's page with the list of African countries, use Beautiful Soup to parse it, and print short English names of all the states and territories mentioned in all tables. This is the URL: ``` https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Africa ``` Your program should print the following: ``` Algeria Angola Benin Botswana ... ``` Solution ``` import httpx from bs4 import BeautifulSoup url = "https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Africa" response = httpx.get(url) response.raise_for_status() html_code = response.text soup = BeautifulSoup(html_code, "html.parser") for table in soup.select(".wikitable"): for row in table.select("tr"): cells = row.select("td") if cells: third_column = cells[2] title_link = third_column.select_one("a") print(title_link.text) ``` Because some rows contain [table headers](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/th), we skip processing a row if `table_row.select("td")` doesn't find any [table data](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/td) cells. ### Use CSS selectors to their max[​](#use-css-selectors-to-their-max "Direct link to Use CSS selectors to their max") Simplify the code from previous exercise. Use a single for loop and a single CSS selector. You may want to check out the following pages: * [Descendant combinator](https://developer.mozilla.org/en-US/docs/Web/CSS/Descendant_combinator) * [`:nth-child()` pseudo-class](https://developer.mozilla.org/en-US/docs/Web/CSS/:nth-child) Solution ``` import httpx from bs4 import BeautifulSoup url = "https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_in_Africa" response = httpx.get(url) response.raise_for_status() html_code = response.text soup = BeautifulSoup(html_code, "html.parser") for name_cell in soup.select(".wikitable tr td:nth-child(3)"): print(name_cell.select_one("a").text) ``` ### Scrape F1 news[​](#scrape-f1-news "Direct link to Scrape F1 news") Download Guardian's page with the latest F1 news, use Beautiful Soup to parse it, and print titles of all the listed articles. This is the URL: ``` https://www.theguardian.com/sport/formulaone ``` Your program should print something like the following: ``` Wolff confident Mercedes are heading to front of grid after Canada improvement Frustrated Lando Norris blames McLaren team for missed chance Max Verstappen wins Canadian Grand Prix: F1 – as it happened ... ``` Solution ``` import httpx from bs4 import BeautifulSoup url = "https://www.theguardian.com/sport/formulaone" response = httpx.get(url) response.raise_for_status() html_code = response.text soup = BeautifulSoup(html_code, "html.parser") for title in soup.select("#maincontent ul li h3"): print(title.text) ``` --- # Parsing HTML with Python **In this lesson we'll look for products in the downloaded HTML. We'll use BeautifulSoup to turn the HTML into objects which we can work with in our Python program.** *** From lessons about browser DevTools we know that the HTML elements representing individual products have a `class` attribute which, among other values, contains `product-item`. ![Products have the ‘product-item’ class](/assets/images/product-item-b3160bf23207c2794e7c37a2e8517815.png) As a first step, let's try counting how many products are on the listing page. ## Processing HTML[​](#processing-html "Direct link to Processing HTML") After downloading, the entire HTML is available in our program as a string. We can print it to the screen or save it to a file, but not much more. However, since it's a string, could we use [string operations](https://docs.python.org/3/library/stdtypes.html#string-methods) or [regular expressions](https://docs.python.org/3/library/re.html) to count the products? While somewhat possible, such an approach is tedious, fragile, and unreliable. To work with HTML, we need a robust tool dedicated to the task: an *HTML parser*. It takes a text with HTML markup and turns it into a tree of Python objects. Why regex can't parse HTML While [Bobince's infamous StackOverflow answer](https://stackoverflow.com/a/1732454/325365) is funny, it doesn't go much into explaining. In formal language theory, HTML's hierarchical and nested structure makes it a [context-free language](https://en.wikipedia.org/wiki/Context-free_language). Regular expressions match patterns in [regular languages](https://en.wikipedia.org/wiki/Regular_language), which are much simpler. This difference makes it hard for a regex to handle HTML's nested tags. HTML's complex syntax rules and various edge cases also add to the difficulty. We'll choose [Beautiful Soup](https://beautiful-soup-4.readthedocs.io/) as our parser, as it's a popular library renowned for its ability to process even non-standard, broken markup. This is useful for scraping, because real-world websites often contain all sorts of errors and discrepancies. ``` $ pip install beautifulsoup4 ... Successfully installed beautifulsoup4-4.0.0 soupsieve-0.0 ``` Now let's use it for parsing the HTML. The `BeautifulSoup` object allows us to work with the HTML elements in a structured way. As a demonstration, we'll first get the `` element, which represents the main heading of the page. ![Element of the main heading](/assets/images/h1-c1c17bca77070ce481691298487e3952.png) We'll update our code to the following: ``` import httpx from bs4 import BeautifulSoup url = "https://warehouse-theme-metal.myshopify.com/collections/sales" response = httpx.get(url) response.raise_for_status() html_code = response.text soup = BeautifulSoup(html_code, "html.parser") print(soup.select("h1")) ``` Then let's run the program: ``` $ python main.py [Sales] ``` Our code lists all `h1` elements it can find on the page. It's the case that there's just one, so in the result we can see a list with a single item. What if we want to print just the text? Let's change the end of the program to the following: ``` headings = soup.select("h1") first_heading = headings[0] print(first_heading.text) ``` If we run our scraper again, it prints the text of the first `h1` element: ``` $ python main.py Sales ``` Dynamic websites The Warehouse returns full HTML in its initial response, but many other sites add content via JavaScript after the page loads or after user interaction. In such cases, what we see in DevTools may differ from `response.text` in Python. Learn how to handle these scenarios in our [API Scraping](https://docs.apify.com/academy/api-scraping.md) and [Puppeteer & Playwright](https://docs.apify.com/academy/puppeteer-playwright.md) courses. ## Using CSS selectors[​](#using-css-selectors "Direct link to Using CSS selectors") Beautiful Soup's `.select()` method runs a *CSS selector* against a parsed HTML document and returns all the matching elements. It's like calling `document.querySelectorAll()` in browser DevTools. Scanning through [usage examples](https://beautiful-soup-4.readthedocs.io/en/latest/#css-selectors) will help us to figure out code for counting the product cards: ``` import httpx from bs4 import BeautifulSoup url = "https://warehouse-theme-metal.myshopify.com/collections/sales" response = httpx.get(url) response.raise_for_status() html_code = response.text soup = BeautifulSoup(html_code, "html.parser") products = soup.select(".product-item") print(len(products)) ``` In CSS, `.product-item` selects all elements whose `class` attribute contains value `product-item`. We call `soup.select()` with the selector and get back a list of matching elements. Beautiful Soup handles all the complexity of understanding the HTML markup for us. On the last line, we use `len()` to count how many items there is in the list. ``` $ python main.py 24 ``` That's it! We've managed to download a product listing, parse its HTML, and count how many products it contains. In the next lesson, we'll be looking for a way to extract detailed information about individual products. *** ## Exercises[​](#exercises "Direct link to Exercises") These challenges are here to help you test what you’ve learned in this lesson. Try to resist the urge to peek at the solutions right away. Remember, the best learning happens when you dive in and do it yourself! Real world You're about to touch the real web, which is practical and exciting! But websites change, so some exercises might break. If you run into any issues, please leave a comment below or [file a GitHub Issue](https://github.com/apify/apify-docs/issues). ### Scrape F1 teams[​](#scrape-f1-teams "Direct link to Scrape F1 teams") Print a total count of F1 teams listed on this page: ``` https://www.formula1.com/en/teams ``` Solution ``` import httpx from bs4 import BeautifulSoup url = "https://www.formula1.com/en/teams" response = httpx.get(url) response.raise_for_status() html_code = response.text soup = BeautifulSoup(html_code, "html.parser") print(len(soup.select(".group"))) ``` ### Scrape F1 drivers[​](#scrape-f1-drivers "Direct link to Scrape F1 drivers") Use the same URL as in the previous exercise, but this time print a total count of F1 drivers. Solution ``` import httpx from bs4 import BeautifulSoup url = "https://www.formula1.com/en/teams" response = httpx.get(url) response.raise_for_status() html_code = response.text soup = BeautifulSoup(html_code, "html.parser") print(len(soup.select(".f1-team-driver-name"))) ``` --- # Using a scraping platform with Python **In this lesson, we'll deploy our application to a scraping platform that automatically runs it daily. We'll also use the platform's API to retrieve and work with the results.** *** Before starting with a scraping platform, let's highlight a few caveats in our current setup: * *User-operated:* We have to run the scraper ourselves. If we're tracking price trends, we'd need to remember to run it daily. And if we want alerts for big discounts, manually running the program isn't much better than just checking the site in a browser every day. * *No monitoring:* If we have a spare server or a Raspberry Pi lying around, we could use [cron](https://en.wikipedia.org/wiki/Cron) to schedule it. But even then, we'd have little insight into whether it ran successfully, what errors or warnings occurred, how long it took, or what resources it used. * *Manual data management:* Tracking prices over time means figuring out how to organize the exported data ourselves. Processing the data could also be tricky since different analysis tools often require different formats. * *Anti-scraping risks:* If the target website detects our scraper, they can rate-limit or block us. Sure, we could run it from a coffee shop's Wi-Fi, but eventually, they'd block that too—risking seriously annoying our barista. In this lesson, we'll use a platform to address all of these issues. Generic cloud platforms like [GitHub Actions](https://github.com/features/actions) can work for simple scenarios. But platforms dedicated to scraping, like [Apify](https://apify.com/), offer extra features such as monitoring scrapers, managing retrieved data, and overcoming anti-scraping measures. Why Apify Scraping platforms come in many varieties, offering a wide range of tools and approaches. As the course authors, we're obviously biased toward Apify—we think it's both powerful and complete. That said, the main goal of this lesson is to show how deploying to *any platform* can make life easier. Plus, everything we cover here fits within [Apify's free tier](https://apify.com/pricing). ## Registering[​](#registering "Direct link to Registering") First, let's [create a new Apify account](https://console.apify.com/sign-up). We'll go through a few checks to confirm we're human and our email is valid—annoying but necessary to prevent abuse of the platform. Apify serves both as an infrastructure where to privately deploy and run own scrapers, and as a marketplace, where anyone can offer their ready scrapers to others for rent. But let's hold off on exploring the Apify Store for now. ## Getting access from the command line[​](#getting-access-from-the-command-line "Direct link to Getting access from the command line") To control the platform from our machine and send the code of our program there, we'll need the Apify CLI. On macOS, we can install the CLI using [Homebrew](https://brew.sh), otherwise we'll first need [Node.js](https://nodejs.org/en/download). After following the [Apify CLI installation guide](https://docs.apify.com/cli/docs/installation), we'll verify that we installed the tool by printing its version: ``` $ apify --version apify-cli/0.0.0 system-arch00 node-v0.0.0 ``` Now let's connect the CLI with the cloud platform using our account from previous step: ``` $ apify login ... Success: You are logged in to Apify as user1234! ``` ## Starting a real-world project[​](#starting-a-real-world-project "Direct link to Starting a real-world project") Until now, we've kept our scrapers simple, each with just a single Python module like `main.py`, and we've added dependencies only by installing them with `pip` inside a virtual environment. If we sent our code to a friend, they wouldn't know what to install to avoid import errors. The same goes for deploying to a cloud platform. To share our project, we need to package it. The best way is following the official [Python Packaging User Guide](https://packaging.python.org/), but for this course, we'll take a shortcut with the Apify CLI. In our terminal, let's change to a directory where we usually start new projects. Then, we'll run the following command: ``` apify create warehouse-watchdog --template=python-crawlee-beautifulsoup ``` It will create a new subdirectory called `warehouse-watchdog` for the new project, containing all the necessary files: ``` Info: Python version 0.0.0 detected. Info: Creating a virtual environment in ... ... Success: Actor 'warehouse-watchdog' was created. To run it, run "cd warehouse-watchdog" and "apify run". Info: To run your code in the cloud, run "apify push" and deploy your code to Apify Console. Info: To install additional Python packages, you need to activate the virtual environment in the ".venv" folder in the actor directory. ``` ## Adjusting the template[​](#adjusting-the-template "Direct link to Adjusting the template") Inside the `warehouse-watchdog` directory, we should see a `src` subdirectory containing several Python files, including `main.py`. This is a sample Beautiful Soup scraper provided by the template. The file contains a single asynchronous function, `main()`. At the beginning, it handles [input](https://docs.apify.com/platform/actors/running/input-and-output#input), then passes that input to a small crawler built on top of the Crawlee framework. Every program that runs on the Apify platform first needs to be packaged as a so-called [Actor](https://apify.com/actors)—a standardized container with designated places for input and output. Crawlee scrapers automatically connect their default dataset to the Actor output, but input must be handled explicitly in the code. ![The expected file structure](/assets/images/actor-file-structure-8ca91fac30aeb65a0b69ece59f2609d0.webp) We'll now adjust the template so that it runs our program for watching prices. As the first step, we'll create a new empty file, `crawler.py`, inside the `warehouse-watchdog/src` directory. Then, we'll fill this file with final, unchanged code from the previous lesson: ``` import asyncio from decimal import Decimal from crawlee.crawlers import BeautifulSoupCrawler async def main(): crawler = BeautifulSoupCrawler() @crawler.router.default_handler async def handle_listing(context): context.log.info("Looking for product detail pages") await context.enqueue_links(selector=".product-list a.product-item__title", label="DETAIL") @crawler.router.handler("DETAIL") async def handle_detail(context): context.log.info(f"Product detail page: {context.request.url}") price_text = ( context.soup .select_one(".product-form__info-content .price") .contents[-1] .strip() .replace("$", "") .replace(",", "") ) item = { "url": context.request.url, "title": context.soup.select_one(".product-meta__title").text.strip(), "vendor": context.soup.select_one(".product-meta__vendor").text.strip(), "price": Decimal(price_text), "variant_name": None, } if variants := context.soup.select(".product-form__option.no-js option"): for variant in variants: context.log.info("Saving a product variant") await context.push_data(item | parse_variant(variant)) else: context.log.info("Saving a product") await context.push_data(item) await crawler.run(["https://warehouse-theme-metal.myshopify.com/collections/sales"]) crawler.log.info("Exporting data") await crawler.export_data_json(path='dataset.json', ensure_ascii=False, indent=2) await crawler.export_data_csv(path='dataset.csv') def parse_variant(variant): text = variant.text.strip() name, price_text = text.split(" - ") price = Decimal( price_text .replace("$", "") .replace(",", "") ) return {"variant_name": name, "price": price} if __name__ == '__main__': asyncio.run(main()) ``` Now, let's replace the contents of `warehouse-watchdog/src/main.py` with this: ``` from apify import Actor from .crawler import main as crawl async def main(): async with Actor: await crawl() ``` We import our scraper as a function and await the result inside the Actor block. Unlike the sample scraper, the one we made in the previous lesson doesn't expect any input data, so we can omit the code that handles that part. Next, we'll change to the `warehouse-watchdog` directory in our terminal and verify that everything works locally before deploying the project to the cloud: ``` $ apify run Run: /Users/course/Projects/warehouse-watchdog/.venv/bin/python3 -m src [apify] INFO Initializing Actor... [apify] INFO System info ({"apify_sdk_version": "0.0.0", "apify_client_version": "0.0.0", "crawlee_version": "0.0.0", "python_version": "0.0.0", "os": "xyz"}) [BeautifulSoupCrawler] INFO Current request statistics: ┌───────────────────────────────┬──────────┐ │ requests_finished │ 0 │ │ requests_failed │ 0 │ │ retry_histogram │ [0] │ │ request_avg_failed_duration │ None │ │ request_avg_finished_duration │ None │ │ requests_finished_per_minute │ 0 │ │ requests_failed_per_minute │ 0 │ │ request_total_duration │ 0.0 │ │ requests_total │ 0 │ │ crawler_runtime │ 0.016736 │ └───────────────────────────────┴──────────┘ [crawlee._autoscaling.autoscaled_pool] INFO current_concurrency = 0; desired_concurrency = 2; cpu = 0; mem = 0; event_loop = 0.0; client_info = 0.0 [BeautifulSoupCrawler] INFO Looking for product detail pages [BeautifulSoupCrawler] INFO Product detail page: https://warehouse-theme-metal.myshopify.com/products/jbl-flip-4-waterproof-portable-bluetooth-speaker [BeautifulSoupCrawler] INFO Saving a product variant [BeautifulSoupCrawler] INFO Saving a product variant ... ``` ## Updating the Actor configuration[​](#updating-the-actor-configuration "Direct link to Updating the Actor configuration") The Actor configuration from the template tells the platform to expect input, so we need to update that before running our scraper in the cloud. Inside `warehouse-watchdog`, there's a directory called `.actor`. Within it, we'll edit the `input_schema.json` file, which looks like this by default: ``` { "title": "Python Crawlee BeautifulSoup Scraper", "type": "object", "schemaVersion": 1, "properties": { "start_urls": { "title": "Start URLs", "type": "array", "description": "URLs to start with", "prefill": [ { "url": "https://apify.com" } ], "editor": "requestListSources" } }, "required": ["start_urls"] } ``` Hidden dot files On some systems, `.actor` might be hidden in the directory listing because it starts with a dot. Use your editor's built-in file explorer to locate it. We'll remove the expected properties and the list of required ones. After our changes, the file should look like this: ``` { "title": "Python Crawlee BeautifulSoup Scraper", "type": "object", "schemaVersion": 1, "properties": {} } ``` Trailing commas in JSON Make sure there's no trailing comma after `{}`, or the file won't be valid JSON. ## Deploying the scraper[​](#deploying-the-scraper "Direct link to Deploying the scraper") Now we can proceed to deployment: ``` $ apify push Info: Created Actor with name warehouse-watchdog on Apify. Info: Deploying Actor 'warehouse-watchdog' to Apify. Run: Updated version 0.0 for Actor warehouse-watchdog. Run: Building Actor warehouse-watchdog ... Actor build detail https://console.apify.com/actors/a123bCDefghiJkLMN#/builds/0.0.1 ? Do you want to open the Actor detail in your browser? (Y/n) ``` After opening the link in our browser, assuming we're logged in, we should see the **Source** screen on the Actor's detail page. We'll go to the **Input** tab of that screen. We won't change anything—just hit **Start**, and we should see logs similar to what we see locally, but this time our scraper will be running in the cloud. ![Actor\'s detail page, screen Source, tab Input](/assets/images/actor-input-85254e66fa557db99c1229b45180c609.webp) When the run finishes, the interface will turn green. On the **Output** tab, we can preview the results as a table or JSON. We can even export the data to formats like CSV, XML, Excel, RSS, and more. ![Actor\'s detail page, screen Source, tab Output](/assets/images/actor-output-27017a6add46795ea603405961e45293.webp) Accessing data We don't need to click buttons to download the data. It's possible to retrieve it also using Apify's API, the `apify datasets` CLI command, or the Python SDK. Learn more in the [Dataset docs](https://docs.apify.com/platform/storage/dataset). ## Running the scraper periodically[​](#running-the-scraper-periodically "Direct link to Running the scraper periodically") Now that our scraper is deployed, let's automate its execution. In the Apify web interface, we'll go to [Schedules](https://console.apify.com/schedules). Let's click **Create new**, review the periodicity (default: daily), and specify the Actor to run. Then we'll click **Enable**—that's it! From now on, the Actor will execute daily. We can inspect each run, view logs, check collected data, [monitor stats and charts](https://docs.apify.com/platform/monitoring), and even set up alerts. ![Schedule detail page](/assets/images/actor-schedule-517425bbd37b0c4989a5324ccf9b4ec7.webp) ## Adding support for proxies[​](#adding-support-for-proxies "Direct link to Adding support for proxies") If monitoring shows that our scraper frequently fails to reach the Warehouse Shop website, it's likely being blocked. To avoid this, we can [configure proxies](https://docs.apify.com/platform/proxy) so our requests come from different locations, reducing the chances of detection and blocking. Proxy configuration is a type of Actor input, so let's start by reintroducing the necessary code. We'll update `warehouse-watchdog/src/main.py` like this: ``` from apify import Actor from .crawler import main as crawl async def main(): async with Actor: input_data = await Actor.get_input() if actor_proxy_input := input_data.get("proxyConfig"): proxy_config = await Actor.create_proxy_configuration(actor_proxy_input=actor_proxy_input) else: proxy_config = None await crawl(proxy_config) ``` Next, we'll add `proxy_config` as an optional parameter in `warehouse-watchdog/src/crawler.py`. Thanks to the built-in integration between Apify and Crawlee, we only need to pass it to `BeautifulSoupCrawler()`, and the class will handle the rest: ``` import asyncio from decimal import Decimal from crawlee.crawlers import BeautifulSoupCrawler async def main(proxy_config = None): crawler = BeautifulSoupCrawler(proxy_configuration=proxy_config) crawler.log.info(f"Using proxy: {'yes' if proxy_config else 'no'}") @crawler.router.default_handler async def handle_listing(context): context.log.info("Looking for product detail pages") await context.enqueue_links(selector=".product-list a.product-item__title", label="DETAIL") ... ``` Finally, we'll modify the Actor configuration in `warehouse-watchdog/src/.actor/input_schema.json` to include the `proxyConfig` input parameter: ``` { "title": "Python Crawlee BeautifulSoup Scraper", "type": "object", "schemaVersion": 1, "properties": { "proxyConfig": { "title": "Proxy config", "description": "Proxy configuration", "type": "object", "editor": "proxy", "prefill": { "useApifyProxy": true, "apifyProxyGroups": [] }, "default": { "useApifyProxy": true, "apifyProxyGroups": [] } } } } ``` To verify everything works, we'll run the scraper locally. We'll use the `apify run` command again, but this time with the `--purge` option to ensure we're not reusing data from a previous run: ``` $ apify run --purge Info: All default local stores were purged. Run: /Users/course/Projects/warehouse-watchdog/.venv/bin/python3 -m src [apify] INFO Initializing Actor... [apify] INFO System info ({"apify_sdk_version": "0.0.0", "apify_client_version": "0.0.0", "crawlee_version": "0.0.0", "python_version": "0.0.0", "os": "xyz"}) [BeautifulSoupCrawler] INFO Using proxy: no [BeautifulSoupCrawler] INFO Current request statistics: ┌───────────────────────────────┬──────────┐ │ requests_finished │ 0 │ │ requests_failed │ 0 │ │ retry_histogram │ [0] │ │ request_avg_failed_duration │ None │ │ request_avg_finished_duration │ None │ │ requests_finished_per_minute │ 0 │ │ requests_failed_per_minute │ 0 │ │ request_total_duration │ 0.0 │ │ requests_total │ 0 │ │ crawler_runtime │ 0.014976 │ └───────────────────────────────┴──────────┘ [crawlee._autoscaling.autoscaled_pool] INFO current_concurrency = 0; desired_concurrency = 2; cpu = 0; mem = 0; event_loop = 0.0; client_info = 0.0 [BeautifulSoupCrawler] INFO Looking for product detail pages [BeautifulSoupCrawler] INFO Product detail page: https://warehouse-theme-metal.myshopify.com/products/jbl-flip-4-waterproof-portable-bluetooth-speaker [BeautifulSoupCrawler] INFO Saving a product variant [BeautifulSoupCrawler] INFO Saving a product variant ... ``` In the logs, we should see `Using proxy: no`, because local runs don't include proxy settings. All requests will be made from our own location, just as before. Now, let's update the cloud version of our scraper with `apify push`: ``` $ apify push Info: Deploying Actor 'warehouse-watchdog' to Apify. Run: Updated version 0.0 for Actor warehouse-watchdog. Run: Building Actor warehouse-watchdog (timestamp) ACTOR: Found input schema referenced from .actor/actor.json ... ? Do you want to open the Actor detail in your browser? (Y/n) ``` Back in the Apify console, we'll go to the **Source** screen and switch to the **Input** tab. We should see the new **Proxy config** option, which defaults to **Datacenter - Automatic**. ![Actor\'s detail page, screen Source, tab Input with proxies](/assets/images/actor-input-proxies-f7747e47adb97be0c3909f8b8870324f.webp) We'll leave it as is and click **Start**. This time, the logs should show `Using proxy: yes`, as the scraper uses proxies provided by the platform: ``` (timestamp) ACTOR: Pulling Docker image of build o6vHvr5KwA1sGNxP0 from repository. (timestamp) ACTOR: Creating Docker container. (timestamp) ACTOR: Starting Docker container. (timestamp) [apify] INFO Initializing Actor... (timestamp) [apify] INFO System info ({"apify_sdk_version": "0.0.0", "apify_client_version": "0.0.0", "crawlee_version": "0.0.0", "python_version": "0.0.0", "os": "xyz"}) (timestamp) [BeautifulSoupCrawler] INFO Using proxy: yes (timestamp) [BeautifulSoupCrawler] INFO Current request statistics: (timestamp) ┌───────────────────────────────┬──────────┐ (timestamp) │ requests_finished │ 0 │ (timestamp) │ requests_failed │ 0 │ (timestamp) │ retry_histogram │ [0] │ (timestamp) │ request_avg_failed_duration │ None │ (timestamp) │ request_avg_finished_duration │ None │ (timestamp) │ requests_finished_per_minute │ 0 │ (timestamp) │ requests_failed_per_minute │ 0 │ (timestamp) │ request_total_duration │ 0.0 │ (timestamp) │ requests_total │ 0 │ (timestamp) │ crawler_runtime │ 0.036449 │ (timestamp) └───────────────────────────────┴──────────┘ (timestamp) [crawlee._autoscaling.autoscaled_pool] INFO current_concurrency = 0; desired_concurrency = 2; cpu = 0; mem = 0; event_loop = 0.0; client_info = 0.0 (timestamp) [crawlee.storages._request_queue] INFO The queue still contains requests locked by another client (timestamp) [BeautifulSoupCrawler] INFO Looking for product detail pages (timestamp) [BeautifulSoupCrawler] INFO Product detail page: https://warehouse-theme-metal.myshopify.com/products/jbl-flip-4-waterproof-portable-bluetooth-speaker (timestamp) [BeautifulSoupCrawler] INFO Saving a product variant ... ``` ## Congratulations\![​](#congratulations "Direct link to Congratulations!") We've reached the end of the course—congratulations! Together, we've built a program that: * Crawls a shop and extracts product and pricing data. * Exports the results in several formats. * Uses a concise code, thanks to a scraping framework. * Runs on a cloud platform with monitoring and alerts. * Executes periodically without manual intervention, collecting data over time. * Uses proxies to avoid being blocked. We hope this serves as a solid foundation for your next scraping project. Perhaps you'll even [start publishing scrapers](https://docs.apify.com/platform/actors/publishing) for others to use—for a fee? --- # Saving data with Python **In this lesson, we'll save the data we scraped in the popular formats, such as CSV or JSON. We'll use Python's standard library to export the files.** *** We managed to scrape data about products and print it, with each product separated by a new line and each field separated by the `|` character. This already produces structured text that can be parsed, i.e., read programmatically. ``` $ python main.py JBL Flip 4 Waterproof Portable Bluetooth Speaker | 74.95 | 74.95 Sony XBR-950G BRAVIA 4K HDR Ultra HD TV | 1398.00 | None ... ``` However, the format of this text is rather *ad hoc* and does not adhere to any specific standard that others could follow. It's unclear what to do if a product title already contains the `|` character or how to represent multi-line product descriptions. No ready-made library can handle all the parsing. We should use widely popular formats that have well-defined solutions for all the corner cases and that other programs can read without much effort. Two such formats are CSV (*Comma-separated values*) and JSON (*JavaScript Object Notation*). ## Collecting data[​](#collecting-data "Direct link to Collecting data") Producing results line by line is an efficient approach to handling large datasets, but to simplify this lesson, we'll store all our data in one variable. This'll take three changes to our program: ``` import httpx from bs4 import BeautifulSoup from decimal import Decimal url = "https://warehouse-theme-metal.myshopify.com/collections/sales" response = httpx.get(url) response.raise_for_status() html_code = response.text soup = BeautifulSoup(html_code, "html.parser") data = [] for product in soup.select(".product-item"): title = product.select_one(".product-item__title").text.strip() price_text = ( product .select_one(".price") .contents[-1] .strip() .replace("$", "") .replace(",", "") ) if price_text.startswith("From "): min_price = Decimal(price_text.removeprefix("From ")) price = None else: min_price = Decimal(price_text) price = min_price data.append({"title": title, "min_price": min_price, "price": price}) print(data) ``` Before looping over the products, we prepare an empty list. Then, instead of printing each line, we append the data of each product to the list in the form of a Python dictionary. At the end of the program, we print the entire list at once. ``` $ python main.py [{'title': 'JBL Flip 4 Waterproof Portable Bluetooth Speaker', 'min_price': Decimal('74.95'), 'price': Decimal('74.95')}, {'title': 'Sony XBR-950G BRAVIA 4K HDR Ultra HD TV', 'min_price': Decimal('1398.00'), 'price': None}, ...] ``` Pretty print If you find the complex data structures printed by `print()` difficult to read, try using [`pp()`](https://docs.python.org/3/library/pprint.html#pprint.pp) from the `pprint` module instead. ## Saving data as JSON[​](#saving-data-as-json "Direct link to Saving data as JSON") The JSON format is popular primarily among developers. We use it for storing data, configuration files, or as a way to transfer data between programs (e.g., APIs). Its origin stems from the syntax of objects in the JavaScript programming language, which is similar to the syntax of Python dictionaries. In Python, we can read and write JSON using the [`json`](https://docs.python.org/3/library/json.html) standard library module. We'll begin with imports: ``` import httpx from bs4 import BeautifulSoup from decimal import Decimal import json ``` Next, instead of printing the data, we'll finish the program by exporting it to JSON. Let's replace the line `print(data)` with the following: ``` with open("products.json", "w") as file: json.dump(data, file) ``` That's it! If we run the program now, it should also create a `products.json` file in the current working directory: ``` $ python main.py Traceback (most recent call last): ... raise TypeError(f'Object of type {o.__class__.__name__} ' TypeError: Object of type Decimal is not JSON serializable ``` Ouch! JSON supports integers and floating-point numbers, but there's no guidance on how to handle `Decimal`. To maintain precision, it's common to store monetary values as strings in JSON files. But this is a convention, not a standard, so we need to handle it manually. We'll pass a custom function to `json.dump()` to serialize objects that it can't handle directly: ``` def serialize(obj): if isinstance(obj, Decimal): return str(obj) raise TypeError("Object not JSON serializable") with open("products.json", "w") as file: json.dump(data, file, default=serialize) ``` If we run our scraper now, it won't display any output, but it will create a `products.json` file in the current working directory, which contains all the data about the listed products: ``` [{"title": "JBL Flip 4 Waterproof Portable Bluetooth Speaker", "min_price": "74.95", "price": "74.95"}, {"title": "Sony XBR-950G BRAVIA 4K HDR Ultra HD TV", "min_price": "1398.00", "price": null}, ...] ``` If you skim through the data, you'll notice that the `json.dump()` function handled some potential issues, such as escaping double quotes found in one of the titles by adding a backslash: ``` {"title": "Sony SACS9 10\" Active Subwoofer", "min_price": "158.00", "price": "158.00"} ``` Pretty JSON While a compact JSON file without any whitespace is efficient for computers, it can be difficult for humans to read. You can pass `indent=2` to `json.dump()` for prettier output. Also, if your data contains non-English characters, set `ensure_ascii=False`. By default, Python encodes everything except [ASCII](https://en.wikipedia.org/wiki/ASCII), which means it would save [Bún bò Nam Bô](https://vi.wikipedia.org/wiki/B%C3%BAn_b%C3%B2_Nam_B%E1%BB%99) as `B\\u00fan b\\u00f2 Nam B\\u00f4`. ## Saving data as CSV[​](#saving-data-as-csv "Direct link to Saving data as CSV") The CSV format is popular among data analysts because a wide range of tools can import it, including spreadsheets apps like LibreOffice Calc, Microsoft Excel, Apple Numbers, and Google Sheets. In Python, we can read and write CSV using the [`csv`](https://docs.python.org/3/library/csv.html) standard library module. First let's try something small in the Python's interactive REPL to familiarize ourselves with the basic usage: ``` >>> import csv >>> with open("data.csv", "w") as file: ... writer = csv.DictWriter(file, fieldnames=["name", "age", "hobbies"]) ... writer.writeheader() ... writer.writerow({"name": "Alice", "age": 24, "hobbies": "kickbox, Python"}) ... writer.writerow({"name": "Bob", "age": 42, "hobbies": "reading, TypeScript"}) ... ``` We first opened a new file for writing and created a `DictWriter()` instance with the expected field names. We instructed it to write the header row first and then added two more rows containing actual data. The code produced a `data.csv` file in the same directory where we're running the REPL. It has the following contents: ``` name,age,hobbies Alice,24,"kickbox, Python" Bob,42,"reading, TypeScript" ``` In the CSV format, if a value contains commas, we should enclose it in quotes. When we open the file in a text editor of our choice, we can see that the writer automatically handled this. When browsing the directory on macOS, we can see a nice preview of the file's contents, which proves that the file is correct and that other programs can read it. If you're using a different operating system, try opening the file with any spreadsheet program you have. ![CSV example preview](/assets/images/csv-example-370c0acb3f87d0f1598505b6f595f35a.png) Now that's nice, but we didn't want Alice, Bob, kickbox, or TypeScript. What we actually want is a CSV containing `Sony XBR-950G BRAVIA 4K HDR Ultra HD TV`, right? Let's do this! First, let's add `csv` to our imports: ``` import httpx from bs4 import BeautifulSoup from decimal import Decimal import json import csv ``` Next, let's add one more data export to end of the source code of our scraper: ``` def serialize(obj): if isinstance(obj, Decimal): return str(obj) raise TypeError("Object not JSON serializable") with open("products.json", "w") as file: json.dump(data, file, default=serialize) with open("products.csv", "w") as file: writer = csv.DictWriter(file, fieldnames=["title", "min_price", "price"]) writer.writeheader() for row in data: writer.writerow(row) ``` The program should now also produce a CSV file with the following content: ![CSV preview](/assets/images/csv-5b03f7c2a2b8a1f19f48f71ab832054d.png) We've built a Python application that downloads a product listing, parses the data, and saves it in a structured format for further use. But the data still has gaps: for some products, we only have the min price, not the actual prices. In the next lesson, we'll attempt to scrape more details from all the product pages. *** ## Exercises[​](#exercises "Direct link to Exercises") In this lesson, we created export files in two formats. The following challenges are designed to help you empathize with the people who'd be working with them. ### Process your JSON[​](#process-your-json "Direct link to Process your JSON") Write a new Python program that reads `products.json`, finds all products with a min price greater than $500, and prints each one using [`pp()`](https://docs.python.org/3/library/pprint.html#pprint.pp). Solution ``` import json from pprint import pp from decimal import Decimal with open("products.json", "r") as file: products = json.load(file) for product in products: if Decimal(product["min_price"]) > 500: pp(product) ``` ### Process your CSV[​](#process-your-csv "Direct link to Process your CSV") Open the `products.csv` file we created in the lesson using a spreadsheet application. Then, in the app, find all products with a min price greater than $500. Solution Let's use [Google Sheets](https://www.google.com/sheets/about/), which is free to use. After logging in with a Google account: 1. Go to **File > Import**, choose **Upload**, and select the file. Import the data using the default settings. You should see a table with all the data. 2. Select the header row. Go to **Data > Create filter**. 3. Use the filter icon that appears next to `min_price`. Choose **Filter by condition**, select **Greater than**, and enter **500** in the text field. Confirm the dialog. You should see only the filtered data. ![CSV in Google Sheets](/assets/images/csv-sheets-898267aa223628c99e056cb8ce345dcd.png) --- # Scraping product variants with Python **In this lesson, we'll scrape the product detail pages to represent each product variant as a separate item in our dataset.** *** We'll need to figure out how to extract variants from the product detail page, and then change how we add items to the data list so we can add multiple items after scraping one product URL. ## Locating variants[​](#locating-variants "Direct link to Locating variants") First, let's extract information about the variants. If we go to [Sony XBR-950G BRAVIA](https://warehouse-theme-metal.myshopify.com/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv) and open the DevTools, we can see that the buttons for switching between variants look like this: ``` 55" 65" ``` Nice! We can extract the variant names, but we also need to extract the price for each variant. Switching the variants using the buttons shows us that the HTML changes dynamically. This means the page uses JavaScript to display this information. ![Switching variants](/assets/images/variants-js-304898fbc90d9c188535e88eae04ba1d.gif) If we can't find a workaround, we'd need our scraper to run JavaScript. That's not impossible. Scrapers can spin up their own browser instance and automate clicking on buttons, but it's slow and resource-intensive. Ideally, we want to stick to plain HTTP requests and Beautiful Soup as much as possible. After a bit of detective work, we notice that not far below the `block-swatch-list` there's also a block of HTML with a class `no-js`, which contains all the data! ``` Variant 55" - $1,398.00 65" - $2,198.00 ``` These elements aren't visible to regular visitors. They're there just in case JavaScript fails to work, otherwise they're hidden. This is a great find because it allows us to keep our scraper lightweight. ## Extracting variants[​](#extracting-variants "Direct link to Extracting variants") Using our knowledge of Beautiful Soup, we can locate the options and extract the data we need: ``` ... listing_url = "https://warehouse-theme-metal.myshopify.com/collections/sales" listing_soup = download(listing_url) data = [] for product in listing_soup.select(".product-item"): item = parse_product(product, listing_url) product_soup = download(item["url"]) vendor = product_soup.select_one(".product-meta__vendor").text.strip() if variants := product_soup.select(".product-form__option.no-js option"): for variant in variants: data.append(item | {"variant_name": variant.text.strip()}) else: item["variant_name"] = None data.append(item) ... ``` The CSS selector `.product-form__option.no-js` matches elements with both `product-form__option` and `no-js` classes. Then we're using the [descendant combinator](https://developer.mozilla.org/en-US/docs/Web/CSS/Descendant_combinator) to match all `option` elements somewhere inside the `.product-form__option.no-js` wrapper. Python dictionaries are mutable, so if we assigned the variant with `item["variant_name"] = ...`, we'd always overwrite the values. Instead of saving an item for each variant, we'd end up with the last variant repeated several times. To avoid this, we create a new dictionary for each variant and merge it with the `item` data before adding it to `data`. If we don't find any variants, we add the `item` as is, leaving the `variant_name` key empty. Modern Python syntax Since Python 3.8, you can use `:=` to simplify checking if an assignment resulted in a non-empty value. It's called an *assignment expression* or *walrus operator*. You can learn more about it in the [docs](https://docs.python.org/3/reference/expressions.html#assignment-expressions) or in the [proposal document](https://peps.python.org/pep-0572/). Since Python 3.9, you can use `|` to merge two dictionaries. If the [docs](https://docs.python.org/3/library/stdtypes.html#dict) aren't clear enough, check out the [proposal document](https://peps.python.org/pep-0584/) for more details. If we run the program now, we'll see 34 items in total. Some items don't have variants, so they won't have a variant name. However, they should still have a price set—our scraper should already have that info from the product listing page. ``` [ ... { "variant_name": null, "title": "Klipsch R-120SW Powerful Detailed Home Speaker - Unit", "min_price": "324.00", "price": "324.00", "url": "https://warehouse-theme-metal.myshopify.com/products/klipsch-r-120sw-powerful-detailed-home-speaker-set-of-1", "vendor": "Klipsch" }, ... ] ``` Some products will break into several items, each with a different variant name. We don't know their exact prices from the product listing, just the min price. In the next step, we should be able to parse the actual price from the variant name for those items. ``` [ ... { "variant_name": "Red - $178.00", "title": "Sony XB-950B1 Extra Bass Wireless Headphones with App Control", "min_price": "128.00", "price": null, "url": "https://warehouse-theme-metal.myshopify.com/products/sony-xb950-extra-bass-wireless-headphones-with-app-control", "vendor": "Sony" }, { "variant_name": "Black - $178.00", "title": "Sony XB-950B1 Extra Bass Wireless Headphones with App Control", "min_price": "128.00", "price": null, "url": "https://warehouse-theme-metal.myshopify.com/products/sony-xb950-extra-bass-wireless-headphones-with-app-control", "vendor": "Sony" }, ... ] ``` Perhaps surprisingly, some products with variants will have the price field set. That's because the shop sells all variants of the product for the same price, so the product listing shows the price as a fixed amount, like *$74.95*, instead of *from $74.95*. ``` [ ... { "variant_name": "Red - $74.95", "title": "JBL Flip 4 Waterproof Portable Bluetooth Speaker", "min_price": "74.95", "price": "74.95", "url": "https://warehouse-theme-metal.myshopify.com/products/jbl-flip-4-waterproof-portable-bluetooth-speaker", "vendor": "JBL" }, ... ] ``` ## Parsing price[​](#parsing-price "Direct link to Parsing price") The items now contain the variant as text, which is good for a start, but we want the price to be in the `price` key. Let's introduce a new function to handle that: ``` def parse_variant(variant): text = variant.text.strip() name, price_text = text.split(" - ") price = Decimal( price_text .replace("$", "") .replace(",", "") ) return {"variant_name": name, "price": price} ``` First, we split the text into two parts, then we parse the price as a decimal number. This part is similar to what we already do for parsing product listing prices. The function returns a dictionary we can merge with `item`. ## Saving price[​](#saving-price "Direct link to Saving price") Now, if we use our new function, we should finally get a program that can scrape exact prices for all products, even if they have variants. The whole code should look like this now: ``` import httpx from bs4 import BeautifulSoup from decimal import Decimal import json import csv from urllib.parse import urljoin def download(url): response = httpx.get(url) response.raise_for_status() html_code = response.text return BeautifulSoup(html_code, "html.parser") def parse_product(product, base_url): title_element = product.select_one(".product-item__title") title = title_element.text.strip() url = urljoin(base_url, title_element["href"]) price_text = ( product .select_one(".price") .contents[-1] .strip() .replace("$", "") .replace(",", "") ) if price_text.startswith("From "): min_price = Decimal(price_text.removeprefix("From ")) price = None else: min_price = Decimal(price_text) price = min_price return {"title": title, "min_price": min_price, "price": price, "url": url} def parse_variant(variant): text = variant.text.strip() name, price_text = text.split(" - ") price = Decimal( price_text .replace("$", "") .replace(",", "") ) return {"variant_name": name, "price": price} def export_json(file, data): def serialize(obj): if isinstance(obj, Decimal): return str(obj) raise TypeError("Object not JSON serializable") json.dump(data, file, default=serialize, indent=2) def export_csv(file, data): fieldnames = list(data[0].keys()) writer = csv.DictWriter(file, fieldnames=fieldnames) writer.writeheader() for row in data: writer.writerow(row) listing_url = "https://warehouse-theme-metal.myshopify.com/collections/sales" listing_soup = download(listing_url) data = [] for product in listing_soup.select(".product-item"): item = parse_product(product, listing_url) product_soup = download(item["url"]) vendor = product_soup.select_one(".product-meta__vendor").text.strip() if variants := product_soup.select(".product-form__option.no-js option"): for variant in variants: data.append(item | parse_variant(variant)) else: item["variant_name"] = None data.append(item) with open("products.json", "w") as file: export_json(file, data) with open("products.csv", "w") as file: export_csv(file, data) ``` Let's run the scraper and see if all the items in the data contain prices: ``` [ ... { "variant_name": "Red", "title": "Sony XB-950B1 Extra Bass Wireless Headphones with App Control", "min_price": "128.00", "price": "178.00", "url": "https://warehouse-theme-metal.myshopify.com/products/sony-xb950-extra-bass-wireless-headphones-with-app-control", "vendor": "Sony" }, { "variant_name": "Black", "title": "Sony XB-950B1 Extra Bass Wireless Headphones with App Control", "min_price": "128.00", "price": "178.00", "url": "https://warehouse-theme-metal.myshopify.com/products/sony-xb950-extra-bass-wireless-headphones-with-app-control", "vendor": "Sony" }, ... ] ``` Success! We managed to build a Python application for watching prices! Is this the end? Maybe! In the next lesson, we'll use a scraping framework to build the same application, but with less code, faster requests, and better visibility into what's happening while we wait for the program to finish. *** ## Exercises[​](#exercises "Direct link to Exercises") These challenges are here to help you test what you’ve learned in this lesson. Try to resist the urge to peek at the solutions right away. Remember, the best learning happens when you dive in and do it yourself! Real world You're about to touch the real web, which is practical and exciting! But websites change, so some exercises might break. If you run into any issues, please leave a comment below or [file a GitHub Issue](https://github.com/apify/apify-docs/issues). ### Build a scraper for watching Python jobs[​](#build-a-scraper-for-watching-python-jobs "Direct link to Build a scraper for watching Python jobs") You're able to build a scraper now, aren't you? Let's build another one! Python's official website has a [job board](https://www.python.org/jobs/). Scrape the job postings that match the following criteria: * Tagged as "Database" * Posted within the last 60 days For each job posting found, use [`pp()`](https://docs.python.org/3/library/pprint.html#pprint.pp) to print a dictionary containing the following data: * Job title * Company * URL to the job posting * Date of posting Your output should look something like this: ``` {'title': 'Senior Full Stack Developer', 'company': 'Baserow', 'url': 'https://www.python.org/jobs/7705/', 'posted_on': datetime.date(2024, 9, 16)} {'title': 'Senior Python Engineer', 'company': 'Active Prime', 'url': 'https://www.python.org/jobs/7699/', 'posted_on': datetime.date(2024, 9, 5)} ... ``` You can find everything you need for working with dates and times in Python's [`datetime`](https://docs.python.org/3/library/datetime.html) module, including `date.today()`, `datetime.fromisoformat()`, `datetime.date()`, and `timedelta()`. Solution After inspecting the job board, you'll notice that job postings tagged as "Database" have a dedicated URL. We'll use that as our starting point, which saves us from having to scrape and check the tags manually. ``` from pprint import pp import httpx from bs4 import BeautifulSoup from urllib.parse import urljoin from datetime import datetime, date, timedelta today = date.today() jobs_url = "https://www.python.org/jobs/type/database/" response = httpx.get(jobs_url) response.raise_for_status() soup = BeautifulSoup(response.text, "html.parser") for job in soup.select(".list-recent-jobs li"): link = job.select_one(".listing-company-name a") time = job.select_one(".listing-posted time") posted_at = datetime.fromisoformat(time["datetime"]) posted_on = posted_at.date() posted_ago = today - posted_on if posted_ago We've censored out the **User ID** in the image because it is private information which should not be shared with anyone who is not trusted. The same goes for your **Personal API Token**. Copy the **Personal API Token** and return to your terminal, entering this command: ``` apify login -t YOUR_TOKEN_HERE ``` If you see a log which looks like this, ``` Success: You are logged in to Apify as YOUR_USERNAME! ``` If you see a log which looks like **Success: You are logged in to Apify as YOUR\_USERNAME!**, you're in! --- # What's EditThisCookie? **Learn how to add, delete, and modify different cookies in your browser for testing purposes using the EditThisCookie Chrome extension.** *** **EditThisCookie** is a Chrome extension to manage your browser's cookies. It can be added through the [Chrome Web Store](https://chromewebstore.google.com/detail/editthiscookie-v3/ojfebgpkimhlhcblbalbfjblapadhbol). After adding it to Chrome, you'll see a button with a delicious cookie icon next to any other Chrome extensions you might have installed. Clicking on it will open a pop-up window with a list of all saved cookies associated with the currently opened page domain. ![EditThisCookie popup](/assets/images/edit-this-cookie-popup-25db9f40d6518c224bc602a2a8d23acf.png) ## Functionalities[​](#functions "Direct link to Functionalities") At the top of the popup, there is a row of buttons. From left to right, here is an explanation for each one: ### Delete all cookies[​](#delete-all-cookies "Direct link to Delete all cookies") Clicking this button will remove all cookies associated with the current domain. For example, if you're logged into your Apify account and delete all the cookies, the website will ask you to log in again. ### Reset[​](#reset "Direct link to Reset") A refresh button. ### Add a new cookie[​](#add-a-new-cookie "Direct link to Add a new cookie") Manually add a new cookie for the current domain. ### Import cookies[​](#import-cookies "Direct link to Import cookies") Allows you to add cookies in bulk. For example, if you have saved some cookies inside your crawler, or someone provided you with some cookies for the purpose of testing a certain website in your browser, they can be imported and automatically applied with this button. ### Export cookies[​](#export-cookies "Direct link to Export cookies") Copies an array of cookies associated with the current domain to the clipboard. The cookies can then be later inspected, added to your crawler, or imported by someone else using EditThisCookie. ### Search[​](#search "Direct link to Search") Allows you to filter through cookies by name. ### Options[​](#options "Direct link to Options") Will open a new browser tab with a bunch of EditThisCookie options. The options page allows you to tweak a few settings such as changing the export format, but you will most likely never need to change anything there. ![EditThisCookie options](/assets/images/edit-this-cookie-options-5d3bf2114475f966187a349bc0af84ec.png) --- # What is Insomnia **Learn about Insomnia, a valuable tool for testing requests and proxies when building scalable web scrapers.** *** Despite its name, the [Insomnia](https://insomnia.rest/download) desktop application has absolutely nothing to do with having a lack of sleep. Rather, it is a tool to build and test APIs. If you've already read about [Postman](https://docs.apify.com/academy/tools/postman.md), you already know what Insomnia can be used for, as they both practically do the same exact things. While Insomnia shares similarities with Postman, such as the ability to send requests with specific headers, cookies, and payloads, it has a few notable differences. One key difference is Insomnia's feature to display the entire request timeline. Insomnia can be downloaded from its [official website](https://insomnia.rest/download), and its features can be read about in the [official documentation](https://docs.insomnia.rest/). ## The Insomnia interface[​](#insomnia-interface "Direct link to The Insomnia interface") After opening the app, you'll first need to create a new request. After creating the request, you'll see an interface that looks like this: ![Insomnia interface](/assets/images/insomnia-interface-8db85779b777d023aee844fcd478b402.jpg) Let's break down the main sections: ### List of requests[​](#list-of-requests "Direct link to List of requests") You can configure multiple requests with a custom payload, headers, cookies, parameters, etc. They are automatically saved in the list of requests until deleted. ### Address bar[​](#address-bar "Direct link to Address bar") The place where you select the type of request to send (**GET**, **POST**, **PUT**, **DELETE**, etc.), specify the URI of the request and send the request with the **Send** button. ### Request options[​](#request-options "Direct link to Request options") Here, you can add a request payload, specify authorization parameters, add query parameters, and attach headers to the request. ### Response[​](#response "Direct link to Response") Where the response body is displayed after the request has been sent. Like in Postman, the request can be viewed in preview mode, pretty-printed, or in its raw form. This section also has the **Headers** and **Cookies** tabs, which respectively show the request headers and cookies. ## Request timeline[​](#request-timeline "Direct link to Request timeline") The one feature of Insomnia that separates it from Postman is the **Timeline**. ![Request timeline](/assets/images/insomnia-timeline-9700132bcd3fc1ca8145b5ea9a1eb062.jpg) This feature allows you to see information about the request that is not present in the response body. ## Using proxies in Insomnia[​](#using-proxies "Direct link to Using proxies in Insomnia") In order to use a proxy, you need to specify the proxy's parameters in Insomnia's preferences. In preferences, scroll down to the **HTTP Network Proxy** section under the **General** tab and specify the full proxy URL there: ![Configuring a proxy](/assets/images/insomnia-proxy-1cacb438369ed0cb8054b86acb5a716f.png) ## Managing the cookies cache[​](#managing-cookies-cache "Direct link to Managing the cookies cache") Insomnia keeps the cookies for the requests you have already sent before. This might result in you receiving a different response within your scraper from what you're receiving in Insomnia, as a necessary cookie is not present in the request sent by the scraper. To check whether or not some cookies associated with a certain request have been cached, click on the **Cookies** button at the top of the list of requests: ![Click on the \"Cookies\" button](/assets/images/insomnia-cookies-4cf492e7c0821caccb4cc924559a83f9.png) This will bring up the **Manage cookies** window, where all cached cookies can be viewed, edited, or deleted. ![The \"Manage Cookies\" tab](/assets/images/insomnia-manage-cookies-115b3de173313e250cbe1eddfa3665b5.jpg) ## Postman or Insomnia[​](#postman-or-insomnia "Direct link to Postman or Insomnia") The application you choose to use is completely up to your personal preference, and will not affect your development workflow. If viewing timelines of the requests you send is important to you, then you should go with Insomnia; however, if that doesn't matter, choose the one that has the most intuitive interface for you. --- # What is ModHeader? **Discover a super useful Chrome extension called ModHeader, which allows you to modify your browser's HTTP request headers.** *** If you read about [Postman](https://docs.apify.com/academy/tools/postman.md), you might remember that you can use it to modify request headers before sending a request. This is great, but the main problem is that Postman can only make static requests - meaning, it is unable to load JavaScript or any [dynamic content](https://docs.apify.com/academy/concepts/dynamic-pages.md). [ModHeader](https://chrome.google.com/webstore/detail/idgpnmonknjnojddfkpgkljpfnnfcklj) is a Chrome extension which can be used to modify the HTTP headers of the requests you make with your browser. This means that, for example, if your scraper using a headless browser Puppeteer is being blocked due to an improper **User-Agent** header, you can use ModHeader to test the target website and quickly solve the issue. ## The ModHeader interface[​](#interface "Direct link to The ModHeader interface") After you install the ModHeader extension, you should see it pinned in Chrome's task bar. When you click it, you'll see an interface like this pop up: ![Modheader\'s interface](/assets/images/modheader-086410fa4720e60dcbbdee0b5ea62d4d.jpg) Here, you can add headers, remove headers, and even save multiple collections of headers that you can toggle between (which are called **Profiles** within the extension itself). ## Use cases[​](#use-cases "Direct link to Use cases") When scraping dynamic websites, sometimes some specific headers are required to access certain pages. The most popularly required headers are generally `User-Agent` and `referer`. ModHeader, and other tools like it, make it easy to test requests to these websites right in your browser before writing logic for your scraper. --- # What is Postman? **Learn about Postman, a valuable tool for testing requests and proxies when building scalable web scrapers.** *** [Postman](https://www.postman.com/) is a powerful collaboration platform for API development and testing. For scraping use-cases, it's mainly used to test requests and proxies (such as checking the response body of a raw request, without loading any additional resources such as JavaScript or CSS). This tool can do much more than that, but we will not be discussing all of its capabilities here. Postman allows us to test requests with cookies, headers, and payloads so that we can be entirely sure what the response looks like for a request URL we plan to eventually use in a scraper. The desktop app can be downloaded from its [official download page](https://www.postman.com/downloads/), or the web app can be used with a signup - no download required. If this is your first time working with a tool like Postman, we recommend checking out their [Getting Started guide](https://learning.postman.com/docs/introduction/overview/). ## Understanding the interface[​](#understanding-the-interface "Direct link to Understanding the interface") ![A basic outline of Postman\'s interface](/assets/images/postman-interface-d0cd1626d8425c1081db491b5625ea06.png) Following four sections are essential to get familiar with Postman: ### Tabs[​](#tabs "Direct link to Tabs") Multiple test endpoints/requests can be opened at one time, each of which will be held within its own tab. ### Address bar[​](#address-bar "Direct link to Address bar") The section in which you select the type of request to send, the URL of the request, and of course, send the request with the **Send Request** button. ### Request options[​](#request-options "Direct link to Request options") This is a very useful section where you can view and edit structured query parameters, as well as specify any authorization parameters, headers, or payloads. ### Response[​](#response "Direct link to Response") After sending a request, the response's body will be found here, along with its cookies and headers. The response body can be viewed in various formats - **Pretty-Print**, **Raw**, or **Preview**. ## Using and testing proxies[​](#using-proxies "Direct link to Using and testing proxies") In order to use a proxy, the proxy's server and configuration must be provided in the **Proxy** tab in Postman settings. ![Proxy configuration in Postman settings](/assets/images/postman-proxy-d3a16a565dd112c68a1517a861e7fe00.png) After configuring a proxy, the next request sent will attempt to use it. To switch off the proxy, its details don't need to be deleted. The **Add a custom proxy configuration** option in settings needs to be un-ticked to disable it. ## Managing the cookies cache[​](#managing-cookies "Direct link to Managing the cookies cache") Postman keeps a cache of the cookies from all previous responses of a certain domain, which can be a blessing, but also a curse. Sometimes, you might notice that a request is going through just fine with Postman, but that your scraper is being blocked. More often than not in these cases, the reason is because the endpoint being reached requires a valid `cookie` header to be present when sending the request, and because of Postman's cache, it is sending a valid cookie within each request's headers, while your scraper is not. Another reason this may happen is because you are sending Postman requests without a proxy (using your local IP address), while your scraper is using a proxy that could potentially be getting blocked. In order to check whether there are any cookies associated with a certain request are cached in Postman, click on the **Cookies** button in any opened request tab: ![Button to view the cached cookies](/assets/images/postman-cookies-button-25f42087846a7a0d29fdf088eeff5756.png) Clicking on this button opens a **MANAGE COOKIES** window, where a list of all cached cookies per domain can be seen. If we had been previously sending multiple requests to ****, within this window we would be able to find cached cookies associated with github.com. Cookies can also be edited (to update some specific values), or deleted (to send a "clean" request without any cached data) here. ![Managing cookies in Postman with the \"MANAGE COOKIES\" window](/assets/images/postman-manage-cookies-5f057bee3fff880af0026b1caf8ecdf5.png) ### Some alternatives to Postman[​](#alternatives "Direct link to Some alternatives to Postman") * [Hoppscotch](https://hoppscotch.io/) * [Insomnia](https://docs.apify.com/academy/tools/insomnia.md) * [Testfully](https://testfully.io/) --- # What's Proxyman? **Learn about Proxyman, a tool for viewing all network requests that are coming through your system. Filter by response type, by a keyword, or by application.** *** Though the name sounds very similar to [Postman](https://docs.apify.com/academy/tools/postman.md), [**Proxyman**](https://proxyman.io/) is used for a different purpose. Rather than for manually sending and analyzing the responses of requests, Proxyman is a tool for macOS that allows you to view and analyze the HTTP/HTTPS requests that are going through your device. This is done by routing all of your requests through a proxy, which intercepts them and allows you to view data about them. Because it's just a proxy, the HTTP/HTTPS requests going through iOS devices, Android devices, and even iOS simulators can also be viewed with Proxyman. If you've already gone through the [**Locating and learning** lesson](https://docs.apify.com/academy/api-scraping/general-api-scraping/locating-and-learning.md) in the **API scraping** section, you can think of Proxyman as an advanced Network Tab, where you can see requests that you sometimes can't see in regular browser DevTools. ## The basics[​](#the-basics "Direct link to The basics") Though the application offers a whole lot of advanced features, there are only a few main features you'll be utilizing when using Proxyman for scraper development purposes. Let's open up Proxyman and take a look at some of the basic features: ### Apps[​](#apps "Direct link to Apps") The **Apps** tab allows you to both view all of the applications on your machine which are sending requests, as well as filter requests based on application. ![Apps tab in Proxyman](/assets/images/proxyman-apps-tab-3653fe914c2d03b6f7091d22ee04f804.png) ### Results[​](#results "Direct link to Results") Let's open up Safari and visit **apify.com**, then check back in Proxyman to see all of the requests Safari has made when visiting the website. ![Results in Proxyman](/assets/images/proxyman-results-86853f0d8dcbf9dd1d8c4e6ddf63aa6a.jpg) We can see all of the requests related to us visiting **apify.com**. Then, by clicking a request, we can see a whole lot of information about it. The most important information for you, however, will usually be the request and response **headers** and **body**. ![View a request](/assets/images/proxyman-view-request-1244c56bbe015b469af10732e896e4cf.jpg) ### Filtering[​](#filtering "Direct link to Filtering") Sometimes, there can be hundreds (or even thousands) of requests that appear in the list. Rather than spending your time rooting through all of them, you can use the plethora of filtering methods that Proxyman offers to find exactly what you are looking for. ![Filter requests with the filter options](/assets/images/proxyman-filter-b685c20107702e86a2e70fcdebc2eb2f.png) ## Alternatives[​](#alternatives "Direct link to Alternatives") Since Proxyman is only available for macOS, it's only appropriate to list some alternatives to it that are accessible to our Windows and Linux friends: * [Burp Suite](https://portswigger.net/burp) * [Charles Proxy](https://www.charlesproxy.com/documentation/installation/) * [Fiddler](https://www.telerik.com/fiddler) --- # Quick JavaScript Switcher **Discover a handy tool for disabling JavaScript on a certain page to determine how it should be scraped. Great for detecting SPAs.** *** **Quick JavaScript Switcher** is a Chrome extension that allows you to switch on/off the JavaScript for the current page with one click. It can be added to your browser via the [Chrome Web Store](https://chrome.google.com/webstore/category/extensions). After adding it to Chrome, you'll see its respective button next to any other Chrome extensions you might have installed. If JavaScript is enabled - clicking the button will switch it off and reload the page. The next click will re-enable JavaScript and refresh the page. This extension is useful for checking whether a certain website will work without JavaScript (and thus could be parsed without using a browser with a plain HTTP request) or not. ![JavaScript toggled on (enabled)]() ![JavaScript toggled off (disabled)]() --- # What is SwitchyOmega? **Discover SwitchyOmega, a Chrome extension to manage and switch between proxies, which is extremely useful when testing proxies for a scraper.** *** SwitchyOmega is a Chrome extension for managing and switching between proxies which can be added in the [Chrome Webstore](https://chrome.google.com/webstore/detail/padekgcemlokbadohgkifijomclgjgif). After adding it to Chrome, you can see the SwitchyOmega icon somewhere amongst all your other Chrome extension icons. Clicking on it will display a menu, where you can select various different connection profiles, as well as open the extension's options. ![The SwitchyOmega interface](/assets/images/switchyomega-7d5c6885b1d3687c6ffc571b704d837b.png) ## Options[​](#options "Direct link to Options") The options page has the following: * General settings/interface settings (which you can keep to their default values). * A list of proxy profiles (separate profiles can be added for different proxy groups, or for different countries for the residential proxy group, etc). * The **New profile** button * The main section, which shows the selected settings sub-section or selected proxy profile connection settings. ![SwitchyOmega options page](/assets/images/switchyomega-options-cb20092151c893cd64a6af8544c87af7.png) ## Adding a new proxy[​](#adding-a-new-proxy "Direct link to Adding a new proxy") After clicking on **New profile**, you'll be greeted with a **New profile** popup, where you can give the profile a name and select the type of profile you'd like to create. To add a proxy profile, select the respective option and click **Create**. ![Adding a proxy profile](/assets/images/switchyomega-proxy-profile-8627d2d260a00412f24bbc3ea78b6654.png) Then, you need to fill in the proxy settings: ![Adding proxy settings](/assets/images/switchyomega-proxy-settings-e516191ea6516e9c48ef67d48126542f.png) If the proxy requires authentication, click on the lock icon and fill in the details within the popup. ![Authenticating a proxy](/assets/images/switchyomega-auth-60b6cb1ca738f1bc3354ab46331eba2a.png) Don't forget to click on **Apply changes** within the left-hand side menu under **Actions**! ## Selecting proxy profiles[​](#selecting-profiles "Direct link to Selecting proxy profiles") And that's it! All of your proxy profiles will appear in the menu. When one is chosen, the page you are currently on will be reloaded using the selected proxy profile. ![SwitchyOmega menu](/assets/images/switchyomega-menu-99479bd509281e5e1b4ecdbec61c7302.png) --- # User-Agent Switcher **Learn how to switch your User-Agent header to different values in order to monitor how a certain site responds to the changes.** *** **User-Agent Switcher** is a Chrome extension that allows you to quickly change your **User-Agent** and see how a certain website would behave with different user agents. After adding it to Chrome, you'll see a **Chrome UA Spoofer** button in the extension icons area. Clicking on it will open up a list of various **User-Agent** groups. ![User-Agent Switcher groups](/assets/images/user-agent-switcher-groups-a8c7591377232d9facfa25982020362c.png) Clicking on a group will display a list of possible User-Agents to set. ![Default available Internet Explorer agents](/assets/images/user-agent-switcher-agents-16779d6057631ab46b517d13510ae07a.png) After setting the **User-Agent**, the page will be refreshed. ## Configuration[​](#configuration "Direct link to Configuration") The extension configuration page allows you to edit the **User-Agent** list in case you want to add a specific User-Agent that isn't already provided. You can find some other options, but most likely you will never need to modify those. ![User-Agent Switcher configuration page](/assets/images/user-agent-switcher-config-e71fa1540a5cc4b886fbc77bc7a7fe95.png) --- # Tutorials 📚 **Learn about various different specific topics related to web-scraping and web-automation with the Apify Academy tutorial lessons!** *** In web scraping, there are a whole lot of niche cases that you will run into. Because our goal with the Apify Academy is to totally prepare you for any battle you may face in your web-automation projects, we've decided to create the **Tutorials** area of the Academy. This area contains various one-off lessons about different specific topics related to web-scraping. --- # Web scraping basics for JavaScript devs **Learn how to develop web scrapers with this comprehensive and practical course. Go from beginner to expert, all in one place.** *** Welcome to **Web scraping basics for JavaScript devs**, a comprehensive, practical and long form web scraping course that will take you from an absolute beginner to a successful web scraper developer. If you're looking for a quick start, we recommend trying [this tutorial](https://blog.apify.com/web-scraping-javascript-nodejs/) instead. This course is made by [Apify](https://apify.com), the web scraping and automation platform, but we will use only open-source technologies throughout all academy lessons. This means that the skills you learn will be applicable to any scraping project, and you'll be able to run your scrapers on any computer. No Apify account needed. If you would like to learn about the Apify platform and how it can help you build, run and scale your web scraping and automation projects, see the [Apify platform course](https://docs.apify.com/academy/apify-platform.md), where we'll teach you all about Apify serverless infrastructure, proxies, API, scheduling, webhooks and much more. ## Why learn scraper development?[​](#why-learn "Direct link to Why learn scraper development?") With so many point-and-click tools and no-code software that can help you extract data from websites, what is the point of learning web scraper development? Contrary to what their marketing departments say, a point-and-click or no-code tool will never be as flexible, as powerful, or as optimized as a custom-built scraper. Any software can do only what it was programmed to do. If you build your own scraper, it can do anything you want. And you can always quickly change it to do more, less, or the same, but faster or cheaper. The possibilities are endless once you know how scraping really works. Scraper development is a fun and challenging way to learn web development, web technologies, and understand the internet. You will reverse-engineer websites and understand how they work internally, what technologies they use and how they communicate with their servers. You will also master your chosen programming language and core programming concepts. When you truly understand web scraping, learning other technologies like React or Next.js will be a piece of cake. ## Course Summary[​](#summary "Direct link to Course Summary") When we set out to create the Academy, we wanted to build a complete guide to web scraping - a course that a beginner could use to create their first scraper, as well as a resource that professionals will continuously use to learn about advanced and niche web scraping techniques and technologies. All lessons include code examples and code-along exercises that you can use to immediately put your scraping skills into action. This is what you'll learn in the **Web scraping basics for JavaScript devs** course: * [Web scraping basics for JavaScript devs](https://docs.apify.com/academy/web-scraping-for-beginners.md) * [Basics of data extraction](https://docs.apify.com/academy/web-scraping-for-beginners/data-extraction.md) * [Basics of crawling](https://docs.apify.com/academy/web-scraping-for-beginners/crawling.md) * [Best practices](https://docs.apify.com/academy/web-scraping-for-beginners/best-practices.md) ## Requirements[​](#requirements "Direct link to Requirements") You don't need to be a developer or a software engineer to complete this course, but basic programming knowledge is recommended. Don't be afraid, though. We explain everything in great detail in the course and provide external references that can help you level up your web scraping and web development skills. If you're new to programming, pay very close attention to the instructions and examples. A seemingly insignificant thing like using `[]` instead of `()` can make a lot of difference. > If you don't already have basic programming knowledge and would like to be well-prepared for this course, we recommend learning about [JavaScript basics](https://developer.mozilla.org/en-US/curriculum/core/javascript-fundamentals/) and [CSS Selectors](https://developer.mozilla.org/en-US/docs/Learn/CSS/Building_blocks/Selectors). As you progress to the more advanced courses, the coding will get more challenging, but will still be manageable to a person with an intermediate level of programming skills. Ideally, you should have at least a moderate understanding of the following concepts: ### JavaScript + Node.js[​](#javascript-and-node "Direct link to JavaScript + Node.js") It is recommended to understand at least the fundamentals of JavaScript and be proficient with Node.js prior to starting this course. If you are not yet comfortable with asynchronous programming (with promises and `async...await`), loops (and the different types of loops in JavaScript), modularity, or working with external packages, we would recommend studying the following resources before coming back and continuing this section: * [`async...await` (YouTube)](https://www.youtube.com/watch?v=vn3tm0quoqE\&ab_channel=Fireship) * [JavaScript loops (MDN)](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Loops_and_iteration) * [Modularity in Node.js](https://javascript.plainenglish.io/how-to-use-modular-patterns-in-nodejs-982f0e5c8f6e) ### General web development[​](#general-web-development "Direct link to General web development") Throughout the next lessons, we will sometimes use certain technologies and terms related to the web without explaining them. This is because their knowledge will be **assumed** (unless we're showing something out of the ordinary). * [HTML](https://developer.mozilla.org/en-US/docs/Web/HTML) * [HTTP protocol](https://developer.mozilla.org/en-US/docs/Web/HTTP) * [DevTools](https://docs.apify.com/academy/web-scraping-for-beginners/data-extraction/browser-devtools.md) ### jQuery or Cheerio[​](#jquery-or-cheerio "Direct link to jQuery or Cheerio") We'll be using the [**Cheerio**](https://www.npmjs.com/package/cheerio) package a lot to parse data from HTML. This package provides an API using jQuery syntax to help traverse downloaded HTML within Node.js. ## Next up[​](#next "Direct link to Next up") The course begins with a small bit of theory and moves into some realistic and practical examples of extracting data from the most popular websites on the internet using your browser console. [Let's get to it!](https://docs.apify.com/academy/web-scraping-for-beginners/introduction.md) > If you already have experience with HTML, CSS, and browser DevTools, feel free to skip to the [Basics of crawling](https://docs.apify.com/academy/web-scraping-for-beginners/crawling.md) section. --- # Best practices when writing scrapers **Understand the standards and best practices that we here at Apify abide by to write readable, scalable, and maintainable code.** *** Every developer has their own style, which evolves as they grow and learn. While one dev might prefer a more [functional](https://en.wikipedia.org/wiki/Functional_programming) style, another might find an [imperative](https://en.wikipedia.org/wiki/Imperative_programming) approach to be more intuitive. We at Apify understand this, and have written this best practices lesson with that in mind. The goal of this lesson is not to force you into a specific paradigm or to make you think that you're doing things wrong, but instead to provide you some insight into the standards and best practices that we at Apify follow to ensure readable, maintainable, scalable code. ## Code style[​](#code-style "Direct link to Code style") When it comes to your code style when writing scrapers, there are some general things we recommend. ### Clean code[​](#clean-code "Direct link to Clean code") Praise [clean code](https://blog.risingstack.com/javascript-clean-coding-best-practices-node-js-at-scale/)! Use proper variable and function names that are descriptive of what they are, and split your code into smaller [pure](https://en.wikipedia.org/wiki/Pure_function) functions. ### Constant variables[​](#constants "Direct link to Constant variables") Define any [constant variables](https://softwareengineering.stackexchange.com/questions/250619/best-practices-reasons-for-string-constants-in-javascript) that globally apply to the scraper in a single file named **constants.js**, from where they will all be imported. Constant variable names should be in `UPPERCASE_WITH_UNDERSCORES` style. > If you have a whole lot of constant variables, they can be in a folder named **constants** organized into different files. ### Use ES6 JavaScript[​](#use-es6 "Direct link to Use ES6 JavaScript") If you're writing your scraper in JavaScript, use [ES6](https://www.w3schools.com/js/js_es6.asp) features and ditch the old ones which they replace. This means using `const` and `let` instead of `var`, `includes` instead of `indexOf`, etc. > To learn more about some of the most popular (and awesome) ES6+ features, check out [this](https://medium.com/@matthiasvstephens/why-is-es6-so-awesome-88bff6857849) article. ### No magic numbers[​](#no-magic-numbers "Direct link to No magic numbers") Avoid using [magic numbers](https://en.wikipedia.org/wiki/Magic_number_\(programming\)) as much as possible. Either declare them as a **constant** variable in your **constants.js** file, or if they are only used once, add a comment explaining what the number is. Don't write code like this: ``` const x = (y) => (y - 32) * (5 / 9); ``` That is quite confusing due to the nondescriptive naming and the magic numbers. Do this instead: ``` // Converts a fahrenheit value to celsius const fahrenheitToCelsius = (celsius) => (celsius - 32) * (5 / 9); ``` ### Use comments\![​](#use-comments "Direct link to Use comments!") Don't be shy to add comments to your code! Even when using descriptive function and variable naming, it might still be a good idea to add a comment in places where you had to make a tough decision or chose an unusual choice. > If you're a true pro, use [JSDoc](https://jsdoc.app/) to comment and document your code. ## Logging[​](#logging "Direct link to Logging") Logging helps you understand exactly what your scraper is doing. Generally, having more logs is better than having fewer. Especially make sure to log your `catch` blocks - no error should pass unseen unless there is a good reason. For scrapers that will run longer than usual, keep track of some useful stats (such as **itemsScraped** or **errorsHit**) and log them to the console on an interval. The meaning of your log messages should make sense to an outsider who is not familiar with the inner workings of your scraper. Avoid log lines with just numbers or just URLs - always identify what the number/string means. Here is an example of an "incorrect" log message: ``` 300 https://example.com/1234 1234 ``` And here is that log message translated into something that makes much more sense to the end user: ``` Index 1234 --- https://example.com/1234 --- took 300 ms ``` ## Input[​](#input "Direct link to Input") When it comes to accepting input into a scraper, two main best practices should be followed. ### Set limits[​](#set-limits "Direct link to Set limits") When allowing your users to pass input properties which could break the scraper (such as **timeout** set to **0**), be sure to disallow ridiculous values. Set a maximum/minimum number allowed, maximum array input length, etc. ### Validate[​](#validate "Direct link to Validate") Validate the input provided by the user! This should be the very first thing your scraper does. If the fields in the input are missing or in an incorrect type/format, either parse the value and correct it programmatically or throw an informative error telling the user how to fix the error. > On the Apify platform, you can use the [input schema](https://docs.apify.com/academy/deploying-your-code/input-schema.md) to both validate inputs and generate a clean UI for those using your scraper. ## Error handling[​](#error-handling "Direct link to Error handling") Errors are bound to occur in scrapers. Perhaps it got blocked, or perhaps the data scraped was corrupted in some way. Whatever the reason, a scraper shouldn't completely crash when an error occurs. Use `try...catch` blocks to catch errors and log useful messages. The log messages should indicate where the error happened, and what type of error happened. Bad error log message: ``` Cannot read property “0” from undefined ``` Good error log message: ``` Could not parse an address, skipping the page. Url: https://www.example-website.com/people/1234 ``` This doesn't mean that you should absolutely litter your code with `try...catch` blocks, but it does mean that they should be placed in error-prone areas (such as API calls or testing a string with a specific regular expression). > If the error that has occurred renders that run of the scraper completely useless, exit the process immediately. Logging is the minimum you should be doing though. For example, if you have an entire object of scraped data and just the **price** field fails to be parsed, you might not want to throw away the rest of that data. Rather, it could still be pushed to the output and a log message like this could appear: ``` We could not parse the price of product: Men's Trainers Orange, pushing anyways. ``` This really depends on your use case though. If you want 100% clean data, you might not want to push incomplete objects and just retry (ideally) or log an error message instead. ## Recap[​](#recap "Direct link to Recap") Wow, that's a whole lot of things to abide by! How will you remember all of them? Try to follow these three points: 1. Describe your code as you write it with good naming, constants, and comments. It **should read like a book**. 2. Add log messages at points throughout your code so that when it's running, you (and everyone else) know what's going on. 3. Handle errors appropriately. Log the error and either retry, or continue on. Only throw if the error will be caught or if the error is absolutely detrimental to the scraper's run. --- # Challenge **Test your knowledge acquired in the previous sections of this course by building an Amazon scraper using Crawlee's CheerioCrawler!** *** Before moving onto the other courses in the academy, we recommend following along with this section, as it combines everything you've learned in the previous lessons into one cohesive project that helps you prove to yourself that you've thoroughly understood the material. We recommend that you make sure you've gone through both the [data extraction](https://docs.apify.com/academy/web-scraping-for-beginners/data-extraction.md) and [crawling](https://docs.apify.com/academy/web-scraping-for-beginners/crawling.md) sections of this course to ensure the smoothest development process. ## Learning 🧠[​](#learning "Direct link to Learning 🧠") Before continuing, it is highly recommended to do the following: * Look over [how to build a crawler in Crawlee](https://crawlee.dev/docs/introduction/first-crawler) and ideally **code along**. * Read [this short article](https://docs.apify.com/academy/node-js/request-labels-in-apify-actors) about [**request labels**](https://crawlee.dev/api/core/class/Request#label) (this will be extremely useful later on). * Check out [this tutorial](https://docs.apify.com/academy/node-js/dealing-with-dynamic-pages.md) about dynamic pages. * Read about the [RequestQueue](https://crawlee.dev/api/core/class/RequestQueue). ## Our task[​](#our-task "Direct link to Our task") On Amazon, we can use this link to get to the results page of any product we want: ``` https://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=KEYWORD ``` Our crawler's input will look like this: ``` { "keyword": "iphone" } ``` The goal at hand is to scrape all of the products from the first page of results for whatever keyword was provided (for our test case, it will be **iPhone**), then to scrape all available offers of each product and push the results to the dataset. For context, the offers for a product look like this: ![Amazon product offers](/assets/images/product-offers-20910dbac0f5cc3df6089143b924ac5b.jpg) In the end, we'd like our final output to look something like this: ``` [ { "title": "Apple iPhone 6 a1549 16GB Space Gray Unlocked (Certified Refurbished)", "asin": "B07P6Y7954", "itemUrl": "https://www.amazon.com/Apple-iPhone-Unlocked-Certified-Refurbished/dp/B00YD547Q6/ref=sr_1_2?s=wireless&ie=UTF8&qid=1539772626&sr=1-2&keywords=iphone", "description": "What's in the box: Certified Refurbished iPhone 6 Space Gray 16GB Unlocked , USB Cable/Adapter. Comes in a Generic Box with a 1 Year Limited Warranty.", "keyword": "iphone", "sellerName": "Blutek Intl", "offer": "$162.97" }, { "title": "Apple iPhone 6 a1549 16GB Space Gray Unlocked (Certified Refurbished)", "asin": "B07P6Y7954", "itemUrl": "https://www.amazon.com/Apple-iPhone-Unlocked-Certified-Refurbished/dp/B00YD547Q6/ref=sr_1_2?s=wireless&ie=UTF8&qid=1539772626&sr=1-2&keywords=iphone", "description": "What's in the box: Certified Refurbished iPhone 6 Space Gray 16GB Unlocked , USB Cable/Adapter. Comes in a Generic Box with a 1 Year Limited Warranty.", "keyword": "iphone", "sellerName": "PLATINUM DEALS", "offer": "$169.98" }, { "...": "..." } ] ``` > The `asin` is the ID of the product, which is data present on the Amazon website. Each of the items in the dataset will represent a scraped offer and will have the same `title`, `asin`, `itemUrl`, and `description`. The offer-specific fields will be `sellerName` and `offer`. ## First up[​](#next "Direct link to First up") From this course, you should have all the knowledge to build this scraper by yourself. Give it a try, then come back to compare your scraper with our solution. The challenge can be completed using either [CheerioCrawler](https://crawlee.dev/api/cheerio-crawler/class/CheerioCrawler) or [PlaywrightCrawler](https://crawlee.dev/api/playwright-crawler/class/PlaywrightCrawler). Playwright is significantly slower but doesn't get blocked as much. You will learn the most by implementing both. Let's start off this section by [initializing and setting up](https://docs.apify.com/academy/web-scraping-for-beginners/challenge/initializing-and-setting-up.md) our project with the Crawlee CLI (don't worry, no additional installation is required). --- # Initialization & setting up **When you extract links from a web page, you often end up with a lot of irrelevant URLs. Learn how to filter the links to only keep the ones you need.** *** The Crawlee CLI speeds up the process of setting up a Crawlee project. Navigate to the directory you'd like your project's folder to live, then open up a terminal instance and run the following command: ``` npx crawlee create amazon-crawler ``` Once you run this command, you'll get prompted into a menu which you can navigate using your arrow keys. Each of these options will generate a different boilerplate code when selected. We're going to work with CheerioCrawler today, so we'll select the **CheerioCrawler template project** template, and then press **Enter**. ![Crawlee CLI \"create\" command](/assets/images/crawlee-create-2f29a900f2ab57f734d05c25cc37feed.png) Once it's completed, open up the **amazon-crawler** folder that was generated by the `npx crawlee create` command. We're going to modify the **main.js** boilerplate to fit our needs: ``` // main.js import { CheerioCrawler, KeyValueStore, log } from 'crawlee'; import { router } from './routes.js'; // Grab our keyword from the input const { keyword } = await KeyValueStore.getInput(); const crawler = new CheerioCrawler({ requestHandler: router, // If you have access to Apify Proxy, you can use residential proxies and // high retry count which helps with blocking // If you don't, your local IP address will likely be fine for a few requests if you scrape slowly. // proxyConfiguration: await Actor.createProxyConfiguration({ groups: ['RESIDENTIAL'] }), // maxRequestRetries: 10, }); log.info('Starting the crawl.'); await crawler.run([{ // Turn the keyword into a link we can make a request with url: `https://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=${keyword}`, label: 'START', userData: { keyword, }, }]); log.info('Crawl finished.'); ``` ``` // routes.js import { createCheerioRouter } from 'crawlee'; export const router = createCheerioRouter(); router.addDefaultHandler(({ log }) => { log.info('Route reached.'); }); ``` Finally, we'll add the following input file to **INPUT.json** in the project's root directory (next to `package.json`, `node_modules` and others) ``` { "keyword": "iphone" } ``` > This is how we'll be inputting data into our scraper from now on. Don't worry though, from now on, we'll only need to work in the **main.js** and **routes.js** files! ## Next up[​](#next "Direct link to Next up") Cool! We're ready. But [let's discuss a bit about modularity](https://docs.apify.com/academy/web-scraping-for-beginners/challenge/modularity.md) before moving forward! --- # Modularity **Before you build your first web scraper with Crawlee, it is important to understand the concept of modularity in programming.** *** Now that we've gotten our first request going, the first challenge is going to be selecting all of the resulting products on the page. Back in the browser, we'll use the DevTools hover tool to inspect a product. ![Result products](/assets/images/result-items-b1e131e323428eb8baaaf4e991bb79b5.jpg) **Bingo!** Each product seems to have a `data-asin` attribute, which includes the ASIN (product ID) data we want. Now, we can select each of these elements with this selector: `div > div[data-asin]:not([data-asin=""])`. Then, we'll scrape some data about each product, and push a request to the main product page so we can grab hold of the description. But, before we start scraping, let's pause to talk a bit about the important concept of **modularity**. You may have noticed the **src** folder inside of your project, which by default has a **routes.js** file in it. We're going to use this to create modularized functions which can then be conditionally executed by our crawler. ``` // routes.js import { createCheerioRouter } from 'crawlee'; import { BASE_URL } from './constants.js'; export const router = createCheerioRouter(); router.addDefaultHandler(({ log }) => { log.info('Route reached.'); }); // Add a handler to our router to handle requests with the 'START' label router.addHandler('START', async ({ $, crawler, request }) => { const { keyword } = request.userData; const products = $('div > div[data-asin]:not([data-asin=""])'); // loop through the resulting products for (const product of products) { const element = $(product); const titleElement = $(element.find('.a-text-normal[href]')); const url = `${BASE_URL}${titleElement.attr('href')}`; // scrape some data from each and to a request // to the crawler for its page await crawler.addRequests([{ url, label: 'PRODUCT', userData: { // Pass the scraped data about the product to the next // request so that it can be used there data: { title: titleElement.first().text().trim(), asin: element.attr('data-asin'), itemUrl: url, keyword, }, }, }]); } }); router.addHandler('PRODUCT', ({ log }) => log.info('on a product page!')); ``` Also notice that we are importing `BASE_URL` from **constants.js**. Here is what that file looks like: ``` // constants.js export const BASE_URL = 'https://www.amazon.com'; ``` And here is what our **main.js** file currently looks like: ``` // main.js import { CheerioCrawler, log, KeyValueStore } from 'crawlee'; import { router } from './routes.js'; import { BASE_URL } from './constants.js'; const { keyword = 'iphone' } = (await KeyValueStore.getInput()) ?? {}; const crawler = new CheerioCrawler({ requestHandler: router, }); await crawler.addRequests([ { // Use BASE_URL here instead url: `${BASE_URL}/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=${keyword}`, label: 'START', userData: { keyword, }, }, ]); log.info('Starting the crawl.'); await crawler.run(); log.info('Crawl finished.'); ``` One of the main reasons we **modularize** our code is to prevent massive and difficult to read files by separating concerns into separate files. In our **main.js** file, we're handling the initialization, configuration, and running of our crawler. In **routes.js**, we determine how the crawler should handle different routes, and in **constants.js** we define non-changing values that will be used throughout the project. Organized code makes everyone happy, including you - the one developing the scraper! Spaghetti is super awesome, [but not when it comes to programming](https://www.urbandictionary.com/define.php?term=spaghetti+code) 🍝 This can even be optimized further by putting our `label` items into **constants.js**, like so: ``` // constants.js export const BASE_URL = 'https://www.amazon.com'; export const labels = { START: 'START', PRODUCT: 'PRODUCT', OFFERS: 'OFFERS', }; ``` Then, the labels can be used by importing `labels` and accessing `labels.START`, `labels.PRODUCT`, or `labels.OFFERS`. This is not necessary, but it is best practice, as it can prevent dumb typos that can cause nasty bugs 🐞 For the rest of this lesson, all of the examples using labels will be using the imported versions. > If you haven't already read the **Best practices** lesson in the **Web scraping basics for JavaScript devs** course, please [give it a read](https://docs.apify.com/academy/web-scraping-for-beginners/best-practices.md). ## Next up[​](#next "Direct link to Next up") Now that we've gotten that out of the way, we can finally continue with our Amazon scraper. [Let's do it](https://docs.apify.com/academy/web-scraping-for-beginners/challenge/scraping-amazon.md)! --- # Scraping Amazon **Build your first web scraper with Crawlee. Let's extract product information from Amazon to give you an idea of what real-world scraping looks like.** *** In our quick chat about modularity, we finished the code for the results page and added a request for each product to the crawler's **RequestQueue**. Here, we need to scrape the description, so it shouldn't be too hard: ``` // routes.js // ... router.addHandler(labels.PRODUCT, async ({ $ }) => { const element = $('div#productDescription'); const description = element.text().trim(); console.log(description); // works! }); ``` Great! But wait, where do we go from here? We need to go to the offers page next and scrape each offer, but how can we do that? Let's take a small break from writing the scraper and open up [Proxyman](https://docs.apify.com/academy/tools/proxyman.md) to analyze requests which we might be difficult to find in the network tab, then we'll click the button on the product page that loads up all of the product offers: ![View offers button](/assets/images/view-offers-button-11e242f1d72d36745d86c7a0f114e637.jpg) After clicking this button and checking back in Proxyman, we discovered this link: > You can find the request below in the network tab just fine, but with Proxyman, it is much easier and faster due to the extended filtering options. ``` https://www.amazon.com/gp/aod/ajax/ref=auto_load_aod?asin=B07ZPKBL9V&pc=dp ``` The `asin` [query parameter](https://www.branch.io/glossary/query-parameters/) matches up with our product's ASIN, which means we can use this for any product of which we have the ASIN. Here's what this page looks like: ![View offers page](/assets/images/offers-page-f9880ec59ed265ab8bf0acce5e2a212b.jpg) Wow, that's ugly. But for our scenario, this is really great. When we click the **View offers** button, we usually have to wait for the offers to load and render, which would mean we could have to switch our entire crawler to a **PuppeteerCrawler** or **PlaywrightCrawler**. The data on this page we've just found appears to be loaded statically, which means we can still use CheerioCrawler and keep the scraper as efficient as possible 😎 > It's totally possible to scrape the same data as this crawler using [Puppeteer or Playwright](https://docs.apify.com/academy/puppeteer-playwright.md); however, with this offers link found in Postman, we can follow the same workflow much more quickly with static HTTP requests using CheerioCrawler. First, we'll create a request for each product's offers page: ``` // routes.js // ... router.addHandler(labels.PRODUCT, async ({ $, crawler, request }) => { const { data } = request.userData; const element = $('div#productDescription'); // Add to the request queue await crawler.addRequests([{ url: `${BASE_URL}/gp/aod/ajax/ref=auto_load_aod?asin=${data.asin}&pc=dp`, label: labels.OFFERS, userData: { data: { ...data, description: element.text().trim(), }, }, }]); }); ``` Finally, we can handle the offers in a separate handler: ``` // routes.js router.addHandler(labels.OFFERS, async ({ $, request }) => { const { data } = request.userData; for (const offer of $('#aod-offer')) { const element = $(offer); await Dataset.pushData({ ...data, sellerName: element.find('div[id*="soldBy"] a[aria-label]').text().trim(), offer: element.find('.a-price .a-offscreen').text().trim(), }); } }); ``` ## Final code[​](#final-code "Direct link to Final code") That should be it! Let's make sure we've all got the same code: ``` // constants.js export const BASE_URL = 'https://www.amazon.com'; export const labels = { START: 'START', PRODUCT: 'PRODUCT', OFFERS: 'OFFERS', }; ``` ``` // routes.js import { createCheerioRouter, Dataset } from 'crawlee'; import { BASE_URL, labels } from './constants'; export const router = createCheerioRouter(); router.addHandler(labels.START, async ({ $, crawler, request }) => { const { keyword } = request.userData; const products = $('div > div[data-asin]:not([data-asin=""])'); for (const product of products) { const element = $(product); const titleElement = $(element.find('.a-text-normal[href]')); const url = `${BASE_URL}${titleElement.attr('href')}`; await crawler.addRequests([ { url, label: labels.PRODUCT, userData: { data: { title: titleElement.first().text().trim(), asin: element.attr('data-asin'), itemUrl: url, keyword, }, }, }, ]); } }); router.addHandler(labels.PRODUCT, async ({ $, crawler, request }) => { const { data } = request.userData; const element = $('div#productDescription'); await crawler.addRequests([ { url: `${BASE_URL}/gp/aod/ajax/ref=auto_load_aod?asin=${data.asin}&pc=dp`, label: labels.OFFERS, userData: { data: { ...data, description: element.text().trim(), }, }, }, ]); }); router.addHandler(labels.OFFERS, async ({ $, request }) => { const { data } = request.userData; for (const offer of $('#aod-offer')) { const element = $(offer); await Dataset.pushData({ ...data, sellerName: element.find('div[id*="soldBy"] a[aria-label]').text().trim(), offer: element.find('.a-price .a-offscreen').text().trim(), }); } }); ``` ``` // main.js import { CheerioCrawler, KeyValueStore, log } from 'crawlee'; import { router } from './routes.js'; // Grab our keyword from the input const { keyword = 'iphone' } = (await KeyValueStore.getInput()) ?? {}; const crawler = new CheerioCrawler({ requestHandler: router, }); // Add our initial requests await crawler.addRequests([ { // Turn the inputted keyword into a link we can make a request with url: `https://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=${keyword}`, label: 'START', userData: { keyword, }, }, ]); log.info('Starting the crawl.'); await crawler.run(); log.info('Crawl finished.'); ``` ## Wrap up 💥[​](#wrap-up "Direct link to Wrap up 💥") Nice work! You've officially built your first scraper with Crawlee! You're now ready to take on the rest of the Apify Academy with confidence. For now, this is the last section of the **Web scraping basics for JavaScript devs** course. If you want to learn more about web scraping, we recommend checking venturing out and following the other lessons in the Academy. We will keep updating the Academy with more content regularly until we cover all the advanced and expert topics we promised at the beginning. --- # Basics of crawling **Learn how to crawl the web with your scraper. How to extract links and URLs from web pages and how to manage the collected links to visit new pages.** *** Welcome to the second section of our **Web scraping basics for JavaScript devs** course. In the [Basics of data extraction](https://docs.apify.com/academy/web-scraping-for-beginners/data-extraction.md) section, we learned how to extract data from a web page. Specifically, a template Shopify site called [Warehouse store](https://warehouse-theme-metal.myshopify.com/). ![on-sale category of Warehouse store](/assets/images/warehouse-store-9982b9901e6a56a6f69035fcd2af4ad5.png) In this section, we will take a look at moving between web pages, which we call **crawling**. We will extract data about all the on-sale products on [Warehouse store](https://warehouse-theme-metal.myshopify.com/collections/sales). To do that, we will need to crawl the individual product pages. ## How do you crawl?[​](#how-to-crawl "Direct link to How do you crawl?") Crawling websites is a fairly straightforward process. We'll start by opening the first web page and extracting all the links (URLs) that lead to the other pages we want to visit. To do that, we'll use the skills learned in the [Basics of data extraction](https://docs.apify.com/academy/web-scraping-for-beginners/data-extraction.md) course. We'll add some extra filtering to make sure we only get the correct URLs. Then, we'll save those URLs, so in case our scraper crashes with an error, we won't have to extract them again. And, finally, we will visit those URLs one by one. At any point, we can extract URLs, data, or both. Crawling can be separate from data extraction, but it's not a requirement and, in most projects, it's actually easier and faster to do both at the same time. To summarize, it goes like this: 1. Visit the start URL. 2. Extract new URLs (and data) and save them. 3. Visit one of the new-found URLs and save data and/or more URLs from them. 4. Repeat 2 and 3 until you have everything you need. ## Next up[​](#next "Direct link to Next up") First, let's make sure we all understand the foundations. In the [next lesson](https://docs.apify.com/academy/web-scraping-for-beginners/crawling/recap-extraction-basics.md) we will review the scraper code we already have from the [Basics of data extraction](https://docs.apify.com/academy/web-scraping-for-beginners/data-extraction.md) section of the course. --- # Exporting data **Learn how to export the data you scraped using Crawlee to CSV or JSON.** *** In the previous lessons, you learned that: 1. You can use `Dataset.pushData()` to save data to the default dataset. 2. The default dataset files are saved in the `./storage/datasets/default` folder. But when we look inside the folder, we see that there are a lot of files, and we don't want to work with those manually. We can use the dataset itself to export the data. ## Exporting data to CSV[​](#export-csv "Direct link to Exporting data to CSV") Crawlee's `Dataset` provides a way to export all your scraped data into one big CSV file. You can then open it in Excel or any other data processor. To do that, you need to call [`Dataset.exportToCSV()`](https://crawlee.dev/api/core/class/Dataset#exportToCSV) after collecting all the data. That means, after your crawler run finishes. ``` // ... await crawler.run(); // Add this line to export to CSV. await Dataset.exportToCSV('results'); ``` After you add this one line and run the code, you'll find your CSV with all the scraped products in here: ``` ./storage/key-value-stores/default/results.csv ``` info [Key-value store](https://crawlee.dev/docs/guides/result-storage#key-value-store) is another of Crawlee's storages. It's best for saving files like CSVs, PDFs or images, but also large JSONs or crawler statistics. ## Exporting data to JSON[​](#export-json "Direct link to Exporting data to JSON") Exporting to JSON is very similar to exporting to CSV, but we'll use a different function: [`Dataset.exportToJSON`](https://crawlee.dev/api/core/class/Dataset#exportToJSON). Exporting to JSON is useful when you don't want to work with each item separately, but would rather have one big JSON file with all the results. ``` // ... await crawler.run(); // Add this line to export to JSON. await Dataset.exportToJSON('results'); ``` You will find the resulting JSON here: ``` ./storage/key-value-stores/default/results.json ``` ## Final scraper code[​](#filtering-data "Direct link to Final scraper code") ``` import { PlaywrightCrawler, Dataset } from 'crawlee'; const crawler = new PlaywrightCrawler({ // We removed the headless: false option to hide the browser windows. requestHandler: async ({ parseWithCheerio, request, enqueueLinks }) => { console.log(`Fetching URL: ${request.url}`); if (request.label === 'start-url') { await enqueueLinks({ selector: 'a.product-item__title', }); return; } // Fourth, parse the browser's page with Cheerio. const $ = await parseWithCheerio(); const title = $('h1').text().trim(); const vendor = $('a.product-meta__vendor').text().trim(); const price = $('span.price').contents()[2].nodeValue; const reviewCount = parseInt($('span.rating__caption').text(), 10); const description = $('div[class*="description"] div.rte').text().trim(); const recommendedProducts = $('.product-recommendations a.product-item__title') .map((i, el) => $(el).text().trim()) .toArray(); await Dataset.pushData({ title, vendor, price, reviewCount, description, recommendedProducts, }); }, }); await crawler.addRequests([{ url: 'https://warehouse-theme-metal.myshopify.com/collections/sales', label: 'start-url', }]); await crawler.run(); await Dataset.exportToCSV('results'); ``` ## Next up[​](#next "Direct link to Next up") And this is it for the [**Basics of crawling**](https://docs.apify.com/academy/web-scraping-for-beginners/crawling.md) section of the [**Web scraping basics for JavaScript devs**](https://docs.apify.com/academy/web-scraping-for-beginners.md) course. If you want to learn more, test your knowledge of the methods and concepts you learned in this course by moving forward with the [**challenge**](https://docs.apify.com/academy/web-scraping-for-beginners/challenge.md). --- # Filtering links **When you extract links from a web page, you often end up with a lot of irrelevant URLs. Learn how to filter the links to only keep the ones you need.** *** Web pages are full of links, but frankly, most of them are useless to us when scraping. Filtering links can be approached in two ways: Targeting the links we're interested in by using unique CSS selectors, or extracting all links and then using pattern matching to find the sought after URLs. In real scraping scenarios, both of these two approaches are often combined for the most effective URL filtering. ## Filtering with unique CSS selectors[​](#css-filtering "Direct link to Filtering with unique CSS selectors") In the previous lesson, we grabbed all the links from the HTML document. * DevTools * Node.js with Cheerio ``` document.querySelectorAll('a'); ``` ``` $('a'); ``` ### Attribute selector[​](#attribute-selector "Direct link to Attribute selector") That's not the only way to do it, however. Since we're interested in the `href` attributes, a first very reasonable filter is to exclusively target the `` tags that have the `href` attribute (yes, anchor tags without the attribute can and do exist). You can do that by using the [CSS attribute selector](https://developer.mozilla.org/en-US/docs/Web/CSS/Attribute_selectors). * DevTools * Node.js ``` document.querySelectorAll('a[href]'); ``` ``` $('a[href]'); ``` Adding the `[href]` part of the selector will save you from nasty bug hunts on certain pages. ### Link specific selectors[​](#specific-selectors "Direct link to Link specific selectors") Let's go back to the [Sales category of Warehouse](https://warehouse-theme-metal.myshopify.com/collections/sales) and see how we could capture only the links to product detail pages. After inspecting the product cards in DevTools, you'll find that the links are available together with the product's title. Getting them will therefore be very similar to getting the product titles in the previous section. ![product detail page link](/assets/images/filtering-product-detail-link-995bf048903e4b4205b52fd364719005.png) * DevTools * Node.js ``` document.querySelectorAll('a.product-item__title'); ``` ``` $('a.product-item__title'); ``` When we print all the URLs in the DevTools console, we can see that we've correctly filtered only the product detail page URLs. ``` for (const a of document.querySelectorAll('a.product-item__title')) { console.log(a.href); } ``` info If you try this in Node.js instead of DevTools, you will not get the full URLs, but only so-called **relative links**. We will explain what those are and how to work with them in the next lesson. ![Product URLs printed to console](/assets/images/filtering-product-urls-aeaa901d618beff306b45fa48cc138e1.png) ## Filtering with pattern-matching[​](#pattern-matching-filter "Direct link to Filtering with pattern-matching") Another common way to filter links (or any text, really) is by matching patterns with regular expressions. > [Learn more about regular expressions](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions) When we inspect the product URLs, we'll find that they all look like the following: ``` https://warehouse-theme-metal.myshopify.com/products/denon-ah-c720-in-ear-headphones https://warehouse-theme-metal.myshopify.com/products/sony-sacs9-10-inch-active-subwoofer https://warehouse-theme-metal.myshopify.com/products/sony-ps-hx500-hi-res-usb-turntable ``` That is, they all begin with exactly the same pattern and only differ in the last portion of the path. We could write the pattern like this: ``` https://warehouse-theme-metal.myshopify.com/products/{PRODUCT_NAME} ``` This means that we can create a regular expression that matches those URLs. You can do it in many ways . For simplicity, let's go with this one: ``` https?:\/\/warehouse-theme-metal\.myshopify\.com\/products\/[\w\-]+ ``` This regular expression matches all URLs that use either `http` or `https` protocol and point to `warehouse-theme-metal.myshopify.com/products/` immediately followed with any number of letters or dashes `-`. > A great way to learn more about regular expression syntax and to test your expressions are tools like [regex101.com](https://regex101.com/) or [regexr.com](https://regexr.com/). It's okay if you don't get the hang of it right away! To test our regular expression in the DevTools console, we'll first create a [`RegExp`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp) object and then test the URLs with the [`regExp.test(string)`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/test) function. ``` // To demonstrate pattern matching, we use only the 'a' // selector to select all links on the page. for (const a of document.querySelectorAll('a')) { const regExp = /https?:\/\/warehouse-theme-metal\.myshopify\.com\/products\/[\w-]+/; const url = a.href; if (regExp.test(url)) console.log(url); } ``` When you run this code in DevTools Console on the [Sales category of Warehouse](https://warehouse-theme-metal.myshopify.com/collections/sales), you'll see that it produces a slightly different set of URLs than the CSS filter did. ![filtering-regex-urls.png](/assets/images/filtering-regex-urls-a4ac344fc73c37d7593ea276bca73872.png) That's because we selected all the links on the page and apparently there are more ways to get to the product detail pages. After careful inspection we can find that we can get there not only by clicking the title, but also by clicking the product's image, which leads to duplicates. Some products also have review links that lead to a specific subsection of the product detail page. With that said, yes, filtering with CSS selectors is often the better and more reliable option. But sometimes, it's not enough, and knowing about pattern matching with regular expressions expands your scraping toolbox and helps you tackle more complex scenarios. ## Next Up[​](#next "Direct link to Next Up") In the [next lesson](https://docs.apify.com/academy/web-scraping-for-beginners/crawling/relative-urls.md) we'll see how rewriting this code to Node.js is not so simple and learn about absolute and relative URLs in the process. --- # Finding links **Learn what a link looks like in HTML and how to find and extract their URLs when web scraping using both DevTools and Node.js.** *** Many kinds of links exist on the internet, and we'll cover all the types in the advanced Academy courses. For now, let's think of links as [HTML anchor elements](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/a) with `` tags. A typical link looks like this: ``` This is a link to example.com ``` On a webpage, the link above will look like this: [This is a link to example.com](https://example.com) When you click it, your browser will navigate to the URL in the `` tag's `href` attribute (`https://example.com`). > `href` means **H**ypertext **REF**erence. You don't need to remember this - just know that `href` typically means some sort of link. ## Extracting links 🔗[​](#extracting-links "Direct link to Extracting links 🔗") If a link is an HTML element, and the URL is an attribute, this means that we can extract links the same way as we extracted data. To test this theory in the browser, we can try running the following code in our DevTools console on any website. ``` // Select all the elements. const links = document.querySelectorAll('a'); // For each of the links... for (const link of links) { // get the value of its 'href' attribute... const url = link.href; // and print it to console. console.log(url); } ``` Go to the [Warehouse store Sales category](https://warehouse-theme-metal.myshopify.com/collections/sales), open the DevTools Console, paste the above code and run it. ![links extracted from Warehouse store](/assets/images/warehouse-links-37f7c3164546c93f7b75ca83cf6e0773.png) ***Boom*** 💥, all the links from the page have now been printed to the console. Most of the links point to other parts of the website, but some links lead to other domains like facebook.com or instagram.com. ## Extracting link URLs in Node.js[​](#Extracting-links-in-node "Direct link to Extracting link URLs in Node.js") DevTools Console is a fun playground, but Node.js is way more useful. Let's create a new file in our project called **crawler.js** and add some basic crawling code that prints all the links from the [Sales category of Warehouse](https://warehouse-theme-metal.myshopify.com/collections/sales). We'll start from a boilerplate that's very similar to the scraper we built in [Basics of data extraction](https://docs.apify.com/academy/web-scraping-for-beginners/data-extraction/node-js-scraper.md). [Run on](https://console.apify.com/actors/kk67IcZkKSSBTslXI?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCAqIGFzIGNoZWVyaW8gZnJvbSAnY2hlZXJpbyc7XFxuaW1wb3J0IHsgZ290U2NyYXBpbmcgfSBmcm9tICdnb3Qtc2NyYXBpbmcnO1xcblxcbmNvbnN0IHN0b3JlVXJsID0gJ2h0dHBzOi8vd2FyZWhvdXNlLXRoZW1lLW1ldGFsLm15c2hvcGlmeS5jb20vY29sbGVjdGlvbnMvc2FsZXMnO1xcblxcbmNvbnN0IHJlc3BvbnNlID0gYXdhaXQgZ290U2NyYXBpbmcoc3RvcmVVcmwpO1xcbmNvbnN0IGh0bWwgPSByZXNwb25zZS5ib2R5O1xcblxcbmNvbnN0ICQgPSBjaGVlcmlvLmxvYWQoaHRtbCk7XFxuXFxuLy8gLS0tLS0tLSBuZXcgY29kZSBiZWxvd1xcblxcbmNvbnN0IGxpbmtzID0gJCgnYScpO1xcblxcbmZvciAoY29uc3QgbGluayBvZiBsaW5rcykge1xcbiAgICBjb25zdCB1cmwgPSAkKGxpbmspLmF0dHIoJ2hyZWYnKTtcXG4gICAgY29uc29sZS5sb2codXJsKTtcXG59XFxuXCJ9Iiwib3B0aW9ucyI6eyJidWlsZCI6ImxhdGVzdCIsImNvbnRlbnRUeXBlIjoiYXBwbGljYXRpb24vanNvbjsgY2hhcnNldD11dGYtOCIsIm1lbW9yeSI6MTAyNCwidGltZW91dCI6MTgwfX0.28PdE3s27h6nCqUFLj6UYLwH9RJRqGQBH5KqnfjfBGw\&asrc=run_on_apify) ``` import * as cheerio from 'cheerio'; import { gotScraping } from 'got-scraping'; const storeUrl = 'https://warehouse-theme-metal.myshopify.com/collections/sales'; const response = await gotScraping(storeUrl); const html = response.body; const $ = cheerio.load(html); // ------- new code below const links = $('a'); for (const link of links) { const url = $(link).attr('href'); console.log(url); } ``` Aside from importing libraries and downloading HTML, we load the HTML into Cheerio and then use it to retrieve all the `` elements. After that, we iterate over the collected links and print their `href` attributes, which we access using the [`.attr()`](https://cheerio.js.org/docs/api/classes/Cheerio#attr) method. When you run the above code, you'll see quite a lot of links in the terminal. Some of them may look wrong, because they don't start with the regular `https://` protocol. We'll learn what to do with them in the following lessons. ## Next Up[​](#next "Direct link to Next Up") The [next lesson](https://docs.apify.com/academy/web-scraping-for-beginners/crawling/filtering-links.md) will teach you how to select and filter links, so that your crawler will always work only with valid and useful URLs. --- # Your first crawl **Learn how to crawl the web using Node.js, Cheerio and an HTTP client. Extract URLs from pages and use them to visit more websites.** *** In the previous lessons, we learned what crawling is and how to extract URLs from a page's HTML. The only thing that remains is to write the code—let's get right to it! > If the code starts to look too complex to you, don't worry. We're showing it for educational purposes, so that you can learn how crawling works. Near the end of this course, we'll show you a much easier and faster way to crawl, using a specialized scraping library. If you want, you can skip the details and [go there now](https://docs.apify.com/academy/web-scraping-for-beginners/crawling/pro-scraping.md). ## Processing URLs[​](#processing-urls "Direct link to Processing URLs") In the previous lessons, we collected and filtered all the URLs pointing to individual products in the [Sales category of Warehouse store](https://warehouse-theme-metal.myshopify.com/collections/sales). To crawl the URLs, we must take the whole list we collected and download the HTML of each of the pages. See the comments for changes and additions to the code. ``` import { gotScraping } from 'got-scraping'; import * as cheerio from 'cheerio'; const WEBSITE_URL = 'https://warehouse-theme-metal.myshopify.com'; const storeUrl = `${WEBSITE_URL}/collections/sales`; const response = await gotScraping(storeUrl); const html = response.body; const $ = cheerio.load(html); const productLinks = $('a.product-item__title'); // Prepare an empty array for our product URLs. const productUrls = []; for (const link of productLinks) { const relativeUrl = $(link).attr('href'); const absoluteUrl = new URL(relativeUrl, WEBSITE_URL); // Collect absolute product URLs. productUrls.push(absoluteUrl); } // Loop over the stored URLs to process // each product page individually. for (const url of productUrls) { // Download HTML. const productResponse = await gotScraping(url); const productHtml = productResponse.body; // Load into Cheerio to parse the HTML. const $productPage = cheerio.load(productHtml); // Extract the product's title from the tag. const productPageTitle = $productPage('h1').text().trim(); // Print the title to the terminal to see // confirm we downloaded the correct pages. console.log(productPageTitle); } ``` If you run the crawler from your terminal, it will print the titles of all the products on sale in the Warehouse store. ## Handling errors[​](#handling-errors "Direct link to Handling errors") The code above is correct, but it's not robust. If something goes wrong, it will crash. That something could be a network error, an internet connection error, or the websites you're trying to reach could be experiencing problems at that moment. Hitting any error like that would cause the current crawler to stop entirely, which means we would lose all the data it had collected so far. In programming, you handle errors by catching and handling them. Typically by printing information that the error occurred and/or retrying. > The scraping library we'll [show you in the following lessons](https://docs.apify.com/academy/web-scraping-for-beginners/crawling/pro-scraping.md) handles errors and retries automatically for you. ``` import { gotScraping } from 'got-scraping'; import * as cheerio from 'cheerio'; const WEBSITE_URL = 'https://warehouse-theme-metal.myshopify.com'; const storeUrl = `${WEBSITE_URL}/collections/sales`; const response = await gotScraping(storeUrl); const html = response.body; const $ = cheerio.load(html); const productLinks = $('a.product-item__title'); const productUrls = []; for (const link of productLinks) { const relativeUrl = $(link).attr('href'); const absoluteUrl = new URL(relativeUrl, WEBSITE_URL); productUrls.push(absoluteUrl); } for (const url of productUrls) { // Everything else is exactly the same. // We only wrapped the code in try/catch blocks. // The try block passes all errors into the catch block. // So, instead of crashing the crawler, they can be handled. try { // The try block attempts to execute our code const productResponse = await gotScraping(url); const productHtml = productResponse.body; const $productPage = cheerio.load(productHtml); const productPageTitle = $productPage('h1').text().trim(); console.log(productPageTitle); } catch (error) { // In the catch block, we handle errors. // This time, we will print // the error message and the url. console.error(error.message, url); } } ``` At the time of writing, none of the links have failed; however, as you crawl more pages, you will surely hit a few errors 😉. The important thing is that the crawler will no longer crash if an error does in fact occur, and that it will be able to download the HTML from the working product links. > If you thought that the crawl was taking too long to complete, the [scraping library](https://docs.apify.com/academy/web-scraping-for-beginners/crawling/pro-scraping.md) we keep referring to will help once again. It automatically parallelizes the downloads and processing of HTML, which leads to significant speed improvements. ## Next up[​](#next "Direct link to Next up") In the [next lesson](https://docs.apify.com/academy/web-scraping-for-beginners/crawling/scraping-the-data.md), we will complete the scraper by extracting data about all the products from their individual pages. --- # Headless browsers **Learn how to scrape the web with a headless browser using only a few lines of code. Chrome, Firefox, Safari, Edge - all are supported.** *** A headless browser is a browser that runs without a user interface (UI). This means that it's normally controlled by automated scripts. Headless browsers are very popular in scraping because they can help you render JavaScript or programmatically behave like a human user to prevent blocking. The two most popular libraries for controlling headless browsers are [Puppeteer](https://pptr.dev/) and [Playwright](https://playwright.dev/). **Crawlee** supports both. ## Building a Playwright scraper[​](#playwright-scraper "Direct link to Building a Playwright scraper") > Our focus will be on Playwright, which boasts additional features and better documentation. Notably, it originates from the same team responsible for Puppeteer. Crawlee has a built-in support for building Playwright scrapers. Let's reuse code of the Cheerio scraper from the previous lesson. It'll take us just a few changes to turn it into a full headless scraper. First, we must install Playwright into our project. It's not included in Crawlee, because it's quite large as it bundles all the browsers. ``` npm install playwright ``` After Playwright installs, we can proceed with updating the scraper code. Let's create a new file called `browser.js` and put the new code there. As always, the comments in the example describe changes in the code. Everything else is the same as before. ``` // First, import PlaywrightCrawler instead of CheerioCrawler import { PlaywrightCrawler, Dataset } from 'crawlee'; const crawler = new PlaywrightCrawler({ // Second, tell the browser to run with visible UI, // so that we can see what's going on. headless: false, // Third, replace $ with parseWithCheerio function. requestHandler: async ({ parseWithCheerio, request, enqueueLinks }) => { console.log(`Fetching URL: ${request.url}`); if (request.label === 'start-url') { await enqueueLinks({ selector: 'a.product-item__title', }); return; } // Fourth, parse the browser's page with Cheerio. const $ = await parseWithCheerio(); const title = $('h1').text().trim(); const vendor = $('a.product-meta__vendor').text().trim(); const price = $('span.price').contents()[2].nodeValue; const reviewCount = parseInt($('span.rating__caption').text(), 10); const description = $('div[class*="description"] div.rte').text().trim(); await Dataset.pushData({ title, vendor, price, reviewCount, description, }); }, }); await crawler.addRequests([{ url: 'https://warehouse-theme-metal.myshopify.com/collections/sales', label: 'start-url', }]); await crawler.run(); ``` tip The `parseWithCheerio` function is available even in `CheerioCrawler` and all the other Crawlee crawlers. If you think you'll often switch up the crawlers, you can use it to further reduce the number of needed line changes. When you run the code with `node browser.js`, you'll see a browser window open and then the individual pages getting scraped, each in a new browser tab. That's it. In 4 lines of code, we transformed our crawler from a static HTTP crawler to a headless browser crawler. The crawler now runs the same as before, but uses a Chromium browser instead of plain HTTP requests. This isn't possible without Crawlee. Using Playwright in combination with Cheerio like this is only one of many ways how you can utilize Playwright (and Puppeteer) with Crawlee. In the advanced courses of the Academy, we will go deeper into using headless browsers for scraping and web automation (RPA) use cases. ## Running in headless mode[​](#running-headless "Direct link to Running in headless mode") We said that headless browsers didn't have a UI, but while scraping with the above scraper code, you could definitely see the browser. That's because we added the `headless: false` option. This is useful for debugging and seeing what's going on in the browser. Once your scraper is complete, you can remove the line and the crawler will run without a UI. You can also switch between headless and headful (with UI) using the [`CRAWLEE_HEADLESS`](https://crawlee.dev/docs/guides/configuration#crawlee_headless) environment variable. This allows you to change the mode without touching your code. * MacOS/Linux * Windows CMD * Windows Powershell ``` CRAWLEE_HEADLESS=1 node browser.js ``` ``` set CRAWLEE_HEADLESS=1 && node browser.js ``` ``` $env:CRAWLEE_HEADLESS=1; & node browser.js ``` ## Dynamically loaded data[​](#dynamic-data "Direct link to Dynamically loaded data") One of the important benefits of using a browser is that it allows you to extract data that's dynamically loaded, such as data that's only fetched after a user scrolls or interacts with the page. In our case, it's the "**You may also like**" section of the product detail pages. Those products aren't available in the initial HTML, but the browser loads them later using an API. ![headless-dynamic-data.png](/assets/images/headless-dynamic-data-556e6fe0874146dbff6ccef48365ed66.png) tip We discuss dynamic data at length in the [How to scrape dynamic pages](https://docs.apify.com/academy/node-js/dealing-with-dynamic-pages.md) tutorial, and we also have a special lesson dedicated to it in our [Puppeteer & Playwright course](https://docs.apify.com/academy/puppeteer-playwright/page/waiting.md). If we added an appropriate selector to our original `CheerioCrawler` code, it would not extract the information, but a browser automatically fetches and renders this extra data. Let's add this new extractor to our code. It collects the names of the recommended products. ``` // ... const recommendedProducts = $('.product-recommendations a.product-item__title') .map((i, el) => $(el).text().trim()) .toArray(); // ... await Dataset.pushData({ // ... recommendedProducts, }); ``` And here's the complete, runnable code: ``` import { PlaywrightCrawler, Dataset } from 'crawlee'; const crawler = new PlaywrightCrawler({ // We removed the headless: false option to hide the browser windows. requestHandler: async ({ parseWithCheerio, request, enqueueLinks }) => { console.log(`Fetching URL: ${request.url}`); if (request.label === 'start-url') { await enqueueLinks({ selector: 'a.product-item__title', }); return; } // Fourth, parse the browser's page with Cheerio. const $ = await parseWithCheerio(); const title = $('h1').text().trim(); const vendor = $('a.product-meta__vendor').text().trim(); const price = $('span.price').contents()[2].nodeValue; const reviewCount = parseInt($('span.rating__caption').text(), 10); const description = $('div[class*="description"] div.rte').text().trim(); // We added one more extractor to get all the recommended products. const recommendedProducts = $('.product-recommendations a.product-item__title') .map((i, el) => $(el).text().trim()) .toArray(); await Dataset.pushData({ title, vendor, price, reviewCount, description, // And we saved the extracted product names. recommendedProducts, }); }, }); await crawler.addRequests([{ url: 'https://warehouse-theme-metal.myshopify.com/collections/sales', label: 'start-url', }]); await crawler.run(); ``` When you run the code, you'll find the recommended product names correctly extracted in the dataset files. If you tried the same with our earlier `CheerioCrawler` code, you would find the `recommendedProducts` array empty in your results. That's because Cheerio can't make the API call to retrieve the additional data, like a browser can. ## Next up[​](#next "Direct link to Next up") We learned how to scrape with Cheerio and Playwright, but how do we export the data for further processing? Let's learn that in the [next and final lesson](https://docs.apify.com/academy/web-scraping-for-beginners/crawling/exporting-data.md) of the Basics of crawling section. --- # Professional scraping 👷 **Learn how to build scrapers quicker and get better and more robust results by using Crawlee, an open-source library for scraping in Node.js.** *** While it's definitely an interesting exercise to do all the programming manually, and we hope you enjoyed it, it's neither the most effective, nor the most efficient way of scraping websites. Because we scrape for a living at Apify, we've built a library that we use to scrape tens of millions of pages every day. It's called [**Crawlee**](https://crawlee.dev/), and it is, and **always will be**, completely **open-source** and **free** to use. You don't need to sign up for an Apify account or use the Apify platform. You can use it on your personal computer, on any server, or in any cloud environment you want. We mentioned the benefits of developing using a dedicated scraping library in the previous lessons, but to recap: 1. **Faster development time** because you don't have to write boilerplate code. 2. **Fewer bugs**. Crawlee is fully unit-tested and battle-tested on millions of scraper runs. 3. **Faster and cheaper scrapers** because Crawlee automatically scales based on system resources, and we optimize its performance regularly. 4. **More robust scrapers**. Annoying details like retries, proxy management, error handling, and result storage are all handled out-of-the-box by Crawlee. 5. **Helpful community**. You can [join our Discord](https://discord.gg/qkMS6pU4cF) or talk to us [on GitHub](https://github.com/apify/crawlee/discussions). We're almost always there to talk about scraping and programming in general. tip If you're still not convinced, [read this story](https://apify.com/success-stories/daltix-analytics-scrapy-python-to-apify) about how a data analytics company saved 90% of scraping costs by switching from Scrapy (a scraping library for Python) to Crawlee. We were pretty surprised ourselves, to be honest. Crawlee factors away and manages the dull and repetitive parts of web scraper development under the hood, such as: * Auto-scaling * Request concurrency * Queueing requests * Data storage * Using and rotating [proxies](https://docs.apify.com/academy/anti-scraping/mitigation/proxies.md) * Puppeteer/Playwright setup overhead * [See all the features](https://crawlee.dev/docs/introduction) Crawlee and its resources can be found in various different places: 1. [Official Crawlee documentation](https://crawlee.dev/) 2. [Crawlee GitHub repository (source code, issues)](https://github.com/apify/crawlee) 3. [Crawlee on npm](https://www.npmjs.com/package/crawlee) ## Install Crawlee[​](#crawlee-installation "Direct link to Install Crawlee") To use Crawlee, we have to install it from npm. Let's add it to our project from the previous lessons by executing this command in your project's folder. ``` npm install crawlee ``` After the installation completes, create a new file called **crawlee.js** and add the following code to it: ``` import { CheerioCrawler } from 'crawlee'; console.log('Crawlee works!'); ``` We are using the new ESM `import` syntax here (see [Node.js docs](https://nodejs.org/dist/latest-v16.x/docs/api/esm.html#enabling)). To be able to use it, we need to turn our project to `module` in the `package.json` file: ``` { "name": "my-scraping-project", "type": "module", "dependencies": { "crawlee": "^3.0.0" } } ``` Then, run the code using `node` as usual: ``` node crawlee.js ``` You'll see "**Crawlee works!**" printed to the console. If it doesn't work, it means Crawlee didn't install correctly. If that's the case, try deleting the `node_modules` directory and `package-lock.json` file in your project and install Crawlee again. > You don't need to `import` any other libraries like Cheerio or Got-Scraping. That's because they're both included in Crawlee's [`CheerioCrawler`](https://crawlee.dev/docs/guides/cheerio-crawler-guide). ## Prepare the scraper[​](#coding-the-scraper "Direct link to Prepare the scraper") `CheerioCrawler` automatically visits URLs, downloads HTML using **Got-Scraping**, and parses it with **Cheerio**. The benefit of this over writing the code yourself is that it automatically handles the URL queue, errors, retries, proxies, parallelizes the downloads, and much more. Overall, it removes the need to write a lot of boilerplate code. To create a crawler with Crawlee, you only need to provide it with a request handler - a function that gets executed for each page it visits. ``` import { CheerioCrawler } from 'crawlee'; const crawler = new CheerioCrawler({ // This function will run on every page. // Among other things, it gives you access // to parsed HTML with the Cheerio $ function. requestHandler: async ({ $, request }) => { console.log('URL:', request.url); // Print the heading of each visited page. console.log('Title:', $('h1').text().trim()); }, }); ``` But the above code still doesn't crawl anything. We need to provide it with URLs to crawl. To do that, we call the crawler's `addRequests` function. ``` import { CheerioCrawler } from 'crawlee'; const crawler = new CheerioCrawler({ requestHandler: async ({ $, request }) => { console.log('URL:', request.url); console.log('Title:', $('h1').text().trim()); }, }); // Add the Sales category of Warehouse store to the queue of URLs. await crawler.addRequests([ 'https://warehouse-theme-metal.myshopify.com/collections/sales', ]); await crawler.run(); ``` When you run the above code, you'll see some internal Crawlee logs and then the two messages your code printed: ``` URL: https://warehouse-theme-metal.myshopify.com/collections/sales Title: Sales ``` > `crawler.addRequests` uses the [`RequestQueue`](https://crawlee.dev/docs/guides/request-storage#request-queue) under the hood. It's a persistent storage, which means that if your crawler crashes, it doesn't have to start over, but it can continue from where it left off. ### Summary[​](#summary "Direct link to Summary") 1. We added the first URL to the crawler using the `addRequests` function. 2. `CheerioCrawler` will automatically take the URL from the queue, download its HTML using Got Scraping, and parse it using Cheerio. 3. The crawler executes the [`requestHandler`](https://crawlee.dev/api/cheerio-crawler/interface/CheerioCrawlerOptions#requestHandler), where we extract the page's data using the [`$`](https://crawlee.dev/api/cheerio-crawler/interface/CheerioCrawlingContext) variable. You can also access the request itself using the [`request`](https://crawlee.dev/api/cheerio-crawler/interface/CheerioCrawlingContext#request) variable. ## Crawling links[​](#crawling-links "Direct link to Crawling links") The current scraper only visits the Sales category page, but we want detailed data for all the products. We can use the [`enqueueLinks()`](https://crawlee.dev/api/cheerio-crawler/interface/CheerioCrawlingContext#enqueueLinks) function to add more URLs to the queue. The function automatically extracts URLs from the current page based on a provided CSS selector and adds them to the queue. Once added, the crawler will automatically crawl them. ``` import { CheerioCrawler } from 'crawlee'; const crawler = new CheerioCrawler({ requestHandler: async ({ $, request, enqueueLinks }) => { console.log('URL:', request.url); console.log('Title:', $('h1').text().trim()); // We only want to enqueue the URLs from the start URL. if (request.label === 'start-url') { // enqueueLinks will add all the links // that match the provided selector. await enqueueLinks({ // The selector comes from our earlier code. selector: 'a.product-item__title', }); } }, }); // Instead of using a string with URL, we're now // using a request object to add more options. await crawler.addRequests([{ url: 'https://warehouse-theme-metal.myshopify.com/collections/sales', // We label the Request to identify // it later in the requestHandler. label: 'start-url', }]); await crawler.run(); ``` When you run the code, you'll see the names and URLs of all the products printed to the console. You'll also see that it crawls faster than the manually written code. This is thanks to the parallelization of the requests. > If the crawler gets stuck for you at the end, it's ok. It's not actually stuck, but waiting to retry any pages that may have failed. ## Extracting data[​](#extracting-data "Direct link to Extracting data") We have the crawler in place, and it's time to extract data. We already have the extraction code from the previous lesson, so we can copy and paste it into the `requestHandler` with tiny changes. Instead of printing results to the terminal, we will save it to disk. ``` // To save data to disk, we need to import Dataset. import { CheerioCrawler, Dataset } from 'crawlee'; const crawler = new CheerioCrawler({ requestHandler: async ({ $, request, enqueueLinks }) => { console.log(`Fetching URL: ${request.url}`); if (request.label === 'start-url') { await enqueueLinks({ selector: 'a.product-item__title', }); // When on the start URL, we don't want to // extract any data after we extract the links. return; } // We copied and pasted the extraction code // from the previous lesson with small // refactoring: e.g. `$productPage` to `$`. const title = $('h1').text().trim(); const vendor = $('a.product-meta__vendor').text().trim(); const price = $('span.price').contents()[2].nodeValue; const reviewCount = parseInt($('span.rating__caption').text(), 10); const description = $('div[class*="description"] div.rte').text().trim(); // Instead of printing the results to // console, we save everything to a file. await Dataset.pushData({ title, vendor, price, reviewCount, description, }); }, }); await crawler.addRequests([{ url: 'https://warehouse-theme-metal.myshopify.com/collections/sales', label: 'start-url', }]); await crawler.run(); ``` When you run the code as usual, you'll see the product URLs printed to the terminal and you'll find the scraped data saved to your disk. Thanks to using the [`Dataset.pushData()`](https://crawlee.dev/docs/introduction/saving-data#whats-datasetpushdata) function, Crawlee automatically created a `storage` directory in your project's location and saved the results there. Each product has its data stored as a separate JSON file. ``` ./storage/datasets/default/*.json ``` Thanks to **Crawlee**, we were able to create a **faster and more robust scraper**, but **with less code** than what was needed for the scraper in the earlier lessons. ## Next up[​](#next "Direct link to Next up") In the [next lesson](https://docs.apify.com/academy/web-scraping-for-beginners/crawling/headless-browser.md) we'll show you how to turn this plain HTTP crawler into a **headless browser** scraper in only a few lines of code. --- # Recap of data extraction basics **Review our e-commerce website scraper and refresh our memory about its code and the programming techniques we used to extract and save the data.** *** We finished off the [first section](https://docs.apify.com/academy/web-scraping-for-beginners/data-extraction.md) of the *Web scraping basics for JavaScript devs* course by creating a web scraper in Node.js. The scraper collected all the on-sale products from [Warehouse store](https://warehouse-theme-metal.myshopify.com/collections/sales). Let's see the code with some comments added. ``` // First, we imported all the libraries we needed to // download, extract, and convert the data we wanted import { writeFileSync } from 'fs'; import { gotScraping } from 'got-scraping'; import * as cheerio from 'cheerio'; import { parse } from 'json2csv'; // Here, we fetched the website's HTML and saved it to a new variable. const storeUrl = 'https://warehouse-theme-metal.myshopify.com/collections/sales'; const response = await gotScraping(storeUrl); const html = response.body; // We used Cheerio, a popular library, to parse (process) // the downloaded HTML so that we could manipulate it. const $ = cheerio.load(html); // Using the .product-item CSS selector, we collected all the HTML // elements which contained data about individual products. const products = $('.product-item'); // Then, we prepared a new array to store the results. const results = []; // And looped over all the elements to extract // information about the individual products. for (const product of products) { // The product's title was in an element // with the CSS class: product-item__title const titleElement = $(product).find('a.product-item__title'); const title = titleElement.text().trim(); // The product's price was in a element // with the CSS class: price const priceElement = $(product).find('span.price'); // Because the also included some useless data, // we had to extract the price from a specific HTML node. const price = priceElement.contents()[2].nodeValue.trim(); // We added the data to the results array // in the form of an object with keys and values. results.push({ title, price }); } // Finally, we formatted the results // as a CSV file instead of a JS object const csv = parse(results); // Then, we saved the CSV to the disk writeFileSync('products.csv', csv); ``` tip If some of the code is hard for you to understand, please review the [Basics of data extraction](https://docs.apify.com/academy/web-scraping-for-beginners/data-extraction.md) section. We will not go through the details again in this section about crawling. caution We are using JavaScript features like `import` statements and top-level `await`. If you see errors like *Cannot use import outside of a module*, please review the [Project setup lesson](https://docs.apify.com/academy/web-scraping-for-beginners/data-extraction/project-setup.md#modern-javascript), where we explain how to enable those features. ## Next up[​](#next "Direct link to Next up") The [next lesson](https://docs.apify.com/academy/web-scraping-for-beginners/crawling/finding-links.md) is all about finding links to crawl on the [Warehouse store](https://warehouse-theme-metal.myshopify.com/collections/sales). --- # Relative URLs **Learn about absolute and relative URLs used on web pages and how to work with them when parsing HTML with Cheerio in your scraper.** *** You might have noticed in the previous lesson that while printing URLs to the DevTools console, they would always show in full length, like this: ``` https://warehouse-theme-metal.myshopify.com/products/denon-ah-c720-in-ear-headphones ``` But in the Elements tab, when checking the `` attributes, the URLs would look like this: ``` /products/denon-ah-c720-in-ear-headphones ``` What's up with that? This short version of the URL is called a **relative URL**, and the full length one is called an **absolute URL**. > [Learn more about absolute and relative URLs](https://developer.mozilla.org/en-US/docs/Learn/Common_questions/Web_mechanics/What_is_a_URL#absolute_urls_vs._relative_urls). We'll see why the difference between relative URLs and absolute URLs is important a bit later in this lesson. ## Browser vs Node.js: The Differences[​](#browser-vs-node "Direct link to Browser vs Node.js: The Differences") Let's update the Node.js code from the [Finding links lesson](https://docs.apify.com/academy/web-scraping-for-beginners/crawling/finding-links.md) to see why links with relative URLs can be a problem. ``` import { gotScraping } from 'got-scraping'; import * as cheerio from 'cheerio'; const storeUrl = 'https://warehouse-theme-metal.myshopify.com/collections/sales'; const response = await gotScraping(storeUrl); const html = response.body; const $ = cheerio.load(html); const productLinks = $('a.product-item__title'); for (const link of productLinks) { const url = $(link).attr('href'); console.log(url); } ``` When you run this file in your terminal, you'll immediately see the difference. Unlike in the browser, where looping over elements produced absolute URLs, here in Node.js it only produces the relative ones. This is bad, because we can't use the relative URLs to crawl. They don't include all the necessary information. ## Resolving URLs[​](#resolving-urls "Direct link to Resolving URLs") Luckily, there's a process called resolving URLs that creates absolute URLs from relative ones. We need two things. The relative URL, such as `/products/denon-ah-c720-in-ear-headphones`, and the URL of the website where we found the relative URL (which is `https://warehouse-theme-metal.myshopify.com` in our case). ``` const websiteUrl = 'https://warehouse-theme-metal.myshopify.com'; const relativeUrl = '/products/denon-ah-c720-in-ear-headphones'; const absoluteUrl = new URL(relativeUrl, websiteUrl); console.log(absoluteUrl.href); ``` In Node.js, when you create a `new URL()`, you can optionally pass a second argument, the base URL. When you do, the URL in the first argument will be resolved using the URL in the second argument. Note that the URL created from `new URL()` is an object, not a string. To get the URL in a string format, we use the `url.href` property, or alternatively the `url.toString()` function. When we plug this into our crawler code, we will get the correct - absolute - URLs. ``` import { gotScraping } from 'got-scraping'; import * as cheerio from 'cheerio'; // Split the base URL from the category to use it later. const WEBSITE_URL = 'https://warehouse-theme-metal.myshopify.com'; const storeUrl = `${WEBSITE_URL}/collections/sales`; const response = await gotScraping(storeUrl); const html = response.body; const $ = cheerio.load(html); const productLinks = $('a.product-item__title'); for (const link of productLinks) { const relativeUrl = $(link).attr('href'); // Resolve relative URLs using the website's URL const absoluteUrl = new URL(relativeUrl, WEBSITE_URL); console.log(absoluteUrl.href); } ``` Cheerio can't resolve the URL itself, because until you provide the necessary information - it doesn't know where you originally downloaded the HTML from. The browser always knows which page you're on, so it will resolve the URLs automatically. ## Next up[​](#next "Direct link to Next up") The [next lesson](https://docs.apify.com/academy/web-scraping-for-beginners/crawling/first-crawl.md) will teach you how to use the collected URLs to crawl all the individual product pages. --- # Scraping data **Learn how to add data extraction logic to your crawler, which will allow you to extract data from all the websites you crawled.** *** At the [very beginning of this course](https://docs.apify.com/academy/web-scraping-for-beginners.md), we learned that the term web scraping usually means a combined process of data extraction and crawling. And this is exactly what we'll do in this lesson. We will take the crawling code from the previous lesson, and we will combine it with data extraction code and turn everything into a web scraper. ## Extracting data from a product detail page[​](#extracting-data "Direct link to Extracting data from a product detail page") The term product detail page (or PDP) is commonly used on e-commerce websites to describe the page where you can find detailed information about a product. In the Warehouse store, there's, for example, [this page describing Denon headphones](https://warehouse-theme-metal.myshopify.com/products/denon-ah-c720-in-ear-headphones). Let's start writing a script that extracts data from this single PDP. We can use this familiar code as a boilerplate. ``` import { gotScraping } from 'got-scraping'; import * as cheerio from 'cheerio'; const productUrl = 'https://warehouse-theme-metal.myshopify.com/products/denon-ah-c720-in-ear-headphones'; const response = await gotScraping(productUrl); const html = response.body; const $ = cheerio.load(html); // Attribute extraction code will go here. ``` We will use the techniques learned in the [Basics of data extraction](https://docs.apify.com/academy/web-scraping-for-beginners/data-extraction.md) section to find and extract the following product attributes: * title * vendor * price * number of reviews * description ![scraping-title.png](/assets/images/scraping-title-b9819987fa0d66c00b3e394c42ba8f98.png) > For brevity, we won't explain how to extract every attribute step-by-step. Review the [Basics of data extraction](https://docs.apify.com/academy/web-scraping-for-beginners/data-extraction.md) section to learn about DevTools and extracting data. ### Title[​](#title "Direct link to Title") Getting the title is quite straightforward. We recommend using `h1` for titles where available, because it's the semantically correct way and therefore unlikely to change. ``` const title = $('h1').text().trim(); ``` ### Vendor[​](#vendor "Direct link to Vendor") Vendor name is available as a link with the `product-meta__vendor` class. We're only interested in the text though. ``` const vendor = $('a.product-meta__vendor').text().trim(); ``` ### Price[​](#price "Direct link to Price") We will take a shortcut here and only extract the price as a string that includes currency. In production scrapers, you might want to split it into two fields. ``` const price = $('span.price').contents()[2].nodeValue; ``` ### Number of reviews[​](#number-of-reviews "Direct link to Number of reviews") For the review count, we use the `parseInt()` function to get only the number. Otherwise, we would receive a string like **2 reviews** from this element. ``` const reviewCount = parseInt($('span.rating__caption').text(), 10); ``` ### Description[​](#description "Direct link to Description") Getting the description is fairly straightforward as well, but notice the two selectors separated by a space: `div[class*="description"] div.rte`. This is called a [descendant combinator](https://developer.mozilla.org/en-US/docs/Web/CSS/Descendant_combinator), and it allows you to search for child elements within parent elements. Using any of the selectors separately would lead to unwanted strings in our result. ``` const description = $('div[class*="description"] div.rte').text().trim(); ``` ### Complete extraction code[​](#complete-extraction-code "Direct link to Complete extraction code") This is the final code after putting all the extractors together with the initial boilerplate. It will scrape all the requested attributes from the single URL and print them to the terminal. Save it into a new file called `product.js` and run it with `node product.js` to see for yourself. ``` import { gotScraping } from 'got-scraping'; import * as cheerio from 'cheerio'; const productUrl = 'https://warehouse-theme-metal.myshopify.com/products/denon-ah-c720-in-ear-headphones'; const response = await gotScraping(productUrl); const html = response.body; const $ = cheerio.load(html); const title = $('h1').text().trim(); const vendor = $('a.product-meta__vendor').text().trim(); const price = $('span.price').contents()[2].nodeValue; const reviewCount = parseInt($('span.rating__caption').text(), 10); const description = $('div[class*="description"] div.rte').text().trim(); const product = { title, vendor, price, reviewCount, description, }; console.log(product); ``` ## Crawling product detail pages[​](#crawling "Direct link to Crawling product detail pages") Let's compare the above data extraction example with the crawling code we wrote in the last lesson: ``` import { gotScraping } from 'got-scraping'; import * as cheerio from 'cheerio'; const WEBSITE_URL = 'https://warehouse-theme-metal.myshopify.com'; const storeUrl = `${WEBSITE_URL}/collections/sales`; const response = await gotScraping(storeUrl); const html = response.body; const $ = cheerio.load(html); const productLinks = $('a.product-item__title'); const productUrls = []; for (const link of productLinks) { const relativeUrl = $(link).attr('href'); const absoluteUrl = new URL(relativeUrl, WEBSITE_URL); productUrls.push(absoluteUrl); } for (const url of productUrls) { try { const productResponse = await gotScraping(url); const productHtml = productResponse.body; const $productPage = cheerio.load(productHtml); const productPageTitle = $productPage('h1').text().trim(); console.log(productPageTitle); } catch (error) { console.error(error.message, url); } } ``` We can see that the code is quite similar. Both scripts download HTML and then process the HTML. To understand how to put them together, we'll go back to the [original process of crawling](https://docs.apify.com/academy/web-scraping-for-beginners/crawling.md). 1. Visit the start URL. 2. Extract the next URLs (and data) and save them. 3. Visit one of the collected URLs and save data and/or more URLs. 4. Repeat step 3 until you have everything you need. Using this flow as guidance, we should be able to connect the pieces of code together to build a scraper which crawls through the products found in the [Sales category of Warehouse](https://warehouse-theme-metal.myshopify.com/collections/sales), and then scrapes the **title**, **vendor**, **price**, **review count**, and **description** of each of them. ## Building the final scraper[​](#building-scraper "Direct link to Building the final scraper") Let's create a brand-new file called **final.js** and write our scraper code there. We'll show the code step by step for easier orientation. At the end, we'll combine the pieces into a runnable example. We'll start by adding our imports and constants at the top of the file, no changes there. ``` import { gotScraping } from 'got-scraping'; import * as cheerio from 'cheerio'; const WEBSITE_URL = 'https://warehouse-theme-metal.myshopify.com'; ``` Then we need to **visit the start URL**. To scrape all the on-sale product links, we need the Sales page as the start URL. ``` // ... const storeUrl = `${WEBSITE_URL}/collections/sales`; const response = await gotScraping(storeUrl); const html = response.body; ``` After that, we need to **extract the next URLs** we want to visit (the product detail page URLs). Thus far, the code is exactly the same as the **crawler.js** code. ``` // ... const $ = cheerio.load(html); const productLinks = $('a.product-item__title'); const productUrls = []; for (const link of productLinks) { const relativeUrl = $(link).attr('href'); const absoluteUrl = new URL(relativeUrl, WEBSITE_URL); productUrls.push(absoluteUrl); } ``` Now the code will start to differ. We will use the crawling logic from earlier to visit all the URLs, but we will replace the placeholder extraction logic we had there. The placeholder logic only extracted the product's title, but we want the vendor, price, number of reviews and description as well. ``` // ... // A new array to save each product in. const results = []; // An optional array we can save errors to. const errors = []; for (const url of productUrls) { try { // Download HTML of each product detail. const productResponse = await gotScraping(url); const $productPage = cheerio.load(productResponse.body); // Use the data extraction logic from above. // If copy pasting, be careful about $ -> $productPage. const title = $productPage('h1').text().trim(); const vendor = $productPage('a.product-meta__vendor').text().trim(); const price = $productPage('span.price').contents()[2].nodeValue; const reviewCount = parseInt($productPage('span.rating__caption').text(), 10); const description = $productPage('div[class*="description"] div.rte').text().trim(); results.push({ title, vendor, price, reviewCount, description, }); } catch (error) { // Save information about the error to the // "errors" array to see what's happened. errors.push({ url, msg: error.message }); } } ``` Finally, let's combine the above code blocks into a full runnable example. When you run the below code, it will scrape detailed information about all the products on the first page of the [Warehouse Sales category](https://warehouse-theme-metal.myshopify.com/collections/sales). We added a few console logs throughout the code to see what's going on. ``` import { gotScraping } from 'got-scraping'; import * as cheerio from 'cheerio'; const WEBSITE_URL = 'https://warehouse-theme-metal.myshopify.com'; const storeUrl = `${WEBSITE_URL}/collections/sales`; console.log('Fetching products on sale.'); const response = await gotScraping(storeUrl); const html = response.body; const $ = cheerio.load(html); const productLinks = $('a.product-item__title'); const productUrls = []; for (const link of productLinks) { const relativeUrl = $(link).attr('href'); const absoluteUrl = new URL(relativeUrl, WEBSITE_URL); productUrls.push(absoluteUrl); } console.log(`Found ${productUrls.length} products.`); const results = []; const errors = []; for (const url of productUrls) { try { console.log(`Fetching URL: ${url}`); const productResponse = await gotScraping(url); const $productPage = cheerio.load(productResponse.body); const title = $productPage('h1').text().trim(); const vendor = $productPage('a.product-meta__vendor').text().trim(); const price = $productPage('span.price').contents()[2].nodeValue; const reviewCount = parseInt($productPage('span.rating__caption').text(), 10); const description = $productPage('div[class*="description"] div.rte').text().trim(); results.push({ title, vendor, price, reviewCount, description, }); } catch (error) { errors.push({ url, msg: error.message }); } } console.log('RESULTS:', results); console.log('ERRORS:', errors); ``` And here's an example of the results you will see after running the above code. We truncated the descriptions for readability. There should be 24 products in your list. ``` [ { title: 'JBL Flip 4 Waterproof Portable Bluetooth Speaker', vendor: 'JBL', price: '$74.95', reviewCount: 2, description: 'JBL Flip 4 is the next generation in the ...', }, { title: 'Sony XBR-950G BRAVIA 4K HDR Ultra HD TV', vendor: 'Sony', price: '$1,398.00', reviewCount: 3, description: 'Unlock the world of ultimate colors and ...', }, { title: 'Sony SACS9 10" Active Subwoofer', vendor: 'Sony', price: '$158.00', reviewCount: 3, description: 'Put more punch in your movie ...', }, ]; ``` That's it for the absolute basics of crawling, but we're not done yet. We scraped 24 products from the first page of the Sales category, but the category actually has 50 products on 3 pages. You will learn how to visit all the pages and scrape all the products in the following lessons. ## Next up[​](#next "Direct link to Next up") In the [next lesson](https://docs.apify.com/academy/web-scraping-for-beginners/crawling/pro-scraping.md) we will rewrite the scraper using an open-source web scraping library called [Crawlee](https://crawlee.dev). It will make the scraper more robust while speeding up development at the same time. --- # Basics of data extraction **Learn about HTML, CSS, and JavaScript, the basic building blocks of a website, and how to use them in web scraping and data extraction.** *** Every web scraping project starts with some detective work. To a human, it's completely obvious where the data is on the web page, but a computer needs very precise instructions to find the data we want. We can leverage three elementary components of each website to give those instructions: HTML, CSS, and JavaScript ## HTML[​](#html "Direct link to HTML") For the browser to be able to show you the web page with all its text and images, the data needs to be present somewhere. This data source is called HTML (HyperText Markup Language) and it gets downloaded to your computer whenever you open a website. If you want to extract data from a website, you need to show your computer where to find it in the HTML. > To learn more about markup, we recommend the [resources about HTML](https://developer.mozilla.org/en-US/docs/Learn/HTML) provided by MDN, the official documentation of the web. ## CSS[​](#css "Direct link to CSS") CSS (Cascading Style Sheets) is a language that is used to give websites their style. It controls shapes, colors, positioning and even animations. The style is then added to the page's HTML and together, they define the page's content and structure. In web scraping, we can leverage CSS to find the data we want using CSS selectors. > To learn more about styles and selectors, we recommend the [resources about CSS](https://developer.mozilla.org/en-US/docs/Learn/CSS) provided by MDN, the official documentation of the web. ## JavaScript[​](#javascript "Direct link to JavaScript") HTML and CSS give websites their structure and style, but they are static. To be able to meaningfully interact with a website, you need to throw JavaScript into the mix. It is the language of the web, and you don't need to be a programmer to learn the basics. You don't even need any special software, because you can try it right now, in your browser. > To learn more about programming in browser, we recommend the [resources about JavaScript](https://developer.mozilla.org/en-US/docs/Learn/JavaScript) provided by MDN, the official documentation of the web. ## Next up[​](#next "Direct link to Next up") We will show you [how to use the browser DevTools](https://docs.apify.com/academy/web-scraping-for-beginners/data-extraction/browser-devtools.md) to inspect and interact with a web page. --- # Starting with browser DevTools **Learn about browser DevTools, a valuable tool in the world of web scraping, and how you can use them to extract data from a website.** *** Even though DevTools stands for developer tools, everyone can use them to inspect a website. Each major browser has its own DevTools. We will use Chrome DevTools as an example, but the advice is applicable to any browser, as the tools are extremely similar. To open Chrome DevTools, you can press **F12** or right-click anywhere in the page and choose **Inspect**. Now go to [Wikipedia](https://www.wikipedia.org/) and open your DevTools there. ![Wikipedia with Chrome DevTools open](/assets/images/browser-devtools-wikipedia-d20b19ea46ed30572858ddc63d9e0f23.png) ## Elements tab[​](#elements-tab "Direct link to Elements tab") When you first open Chrome DevTools on Wikipedia, you will start on the Elements tab (In Firefox it's called the **Inspector**). You can use this tab to inspect the page's HTML on the left hand side, and its CSS on the right. The items in the HTML view are called [**elements**](https://docs.apify.com/academy/concepts/html-elements.md). ![Elements tab in Chrome DevTools](/assets/images/browser-devtools-elements-tab-fb7aa7fc2b9442bb7fd94dbc6955e4c8.png) > On a screen that is narrow or has a small resolution, the CSS information can appear under the HTML tab, not on the right. Each element is enclosed in an HTML tag. For example ``, ``, and `` are all tags. When you add something inside of those tags, like `Hello!` you create an element. You can also see elements inside other elements in the **Elements** tab. This is called nesting, and it gives the page its structure. At the bottom, there's the **JavaScript console**, which is a powerful tool which can be used to manipulate the website. If the console is not there, you can press **ESC** to toggle it. All of this might look super complicated at first, but don't worry, there's no need to understand everything yet - we'll walk you through all the important things you need to know. ![Console in Chrome DevTools](/assets/images/browser-devtools-console-0752bf16933c5b7b8858dac3bbd80694.png) ## Selecting an element[​](#selecting-an-element "Direct link to Selecting an element") In the top left corner of DevTools, there's a little arrow icon with a square. ![Chrome DevTools element selection tool](/assets/images/browser-devtools-element-selection-c1cf8032d6d23ad5941c7ebf2b0f1ae5.png) Click it and then hover your mouse over **The Free Encyclopedia**, Wikipedia's subtitle. DevTools will show you information about the HTML element being hovered over. Now click the element. It will be selected in the **Elements** tab, which allows for further inspection of the element and its content. ![Chrome DevTools element hover effect](/assets/images/browser-devtools-hover-b85b0699eef969e79c92fda46154bbe2.png) ## Interacting with an element[​](#interacting-with-elements "Direct link to Interacting with an element") After you select the subtitle element, right-click the highlighted element in the Elements tab to show a menu with available actions. For now, select **Store as global variable** (**Use in Console** in Firefox). You'll see that a new variable called `temp1` (`temp0` in Firefox) appeared in your DevTools Console. You can now use the Console to access the element's properties using JavaScript. For example, if you wanted to scrape the text inside the element, you could use the `textContent` property to get it. Copy and paste (or type) the following command into your Console and press Enter. The text of your `temp1` element - The Free Encyclopedia - will display in the Console. ``` temp1.textContent; ``` Now run this command to get the HTML of the element: ``` temp1.outerHTML; ``` And finally, run the next command to change the text of the element. ``` temp1.textContent = 'Hello World!'; ``` By changing HTML elements from the Console, you can change what's displayed on the page. This change only happens on your own computer so don't worry, you haven't hacked Wikipedia. ![Chrome DevTools JavaScript command execution](/assets/images/browser-devtools-console-commands-9f82a9905f884595024c32ee4519760a.png) > In JavaScript, the web page is called `document`. From the Console you can interact with it in many ways. Go through [document basics](https://developer.mozilla.org/en-US/docs/Web/API/Document_object_model/Using_the_Document_Object_Model) to learn more. ## Next up[​](#next "Direct link to Next up") In this lesson, we learned the absolute basics of interaction with a page using the DevTools. In the [next lesson](https://docs.apify.com/academy/web-scraping-for-beginners/data-extraction/using-devtools.md), you will learn how to extract data from it. We will extract data about the on-sale products on the [Warehouse store](https://warehouse-theme-metal.myshopify.com). It isn't a real store, but a full-featured demo of a Shopify online store. And that is perfect for our purposes. Shopify is one of the largest e-commerce platforms in the world, and it uses all the latest technologies that a real e-commerce web application would use. Learning to scrape a Shopify store is useful, because you can immediately apply the learnings to millions of websites. --- # Prepare your computer for programming **Set up your computer to be able to code scrapers with Node.js and JavaScript. Download Node.js and npm and run a Hello World script.** *** Before you can start writing scraper code, you need to have your computer set up for it. In this lesson, we will show you all the tools you need to install to successfully write your first scraper. ## Install Node.js[​](#install-node "Direct link to Install Node.js") Let's start with the installation of Node.js. Node.js is an engine for running JavaScript, quite similar to the browser console we used in the previous lessons. You feed it JavaScript code, and it executes it for you. Why not just use the browser console? Because it's limited in its capabilities. Node.js is way more powerful and is much better suited for coding scrapers. If you're on macOS, use [this tutorial to install Node.js](https://blog.apify.com/how-to-install-nodejs/). If you're using Windows [visit the official Node.js website](https://nodejs.org/en/download/). And if you're on Linux, use your package manager to install `nodejs`. ## Install a text editor[​](#install-an-editor "Direct link to Install a text editor") Many text editors are available for you to choose from when programming. You might already have a preferred one so feel free to use that. Make sure it has syntax highlighting and support for Node.js. If you don't have a text editor, we suggest starting with VSCode. It's free, very popular, and well maintained. [Download it here](https://code.visualstudio.com/download). Once you downloaded and installed it, you can open a folder where we will build your scraper. We recommend starting with a new, empty folder. ![How to open a folder in VSCode](/assets/images/vscode-open-folder-4fe8ed6d37a7d37b1c2d8c9356b7a8bb.png) ## Hello world! 👋[​](#hello-world "Direct link to Hello world! 👋") Before we start, let's confirm that Node.js was successfully installed on your computer. To do that, run those two commands in your terminal and see if they correctly print your Node.js and npm versions. The next lessons **require Node.js version 16 or higher**. If you skipped Node.js installation and want to use your existing version of Node.js, **make sure that it's 16 or higher**. ``` node -v npm -v ``` If you installed VSCode in the previous paragraph, you can use the integrated terminal. ![How to open a terminal in VSCode](/assets/images/vscode-open-terminal-44dc7539448cf0e3c67f123f664dbfeb.png) > If you're still wondering what a "terminal" is, we suggest googling for a terminal tutorial for your operating system because individual terminals are different. Sometimes a little, sometimes a lot. After confirming that `node` is correctly installed on your computer, use your text editor to create a file called **hello.js** in your folder. ![How to create a file in VSCode](/assets/images/vscode-create-file-85dd6193a61846dcc6bc584b9c83ef6d.png) Now add this piece of code to **hello.js** and save the file. ``` console.log('Hello World'); ``` Finally, run the below command in your terminal: ``` node hello.js ``` You should see **Hello World** printed in your terminal. If you do, congratulations, you are now officially a programmer! 🚀 ![Hello world in VSCode](/assets/images/vscode-hello-world-993a4d46e1828928f34c468db5bf5810.png) ## Next up[​](#next "Direct link to Next up") You have your computer set up correctly for development, and you've run your first script. Great! In the [next lesson](https://docs.apify.com/academy/web-scraping-for-beginners/data-extraction/project-setup.md) we'll set up your project to download a website's HTML using Node.js instead of a browser. --- # Extracting data with DevTools **Continue learning how to extract data from a website using browser DevTools, CSS selectors, and JavaScript via the DevTools console.** *** In the previous parts of the DevTools tutorial, we were able to extract information about a single product from the Sales collection of the [Warehouse store](https://warehouse-theme-metal.myshopify.com/collections/sales). If you missed the previous lessons, please go through them to understand the basic concepts. You don't need any of the code from there, though. We will start from scratch. ## Find all product elements[​](#find-all-products "Direct link to Find all product elements") First, we will use the `querySelectorAll()` function from the previous lessons to get a list of all the product elements. Run this command in your Console: ``` const products = document.querySelectorAll('.product-item'); products.length; ``` The `length` property of `products` tells us how many products we have in the list. It says **24** and if you count the number of products on the page, you'll find that it's correct. Good, that means our CSS selector is working perfectly to get all the products. ![Print all products](/assets/images/devtools-count-products-d590f5142b901919a0c31b50df5b0538.png) ## Looping over elements[​](#looping-over-elements "Direct link to Looping over elements") > [Visit this tutorial](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Loops_and_iteration) if you need to refresh the concept of loops in programming. Now, we will loop over each product and print their titles. We will use a so-called `for..of` loop to do it. It is a loop that iterates through all items of an array. Run the following command in the Console. Some notes: * The `a.product-item__title` selector and the extraction code come from the previous lesson. * The `console.log()` function prints the results to the Console. * The `trim()` function makes sure there are no useless whitespace characters around our data. ``` for (const product of products) { const titleElement = product.querySelector('a.product-item__title'); const title = titleElement.textContent.trim(); console.log(title); } ``` > [Learn more](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Statements/for...of) about the `for..of` loop. ![Print all products\' text](/assets/images/devtools-product-titles-707d5bbdd367173c973801350f61e859.png) ## Extracting more data[​](#extracting-data-in-loop "Direct link to Extracting more data") We will add the price extraction from the previous lesson to the loop. We will also save all the data to an array so that we can work with it. Run this in the Console: > The `results.push()` function takes its argument and pushes (adds) it to the `results` array. [Learn more about it here](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/push). ``` const results = []; for (const product of products) { const titleElement = product.querySelector('a.product-item__title'); const title = titleElement.textContent.trim(); const priceElement = product.querySelector('span.price'); const price = priceElement.childNodes[2].nodeValue.trim(); results.push({ title, price }); } ``` After running the code, you'll see **24** printed to the Console. That's because the `results` array includes 24 products. Now, run this command in the Console to print all the products: ``` console.log(results); ``` ![Print all products\' data](/assets/images/devtools-print-all-products-79895abb91e05c4de5321f849b084bbf.png) > You may notice that some prices include the word **From**, indicating that the price is not final. If you wanted to process this data further, you would want to remove this from the price and instead save this information to another field. ## Summary[​](#summary "Direct link to Summary") Let's recap the web scraping process. First, we used DevTools to **find the element** that holds data about a single product. Then, inside this **parent** element we **found child elements** that contained the data (title, price) we were looking for. Second, we used the `document.querySelector()` function and its `All` variant to **find the data programmatically**, using their **CSS selectors**. And third, we wrapped this data extraction logic in a **loop** to automatically find the data not only for a single product, but for **all the products** on the page. 🎉 ## Next up[​](#next "Direct link to Next up") And that's it! With a bit of trial and error, you will be able to extract data from any webpage that's loaded in your browser. This is a useful skill on its own. It will save you time copy-pasting stuff when you need data for a project. More importantly though, it taught you the basics to start programming your own scrapers. In the [next lessons](https://docs.apify.com/academy/web-scraping-for-beginners/data-extraction/computer-preparation.md), we will teach you how to create your own web data extraction script using JavaScript and Node.js. --- # Extracting data with Node.js **Continue learning how to create a web scraper with Node.js and Cheerio. Learn how to parse HTML and print the results of the data your scraper has collected.** *** In the first part of the Node.js tutorial we downloaded the HTML of our [Warehouse store](https://warehouse-theme-metal.myshopify.com/collections/sales) and parsed it with Cheerio. Now, we will replicate the extraction logic from the [Extracting Data with DevTools](https://docs.apify.com/academy/web-scraping-for-beginners/data-extraction/using-devtools.md) lessons and finish our scraper. ## Querying data with Cheerio[​](#querying-with-cheerio "Direct link to Querying data with Cheerio") As a reminder, the data we need for each product on the page is available in the elements that have the `product-item` class attribute. ![Selecting an element from the Elements tab](/assets/images/devtools-collection-class-0c9aab05cfd98e0e534e0c9adbac28da.png) To get all the elements with that class using Cheerio, we call the `$` function with the appropriate CSS selector. Same as we would with the `document.querySelectorAll()` function. ``` // In browser DevTools Console const products = document.querySelectorAll('.product-item'); ``` ``` // In Node.js with Cheerio const products = $('.product-item'); ``` We will use the same approach as in the previous DevTools lessons. Using a `for..of` loop we will iterate over the list of products we saved in the `products` variable. The code is a little different from DevTools, because we're using Node.js and Cheerio instead of a browser's native DOM manipulation functions, but the principle is exactly the same. Replace the code in your **main.js** with the following, and run it with `node main.js` in your terminal. ``` // main.js import { gotScraping } from 'got-scraping'; import * as cheerio from 'cheerio'; const storeUrl = 'https://warehouse-theme-metal.myshopify.com/collections/sales'; // Download HTML with Got Scraping const response = await gotScraping(storeUrl); const html = response.body; // Parse HTML with Cheerio const $ = cheerio.load(html); // Find all products on the page const products = $('.product-item'); // Loop through all the products // and print their text to terminal for (const product of products) { const productElement = $(product); const productText = productElement.text(); console.log(productText); } ``` After you run this script, you will see data of all the 24 products printed in your terminal. The output will be messy, but that's ok. Next, we will clean it. ## Extracting product data[​](#extracting-data "Direct link to Extracting product data") To clean the output, we need to repeat the process from the DevTools lessons and add individual data point extraction to the loop. From those lessons, we know that each of our product cards includes an `` element which holds the product's title, and a `` element which includes the product's price. ![Finding child elements in Elements tab](/assets/images/devtools-find-child-elements-c503170e9925d6e564f97317bf7503b6.png) We will loop over all the `products` and extract the data points from each of them using the `for..of` loop. For reference, this a part of the code from the DevTools lesson, where we collected the data using the browser **DevTools Console**: ``` // This code will only work in the browser, and NOT in Node.js const results = []; for (const product of products) { const titleElement = product.querySelector('a.product-item__title'); const title = titleElement.textContent.trim(); const priceElement = subwoofer.querySelector('span.price'); const price = priceElement.childNodes[2].nodeValue.trim(); results.push({ title, price }); } ``` And this snippet shows the same piece of code when using **Node.js and Cheerio**: ``` const results = []; for (const product of products) { const titleElement = $(product).find('a.product-item__title'); const title = titleElement.text().trim(); const priceElement = $(product).find('span.price'); const price = priceElement.contents()[2].nodeValue.trim(); results.push({ title, price }); } ``` The main difference is that we used the [`.find()`](https://cheerio.js.org/classes/Cheerio.html#find) function to select the title and price elements and also the `.contents()` function instead of the `childNodes` attribute. If you find the differences confusing, don't worry about it. It will begin to feel very natural after a bit of practice. The final scraper code looks like this. Replace the code in your **main.js** file with this code and run it using `node main.js` in your terminal. ``` // main.js import { gotScraping } from 'got-scraping'; import * as cheerio from 'cheerio'; const storeUrl = 'https://warehouse-theme-metal.myshopify.com/collections/sales'; // Download HTML with Got Scraping const response = await gotScraping(storeUrl); const html = response.body; // Parse HTML with Cheerio const $ = cheerio.load(html); // Find all products on the page const products = $('.product-item'); const results = []; for (const product of products) { const titleElement = $(product).find('a.product-item__title'); const title = titleElement.text().trim(); const priceElement = $(product).find('span.price'); const price = priceElement.contents()[2].nodeValue.trim(); results.push({ title, price }); } console.log(results); ``` After running the code, you will see this output in your terminal: ``` [ { title: 'JBL Flip 4 Waterproof Portable Bluetooth Speaker', price: '$74.95', }, { title: 'Sony XBR-950G BRAVIA 4K HDR Ultra HD TV', price: 'From $1,398.00', }, { title: 'Sony SACS9 10" Active Subwoofer', price: '$158.00', }, { title: 'Sony PS-HX500 Hi-Res USB Turntable', price: '$398.00', }, { title: 'Klipsch R-120SW Powerful Detailed Home Speaker - Unit', price: '$324.00', }, // ...and more ]; ``` Congratulations! You completed the **Basics of data extraction** section of the Web scraping basics for JavaScript devs course. A quick recap of what you learned: 1. The basic terminology around web scraping, crawling, HTML, CSS and JavaScript. 2. How to use browser DevTools and Console to inspect web pages and manipulate them using CSS and JavaScript. 3. How to install Node.js and set up your computer for building scrapers. 4. How to download a website's HTML using Got Scraping and then parse it using Cheerio to extract valuable data. Great job! 👏🎉 # Next up What's next? While we were able to extract the data, it's not super useful to have it printed to the terminal. In the [next, bonus lesson](https://docs.apify.com/academy/web-scraping-for-beginners/data-extraction/save-to-csv.md), we will learn how to convert the data to a CSV and save it to a file. --- # Scraping with Node.js **Learn how to use JavaScript and Node.js to create a web scraper, plus take advantage of the Cheerio and Got-scraping libraries to make your job easier.** *** Finally, we have everything ready to start scraping! Yes, the setup was a bit daunting, but luckily, you only have to do it once. We have our project, we have our **main.js** file, so let's add some code to it. ## Downloading HTML[​](#downloading-html "Direct link to Downloading HTML") We will use the `got-scraping` library to download the HTML of products that are [on sale in the Warehouse store](https://warehouse-theme-metal.myshopify.com/collections/sales). We already worked with this page earlier in the [Extracting Data with DevTools](https://docs.apify.com/academy/web-scraping-for-beginners/data-extraction/using-devtools.md) lessons. Replace the contents of your **main.js** file with this code: ``` // main.js import { gotScraping } from 'got-scraping'; const storeUrl = 'https://warehouse-theme-metal.myshopify.com/collections/sales'; const response = await gotScraping(storeUrl); const html = response.body; console.log(html); ``` Now run the script using the `node main.js` command from the previous lesson. After a brief moment, you should see the page's HTML printed to your terminal. > `gotScraping` is an `async` function and the `await` keyword is used to pause execution of the script until it returns the `response`. If you're new to this, go through an [introduction to asynchronous JavaScript](https://developer.mozilla.org/en-US/docs/Learn/JavaScript/Asynchronous). ## Parsing HTML[​](#parsing-html "Direct link to Parsing HTML") Having the HTML printed to the terminal is not very helpful. To extract the data, we first have to parse it. Parsing the HTML allows us to query the individual HTML elements, similarly to the way we did it in the browser in the [Extracting Data with DevTools](https://docs.apify.com/academy/web-scraping-for-beginners/data-extraction/using-devtools.md) lessons. To parse the HTML with the `cheerio` library. Replace the code in your **main.js** with the following code: ``` // main.js import { gotScraping } from 'got-scraping'; import * as cheerio from 'cheerio'; const storeUrl = 'https://warehouse-theme-metal.myshopify.com/collections/sales'; // Download HTML with Got Scraping const response = await gotScraping(storeUrl); const html = response.body; // Parse HTML with Cheerio const $ = cheerio.load(html); const headingElement = $('h1'); const headingText = headingElement.text(); // Print page title to terminal console.log(headingText); ``` When you run the above script, **Sales** will be printed to the terminal. That's because it's the heading of the Sales page of the Warehouse Store which is located in a `h1` element. ![Scraping page heading](/assets/images/node-scraper-title-be9c173ce57d6c4af63aec0ed0c9bcdd.png) Great, we successfully parsed the HTML and extracted the text of the `` element from it using Node.js and Cheerio. Let's break the code down. The script first downloaded the page's HTML using the Got Scraping library. Then, it parsed the downloaded `html` with `cheerio` using the `load()` function, and allowed us to work with it using the `$` variable (the `$` name is an old convention). The next `$('h1')` function call looked inside the parsed HTML and found the `` element. Finally, the script extracted the text from the element using the `.text()` function and printed it to the terminal with `console.log()`. > `$('h1')` is very similar to calling `document.querySelector('h1')` in the browser and `element.text()` is similar to `element.textContent` from the earlier DevTools lessons. [Visit the cheerio documentation](https://github.com/cheeriojs/cheerio#readme) to learn more about its syntax. ## Next up[​](#next "Direct link to Next up") In the [next lesson](https://docs.apify.com/academy/web-scraping-for-beginners/data-extraction/node-continued.md) we will learn more about Cheerio and use it to extract all the products' data from Fakestore. --- # Setting up your project **Create a new project with npm and Node.js. Install necessary libraries, and test that everything works before starting the next lesson.** *** When you open a website in a browser, the browser first downloads the page's HTML. To do the same thing with Node.js, we will install a program - an npm module - to help us with it. npm modules are installed using `npm`, which is another program, automatically installed with Node.js. > The [npmjs.com](https://www.npmjs.com/) registry offers a huge collection of open-source libraries for Node.js. You can (and you should) utilize it to save time and tap into the amazing open-source community around JavaScript and Node.js. ## Creating a new project with npm[​](#creating-a-project "Direct link to Creating a new project with npm") Before we can install npm modules, we need to create an npm project. To do that, you can create a new directory or use the one that you already have open in VSCode (you can delete the **hello.js** file now) and from that directory run this command in your terminal: ``` npm init -y ``` It will set up an empty npm project for you and create a file called **package.json**. This is a very important file in Node.js programming as it contains information about the project. ![npm init with VSCode](/assets/images/vscode-npm-init-9a14ce01ee1ea6ec3e6d8c9bb7b27209.png) ### Use modern JavaScript[​](#modern-javascript "Direct link to Use modern JavaScript") Node.js and npm support two types of projects, let's call them legacy and modern. For backwards compatibility, the legacy version is used by default. To switch to the modern version, open your **package.json** and add this line to the end of the JSON object. Don't forget to add a comma to the end of the previous line 😉 ``` "type": "module" ``` ![Update package.json with VSCode](/assets/images/vscode-type-module-4981eb4a264712ba7e0a951578b13c10.png) > More recent versions of npm might already have `"type": "commonjs",` pre-defined; if so, simply replace `commonjs` with `module`. If you want to learn more about JSON and its syntax, we recommend [this tutorial on MDN](https://developer.mozilla.org/en-US/docs/Learn/JavaScript/Objects/JSON). ## Installing necessary libraries[​](#install-libraries "Direct link to Installing necessary libraries") Now that we have a project set up, we can install npm modules into the project. Let's install libraries that will help us with downloading and processing websites' HTML. In the project directory, run the following command, which will install two libraries into your project. **got-scraping** and Cheerio. ``` npm install got-scraping cheerio ``` [**got-scraping**](https://github.com/apify/got-scraping) is a library that's made especially for scraping and downloading page's HTML. It's based on the popular [**got** library](https://github.com/sindresorhus/got), which means any features of **got** are also available in **got-scraping**. Both **got** and **got-scraping** are HTTP clients. To learn more about HTTP, [visit this MDN tutorial](https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP). [Cheerio](https://github.com/cheeriojs/cheerio) is a popular Node.js library for parsing and processing HTML. If you know how to work with [jQuery](https://jquery.com/), you'll find Cheerio familiar. ## Test everything[​](#testing "Direct link to Test everything") With the libraries installed, create a new file in the project's folder called **main.js**. This is where we will put all our code. Before we start scraping, though, let's do a check that everything was installed correctly. Add this piece of code inside **main.js**. ``` import { gotScraping } from 'got-scraping'; import * as cheerio from 'cheerio'; console.log('it works!'); ``` Those `import` statements tell Node.js that it should give you access to the **got-scraping** library under the `gotScraping` variable and the Cheerio library under the `cheerio` variable. Now run this command in your terminal: ``` node main.js ``` If you see **it works!** printed in your terminal, great job! You set up everything correctly. If you see an error that says *Cannot use import statement outside a module*, go back to the [Use modern JavaScript](#modern-javascript) paragraph and add the `type` property to your **package.json**. If you see a different error, try copying and pasting it into Google, and you'll find a solution soon. ![Test your setup with VSCode](/assets/images/vscode-test-setup-ee57f6c24936b1661402600019b6373a.png) ## Next up[​](#next "Direct link to Next up") With the project set up, the [next lesson](https://docs.apify.com/academy/web-scraping-for-beginners/data-extraction/node-js-scraper.md) will show you how to use **got-scraping** to download the website's HTML and extract data from it with Cheerio. --- # Saving results to CSV **Learn how to save the results of your scraper's collected data to a CSV file that can be opened in Excel, Google Sheets, or any other spreadsheets program.** *** In the last lesson, we were able to extract data about all the on-sale products from [Warehouse Store](https://warehouse-theme-metal.myshopify.com/collections/sales). That's great. But we ended up with results printed to the terminal, which is not very useful for further processing. In this lesson, we'll learn how to save that data into a CSV file that you can then open in Excel or Google Sheets. ## Converting to CSV[​](#converting-to-csv "Direct link to Converting to CSV") It might look like a big programming challenge to transform a JavaScript object into a CSV, but thanks to npm, this is going to be a walk in the park. Google search **json to csv npm**. You will find that there's a library called [`json2csv`](https://www.npmjs.com/package/json2csv) that can convert a JavaScript object to CSV format with a single function call. *Perfect!* To install `json2csv`, run this command in your terminal. You need to be in the project's folder - the folder which has the `package.json` file. ``` npm i json2csv ``` First, we need to import the `parse()` function from the library. ``` import { parse } from 'json2csv'; ``` Next, we need to parse the `results` array from the previous lesson with the imported function. ``` const csv = parse(results); ``` The full code including the earlier scraping part now looks like this. Replace the contents of your **main.js** file with this code: ``` // main.js import { gotScraping } from 'got-scraping'; import * as cheerio from 'cheerio'; import { parse } from 'json2csv'; // **Why use a Shopify demo and not a real e-commerce store like Amazon?** Because real websites are usually bulkier, littered with promotions, and they change very often. Many have multiple versions of pages, and you never know in advance which one you will get. It will be important to learn how to deal with these challenges in the future, but for this beginner course, we want to have a light and stable environment. > > Some other courses use so-called scraping playgrounds or sandboxes. Those are websites made solely for the purpose of learning scraping. We find those too dumbed down and not representative of real websites. The Shopify demo is a full-featured, real-world website. ## Getting structured data from HTML[​](#getting-structured-data "Direct link to Getting structured data from HTML") When you open up the [Sales section of Warehouse](https://warehouse-theme-metal.myshopify.com/collections/sales), you'll see that there's a grid of products on the page with names and pictures of products. We will learn how to extract all this information. ![Warehouse store with DevTools open](/assets/images/devtools-collection-warehouse-9ab5f52429a865e8965f889abd5e2180.png) Open DevTools and select the name of the **Sony SACS9 Active Subwoofer**. When you click on it, it will get highlighted in the Elements tab. ![Selecting an element with DevTools](/assets/images/devtools-collection-product-name-49124d4c5af086ac2577f349e1ffd989.png) Great, you have selected the element which contains the name of the subwoofer. Now we want to find all the elements that contain all the information about this subwoofer. Price, number of reviews, image and everything else you might need. We will use the **Elements** tab to do that. You can hover over the elements in the Elements tab, and they will get highlighted on the page as you move the mouse. Start from the previously selected element with the subwoofer's name and move your mouse up, hovering over each element, until you find the one that highlights the entire product card. Alternatively, you can press the up arrow a few times to get the same result. The element that contains all the information about the subwoofer is called a **parent element**, and all the nested elements, including the subwoofer's name, price and everything else, are **child elements**. ![Selecting an element with hover](/assets/images/devtools-collection-product-hover-dd23c141bea5f9739410eaa06e83ee80.png) Now that we know how the parent element looks, we can extract its data, including the data of its children. Notice that the element has a `class` attribute with multiple values like `product-item` or `product-item--vertical`. Let's use those classes in the Console to extract data. ![Class attribute in DevTools](/assets/images/devtools-collection-class-0c9aab05cfd98e0e534e0c9adbac28da.png) ## Selecting elements in Console[​](#selecting-elements "Direct link to Selecting elements in Console") We know how to find an element manually using the DevTools, but that's not very useful for automated scraping. We need to tell the computer how to find it as well. We can do that using JavaScript and CSS selectors. The function to do that is called [`document.querySelector()`](https://docs.apify.com/academy/concepts/querying-css-selectors.md) and it will find the first element in the page's HTML matching the provided [CSS selector](https://docs.apify.com/academy/concepts/css-selectors.md). For example `document.querySelector('div')` will find the first `` element. And `document.querySelector('.my-class')` (notice the period `.`) will find the first element with the class `my-class`, such as `` or ``. You can also combine selectors. `document.querySelector('p.my-class')` will find all `` elements, but no ``. Let's try to use `document.querySelector()` to find the **Sony subwoofer**. Earlier we mentioned that the parent element of the subwoofer had, among others, the `product-item` class. We can use the class to look up the element. Copy or type (don't miss the period `.` in `.product-item`) the following function into the Console and press Enter. ``` document.querySelector('.product-item'); ``` It will produce a result like this, but it **won't be** the Sony subwoofer. ![Query a selector with JavaScript](/assets/images/devtools-collection-query-9f650202d95e659b6124adfc5bc1754d.png) About the missing semicolon In the screenshot, there is a missing semicolon `;` at the end of the line. In JavaScript, semicolons are optional, so it makes no difference. When we look more closely by hovering over the result in the Console, we find that instead of the Sony subwoofer, we found a JBL Flip speaker. Why? Because earlier we explained that `document.querySelector('.product-item')` finds the **first element** with the `product-item` class, and the JBL speaker is the first product in the list. ![Hover over a query result](/assets/images/devtools-collection-query-hover-06b6ee753dee883803dc079bf2abecf1.png) We need a different function: [`document.querySelectorAll()`](https://docs.apify.com/academy/concepts/querying-css-selectors.md) (notice the `All` at the end). This function does not find only the first element, but all the elements that match the provided selector. Run the following function in the Console: ``` document.querySelectorAll('.product-item'); ``` It will return a `NodeList` (a type of array) with many results. Expand the results by clicking the small arrow button and then hover over the third (number 2, indexing starts at 0) element in the list. You'll find that it's the Sony subwoofer we're looking for. ![Hover over a query result](/assets/images/devtools-collection-query-all-8d4bc26a48292f70f45bd4fe99fbbb0a.png) Naturally, this is the method we use mostly in web scraping, because we're usually interested in scraping all the products from a page, not just a single product. Elements or nodes? The list is called a `NodeList`, because browsers understand a HTML document as a tree of nodes. Most of the nodes are HTML elements, but there can be also text nodes for plain text, and others. ## How to choose good selectors[​](#choose-good-selectors "Direct link to How to choose good selectors") Often you can select the same element with different CSS selectors. Try to choose selectors that are **simple**, **human-readable**, **unique** and **semantically connected** to the data. Selectors that meet these criteria are sometimes called **resilient selectors**, because they're the most reliable and least likely to change with website updates. If you can, avoid randomly generated attributes like `class="F4jsL8"`. They change often and without warning. The `product-item` class is simple, human-readable, and semantically connected with the data. The subwoofer is one of the products. A product item. Those are strong signals that this is a good selector. It's also sufficiently unique in the website's context. If the selector was only an `item`, for example, there would be a higher chance that the website's developers would add this class to something unrelated. Like an advertisement. And it could break your extraction code. ## Extracting data from elements[​](#extraction-from-elements "Direct link to Extracting data from elements") Now that we found the element, we can start poking into it to extract data. First, let's save the element to a variable so that we can work with it repeatedly. Run these commands in the Console: ``` const products = document.querySelectorAll('.product-item'); const subwoofer = products[2]; ``` > If you're wondering what an array is or what `products[2]` means, read the [JavaScript arrays basics](https://developer.mozilla.org/en-US/docs/Learn/JavaScript/First_steps/Arrays). Now that we have the subwoofer saved in a variable, run another command in the Console to print its text: ``` subwoofer.textContent; ``` ![Print text content of parent element](/assets/images/devtools-print-parent-text-1e295ece2af49af7bd9cf50aba9d2868.png) As you can see, we were able to extract information about the subwoofer, but the format is still not very useful - there's a lot of content that we don't need. For further processing (ex. in a spreadsheet), we would like to have each piece of data as a separate field (column). To do that, we will look at the HTML structure in more detail. ### Finding child elements[​](#finding-child-elements "Direct link to Finding child elements") In the [Getting structured data from HTML](#getting-structured-data) section, we were browsing the elements in the **Elements** tab to find the element that contains all the data. We can use the same approach to find the individual data points as well. Start from the element that contains all data: `` Then inspect all the elements nested within this element. You'll discover that: * the product's name is an `` element with the class `product-item__title`, and * the price is held inside a `` with the class `price`. Note that there are two prices. The sale price and the regular price. We want the sale price. We will use this knowledge soon to extract the data. ![Finding child elements in Elements tab](/assets/images/devtools-find-child-elements-c503170e9925d6e564f97317bf7503b6.png) ### Selecting child elements[​](#selecting-child-elements "Direct link to Selecting child elements") The `document.querySelector()` function looks for a specific element in the whole HTML `document`, so if we called it with `h3`, it would find the first `` node in the `document`. But we can replace the `document` with any other parent element and the function will limit its search to child elements of the chosen parent. Earlier we selected the parent element of the Sony subwoofer and saved it to a variable called `subwoofer`. Let's use this variable to search inside the subwoofer element and find the product's name and price. Run two commands in the Console. The first will find the element with the subwoofer's name and save it to a variable called `title`. The second will extract the name and print it. ``` const title = subwoofer.querySelector('a.product-item__title'); title.textContent; ``` ![Extract product title](/assets/images/devtools-extract-product-title-b1934c7e0566336835fad2001dd126b7.png) Great! We found a way how to programmatically extract the name of the product. We're getting somewhere. Next, run the following two commands in the Console. ``` const price = subwoofer.querySelector('span.price'); price.textContent; ``` ![Extract product price](/assets/images/devtools-extract-product-price-b6f83ca557a4c83972c75b78631df80e.png) It worked, but the price was not alone in the result. We extracted it together with some extra text. This is very common in web scraping. Sometimes it's impossible to separate the data we need by element selection alone, and we have to clean the data using other methods. ### Cleaning extracted data[​](#cleaning-extracted-data "Direct link to Cleaning extracted data") When it comes to data cleaning, there are two main approaches you can take. It's beneficial to understand both, as one approach may be feasible in a given situation while the other is not. 1. Remove the elements that add noise to your data from the selection. Then extract the pre-cleaned data. 2. Extract the data with noise. Use [regular expressions](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions) or other text manipulation techniques to parse the data and keep only the parts we're interested in. First, let's look at **removing the noise before extraction**. When you look closely at the element that contains the price, you'll see that it includes another `` element with the text **Sale price**. This `` is what adds noise to our data, and we have to get rid of it. ![Noise in element selection](/assets/images/devtools-cleaning-noise-d6f3a7365fa9d2306086e00d8cd7fa88.png) When we call `subwoofer.querySelector('span.price')` it selects the whole `` element. Unfortunately, it also includes the `` element that we're not interested in. We can, however, use JavaScript to get only the actual text of the selected element, without any child elements. Run this command in the Console: ``` price.childNodes[2].nodeValue; ``` Why the third child node? Because the first one represents the empty space before `` itself and the third one is the price. In any case, we were able to extract the clean price. ![Clean price selection](/assets/images/devtools-clean-price-8411206ba8f39485d232cb7c1e5d066a.png) The second option we have is to **take the noisy price data and clean it with string manipulation**. The data looks like this: ``` \n Sale price$158.00 ``` This can be approached in a variety of ways. To start let's look at a naive solution: ``` price.textContent.split('$')[1]; ``` ![Split price from noise](/assets/images/devtools-split-price-f86125408dc6f5ad844bb3d154931bc6.png) And there you go. Notice that this time we extracted the price without the `$` dollar sign. This could be desirable, because we wanted to convert the price from a string to a number, or not, depending on individual circumstances of the scraping project. Which method to choose? Neither is the perfect solution. The first method could break if the website's developers change the structure of the `` elements and the price will no longer be in the third position - a very small change that can happen at any moment. The second method seems more reliable, but only until the website adds prices in other currency or decides to replace `$` with `USD`. It's up to you, the scraping developer to decide which of the methods will be more resilient on the website you scrape. In production, we would probably use a regular expression like the following, or a specialized library for parsing prices from strings, but for this tutorial, we'll keep it simple. ``` price.textContent.match(/((\d+,?)+.?(\d+)?)/)[0]; ``` ## Next up[​](#next "Direct link to Next up") This concludes our lesson on extracting and cleaning data using DevTools. Using CSS selectors, we were able to find the HTML element that contains data about our favorite Sony subwoofer and then extract the data. In the [next lesson](https://docs.apify.com/academy/web-scraping-for-beginners/data-extraction/devtools-continued.md), we will learn how to extract information not only about the subwoofer, but about all the products on the page. --- # Introduction **Start learning about web scraping, web crawling, data extraction, and popular tools to start developing your own scraper.** *** Web scraping or crawling? Web data extraction, mining, or collection? You can find various definitions on the web. Let's agree on explanations that we will use throughout this beginner course on web scraping. ## What is web data extraction?[​](#what-is-data-extraction "Direct link to What is web data extraction?") Web data extraction (or collection) is a process that takes a web page, like an Amazon product page, and collects useful information from the page, such as the product's name and price. Web pages are an unstructured data source and the goal of web data extraction is to make information from websites structured, so that it can be processed by data analysis tools or integrated with computer systems. The main sources of data on a web page are HTML documents and API calls, but also images, PDFs, etc. ![product data extraction from Amazon](/assets/images/beginners-data-extraction-2bb4d2e434080f9d29cb78c66c3a8ac2.png) ## What is crawling?[​](#what-is-crawling "Direct link to What is crawling?") Where web data extraction focuses on a single page, web crawling (sometimes called spidering 🕷) is all about movement between pages or websites. The purpose of crawling is to travel across the website to find pages with the information we want. Crawling and collection can happen either simultaneously, while moving from page to page, or separately, where one scraper focuses solely on finding pages with data, and another scraper collects the data. The main purpose of crawling is to collect URLs or other links that can be used to move around. ## What is web scraping?[​](#what-is-web-scraping "Direct link to What is web scraping?") We use web scraping as an umbrella term for crawling, web data extraction and all other activities that have the purpose of converting unstructured data from the web to a structured format ready for integration or data analysis. In the advanced courses, you'll learn that web scraping is about much more than just HTML and URLs. ## Next up[​](#next "Direct link to Next up") In the [next lesson](https://docs.apify.com/academy/web-scraping-for-beginners/data-extraction.md), you will learn about the basic building blocks of each web page. HTML, CSS and JavaScript. --- Download OpenAPI * [YAML](https://docs.apify.com/api/openapi.yaml) * [JSON](https://docs.apify.com/api/openapi.json) # Apify API The Apify API (version 2) provides programmatic access to the [Apify platform](https://docs.apify.com). The API is organized around [RESTful](https://en.wikipedia.org/wiki/Representational_state_transfer) HTTP endpoints. You can download the complete OpenAPI schema of Apify API in the [YAML](http://docs.apify.com/api/openapi.yaml) or [JSON](http://docs.apify.com/api/openapi.json) formats. The source code is also available on [GitHub](https://github.com/apify/apify-docs/tree/master/apify-api/openapi). All requests and responses (including errors) are encoded in [JSON](http://www.json.org/) format with UTF-8 encoding, with a few exceptions that are explicitly described in the reference. * To access the API using [Node.js](https://nodejs.org/en/), we recommend the [`apify-client`](https://docs.apify.com/api/client/js) [NPM package](https://www.npmjs.com/package/apify-client). * To access the API using [Python](https://www.python.org/), we recommend the [`apify-client`](https://docs.apify.com/api/client/python) [PyPI package](https://pypi.org/project/apify-client/). The clients' functions correspond to the API endpoints and have the same parameters. This simplifies development of apps that depend on the Apify platform. Important Request Details * `Content-Type` header: For requests with a JSON body, you must include the `Content-Type: application/json` header. * Method override: You can override the HTTP method using the `method` query parameter. This is useful for clients that can only send `GET` requests. For example, to call a `POST` endpoint, append `?method=POST` to the URL of your `GET` request. ## Authentication[​](#authentication "Direct link to Authentication") You can find your API token on the [Integrations](https://console.apify.com/account#/integrations) page in the Apify Console. To use your token in a request, either: * Add the token to your request's `Authorization` header as `Bearer `. E.g., `Authorization: Bearer xxxxxxx`. [More info](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Authorization). (Recommended). * Add it as the `token` parameter to your request URL. (Less secure). Using your token in the request header is more secure than using it as a URL parameter because URLs are often stored in browser history and server logs. This creates a chance for someone unauthorized to access your API token. **Do not share your API token or password with untrusted parties.** For more information, see our [integrations](https://docs.apify.com/platform/integrations) documentation. ## Basic usage[​](#basic-usage "Direct link to Basic usage") To run an Actor, send a POST request to the [Run Actor](#/reference/actors/run-collection/run-actor) endpoint using either the Actor ID code (e.g. `vKg4IjxZbEYTYeW8T`) or its name (e.g. `janedoe~my-actor`): `https://api.apify.com/v2/acts/[actor_id]/runs` If the Actor is not runnable anonymously, you will receive a 401 or 403 [response code](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status). This means you need to add your [secret API token](https://console.apify.com/account#/integrations) to the request's `Authorization` header ([recommended](#/introduction/authentication)) or as a URL query parameter `?token=[your_token]` (less secure). Optionally, you can include the query parameters described in the [Run Actor](#/reference/actors/run-collection/run-actor) section to customize your run. If you're using Node.js, the best way to run an Actor is using the `Apify.call()` method from the [Apify SDK](https://sdk.apify.com/docs/api/apify#apifycallactid-input-options). It runs the Actor using the account you are currently logged into (determined by the [secret API token](https://console.apify.com/account#/integrations)). The result is an [Actor run object](https://sdk.apify.com/docs/typedefs/actor-run) and its output (if any). A typical workflow is as follows: 1. Run an Actor or task using the [Run Actor](#/reference/actors/run-collection/run-actor) or [Run task](#/reference/actor-tasks/run-collection/run-task) API endpoints. 2. Monitor the Actor run by periodically polling its progress using the [Get run](#/reference/actor-runs/run-object-and-its-storages/get-run) API endpoint. 3. Fetch the results from the [Get items](#/reference/datasets/item-collection/get-items) API endpoint using the `defaultDatasetId`, which you receive in the Run request response. Additional data may be stored in a key-value store. You can fetch them from the [Get record](#/reference/key-value-stores/record/get-record) API endpoint using the `defaultKeyValueStoreId` and the store's `key`. **Note**: Instead of periodic polling, you can also run your [Actor](#/reference/actors/run-actor-synchronously) or [task](#/reference/actor-tasks/runs-collection/run-task-synchronously) synchronously. This will ensure that the request waits for 300 seconds (5 minutes) for the run to finish and returns its output. If the run takes longer, the request will time out and throw an error. ## Response structure[​](#response-structure "Direct link to Response structure") Most API endpoints return a JSON object with the `data` property: ``` { "data": { ... } } ``` However, there are a few explicitly described exceptions, such as Dataset [Get items](#/reference/datasets/item-collection/get-items) or Key-value store [Get record](#/reference/key-value-stores/record/get-record) API endpoints, which return data in other formats. In case of an error, the response has the HTTP status code in the range of 4xx or 5xx and the `data` property is replaced with `error`. For example: ``` { "error": { "type": "record-not-found", "message": "Store was not found." } } ``` See [Errors](#/introduction/errors) for more details. ## Pagination[​](#pagination "Direct link to Pagination") All API endpoints that return a list of records (e.g. [Get list of Actors](#/reference/actors/actor-collection/get-list-of-actors)) enforce pagination in order to limit the size of their responses. Most of these API endpoints are paginated using the `offset` and `limit` query parameters. The only exception is [Get list of keys](#/reference/key-value-stores/key-collection/get-list-of-keys), which is paginated using the `exclusiveStartKey` query parameter. **IMPORTANT**: Each API endpoint that supports pagination enforces a certain maximum value for the `limit` parameter, in order to reduce the load on Apify servers. The maximum limit could change in future so you should never rely on a specific value and check the responses of these API endpoints. ### Using offset[​](#using-offset "Direct link to Using offset") Most API endpoints that return a list of records enable pagination using the following query parameters: | | | | -------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `limit` | Limits the response to contain a specific maximum number of items, e.g. `limit=20`. | | `offset` | Skips a number of items from the beginning of the list, e.g. `offset=100`. | | `desc` | By default, items are sorted in the order in which they were created or added to the list. This feature is useful when fetching all the items, because it ensures that items created after the client started the pagination will not be skipped. If you specify the `desc=1` parameter, the items will be returned in the reverse order, i.e. from the newest to the oldest items. | The response of these API endpoints is always a JSON object with the following structure: ``` { "data": { "total": 2560, "offset": 250, "limit": 1000, "count": 1000, "desc": false, "items": [ { 1st object }, { 2nd object }, ... { 1000th object } ] } } ``` The following table describes the meaning of the response properties: | Property | Description | | -------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `total` | The total number of items available in the list. | | `offset` | The number of items that were skipped at the start. This is equal to the `offset` query parameter if it was provided, otherwise it is `0`. | | `limit` | The maximum number of items that can be returned in the HTTP response. It equals to the `limit` query parameter if it was provided or the maximum limit enforced for the particular API endpoint, whichever is smaller. | | `count` | The actual number of items returned in the HTTP response. | | `desc` | `true` if data were requested in descending order and `false` otherwise. | | `items` | An array of requested items. | ### Using key[​](#using-key "Direct link to Using key") The records in the [key-value store](https://docs.apify.com/platform/storage/key-value-store) are not ordered based on numerical indexes, but rather by their keys in the UTF-8 binary order. Therefore the [Get list of keys](#/reference/key-value-stores/key-collection/get-list-of-keys) API endpoint only supports pagination using the following query parameters: | | | | ------------------- | --------------------------------------------------------------------------------------------------- | | `limit` | Limits the response to contain a specific maximum number items, e.g. `limit=20`. | | `exclusiveStartKey` | Skips all records with keys up to the given key including the given key, in the UTF-8 binary order. | The response of the API endpoint is always a JSON object with following structure: ``` { "data": { "limit": 1000, "isTruncated": true, "exclusiveStartKey": "my-key", "nextExclusiveStartKey": "some-other-key", "items": [ { 1st object }, { 2nd object }, ... { 1000th object } ] } } ``` The following table describes the meaning of the response properties: | Property | Description | | ----------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `limit` | The maximum number of items that can be returned in the HTTP response. It equals to the `limit` query parameter if it was provided or the maximum limit enforced for the particular endpoint, whichever is smaller. | | `isTruncated` | `true` if there are more items left to be queried. Otherwise `false`. | | `exclusiveStartKey` | The last key that was skipped at the start. Is `null` for the first page. | | `nextExclusiveStartKey` | The value for the `exclusiveStartKey` parameter to query the next page of items. | ## Errors[​](#errors "Direct link to Errors") The Apify API uses common HTTP status codes: `2xx` range for success, `4xx` range for errors caused by the caller (invalid requests) and `5xx` range for server errors (these are rare). Each error response contains a JSON object defining the `error` property, which is an object with the `type` and `message` properties that contain the error code and a human-readable error description, respectively. For example: ``` { "error": { "type": "record-not-found", "message": "Store was not found." } } ``` Here is the table of the most common errors that can occur for many API endpoints: | status | type | message | | ------ | --------------------- | -------------------------------------------------------------------------------------- | | `400` | `invalid-request` | POST data must be a JSON object | | `400` | `invalid-value` | Invalid value provided: Comments required | | `400` | `invalid-record-key` | Record key contains invalid character | | `401` | `token-not-provided` | Authentication token was not provided | | `404` | `record-not-found` | Store was not found | | `429` | `rate-limit-exceeded` | You have exceeded the rate limit of 30 requests per second | | `405` | `method-not-allowed` | This API endpoint can only be accessed using the following HTTP methods: OPTIONS, POST | ## Rate limiting[​](#rate-limiting "Direct link to Rate limiting") All API endpoints limit the rate of requests in order to prevent overloading of Apify servers by misbehaving clients. There are two kinds of rate limits - a global rate limit and a per-resource rate limit. ### Global rate limit[​](#global-rate-limit "Direct link to Global rate limit") The global rate limit is set to *250 000 requests per minute*. For [authenticated](#/introduction/authentication) requests, it is counted per user, and for unauthenticated requests, it is counted per IP address. ### Per-resource rate limit[​](#per-resource-rate-limit "Direct link to Per-resource rate limit") The default per-resource rate limit is *30 requests per second per resource*, which in this context means a single Actor, a single Actor run, a single dataset, single key-value store etc. The default rate limit is applied to every API endpoint except a few select ones, which have higher rate limits. Each API endpoint returns its rate limit in `X-RateLimit-Limit` header. These endpoints have a rate limit of *100 requests per second per resource*: * CRUD ([get](#/reference/key-value-stores/record/get-record), [put](#/reference/key-value-stores/record/put-record), [delete](#/reference/key-value-stores/record/delete-record)) operations on key-value store records These endpoints have a rate limit of *200 requests per second per resource*: * [Run Actor](#/reference/actors/run-collection/run-actor) * [Run Actor task asynchronously](#/reference/actor-tasks/runs-collection/run-task-asynchronously) * [Run Actor task synchronously](#/reference/actor-tasks/runs-collection/run-task-synchronously) * [Metamorph Actor run](#/reference/actors/metamorph-run/metamorph-run) * [Push items](#/reference/datasets/item-collection/put-items) to dataset * CRUD ([add](#/reference/request-queues/request-collection/add-request), [get](#/reference/request-queues/request-collection/get-request), [update](#/reference/request-queues/request-collection/update-request), [delete](#/reference/request-queues/request-collection/delete-request)) operations on requests in request queues ### Rate limit exceeded errors[​](#rate-limit-exceeded-errors "Direct link to Rate limit exceeded errors") If the client is sending too many requests, the API endpoints respond with the HTTP status code `429 Too Many Requests` and the following body: ``` { "error": { "type": "rate-limit-exceeded", "message": "You have exceeded the rate limit of ... requests per second" } } ``` ### Retrying rate-limited requests with exponential backoff[​](#retrying-rate-limited-requests-with-exponential-backoff "Direct link to Retrying rate-limited requests with exponential backoff") If the client receives the rate limit error, it should wait a certain period of time and then retry the request. If the error happens again, the client should double the wait period and retry the request, and so on. This algorithm is known as *exponential backoff* and it can be described using the following pseudo-code: 1. Define a variable `DELAY=500` 2. Send the HTTP request to the API endpoint 3. If the response has status code not equal to `429` then you are done. Otherwise: * Wait for a period of time chosen randomly from the interval `DELAY` to `2*DELAY` milliseconds * Double the future wait period by setting `DELAY = 2*DELAY` * Continue with step 2 If all requests sent by the client implement the above steps, the client will automatically use the maximum available bandwidth for its requests. Note that the Apify API clients [for JavaScript](https://docs.apify.com/api/client/js) and [for Python](https://docs.apify.com/api/client/python) use the exponential backoff algorithm transparently, so that you do not need to worry about it. ## Referring to resources[​](#referring-to-resources "Direct link to Referring to resources") There are three main ways to refer to a resource you're accessing via API. * the resource ID (e.g. `iKkPcIgVvwmztduf8`) * `username~resourcename` - when using this access method, you will need to use your API token, and access will only work if you have the correct permissions. * `~resourcename` - for this, you need to use an API token, and the `resourcename` refers to a resource in the API token owner's account. ## Authentication[​](#authentication "Direct link to Authentication") * HTTP: Bearer Auth * API Key: apiKey Bearer token provided in the `Authorization` header (e.g., `Authorization: Bearer your_token`—recommended). [More info](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Authorization). Use your API token to authenticate requests. You can find it on the [Integrations page](https://console.apify.com/account#/integrations) in Apify Console. This method is more secure than query parameters, as headers are not logged in browser history or server logs. Security Do not share your API token (or account password) with untrusted parties. *When is authentication required?* * *Required* for private Actors, tasks, or resources (e.g., builds of private Actors). * *Required* when using named formats for IDs (e.g., `username~store-name` for stores or `username~queue-name` for queues). * *Optional* for public Actors or resources (e.g., builds of public Actors can be queried without a token). For more information, see our [integrations documentation](https://docs.apify.com/platform/integrations). | Security Scheme Type: | http | | -------------------------- | ------ | | HTTP Authorization Scheme: | bearer | API token provided as a query parameter (e.g., `?token=your_token`—less secure). Use your API token to authenticate requests. You can find it on the [Integrations page](https://console.apify.com/account#/integrations) in Apify Console. Security Do not share your API token (or account password) with untrusted parties. *When is authentication required?* * *Required* for private Actors, tasks, or resources (e.g., builds of private Actors). * *Required* when using named formats for IDs (e.g., `username~store-name` for stores or `username~queue-name` for queues). * *Optional* for public Actors or resources (e.g., builds of public Actors can be queried without a token). For more information, see our [integrations documentation](https://docs.apify.com/platform/integrations). | Security Scheme Type: | apiKey | | --------------------- | ------ | | Query parameter name: | token | --- # Abort build ``` POST /v2/acts/:actorId/builds/:buildId/abort ``` deprecated This endpoint has been deprecated and may be replaced or removed in future versions of the API. **\[DEPRECATED]** API endpoints related to build of the Actor were moved under new namespace [`actor-builds`](#/reference/actor-builds). Aborts an Actor build and returns an object that contains all the details about the build. Only builds that are starting or running are aborted. For builds with status `FINISHED`, `FAILED`, `ABORTING` and `TIMED-OUT` this call does nothing. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Get default build ``` GET /v2/acts/:actorId/builds/default ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/ActorClient#default_build)[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/ActorClient#defaultBuild)Get the default build for an Actor. Use the optional `waitForFinish` parameter to synchronously wait for the build to finish. This avoids the need for periodic polling when waiting for the build to complete. This endpoint does not require an authentication token. Instead, calls are authenticated using the Actor's unique ID. However, if you access the endpoint without a token, certain attributes (e.g., `usageUsd` and `usageTotalUsd`) will be hidden. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Get build ``` GET /v2/acts/:actorId/builds/:buildId ``` deprecated API endpoints related to build of the Actor were moved under new namespace [`actor-builds`](#/reference/actor-builds). Gets an object that contains all the details about a specific build of an Actor. By passing the optional `waitForFinish` parameter the API endpoint will synchronously wait for the build to finish. This is useful to avoid periodic polling when waiting for an Actor build to finish. This endpoint does not require the authentication token. Instead, calls are authenticated using a hard-to-guess ID of the build. However, if you access the endpoint without the token, certain attributes, such as `usageUsd` and `usageTotalUsd`, will be hidden. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Get list of builds ``` GET /v2/acts/:actorId/builds ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/BuildCollectionClientAsync#list)[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/BuildCollectionClient#list)Gets the list of builds of a specific Actor. The response is a JSON with the list of objects, where each object contains basic information about a single build. The endpoint supports pagination using the `limit` and `offset` parameters and it will not return more than 1000 records. By default, the records are sorted by the `startedAt` field in ascending order, therefore you can use pagination to incrementally fetch all builds while new ones are still being started. To sort the records in descending order, use the `desc=1` parameter. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Build Actor ``` POST /v2/acts/:actorId/builds ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/ActorClientAsync#build)[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/ActorClient#build)Builds an Actor. The response is the build object as returned by the [Get build](#/reference/actors/build-object/get-build) endpoint. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 201 **Response Headers** * **Location** --- # Delete Actor ``` DELETE /v2/acts/:actorId ``` Clients[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/ActorClient#delete)Deletes an Actor. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 204 **Response Headers** --- # Get Actor ``` GET /v2/acts/:actorId ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/ActorClientAsync#get)[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/ActorClient#get)Gets an object that contains all the details about a specific Actor. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Get OpenAPI definition ``` GET /v2/acts/:actorId/builds/:buildId/openapi.json ``` Get the OpenAPI definition for Actor builds. Two similar endpoints are available: * [First endpoint](https://docs.apify.com/api/v2/act-openapi-json-get.md): Requires both `actorId` and `buildId`. Use `default` as the `buildId` to get the OpenAPI schema for the default Actor build. * [Second endpoint](https://docs.apify.com/api/v2/actor-build-openapi-json-get.md): Requires only `buildId`. Get the OpenAPI definition for a specific Actor build. To fetch the default Actor build, simply pass `default` as the `buildId`. Authentication is based on the build's unique ID. No authentication token is required. note You can also use the [`/api/v2/actor-build-openapi-json-get`](https://docs.apify.com/api/v2/actor-build-openapi-json-get.md) endpoint to get the OpenAPI definition for a build. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Update Actor ``` PUT /v2/acts/:actorId ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/ActorClientAsync#update)[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/ActorClient#update)Updates settings of an Actor using values specified by an Actor object passed as JSON in the POST payload. If the object does not define a specific property, its value will not be updated. The response is the full Actor object as returned by the [Get Actor](#/reference/actors/actor-object/get-actor) endpoint. The request needs to specify the `Content-Type: application/json` HTTP header! When providing your API authentication token, we recommend using the request's `Authorization` header, rather than the URL. ([More info](#/introduction/authentication)). If you want to make your Actor [public](https://docs.apify.com/platform/actors/publishing) using `isPublic: true`, you will need to provide the Actor's `title` and the `categories` under which that Actor will be classified in Apify Store. For this, it's best to use the [constants from our `apify-shared-js` package](https://github.com/apify/apify-shared-js/blob/2d43ebc41ece9ad31cd6525bd523fb86939bf860/packages/consts/src/consts.ts#L452-L471). ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Abort run ``` POST /v2/acts/:actorId/runs/:runId/abort ``` deprecated This endpoint has been deprecated and may be replaced or removed in future versions of the API. **\[DEPRECATED]** API endpoints related to run of the Actor were moved under new namespace [`actor-runs`](#/reference/actor-runs). Aborts an Actor run and returns an object that contains all the details about the run. Only runs that are starting or running are aborted. For runs with status `FINISHED`, `FAILED`, `ABORTING` and `TIMED-OUT` this call does nothing. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Get run ``` GET /v2/acts/:actorId/runs/:runId ``` deprecated This endpoint has been deprecated and may be replaced or removed in future versions of the API. **\[DEPRECATED]** API endpoints related to run of the Actor were moved under new namespace [`actor-runs`](#/reference/actor-runs). Gets an object that contains all the details about a specific run of an Actor. By passing the optional `waitForFinish` parameter the API endpoint will synchronously wait for the run to finish. This is useful to avoid periodic polling when waiting for Actor run to complete. This endpoint does not require the authentication token. Instead, calls are authenticated using a hard-to-guess ID of the run. However, if you access the endpoint without the token, certain attributes, such as `usageUsd` and `usageTotalUsd`, will be hidden. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Metamorph run ``` POST /v2/acts/:actorId/runs/:runId/metamorph ``` deprecated This endpoint has been deprecated and may be replaced or removed in future versions of the API. **\[DEPRECATED]** API endpoints related to run of the Actor were moved under new namespace [`actor-runs`](#/reference/actor-runs).Transforms an Actor run into a run of another Actor with a new input. This is useful if you want to use another Actor to finish the work of your current Actor run, without the need to create a completely new run and waiting for its finish. For the users of your Actors, the metamorph operation is transparent, they will just see your Actor got the work done. There is a limit on how many times you can metamorph a single run. You can check the limit in [the Actor runtime limits](https://docs.apify.com/platform/limits#actor-limits). Internally, the system stops the Docker container corresponding to the Actor run and starts a new container using a different Docker image. All the default storages are preserved and the new input is stored under the `INPUT-METAMORPH-1` key in the same default key-value store. For more information, see the [Actor docs](https://docs.apify.com/platform/actors/development/programming-interface/metamorph). ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Resurrect run ``` POST /v2/acts/:actorId/runs/:runId/resurrect ``` **\[DEPRECATED]** API endpoints related to run of the Actor were moved under new namespace [`actor-runs`](#/reference/actor-runs).Resurrects a finished Actor run and returns an object that contains all the details about the resurrected run. Only finished runs, i.e. runs with status `FINISHED`, `FAILED`, `ABORTED` and `TIMED-OUT` can be resurrected. Run status will be updated to RUNNING and its container will be restarted with the same storages (the same behaviour as when the run gets migrated to the new server). For more information, see the [Actor docs](https://docs.apify.com/platform/actors/running/runs-and-builds#resurrection-of-finished-run). ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Without input ``` GET /v2/acts/:actorId/run-sync ``` Runs a specific Actor and returns its output. The run must finish in 300 seconds otherwise the API endpoint returns a timeout error. The Actor is not passed any input. Beware that it might be impossible to maintain an idle HTTP connection for a long period of time, due to client timeout or network conditions. Make sure your HTTP client is configured to have a long enough connection timeout. If the connection breaks, you will not receive any information about the run and its status. To run the Actor asynchronously, use the [Run Actor](#/reference/actors/run-collection/run-actor) API endpoint instead. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 201 * 400 * 408 **Response Headers** **Response Headers** **Response Headers** --- # Run Actor synchronously without input and get dataset items ``` GET /v2/acts/:actorId/run-sync-get-dataset-items ``` Runs a specific Actor and returns its dataset items. The run must finish in 300 seconds otherwise the API endpoint returns a timeout error. The Actor is not passed any input. It allows to send all possible options in parameters from [Get Dataset Items](#/reference/datasets/item-collection/get-items) API endpoint. Beware that it might be impossible to maintain an idle HTTP connection for a long period of time, due to client timeout or network conditions. Make sure your HTTP client is configured to have a long enough connection timeout. If the connection breaks, you will not receive any information about the run and its status. To run the Actor asynchronously, use the [Run Actor](#/reference/actors/run-collection/run-actor) API endpoint instead. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 201 * 400 * 408 **Response Headers** * **X-Apify-Pagination-Offset** **X-Apify-Pagination-Limit** **X-Apify-Pagination-Count** **X-Apify-Pagination-Total** **Response Headers** **Response Headers** --- # Run Actor synchronously with input and get dataset items ``` POST /v2/acts/:actorId/run-sync-get-dataset-items ``` Runs a specific Actor and returns its dataset items. The POST payload including its `Content-Type` header is passed as `INPUT` to the Actor (usually `application/json`). The HTTP response contains the Actors dataset items, while the format of items depends on specifying dataset items' `format` parameter. You can send all the same options in parameters as the [Get Dataset Items](#/reference/datasets/item-collection/get-items) API endpoint. The Actor is started with the default options; you can override them using URL query parameters. If the Actor run exceeds 300 seconds, the HTTP response will return the 408 status code (Request Timeout). Beware that it might be impossible to maintain an idle HTTP connection for a long period of time, due to client timeout or network conditions. Make sure your HTTP client is configured to have a long enough connection timeout. If the connection breaks, you will not receive any information about the run and its status. To run the Actor asynchronously, use the [Run Actor](#/reference/actors/run-collection/run-actor) API endpoint instead. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 201 * 400 * 408 **Response Headers** * **X-Apify-Pagination-Offset** **X-Apify-Pagination-Limit** **X-Apify-Pagination-Count** **X-Apify-Pagination-Total** **Response Headers** **Response Headers** --- # Run Actor synchronously with input and return output ``` POST /v2/acts/:actorId/run-sync ``` Runs a specific Actor and returns its output. The POST payload including its `Content-Type` header is passed as `INPUT` to the Actor (usually `application/json`). The HTTP response contains Actors `OUTPUT` record from its default key-value store. The Actor is started with the default options; you can override them using various URL query parameters. If the Actor run exceeds 300 seconds, the HTTP response will have status 408 (Request Timeout). Beware that it might be impossible to maintain an idle HTTP connection for a long period of time, due to client timeout or network conditions. Make sure your HTTP client is configured to have a long enough connection timeout. If the connection breaks, you will not receive any information about the run and its status. To run the Actor asynchronously, use the [Run Actor](#/reference/actors/run-collection/run-actor) API endpoint instead. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 201 * 400 * 408 **Response Headers** **Response Headers** **Response Headers** --- # Get list of runs ``` GET /v2/acts/:actorId/runs ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/RunCollectionClientAsync#list)[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/RunCollectionClient#list)Gets the list of runs of a specific Actor. The response is a list of objects, where each object contains basic information about a single Actor run. The endpoint supports pagination using the `limit` and `offset` parameters and it will not return more than 1000 array elements. By default, the records are sorted by the `startedAt` field in ascending order, therefore you can use pagination to incrementally fetch all records while new ones are still being created. To sort the records in descending order, use `desc=1` parameter. You can also filter runs by status ([available statuses](https://docs.apify.com/platform/actors/running/runs-and-builds#lifecycle)). ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Get last run ``` GET /v2/acts/:actorId/runs/last ``` This is not a single endpoint, but an entire group of endpoints that lets you to retrieve and manage the last run of given Actor or any of its default storages. All the endpoints require an authentication token. The endpoints accept the same HTTP methods and query parameters as the respective storage endpoints. The base path represents the last Actor run object is: `/v2/acts/{actorId}/runs/last{?token,status}` Using the `status` query parameter you can ensure to only get a run with a certain status (e.g. `status=SUCCEEDED`). The output of this endpoint and other query parameters are the same as in the [Run object](#/reference/actors/run-object) endpoint. In order to access the default storages of the last Actor run, i.e. log, key-value store, dataset and request queue, use the following endpoints: * `/v2/acts/{actorId}/runs/last/log{?token,status}` * `/v2/acts/{actorId}/runs/last/key-value-store{?token,status}` * `/v2/acts/{actorId}/runs/last/dataset{?token,status}` * `/v2/acts/{actorId}/runs/last/request-queue{?token,status}` These API endpoints have the same usage as the equivalent storage endpoints. For example, `/v2/acts/{actorId}/runs/last/key-value-store` has the same HTTP method and parameters as the [Key-value store object](#/reference/key-value-stores/store-object) endpoint. Additionally, each of the above API endpoints supports all sub-endpoints of the original one: #### Key-value store[​](#key-value-store "Direct link to Key-value store") * `/v2/acts/{actorId}/runs/last/key-value-store/keys{?token,status}` [Key collection](#/reference/key-value-stores/key-collection) * `/v2/acts/{actorId}/runs/last/key-value-store/records/{recordKey}{?token,status}` [Record](#/reference/key-value-stores/record) #### Dataset[​](#dataset "Direct link to Dataset") * `/v2/acts/{actorId}/runs/last/dataset/items{?token,status}` [Item collection](#/reference/datasets/item-collection) #### Request queue[​](#request-queue "Direct link to Request queue") * `/v2/acts/{actorId}/runs/last/request-queue/requests{?token,status}` [Request collection](#/reference/request-queues/request-collection) * `/v2/acts/{actorId}/runs/last/request-queue/requests/{requestId}{?token,status}` [Request collection](#/reference/request-queues/request) * `/v2/acts/{actorId}/runs/last/request-queue/head{?token,status}` [Queue head](#/reference/request-queues/queue-head) For example, to download data from a dataset of the last succeeded Actor run in XML format, send HTTP GET request to the following URL: ``` https://api.apify.com/v2/acts/{actorId}/runs/last/dataset/items?token={yourApiToken}&format=xml&status=SUCCEEDED ``` In order to save new items to the dataset, send HTTP POST request with JSON payload to the same URL. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Run Actor ``` POST /v2/acts/:actorId/runs ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/ActorClientAsync#call)[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/ActorClient#start)Runs an Actor and immediately returns without waiting for the run to finish. The POST payload including its `Content-Type` header is passed as `INPUT` to the Actor (usually `application/json`). The Actor is started with the default options; you can override them using various URL query parameters. The response is the Run object as returned by the [Get run](#/reference/actor-runs/run-object-and-its-storages/get-run) API endpoint. If you want to wait for the run to finish and receive the actual output of the Actor as the response, please use one of the [Run Actor synchronously](#/reference/actors/run-actor-synchronously) API endpoints instead. To fetch the Actor run results that are typically stored in the default dataset, you'll need to pass the ID received in the `defaultDatasetId` field received in the response JSON to the [Get items](#/reference/datasets/item-collection/get-items) API endpoint. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 201 **Response Headers** * **Location** --- # Delete version ``` DELETE /v2/acts/:actorId/versions/:versionNumber ``` Deletes a specific version of Actor's source code. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 204 **Response Headers** --- # Delete environment variable ``` DELETE /v2/acts/:actorId/versions/:versionNumber/env-vars/:envVarName ``` Deletes a specific environment variable. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 204 **Response Headers** --- # Get environment variable ``` GET /v2/acts/:actorId/versions/:versionNumber/env-vars/:envVarName ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/ActorEnvVarClientAsync#get)Gets a [EnvVar object](#/reference/actors/environment-variable-object) that contains all the details about a specific environment variable of an Actor. If `isSecret` is set to `true`, then `value` will never be returned. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Update environment variable ``` PUT /v2/acts/:actorId/versions/:versionNumber/env-vars/:envVarName ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/ActorEnvVarClientAsync#update)Updates Actor environment variable using values specified by a [EnvVar object](#/reference/actors/environment-variable-object) passed as JSON in the POST payload. If the object does not define a specific property, its value will not be updated. The request needs to specify the `Content-Type: application/json` HTTP header! When providing your API authentication token, we recommend using the request's `Authorization` header, rather than the URL. ([More info](#/introduction/authentication)). The response is the [EnvVar object](#/reference/actors/environment-variable-object) as returned by the [Get environment variable](#/reference/actors/environment-variable-object/get-environment-variable) endpoint. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Get list of environment variables ``` GET /v2/acts/:actorId/versions/:versionNumber/env-vars ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/ActorEnvVarCollectionClientAsync#list)Gets the list of environment variables for a specific version of an Actor. The response is a JSON object with the list of [EnvVar objects](#/reference/actors/environment-variable-object), where each contains basic information about a single environment variable. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Create environment variable ``` POST /v2/acts/:actorId/versions/:versionNumber/env-vars ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/ActorEnvVarCollectionClientAsync#create)Creates an environment variable of an Actor using values specified in a [EnvVar object](#/reference/actors/environment-variable-object) passed as JSON in the POST payload. The request must specify `name` and `value` parameters (as strings) in the JSON payload and a `Content-Type: application/json` HTTP header. ``` { "name": "ENV_VAR_NAME", "value": "my-env-var" } ``` The response is the [EnvVar object](#/reference/actors/environment-variable-object) as returned by the [Get environment variable](#/reference/actors/environment-variable-object/get-environment-variable) endpoint. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 201 **Response Headers** * **Location** --- # Get version ``` GET /v2/acts/:actorId/versions/:versionNumber ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/ActorVersionClientAsync#get)Gets a [Version object](#/reference/actors/version-object) that contains all the details about a specific version of an Actor. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Update version ``` PUT /v2/acts/:actorId/versions/:versionNumber ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/ActorVersionClientAsync#update)Updates Actor version using values specified by a [Version object](#/reference/actors/version-object) passed as JSON in the POST payload. If the object does not define a specific property, its value will not be updated. The request needs to specify the `Content-Type: application/json` HTTP header! When providing your API authentication token, we recommend using the request's `Authorization` header, rather than the URL. ([More info](#/introduction/authentication)). The response is the [Version object](#/reference/actors/version-object) as returned by the [Get version](#/reference/actors/version-object/get-version) endpoint. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Get list of versions ``` GET /v2/acts/:actorId/versions ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/ActorVersionCollectionClientAsync#list)Gets the list of versions of a specific Actor. The response is a JSON object with the list of [Version objects](#/reference/actors/version-object), where each contains basic information about a single version. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Create version ``` POST /v2/acts/:actorId/versions ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/ActorVersionCollectionClientAsync#create)Creates a version of an Actor using values specified in a [Version object](#/reference/actors/version-object) passed as JSON in the POST payload. The request must specify `versionNumber` and `sourceType` parameters (as strings) in the JSON payload and a `Content-Type: application/json` HTTP header. Each `sourceType` requires its own additional properties to be passed to the JSON payload object. These are outlined in the [Version object](#/reference/actors/version-object) table below and in more detail in the [Apify documentation](https://docs.apify.com/platform/actors/development/deployment/source-types). For example, if an Actor's source code is stored in a [GitHub repository](https://docs.apify.com/platform/actors/development/deployment/source-types#git-repository), you will set the `sourceType` to `GIT_REPO` and pass the repository's URL in the `gitRepoUrl` property. ``` { "versionNumber": "0.1", "sourceType": "GIT_REPO", "gitRepoUrl": "https://github.com/my-github-account/actor-repo" } ``` The response is the [Version object](#/reference/actors/version-object) as returned by the [Get version](#/reference/actors/version-object/get-version) endpoint. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 201 **Response Headers** * **Location** --- # Get list of webhooks ``` GET /v2/acts/:actorId/webhooks ``` Gets the list of webhooks of a specific Actor. The response is a JSON with the list of objects, where each object contains basic information about a single webhook. The endpoint supports pagination using the `limit` and `offset` parameters and it will not return more than 1000 records. By default, the records are sorted by the `createdAt` field in ascending order, to sort the records in descending order, use the `desc=1` parameter. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Abort build ``` POST /v2/actor-builds/:buildId/abort ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/BuildClientAsync#abort)[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/BuildClient#abort)Aborts an Actor build and returns an object that contains all the details about the build. Only builds that are starting or running are aborted. For builds with status `FINISHED`, `FAILED`, `ABORTING` and `TIMED-OUT` this call does nothing. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Delete build ``` DELETE /v2/actor-builds/:buildId ``` Clients[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/BuildClient#delete)Delete the build. The build that is the current default build for the Actor cannot be deleted. Only users with build permissions for the Actor can delete builds. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 204 **Response Headers** --- # Get build ``` GET /v2/actor-builds/:buildId ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/BuildClientAsync#get)[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/BuildClient#get)Gets an object that contains all the details about a specific build of an Actor. By passing the optional `waitForFinish` parameter the API endpoint will synchronously wait for the build to finish. This is useful to avoid periodic polling when waiting for an Actor build to finish. This endpoint does not require the authentication token. Instead, calls are authenticated using a hard-to-guess ID of the build. However, if you access the endpoint without the token, certain attributes, such as `usageUsd` and `usageTotalUsd`, will be hidden. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Get log ``` GET /v2/actor-builds/:buildId/log ``` Check out [Logs](#/reference/logs) for full reference. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Get OpenAPI definition ``` GET /v2/actor-builds/:buildId/openapi.json ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/BuildClient#get_open_api_definition)[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/BuildClient#getOpenApiDefinition)Get the OpenAPI definition for Actor builds. Two similar endpoints are available: * [First endpoint](https://docs.apify.com/api/v2/act-openapi-json-get.md): Requires both `actorId` and `buildId`. Use `default` as the `buildId` to get the OpenAPI schema for the default Actor build. * [Second endpoint](https://docs.apify.com/api/v2/actor-build-openapi-json-get.md): Requires only `buildId`. Get the OpenAPI definition for a specific Actor build. Authentication is based on the build's unique ID. No authentication token is required. note You can also use the [`/api/v2/act-openapi-json-get`](https://docs.apify.com/api/v2/act-openapi-json-get.md) endpoint to get the OpenAPI definition for a build. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Actor builds - Introduction The API endpoints described in this section enable you to manage, and delete Apify Actor builds. Note that if any returned build object contains usage in dollars, your effective unit pricing at the time of query has been used for computation of this dollar equivalent, and hence it should be used only for informative purposes. You can learn more about platform usage in the [documentation](https://docs.apify.com/platform/actors/running/usage-and-resources#usage). ## [Get user builds list](https://docs.apify.com/api/v2/actor-builds-get.md) [`/actor-builds`](https://docs.apify.com/api/v2/actor-builds-get.md) ## [Get build](https://docs.apify.com/api/v2/actor-build-get.md) [`/actor-builds/{buildId}`](https://docs.apify.com/api/v2/actor-build-get.md) ## [Delete build](https://docs.apify.com/api/v2/actor-build-delete.md) [`/actor-builds/{buildId}`](https://docs.apify.com/api/v2/actor-build-delete.md) ## [Abort build](https://docs.apify.com/api/v2/actor-build-abort-post.md) [`/actor-builds/{buildId}/abort`](https://docs.apify.com/api/v2/actor-build-abort-post.md) ## [Get log](https://docs.apify.com/api/v2/actor-build-log-get.md) [`/actor-builds/{buildId}/log`](https://docs.apify.com/api/v2/actor-build-log-get.md) ## [Get OpenAPI definition](https://docs.apify.com/api/v2/actor-build-openapi-json-get.md) [`/actor-builds/{buildId}/openapi.json`](https://docs.apify.com/api/v2/actor-build-openapi-json-get.md) --- # Get user builds list ``` GET /v2/actor-builds ``` Gets a list of all builds for a user. The response is a JSON array of objects, where each object contains basic information about a single build. The endpoint supports pagination using the `limit` and `offset` parameters and it will not return more than 1000 records. By default, the records are sorted by the `startedAt` field in ascending order. Therefore, you can use pagination to incrementally fetch all builds while new ones are still being started. To sort the records in descending order, use the `desc=1` parameter. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Abort run ``` POST /v2/actor-runs/:runId/abort ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/RunClientAsync#abort)[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/RunClient#abort)Aborts an Actor run and returns an object that contains all the details about the run. Only runs that are starting or running are aborted. For runs with status `FINISHED`, `FAILED`, `ABORTING` and `TIMED-OUT` this call does nothing. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Delete run ``` DELETE /v2/actor-runs/:runId ``` Clients[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/RunClient#delete)Delete the run. Only finished runs can be deleted. Only the person or organization that initiated the run can delete it. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 204 **Response Headers** --- # Get run ``` GET /v2/actor-runs/:runId ``` This is not a single endpoint, but an entire group of endpoints that lets you retrieve the run or any of its default storages. The endpoints accept the same HTTP methods and query parameters as the respective storage endpoints. The base path that represents the Actor run object is: `/v2/actor-runs/{runId}{?token}` In order to access the default storages of the Actor run, i.e. log, key-value store, dataset and request queue, use the following endpoints: * `/v2/actor-runs/{runId}/log{?token}` * `/v2/actor-runs/{runId}/key-value-store{?token}` * `/v2/actor-runs/{runId}/dataset{?token}` * `/v2/actor-runs/{runId}/request-queue{?token}` These API endpoints have the same usage as the equivalent storage endpoints. For example, `/v2/actor-runs/{runId}/key-value-store` has the same HTTP method and parameters as the [Key-value store object](#/reference/key-value-stores/store-object) endpoint. Additionally, each of the above API endpoints supports all sub-endpoints of the original one: #### Log[​](#log "Direct link to Log") * `/v2/actor-runs/{runId}/log` [Log](#/reference/logs) #### Key-value store[​](#key-value-store "Direct link to Key-value store") * `/v2/actor-runs/{runId}/key-value-store/keys{?token}` [Key collection](#/reference/key-value-stores/key-collection) * `/v2/actor-runs/{runId}/key-value-store/records/{recordKey}{?token}` [Record](#/reference/key-value-stores/record) #### Dataset[​](#dataset "Direct link to Dataset") * `/v2/actor-runs/{runId}/dataset/items{?token}` [Item collection](#/reference/datasets/item-collection) #### Request queue[​](#request-queue "Direct link to Request queue") * `/v2/actor-runs/{runId}/request-queue/requests{?token}` [Request collection](#/reference/request-queues/request-collection) * `/v2/actor-runs/{runId}/request-queue/requests/{requestId}{?token}` [Request collection](#/reference/request-queues/request) * `/v2/actor-runs/{runId}/request-queue/head{?token}` [Queue head](#/reference/request-queues/queue-head) For example, to download data from a dataset of the Actor run in XML format, send HTTP GET request to the following URL: ``` https://api.apify.com/v2/actor-runs/{runId}/dataset/items?format=xml ``` In order to save new items to the dataset, send HTTP POST request with JSON payload to the same URL. Gets an object that contains all the details about a specific run of an Actor. By passing the optional `waitForFinish` parameter the API endpoint will synchronously wait for the run to finish. This is useful to avoid periodic polling when waiting for Actor run to complete. This endpoint does not require the authentication token. Instead, calls are authenticated using a hard-to-guess ID of the run. However, if you access the endpoint without the token, certain attributes, such as `usageUsd` and `usageTotalUsd`, will be hidden. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Metamorph run ``` POST /v2/actor-runs/:runId/metamorph ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/RunClientAsync#metamorph)[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/RunClient#metamorph)Transforms an Actor run into a run of another Actor with a new input. This is useful if you want to use another Actor to finish the work of your current Actor run, without the need to create a completely new run and waiting for its finish. For the users of your Actors, the metamorph operation is transparent, they will just see your Actor got the work done. Internally, the system stops the Docker container corresponding to the Actor run and starts a new container using a different Docker image. All the default storages are preserved and the new input is stored under the `INPUT-METAMORPH-1` key in the same default key-value store. For more information, see the [Actor docs](https://docs.apify.com/platform/actors/development/programming-interface/metamorph). ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Update status message ``` PUT /v2/actor-runs/:runId ``` You can set a single status message on your run that will be displayed in the Apify Console UI. During an Actor run, you will typically do this in order to inform users of your Actor about the Actor's progress. The request body must contain `runId` and `statusMessage` properties. The `isStatusMessageTerminal` property is optional and it indicates if the status message is the very last one. In the absence of a status message, the platform will try to substitute sensible defaults. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Reboot run ``` POST /v2/actor-runs/:runId/reboot ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/RunClientAsync#reboot)[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/RunClient#reboot)Reboots an Actor run and returns an object that contains all the details about the rebooted run. Only runs that are running, i.e. runs with status `RUNNING` can be rebooted. The run's container will be restarted, so any data not persisted in the key-value store, dataset, or request queue will be lost. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Actor runs - Introduction The API endpoints described in this section enable you to manage, and delete Apify Actor runs. If any returned run object contains usage in dollars, your effective unit pricing at the time of query has been used for computation of this dollar equivalent, and hence it should be used only for informative purposes. You can learn more about platform usage in the [documentation](https://docs.apify.com/platform/actors/running/usage-and-resources#usage). ## [Get user runs list](https://docs.apify.com/api/v2/actor-runs-get.md) [`/actor-runs`](https://docs.apify.com/api/v2/actor-runs-get.md) ## [Get run](https://docs.apify.com/api/v2/actor-run-get.md) [`/actor-runs/{runId}`](https://docs.apify.com/api/v2/actor-run-get.md) ## [Update status message](https://docs.apify.com/api/v2/actor-run-put.md) [`/actor-runs/{runId}`](https://docs.apify.com/api/v2/actor-run-put.md) ## [Delete run](https://docs.apify.com/api/v2/actor-run-delete.md) [`/actor-runs/{runId}`](https://docs.apify.com/api/v2/actor-run-delete.md) ## [Abort run](https://docs.apify.com/api/v2/actor-run-abort-post.md) [`/actor-runs/{runId}/abort`](https://docs.apify.com/api/v2/actor-run-abort-post.md) ## [Metamorph run](https://docs.apify.com/api/v2/actor-run-metamorph-post.md) [`/actor-runs/{runId}/metamorph`](https://docs.apify.com/api/v2/actor-run-metamorph-post.md) ## [Reboot run](https://docs.apify.com/api/v2/actor-run-reboot-post.md) [`/actor-runs/{runId}/reboot`](https://docs.apify.com/api/v2/actor-run-reboot-post.md) ## [Resurrect run](https://docs.apify.com/api/v2/post-resurrect-run.md) [`/actor-runs/{runId}/resurrect`](https://docs.apify.com/api/v2/post-resurrect-run.md) ## [Charge events in run](https://docs.apify.com/api/v2/post-charge-run.md) [`/actor-runs/{runId}/charge`](https://docs.apify.com/api/v2/post-charge-run.md) --- # Get user runs list ``` GET /v2/actor-runs ``` Gets a list of all runs for a user. The response is a list of objects, where each object contains basic information about a single Actor run. The endpoint supports pagination using the `limit` and `offset` parameters and it will not return more than 1000 array elements. By default, the records are sorted by the `startedAt` field in ascending order. Therefore, you can use pagination to incrementally fetch all records while new ones are still being created. To sort the records in descending order, use `desc=1` parameter. You can also filter runs by status ([available statuses](https://docs.apify.com/platform/actors/running/runs-and-builds#lifecycle)). ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Delete task ``` DELETE /v2/actor-tasks/:actorTaskId ``` Clients[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/TaskClient#delete)Delete the task specified through the `actorTaskId` parameter. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 204 **Response Headers** --- # Get task ``` GET /v2/actor-tasks/:actorTaskId ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/TaskClientAsync#get)[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/TaskClient#get)Get an object that contains all the details about a task. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Get task input ``` GET /v2/actor-tasks/:actorTaskId/input ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/TaskClientAsync#get_input)[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/TaskClient#getInput)Returns the input of a given task. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Update task input ``` PUT /v2/actor-tasks/:actorTaskId/input ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/TaskClientAsync#update_input)[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/TaskClient#updateInput)Updates the input of a task using values specified by an object passed as JSON in the PUT payload. If the object does not define a specific property, its value is not updated. The response is the full task input as returned by the [Get task input](#/reference/tasks/task-input-object/get-task-input) endpoint. The request needs to specify the `Content-Type: application/json` HTTP header! When providing your API authentication token, we recommend using the request's `Authorization` header, rather than the URL. ([More info](#/introduction/authentication)). ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Update task ``` PUT /v2/actor-tasks/:actorTaskId ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/TaskClientAsync#update)[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/TaskClient#update)Update settings of a task using values specified by an object passed as JSON in the POST payload. If the object does not define a specific property, its value is not updated. The response is the full task object as returned by the [Get task](#/reference/tasks/task-object/get-task) endpoint. The request needs to specify the `Content-Type: application/json` HTTP header! When providing your API authentication token, we recommend using the request's `Authorization` header, rather than the URL. ([More info](#/introduction/authentication)). ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Run task synchronously ``` GET /v2/actor-tasks/:actorTaskId/run-sync ``` Run a specific task and return its output. The run must finish in 300 seconds otherwise the HTTP request fails with a timeout error (this won't abort the run itself). Beware that it might be impossible to maintain an idle HTTP connection for an extended period, due to client timeout or network conditions. Make sure your HTTP client is configured to have a long enough connection timeout. If the connection breaks, you will not receive any information about the run and its status. To run the Task asynchronously, use the [Run task asynchronously](#/reference/actor-tasks/run-collection/run-task) endpoint instead. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 201 * 400 * 408 **Response Headers** **Response Headers** Request Timeout: the HTTP request exceeded the 300 second limit **Response Headers** --- # Run task synchronously and get dataset items ``` GET /v2/actor-tasks/:actorTaskId/run-sync-get-dataset-items ``` Run a specific task and return its dataset items. The run must finish in 300 seconds otherwise the HTTP request fails with a timeout error (this won't abort the run itself). You can send all the same options in parameters as the [Get Dataset Items](#/reference/datasets/item-collection/get-items) API endpoint. Beware that it might be impossible to maintain an idle HTTP connection for an extended period, due to client timeout or network conditions. Make sure your HTTP client is configured to have a long enough connection timeout. If the connection breaks, you will not receive any information about the run and its status. To run the Task asynchronously, use the [Run task asynchronously](#/reference/actor-tasks/run-collection/run-task) endpoint instead. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 201 * 400 * 408 **Response Headers** * **X-Apify-Pagination-Offset** **X-Apify-Pagination-Limit** **X-Apify-Pagination-Count** **X-Apify-Pagination-Total** **Response Headers** Request Timeout: the HTTP request exceeded the 300 second limit **Response Headers** --- # Run task synchronously and get dataset items ``` POST /v2/actor-tasks/:actorTaskId/run-sync-get-dataset-items ``` Runs an Actor task and synchronously returns its dataset items. The run must finish in 300 seconds otherwise the HTTP request fails with a timeout error (this won't abort the run itself). Optionally, you can override the Actor input configuration by passing a JSON object as the POST payload and setting the `Content-Type: application/json` HTTP header. Note that if the object in the POST payload does not define a particular input property, the Actor run uses the default value defined by the task (or the Actor's input schema if not defined by the task). You can send all the same options in parameters as the [Get Dataset Items](#/reference/datasets/item-collection/get-items) API endpoint. Beware that it might be impossible to maintain an idle HTTP connection for an extended period, due to client timeout or network conditions. Make sure your HTTP client is configured to have a long enough connection timeout. If the connection breaks, you will not receive any information about the run and its status. Input fields from Actor task configuration can be overloaded with values passed as the POST payload. Just make sure to specify the `Content-Type` header as `application/json` and that the input is an object. To run the task asynchronously, use the [Run task](#/reference/actor-tasks/run-collection/run-task) API endpoint instead. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 201 * 400 **Response Headers** * **X-Apify-Pagination-Offset** **X-Apify-Pagination-Limit** **X-Apify-Pagination-Count** **X-Apify-Pagination-Total** **Response Headers** --- # Run task synchronously ``` POST /v2/actor-tasks/:actorTaskId/run-sync ``` Runs an Actor task and synchronously returns its output. The run must finish in 300 seconds otherwise the HTTP request fails with a timeout error (this won't abort the run itself). Optionally, you can override the Actor input configuration by passing a JSON object as the POST payload and setting the `Content-Type: application/json` HTTP header. Note that if the object in the POST payload does not define a particular input property, the Actor run uses the default value defined by the task (or Actor's input schema if not defined by the task). Beware that it might be impossible to maintain an idle HTTP connection for an extended period, due to client timeout or network conditions. Make sure your HTTP client is configured to have a long enough connection timeout. If the connection breaks, you will not receive any information about the run and its status. Input fields from Actor task configuration can be overloaded with values passed as the POST payload. Just make sure to specify `Content-Type` header to be `application/json` and input to be an object. To run the task asynchronously, use the [Run task](#/reference/actor-tasks/run-collection/run-task) API endpoint instead. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 201 * 400 **Response Headers** **Response Headers** --- # Get list of task runs ``` GET /v2/actor-tasks/:actorTaskId/runs ``` Get a list of runs of a specific task. The response is a list of objects, where each object contains essential information about a single task run. The endpoint supports pagination using the `limit` and `offset` parameters, and it does not return more than a 1000 array elements. By default, the records are sorted by the `startedAt` field in ascending order; therefore you can use pagination to incrementally fetch all records while new ones are still being created. To sort the records in descending order, use the `desc=1` parameter. You can also filter runs by status ([available statuses](https://docs.apify.com/platform/actors/running/runs-and-builds#lifecycle)). ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Get last run ``` GET /v2/actor-tasks/:actorTaskId/runs/last ``` This is not a single endpoint, but an entire group of endpoints that lets you to retrieve and manage the last run of given actor task or any of its default storages. All the endpoints require an authentication token. The endpoints accept the same HTTP methods and query parameters as the respective storage endpoints. The base path represents the last actor task run object is: `/v2/actor-tasks/{actorTaskId}/runs/last{?token,status}` Using the `status` query parameter you can ensure to only get a run with a certain status (e.g. `status=SUCCEEDED`). The output of this endpoint and other query parameters are the same as in the [Run object](https://docs.apify.com/api/v2/actor-run-get.md) endpoint. In order to access the default storages of the last actor task run, i.e. log, key-value store, dataset and request queue, use the following endpoints: * `/v2/actor-tasks/{actorTaskId}/runs/last/log{?token,status}` * `/v2/actor-tasks/{actorTaskId}/runs/last/key-value-store{?token,status}` * `/v2/actor-tasks/{actorTaskId}/runs/last/dataset{?token,status}` * `/v2/actor-tasks/{actorTaskId}/runs/last/request-queue{?token,status}` These API endpoints have the same usage as the equivalent storage endpoints. For example, `/v2/actor-tasks/{actorTaskId}/runs/last/key-value-store` has the same HTTP method and parameters as the [Key-value store object](https://docs.apify.com/api/v2/storage-key-value-stores.md) endpoint. Additionally, each of the above API endpoints supports all sub-endpoints of the original one: ##### Storage endpoints[​](#storage-endpoints "Direct link to Storage endpoints") * [Dataset - introduction](https://docs.apify.com/api/v2/storage-datasets.md) * [Key-value store - introduction](https://docs.apify.com/api/v2/storage-key-value-stores.md) * [Request queue - introduction](https://docs.apify.com/api/v2/storage-request-queues.md) For example, to download data from a dataset of the last succeeded actor task run in XML format, send HTTP GET request to the following URL: ``` https://api.apify.com/v2/actor-tasks/{actorTaskId}/runs/last/dataset/items?token={yourApiToken}&format=xml&status=SUCCEEDED ``` In order to save new items to the dataset, send HTTP POST request with JSON payload to the same URL. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Run task ``` POST /v2/actor-tasks/:actorTaskId/runs ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/TaskClientAsync#call)[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/TaskClient#start)Runs an Actor task and immediately returns without waiting for the run to finish. Optionally, you can override the Actor input configuration by passing a JSON object as the POST payload and setting the `Content-Type: application/json` HTTP header. Note that if the object in the POST payload does not define a particular input property, the Actor run uses the default value defined by the task (or Actor's input schema if not defined by the task). The response is the Actor Run object as returned by the [Get run](#/reference/actor-runs/run-object-and-its-storages/get-run) endpoint. If you want to wait for the run to finish and receive the actual output of the Actor run as the response, use one of the [Run task synchronously](#/reference/actor-tasks/run-task-synchronously) API endpoints instead. To fetch the Actor run results that are typically stored in the default dataset, you'll need to pass the ID received in the `defaultDatasetId` field received in the response JSON to the [Get items](#/reference/datasets/item-collection/get-items) API endpoint. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 201 **Response Headers** * **Location** --- # Get list of webhooks ``` GET /v2/actor-tasks/:actorTaskId/webhooks ``` Gets the list of webhooks of a specific Actor task. The response is a JSON with the list of objects, where each object contains basic information about a single webhook. The endpoint supports pagination using the `limit` and `offset` parameters and it will not return more than 1000 records. By default, the records are sorted by the `createdAt` field in ascending order, to sort the records in descending order, use the `desc=1` parameter. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Actor tasks - Introduction The API endpoints described in this section enable you to create, manage, delete, and run Apify Actor tasks. For more information, see the [Actor tasts documentation](https://docs.apify.com/platform/actors/running/tasks). note For all the API endpoints that accept the `actorTaskId` parameter to specify a task, you can pass either the task ID (e.g. `HG7ML7M8z78YcAPEB`) or a tilde-separated username of the task's owner and the task's name (e.g. `janedoe~my-task`). Some of the API endpoints return run objects. If any such run object contains usage in dollars, your effective unit pricing at the time of query has been used for computation of this dollar equivalent, and hence it should be used only for informative purposes. You can learn more about platform usage in the [documentation](https://docs.apify.com/platform/actors/running/usage-and-resources#usage). ## [Get list of tasks](https://docs.apify.com/api/v2/actor-tasks-get.md) [`/actor-tasks`](https://docs.apify.com/api/v2/actor-tasks-get.md) ## [Create task](https://docs.apify.com/api/v2/actor-tasks-post.md) [`/actor-tasks`](https://docs.apify.com/api/v2/actor-tasks-post.md) ## [Get task](https://docs.apify.com/api/v2/actor-task-get.md) [`/actor-tasks/{actorTaskId}`](https://docs.apify.com/api/v2/actor-task-get.md) ## [Update task](https://docs.apify.com/api/v2/actor-task-put.md) [`/actor-tasks/{actorTaskId}`](https://docs.apify.com/api/v2/actor-task-put.md) ## [Delete task](https://docs.apify.com/api/v2/actor-task-delete.md) [`/actor-tasks/{actorTaskId}`](https://docs.apify.com/api/v2/actor-task-delete.md) ## [Get task input](https://docs.apify.com/api/v2/actor-task-input-get.md) [`/actor-tasks/{actorTaskId}/input`](https://docs.apify.com/api/v2/actor-task-input-get.md) ## [Update task input](https://docs.apify.com/api/v2/actor-task-input-put.md) [`/actor-tasks/{actorTaskId}/input`](https://docs.apify.com/api/v2/actor-task-input-put.md) ## [Get list of webhooks](https://docs.apify.com/api/v2/actor-task-webhooks-get.md) [`/actor-tasks/{actorTaskId}/webhooks`](https://docs.apify.com/api/v2/actor-task-webhooks-get.md) ## [Get list of task runs](https://docs.apify.com/api/v2/actor-task-runs-get.md) [`/actor-tasks/{actorTaskId}/runs`](https://docs.apify.com/api/v2/actor-task-runs-get.md) ## [Run task](https://docs.apify.com/api/v2/actor-task-runs-post.md) [`/actor-tasks/{actorTaskId}/runs`](https://docs.apify.com/api/v2/actor-task-runs-post.md) ## [Run task synchronously](https://docs.apify.com/api/v2/actor-task-run-sync-get.md) [`/actor-tasks/{actorTaskId}/run-sync`](https://docs.apify.com/api/v2/actor-task-run-sync-get.md) ## [Run task synchronously](https://docs.apify.com/api/v2/actor-task-run-sync-post.md) [`/actor-tasks/{actorTaskId}/run-sync`](https://docs.apify.com/api/v2/actor-task-run-sync-post.md) ## [Run task synchronously and get dataset items](https://docs.apify.com/api/v2/actor-task-run-sync-get-dataset-items-get.md) [`/actor-tasks/{actorTaskId}/run-sync-get-dataset-items`](https://docs.apify.com/api/v2/actor-task-run-sync-get-dataset-items-get.md) ## [Run task synchronously and get dataset items](https://docs.apify.com/api/v2/actor-task-run-sync-get-dataset-items-post.md) [`/actor-tasks/{actorTaskId}/run-sync-get-dataset-items`](https://docs.apify.com/api/v2/actor-task-run-sync-get-dataset-items-post.md) ## [Get last run](https://docs.apify.com/api/v2/actor-task-runs-last-get.md) [`/actor-tasks/{actorTaskId}/runs/last`](https://docs.apify.com/api/v2/actor-task-runs-last-get.md) --- # Get list of tasks ``` GET /v2/actor-tasks ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/TaskCollectionClientAsync#list)[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/TaskCollectionClient#list)Gets the complete list of tasks that a user has created or used. The response is a list of objects in which each object contains essential information about a single task. The endpoint supports pagination using the `limit` and `offset` parameters, and it does not return more than a 1000 records. By default, the records are sorted by the `createdAt` field in ascending order; therefore you can use pagination to incrementally fetch all tasks while new ones are still being created. To sort the records in descending order, use the `desc=1` parameter. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Create task ``` POST /v2/actor-tasks ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/TaskCollectionClientAsync#create)[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/TaskCollectionClient#create)Create a new task with settings specified by the object passed as JSON in the POST payload. The response is the full task object as returned by the [Get task](#/reference/tasks/task-object/get-task) endpoint. The request needs to specify the `Content-Type: application/json` HTTP header! When providing your API authentication token, we recommend using the request's `Authorization` header, rather than the URL. ([More info](#/introduction/authentication)). ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 201 **Response Headers** * **Location** --- # Actors - Introduction The API endpoints in this section allow you to manage Apify Actors. For more details about Actors, refer to the [Actor documentation](https://docs.apify.com/platform/actors). For API endpoints that require the `actorId` parameter to identify an Actor, you can provide either: * The Actor ID (e.g., `HG7ML7M8z78YcAPEB`), or * A tilde-separated combination of the Actor owner's username and the Actor name (e.g., `janedoe~my-actor`). ## [Get list of Actors](https://docs.apify.com/api/v2/acts-get.md) [`/acts`](https://docs.apify.com/api/v2/acts-get.md) ## [Create Actor](https://docs.apify.com/api/v2/acts-post.md) [`/acts`](https://docs.apify.com/api/v2/acts-post.md) ## [Get Actor](https://docs.apify.com/api/v2/act-get.md) [`/acts/{actorId}`](https://docs.apify.com/api/v2/act-get.md) ## [Update Actor](https://docs.apify.com/api/v2/act-put.md) [`/acts/{actorId}`](https://docs.apify.com/api/v2/act-put.md) ## [Delete Actor](https://docs.apify.com/api/v2/act-delete.md) [`/acts/{actorId}`](https://docs.apify.com/api/v2/act-delete.md) --- # Actor builds - Introduction The API endpoints in this section allow you to manage your Apify Actors builds. ## [Get list of builds](https://docs.apify.com/api/v2/act-builds-get.md) [`/acts/{actorId}/builds`](https://docs.apify.com/api/v2/act-builds-get.md) ## [Build Actor](https://docs.apify.com/api/v2/act-builds-post.md) [`/acts/{actorId}/builds`](https://docs.apify.com/api/v2/act-builds-post.md) ## [Get default build](https://docs.apify.com/api/v2/act-build-default-get.md) [`/acts/{actorId}/builds/default`](https://docs.apify.com/api/v2/act-build-default-get.md) ## [Get OpenAPI definition](https://docs.apify.com/api/v2/act-openapi-json-get.md) [`/acts/{actorId}/builds/{buildId}/openapi.json`](https://docs.apify.com/api/v2/act-openapi-json-get.md) ## [Get build](https://docs.apify.com/api/v2/act-build-get.md) [`/acts/{actorId}/builds/{buildId}`](https://docs.apify.com/api/v2/act-build-get.md) ## [Abort build](https://docs.apify.com/api/v2/act-build-abort-post.md) [`/acts/{actorId}/builds/{buildId}/abort`](https://docs.apify.com/api/v2/act-build-abort-post.md) --- # Actor runs - Introduction The API endpoints in this section allow you to manage your Apify Actors runs. Some API endpoints return run objects. If a run object includes usage costs in dollars, note that these values are calculated based on your effective unit pricing at the time of the query. As a result, the dollar amounts should be treated as informational only and not as exact figures. For more information about platform usage and resource calculations, see the [Usage and Resources documentation](https://docs.apify.com/platform/actors/running/usage-and-resources#usage). ## [Get list of runs](https://docs.apify.com/api/v2/act-runs-get.md) [`/acts/{actorId}/runs`](https://docs.apify.com/api/v2/act-runs-get.md) ## [Run Actor](https://docs.apify.com/api/v2/act-runs-post.md) [`/acts/{actorId}/runs`](https://docs.apify.com/api/v2/act-runs-post.md) ## [Run Actor synchronously with input and return output](https://docs.apify.com/api/v2/act-run-sync-post.md) [`/acts/{actorId}/run-sync`](https://docs.apify.com/api/v2/act-run-sync-post.md) ## [Without input](https://docs.apify.com/api/v2/act-run-sync-get.md) [`/acts/{actorId}/run-sync`](https://docs.apify.com/api/v2/act-run-sync-get.md) ## [Run Actor synchronously with input and get dataset items](https://docs.apify.com/api/v2/act-run-sync-get-dataset-items-post.md) [`/acts/{actorId}/run-sync-get-dataset-items`](https://docs.apify.com/api/v2/act-run-sync-get-dataset-items-post.md) ## [Run Actor synchronously without input and get dataset items](https://docs.apify.com/api/v2/act-run-sync-get-dataset-items-get.md) [`/acts/{actorId}/run-sync-get-dataset-items`](https://docs.apify.com/api/v2/act-run-sync-get-dataset-items-get.md) ## [Resurrect run](https://docs.apify.com/api/v2/act-run-resurrect-post.md) [`/acts/{actorId}/runs/{runId}/resurrect`](https://docs.apify.com/api/v2/act-run-resurrect-post.md) ## [Get last run](https://docs.apify.com/api/v2/act-runs-last-get.md) [`/acts/{actorId}/runs/last`](https://docs.apify.com/api/v2/act-runs-last-get.md) ## [Get run](https://docs.apify.com/api/v2/act-run-get.md) [`/acts/{actorId}/runs/{runId}`](https://docs.apify.com/api/v2/act-run-get.md) ## [Abort run](https://docs.apify.com/api/v2/act-run-abort-post.md) [`/acts/{actorId}/runs/{runId}/abort`](https://docs.apify.com/api/v2/act-run-abort-post.md) ## [Metamorph run](https://docs.apify.com/api/v2/act-run-metamorph-post.md) [`/acts/{actorId}/runs/{runId}/metamorph`](https://docs.apify.com/api/v2/act-run-metamorph-post.md) --- # Actor versions - Introduction The API endpoints in this section allow you to manage your Apify Actors versions. * The version object contains the source code of a specific version of an Actor. * The `sourceType` property indicates where the source code is hosted, and based on its value the Version object has the following additional property: | **Value** | **Description** | | ---------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `"SOURCE_FILES"` | Source code is comprised of multiple files specified in the `sourceFiles` array. Each item of the array is an object with the following fields:- `name`: File path and name- `format`: Format of the content, can be either `"TEXT"` or `"BASE64"`- `content`: File contentSource files can be shown and edited in the Apify Console's Web IDE. | | `"GIT_REPO"` | Source code is cloned from a Git repository, whose URL is specified in the `gitRepoUrl` field. | | `"TARBALL"` | Source code is downloaded using a tarball or Zip file from a URL specified in the `tarballUrl` field. | | `"GITHUB_GIST"` | Source code is taken from a GitHub Gist, whose URL is specified in the `gitHubGistUrl` field. | For more information about source code and Actor versions, check out [Source code](https://docs.apify.com/platform/actors/development/actor-definition/source-code) in Actors documentation. ## [Get list of versions](https://docs.apify.com/api/v2/act-versions-get.md) [`/acts/{actorId}/versions`](https://docs.apify.com/api/v2/act-versions-get.md) ## [Create version](https://docs.apify.com/api/v2/act-versions-post.md) [`/acts/{actorId}/versions`](https://docs.apify.com/api/v2/act-versions-post.md) ## [Get version](https://docs.apify.com/api/v2/act-version-get.md) [`/acts/{actorId}/versions/{versionNumber}`](https://docs.apify.com/api/v2/act-version-get.md) ## [Update version](https://docs.apify.com/api/v2/act-version-put.md) [`/acts/{actorId}/versions/{versionNumber}`](https://docs.apify.com/api/v2/act-version-put.md) ## [Delete version](https://docs.apify.com/api/v2/act-version-delete.md) [`/acts/{actorId}/versions/{versionNumber}`](https://docs.apify.com/api/v2/act-version-delete.md) ## [Get list of environment variables](https://docs.apify.com/api/v2/act-version-env-vars-get.md) [`/acts/{actorId}/versions/{versionNumber}/env-vars`](https://docs.apify.com/api/v2/act-version-env-vars-get.md) ## [Create environment variable](https://docs.apify.com/api/v2/act-version-env-vars-post.md) [`/acts/{actorId}/versions/{versionNumber}/env-vars`](https://docs.apify.com/api/v2/act-version-env-vars-post.md) ## [Get environment variable](https://docs.apify.com/api/v2/act-version-env-var-get.md) [`/acts/{actorId}/versions/{versionNumber}/env-vars/{envVarName}`](https://docs.apify.com/api/v2/act-version-env-var-get.md) ## [Update environment variable](https://docs.apify.com/api/v2/act-version-env-var-put.md) [`/acts/{actorId}/versions/{versionNumber}/env-vars/{envVarName}`](https://docs.apify.com/api/v2/act-version-env-var-put.md) ## [Delete environment variable](https://docs.apify.com/api/v2/act-version-env-var-delete.md) [`/acts/{actorId}/versions/{versionNumber}/env-vars/{envVarName}`](https://docs.apify.com/api/v2/act-version-env-var-delete.md) --- # Webhook collection - Introduction The API endpoint in this section allows you to get a list of webhooks of a specific Actor. ## [Get list of webhooks](https://docs.apify.com/api/v2/act-webhooks-get.md) [`/acts/{actorId}/webhooks`](https://docs.apify.com/api/v2/act-webhooks-get.md) --- # Get list of Actors ``` GET /v2/acts ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/ActorCollectionClientAsync#list)[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/ActorCollectionClient#list)Gets the list of all Actors that the user created or used. The response is a list of objects, where each object contains a basic information about a single Actor. To only get Actors created by the user, add the `my=1` query parameter. The endpoint supports pagination using the `limit` and `offset` parameters and it will not return more than 1000 records. By default, the records are sorted by the `createdAt` field in ascending order, therefore you can use pagination to incrementally fetch all Actors while new ones are still being created. To sort the records in descending order, use the `desc=1` parameter. You can also sort by your last run by using the `sortBy=stats.lastRunStartedAt` query parameter. In this case, descending order means the most recently run Actor appears first. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Create Actor ``` POST /v2/acts ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/ActorCollectionClientAsync#create)[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/ActorCollectionClient#create)Creates a new Actor with settings specified in an Actor object passed as JSON in the POST payload. The response is the full Actor object as returned by the [Get Actor](#/reference/actors/actor-object/get-actor) endpoint. The HTTP request must have the `Content-Type: application/json` HTTP header! The Actor needs to define at least one version of the source code. For more information, see [Version object](#/reference/actors/version-object). If you want to make your Actor [public](https://docs.apify.com/platform/actors/publishing) using `isPublic: true`, you will need to provide the Actor's `title` and the `categories` under which that Actor will be classified in Apify Store. For this, it's best to use the [constants from our `apify-shared-js` package](https://github.com/apify/apify-shared-js/blob/2d43ebc41ece9ad31cd6525bd523fb86939bf860/packages/consts/src/consts.ts#L452-L471). ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 201 **Response Headers** * **Location** --- # Delete dataset ``` DELETE /v2/datasets/:datasetId ``` Clients[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/DatasetClient#delete)Deletes a specific dataset. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 204 **Response Headers** --- # Get dataset ``` GET /v2/datasets/:datasetId ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/DatasetClientAsync#get)[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/DatasetClient#get)Returns dataset object for given dataset ID. note Keep in mind that attributes `itemCount` and `cleanItemCount` are not propagated right away after data are pushed into a dataset. There is a short period (up to 5 seconds) during which these counters may not match with exact counts in dataset items. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Get items ``` GET /v2/datasets/:datasetId/items ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/DatasetClientAsync#stream_items)[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/DatasetClient#listItems)Returns data stored in the dataset in a desired format. ### Response format[​](#response-format "Direct link to Response format") The format of the response depends on `format` query parameter. The `format` parameter can have one of the following values: `json`, `jsonl`, `xml`, `html`, `csv`, `xlsx` and `rss`. The following table describes how each format is treated. | Format | Items | | | ------- | ------------------------------------------------------------------------------------------------------------------------------------------- | - | | `json` | The response is a JSON, JSONL or XML array of raw item objects. | | | `jsonl` | | | | `xml` | | | | `html` | The response is a HTML, CSV or XLSX table, where columns correspond to the properties of the item and rows correspond to each dataset item. | | | `csv` | | | | `xlsx` | | | | `rss` | The response is a RSS file. Each item is displayed as child elements of one ``. | | Note that CSV, XLSX and HTML tables are limited to 2000 columns and the column names cannot be longer than 200 characters. JSON, XML and RSS formats do not have such restrictions. ### Hidden fields[​](#hidden-fields "Direct link to Hidden fields") The top-level fields starting with the `#` character are considered hidden. These are useful to store debugging information and can be omitted from the output by providing the `skipHidden=1` or `clean=1` query parameters. For example, if you store the following object to the dataset: ``` { productName: "iPhone Xs", description: "Welcome to the big screens." #debug: { url: "https://www.apple.com/lae/iphone-xs/", crawledAt: "2019-01-21T16:06:03.683Z" } } ``` The `#debug` field will be considered as hidden and can be omitted from the results. This is useful to provide nice cleaned data to end users, while keeping debugging info available if needed. The Dataset object returned by the API contains the number of such clean items in the`dataset.cleanItemCount` property. ### XML format extension[​](#xml-format-extension "Direct link to XML format extension") When exporting results to XML or RSS formats, the names of object properties become XML tags and the corresponding values become tag's children. For example, the following JavaScript object: ``` { name: "Paul Newman", address: [ { type: "home", street: "21st", city: "Chicago" }, { type: "office", street: null, city: null } ] } ``` will be transformed to the following XML snippet: ``` Paul Newman home 21st Chicago office ``` If the JavaScript object contains a property named `@` then its sub-properties are exported as attributes of the parent XML element. If the parent XML element does not have any child elements then its value is taken from a JavaScript object property named `#`. For example, the following JavaScript object: ``` { "address": [{ "@": { "type": "home" }, "street": "21st", "city": "Chicago" }, { "@": { "type": "office" }, "#": 'unknown' }] } ``` will be transformed to the following XML snippet: ``` 21st Chicago unknown ``` This feature is also useful to customize your RSS feeds generated for various websites. By default the whole result is wrapped in a `` element and each page object is wrapped in a `` element. You can change this using `xmlRoot` and `xmlRow` url parameters. ### Pagination[​](#pagination "Direct link to Pagination") The generated response supports [pagination](#/introduction/pagination). The pagination is always performed with the granularity of a single item, regardless whether `unwind` parameter was provided. By default, the **Items** in the response are sorted by the time they were stored to the database, therefore you can use pagination to incrementally fetch the items as they are being added. No limit exists to how many items can be returned in one response. If you specify `desc=1` query parameter, the results are returned in the reverse order than they were stored (i.e. from newest to oldest items). Note that only the order of **Items** is reversed, but not the order of the `unwind` array elements. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** * **X-Apify-Pagination-Offset** **X-Apify-Pagination-Limit** **X-Apify-Pagination-Count** **X-Apify-Pagination-Total** --- # Store items ``` POST /v2/datasets/:datasetId/items ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/DatasetClientAsync#push_items)[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/DatasetClient#pushItems)Appends an item or an array of items to the end of the dataset. The POST payload is a JSON object or a JSON array of objects to save into the dataset. If the data you attempt to store in the dataset is invalid (meaning any of the items received by the API fails the validation), the whole request is discarded and the API will return a response with status code 400. For more information about dataset schema validation, see [Dataset schema](https://docs.apify.com/platform/actors/development/actor-definition/dataset-schema/validation). **IMPORTANT:** The limit of request payload size for the dataset is 5 MB. If the array exceeds the size, you'll need to split it into a number of smaller arrays. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 201 * 400 **Response Headers** * **Location** **Response Headers** --- # Update dataset ``` PUT /v2/datasets/:datasetId ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/DatasetClientAsync#update)[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/DatasetClient#update)Updates a dataset's name using a value specified by a JSON object passed in the PUT payload. The response is the updated dataset object, as returned by the [Get dataset](#/reference/datasets/dataset-collection/get-dataset) API endpoint. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Get dataset statistics ``` GET /v2/datasets/:datasetId/statistics ``` Returns statistics for given dataset. Provides only [field statistics](https://docs.apify.com/platform/actors/development/actor-definition/dataset-schema/validation#dataset-field-statistics). ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 --- # Get list of datasets ``` GET /v2/datasets ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/DatasetCollectionClientAsync#list)[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/DatasetCollectionClient#list)Lists all of a user's datasets. The response is a JSON array of objects, where each object contains basic information about one dataset. By default, the objects are sorted by the `createdAt` field in ascending order, therefore you can use pagination to incrementally fetch all datasets while new ones are still being created. To sort them in descending order, use `desc=1` parameter. The endpoint supports pagination using `limit` and `offset` parameters and it will not return more than 1000 array elements. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Create dataset ``` POST /v2/datasets ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/DatasetCollectionClientAsync#get_or_create)[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/DatasetCollectionClient#getOrCreate)Creates a dataset and returns its object. Keep in mind that data stored under unnamed dataset follows [data retention period](https://docs.apify.com/platform/storage#data-retention). It creates a dataset with the given name if the parameter name is used. If a dataset with the given name already exists then returns its object. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 201 **Response Headers** * **Location** --- # Getting started with Apify API The Apify API provides programmatic access to the [Apify platform](https://docs.apify.com). The API is organized around [RESTful](https://en.wikipedia.org/wiki/Representational_state_transfer) HTTP endpoints. The diagram illustrates the basic workflow when using the Apify API: 1. Your application communicates with the Apify API by sending requests to run Actors and receiving results back. 2. When you request to run an Actor, the Apify API creates and manages an Actor run instance on the platform. 3. The Actor processes data and stores results in Apify's storage systems: * **Dataset**: Structured storage optimized for tabular or list-type data, ideal for scraped items or processed results. * **Key-Value Store**: Flexible storage for various data types (including images, JSON, HTML, and text), perfect for configuration settings and non-tabular outputs. ## Prerequisites[​](#prerequisites "Direct link to Prerequisites") Before you can start using the API, check if you have all the necessary prerequisites: * An Apify account with an API token. * A tool to make HTTP requests (cURL, Postman, or your preferred programming language). ## Authentication[​](#authentication "Direct link to Authentication") You must authenticate all API requests presented on this page. You can authenticate using your API token: ``` Authorization: Bearer YOUR_API_TOKEN ``` You can find your API token in the Apify Console under **[Settings > Integrations](https://console.apify.com/settings/integrations)**. ### Verify your account[​](#verify-your-account "Direct link to Verify your account") To check your API credentials or account details: Endpoint ``` GET https://api.apify.com/v2/users/me ``` Expected response codes: * `200` ## Basic workflow[​](#basic-workflow "Direct link to Basic workflow") The most common workflow involving Apify API consists of the following steps: 1. Running an Actor. 2. Retrieving the results. ### 1. Run an Actor[​](#1-run-an-actor "Direct link to 1. Run an Actor") #### Synchronously[​](#synchronously "Direct link to Synchronously") For shorter runs where you need immediate results: Endpoint ``` POST https://api.apify.com/v2/acts/:actorId/run-sync ``` Expected response codes: * `201` * `400` * `408` #### Asynchronously[​](#asynchronously "Direct link to Asynchronously") For longer-running operations or when you don't need immediate results. Endpoint ``` POST https://api.apify.com/v2/acts/:actorId/runs ``` Expected response codes: * `201` ### 2. Retrieve results[​](#2-retrieve-results "Direct link to 2. Retrieve results") #### From a Dataset[​](#from-a-dataset "Direct link to From a Dataset") Most Actors store their results in a dataset: Endpoint ``` GET https://api.apify.com/v2/datasets/:datasetId/items ``` Optional query parameters: * `format=json` (default), other possible formats are: * jsonl * xml * html * csv * xlsx * rss * `limit=100` (number of items to retrieve) * `offset=0` (pagination offset) Expected response codes: * `200` #### From a Key-value store[​](#from-a-key-value-store "Direct link to From a Key-value store") Endpoint ``` GET https://api.apify.com/v2/key-value-stores/:storeId/records/:recordKey ``` Expected response codes: * `200` * `302` ### Additional operations[​](#additional-operations "Direct link to Additional operations") #### Get log[​](#get-log "Direct link to Get log") You can get a log for a specific run or build of an Actor. Endpoint ``` GET https://api.apify.com/v2/logs/:buildOrRunId ``` Expected response codes: * `200` #### Monitor run status[​](#monitor-run-status "Direct link to Monitor run status") Endpoint ``` GET https://api.apify.com/v2/actor-runs/:runId ``` Expected response codes: * `200` #### Store data in Dataset[​](#store-data-in-dataset "Direct link to Store data in Dataset") To store your own data in a Dataset: Endpoint ``` POST https://api.apify.com/v2/datasets/:datasetId/items ``` If any item in the request fails validation, the entire request will be rejected. Expected response codes: * `201` * `400` #### Store data in Key-value store[​](#store-data-in-key-value-store "Direct link to Store data in Key-value store") To store your own data in a Key-value store: Endpoint ``` PUT https://api.apify.com/v2/key-value-stores/:storeId/records/:recordKey ``` Include your data in the request body and set the appropriate `Content-Type` header. Expected response codes: * `201` ## HTTP Status Code Descriptions[​](#http-status-code-descriptions "Direct link to HTTP Status Code Descriptions") ### `200` OK[​](#200-ok "Direct link to 200-ok") The request has succeeded. ### `201` Created[​](#201-created "Direct link to 201-created") The request has been fulfilled and a new resource has been created. ### `302` Found[​](#302-found "Direct link to 302-found") A redirection response indicating that the requested resource has been temporarily moved to a different URL. ### `400` Bad Request[​](#400-bad-request "Direct link to 400-bad-request") The server cannot process the request due to client error, such as request syntax, invalid request parameters, or invalid data format. This occurs when: * The request body contains invalid data * Required parameters are missing * Data validation fails for Dataset items ### `408` Request Timeout[​](#408-request-timeout "Direct link to 408-request-timeout") The server timed out waiting for the request to complete. ## Next steps[​](#next-steps "Direct link to Next steps") * Explore more advanced API endpoints in our full [API reference](https://docs.apify.com/api/v2.md). * Learn about webhooks to get notified when your runs finish. * Check out Apify client libraries for the following programming languages: * [JavaScript](https://docs.apify.com/api/client/js) * [Python](https://docs.apify.com/api/client/python) --- # Delete store ``` DELETE /v2/key-value-stores/:storeId ``` Clients[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/KeyValueStoreClient#delete)Deletes a key-value store. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 204 **Response Headers** --- # Get store ``` GET /v2/key-value-stores/:storeId ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/KeyValueStoreClientAsync#get)[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/KeyValueStoreClient#get)Gets an object that contains all the details about a specific key-value store. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Get list of keys ``` GET /v2/key-value-stores/:storeId/keys ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/KeyValueStoreClientAsync#list_keys)[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/KeyValueStoreClient#listKeys)Returns a list of objects describing keys of a given key-value store, as well as some information about the values (e.g. size). This endpoint is paginated using `exclusiveStartKey` and `limit` parameters * see [Pagination](https://docs.apify.com/api/v2.md#using-key) for more details. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Update store ``` PUT /v2/key-value-stores/:storeId ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/KeyValueStoreClientAsync#update)[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/KeyValueStoreClient#update)Updates a key-value store's name using a value specified by a JSON object passed in the PUT payload. The response is the updated key-value store object, as returned by the [Get store](#/reference/key-value-stores/store-object/get-store) API endpoint. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Delete record ``` DELETE /v2/key-value-stores/:storeId/records/:recordKey ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/KeyValueStoreClientAsync#delete_record)[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/KeyValueStoreClient#deleteRecord)Removes a record specified by a key from the key-value store. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 204 **Response Headers** --- # Get record ``` GET /v2/key-value-stores/:storeId/records/:recordKey ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/KeyValueStoreClientAsync#get_record)[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/KeyValueStoreClient#getRecord)Gets a value stored in the key-value store under a specific key. The response body has the same `Content-Encoding` header as it was set in [Put record](#tag/Key-value-storesRecord/operation/keyValueStore_record_put). If the request does not define the `Accept-Encoding` HTTP header with the right encoding, the record will be decompressed. Most HTTP clients support decompression by default. After using the HTTP client with decompression support, the `Accept-Encoding` header is set by the client and body is decompressed automatically. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 * 302 **Response Headers** **Response Headers** * **Location** --- # Check if a record exists ``` HEAD /v2/key-value-stores/:storeId/records/:recordKey ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/KeyValueStoreClientAsync#record_exists)[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/KeyValueStoreClient#recordExists)Check if a value is stored in the key-value store under a specific key. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 * 404 The record exists **Response Headers** The record does not exist **Response Headers** --- # Store record ``` PUT /v2/key-value-stores/:storeId/records/:recordKey ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/KeyValueStoreClientAsync#set_record)[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/KeyValueStoreClient#setRecord)Stores a value under a specific key to the key-value store. The value is passed as the PUT payload and it is stored with a MIME content type defined by the `Content-Type` header and with encoding defined by the `Content-Encoding` header. To save bandwidth, storage, and speed up your upload, send the request payload compressed with Gzip compression and add the `Content-Encoding: gzip` header. It is possible to set up another compression type with `Content-Encoding` request header. Below is a list of supported `Content-Encoding` types. * Gzip compression: `Content-Encoding: gzip` * Deflate compression: `Content-Encoding: deflate` * Brotli compression: `Content-Encoding: br` ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 201 **Response Headers** * **Location** --- # Get list of key-value stores ``` GET /v2/key-value-stores ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/KeyValueStoreCollectionClientAsync#list)[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/KeyValueStoreCollectionClient#list)Gets the list of key-value stores owned by the user. The response is a list of objects, where each objects contains a basic information about a single key-value store. The endpoint supports pagination using the `limit` and `offset` parameters and it will not return more than 1000 array elements. By default, the records are sorted by the `createdAt` field in ascending order, therefore you can use pagination to incrementally fetch all key-value stores while new ones are still being created. To sort the records in descending order, use the `desc=1` parameter. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Create key-value store ``` POST /v2/key-value-stores ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/KeyValueStoreCollectionClientAsync#get_or_create)[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/KeyValueStoreCollectionClient#getOrCreate)Creates a key-value store and returns its object. The response is the same object as returned by the [Get store](#/reference/key-value-stores/store-object/get-store) endpoint. Keep in mind that data stored under unnamed store follows [data retention period](https://docs.apify.com/platform/storage#data-retention). It creates a store with the given name if the parameter name is used. If there is another store with the same name, the endpoint does not create a new one and returns the existing object instead. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 201 **Response Headers** * **Location** --- # Get log ``` GET /v2/logs/:buildOrRunId ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/LogClientAsync#stream)[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/LogClient#stream)Retrieves logs for a specific Actor build or run. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Logs - Introduction The API endpoints described in this section are used the download the logs generated by Actor builds and runs. Note that only the trailing 5M characters of the log are stored, the rest is discarded. note Note that the endpoints do not require the authentication token, the calls are authenticated using a hard-to-guess ID of the Actor build or run. ## [Get log](https://docs.apify.com/api/v2/log-get.md) [`/logs/{buildOrRunId}`](https://docs.apify.com/api/v2/log-get.md) --- # Charge events in run ``` POST /v2/actor-runs/:runId/charge ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/RunClientAsync#charge)[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/RunClient#charge)Charge for events in the run of your [pay per event Actor](https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event). The event you are charging for must be one of the configured events in your Actor. If the Actor is not set up as pay per event, or if the event is not configured, the endpoint will return an error. The endpoint must be called from the Actor run itself, with the same API token that the run was started with. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 201 The charge was successful. Note that you still have to make sure in your Actor that the total charge for the run respects the maximum value set by the user, as the API does not check this. Above the limit, the charges reported as successful in API will not be added to your payouts, but you will still bear the associated costs. Use the Apify charge manager or SDK to avoid having to deal with this manually. --- # Resurrect run ``` POST /v2/actor-runs/:runId/resurrect ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/RunClientAsync#resurrect)[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/RunClient#resurrect)Resurrects a finished Actor run and returns an object that contains all the details about the resurrected run. Only finished runs, i.e. runs with status `FINISHED`, `FAILED`, `ABORTED` and `TIMED-OUT` can be resurrected. Run status will be updated to RUNNING and its container will be restarted with the same storages (the same behaviour as when the run gets migrated to the new server). For more information, see the [Actor docs](https://docs.apify.com/platform/actors/running/runs-and-builds#resurrection-of-finished-run). ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Delete request queue ``` DELETE /v2/request-queues/:queueId ``` Clients[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/RequestQueueClient#delete)Deletes given queue. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 204 **Response Headers** --- # Get request queue ``` GET /v2/request-queues/:queueId ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/RequestQueueClientAsync#get)[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/RequestQueueClient#get)Returns queue object for given queue ID. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Get head ``` GET /v2/request-queues/:queueId/head ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/RequestQueueClientAsync#list_head)[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/RequestQueueClient#listHead)Returns given number of first requests from the queue. The response contains the `hadMultipleClients` boolean field which indicates that the queue was accessed by more than one client (with unique or empty `clientKey`). This field is used by [Apify SDK](https://sdk.apify.com) to determine whether the local cache is consistent with the request queue, and thus optimize performance of certain operations. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Get head and lock ``` POST /v2/request-queues/:queueId/head/lock ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/RequestQueueClientAsync#list_and_lock_head)[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/RequestQueueClient#listAndLockHead)Returns the given number of first requests from the queue and locks them for the given time. If this endpoint locks the request, no other client or run will be able to get and lock these requests. The response contains the `hadMultipleClients` boolean field which indicates that the queue was accessed by more than one client (with unique or empty `clientKey`). ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Update request queue ``` PUT /v2/request-queues/:queueId ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/RequestQueueClientAsync#update)[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/RequestQueueClient#update)Updates a request queue's name using a value specified by a JSON object passed in the PUT payload. The response is the updated request queue object, as returned by the [Get request queue](#/reference/request-queues/queue-collection/get-request-queue) API endpoint. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Delete request ``` DELETE /v2/request-queues/:queueId/requests/:requestId ``` Clients[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/RequestQueueClient#delete)Deletes given request from queue. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 204 **Response Headers** --- # Get request ``` GET /v2/request-queues/:queueId/requests/:requestId ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/RequestQueueClientAsync#get)[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/RequestQueueClient#get)Returns request from queue. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Delete request lock ``` DELETE /v2/request-queues/:queueId/requests/:requestId/lock ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/RequestQueueClientAsync#delete_request_lock)[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/RequestQueueClient#deleteRequestLock)Deletes a request lock. The request lock can be deleted only by the client that has locked it using [Get and lock head operation](#/reference/request-queues/queue-head-with-locks). ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 204 **Response Headers** --- # Prolong request lock ``` PUT /v2/request-queues/:queueId/requests/:requestId/lock ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/RequestQueueClientAsync#prolong_request_lock)[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/RequestQueueClient#prolongRequestLock)Prolongs request lock. The request lock can be prolonged only by the client that has locked it using [Get and lock head operation](#/reference/request-queues/queue-head-with-locks). ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Update request ``` PUT /v2/request-queues/:queueId/requests/:requestId ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/RequestQueueClientAsync#update)[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/RequestQueueClient#update)Updates a request in a queue. Mark request as handled by setting `request.handledAt = new Date()`. If `handledAt` is set, the request will be removed from head of the queue (and unlocked, if applicable). ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Delete requests ``` DELETE /v2/request-queues/:queueId/requests/batch ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/RequestQueueClientAsync#batch_delete_requests)[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/RequestQueueClient#batchDeleteRequests)Batch-deletes given requests from the queue. The number of requests in a batch is limited to 25. The response contains an array of unprocessed and processed requests. If any delete operation fails because the request queue rate limit is exceeded or an internal failure occurs, the failed request is returned in the `unprocessedRequests` response parameter. You can re-send these delete requests. It is recommended to use an exponential backoff algorithm for these retries. Each request is identified by its ID or uniqueKey parameter. You can use either of them to identify the request. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 204 **Response Headers** --- # Add requests ``` POST /v2/request-queues/:queueId/requests/batch ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/RequestQueueClientAsync#batch_add_requests)[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/RequestQueueClient#batchAddRequests)Adds requests to the queue in batch. The maximum requests in batch is limit to 25. The response contains an array of unprocessed and processed requests. If any add operation fails because the request queue rate limit is exceeded or an internal failure occurs, the failed request is returned in the unprocessedRequests response parameter. You can resend these requests to add. It is recommended to use exponential backoff algorithm for these retries. If a request with the same `uniqueKey` was already present in the queue, then it returns an ID of the existing request. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 201 **Response Headers** --- # List requests ``` GET /v2/request-queues/:queueId/requests ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/RequestQueueClientAsync#list_requests)[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/RequestQueueClient#paginateRequests)Returns a list of requests. This endpoint is paginated using exclusiveStartId and limit parameters. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Add request ``` POST /v2/request-queues/:queueId/requests ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/RequestQueueClientAsync#add_request)[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/RequestQueueClient#addRequest)Adds request to the queue. Response contains ID of the request and info if request was already present in the queue or handled. If request with same `uniqueKey` was already present in the queue then returns an ID of existing request. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 201 **Response Headers** --- # Unlock requests ``` POST /v2/request-queues/:queueId/requests/unlock ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/RequestQueueClientAsync#unlock_requests)[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/RequestQueueClient#unlockRequests)Unlocks requests in the queue that are currently locked by the client. * If the client is within an Actor run, it unlocks all requests locked by that specific run plus all requests locked by the same clientKey. * If the client is outside of an Actor run, it unlocks all requests locked using the same clientKey. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 Number of requests that were unlocked --- # Get list of request queues ``` GET /v2/request-queues ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/RequestQueueCollectionClientAsync#list)[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/RequestQueueCollectionClient#list)Lists all of a user's request queues. The response is a JSON array of objects, where each object contains basic information about one queue. By default, the objects are sorted by the `createdAt` field in ascending order, therefore you can use pagination to incrementally fetch all queues while new ones are still being created. To sort them in descending order, use `desc=1` parameter. The endpoint supports pagination using `limit` and `offset` parameters and it will not return more than 1000 array elements. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Create request queue ``` POST /v2/request-queues ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/RequestQueueCollectionClientAsync#get_or_create)[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/RequestQueueCollectionClient#getOrCreate)Creates a request queue and returns its object. Keep in mind that requests stored under unnamed queue follows [data retention period](https://docs.apify.com/platform/storage#data-retention). It creates a queue of given name if the parameter name is used. If a queue with the given name already exists then the endpoint returns its object. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 201 **Response Headers** * **Location** --- # Delete schedule ``` DELETE /v2/schedules/:scheduleId ``` Clients[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/ScheduleClient#delete)Deletes a schedule. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 204 **Response Headers** --- # Get schedule ``` GET /v2/schedules/:scheduleId ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/ScheduleClientAsync#get)[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/ScheduleClient#get)Gets the schedule object with all details. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Get schedule log ``` GET /v2/schedules/:scheduleId/log ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/ScheduleClientAsync#get_log)[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/ScheduleClient#getLog)Gets the schedule log as a JSON array containing information about up to a 1000 invocations of the schedule. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Update schedule ``` PUT /v2/schedules/:scheduleId ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/ScheduleClientAsync#update)[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/ScheduleClient#update)Updates a schedule using values specified by a schedule object passed as JSON in the POST payload. If the object does not define a specific property, its value will not be updated. The response is the full schedule object as returned by the [Get schedule](#/reference/schedules/schedule-object/get-schedule) endpoint. **The request needs to specify the `Content-Type: application/json` HTTP header!** When providing your API authentication token, we recommend using the request's `Authorization` header, rather than the URL. ([More info](#/introduction/authentication)). ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Schedules - Introduction This section describes API endpoints for managing schedules. Schedules are used to automatically start your Actors at certain times. Each schedule can be associated with a number of Actors and Actor tasks. It is also possible to override the settings of each Actor (task) similarly to when invoking the Actor (task) using the API. For more information, see [Schedules documentation](https://docs.apify.com/platform/schedules). Each schedule is assigned actions for it to perform. Actions can be of two types * `RUN_ACTOR` and `RUN_ACTOR_TASK`. For details, see the documentation of the [Get schedule](#/reference/schedules/schedule-object/get-schedule) endpoint. ## [Get list of schedules](https://docs.apify.com/api/v2/schedules-get.md) [`/schedules`](https://docs.apify.com/api/v2/schedules-get.md) ## [Create schedule](https://docs.apify.com/api/v2/schedules-post.md) [`/schedules`](https://docs.apify.com/api/v2/schedules-post.md) ## [Get schedule](https://docs.apify.com/api/v2/schedule-get.md) [`/schedules/{scheduleId}`](https://docs.apify.com/api/v2/schedule-get.md) ## [Update schedule](https://docs.apify.com/api/v2/schedule-put.md) [`/schedules/{scheduleId}`](https://docs.apify.com/api/v2/schedule-put.md) ## [Delete schedule](https://docs.apify.com/api/v2/schedule-delete.md) [`/schedules/{scheduleId}`](https://docs.apify.com/api/v2/schedule-delete.md) ## [Get schedule log](https://docs.apify.com/api/v2/schedule-log-get.md) [`/schedules/{scheduleId}/log`](https://docs.apify.com/api/v2/schedule-log-get.md) --- # Get list of schedules ``` GET /v2/schedules ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/ScheduleCollectionClientAsync#list)[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/ScheduleCollectionClient#list)Gets the list of schedules that the user created. The endpoint supports pagination using the `limit` and `offset` parameters. It will not return more than 1000 records. By default, the records are sorted by the `createdAt` field in ascending order. To sort the records in descending order, use the `desc=1` parameter. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Create schedule ``` POST /v2/schedules ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/ScheduleCollectionClientAsync#create)[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/ScheduleCollectionClient#create)Creates a new schedule with settings provided by the schedule object passed as JSON in the payload. The response is the created schedule object. The request needs to specify the `Content-Type: application/json` HTTP header! When providing your API authentication token, we recommend using the request's `Authorization` header, rather than the URL. ([More info](#/introduction/authentication)). ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 201 **Response Headers** * **Location** --- # Datasets - Introduction This section describes API endpoints to manage Datasets. Dataset is a storage for structured data, where each record stored has the same attributes, such as online store products or real estate offers. You can imagine it as a table, where each object is a row and its attributes are columns. Dataset is an append-only storage - you can only add new records to it but you cannot modify or remove existing records. Typically it is used to store crawling results. For more information, see the [Datasets documentation](https://docs.apify.com/platform/storage/dataset). note Some of the endpoints do not require the authentication token, the calls are authenticated using the hard-to-guess ID of the dataset. ## [Get list of datasets](https://docs.apify.com/api/v2/datasets-get.md) [`/datasets`](https://docs.apify.com/api/v2/datasets-get.md) ## [Create dataset](https://docs.apify.com/api/v2/datasets-post.md) [`/datasets`](https://docs.apify.com/api/v2/datasets-post.md) ## [Get dataset](https://docs.apify.com/api/v2/dataset-get.md) [`/datasets/{datasetId}`](https://docs.apify.com/api/v2/dataset-get.md) ## [Update dataset](https://docs.apify.com/api/v2/dataset-put.md) [`/datasets/{datasetId}`](https://docs.apify.com/api/v2/dataset-put.md) ## [Delete dataset](https://docs.apify.com/api/v2/dataset-delete.md) [`/datasets/{datasetId}`](https://docs.apify.com/api/v2/dataset-delete.md) ## [Get items](https://docs.apify.com/api/v2/dataset-items-get.md) [`/datasets/{datasetId}/items`](https://docs.apify.com/api/v2/dataset-items-get.md) ## [Store items](https://docs.apify.com/api/v2/dataset-items-post.md) [`/datasets/{datasetId}/items`](https://docs.apify.com/api/v2/dataset-items-post.md) ## [Get dataset statistics](https://docs.apify.com/api/v2/dataset-statistics-get.md) [`/datasets/{datasetId}/statistics`](https://docs.apify.com/api/v2/dataset-statistics-get.md) --- # Key-value stores - Introduction This section describes API endpoints to manage Key-value stores. Key-value store is a simple storage for saving and reading data records or files. Each data record is represented by a unique key and associated with a MIME content type. Key-value stores are ideal for saving screenshots, Actor inputs and outputs, web pages, PDFs or to persist the state of crawlers. For more information, see the [Key-value store documentation](https://docs.apify.com/platform/storage/key-value-store). note Some of the endpoints do not require the authentication token, the calls are authenticated using a hard-to-guess ID of the key-value store. ## [Get list of key-value stores](https://docs.apify.com/api/v2/key-value-stores-get.md) [`/key-value-stores`](https://docs.apify.com/api/v2/key-value-stores-get.md) ## [Create key-value store](https://docs.apify.com/api/v2/key-value-stores-post.md) [`/key-value-stores`](https://docs.apify.com/api/v2/key-value-stores-post.md) ## [Get store](https://docs.apify.com/api/v2/key-value-store-get.md) [`/key-value-stores/{storeId}`](https://docs.apify.com/api/v2/key-value-store-get.md) ## [Update store](https://docs.apify.com/api/v2/key-value-store-put.md) [`/key-value-stores/{storeId}`](https://docs.apify.com/api/v2/key-value-store-put.md) ## [Delete store](https://docs.apify.com/api/v2/key-value-store-delete.md) [`/key-value-stores/{storeId}`](https://docs.apify.com/api/v2/key-value-store-delete.md) ## [Get list of keys](https://docs.apify.com/api/v2/key-value-store-keys-get.md) [`/key-value-stores/{storeId}/keys`](https://docs.apify.com/api/v2/key-value-store-keys-get.md) ## [Get record](https://docs.apify.com/api/v2/key-value-store-record-get.md) [`/key-value-stores/{storeId}/records/{recordKey}`](https://docs.apify.com/api/v2/key-value-store-record-get.md) ## [Check if a record exists](https://docs.apify.com/api/v2/key-value-store-record-head.md) [`/key-value-stores/{storeId}/records/{recordKey}`](https://docs.apify.com/api/v2/key-value-store-record-head.md) ## [Store record](https://docs.apify.com/api/v2/key-value-store-record-put.md) [`/key-value-stores/{storeId}/records/{recordKey}`](https://docs.apify.com/api/v2/key-value-store-record-put.md) ## [Delete record](https://docs.apify.com/api/v2/key-value-store-record-delete.md) [`/key-value-stores/{storeId}/records/{recordKey}`](https://docs.apify.com/api/v2/key-value-store-record-delete.md) --- # Request queues - Introduction This section describes API endpoints to create, manage, and delete request queues. Request queue is a storage for a queue of HTTP URLs to crawl, which is typically used for deep crawling of websites where you start with several URLs and then recursively follow links to other pages. The storage supports both breadth-first and depth-first crawling orders. For more information, see the [Request queue documentation](https://docs.apify.com/platform/storage/request-queue). note Some of the endpoints do not require the authentication token, the calls are authenticated using the hard-to-guess ID of the queue. ## [Get list of request queues](https://docs.apify.com/api/v2/request-queues-get.md) [`/request-queues`](https://docs.apify.com/api/v2/request-queues-get.md) ## [Create request queue](https://docs.apify.com/api/v2/request-queues-post.md) [`/request-queues`](https://docs.apify.com/api/v2/request-queues-post.md) ## [Get request queue](https://docs.apify.com/api/v2/request-queue-get.md) [`/request-queues/{queueId}`](https://docs.apify.com/api/v2/request-queue-get.md) ## [Update request queue](https://docs.apify.com/api/v2/request-queue-put.md) [`/request-queues/{queueId}`](https://docs.apify.com/api/v2/request-queue-put.md) ## [Delete request queue](https://docs.apify.com/api/v2/request-queue-delete.md) [`/request-queues/{queueId}`](https://docs.apify.com/api/v2/request-queue-delete.md) ## [Add requests](https://docs.apify.com/api/v2/request-queue-requests-batch-post.md) [`/request-queues/{queueId}/requests/batch`](https://docs.apify.com/api/v2/request-queue-requests-batch-post.md) ## [Delete requests](https://docs.apify.com/api/v2/request-queue-requests-batch-delete.md) [`/request-queues/{queueId}/requests/batch`](https://docs.apify.com/api/v2/request-queue-requests-batch-delete.md) --- # Requests- Introduction This section describes API endpoints to create, manage, and delete requests within request queues. Request queue is a storage for a queue of HTTP URLs to crawl, which is typically used for deep crawling of websites where you start with several URLs and then recursively follow links to other pages. The storage supports both breadth-first and depth-first crawling orders. For more information, see the [Request queue documentation](https://docs.apify.com/platform/storage/request-queue). note Some of the endpoints do not require the authentication token, the calls are authenticated using the hard-to-guess ID of the queue. ## [List requests](https://docs.apify.com/api/v2/request-queue-requests-get.md) [`/request-queues/{queueId}/requests`](https://docs.apify.com/api/v2/request-queue-requests-get.md) ## [Add request](https://docs.apify.com/api/v2/request-queue-requests-post.md) [`/request-queues/{queueId}/requests`](https://docs.apify.com/api/v2/request-queue-requests-post.md) ## [Get request](https://docs.apify.com/api/v2/request-queue-request-get.md) [`/request-queues/{queueId}/requests/{requestId}`](https://docs.apify.com/api/v2/request-queue-request-get.md) ## [Update request](https://docs.apify.com/api/v2/request-queue-request-put.md) [`/request-queues/{queueId}/requests/{requestId}`](https://docs.apify.com/api/v2/request-queue-request-put.md) ## [Delete request](https://docs.apify.com/api/v2/request-queue-request-delete.md) [`/request-queues/{queueId}/requests/{requestId}`](https://docs.apify.com/api/v2/request-queue-request-delete.md) --- # Requests locks - Introduction This section describes API endpoints to create, manage, and delete request locks within request queues. Request queue is a storage for a queue of HTTP URLs to crawl, which is typically used for deep crawling of websites where you start with several URLs and then recursively follow links to other pages. The storage supports both breadth-first and depth-first crawling orders. For more information, see the [Request queue documentation](https://docs.apify.com/platform/storage/request-queue). note Some of the endpoints do not require the authentication token, the calls are authenticated using the hard-to-guess ID of the queue. ## [Unlock requests](https://docs.apify.com/api/v2/request-queue-requests-unlock-post.md) [`/request-queues/{queueId}/requests/unlock`](https://docs.apify.com/api/v2/request-queue-requests-unlock-post.md) ## [Get head](https://docs.apify.com/api/v2/request-queue-head-get.md) [`/request-queues/{queueId}/head`](https://docs.apify.com/api/v2/request-queue-head-get.md) ## [Get head and lock](https://docs.apify.com/api/v2/request-queue-head-lock-post.md) [`/request-queues/{queueId}/head/lock`](https://docs.apify.com/api/v2/request-queue-head-lock-post.md) ## [Prolong request lock](https://docs.apify.com/api/v2/request-queue-request-lock-put.md) [`/request-queues/{queueId}/requests/{requestId}/lock`](https://docs.apify.com/api/v2/request-queue-request-lock-put.md) ## [Delete request lock](https://docs.apify.com/api/v2/request-queue-request-lock-delete.md) [`/request-queues/{queueId}/requests/{requestId}/lock`](https://docs.apify.com/api/v2/request-queue-request-lock-delete.md) --- # Store - Introduction [Apify Store](https://apify.com/store) is home to thousands of public Actors available to the Apify community. The API endpoints described in this section are used to retrieve these Actors. note These endpoints do not require the authentication token. ## [Get list of Actors in store](https://docs.apify.com/api/v2/store-get.md) [`/store`](https://docs.apify.com/api/v2/store-get.md) --- # Get list of Actors in store ``` GET /v2/store ``` Gets the list of public Actors in Apify Store. You can use `search` parameter to search Actors by string in title, name, description, username and readme. If you need detailed info about a specific Actor, use the [Get Actor](#/reference/actors/actor-object/get-actor) endpoint. The endpoint supports pagination using the `limit` and `offset` parameters. It will not return more than 1,000 records. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Get public user data ``` GET /v2/users/:userId ``` Returns public information about a specific user account, similar to what can be seen on public profile pages (e.g. ). This operation requires no authentication token. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Users - Introduction The API endpoints described in this section return information about user accounts. ## [Get public user data](https://docs.apify.com/api/v2/user-get.md) [`/users/{userId}`](https://docs.apify.com/api/v2/user-get.md) ## [Get private user data](https://docs.apify.com/api/v2/users-me-get.md) [`/users/me`](https://docs.apify.com/api/v2/users-me-get.md) ## [Get monthly usage](https://docs.apify.com/api/v2/users-me-usage-monthly-get.md) [`/users/me/usage/monthly`](https://docs.apify.com/api/v2/users-me-usage-monthly-get.md) ## [Get limits](https://docs.apify.com/api/v2/users-me-limits-get.md) [`/users/me/limits`](https://docs.apify.com/api/v2/users-me-limits-get.md) ## [Update limits](https://docs.apify.com/api/v2/users-me-limits-put.md) [`/users/me/limits`](https://docs.apify.com/api/v2/users-me-limits-put.md) --- # Get private user data ``` GET /v2/users/me ``` Returns information about the current user account, including both public and private information. The user account is identified by the provided authentication token. The fields `plan`, `email` and `profile` are omitted when this endpoint is accessed from Actor run. ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Get limits ``` GET /v2/users/me/limits ``` Returns a complete summary of your account's limits. It is the same information you will see on your account's [Limits page](https://console.apify.com/billing#/limits). The returned data includes the current usage cycle, a summary of your limits, and your current usage. ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Update limits ``` PUT /v2/users/me/limits ``` Updates the account's limits manageable on your account's [Limits page](https://console.apify.com/billing#/limits). Specifically the: `maxMonthlyUsageUsd` and `dataRetentionDays` limits (see request body schema for more details). ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 201 **Response Headers** --- # Get monthly usage ``` GET /v2/users/me/usage/monthly ``` Returns a complete summary of your usage for the current usage cycle, an overall sum, as well as a daily breakdown of usage. It is the same information you will see on your account's [Billing page](https://console.apify.com/billing#/usage). The information includes your use of storage, data transfer, and request queue usage. Using the `date` parameter will show your usage in the usage cycle that includes that date. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Delete webhook ``` DELETE /v2/webhooks/:webhookId ``` Clients[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/WebhookClient#delete)Deletes a webhook. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 204 **Response Headers** --- # Get webhook dispatch ``` GET /v2/webhook-dispatches/:dispatchId ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/WebhookDispatchClientAsync#get)[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/WebhookDispatchClient#get)Gets webhook dispatch object with all details. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Get collection ``` GET /v2/webhooks/:webhookId/dispatches ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/WebhookClientAsync#dispatches)Gets a given webhook's list of dispatches. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Get webhook ``` GET /v2/webhooks/:webhookId ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/WebhookClientAsync#get)[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/WebhookClient#get)Gets webhook object with all details. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Update webhook ``` PUT /v2/webhooks/:webhookId ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/WebhookClientAsync#update)[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/WebhookClient#update)Updates a webhook using values specified by a webhook object passed as JSON in the POST payload. If the object does not define a specific property, its value will not be updated. The response is the full webhook object as returned by the [Get webhook](#/reference/webhooks/webhook-object/get-webhook) endpoint. The request needs to specify the `Content-Type: application/json` HTTP header! When providing your API authentication token, we recommend using the request's `Authorization` header, rather than the URL. ([More info](#/introduction/authentication)). ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Test webhook ``` POST /v2/webhooks/:webhookId/test ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/WebhookClientAsync#test)[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/WebhookClient#test)Tests a webhook. Creates a webhook dispatch with a dummy payload. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 201 **Response Headers** --- # Get list of webhooks ``` GET /v2/webhooks ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/WebhookCollectionClientAsync#list)[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/WebhookCollectionClient#list)Gets the list of webhooks that the user created. The endpoint supports pagination using the `limit` and `offset` parameters and it will not return more than 1000 records. By default, the records are sorted by the `createdAt` field in ascending order. To sort the records in descending order, use the `desc=1` parameter. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 200 **Response Headers** --- # Create webhook ``` POST /v2/webhooks ``` Clients[![Apify API Python Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/python.svg)](https://docs.apify.com/api/client/python/reference/class/WebhookCollectionClientAsync#create)[![Apify API JavaScript Client Reference](https://raw.githubusercontent.com/apify/openapi/b1206ac2adf8f39b05e5a09bf32c2802af58d851/assets/javascript.svg)](https://docs.apify.com/api/client/js/reference/class/WebhookCollectionClient#create)Creates a new webhook with settings provided by the webhook object passed as JSON in the payload. The response is the created webhook object. To avoid duplicating a webhook, use the `idempotencyKey` parameter in the request body. Multiple calls to create a webhook with the same `idempotencyKey` will only create the webhook with the first call and return the existing webhook on subsequent calls. Idempotency keys must be unique, so use a UUID or another random string with enough entropy. To assign the new webhook to an Actor or task, the request body must contain `requestUrl`, `eventTypes`, and `condition` properties. * `requestUrl` is the webhook's target URL, to which data is sent as a POST request with a JSON payload. * `eventTypes` is a list of events that will trigger the webhook, e.g. when the Actor run succeeds. * `condition` should be an object containing the ID of the Actor or task to which the webhook will be assigned. * `payloadTemplate` is a JSON-like string, whose syntax is extended with the use of variables. * `headersTemplate` is a JSON-like string, whose syntax is extended with the use of variables. Following values will be re-written to defaults: "host", "Content-Type", "X-Apify-Webhook", "X-Apify-Webhook-Dispatch-Id", "X-Apify-Request-Origin" * `description` is an optional string. * `shouldInterpolateStrings` is a boolean indicating whether to interpolate variables contained inside strings in the `payloadTemplate` ``` "isAdHoc" : false, "requestUrl" : "https://example.com", "eventTypes" : [ "ACTOR.RUN.SUCCEEDED", "ACTOR.RUN.ABORTED" ], "condition" : { "actorId": "janedoe~my-actor", "actorTaskId" : "W9bs9JE9v7wprjAnJ" }, "payloadTemplate": "", "headersTemplate": "", "description": "my awesome webhook", "shouldInterpolateStrings": false, ``` **Important**: The request must specify the `Content-Type: application/json` HTTP header. ## Request[​](#request "Direct link to Request") ## Responses[​](#responses "Direct link to Responses") * 201 **Response Headers** * **Location** --- # Webhook dispatches - Introduction This section describes API endpoints to get webhook dispatches. ## [Get collection](https://docs.apify.com/api/v2/webhook-dispatches-get.md) [`/webhook-dispatches`](https://docs.apify.com/api/v2/webhook-dispatches-get.md) ## [Get webhook dispatch](https://docs.apify.com/api/v2/webhook-dispatch-get.md) [`/webhook-dispatches/{dispatchId}`](https://docs.apify.com/api/v2/webhook-dispatch-get.md) --- # Webhooks - Introduction This section describes API endpoints to manage webhooks. Webhooks provide an easy and reliable way to configure the Apify platform to carry out an action (e.g. a HTTP request to another service) when a certain system event occurs. For example, you can use webhooks to start another Actor when an Actor run finishes or fails. For more information see [Webhooks documentation](https://docs.apify.com/platform/integrations/webhooks). ## [Get list of webhooks](https://docs.apify.com/api/v2/webhooks-get.md) [`/webhooks`](https://docs.apify.com/api/v2/webhooks-get.md) ## [Create webhook](https://docs.apify.com/api/v2/webhooks-post.md) [`/webhooks`](https://docs.apify.com/api/v2/webhooks-post.md) ## [Get webhook](https://docs.apify.com/api/v2/webhook-get.md) [`/webhooks/{webhookId}`](https://docs.apify.com/api/v2/webhook-get.md) ## [Update webhook](https://docs.apify.com/api/v2/webhook-put.md) [`/webhooks/{webhookId}`](https://docs.apify.com/api/v2/webhook-put.md) ## [Delete webhook](https://docs.apify.com/api/v2/webhook-delete.md) [`/webhooks/{webhookId}`](https://docs.apify.com/api/v2/webhook-delete.md) ## [Test webhook](https://docs.apify.com/api/v2/webhook-test-post.md) [`/webhooks/{webhookId}/test`](https://docs.apify.com/api/v2/webhook-test-post.md) ## [Get collection](https://docs.apify.com/api/v2/webhook-dispatches-get.md) [`/webhooks/{webhookId}/dispatches`](https://docs.apify.com/api/v2/webhook-dispatches-get.md) --- # Apify Legal ## Company details (Impressum)[​](#company-details-impressum "Direct link to Company details (Impressum)") **Apify Technologies s.r.o.**Registered seat: Vodickova 704/36, 110 00 Prague 1, Czech RepublicVAT ID: CZ04788290 (EU), GB373153700 (UK)Company ID: 04788290Czech limited liability company registered in the [Commercial Register](https://or.justice.cz/ias/ui/rejstrik-firma.vysledky?subjektId=924944\&typ=PLATNY) kept by the Municipal Court of Prague, File No.: C 253224Represented by managing director Jan ČurnIBAN: CZ0355000000000027434378SWIFT / BIC: RZBCCZPP ### Contacts[​](#contacts "Direct link to Contacts") General: Legal team contact: Privacy team contact: Apify Trust Center: ### Trademarks[​](#trademarks "Direct link to Trademarks") "APIFY" is a word trademark registered with USPTO (4517178), EUIPO (011628377), UKIPO (UK00911628377), and DPMA (3020120477984). ## Terms and Conditions[​](#terms-and-conditions "Direct link to Terms and Conditions") * [Apify General Terms and Conditions](https://docs.apify.com/legal/general-terms-and-conditions.md) * [Apify Store Publishing Terms and Conditions](https://docs.apify.com/legal/store-publishing-terms-and-conditions.md) * [Apify Affiliate Program Terms and Conditions](https://docs.apify.com/legal/affiliate-program-terms-and-conditions.md) * [Apify Data Processing Addendum](https://docs.apify.com/legal/data-processing-addendum.md) * [Apify Open Source Fair Share Program Terms and Conditions](https://docs.apify.com/legal/fair-share-program-terms-and-conditions.md) * [Apify Candidate Referral Program Terms](https://docs.apify.com/legal/candidate-referral-program-terms.md) ## Policies[​](#policies "Direct link to Policies") * [Apify Acceptable Use Policy](https://docs.apify.com/legal/acceptable-use-policy.md) * [Apify Privacy Policy](https://docs.apify.com/legal/privacy-policy.md) * [Apify Cookie Policy](https://docs.apify.com/legal/cookie-policy.md) * [Apify GDPR Information](https://docs.apify.com/legal/gdpr-information.md) * [Apify Whistleblowing Policy](https://docs.apify.com/legal/whistleblowing-policy.md) --- # Apify Acceptable Use Policy Apify Technologies s.r.o., with its registered seat at Vodičkova 704/36, Nové Město, 110 00 Prague 1, Czech Republic, Company ID No.: 04788290, registered in the Commercial Register kept by the Municipal Court of Prague, File No.: C 253224 (hereinafter referred to as “**we**” or “**Apify**”), is committed to making sure that the Platform and the Website are being used only for legitimate and legal purposes. By agreeing to our General Terms and Conditions or simply by using the Platform or the Website, you also agree to be bound by this Acceptable Use Policy. ## 1. General Provisions[​](#1-general-provisions "Direct link to 1. General Provisions") **1.1.** Apify takes firm action against any illegal or non-compliant use of the Platform and the Website and will assist law enforcement in investigating any illegal activities. Any use of the Platform or the Website for illegal, fraudulent, or otherwise unacceptable activities is strictly prohibited. **1.2.** You agree to use the Platform and the Website only for legal and legitimate purposes and to avoid the Prohibited Activities under Article 2.1. **1.3.** Any capitalized terms that are not defined in this Acceptable Use Policy shall have the meaning ascribed to them in the General Terms and Conditions. ## 2. Prohibited Activities[​](#2-prohibited-activities "Direct link to 2. Prohibited Activities") **2.1.** Prohibited activities include, but are not limited to: 1. denial-of-service (DDoS) attacks or any other actions that cause undue burden on any servers or infrastructure; 2. unsolicited mass messaging; 3. any fraudulent or deceptive behavior (such as phishing, malware, impersonation, spoofing, ad fraud, click fraud, etc.); 4. any artificial interaction (such as upvotes, shares, etc.); 5. creating fake accounts or deceptive content (such as disinformation, clickbait, misleading ad, scam emails, etc.); 6. any manipulation of Search Engine Optimization (i.e., fake clicks in search engine results); 7. engaging in surveys in exchange for any financial or in-kind benefit; 8. resale of any Platform features without obtaining Apify’s prior written approval; 9. engaging in activities that contravene applicable laws, regulations, or the rights of any third party; 10. any activity that may harm the reputation, goodwill or interests of Apify; 11. any additional activity that Apify deems immoral or undesirable on its Platform or Website (together as the “**Prohibited Activities**”). ## 3. Our Rights[​](#3-our-rights "Direct link to 3. Our Rights") **3.1.** In case Apify identifies any of the Prohibited Activities on the Platform or the Website, it is authorized to block, delete, or otherwise restrict any such non-compliant User or Actor from the Platform or Website without notice. Apify may limit the use of the Platform or the Website in its sole discretion to prevent any direct or indirect damage to Apify or any third party. **3.2.** Apify shall not be liable towards you or any third party for exercising its rights according to this Acceptable Use Policy. ## 4. Reporting[​](#4-reporting "Direct link to 4. Reporting") **4.1.** We encourage users to report any misuse or suspicious activity on our Platform through our contact email [support@apify.com](mailto:hello@apify.com). --- # Apify Affiliate Program Terms and Conditions Effective date: May 14, 2024 Latest version effective from: July 5, 2025 *** **Apify Technologies s.r.o.**, with its registered seat at Vodičkova 704/36, 110 00 Prague 1, Czech Republic, Company reg. no. 04788290, recorded in the Commercial Register kept by the Municipal Court of Prague, File No.: C 253224 ("**we**" or "**Apify**") thanks you ("**you**" or the "**Affiliate Partner**") for your interest in joining the Apify's Affiliate Program (the "**Affiliate Program**"). These terms and conditions (the "**Affiliate Program Terms**") shall govern your rights and obligations in relation to your participation in the Affiliate Program in addition to [Apify General Terms and Conditions](https://docs.apify.com/legal/general-terms-and-conditions.md) (the "**General Terms**"). Terms starting with a capital letter used in these Affiliate Program Terms have the meaning defined either here or in the General Terms. Provisions of the General Terms regarding liability, indemnity, governing law and choice of jurisdiction are incorporated herein by reference. ## 1. Joining the Affiliate Program[​](#1-joining-the-affiliate-program "Direct link to 1. Joining the Affiliate Program") **1.1.** You may join the Affiliate Program by creating an account on Apify's [FirstPromoter portal](https://apify.firstpromoter.com/) ("**FirstPromoter**"). By ticking the box "*I agree to the Apify Affiliate Program terms*", you claim that you are over 18 years old and agree to adhere to these Affiliate Program Terms, in addition to the General Terms. If you act on behalf of a company when accepting these Affiliate Program Terms, you also hereby declare to be authorized to perform such legal actions on behalf of the company (herein the term "**you**" shall mean the relevant company). ## 2. Apify's obligations[​](#2-apifys-obligations "Direct link to 2. Apify's obligations") **2.1.** Upon joining the Affiliate Program, Apify will make available to you your unique referral link (the "**Referral Link**") and promotional materials, which you may use in promoting Apify's services. **2.2.** Apify shall pay to the Affiliate Partner a commission (the "**Commission**") for referred users' use of the Platform according to the provisions below. ## 3. Referred Customers[​](#3-referred-customers "Direct link to 3. Referred Customers") **3.1.** "**Referred Customer**" is a natural person or a company who subscribed to Apify's Platform using the Referral Link in compliance with these Affiliate Program Terms. **3.2.** Tracking of the potential Referred Customer is performed by cookies lasting 45 days. If the cookies expire or are deleted, a customer may not be recognised as a Referred Customer. **3.3.** If the Affiliate Partner identifies any customer that, in their opinion, shall be considered Referred Customer but has not been recognised by the tracking, the Affiliate Partner may report this information to Apify together with sufficient evidence. Apify shall in its discretion in good faith consider the Affiliate Partner's report and evidence, and decide whether or not such a customer shall be deemed Referred Customer. **3.4.** In case of any uncertain or suspicious circumstances, Apify shall have the right to ultimately decide whether a customer is to be deemed Referred Customer or not, in its sole discretion. ## 4. Commission[​](#4-commission "Direct link to 4. Commission") **4.1.** Unless agreed otherwise, the Commission shall be computed as 20% of all payments made by each Referred Customer to Apify in the first 3 months from the date when that Referred Customer started paying for Services (as defined in the General Terms) and then increased to 30% for all payments made by each Referred Customer to Apify, up to 2,500 USD per Referred Customer. ## 5. Payment terms[​](#5-payment-terms "Direct link to 5. Payment terms") **5.1.** Apify shall pay the earned Commission to the Affiliate Partner monthly, within 15 business days after the end of each calendar month, based on an invoice that shall be issued to Apify. The minimum Commission payable is 100 USD. Commission in any given calendar month lower than 100 USD will be rolled over to the following month. **5.2.** The Commission may be paid either via PayPal or bank transfer. The Affiliate Partner shall specify the chosen payment method. **5.3.** Invoices are generated automatically through FirstPromoter by Apify. The Affiliate Partner shall provide and maintain up-to-date information required for this purpose in the portal. Apify may request that the Affiliate Partner issues an invoice using a different method, if the method through FirstPromoter portal is not available (e.g. due to missing Affiliate Partner's personal information). Apify shall not be obliged to pay any Commission until a valid invoice has been issued or generated. **5.4.** Affiliate acknowledges and agrees that Apify makes no representation or guarantee of any kind regarding revenue, business, profit, or customers under these Affiliate Program Terms. ## 6. Affiliate Partner's obligations[​](#6-affiliate-partners-obligations "Direct link to 6. Affiliate Partner's obligations") **6.1.** The Affiliate Partner must not promote Apify or use its Referral Link in any of the following ways: 1. in any materials or in connection with any services that are illegal, infringing on third party rights, fraudulent, harassing, defamatory, discriminatory or violent; 2. use any paid advertisements and/or advertise via any pay-per-click advertisement systems (e.g., Google Ads, Facebook Ads or LinkedIn Ads); 3. bid on the "Apify" keyword for any pay-per-click advertisement systems (e.g., Google Ads), including any misspellings, capitalizations or in combination with any other text; 4. for self-referral, i.e., its own or its friends, affiliates, related persons or business partners sign-ups to the Platform; 5. in any content that's available on the Platform, Website or any other assets owned, operated or maintained by Apify; or 6. for any illegal activity, including, without limitation, fraud or money laundering. Any customer referred in breach of this clause 6.1 shall not be considered a Referred Customer. **6.2.** For avoidance of doubt, the Affiliate Partner shall not be an agent of Apify. Nothing in these Affiliate Program Terms shall be construed as authorization to act or make representations on behalf of Apify. ## 7. License[​](#7-license "Direct link to 7. License") **7.1.** Apify hereby grants to the Affiliate Partner a limited, worldwide, revocable, non-exclusive, non-sublicensable and non-transferable license to use Apify's trademark, logo, trade name, service names and copyrighted material (the "**Intellectual Property Assets**") in its marketing, advertising or other content while promoting Apify services in accordance with this Affiliate Program Terms. **7.2.** The Affiliate Partner shall not alter, modify, adapt, translate, or create derivative works from the whole or any part of the Intellectual Property Assets or permit any part of the Intellectual Property Assets to be merged, combined with, or otherwise incorporated into any other product, unless the Affiliate Partner obtained prior written consent from Apify. The Affiliate Partner shall not use any language or display the Intellectual Property Assets in such a way as to create the impression that the Intellectual Property Assets belong to the Affiliate Partner. **7.3.** The Affiliate Partner shall not attack, question, or contest the validity of Apify's ownership of Intellectual Property Assets. **7.4.** If any infringement of any Intellectual Property Assets comes to the Affiliate Partner's attention, whether actual or threatened, the Affiliate Partner agrees to inform Apify as soon as possible. The Affiliate Partner also agrees to notify us of any claim by anyone that our products infringe the rights of any other person. The Affiliate Partner shall, at our request and expense, provide its cooperation to Apify in doing any reasonably required steps to address such claims. **7.5.** Apify reserves all rights not expressly granted in this Agreement, and does not transfer any right, title, or interest to any intellectual property rights. ## 8. Termination[​](#8-termination "Direct link to 8. Termination") **8.1.** The Affiliate Partner may terminate its participation in the Affiliate Program at any time and for any reason by providing Apify a written notice. Commission earned during that calendar month when the Affiliate ceased to be part of the Affiliate Program will be paid out in the usual term after the end of that calendar month. **8.2.** Apify may terminate the Affiliate Partner's participation in the Affiliate Program with immediate effect if the Affiliate Partner breaches any provision of these Affiliate Program Terms or any other terms agreed between the Affiliate Partner and Apify. All commissions shall be forfeited and the Affiliate Partner will not be entitled to any reimbursement. **8.3.** Apify may terminate the Affiliate Partner's participation in the Affiliate Program at any time and for any reason by providing the Affiliate Partner with at least a thirty (30) days written notice. Commission earned during that calendar month when the Affiliate Partner ceased to be a part of the Affiliate Program will be paid out in the usual term after the end of that calendar month. **8.4.** These Affiliate Program Terms shall terminate together with the termination of the Affiliate Partner's participation in the Affiliate Program. ## 9. Amendments[​](#9-amendments "Direct link to 9. Amendments") **9.1.** We may unilaterally amend the Affiliate Program Terms. We shall notify you of such an amendment at least 30 days in advance before its effectiveness. Should you disagree with such an amendment, you may withdraw from the Affiliate Program until the effective date of the amendments. Otherwise, you will be deemed to agree with the announced amendments. --- # Apify Candidate Referral Program Last Updated: April 14, 2025 *** Apify Technologies s.r.o., as the announcer (“**Apify**”), is constantly looking for new employees and prefers to recruit people based on credible references.Therefore, Apify is announcing this public candidate referral program. Apify undertakes that any individual (“**You**”) who: is neither: * currently employed by Apify; * a person who carries out business in human resources; * a recruiting agency, or a person cooperating with recruiting agency in any other capacity; and recommends to Apify a suitable job candidate (“**Candidate**”): * for any open full-time position published on the Apify Job page available at: ; * who is not already in Apify’s applicant database, and is not active in the recruitment process (in which case you will be informed by Apify); * with whom Apify concludes an employment contract that establishes full-time employment between Apify and the Candidate (any candidates hired on a “DPP/DPČ” basis are excluded from this program); * who confirms that he/she was recommended to Apify by you and * whose employment with Apify is not terminated during the probationary period; will receive a reward of **CZK 20,000** from Apify for each such Candidate. If the Candidate is hired in a capacity other than full-time engagement, the reward will be prorated accordingly. If the Candidate transfers from part-time and/or “DPP/DPČ” to full-time engagement, you will not be entitled to any additional reward. A person will be considered a Candidate recommended by you only if you send the Candidate’s CV and contact details to the email address jobs\[at]apify\[dot]com. As it’s very important for Apify to respond promptly and avoid any inconveniences, Apify cannot accept any other method of recommendation. Sending resumes and information directly to jobs\[at]apify\[dot]com ensures that the entire Apify recruiting team receives the referral and can take care of the Candidate. When submitting the resume, please provide as much supporting information as possible about why Apify should hire the Candidate. You shall become entitled to the reward after the Candidate’s probationary period successfully passes. Apify will issue a protocol confirming the payout of the reward. Reward payment is based on your signature of the protocol. It is payable by bank transfer to the account specified in the protocol within thirty (30) days from the date of the protocol signature. Please note that the reward is subject to the applicable taxes. You are solely responsible for any related tax obligations (such as tax returns, etc.). You may recommend more than one Candidate. If you and someone else recommend one Candidate for the same open position in parallel, the reward will be provided to the individual who recommended the Candidate first for that specific open position. The current list of open positions is available on the Apify Job page: . This Apify Candidate Referral Program is effective from its publication and remains in effect as long as it is published on the website [www.apify.com](http://www.apify.com). Apify reserves the right to modify or revoke this Candidate Referral Program and its terms at any time and at its sole discretion by removing it from the website [www.apify.com](http://www.apify.com). --- # Apify Cookie Policy **Apify Technologies s.r.o.**, with its registered seat at Vodičkova 704/36, 110 00 Prague 1, Czech Republic, Company reg. no. 04788290, recorded in the Commercial Register kept by the Municipal Court of Prague, File No.: C 253224 ("**we**", "**us**" or the "**Provider**") welcomes you (“**you**” or the “**User**”) on our website apify.com (the “**Website**”). This Cookie Policy (the “**Policy**” or “**Cookie Policy**”) describes the way we use cookies on our Website and on our platform on console.apify.com (the “**Platform**”). Terms starting with a capital letter used in this Policy have the meaning defined in our [General Terms and Conditions](https://docs.apify.com/legal/general-terms-and-conditions.md) (the “**Terms**”). By accessing the Website, Platform or using our Services, you acknowledge and agree to this Policy. If you do not agree to the terms of this Policy, please do not use the Website, Platform or any of our Services. Each time you use the Website, Platform or our Services, the current version of the Cookie Policy will apply. ## Cookies[​](#cookies "Direct link to Cookies") When you access our Website, Platform or use our Services, we may collect information regarding your IP address for the purposes of administering the Website or Platform and tracking Website or Platform usage. However, your IP address may vary each time you visit, or it may remain the same depending on the type of Internet connection you are using or the location from which you access the Website or Platform. We may also collect information about the websites that directed you to our Website or Platform after you clicked on a text or banner link or an ad from another website, or the day and time you visited our Website or Platform and how long you spent on the Website or Platform. We aggregate such information to help us to compile reports as to trends and other behavior about users visiting our Website. However, such information is anonymous and cannot be tied directly to you. We may also use “cookies” and your personal information to enhance your experience on the Website, Platform and with the Services and to provide you with personalized offers. A cookie is a small data file placed on your computer's hard drive that contains information that allows us to track your activity on the Website or Platform. The cookie itself does not contain any personal information; however, if you provide us with any personal information, the cookie may act as an identifier to tie your personal information to your IP address or computer. You may choose to delete cookies from your computer's hard drive at any time or to disable cookies on your computer. If you delete cookies that relate to the Website, we may not be able to identify you upon your return to the Website. Additionally, if you disable your computer's cookies, you may not be able to access certain features of the Website, Platform or Services that require that cookies be enabled. The Website's or Platform's cookies cannot be used to read data from your hard drive and cannot retrieve information from any other cookies created by other websites. Additionally, our cookies cannot be used as a virus, Trojan horse, worm, or any other malicious tool that could impair your use of your computer. Our cookies are used to help us better understand how you and other users use the Website or Platform, so we can continue to provide a better, more personalized user experience on the Website. We also share website usage information about our Website or Platform with those interested in running targeted promotional campaigns on the Website. For this purpose, we and our advertisers track some of the pages that you visit on the Website through the use of pixel tags (also called clear gifs). ## What types of cookies do we use?[​](#what-types-of-cookies-do-we-use "Direct link to What types of cookies do we use?") ### Strictly Necessary Cookies[​](#strictly-necessary-cookies "Direct link to Strictly Necessary Cookies") These cookies are necessary for the website to function and cannot be switched off in our systems. They are usually only set in response to actions made by you which amount to a request for services, such as setting your privacy preferences, logging in or filling in forms. You can set your browser to block or alert you about these cookies, but some parts of the site will not then work. These cookies do not store any personally identifiable information. ### Performance Cookies[​](#performance-cookies "Direct link to Performance Cookies") These cookies allow us to count visits and traffic sources, so we can measure and improve the performance of our site. They help us to know which pages are the most and least popular and see how visitors move around the site. All information these cookies collect is aggregated and therefore anonymous. If you do not allow these cookies, we will not know when you have visited our site, and will not be able to monitor its performance. ### Functional Cookies[​](#functional-cookies "Direct link to Functional Cookies") These cookies enable the website to provide enhanced functionality and personalisation. They may be set by us or by third party providers whose services we have added to our pages. If you do not allow these cookies then some or all of these services may not function properly. ### Targeting Cookies[​](#targeting-cookies "Direct link to Targeting Cookies") These cookies may be set through our site by our advertising partners. They may be used by those companies to build a profile of your interests and show you relevant adverts on other sites. They do not store directly personal information, but are based on uniquely identifying your browser and internet device. If you do not allow these cookies, you will experience less targeted advertising. ## How long do cookies last?[​](#how-long-do-cookies-last "Direct link to How long do cookies last?") None of our cookies last forever. You can always choose to delete cookies from your computer at any time. Even if you do not delete them yourself, our cookies are set to expire automatically after some time. Some cookies will be deleted as soon as you close your browser (so-called “session cookies”), some cookies will stay on your device until you delete them or they expire (so called “persistent cookies”). You can see from the table below the lifespan of each type of cookie that we use; session cookies are those marked with 0 days' expiration, all other cookies are persistent, and you can see the number of days they last before they automatically expire. The expiration periods work on a rolling basis, i.e., each time you visit our website again, the period restarts. | Cookie name | Cookie description | Type | Expiration (in days) | | ---------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------ | -------------------- | | AWSALB | AWS ELB application load balancer | Strictly necessary | 6 | | OptanonConsent | This cookie is set by the cookie compliance solution from OneTrust. It stores information about the categories of cookies the site uses and whether visitors have given or withdrawn consent for the use of each category. This enables site owners to prevent cookies in each category from being set in the user's browser, when consent is not given. The cookie has a normal lifespan of one year, so that returning visitors to the site will have their preferences remembered. It contains no information that can identify the site visitor. | Strictly necessary | 364 | | AWSALBCORS | This cookie is managed by AWS and is used for load balancing. | Strictly necessary | 6 | | ApifyProdUserId | This cookie is created by Apify after a user signs into their account and is used across Apify domains to identify if the user is signed in. | Strictly necessary | 0 | | ApifyProdUser | This cookie is created by Apify after a user signs into their account and is used across Apify domains to identify if the user is signed in. | Strictly necessary | 0 | | intercom-id-kod1r788 | This cookie is used by Intercom service to identify user sessions for customer support chat. | Strictly necessary | 270 | | intercom-session-kod1r788 | This cookie is used by Intercom service to identify user sessions for customer support chat. | Strictly necessary | 6 | | \_gaexp\_rc | \_ga | Performance | 0 | | \_hjTLDTest | When the Hotjar script executes we try to determine the most generic cookie path we should use, instead of the page hostname. This is done so that cookies can be shared across subdomains (where applicable). To determine this, we try to store the \_hjTLDTest cookie for different URL substring alternatives until it fails. After this check, the cookie is removed. | Performance | 0 | | \_hjSessionUser\_1441872 | Hotjar cookie that is set when a user first lands on a page with the Hotjar script. It is used to persist the Hotjar User ID, unique to that site on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID. | Performance | 364 | | \_hjIncludedInPageviewSample | This cookie is set to let Hotjar know whether that visitor is included in the data sampling defined by your site's pageview limit. | Performance | 0 | | \_ga | This cookie name is associated with Google Universal Analytics - which is a significant update to Google's more commonly used analytics service. This cookie is used to distinguish unique users by assigning a randomly generated number as a client identifier. It is included in each page request in a site and used to calculate visitor, session and campaign data for the sites analytics reports. By default it is set to expire after 2 years, although this is customisable by website owners. \_ga | Performance | 729 | | \_ga\_F50Z86TBGX | \_ga | Performance | 729 | | \_hjIncludedInSessionSample | This cookie is set to let Hotjar know whether that visitor is included in the data sampling defined by your site's daily session limit. | Performance | 0 | | \_hjFirstSeen | Identifies a new user's first session on a website, indicating whether or not Hotjar's seeing this user for the first time. | Performance | 0 | | \_gclxxxx | Google conversion tracking cookie | Performance | 89 | | \_hjAbsoluteSessionInProgress | This cookie is used by HotJar to detect the first pageview session of a user. This is a True/False flag set by the cookie. | Performance | 0 | | \_\_hssc | This cookie name is associated with websites built on the HubSpot platform. It is reported by them as being used for website analytics. | Performance | 0 | | \_gaexp | Used to determine a user's inclusion in an experiment and the expiry of experiments a user has been included in.\_ga | Performance | 43 | | \_hjIncludedInPageviewSample | This cookie is set to let Hotjar know whether that visitor is included in the data sampling defined by your site's pageview limit. | Performance | 0 | | \_gat\_UA-nnnnnnn-nn | This is a pattern type cookie set by Google Analytics, where the pattern element on the name contains the unique identity number of the account or website it relates to. It appears to be a variation of the \_gat cookie which is used to limit the amount of data recorded by Google on high traffic volume websites. | Performance | 0 | | \_\_hstc | This cookie name is associated with websites built on the HubSpot platform. It is reported by them as being used for website analytics. | Performance | 389 | | \_hjIncludedInSessionSample | This cookie is set to let Hotjar know whether that visitor is included in the data sampling defined by your site's daily session limit. | Performance | 0 | | \_hjSession\_1441872 | A cookie that holds the current session data. This ensures that subsequent requests within the session window will be attributed to the same Hotjar session. | Performance | 0 | | \_gid | This cookie name is associated with Google Universal Analytics. This appears to be a new cookie and as of Spring 2017 no information is available from Google. It appears to store and update a unique value for each page visited.\_gid | Performance | 0 | | \_gat | This cookie name is associated with Google Universal Analytics, according to documentation it is used to throttle the request rate - limiting the collection of data on high traffic sites. It expires after 10 minutes.\_ga | Performance | 0 | | \_\_hssrc | This cookie name is associated with websites built on the HubSpot platform. It is reported by them as being used for website analytics. | Performance | 0 | | ApifyAcqRef | This cookie is used by Apify to identify from which website the user came to Apify. | Performance | 364 | | ApifyAcqSrc | This cookie is used by Apify to identify from which website the user came to Apify. | Performance | 364 | | hubspotutk | This cookie name is associated with websites built on the HubSpot platform. HubSpot report that its purpose is user authentication. As a persistent rather than a session cookie it cannot be classified as Strictly Necessary. | Functional | 389 | | \_ALGOLIA | This cookie name is associated with websites built on the HubSpot platform. HubSpot report that its purpose is user authentication. As a persistent rather than a session cookie it cannot be classified as Strictly Necessary. | Functional | 179 | | kvcd | Social Media sharing tracking cookie. | Targeting | 0 | | \_gat\_gtag\_xxxxxxxxxxxxxxxxxxxxxxxxxxx | Google Analytics | Targeting | 0 | | km\_vs | Social Media sharing tracking cookie. | Targeting | 0 | *\*Please note that the table serves for general information purposes. The information included in it may change over time and the table may be updated from time to time accordingly.* --- # Apify Data Processing Addendum Last Updated: January 13, 2025 *** *If you wish to execute this DPA, continue [here](https://eform.pandadoc.com/?eform=5344745e-5f8e-44eb-bcbd-1a2f45dbd692) and follow instructions in the PandaDoc form.* *** This Apify Data Processing Addendum ("**DPA**") forms part of the Apify General Terms of Service and/or the Master Service Agreement ("**Agreement**") between Apify Technologies s.r.o. ("**Apify**") and Customer identified in the Agreement (referred to as the "**Parties**" or individually as the "**Party**"), and sets forth the Parties' obligations with respect to the Processing of Personal Data (definitions below). ## 1. Definitions[​](#1-definitions "Direct link to 1. Definitions") All capitalized terms not otherwise defined herein shall have the meaning set forth in the Agreement or the Data Protection Law, as applicable. In the event of a conflict between the terms of the Agreement and this DPA, the DPA will prevail. 1.1. "**Data Protection Law**" means all applicable laws, regulations, and other legally binding requirements in any jurisdiction relating to privacy, data protection, data security, breach notification, or the Processing of Personal Data, including, to the extent applicable, the General Data Protection Regulation, Regulation (EU) 2016/679 ("**GDPR**"), the United Kingdom Data Protection Act of 2018 ("**UK Privacy Act**"), the California Consumer Privacy Act, Cal. Civ. Code § 1798.100 et seq. and associated amendments and regulations thereto ("**CCPA**"). For the avoidance of any doubt, if Apify's Processing activities involving Personal Data are not within the scope of a given Data Protection Law, such law is not applicable for purposes of this DPA. 1.2. "**EU SCCs**" means the Standard Contractual Clauses issued pursuant to Commission Implementing Decision (EU) 2021/914 of 4 June 2021 on standard contractual clauses for the transfer of personal data to third countries pursuant to Regulation (EU) 2016/679 of the European Parliament and of the Council (available as of the DPA effective date at or any subsequent link published by the competent EU authorities). 1.3. "**Personal Data**" includes "personal data," "personal information," "personally identifiable information," and similar terms, and such terms will have the same meaning as defined by applicable Data Protection Laws, that are Processed by Apify on behalf of Customer in the course of providing Apify Platform and other Services under the Agreement. 1.4. "**UK Addendum**" means the International Data Transfer Addendum to the EU Commission Standard Contractual Clauses, as published by the UK Information Commissioner's Office and in force as of 21 March 2022. ## 2. Roles and Details of Processing[​](#2-roles-and-details-of-processing "Direct link to 2. Roles and Details of Processing") 2.1. **Customer as a Controller** 2.1.1. Under this DPA, Customer acts as a Controller or Processor (as applicable) of Personal Data and Apify will act as a (Sub)Processor or Service Provider (as defined in and as applicable under the Data Protection Laws) and will Process Personal Data in connection with the Apify Platform and other Services solely to fulfill Apify obligations to Customer under the Agreement, including this DPA; on Customer's behalf, pursuant to Customer's documented instructions; and in compliance with applicable Data Protection Laws ("**Permitted Purpose**"). 2.1.2. The scope, nature, purposes, and duration of the Processing, the types of Personal Data Processed, and the Data Subjects concerned are set forth in this DPA, including without limitation Schedule C to this DPA. 2.1.3. It is Customer's responsibility to ensure that Customer's instructions comply with Data Protection Laws. Apify is not responsible for determining what laws or regulations apply to Customer's business, or for determining whether Apify Platform or other Services meet the requirements of such laws. Customer will ensure that Processing Personal Data in accordance with its instructions will not cause Apify to violate any law or regulation, including Data Protection Laws. 2.1.4. Unless Parties mutually agree otherwise in writing, Customer shall not provide Apify any Personal Data for Processing that is subject to strict privacy regimes outside of the scope of this DPA, including but not limited to Family Educational Rights and Privacy Act, 20 U.S.C. § 1232g (FERPA), relating to criminal convictions and offenses or Personal Data collected or otherwise Processed by Customer subject to or in connection with FBI Criminal Justice Information Services or the related Security Policy; constituting protected health information subject to the Health Insurance Portability and Accountability Act of 1996 (HIPAA) or data subject to Payment Card Industry Data Security Standard (PCI-DSS). 2.2. **Apify as an Independent Controller**. Apify Processes some Personal Data as an independent Controller. Apify conducts such Processing outside of the scope of this DPA, however, in compliance with Data Protection Laws, and in a manner consistent with the purposes outlined in the [Apify Privacy Policy](https://docs.apify.com/legal/privacy-policy). Those exhaustive purposes are restated here for transparency and convenience: 2.2.1. user accounts provisioning, management and removal, customer support; account, billing, and customer relationship management and related customer correspondence; 2.2.2. complying with and resolving legal obligations, including responding to Data Subject requests for Personal Data Processed by Apify as Controller (for example, website data), tax requirements, agreements, and disputes; 2.2.3. abuse detection, prevention, and protection, and scanning to detect violations of Apify Terms and Conditions and, 2.2.4. creating aggregated statistical data for internal reporting, financial reporting, revenue planning, capacity planning, and forecast modeling (including product strategy). ## 3. Confidentiality of Processing[​](#3-confidentiality-of-processing "Direct link to 3. Confidentiality of Processing") 3.1. Apify shall ensure that any person it authorizes to Process the Personal Data (including Apify affiliates and their staff, agents, and subcontractors) (an "**Authorized Person**") shall be subject to a strict duty of confidentiality (whether a contractual duty or a statutory duty), and shall not permit any person to Process the Personal Data who is not under such a duty of confidentiality. 3.2. Apify shall ensure that only Authorized Persons will Process the Personal Data, and that such Processing shall be limited to the extent necessary to achieve the Permitted Purpose. Apify accepts responsibility for any breach of this DPA caused by the act, error or omission of an Authorized Person. ## 4. Security Measures[​](#4-security-measures "Direct link to 4. Security Measures") 4.1. Apify has adopted a variety of administrative, technical, physical, and organizational measures designed to protect the Apify Platform against accidental or unlawful destruction, loss, alteration, disclosure or access (collectively the "**Security Measures**"). 4.2. Apify will maintain its Security Measures to provide a level of protection that is appropriate to the risks concerning confidentiality, integrity, availability, and resilience of our systems and the Apify Platform while also taking into account the state of the art, implementation costs, the nature, scope, and purposes of Processing, as well as the probability of occurrence and the severity of the risk to the rights and freedoms of Data Subjects. Apify Security Measures are described in Schedule D. ## 5. Security Incidents[​](#5-security-incidents "Direct link to 5. Security Incidents") 5.1. Apify will notify Customer without undue delay (and in any event within 72 hours) of any known breach of security leading to the accidental or unlawful destruction, loss, alteration, unauthorized disclosure of, or access to, Personal Data Processed by Apify on Customer's behalf (a "**Security Incident**"). For clarity, the term Security Incident always excludes (a) unsuccessful attempts to penetrate computer networks or servers maintained by or for Apify; and (b) immaterial incidents that occur on a routine basis, such as security scans, brute-force attempts or "denial of service" attacks. 5.2. Apify will also provide reasonable assistance to Customer in its compliance with Customer's Security Incident-related obligations, including without limitation by: 5.2.1. taking steps to mitigate the effects of the Security Incident and reduce the risk to Data Subjects whose Personal Data was involved (such steps to be determined by Apify in its sole discretion); and 5.2.2. providing Customer with the following information, to the extent known: (i) the nature of the Security Incident, including, where possible, how the Security Incident occurred, the categories and approximate number of Data Subjects concerned, and the categories and approximate number of Personal Data records concerned; (ii) the likely consequences of the Security Incident; and (iii) the measures we have taken or propose to take to address the Security Incident, including where appropriate measures to mitigate its possible adverse effects. Where, and in so far as, it is not possible to provide all information at the same time, the initial notification will contain the information then available and further information will, as it becomes available, subsequently be provided without undue delay. 5.3. Apify's notification of or response to a Security Incident under this Section is not an acknowledgement of any fault or liability. 5.4. Customer is solely responsible for complying with its obligations under any incident notification laws. Customer must notify Apify promptly about any possible misuse of its user accounts or authentication credentials, or any Security Incident related to Apify Platform or other Services provided by Apify under the Agreement. ## 6. Subprocessors[​](#6-subprocessors "Direct link to 6. Subprocessors") 6.1. Customer authorizes Apify to engage third parties to Process Personal Data ("**Subprocessors**") listed in Schedule E ("**Apify Subprocessor(s)**"), provided that Apify provides at least ten (10) days' prior written notice of the addition of any Subprocessor (including the categories of Personal Data Processed, details of the Processing it performs or will perform, and the location of such Processing) by means of a notice on the Apify Subprocessors website. 6.2. Apify encourages Customer to periodically review the Apify Subprocessors website for the latest information on Apify Subprocessors, and especially before Customer provides Apify with any Personal Data. The Apify Subprocessors website contains a mechanism to subscribe to notifications of updates to the Subprocessor list, and Apify will provide details of any such changes solely via this subscription mechanism. Customer has the opportunity to object to such changes within ten (10) days after written notification. Suppose Customer objects to Apify's appointment of a new Subprocessor on reasonable grounds relating to the protection of its Personal Data. In that case, the Parties will promptly confer and discuss alternative arrangements to enable Apify to continue Processing of Personal Data. 6.3. In all cases, Apify shall impose in writing the same data protection obligations on any Subprocessor it appoints as those provided for by this DPA and Apify shall remain liable for any breach of this DPA that is caused by an act, error or omission of its Subprocessor to the extent it is liable for its own acts and omissions under the Agreement. ## 7. International Data Transfers[​](#7-international-data-transfers "Direct link to 7. International Data Transfers") 7.1. Customer appoints Apify to transfer Personal Data to the United States or any other country in which Apify or its Subprocessors operate as specified hereunder, and to store and Process Personal Data for Permitted Purpose, subject to the safeguards below and described elsewhere in this DPA. 7.2. Where Apify engages in an onward transfer of Personal Data, Apify shall ensure that, where legally required, a lawful data transfer mechanism is in place prior to transferring Personal Data from one country to another. 7.3. To the extent legally required, the EU SCCs form part of this DPA and will be deemed completed as set forth in Schedule A. In the event of a conflict between the terms of the EU SCCs and this DPA, the EU SCCs will prevail. 7.4. If, as a Controller, the Customer is situated in the United Kingdom (UK), the EU SCCs shall apply together with the UK Addendum to the SCCs, as specified in Schedule A, in relation to the transfer of Personal Data from the United Kingdom and shall be incorporated in this DPA. ## 8. Auditing Compliance[​](#8-auditing-compliance "Direct link to 8. Auditing Compliance") 8.1. Upon Customer's written request, and no more than once per twelve (12) calendar months, Apify will provide Customer with its most recent security review reports and/or applicable certifications for the Apify Platform and provide reasonable assistance and information to Customer to understand the information in such reports. 8.2. If Customer has a reasonable objection that the information provided is not sufficient to demonstrate Apify compliance with this DPA, Customer may conduct an audit, or select a mutually-agreed upon third-party to conduct an audit, of Apify practices related to Processing Personal Data in compliance with this DPA, at Customer's sole expense (an "**Audit**"). General compliance Audits shall occur not more than once every twelve (12) calendar months. 8.3. To the extent you use a third-party representative to conduct the Audit, Customer will ensure that such third-party representative is bound by obligations of confidentiality no less protective than those contained in this DPA and the Agreement. Customer will provide Apify with at least thirty (30) days prior written notice of its intention to conduct an Audit. Before any Audit, the Parties will mutually agree upon the scope, timing, and duration of the Audit, as well as the Apify reimbursement rate for which Customer will be responsible. All reimbursement rates will be reasonable, taking into account the resources expended by or on behalf of Apify. 8.4. Customer and its third-party representatives will conduct Audits: (i) acting reasonably, in good faith, and in a proportional manner, taking into account the nature and complexity of the Apify Platform; and (ii) in a manner that will result in minimal disruption to Apify's business operations and during Apify's regular business hours. Neither Customer nor its third-party representatives will be entitled to receive data or information of other Apify customers or any other Apify Confidential Information that is not directly relevant for the authorized purposes of the Audit in accordance with this provision. 8.5. Customer will promptly provide Apify with the Audit results upon completion of the Audit. All Audit related materials will be considered "Confidential Information" subject to the confidentiality provisions of the Agreement. ## 9. Personal Data Retention; Return and Destruction[​](#9-personal-data-retention-return-and-destruction "Direct link to 9. Personal Data Retention; Return and Destruction") 9.1. Apify will retain Personal Data in accordance with its standard data retention policies and procedures. Customer shall ensure to retrieve all Personal Data before termination or expiration of the Agreement. If Customer deletes its user account or following the termination of Agreement, Apify will have no obligation to maintain or provide Customer with copies of its Personal Data. 9.2. Except to the extent required otherwise by Data Protection Laws, Apify will, at Customer's choice and upon its written request, return to Customer or securely destroy all Personal Data upon such request or at termination or expiration of the Agreement. Apify will provide Customer with a certificate of destruction only upon Customer's written request. In case of local laws applicable to Apify that prohibit the return or deletion of Personal Data, Apify warrants that it will continue to ensure compliance with this DPA and will only Process the Personal Data to the extent and for as long as required under such local laws. ## 10. Data Subject Requests[​](#10-data-subject-requests "Direct link to 10. Data Subject Requests") 10.1. If Apify receives any requests from Data Subjects seeking to exercise any rights afforded to them under Data Protection Laws regarding their Personal Data, and to the extent legally permitted, will promptly notify Customer or refer the Data Subjects to Customer for handling. Such requests related to Personal Data may include: access, rectification, restriction of Processing, erasure ("right to be forgotten"), data portability, objection to the Processing, or to not be subject to automated individual decision making (each, a "**Data Subject Request**"). 10.2. Apify will not respond to such Data Subject Requests itself, and Customer authorizes Apify to redirect the Data Subject Request as necessary to Customer for handling. If Customer is unable to directly respond to a Data Subject Request made by a Data Subject itself, Apify will, upon your request, provide commercially reasonable efforts to assist Customer in responding to the Data Subject Request, to the extent Apify is legally permitted to do so and the response to such Data Subject Request is required under Data Protection Laws. 10.3. To the extent legally permitted, Customer will be responsible for any costs arising from Apify's provision of this additional support to assist Customer with a Data Subject Request. ## 11. Data Protection Impact Assessment[​](#11-data-protection-impact-assessment "Direct link to 11. Data Protection Impact Assessment") 11.1. Apify will provide reasonable assistance to and cooperation with the other party for their performance of a data protection impact assessment or privacy impact assessment of Processing or proposed Processing activities, when required by applicable Data Protection Laws. ## 12. General Cooperation to Remediate[​](#12-general-cooperation-to-remediate "Direct link to 12. General Cooperation to Remediate") 12.1. If Apify believes or becomes aware that (i) its Processing of the Personal Data is likely to result in a high risk to the data protection rights and freedoms of Data Subjects; (ii) it can no longer meet its obligations under this DPA or applicable Data Protection Laws; or (iii) in its opinion an instruction from Customer infringes applicable Data Protection Laws; it shall promptly inform Customer of the same and await Customer's further instructions. Apify shall, taking into account the nature of Processing and the information available to Apify, provide Customer with all such reasonable and timely assistance as Customer may require in order to conduct a data protection impact assessment, and, if necessary, to consult with its relevant data protection authority. 12.2. Each Party shall promptly notify the other Party of any proceedings, in particular administrative or court proceedings, relating to Personal Data Processing hereunder, and of any administrative decision or judgment concerning the Processing of that Personal Data, as well as of any inspections pertaining to Personal Data Processing. 12.3. In the event that Data Protection Law, or a data protection authority or regulator, provides that the transfer or Processing of Personal Data under this DPA is no longer lawful or otherwise permitted, then the Parties shall agree to remediate the Processing (by amendment to this DPA or otherwise) in order to meet the necessary standards or requirements. ## 13. Representations and Warranties; Liability[​](#13-representations-and-warranties-liability "Direct link to 13. Representations and Warranties; Liability") 13.1. Customer represents and warrants that it is authorized to enter into this DPA, issue instructions, and make and receive any communications or notifications in relation to this DPA on behalf of Customer Affiliates. Customer further represents and guarantees that it has acquired all necessary consents from the Data Subjects for the Processing of their Personal Data or is subject to any other lawful basis under the applicable Data Protection Laws. Customer is fully responsible for compliance of the instructions, requests and recommendations issued to Apify with the Permitted Purpose of the Processing and any applicable Data Protection Laws. 13.2. Each Party represents, warrants, and covenants that it understands and will comply with the restrictions and obligations set forth in this DPA. Each Party further represents, warrants, and covenants that it will comply with all Data Protection Laws applicable to such Party in its role as Data Controller, Business, Data Processor, Service Provider, or Subprocessor (as applicable under Data Protection Laws). 13.3. Customer agrees to indemnify and hold Apify harmless against all claims, actions, third-party claims, losses, damages and expenses incurred by Apify in its capacity as Processor of the Personal Data of the Customer arising from (i) any Security Incident in terms of this Agreement if such Security Incident was caused by the Customer or (ii) any negligent act or omission by Customer in the exercise of the rights granted to it under the Privacy Protection Law and arising directly or indirectly out of or in connection with a breach of this DPA. 13.4. Except for Customer's indemnification obligations hereunder, each Party’s liability arising out of or related to this DPA is subject to the liability limitation provisions of the Agreement, and any reference in such section to the liability of a Party means the aggregate liability of that Party under the Agreement and this DPA together. ## 14. Final Provisions[​](#14-final-provisions "Direct link to 14. Final Provisions") 14.1. This DPA is effective from the date of its execution or from the Effective Date of the Agreement, which incorporates the DPA. The obligations placed upon Apify under this DPA shall survive so long as Apify and/or its Subprocessors Process Personal Data as described herein and/or under the terms of the Agreement. 14.2. Apify may update this DPA from time to time as laws, regulations, and industry standards evolve, or as Apify makes changes to its business or the Apify Platform. 14.3. If Apify makes changes that materially change the Parties’ rights or obligations under this DPA, Apify will provide additional notice in accordance with applicable legal requirements, such as via our website or through the Apify Platform. By continuing to access and use the Apify Platform and other Services after the "last updated" date of the revised DPA, Customer agrees to be bound by the revised DPA. 14.4. If any provision hereof is deemed to be invalid or unenforceable for any reason, all other provisions shall remain in force and the Parties shall be obliged to replace such invalid (unenforceable) provisions at the request of either Party with a provision which is valid and the economic effect of which is as close as possible to the economic effect of the replaced provision. **Schedules**: Schedule A: EU SCCs & UK Addendum Schedule B: CCPA Additional Terms Schedule C: Details of Processing Schedule D: Security Measures Schedule E: List of Apify Subprocessors ## Schedule A: EU SCCs and UK Addendum[​](#schedule-a-eu-sccs-and-uk-addendum "Direct link to Schedule A: EU SCCs and UK Addendum") Article 46 of the GDPR requires that a Processor that transfers data outside of the EEA to a non-adequate country must utilize a safeguard. Therefore, where: (a) Customer is not established in the EU and Personal Data Processing by Customer is not subject to GDPR (pursuant to Article 3(2) thereof); and (b) GDPR applies to international data transfer from EEA to countries outside the EEA (where Apify is involved in Processing data within the EEA on behalf of Customer); and (c) an international transfer of Personal Data cannot take place on the basis of an adequacy decision pursuant to Art 45 (3) GDPR; Parties will comply with the obligations in the EU SCCs, which shall form an integral part of this Addendum. Any undefined capitalized terms used in this Schedule A have the meanings assigned to such terms in the EU SCCs. For the purposes of EU Standard Contractual Clauses: 1.1. Module Four of the EU SCCs will apply. 1.2. The docking option under Clause 7 (Optional - Docking Clause) will not apply. 1.3. Clause 17 (Governing law) shall be completed as follows: "These Clauses shall be governed by the law of a country allowing for third-party beneficiary rights. The Parties agree that this shall be the law of the Czech Republic." 1.4. Clause 18 (Choice of forum and jurisdiction), shall be completed as follows: "Any dispute arising from these Clauses shall be resolved by the courts of the Czech Republic." ### Annex I(A): List of Parties[​](#annex-ia-list-of-parties "Direct link to Annex I(A): List of Parties") Data exporter: Name: Apify Technologies s.r.o. Address: Vodičkova 704/36, Nové Město, 110 00 Praha 1 Contact person’s name, position and contact details: Apify Privacy Team, privacy\[at]apify\[dot]com Activities relevant to the data transferred under these Clauses: Processing necessary to provide the Apify Platform and other Services by Apify to Customer and for any disclosures of Personal Data in accordance with the Agreement. Role: Processor or Subprocessor, as applicable Data importer: Name: Customer's name identified in the Agreement Address: Customer's address as provided in the Agreement Contact person’s name, position and contact details: As provided in Customer's user account at Apify Platform Activities relevant to the data transferred under these Clauses: Processing necessary to provide the Apify Platform and other Services by Apify to Customer and for any disclosures of Personal Data in accordance with the Agreement. Role: Controller or Processor, as applicable Annex I(B): Description of Processing & Transfer As provided in Schedule C to this DPA. ### UK Addendum[​](#uk-addendum "Direct link to UK Addendum") In relation to Personal Data that is protected by the UK GDPR, the UK Addendum will apply, completed as follows: The Module 4 of the EU SCCs shall also apply to transfers of such Personal Data, subject to sub-section (b) below; Tables 1 to 3 of the UK Addendum shall be deemed completed with relevant information from the EU SCCs,completed as set out in Schedule A of this DPA, and the option "neither party" shall be deemed checked in Table 4; and, The start date of the UK Addendum (as set out in Table 1) shall be the date of this DPA. ## Schedule B: CCPA Additional Terms[​](#schedule-b-ccpa-additional-terms "Direct link to Schedule B: CCPA Additional Terms") If and to the extent Apify is Processing Personal Data within the scope of the CCPA on Customer's behalf and in accordance with Customer's documented instructions, Apify will not: (a) sell the Personal Data as the term "selling" is defined in the CCPA; (b) share, rent, release, disclose, disseminate, make available, transfer, or otherwise communicate orally, in writing, or by electronic or other means, the Personal Data to a third party for cross-context behavioral advertising, whether or not for monetary or other valuable consideration, including transactions for cross-context behavioral advertising in which no money is exchanged; (c) retain, use, or disclose the Personal Data for any purpose other than for the business purposes specified in this DPA and the Agreement, or as otherwise permitted by the CCPA; (d) retain, use, or disclose the Personal Data outside of the direct business relationship with Customer; or (e) combine the Personal Data with personal information that it receives from or on behalf of a third party or collects from California residents, except that Apify may combine Personal Data to perform any business purpose as permitted by the CCPA or any regulations adopted or issued under the CCPA. The Parties acknowledge and agree that the exchange of Personal Data between them does not constitute a "sale" of Personal Data under the CCPA and does not form part of any monetary or other valuable consideration exchanged between them with respect to the Agreement or this DPA. ## Schedule C: Details of Processing[​](#schedule-c-details-of-processing "Direct link to Schedule C: Details of Processing") ### Categories of Data Subjects[​](#categories-of-data-subjects "Direct link to Categories of Data Subjects") Data Subjects may be any individuals about which Customer collects and instructs Apify to Process Personal Data, including its prospects, customers, vendors, employees, contact persons, website users, etc. ### Categories of Personal Data[​](#categories-of-personal-data "Direct link to Categories of Personal Data") Categories of Personal Data collected are solely at Customer's own discretion, resulting from Customer's use of Apify Platform and other Services, and may include name, title, contact details, ID data, professional or personal life data, connection data, localization data, etc. ### Sensitive Data Transferred[​](#sensitive-data-transferred "Direct link to Sensitive Data Transferred") Customer agrees not to transfer sensitive data without informing Apify. Transfer of sensitive data, if applicable and agreed upon in the Agreement, is done subject to additional safeguards that fully take into account the nature of such data and risks involved. ### Frequency of the Transfer[​](#frequency-of-the-transfer "Direct link to Frequency of the Transfer") Continuous during the term of the DPA. ### Nature of Processing[​](#nature-of-processing "Direct link to Nature of Processing") The nature of processing is storage and retrieval of Personal Data relating to the provision of Apify Platform and other Services by Apify to Customer. ### Purpose of Processing[​](#purpose-of-processing "Direct link to Purpose of Processing") As specified in Section 2.1.1. of the DPA above. ### The period for which the personal data will be retained, or, if that is not possible, the criteria used to determine that period[​](#the-period-for-which-the-personal-data-will-be-retained-or-if-that-is-not-possible-the-criteria-used-to-determine-that-period "Direct link to The period for which the personal data will be retained, or, if that is not possible, the criteria used to determine that period") As described in Section 9 of the DPA. ### For transfers to (sub-) Processors, also specify subject matter, nature and duration of the processing[​](#for-transfers-to-sub--processors-also-specify-subject-matter-nature-and-duration-of-the-processing "Direct link to For transfers to (sub-) Processors, also specify subject matter, nature and duration of the processing") The Personal Data are transferred to further Subprocessors for the purposes of provision of infrastructure and/or software as a service in relation to the Permitted Purpose, for as long as needed in order to deliver the functionality. ## Schedule D: Security Measures[​](#schedule-d-security-measures "Direct link to Schedule D: Security Measures") Apify shall implement appropriate technical and organizational measures in accordance with Data Protection Laws to ensure a level of security appropriate to the risk, which may include as appropriate: (a) the encryption of personal data; (b) the ability to ensure the ongoing confidentiality, integrity, availability and resilience of Processing systems and services; (c) the ability to restore the availability and access to personal data in a timely manner in the event of a physical or technical incident; (d) a process for regularly testing, accessing and evaluating the effectiveness of technical and organizational measures for ensuring the security of the Processing. ## Schedule E: List of Apify Subprocessors[​](#schedule-e-list-of-apify-subprocessors "Direct link to Schedule E: List of Apify Subprocessors") List of Apify Subprocessor is available at: --- # Apify Open Source Fair Share Program Terms and Conditions Effective Date: February 14, 2025 *** We offer you the opportunity to enroll in our Apify Open Source Fair Share Program ("**Fair Share Program**"), which is subject to the following Apify Open Source Fair Share Program Terms and Conditions ("**Fair Share Program Terms**"). Fair Share Program is further governed by the [Affiliate Program Terms and Conditions](https://docs.apify.com/legal/affiliate-program-terms-and-conditions.md) and, to the extent applicable, by [Apify Store Publishing Terms and Conditions](https://docs.apify.com/legal/store-publishing-terms-and-conditions.md) which are both incorporated herein by the reference. In case of a conflict, these Apify Fair Share Program Terms shall prevail. Terms starting with a capital letter used in these Fair Share Program Terms have the meaning defined either here or in the Affiliate Program Terms and Conditions. ## 1. Eligibility[​](#1-eligibility "Direct link to 1. Eligibility") The Fair Share Program is open to maintainers of GitHub repositories with projects suitable for web automation, data extraction, or related purposes (“**you**” or "**Participant**"). Participation is subject to review and approval by Apify. ## 2. Tiers[​](#2-tiers "Direct link to 2. Tiers") The Fair Share Program offers three different tiers, each with varying levels of involvement and benefits: ### 2.1. Passive Tier[​](#21-passive-tier "Direct link to 2.1. Passive Tier") 2.1.1. **Joining Passive Tier**. You can join the Passive Tier only by accepting a pull request from Apify for your GitHub repository. To participate, your GitHub repository must have the GitHub Sponsor button enabled. Participants in the Passive Tier will not have access to the FirstPromoter account and, therefore, will not have visibility into the traffic or detailed performance metrics. 2.1.2. **Commission**. Notwithstanding anything to the contrary in Section 4.1. of the Affiliate Program Terms, the Commission shall be calculated as 10% of all payments made by each Referred Customer to Apify in the first 3 months from the date when that Referred Customer started paying for Services (as defined in the General Terms) and then increased to 20% for all payments made by each Referred Customer to Apify, up to 2,500 USD per Referred Customer. 2.1.3. **Payment Terms**. Notwithstanding anything to the contrary in Sections 5.2. and 5.3, Commission for Referred Customers in the Passive Tier is paid monthly, exclusively via the GitHub Sponsor button. ### 2.2. Maintainer Tier[​](#22-maintainer-tier "Direct link to 2.2. Maintainer Tier") 2.2.1. **Joining Maintainer Tier**. To join the Maintainer Tier, you must first: (i) [join our Affiliate Program](https://apify.firstpromoter.com/signup/28997), (ii) [create an Account](https://console.apify.com/sign-up) at apify.com, and (iii) either successfully claim ownership of the Actor in Apify Store or link your GitHub OSS Public Repository containing an Actor code to the same Actor in Apify Store, subject to the Apify Store Publishing Terms. 2.2.2. **Commission & Remuneration**. In the Maintainer Tier you may receive standard Commission as outlined in the Affiliate Program. Additionally, you may also be eligible to receive remuneration under the Apify Store Publishing Terms in case you monetize your Actor. ### 2.3. Active Developer Tier[​](#23-active-developer-tier "Direct link to 2.3. Active Developer Tier") 2.3.1. **Joining Active Developer Tier**. In order to benefit from the Active Developer Tier, you must: (i) join the Maintainer Tier, (ii) monetize your Actor through the Pay-Per-Event monetization model which allows you to set custom pricing for each use of your Actor by Apify Users (you are required to optimize your Actor’s performance and configure it for Pay-Per-Event usage). 2.3.2. **Additional Incentive under Active Developer Tier**. In addition to the benefits arising from the Maintainer Tier, as an Active Developer you may receive a temporary discount on computing resources or other incentives for your open-source Actor subject to a separate agreement with Apify. ## 3. General[​](#3-general "Direct link to 3. General") 3.1. Participants are responsible for any applicable taxes, payment processing fees, or other charges related to receiving Commission under the Fair Share Program. Apify is not responsible for covering such costs. 3.2. Apify reserves the right to modify, suspend, or terminate the Fair Share Program at any time, with or without prior notice. Any changes will be posted on our Website, and continued participation constitutes acceptance of the updated Terms. --- # Apify GDPR Information The European Union (“**EU**”) General Data Protection Regulation (“**GDPR**”) replaces the 1995 EU Data Protection Directive. The GDPR strengthens the rights that individuals have regarding personal data relating to them and seeks to unify data protection laws across Europe, regardless of where that data is processed. Besides strengthening and standardizing user data privacy across the EU nations, the GDPR requires new or additional obligations from all organizations that handle the personal data of EU citizens, regardless of where the organizations themselves are located. Apify Technologies s.r.o., with its registered seat at Vodičkova 704/36, 110 00 Prague 1, Czech Republic, Company reg. no. 04788290 (also referred to as “**Apify**”, “**we**”, “**us**” or “**our**”), is deeply committed to providing its users with the maximum security and privacy, and is committed to comply with the GDPR. On this page, we will explain our methods to achieve GDPR compliance, both for ourselves and for our customers. Please note that this document is not a legal contract - the legal relation between Apify and its users is governed by the [General Terms and Conditions](https://docs.apify.com/legal/general-terms-and-conditions.md) and Apify's [Privacy Policy](https://docs.apify.com/legal/privacy-policy.md). Privacy Policy specifies in detail the extent of personal data we collect and process, retention period of personal data and other details regarding personal data processing. ## Apify as the data controller[​](#apify-as-the-data-controller "Direct link to Apify as the data controller") Apify acts as the data controller for the personal data we collect about you, the user of our Website, Platform or Services. Apify can also use various suppliers who process personal data as processors. These are always entities that provide sufficient guarantees for the implementation of appropriate technical and organizational measures to ensure proper and sufficient protection of your personal data (see also further below). Upon request, we will provide you with information about specific processors. Apify may also transmit personal data to state authorities or third parties if it has such statutory obligation or it is allowed to do so under statutory regulations. First and foremost, we process data that is necessary for us to perform our contract with you (GDPR Article 6(1)(b)). Second, we process data to meet our obligations under the law (GDPR Article 6(1)(c)) — this primarily involves financial data and information that we need to meet our accountability obligations under the GDPR. Third, we process your personal data for our legitimate interests in line with GDPR Article 6(1)(f). ### What are these ‘legitimate interests’?[​](#what-are-these-legitimate-interests "Direct link to What are these ‘legitimate interests’?") * Improving our Website, Platform and Services to help you reach new levels of productivity. * Making sure that your data and Apify's systems are safe and secure. * Responsible marketing of our product and its features. ### What rights do you have in connection with your personal data processing?[​](#what-rights-do-you-have-in-connection-with-your-personal-data-processing "Direct link to What rights do you have in connection with your personal data processing?") 1. **Right of access to personal data:** you have the right to obtain information on whether personal data is processed and, if so, the right to access to this personal data. 2. **Right to rectification of inaccurate personal data and the right to have incomplete personal data completed:** if you believe that Apify processes inaccurate or incomplete personal data, you have the right to rectification of inaccurate data and the right to have incomplete data completed; the rectification or completion takes place without undue delay, and always with regard to technical possibilities. 3. **Right to erasure:** you have the right to have your personal data erased if (i) they are no longer necessary in relation to the purposes for which they were collected or otherwise processed (ii) the processing was unlawful, (iii) you object to the processing and there are no overriding legitimate grounds for processing your personal data, or the law requires erasure, (iv) we are required to erase data under our legal obligation, or (v) you withdrew your consent to the processing of personal data (if processed based on such consent). 4. **Right to restriction of processing:** if you request to obtain restriction of processing, we are only allowed to store personal data, not further process it, with the exceptions set out in the GDPR. You may exercise the right to restriction in the following cases: * If you contest the accuracy of your personal data; in this case, the restrictions apply for the time necessary for us to verify the accuracy of the personal data. * If we process your personal data unlawfully, but instead of erasure you request only restriction of their use. * If we no longer need your personal data for the above-mentioned purposes of processing, but you request the data for the establishment, exercise, or defense of legal claims. * If you object to processing, the data processing is restricted pending the verification whether our legitimate interest override yours. 5. **Right to data portability:** if you wish us to transmit your personal data to another controller, you may exercise your right to data portability, if technically feasible. If the exercise of your right would adversely affect the rights and freedoms of other persons, we will not be able to comply with the request. 6. **Right to object:** you have the right to object to the processing of personal data which are processed for the purpose of protecting our legitimate interests or for the purpose of fulfilling a task performed in the public interest or in the exercise of public power. If Apify does not prove that there is a justified legitimate reason for the processing which overrides your interest or rights and freedoms, we shall terminate the processing on the basis of the objection without an undue delay. 7. **Right to file a complaint:** you can file a complaint with the Office for Personal Data Protection if you claim that processing of data has violated your right to personal data protection during their processing or related legislation, including violating the above-mentioned rights. The Office for Personal Data Protection is located at the address Pplk. Sochora 27, 170 00 Prague 7. More information about its activities is available on the website . As the controller for your personal data, Apify is committed to respecting all your rights under the GDPR. If you have any questions or feedback, please reach out to us by email at . ## Apify as the data processor[​](#apify-as-the-data-processor "Direct link to Apify as the data processor") The data that you collect, store and process using Apify might contain personal data of your data subjects, and you are considered the data controller for this personal data. Using the Apify Platform to process the personal data of your customers means that you have engaged Apify as a data processor to carry out certain processing activities on your behalf. According to Article 28 of the GDPR, the relationship between the controller and the processor needs to be made in writing (electronic form is acceptable under subsection (9) of the same Article). This is where our [General Terms and Conditions](https://docs.apify.com/legal/general-terms-and-conditions.md) and [Privacy Policy](https://docs.apify.com/legal/privacy-policy.md) come in. These two documents also serve as your data processing contract, setting out the instructions that you are giving to Apify with regard to processing the personal data you control and establishing the rights and responsibilities of both parties. Apify will only process your data based on your instructions as the data controller. Apify is based in the EU, therefore all our customers including customers in the EU have a contractual relationship with our EU legal entity, based in the Czech Republic. ### Data transfers[​](#data-transfers "Direct link to Data transfers") One topic that often comes up with customers is data transfers outside of the European Economic Area (EEA). The GDPR establishes strict requirements for moving data outside of its scope of protection. This is only natural - otherwise it would be impossible for the law to fulfill its purpose. When Apify engages sub-processors outside the EU and EEA, it is our job to ensure that we transfer the data lawfully. We keep an up-to-date list of sub-processors and guarantee that the data is adequately protected even after it leaves the EU or EEA. To ensure that all the personal data are protected when they are being transferred outside the EU or EEA, Apify is using a legal mechanism based on Standard Contractual Clauses (SCC) approved by European Commission. These clauses provide sufficient safeguards for the protection of data when they are being transferred internationally. SCC represents a contractual commitment between Apify, as a company transferring the data, and a third-party service provider, which is obliged to maintain the personal data secure and protected at all time. Hopefully this information helps you to better navigate the EU's data protection requirements. If you have any questions with regard to the above, you are welcome to reach out to us at and we will do our best to explain things further. Upon signing a non-disclosure agreement with us, we will provide you with our GDPR Audit document where we provide the list of sub-processors, information about what kind of data we send to them and how we verify their compliance with the GDPR. ## What Apify is doing for GDPR compliance[​](#what-apify-is-doing-for-gdpr-compliance "Direct link to What Apify is doing for GDPR compliance") As a company based in Europe, Apify is very much up to speed with the implications that the EU General Data Protection Regulation has for businesses. We appreciate the privacy needs of our users as well as their customers and, as such, have implemented — and will continue to improve — technical and organizational measures in line with the GDPR to safeguard the personal data processed by Apify. ### Internal processes, security and data transfers[​](#internal-processes-security-and-data-transfers "Direct link to Internal processes, security and data transfers") A large part of GDPR compliance is making sure that there are procedures in place that ensure that data processes are mapped and auditable. We have added elements to our application development cycle to build features in accordance with the principles of Privacy by Design. Any access to the personal data that we process on your behalf is strictly limited. Our internal procedures and logs make sure that we meet the GDPR accountability requirements in this regard. We have established a process for onboarding third-party service providers and adopting tools that makes sure that these third-parties meet the high expectations that Apify and its customers have when it comes to privacy and security. ### Readiness to comply with subject access requests[​](#readiness-to-comply-with-subject-access-requests "Direct link to Readiness to comply with subject access requests") Data subjects' ownership of their personal data is at the heart of the GDPR. We have created a readiness to respond to data subject requests to delete, modify, or transfer their data. This means that our Customer Support specialists along with the Engineers that assist them in their work are well-prepared to help you in any matters involving your personal data, in addition to providing the excellent customer support experience that you are accustomed to. ### Documentation[​](#documentation "Direct link to Documentation") Our [General Terms and Conditions](https://docs.apify.com/legal/general-terms-and-conditions.md) and [Privacy Policy](https://docs.apify.com/legal/privacy-policy.md) are constantly being revised to increase transparency and to make sure the documents meet GDPR requirements. As these are the basis for our relationship for you, it is very important for us to comprehensively and openly explain our commitments and your rights in these documents. Additionally, we are constantly mapping all our data processing activities to be able to comply with the GDPR accountability requirements. ### Training[​](#training "Direct link to Training") All of the above is supported by extensive training efforts within the company so that the GDPR-compliant processes we have put in place are followed. Sessions on data privacy and security are an integral part of our onboarding process, and each department receives training that is tailored to their work involving personal data. --- # Apify General Terms and Conditions Effective date: May 14, 2024 *** Apify Technologies s.r.o., with its registered seat at Vodičkova 704/36, 110 00 Prague 1, Czech Republic, Company reg. no. 04788290, recorded in the Commercial Register kept by the Municipal Court of Prague, File No.: C 253224 ("**we**", "**us**" or the "**Provider**") welcomes you ("**you**" or the "**User**") on our website apify.com (the "**Website**"). These terms and conditions of use (the "**Terms**”) shall govern your access to the Website, its use, and use of the Platform and the Services (as defined below). ## Summary[​](#summary "Direct link to Summary") The Terms are the key document governing the relationship between you and us, please read the whole text of the Terms. For your convenience, we have presented these terms in a short non-binding summary followed by the full legal terms. | Section | What can you find there? | | ---------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | [1. Acceptance of these Terms](#1-acceptance-of-these-terms) | These Terms become a binding contract at the moment you sign-up on our Website. | | [2. Our Services](#2-our-services) | Overview of the Services that we are providing. | | [3. User Account](#3-user-account) | In order to use our Services you will create a user account. You must use true and accurate information when creating a user account. | | [4. License and Copyright](#4-license-and-copyright) | In order to use our Services, we are granting you a non-exclusive license. You must respect and adhere to copyright law protecting our Website, Platform and/or Services. | | [5. Terms of Use of the Website, Platform, and Services](#5-terms-of-use-of-the-website-platform-and-services) | Using our Platform or Services for illegal or illegitimate purposes is prohibited. Should you use the Platform and/or the Services in breach of the Terms or the law, we will not be liable for any damage caused as a result. You must not allow anyone else to use your user account. | | [6. Liability](#6-liability) | We are limiting our liability towards you in relation to specified situations. | | [7. Warranty](#7-warranty) | We are providing our Services as is and do not make any representations, warranties or guarantees to you. | | [8. Limitation of Liability; Force Majeure](#8-limitation-of-liability-force-majeure) | We are limiting the liability of both parties - you and us - for less predictable legal types of damages. Our overall limit of liability isset as the aggregate of amounts paid by you for the Services. Neither of us shall be liable for situations beyond one's reasonable control (force majeure). | | [9. Your Obligation to Indemnify](#9-your-obligation-to-indemnify) | If you use our Services in breach of these Terms and as a result a third party will claim our liability, you agree to indemnify us against any such claim. | | [10. Legal Disclaimer and Warning](#10-legal-disclaimer-and-warning) | If any of your targeted websites demands that we cease the respective automated activities or processes, we may need to suspend your use of the Services accordingly and/or obey by any related court order that we receive. | | [11. Payment Terms](#11-payment-terms) | Find out about the payment terms for the Services; the up-to-date prices can be found at apify.com/pricing. | | [12. Personal Data Protection, Commercial Offers, Confidentiality](#12-personal-data-protection-commercial-offers-confidentiality) | Personal Data Protection is regulated by the Privacy Policy available at apify.com/privacy-policy. For the duration of our agreement and 2 years after, you and us both agree to maintain mutual confidentiality. | | [13. Term, Amendment, and Termination](#13-term-amendment-and-termination) | The Terms remain effective until your account is terminated. We can terminate the agreement immediately in case of non-payment for the Services or breach of the Terms and/or Privacy Policy. We may amend the Terms in the future, in such case you will be notified 30 days in advance. | | [14. Final Provisions](#14-final-provisions) | These Terms are governed by Czech law. 'Apify' is a registered trademark. | ## 1. Acceptance of these Terms[​](#1-acceptance-of-these-terms "Direct link to 1. Acceptance of these Terms") By clicking the button “Sign up” during registration on our Website, you claim that you are over 18 years old and agree to adhere to these Terms and also to our: * [Privacy Policy](https://docs.apify.com/legal/privacy-policy.md), and * [Acceptable Use Policy](https://docs.apify.com/legal/acceptable-use-policy.md). If you act on behalf of a company when accepting these Terms, you also declare to be authorized to perform such legal actions on behalf of the company (herein the term “you” shall mean the relevant company). You agree to adhere to these Terms also by the fact that you visit or use the Website, Platform (as defined below), any software, application or any other service running or available on the Platform or a service we provide or make accessible to you. ## 2. Our Services[​](#2-our-services "Direct link to 2. Our Services") By means of our Website or by other means, we enable you to use the computer platform “Apify” (the "**Platform**") and some other services and functions (the "**Services**"). The Platform is a computer system operated by the Provider, remote access to which is available by means of servers within the apify.com domain or other domains (so-called “cloud system”). In addition to other things, the Platform makes it possible to: * Download and extract structured data from the websites; * Download, save, process, and publish data; * Create, run, amend, and publish software programs and applications; * Provide servers for remote access to other servers (so-called ”proxy servers” or “proxies”); * Publish the public user profile, including data. Our other Services also include: * Creation and adjustment of the Platform configuration and configuration of software programs or applications running on the Platform (for example, so-called “Actors”) according to your instructions, for instance, in order to perform the extraction of data specified by you from the websites designated by you (the "**Configuration**"); * Assistance to users with respect to the use of the Platform, including the analysis of data sources on third-party websites; * Downloading, saving, processing, and publishing of your data, according to your instructions; * Switching the Configuration or applications running on the Platform on or off whether manually or automatically, and their monitoring. Some functions of the Platform as well as some other Services may be used free of charge upon registration on the Website (i.e., upon clicking “Sign up”). Use of other functions of the Platform and Services within the expanded Configuration is available against payment as is managed hereby and by the terms stipulated on the Website. ## 3. User Account[​](#3-user-account "Direct link to 3. User Account") Upon registration on the Website, a user account shall be created for you. During Sign-up as well as at any later time you use the account and/or communicate with us, you are obliged to provide us with accurate and true data. Should any of the identification information be amended in the future, your obligation is to update such information in the Platform or in your user account without undue delay. You are also obliged to protect the login details of your account and to prevent any third parties from using your account. We shall not be liable for any misuse of your account due to the breach of the aforementioned obligations. Moreover, we shall be authorised to block your account or delete it completely including all the data provided therein; you agree that such action shall be followed by no compensation. Unless otherwise agreed by us in writing (i.e. also by email), each user may create and use one user account at the most. Creating (directly or by means of a third party) and/or using multiple personal accounts (even for the use of various email addresses) without our written (incl. email) consent shall be considered a breach of this section with the same consequences as stipulated in the paragraph above. Use of organization accounts shall not be limited by this paragraph. You shall: (i) notify us immediately of any unauthorized use of any password or account or any other known or suspected breach of security; (ii) report to us immediately and use reasonable efforts to stop immediately any copying or distribution of content or infringement of our Platform, Website, Configurations and/or Services that is known or suspected by you; and (iii) not impersonate our another user or provide false identity information to gain access to or use the Platform, Website, Configurations or Services. You acknowledge that when you contact our support team and request their help with some issue you are having with our Platform, Custom Solutions and/or Services, the members of the support team may access your account in order to help you with resolving that issue. The support team limits its actions on your account only to those necessary for providing the requested support. ## 4. License and Copyright[​](#4-license-and-copyright "Direct link to 4. License and Copyright") You acknowledge that our Website, the Platform (and software and applications associated therewith) and Services provided by us are protected by Act No. 121/2000 Sb., on Copyright and on Related Rights and on Amendments to Certain Acts (the Copyright Act), as amended, and by other related legal regulations (the "**Copyright Act**" or the "**Copyright**"). We alone shall own all right, title and interest, including all intellectual property rights to the Website, Platform and other Services (or any outcome of such Services) within the maximum scope admissible by law (except for the rights that we exercise in compliance with a license granted to us by third parties). Furthermore, we shall own any suggestions, ideas, enhancement requests, feedback, recommendations or other information related to the Website, Platform, and other Services. You acknowledge that any use of the software, systems, and functionalities of third parties available on the Website or the Platform shall be governed by special license terms of owners of the relevant copyright or open-source licenses that you must adhere to (by using the relevant software, system or functionality, you confirm to have been familiarised and agree with the license terms of the relevant third parties). Should you breach any of those terms, we shall not be liable for any such action. In order to use the Website, Platform or Services provided by us, we provide you with a non-exclusive license without territorial restriction (i.e. worldwide license). Within the scope of the license, you may use the Website or the Platform in their unchanged form (excluding the amendments, modifications, and updates of the Website or the Platform performed by us) for the purpose for which the Website or the Platform has been designed (as defined in Article 2 hereof) and in the manner and within the scope of these Terms. We shall grant you the license exclusively for the use of the Website and the Platform on our servers, or the servers maintained by us or designed for such purpose (e.g. Amazon Web Services or GitHub). Within the license you may not modify, adjust or connect the Website or the Platform (or any parts of it whatsoever) or its name with any other copyrighted work or use it in a collective work. Your right to use the Website and the Platform in compliance herewith and within the scope of their common functionality (i.e. to create your own work within their scope, e.g. the Configuration) shall not be affected. Furthermore, you may not create any reproductions of the Website or the Platform (or any parts of it whatsoever), of our software, and of applications (in any form), or to disseminate them except such parts of them that are open-source (i.e. so-called free license). You may neither provide the license as a whole or any authorisations forming the part thereof wholly or partially to a third person (to provide a sublicense), nor may you assign any rights and obligations arising from the license. In addition to the license above, if we are providing you with a Custom Solution then, unless agreed otherwise, we provide you with a non-exclusive license without territorial restriction (i.e. worldwide license) to use, modify and adjust the source code of the Actor(s) provided to you. The intellectual property rights to the source code are not transferred and belong to us or our contractors. You acknowledge that we may use (part of) the source code in other Services, Custom Solutions and/or public Actors in the Apify Store. The provided license or the Terms shall not transfer any of our intellectual property rights to you (including the rights to trade-marks or brands or names). We neither provide you with any other license than the license specified explicitly hereby. Should you breach these Terms in any way or had you breached them in the past, we may reject or disable your license provision, use of the Website, Platform or other Services. ## 5. Terms of Use of the Website, Platform, and Services[​](#5-terms-of-use-of-the-website-platform-and-services "Direct link to 5. Terms of Use of the Website, Platform, and Services") You may use the Platform and other Services solely for the purposes of data extraction from publicly accessible websites (including websites that may require the user's login) or from other sources you are authorised to access. The Platform or Services functionalities may be used solely on such publicly accessible websites or other sources where their use (including data extraction) is permitted explicitly or where you are authorised to do so. Should you instruct us to use the Platform or the Services on certain publicly accessible websites or other sources and to send you the extracted data subsequently, you hereby declare and guarantee that use of the functionalities of the Platform and/or Services as well as the data extraction on such websites is expressly permitted or authorised to you. Should the extracted data be comprised of any sensitive data, confidential data or data protected by the Copyright or by other intellectual property right or any third-party right, you must not breach such rights by using the Platform or the Service in any way. Should you breach this prohibition, or should you use the Platform or Service functionalities directly or by means of our company on websites or other sources that do not permit their use, you shall be fully liable for such a breach and solely responsible for compensation of any damages incurred by and/or any claims of the affected third parties; we shall not be liable for any breach of third-party rights with respect to the usage of the Website, Platform or any Services. You may not take any actions which could lead to unauthorised use of the Platform or the Services. The actions you are not authorised to take include, but are not limited to, the circumvention, elimination or limitation of any mechanisms possibly serving to the protection of our rights or of any information with respect to the Copyright to the Platform (e.g. our logo or any other designation). You may not allow any third parties to use and/or access the Platform or Services by using your user account. Use of the Platform and Services shall only be possible on the grounds of the license provided by us hereunder. Should you take such unauthorised actions, you agree to compensate us for any damages incurred by us in this respect. Furthermore, during the use of the Website, Platform (or any of its functionalities) and the Services you may not use them in violation of our [Acceptable Use Policy](https://docs.apify.com/legal/acceptable-use-policy.md); You acknowledge that the Website, Platform or the Services may not be available constantly. We may perform their planned or unplanned downtime in order to perform the inspection, maintenance, update or replacement of hardware or software. Their availability may also be limited due to other reasons, including but not limited to, power cuts, data network loss, other failures caused by third parties or caused to the devices of third parties or due to Force Majeure. We shall not be liable for the limitation of availability of the Website, Platform or the Services. Provided that we mediate to you provision of services by any third parties, you are obliged to adhere to the terms of use of the respective providers. ## 6. Liability[​](#6-liability "Direct link to 6. Liability") We are not obliged to verify the manner in which you or other users use the Website, Platform, Configuration or Services and we shall not be liable for the manner of such usage. We assume that you use the Website Platform and Services legally and ethically and that you have obtained permission, if necessary, to use it on the targeted websites and/or other data sources. We shall not be liable for the outcomes of activities for which you use our Website, Platform, Configuration or Services. Provided that a third-party service or product is established on the Platform or on any of its functionalities, we shall not be liable for such a service or product, their functioning or manner and consequences of their usage. In compliance with the provision of Section 5 of Act No. 480/2004 Sb., on certain Information Society Services and on Amendments to some Acts (Act on Certain Information Society Services), as amended, we shall not be liable for the contents of the information that you save on our Website, Platform or by means of them, or by means of the Configuration. We shall not be liable for any of your unlawful actions in connection to the usage of the Website, Platform, Configuration or Services with respect to third parties (e.g. breach of intellectual property rights, rights to the name or company name, unfair competition, breach of terms of websites or applications and programs of third parties). We shall not guarantee or be liable for the availability of the Website, Platform or Services (or products arising therefrom) or for their performance, reliability or responsiveness or any other performance or time parameters. We shall neither be liable for the functionality or availability of the services of other providers that we mediate to you solely. We shall neither be liable for your breach of service usage terms of such providers. ## 7. Warranty[​](#7-warranty "Direct link to 7. Warranty") WE MAKE NO REPRESENTATION, WARRANTY, OR GUARANTY AS TO THE RELIABILITY, TIMELINESS, QUALITY, SUITABILITY, AVAILABILITY, ACCURACY OR COMPLETENESS OF THE PLATFORM AND ITS FUNCTIONALITIES, SERVICES OR ANY CONTENT. WE DO NOT REPRESENT OR WARRANT THAT (A) THE USE OF THE WEBSITE OR PLATFORM OR SERVICES WILL BE COMPLETELY SECURE, TIMELY, UNINTERRUPTED OR ERROR-FREE OR OPERATE IN COMBINATION WITH ANY OTHER HARDWARE, SOFTWARE, SYSTEM OR DATA, (B) THE WEBSITE, PLATFORM (AND ITS FUNCTIONALITIES), CONFIGURATIONS AND SERVICES WILL MEET YOUR REQUIREMENTS OR EXPECTATIONS, (C) ANY STORED DATA WILL BE ACCURATE OR RELIABLE, (D) THE QUALITY OF ANY PRODUCTS, SERVICES, INFORMATION, OR OTHER MATERIAL PURCHASED OR OBTAINED BY YOU THROUGH THE WEBSITE OR PLATFORM, CONFIGURATIONS OR SERVICES WILL MEET YOUR REQUIREMENTS OR EXPECTATIONS, (E) ERRORS OR DEFECTS WILL BE CORRECTED, OR (F) THE SERVICES OR THE SERVER(S) THAT MAKE THE WEBSITE, PLATFORM AND SERVICES AVAILABLE ARE FREE OF VIRUSES OR OTHER HARMFUL COMPONENTS. THE WEBSITE, PLATFORM (AND ITS FUNCTIONALITIES), CONFIGURATIONS AND SERVICES AND ALL CONTENT IS PROVIDED TO YOU STRICTLY ON AN “AS IS” BASIS. ALL CONDITIONS, REPRESENTATIONS AND WARRANTIES, WHETHER EXPRESS, IMPLIED, STATUTORY OR OTHERWISE, INCLUDING, WITHOUT LIMITATION, ANY IMPLIED WARRANTY OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, OR NON-INFRINGEMENT OF THIRD PARTY RIGHTS, ARE HEREBY DISCLAIMED TO THE MAXIMUM EXTENT PERMITTED BY APPLICABLE LAW BY US. We shall not be liable for any defects of the Website, Platform (or its functionalities), Configuration or Services arising due to unauthorised interference with the Website, Platform, Configuration or the use of Platform or Configuration contradictory hereto. We shall neither be liable for errors with respect to the function or non-function of the Configurations arising due to the changes of third-party websites (i.e. website from which the relevant Configuration is to extract data) upon the Configuration creation. ## 8. Limitation of Liability; Force Majeure[​](#8-limitation-of-liability-force-majeure "Direct link to 8. Limitation of Liability; Force Majeure") EXCEPT FOR ANY INDEMNIFICATION AND CONFIDENTIALITY OBLIGATIONS HEREUNDER, (i) IN NO EVENT SHALL EITHER PARTY BE LIABLE UNDER THESE TERMS FOR ANY CONSEQUENTIAL, SPECIAL, INDIRECT, EXEMPLARY, OR PUNITIVE DAMAGES WHETHER IN CONTRACT, TORT OR ANY OTHER LEGAL THEORY, EVEN IF SUCH PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES AND NOTWITHSTANDING ANY FAILURE OF ESSENTIAL PURPOSE OF ANY LIMITED REMEDY AND (ii) OUR AGGREGATE LIABILITY TO YOU UNDER THIS AGREEMENT FOR ANY CLAIM IS LIMITED TO THE AMOUNT PAID TO US BY YOU FOR THE SERVICES GIVING RISE TO THE CLAIM. Each party acknowledges that the other party has agreed to these terms relying on the limitations of liability stated herein and that those limitations are an essential basis of the bargain between the parties. Without limiting the foregoing and except for payment obligations, neither party shall have any liability for any failure or delay resulting from any condition beyond the reasonable control of such party, including but not limited to governmental action or acts of terrorism, earthquake or other acts of God, labour conditions, and power failures. ## 9. Your Obligation to Indemnify[​](#9-your-obligation-to-indemnify "Direct link to 9. Your Obligation to Indemnify") You agree to indemnify, defend and hold us, our agents, affiliates, subsidiaries, directors, officers, employees, and applicable third parties (e.g., all relevant partner(s), licensors, licensees, consultants and contractors) (“Indemnified Person(s)”) harmless from and against any third-party claim, liability, loss, and expense (including damage awards, settlement amounts, and reasonable legal fees), brought against any Indemnified Person(s), arising out of your use of the Website, Platform, Configurations or Services and/or your breach of any of these terms. You acknowledge and agree that each Indemnified Person has the right to assert and enforce its rights under this section directly on its own behalf as a third-party beneficiary. ## 10. Legal Disclaimer and Warning[​](#10-legal-disclaimer-and-warning "Direct link to 10. Legal Disclaimer and Warning") We may immediately suspend your use of the Website, Platform, Configurations and/or Services if we are contacted by your targeted data extraction source, website, or repository and asked to cease all extraction activity. If such an event occurs, we will not disclose your information without a court order mandating us to do so unless we in our best judgment determine that there would be an adverse consequence if we do not. If, however, we receive a court order demanding the release of your information to a third party, we will comply. If such action becomes necessary, you agree to indemnify and hold us and (as applicable) our parent(s), subsidiaries, affiliates, officers, directors, agents, contractors and employees, harmless from any claim or demand, including reasonable attorneys' fees, made by any third party arising from any complaint, suit, disagreement or other repercussions resulting from your use of the Website, Platform, Configurations or Services. Should any third party claim its rights against us in connection to your actions, we may immediately eliminate any contents gathered, saved or disseminated by you from servers used by us. In the event of a judicial dispute with a third party related to your actions, you are obliged to provide us with all necessary cooperation in order to resolve such a dispute successfully and you are also obliged to reimburse continuously any purposeful expenses arising to us due to such a dispute. With respect to this, should an obligation arise to reimburse any claim of a third party, you agree to pay us the full scope of the damages. ## 11. Payment Terms[​](#11-payment-terms "Direct link to 11. Payment Terms") Within the scope of your user account on the Website or by means of the Website, you may order a paid license to use the Platform (an extended paid version as opposed to the basic version that is free of charge) or other Services in compliance with the up-to-date information and prices provided for the given license or Service on the Website or as agreed with us individually. Your orders made by means of the Website shall be binding. By entering the relevant details of your debit or credit card during the order-making process, you agree that the price of the ordered license (including the Overages as defined below) or Service (the price of the ordered license and the price of the ordered Service collectively as the “**Fees**”) shall be deducted from the card. Unless agreed otherwise, you shall pay the Fees on a monthly or yearly basis or another basis selected when making the order (the “**Billing Period**”). The Billing Period shall commence on the date of your first payment. The Fees are payable at the first date of the Billing Period for which it is valid except for the Overages as provided for in the following paragraph. You agree that the Fees for the relevant Billing Period shall be deducted by us from the debit or credit card the details of which were provided by you when ordering the relevant license. In case your actual use of the Platform features exceeds your prepaid subscription, you shall be automatically charged for overages in compliance with the up-to-date information and prices provided for the given license or Service on the Website or as agreed with us elsewhere (the “**Overages**”). Overages not exceeding in aggregate a certain threshold (the "**Overage Threshold**"), will be added as a part of the invoice issued by us at the beginning of the next Billing Period. The default value of the Overage Threshold is USD 200 in any given Billing Period, but it may be modified by us in our sole discretion. In case you cancel your subscription, the Overages will be invoiced as a part of the final invoice. Any Overages exceeding in aggregate the Overage Threshold in any given Billing Period may be invoiced separately as soon as the Overage Threshold has been reached. For the avoidance of doubt, you may be invoiced for Overages multiple times if you exceed the Overage Threshold according to the previous sentence multiple times in any given Billing Period. Should you order an upgrade of the license you are currently using, you shall pay the pro rata difference for the previously paid price of the currently used license and the price of the new license for the rest of the current Billing Period. The upgrade shall be performed immediately upon payment of the difference. Should you select a downgrade of the currently used license, you may not request the return of the paid remuneration for the current Billing Period (or its part); until the end of the current Billing Period you may use the license for the version already paid for. Upon the effectiveness of the upgrade or downgrade, the remuneration deducted for the following Billing Period shall amount to the newly selected license version similarly to the definition above. We are a value-added tax (VAT) payer in compliance with the tax laws of the Czech Republic. VAT shall therefore always be added to the Fees payable in the amount stipulated by the respective tax laws. The Fees shall always be deemed as paid for when credited to our bank account or our other electronic accounts held with a provider of payment services (e.g., PayPal or Braintree). We shall not refund any Fees or other amounts paid by you to us. You hereby agree to pay all applicable Fees and/or charges under these terms, including any applicable taxes or charges imposed by any government entity, and that we may change its pricing at any time. There are no refunds for payments made once Services have been rendered or license provided. If you dispute any charge made for your use of the Platform, licenses or Services, you must notify us in writing within fifteen (15) days of any such charge; failure to so notify us shall result in the waiver by you of any claim relating to any such disputed charge. Charges shall be calculated solely based on invoicing records maintained by us for purposes of billing. No other measurements or statistics of any kind shall be accepted by us or have any effect under these terms ## 12. Personal Data Protection, Commercial Offers, Confidentiality[​](#12-personal-data-protection-commercial-offers-confidentiality "Direct link to 12. Personal Data Protection, Commercial Offers, Confidentiality") Personal Data Protection is regulated by the [Privacy Policy](https://docs.apify.com/legal/privacy-policy.md). During the term of your use of the Website or Platform, licenses granted under these terms and Services and for a period of two (2) years following the termination or expiration of your subscription, each party agrees not to disclose Confidential Information of the other party to any third party without prior written consent except as provided herein (the "**Confidential Information**"). Confidential Information includes (i) subscription account data, including agent definitions, Customer Data, and User Content, (ii) except as provided in subsection (i) above, any other Website, Platform or Services information or access to technology prior to public disclosure provided by us to you and identified at the time of disclosure in writing as “Confidential.” Confidential Information does not include information that has become publicly known through no breach by a party, or has been (i) independently developed without access to the other party's Confidential Information; (ii) rightfully received from a third party; or (iii) required to be disclosed by law or by a governmental authority. ## 13. Term, Amendment, and Termination[​](#13-term-amendment-and-termination "Direct link to 13. Term, Amendment, and Termination") Unless mutually agreed otherwise in writing, the license agreement and the agreement on the provision of other Services concluded by and between us shall be for an indefinite period of time. Either we or you may terminate any such agreement by cancelling your user account on the Platform (user account may be cancelled in the account settings). In the event of the agreement termination or user account cancellation, you shall not be entitled to the refund of any remuneration you already paid to us (see Article 11, Payment Terms). In the following cases we may further withdraw from an agreement concluded with you (by cancelling your user account) with immediate effect: * Should you be in delay with payment of any amounts due to us for more than three days; * Should you breach these Terms and/or Privacy Policy in any way whatsoever (including, but not limited to, the breach of our Copyright, terms of the provided license, terms of use of the Website, Platform or Services, or should you create more than one user account without our prior written consent) When terminating the license agreement or cancelling your user account, we may eliminate any data entered in the user account and gathered there. If necessary (including but not limited to the market development, development of the Platform and related technologies and with respect to new risks associated with the Platform and the safety of its use), we may unilaterally amend the version hereof. Similarly, with respect to the market development we may unilaterally amend the remuneration amount of the license for use of the Platform or the remuneration for the Services. We shall notify you of such an amendment at least 30 days in advance before its effectiveness by displaying the notification in your user account or otherwise on the Website or the Platform. Should you disagree with such an amendment, you may withdraw from the relevant agreement effective as at the date of the effectiveness of the announced amendments by cancelling your user account (cancellation may be performed in your account settings). In the event of the agreement termination and user account cancellation you shall not be entitled to a refund of any remuneration paid to us already (see Art. 11, Payment Terms). Provided you shall not cancel your user account before the effective date of the amendment, you shall be deemed to agree with such an amendment. ## 14. Final Provisions[​](#14-final-provisions "Direct link to 14. Final Provisions") These Terms and any other contracts and legal relationships concluded by and between us (unless otherwise explicitly provided) shall be governed by the laws of the Czech Republic. Any disputes arising here from between us shall be resolved by the courts of general jurisdiction in the Czech Republic. You agree that we may use your name, company name and logo as a reference in all types of promotion materials for marketing purposes free of charge. Apify is a trademark of Apify Technologies s.r.o., registered in the United States and other countries. Provided that we enter into a separate written license agreement or another contract with you provisions of which deviate herefrom, such different provisions shall take precedence over the respective provisions of these Terms. Should any of the provisions hereof be ascertained as invalid, ineffective or unenforceable, upon mutual agreement such a provision shall be replaced by a provision whose sense and purpose comes as closely as possible to the original provision. The invalidity, ineffectiveness or unenforceability of one provision shall not affect the validity and effectiveness of the remaining provisions hereof. Unless otherwise provided hereby, any changes and amendments hereto may only be made in writing. ## Version History[​](#version-history "Direct link to Version History") This is the history of Apify General Terms and Conditions. If you're a new user, the latest Terms apply. If you're an existing user, see the table below to identify which terms and conditions were applicable to you at a given date. | Version | Effective from | Effective until | | ----------------------------------------------------------------------------------------- | --------------- | ------------------ | | Latest (this document) | May 14, 2024 | | | [Oct 2022](https://docs.apify.com/legal/old/general-terms-and-conditions-october-2022.md) | October 1, 2022 | June 13, 2024 | | Older T\&Cs available upon request | | September 30, 2022 | --- # Apify General Terms and Conditions October 2022 ## Version History[​](#version-history "Direct link to Version History") You are reading terms and conditions that are no longer effective. If you're a new user, the [latest terms](https://docs.apify.com/legal/general-terms-and-conditions.md) apply. If you're an existing user, see the table below to identify which terms and conditions were applicable to you at a given date. | Version | Effective from | Effective until | | ---------------------------------------------------------------------- | --------------- | ------------------ | | [Latest](https://docs.apify.com/legal/general-terms-and-conditions.md) | May 13, 2024 | | | Oct 2022 (This document) | October 1, 2022 | June 12, 2024 | | Older T\&Cs available upon request | | September 30, 2022 | # Apify General Terms and Conditions October 2022 Apify Technologies s.r.o., with its registered seat at Vodičkova 704/36, 110 00 Prague 1, Czech Republic, Company reg. no. 04788290, recorded in the Commercial Register kept by the Municipal Court of Prague, File No.: C 253224 (hereinafter referred to as “**we**” or the “**Provider**”) welcomes you (hereinafter referred to as “**you**” or the “**User**”) on our website apify.com (hereinafter referred to as the “**Website**”). These terms and conditions of use (hereinafter referred to as the “**Terms**”) shall govern your access to the Website, its use, and use of the Platform and the Services (as defined below). ## Summary[​](#summary "Direct link to Summary") The Terms are the key document governing our relationship between you and us, please read the whole text of the Terms. For your convenience, below is a short summary of each section of the Terms. | Section | What can you find there? | | ---------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | [1. Acceptance of these Terms](#1-acceptance-of-these-terms) | These Terms become a binding contract at the moment you sign-up on our Website. | | [2. Our Services](#2-our-services) | Overview of the Services that we are providing. | | [3. User Account](#3-user-account) | In order to use our Services you will create a user account. You must use true and accurate information when creating a user account. | | [4. License and Copyright](#4-license-and-copyright) | In order to use our Services, we are granting you a non-exclusive license. You must respect and adhere to copyright law protecting our Website, Platform and/or Services. | | [5. Terms of Use of the Website, Platform, and Services](#5-terms-of-use-of-the-website-platform-and-services) | Using our Platform or Services for illegal or illegitimate purposes is prohibited. Should you use the Platform and/or the Services in breach of the Terms or the law, we will not be liable for any damage caused as a result. You must not allow anyone else to use your user account. | | [6. Liability](#6-liability) | We are limiting our liability towards you in relation to specified situations. | | [7. Warranty](#7-warranty) | We are providing our Services as is and do not make any representations, warranties or guarantees to you. | | [8. Limitation of Liability; Force Majeure](#8-limitation-of-liability-force-majeure) | We are limiting the liability of both parties - you and us - for less predictable legal types of damages. Our overall limit of liability isset as the aggregate of amounts paid by you for the Services. Neither of us shall be liable for situations beyond one's reasonable control (force majeure). | | [9. Your Obligation to Indemnify](#9-your-obligation-to-indemnify) | If you use our Services in breach of these Terms and as a result a third party will claim our liability, you agree to indemnify us against any such claim. | | [10. Legal Disclaimer and Warning](#10-legal-disclaimer-and-warning) | If any of your targeted websites demands that we cease the respective automated activities or processes, we may need to suspend your use of the Services accordingly and/or obey by any related court order that we receive. | | [11. Payment Terms](#11-payment-terms) | Find out about the payment terms for the Services; the up-to-date prices can be found at apify.com/pricing. | | [12. Personal Data Protection, Commercial Offers, Confidentiality](#12-personal-data-protection-commercial-offers-confidentiality) | Personal Data Protection is regulated by the Privacy Policy available at apify.com/privacy-policy. For the duration of our agreement and 2 years after, you and us both agree to maintain mutual confidentiality. | | [13. Term, Amendment, and Termination](#13-term-amendment-and-termination) | The Terms remain effective until your account is terminated. We can terminate the agreement immediately in case of non-payment for the Services or breach of the Terms and/or Privacy Policy. We may amend the Terms in the future, in such case you will be notified 30 days in advance. | | [14. Final Provisions](#14-final-provisions) | These Terms are governed by Czech law. 'Apify' is a registered trademark. | ## 1. Acceptance of these Terms[​](#1-acceptance-of-these-terms "Direct link to 1. Acceptance of these Terms") By clicking the button “Sign up” during registration on our Website, you claim that you are over 18 years old and agree to adhere to these Terms and also to the [terms of personal data protection](https://apify.com/privacy-policy) (hereinafter referred to as the “**Privacy Policy**”). If you act on behalf of a company when accepting these Terms and the Privacy Policy, you also hereby declare to be authorized to perform such legal actions on behalf of the company (herein the term “you” shall mean the relevant company). You agree to adhere to these Terms also by the fact that you visit or use the Website, Platform (as defined below), any software, application or any other service running or available on the Platform or a service we provide or make accessible to you. ## 2. Our Services[​](#2-our-services "Direct link to 2. Our Services") By means of our Website or by other means, we enable you to use the computer platform “Apify” (hereinafter referred to as the “**Platform**”) and some other services and functions (hereinafter referred to as the “**Services**”). The Platform is a computer system operated by the Provider, remote access to which is available by means of servers within the apify.com domain or other domains (so-called “cloud system”). In addition to other things, the Platform makes it possible to: * Download and extract structured data from the websites; * Download, save, process, and publish data; * Create, run, amend, and publish software programs and applications; * Provide servers for remote access to other servers (so-called ”proxy servers” or “proxies”); * Publish the public user profile, including data. Our other Services also include: * Creation and adjustment of the Platform configuration and configuration of software programs or applications running on the Platform (for example, so-called “Crawlers” or “Acts” or “Actors”) according to your instructions, for instance, in order to perform the extraction of data specified by you from the websites designated by you (hereinafter referred to as the “**Configuration**”); * Assistance to users with respect to the use of the Platform, including the analysis of data sources on third-party websites; * Downloading, saving, processing, and publishing of your data according to your instructions; * Switching the Configuration or applications running on the Platform on or off whether manually or automatically, and their monitoring. Some functions of the Platform as well as some other Services may be used free of charge upon registration on the Website (i.e. upon clicking “Sign up”). Use of other functions of the Platform and Services within the expanded Configuration is available against payment as is managed hereby and by the terms stipulated on the Website. ## 3. User Account[​](#3-user-account "Direct link to 3. User Account") Upon registration on the Website, a user account shall be created for you. During Sign-up as well as at any later time you use the account and/or communicate with us, you are obliged to provide us with accurate and true data. Should any of the identification information be amended in the future, your obligation is to update such information in the Platform or in your user account without undue delay. You are also obliged to protect the login details of your account and to prevent any third parties from using your account. We shall not be liable for any misuse of your account due to the breach of the aforementioned obligations. Moreover, we shall be authorised to block your account or delete it completely including all the data provided therein; you agree that such action shall be followed by no compensation. Unless otherwise agreed by us in writing (i.e. also by email), each user may create and use one user account at the most. Creating (directly or by means of a third party) and/or using multiple personal accounts (even for the use of various email addresses) without our written (incl. email) consent shall be considered a breach of this section with the same consequences as stipulated in the paragraph above. Use of organization accounts shall not be limited by this paragraph. You shall: (i) notify us immediately of any unauthorized use of any password or account or any other known or suspected breach of security; (ii) report to us immediately and use reasonable efforts to stop immediately any copying or distribution of content or infringement of our Platform, Website, Configurations and/or Services that is known or suspected by you; and (iii) not impersonate our another user or provide false identity information to gain access to or use the Platform, Website, Configurations or Services. You acknowledge that when you contact our support team and request their help with some issue you are having with our Platform, Custom Solutions and/or Services, the members of the support team may access your account in order to help you with resolving that issue. The support team limits its actions on your account only to those necessary for providing the requested support. ## 4. License and Copyright[​](#4-license-and-copyright "Direct link to 4. License and Copyright") You acknowledge that our Website, the Platform (and software and applications associated therewith) and Services provided by us are protected by Act No. 121/2000 Sb., on Copyright and on Related Rights and on Amendments to Certain Acts (the Copyright Act), as amended, and by other related legal regulations (hereinafter referred to as the “**Copyright Act**” or the “**Copyright**”). We alone shall own all right, title and interest, including all intellectual property rights to the Website, Platform and other Services (or any outcome of such Services) within the maximum scope admissible by law (except for the rights that we exercise in compliance with a license granted to us by third parties). Furthermore, we shall own any suggestions, ideas, enhancement requests, feedback, recommendations or other information related to the Website, Platform, and other Services. You acknowledge that any use of the software, systems, and functionalities of third parties available on the Website or the Platform shall be governed by special license terms of owners of the relevant copyright or open-source licenses that you must adhere to (by using the relevant software, system or functionality, you confirm to have been familiarised and agree with the license terms of the relevant third parties). Should you breach any of those terms, we shall not be liable for any such action. In order to use the Website, Platform or Services provided by us, we provide you with a non-exclusive license without territorial restriction (i.e. worldwide license). Within the scope of the license, you may use the Website or the Platform in their unchanged form (excluding the amendments, modifications, and updates of the Website or the Platform performed by us) for the purpose for which the Website or the Platform has been designed (as defined in Article 2 hereof) and in the manner and within the scope of these Terms. We shall grant you the license exclusively for the use of the Website and the Platform on our servers, or the servers maintained by us or designed for such purpose (e.g. Amazon Web Services or GitHub). Within the license you may not modify, adjust or connect the Website or the Platform (or any parts of it whatsoever) or its name with any other copyrighted work or use it in a collective work. Your right to use the Website and the Platform in compliance herewith and within the scope of their common functionality (i.e. to create your own work within their scope, e.g. the Configuration) shall not be affected. Furthermore, you may not create any reproductions of the Website or the Platform (or any parts of it whatsoever), of our software, and of applications (in any form), or to disseminate them except such parts of them that are open-source (i.e. so-called free license). You may not provide the license as a whole or any authorisations forming the part thereof wholly or partially to a third person (to provide a sub-license) and neither may you assign any rights and obligations arising from the license. In addition to the license above, if we are providing you with a Custom Solution then, unless agreed otherwise, we provide you with a non-exclusive license without territorial restriction (i.e. worldwide license) to use, modify and adjust the source code of the Actor(s) provided to you. The intellectual property rights to the source code are not transferred and belong to us or our contractors. You acknowledge that we may use (part of) the source code in other Services, Custom Solutions and/or public Actors in the Apify Store. The provided license or the Terms shall not transfer any of our intellectual property rights to you (including the rights to trade-marks or brands or names). We neither provide you with any other license than the license specified explicitly hereby. Should you breach these Terms in any way or had you breached them in the past, we may reject or disable your license provision, use of the Website, Platform or other Services. ## 5. Terms of Use of the Website, Platform, and Services[​](#5-terms-of-use-of-the-website-platform-and-services "Direct link to 5. Terms of Use of the Website, Platform, and Services") You may use the Platform and other Services solely for the purposes of data extraction from publicly accessible websites (including websites that may require the user's login) or from other sources you are authorised to access. The Platform or Services functionalities may be used solely on such publicly accessible websites or other sources where their use (including data extraction) is permitted explicitly or where you are authorised to do so. Should you instruct us to use the Platform or the Services on certain publicly accessible websites or other sources and to send you the extracted data subsequently, you hereby declare and guarantee that use of the functionalities of the Platform and/or Services as well as the data extraction on such websites is expressly permitted or authorised to you. Should the extracted data be comprised of any sensitive data, confidential data or data protected by the Copyright or by other intellectual property right or any third-party right, you must not breach such rights by using the Platform or the Service in any way. Should you breach this prohibition, or should you use the Platform or Service functionalities directly or by means of our company on websites or other sources that do not permit their use, you shall be fully liable for such a breach and solely responsible for compensation of any damages incurred by and/or any claims of the affected third parties; we shall not be liable for any breach of third-party rights with respect to the usage of the Website, Platform or any Services. You may not take any actions which could lead to unauthorised use of the Platform or the Services. The actions you are not authorised to take include, but are not limited to, the circumvention, elimination or limitation of any mechanisms possibly serving to the protection of our rights or of any information with respect to the Copyright to the Platform (e.g. our logo or any other designation). You may not allow any third parties to use and/or access the Platform or Services by using your user account. Use of the Platform and Services shall only be possible on the grounds of the license provided by us hereunder. Should you take such unauthorised actions, you agree to compensate us for any damages incurred by us in this respect. Furthermore, during the use of the Website, Platform (or any of its functionalities) and the Services you may not: * Use them in a manner likely to unreasonably limit usage by our other customers, including but not limited to burdening the server on which the Platform is located by automated requests outside the interface designed for such a purpose; * Gather, save, enable the transmission to third parties or enable access to the content that is (themselves or their accessibility) contradictory to the generally binding legal regulations effective in the Czech Republic and in any country in which you are a resident where the Website, Platform or Services are used or where detrimental consequences could arise by taking such actions, including but not limited to the content that: * interferes with the Copyright, with rights related to Copyright or with other intellectual property rights and/or confidential or any sensitive information; * breaches the applicable legal rules relevant to the protection from hatred for a nation, ethnic group, race, religion, class or another group of people or relevant to the limitation of rights and freedoms of its members or invasion of privacy, promotion of violence and animosity, gambling or the sales or usage of drugs; * interferes with the rights to the protection of competition law; * Gather, save, enable the transmission to third parties or enable access to the content that is pornographic, humiliating or that refer to pornographic or humiliating materials; * Gather, save, enable the transmission to third parties or enable access to the contents that make conspicuous resemblance to the contents, services or third-party applications for the purposes of confusing or deceiving Internet users (so-called phishing); * Gather, save, enable the transmission to third parties or enable access to the contents that harm our good reputation or authorised interests (including hypertext links to the contents that harm our good reputation or authorised interests); * Disseminate computer viruses or other harmful software; * Use mechanisms, instruments or computer equipment or processes that have or potentially have a negative effect on the operation of devices used by us, on the security of the internet or internet users; * Generate fraudulent impressions of or fraudulent clicks on your ad(s) or third-party ad(s) through any automated, deceptive, fraudulent or other invalid means, including but not limited to through repeated manual clicks, the use of robots, agents or other automated query tools and/or computer generated search requests, and/or the unauthorized use of other search engine optimization services and/or software. You acknowledge that the Website, Platform or the Services may not be available constantly. We may perform their planned or unplanned downtime in order to perform the inspection, maintenance, update or replacement of hardware or software. Their availability may also be limited due to other reasons, including but not limited to, power cuts, data network loss, other failures caused by third parties or caused to the devices of third parties or due to Force Majeure. We shall not be liable for the limitation of availability of the Website, Platform or the Services. Provided that we mediate to you provision of services by any third parties, you are obliged to adhere to the terms of use of the respective providers. ## 6. Liability[​](#6-liability "Direct link to 6. Liability") We are not obliged to verify the manner in which you or other users use the Website, Platform, Configuration or Services and we shall not be liable for the manner of such usage. We assume that you use the Website Platform and Services legally and ethically and that you have obtained permission, if necessary, to use it on the targeted websites and/or other data sources. We shall not be liable for the outcomes of activities for which you use our Website, Platform, Configuration or Services. Provided that a third-party service or product is established on the Platform or on any of its functionalities, we shall not be liable for such a service or product, their functioning or manner and consequences of their usage. In compliance with the provision of Section 5 of Act No. 480/2004 Sb., on certain Information Society Services and on Amendments to some Acts (Act on Certain Information Society Services), as amended, we shall not be liable for the contents of the information that you save on our Website, Platform or by means of them, or by means of the Configuration. We shall not be liable for any of your unlawful actions in connection to the usage of the Website, Platform, Configuration or Services with respect to third parties (e.g. breach of intellectual property rights, rights to the name or company name, unfair competition, breach of terms of websites or applications and programs of third parties). We shall not guarantee or be liable for the availability of the Website, Platform or Services (or products arising therefrom) or for their performance, reliability or responsiveness or any other performance or time parameters. We shall neither be liable for the functionality or availability of the services of other providers that we mediate to you solely. We shall neither be liable for your breach of service usage terms of such providers. ## 7. Warranty[​](#7-warranty "Direct link to 7. Warranty") WE MAKE NO REPRESENTATION, WARRANTY, OR GUARANTY AS TO THE RELIABILITY, TIMELINESS, QUALITY, SUITABILITY, AVAILABILITY, ACCURACY OR COMPLETENESS OF THE PLATFORM AND ITS FUNCTIONALITIES, SERVICES OR ANY CONTENT. WE DO NOT REPRESENT OR WARRANT THAT (A) THE USE OF THE WEBSITE OR PLATFORM OR SERVICES WILL BE COMPLETELY SECURE, TIMELY, UNINTERRUPTED OR ERROR-FREE OR OPERATE IN COMBINATION WITH ANY OTHER HARDWARE, SOFTWARE, SYSTEM OR DATA, (B) THE WEBSITE, PLATFORM (AND ITS FUNCTIONALITIES), CONFIGURATIONS AND SERVICES WILL MEET YOUR REQUIREMENTS OR EXPECTATIONS, (C) ANY STORED DATA WILL BE ACCURATE OR RELIABLE, (D) THE QUALITY OF ANY PRODUCTS, SERVICES, INFORMATION, OR OTHER MATERIAL PURCHASED OR OBTAINED BY YOU THROUGH THE WEBSITE OR PLATFORM, CONFIGURATIONS OR SERVICES WILL MEET YOUR REQUIREMENTS OR EXPECTATIONS, (E) ERRORS OR DEFECTS WILL BE CORRECTED, OR (F) THE SERVICES OR THE SERVER(S) THAT MAKE THE WEBSITE, PLATFORM AND SERVICES AVAILABLE ARE FREE OF VIRUSES OR OTHER HARMFUL COMPONENTS. THE WEBSITE, PLATFORM (AND ITS FUNCTIONALITIES), CONFIGURATIONS AND SERVICES AND ALL CONTENT IS PROVIDED TO YOU STRICTLY ON AN “AS IS” BASIS. ALL CONDITIONS, REPRESENTATIONS AND WARRANTIES, WHETHER EXPRESS, IMPLIED, STATUTORY OR OTHERWISE, INCLUDING, WITHOUT LIMITATION, ANY IMPLIED WARRANTY OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, OR NON-INFRINGEMENT OF THIRD PARTY RIGHTS, ARE HEREBY DISCLAIMED TO THE MAXIMUM EXTENT PERMITTED BY APPLICABLE LAW BY US. We shall not be liable for any defects of the Website, Platform (or its functionalities), Configuration or Services arising due to unauthorised interference with the Website, Platform, Configuration or the use of Platform or Configuration contradictory hereto. We shall neither be liable for errors with respect to the function or non-function of the Configurations arising due to the changes of third-party websites (i.e. website from which the relevant Configuration is to extract data) upon the Configuration creation. ## 8. Limitation of Liability; Force Majeure[​](#8-limitation-of-liability-force-majeure "Direct link to 8. Limitation of Liability; Force Majeure") EXCEPT FOR ANY INDEMNIFICATION AND CONFIDENTIALITY OBLIGATIONS HEREUNDER, (i) IN NO EVENT SHALL EITHER PARTY BE LIABLE UNDER THESE TERMS FOR ANY CONSEQUENTIAL, SPECIAL, INDIRECT, EXEMPLARY, OR PUNITIVE DAMAGES WHETHER IN CONTRACT, TORT OR ANY OTHER LEGAL THEORY, EVEN IF SUCH PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES AND NOTWITHSTANDING ANY FAILURE OF ESSENTIAL PURPOSE OF ANY LIMITED REMEDY AND (ii) OUR AGGREGATE LIABILITY TO YOU UNDER THIS AGREEMENT FOR ANY CLAIM IS LIMITED TO THE AMOUNT PAID TO US BY YOU FOR THE SERVICES GIVING RISE TO THE CLAIM. Each party acknowledges that the other party has agreed to these terms relying on the limitations of liability stated herein and that those limitations are an essential basis of the bargain between the parties. Without limiting the foregoing and except for payment obligations, neither party shall have any liability for any failure or delay resulting from any condition beyond the reasonable control of such party, including but not limited to governmental action or acts of terrorism, earthquake or other acts of God, labour conditions, and power failures. ## 9. Your Obligation to Indemnify[​](#9-your-obligation-to-indemnify "Direct link to 9. Your Obligation to Indemnify") You agree to indemnify, defend and hold us, our agents, affiliates, subsidiaries, directors, officers, employees, and applicable third parties (e.g., all relevant partner(s), licensors, licensees, consultants and contractors) (“Indemnified Person(s)”) harmless from and against any third-party claim, liability, loss, and expense (including damage awards, settlement amounts, and reasonable legal fees), brought against any Indemnified Person(s), arising out of your use of the Website, Platform, Configurations or Services and/or your breach of any of these terms. You acknowledge and agree that each Indemnified Person has the right to assert and enforce its rights under this section directly on its own behalf as a third-party beneficiary. ## 10. Legal Disclaimer and Warning[​](#10-legal-disclaimer-and-warning "Direct link to 10. Legal Disclaimer and Warning") We may immediately suspend your use of the Website, Platform, Configurations and/or Services if we are contacted by your targeted data extraction source, website, or repository and asked to cease all extraction activity. If such an event occurs, we will not disclose your information without a court order mandating us to do so unless we in our best judgment determine that there would be an adverse consequence if we do not. If, however, we receive a court order demanding the release of your information to a third party, we will comply. If such action becomes necessary, you agree to indemnify and hold us and (as applicable) our parent(s), subsidiaries, affiliates, officers, directors, agents, contractors and employees, harmless from any claim or demand, including reasonable attorneys' fees, made by any third party arising from any complaint, suit, disagreement or other repercussions resulting from your use of the Website, Platform, Configurations or Services. Should any third party claim its rights against us in connection to your actions, we may immediately eliminate any contents gathered, saved or disseminated by you from servers used by us. In the event of a judicial dispute with a third party related to your actions, you are obliged to provide us with all necessary cooperation in order to resolve such a dispute successfully and you are also obliged to reimburse continuously any purposeful expenses arising to us due to such a dispute. With respect to this, should an obligation arise to reimburse any claim of a third party, you agree to pay us the full scope of the damages. ## 11. Payment Terms[​](#11-payment-terms "Direct link to 11. Payment Terms") Within the scope of your user account on the Website or by means of the Website, you may order a paid license to use the Platform (an extended paid version as opposed to the basic version that is free of charge) or other Services in compliance with the up-to-date information and prices provided for the given license or Service on the Website or as agreed with us individually. Your orders made by means of the Website shall be binding. By entering the relevant details of your debit or credit card during the order-making process, you agree that the price of the ordered license or Service shall be deducted from the card. Unless agreed otherwise , you shall pay the remuneration for the license to use the Platform on a monthly or yearly basis or another basis selected when making the order (“Billing Period”). The Billing Period shall commence on the date of your first payment. The remuneration is payable at the first date of the Billing Period for which it is valid. You agree that the remuneration for the relevant Billing Period shall be deducted by us from the debit or credit card the details of which were provided by you when ordering the relevant license. Should you order an upgrade of the license you are currently using, you shall pay the pro rata difference for the previously paid price of the currently used license and the price of the new license for the rest of the current Billing Period. The upgrade shall be performed immediately upon payment of the difference. Should you select a downgrade of the currently used license, you may not request the return of the paid remuneration for the current Billing Period (or its part); until the end of the current Billing Period you may use the license for the version already paid for. Upon the effectiveness of the upgrade or downgrade, the remuneration deducted for the following Billing Period shall amount to the newly selected license version similarly to the definition above. We are a value-added tax (VAT) payer in compliance with the tax laws of the Czech Republic. VAT shall be therefore always added to our price in the amount stipulated by the respective tax laws. The amount you are to pay shall always be deemed as paid for when credited to our bank account or our other electronic accounts held with a provider of payment services (e.g. PayPal or Braintree). We shall not refund any remuneration or other amounts paid by you to us. You hereby agree to pay all applicable fees and/or charges under these terms, including any applicable taxes or charges imposed by any government entity, and that we may change its pricing at any time. There are no refunds for payments made once Services have been rendered or license provided. If you dispute any charge made for your use of the Platform, licenses or Services, you must notify us in writing within fifteen (15) days of any such charge; failure to so notify us shall result in the waiver by you of any claim relating to any such disputed charge. Charges shall be calculated solely based on invoicing records maintained by us for purposes of billing. No other measurements or statistics of any kind shall be accepted by us or have any effect under these terms ## 12. Personal Data Protection, Commercial Offers, Confidentiality[​](#12-personal-data-protection-commercial-offers-confidentiality "Direct link to 12. Personal Data Protection, Commercial Offers, Confidentiality") Personal Data Protection is regulated by the Privacy Policy available on our Website (). You are obliged to adhere to the Privacy Policy. During the term of your use of the Website or Platform, licenses granted under these terms and Services and for a period of two (2) years following the termination or expiration of your subscription, each party agrees not to disclose Confidential Information of the other party to any third party without prior written consent except as provided herein (hereinafter referred to as the “**Confidential Information**”). Confidential Information includes (i) subscription account data, including agent definitions, Customer Data, and User Content, (ii) except as provided in subsection (i) above, any other Website, Platform or Services information or access to technology prior to public disclosure provided by us to you and identified at the time of disclosure in writing as “Confidential.” Confidential Information does not include information that has become publicly known through no breach by a party, or has been (i) independently developed without access to the other party's Confidential Information; (ii) rightfully received from a third party; or (iii) required to be disclosed by law or by a governmental authority. ## 13. Term, Amendment, and Termination[​](#13-term-amendment-and-termination "Direct link to 13. Term, Amendment, and Termination") Unless mutually agreed otherwise in writing, the license agreement and the agreement on the provision of other Services concluded by and between us shall be for an indefinite period of time. Either we or you may terminate any such agreement by cancelling your user account on the Platform (user account may be cancelled in the account settings). In the event of the agreement termination or user account cancellation, you shall not be entitled to the refund of any remuneration you already paid to us (see Article 11, Payment Terms). In the following cases we may further withdraw from an agreement concluded with you (by cancelling your user account) with immediate effect: * Should you be in delay with payment of any amounts due to us for more than three days; * Should you breach these Terms and/or Privacy Policy in any way whatsoever (including, but not limited to, the breach of our Copyright, terms of the provided license, terms of use of the Website, Platform or Services, or should you create more than one user account without our prior written consent) When terminating the license agreement or cancelling your user account, we may eliminate any data entered in the user account and gathered there. If necessary (including but not limited to the market development, development of the Platform and related technologies and with respect to new risks associated with the Platform and the safety of its use), we may unilaterally amend the version hereof. Similarly, with respect to the market development we may unilaterally amend the remuneration amount of the license for use of the Platform or the remuneration for the Services. We shall notify you of such an amendment at least 30 days in advance before its effectiveness by displaying the notification in your user account or otherwise on the Website or the Platform. Should you disagree with such an amendment, you may withdraw from the relevant agreement effective as at the date of the effectiveness of the announced amendments by cancelling your user account (cancellation may be performed in your account settings). In the event of the agreement termination and user account cancellation you shall not be entitled to a refund of any remuneration paid to us already (see Art. 11, Payment Terms). Provided you shall not cancel your user account before the effective date of the amendment, you shall be deemed to agree with such an amendment. ## 14. Final Provisions[​](#14-final-provisions "Direct link to 14. Final Provisions") These Terms and any other contracts and legal relationships concluded by and between us (unless otherwise explicitly provided) shall be governed by the laws of the Czech Republic. Any disputes arising here from between us shall be resolved by the courts of general jurisdiction in the Czech Republic. You agree that we may use your name, company name and logo as a reference in all types of promotion materials for marketing purposes free of charge. Apify is a trademark of Apify Technologies s.r.o., registered in the United States and other countries. Provided that we enter into a separate written license agreement or another contract with you provisions of which deviate herefrom, such different provisions shall take precedence over the respective provisions of these Terms. Should any of the provisions hereof be ascertained as invalid, ineffective or unenforceable, upon mutual agreement such a provision shall be replaced by a provision whose sense and purpose comes as closely as possible to the original provision. The invalidity, ineffectiveness or unenforceability of one provision shall not affect the validity and effectiveness of the remaining provisions hereof. Unless otherwise provided hereby, any changes and amendments hereto may only be made in writing. --- # Apify Store Publishing Terms and Conditions December 2022 ## Version History[​](#version-history "Direct link to Version History") You are reading terms and conditions that are no longer effective. If you're a new user, the [latest terms](https://docs.apify.com/legal/store-publishing-terms-and-conditions.md) apply. If you're an existing user, see the table below to identify which terms and conditions were applicable to you at a given date. | Version | Effective from | Effective until | | ------------------------------------------------------------------------------- | ---------------- | --------------- | | [Latest](https://docs.apify.com/legal/store-publishing-terms-and-conditions.md) | May 13, 2024 | | | December 2022 (This document) | December 1, 2022 | June 12, 2024 | # Apify Store Publishing Terms and Conditions December 2022 Apify Technologies s.r.o., with its registered seat at Vodičkova 704/36, 110 00 Prague 1, Czech Republic, Company reg. no. 04788290, recorded in the Commercial Register kept by the Municipal Court of Prague, File No.: C 253224 (hereinafter referred to as “**we**” or the “**Provider**”) thanks you (hereinafter referred to as “**you**” or the “**User**”) for using our Platform console.apify.com (the “**Platform**”). These terms and conditions of use (the “**Apify Store Terms**”) shall govern your rights and obligations in relation to publishing and maintaining a public Actor in Apify Store at console.apify.com/store (“**Apify Store**”) in addition to our [General Terms and Conditions](https://docs.apify.com/legal/general-terms-and-conditions.md) of use (the “**Terms**”). Terms starting with a capital letter used in this Apify Store Terms have the meaning defined in the Terms. Provisions of the Terms regarding liability, indemnity, amendments, governing law and choice of jurisdiction are incorporated herein by reference. ## 1. Publishing your Actor[​](#1-publishing-your-actor "Direct link to 1. Publishing your Actor") Actors (i.e. the serverless cloud programs running on the Platform as defined in the Terms) can be either private or public. Public Actors are shown in Apify Store and can be run by anyone. Private Actors can only be accessed and started by their owner. If you decide to make your Actor public, the following rules apply. ## 2. Acceptance of these Terms[​](#2-acceptance-of-these-terms "Direct link to 2. Acceptance of these Terms") By clicking a button “I agree”, you claim that you are over 18 years old and agree to adhere to these Apify Store Terms, in addition to the [Terms and the terms of personal data protection](https://docs.apify.com/legal/privacy-policy.md). If you act on behalf of a company when accepting these Apify Store Terms, you also hereby declare to be authorized to perform such legal actions on behalf of the company (herein the term “you” shall mean the relevant company). ## 3. Actor name, description and price[​](#3-actor-name-description-and-price "Direct link to 3. Actor name, description and price") **3.1.** Each Actor has its own unique name. When you publish an Actor, you agree to assign to it a relevant, non-deceiving name **3.2.** When publishing your Actor, you agree to create, publish and maintain an up-to-date, pertinent description, documentation or other sources of information, allowing the users to use the Actor. **3.3.** The Actor, its name, price, description and other information connected to it published in Apify Store can be changed at any time in the future. However, changes to the Actor's name and price are limited to a maximum of one change per calendar month. ## 4. Competition[​](#4-competition "Direct link to 4. Competition") We encourage healthy competition. Creating an Actor that offers a similar functionality or outcome as another Actor is permitted. However, you must avoid simply copying another's Actor and claiming it as yours. ## 5. Prohibited content[​](#5-prohibited-content "Direct link to 5. Prohibited content") **5.1.** You may create and publish an Actor of any kind, however to maintain a safe, legal and respectful environment on the Platform, we are asking you to avoid content and activities that are prohibited by the Terms (the “**Prohibited Content**”). **5.2.** While we are not obliged to monitor the content of all the Actors, if we find out that your Actor contains any of the Prohibited Content, we are entitled to unpublish and/or delete such Actor, in our sole discretion. ## 6. Provider's right to intervene[​](#6-providers-right-to-intervene "Direct link to 6. Provider's right to intervene") Without limitation to clause 5.2 above, we reserve the right to delete, unpublish, restrict or modify any unlawful, offensive, harmful or misleading content or public information in Apify Store or any Public Actor breaching this Apify Store Terms as we may see fit. ## 7. Privacy of the code[​](#7-privacy-of-the-code "Direct link to 7. Privacy of the code") By publishing your Actor on the Platform you are allowing us to view the code of that Actor. We may only access and inspect the code in limited circumstances where our interference is necessary for legal, compliance or security reasons, e.g. when investigating presence of any Prohibited Content, suspicion of credentials stealing or account hacking. ## 8. Maintenance of the Actor[​](#8-maintenance-of-the-actor "Direct link to 8. Maintenance of the Actor") By publishing your Actor you agree to use your best effort to maintain it in working condition and make updates to it from time to time as needed, in order to maintain a continuing functionality. ## 9. Testing[​](#9-testing "Direct link to 9. Testing") Please note that we are performing regular automated testing of the functionality and performance of all Actors published in Apify Store. Failing the testing may lead to the consequences described in clause 10 below. ## 10. Faulty Actor[​](#10-faulty-actor "Direct link to 10. Faulty Actor") If your Actor does not provide the declared functionality (a “**Faulty Actor**”) we are authorized to mark that Faulty Actor as “under maintenance” in the Apify Store. If you do not update or make changes to the Faulty Actor to recover its functionality and the Actor remains a Faulty Actor for the period of 30 days or more, we are authorized to mark that Faulty Actor as “deprecated” and/or remove that Faulty Actor from the Apify Store. You will not be reimbursed for removing the Faulty Actor if the removal is made in accordance with this Apify Store Terms. ## 11. Notified issues with an Actor[​](#11-notified-issues-with-an-actor "Direct link to 11. Notified issues with an Actor") **11.1.** The Platform users have the option to report an issue with a Public Actor. The issue is then notified by an email to the author of that Actor. Should you receive such a notification email about an issue with your Actor, you agree to address the issue by either fixing the issue, updating the Actor, its description or other information, or by responding to us with explanation as soon as possible, however no later than within 14 days. If you do not respond to the notified issue in accordance with this clause, your Actor may be treated as a Faulty Actor with the consequences as per clause 10 above. **11.2.** In addition to responding according to clause 11.1 above, you agree to respond to us, should we contact you regarding your Actor via email marked “urgent” in its subject, within three business days. ## 12. Pricing options[​](#12-pricing-options "Direct link to 12. Pricing options") When you decide to set your Actor as paid, you may choose one of the two following options for setting its price: **12.1. Monthly Rental** which means that each user of your Actor will pay a flat monthly rental fee for any their use of that Actor. You will set the price as X USD per month; or **12.2. Price per Result** model which means that each user of your Actor will pay a fee calculated according to the number of results for each run of that Actor; You will set the price as X USD per 1,000 results. In this model the users do not pay for the Platform usage. ## 13. Payments to you[​](#13-payments-to-you "Direct link to 13. Payments to you") **13.1.** If you set your Actor as paid, you will be entitled to receive remuneration calculated as follows: **13.1.1** 80% of the aggregate of the monthly rental fees paid by the users of the Actor; or **13.1.2** 80% of the aggregate of Price per Result fees paid by the users of the Actor, further reduced by the cost of Platform usage. You acknowledge that the remaining portion of the users' fees paid for the Actor shall belong to us. *Example: You set the price for your paid Actor under Price per Result model as USD 5 per 1,000 results. The Actor has one paying user who runs it once and gets 1,000 results. The Platform usage costs of the Actor run are USD 0.5 You will be entitled to remuneration of USD 3.5 (i.e. (80% of 5) - 0.5).* **13.2.** You are responsible for filling in your correct payment details in your user account and keeping them up-to-date to enable us to make the payments to you. **13.3.** Your entitlement to remuneration for the Actor will cease for the time that the Actor will be a Faulty Actor. If the Faulty Actor will be repaired or updated by you and becomes operating in the desired manner again, your entitlement to remuneration in relation to the Actor will resume again. As a result you will receive the portion of the monthly rental fees corresponding to the portion of the month for which the Actor was not a Faulty Actor. **13.4.** We shall pay you the remuneration monthly. The minimum amount payable is USD 20 for PayPal and USD 100 for any other payout option. Remuneration in any given month lower than 20 or 100 USD (respectively) will be rolled over to the following month. --- # Apify Privacy Policy Last Updated: February 10, 2025 Welcome to the Apify Privacy Policy! Apify Technologies s.r.o. ("**Apify**," "**we**," "**our**" or "**us**") operates website apify.com (“**Website**”), provides its customers with the computer platform “Apify” (the "**Platform**") and some other services and functions, as specified in the [Apify General Terms and Conditions](https://docs.apify.com/legal/general-terms-and-conditions.md) (the "**Services**"). Apify is committed to transparency in the processing of information. This is where we describe how we handle your personal data. “**Personal data**” is any information that is directly linked or can be linked to you. Capitalized terms not otherwise defined in this Privacy Policy will have the meaning outlined in the [Apify General Terms and Conditions](https://docs.apify.com/legal/general-terms-and-conditions.md). ## When the Privacy Policy applies[​](#when-the-privacy-policy-applies "Direct link to When the Privacy Policy applies") Please note that this Privacy Policy applies where Apify is a “data controller” of your personal data. This includes when we collect information from and about visitors to our websites, job candidates, prospective users and customers, and users of the Platform, collectively referred to herein as “**you**.” ## When the Privacy Policy does not apply[​](#when-the-privacy-policy-does-not-apply "Direct link to When the Privacy Policy does not apply") You may collect and manage personal data when using Platform or other Services. In such a scenario, Apify is a “**data processor**", not a “**data controller**” (as defined by applicable privacy laws) of personal data that we process under your instructions and on your behalf. For clarity, this Privacy Policy does not apply to where Apify processes personal data as a data processor. Such processing activities are governed by a separately executed data processing agreement(s) between Apify and you. We are not responsible for your privacy or data security practices. You represent and warrant that you have all necessary rights, consents, or other legal basis for processing such personal data and instructing us to process them on your behalf. This Privacy Policy also does not apply to personal data about current and former Apify employees or contractors and agents acting in similar roles. **PLEASE READ THIS PRIVACY POLICY CAREFULLY TO UNDERSTAND HOW WE HANDLE YOUR PERSONAL DATA. IF YOU DO NOT AGREE TO THIS PRIVACY POLICY, PLEASE DO NOT USE OUR WEBSITE OR THE SERVICES.** ## Table of Contents[​](#table-of-contents "Direct link to Table of Contents") * [Personal Data We Collect](#personal-data-we-collect) * [How We Use Your Personal Data](#how-we-use-your-personal-data) * [How We Disclose Your Personal Data](#how-we-disclose-your-personal-data) * [How We Retain and Dispose Your Personal Data](#how-we-retain-and-dispose-your-personal-data) * [Your Rights and Your Choices](#your-rights-and-your-choices) * [Third-Party Links and Features](#third-party-links-and-features) * [International Transfer of Your Personal Data](#international-transfer-of-your-personal-data) * [How We Protect Your Personal Data](#how-we-protect-your-personal-data) * [Children and Privacy](#children-and-privacy) * [Aggregate Data](#aggregate-data) * [Territory-Specific Terms](#territory-specific-terms) * [Changes to our Privacy Policy](#changes-to-our-privacy-policy) * [Contact Us](#contact-us) ## Personal Data We Collect[​](#personal-data-we-collect "Direct link to Personal Data We Collect") ### Personal Data You Provide to Us[​](#personal-data-you-provide-to-us "Direct link to Personal Data You Provide to Us") We collect a variety of personal data that you provide directly to us. For example, we collect information from you when you: * create a user account to log into and use Platform and Services, including communicating with support or sales teams * register for a demo, webinar, conference, or other events * apply to a job offer. We need, including but not limited to, your name, email address, username, business information, billing information, information about your professional career and educational background, including current and old job positions, degrees, qualifications, and payment information. Additionally, you may provide us voluntarily with a short bio, homepage URL, GitHub username, Twitter username, and profile picture, which will be added to your public profile on the Platform. ### Personal Data We Collect through Automated Means[​](#personal-data-we-collect-through-automated-means "Direct link to Personal Data We Collect through Automated Means") Like most websites and technology services delivered over the internet, we automatically collect and store various information about your computer hardware and software when you visit our Websites and use our Platform and Services, including but not limited to the device name; relevant operating system type; hostname; IP address; language settings; date and time of access to the Platform; logs describing connections and containing statistics about data sent to and from other devices including page scraping activities. ### Cookies[​](#cookies "Direct link to Cookies") When you visit our Website or use our Platform or our Services, we may collect some personal data in the form of cookies (a cookie is a small data file placed on your computer's hard drive that contains information that allows us to track your activity on the Website and Platform). The cookie does not contain personal data; however, if you provide us with some personal data, the cookies may act as an identifier to tie your personal data to your IP address or computer. We are committed to following the highest standards of privacy protection in relation to cookies. Other than the cookies necessary for the Website to function, you can always choose not to allow the cookies while using our Website or Platform. Read more information about the cookies we use and how we use them and ensure your privacy at the same time in our [Cookie Policy](https://docs.apify.com/legal/cookie-policy.md). ## How We Use Your Personal Data[​](#how-we-use-your-personal-data "Direct link to How We Use Your Personal Data") At Apify, it is extremely important to us to process personal data securely, fairly, and transparently. We do so in accordance with applicable privacy laws, including the European Union's and the United Kingdom's General Data Protection Regulation (“**GDPR**”). We process your personal data for various purposes: * **Provide Platform and other Services**: Authenticate you and provide you with access to Platform and to administer our Services * **Provide paid Services**: We use secure third-party payment service providers to manage payment processing, which is collected through a secure payment process. * **Create your publicly visible Apify profile** but only populated with personal data and additional information that you choose to provide * **Provide customer support**: Respond to your requests for information and provide you with more effective and efficient customer support * **Send marketing communications**: Contact you by email, postal mail, or phone with news, updates, information, promotions, surveys, or contests relating to the Services or other services that may be of interest to you, in accordance with applicable legal requirements related to such communications * **Customize and optimize the content** you see on our Website * **Improve Platform and the Services**: Assessing trends and usage across the Website and Platform to help us determine what new features or integrations our Users may be interested in * **Conduct customer research**: Engage in analysis and research regarding the use of the Services, and improve our Services * **Secure our Services** and resolve technical issues being reported * **Meet legal requirements**: Comply with any procedures, laws, and regulations that apply to us where it is necessary for our legitimate interests or the legitimate interests of others * **Establish, exercise, or defend our legal rights** where it is needed for our legitimate interests or the legitimate interests of others * **Recruiting**: Evaluation and selection of applicants; including, for example, setting up and conducting interviews and tests, evaluating and assessing the results thereto, and as is otherwise needed in the recruitment processes, including the final recruitment. Additionally, we may process your personal data to include you in our talent pool and contact you should a suitable position be available if you have consented to this; such processing is legally permissible under Art. 6 (1)(a) of the GDPR. ## How We Disclose Your Personal Data[​](#how-we-disclose-your-personal-data "Direct link to How We Disclose Your Personal Data") We may disclose your personal data: * **Service Providers**: We provide access to or disclose your personal data to selected third parties who help us run our Website, provide Platform, or deliver our other Services, including billing and credit card verification, advertising and marketing, content and features, analytics, research, customer support, data storage, security, web hosting, fraud prevention, applicants tracking and legal services. * **Protection of Apify and Others**: By using the Services, you acknowledge and agree that we may access, retain, and disclose the personal data we collect and maintain about you if required to do so by applicable law or in a good faith belief that such access, retention or disclosure is reasonably necessary to: (a) enforce any contracts with you; (b) respond to claims that any content violates the rights of third parties; (c) protect the rights, property or personal safety of Apify, its agents and affiliates, its other users and/or the public; and/or (d) comply with legal process (e.g. a subpoena or court order). * **Joint Offerings**: From time to time, Apify may partner with other companies to offer products or services jointly. If you purchase or specifically express interest in a jointly offered product or service from us, Apify may share certain personal data collected in connection with your purchase or expression of interest with our joint promotion partner(s). Apify does not control its business partners' use of the personal data we share with them, and their use of the personal data will be in accordance with their own privacy policies. If you do not wish for your personal data to be shared in connection with any joint offerings, you may opt not to purchase or specifically express interest in a jointly offered product or service. * **Public Forums**: Our websites may offer publicly accessible message boards, blogs, and community forums. Please keep in mind that if you directly disclose any personal data through our public message boards, blogs, or forums (including profile information associated with your user account), it may be read, collected, and used by any member of the public who accesses these Websites. Your posts and profile information may remain available even after terminating your user account. We urge you to consider the sensitivity of any information you may disclose in this way. * **Compelled Disclosure**: We reserve the right to use or disclose your personal data if required by law or if we reasonably believe that use or disclosure is necessary to protect our rights, protect your safety or the safety of others, investigate fraud, or comply with a law, court order, or legal process. * **Business transfers**: If Apify sells substantially all of its assets or one of Apify’s business units is acquired, your personal data will likely be one of the transferred assets * **Otherwise with Your Consent or at Your Direction**. In addition to the disclosures described in this Privacy Policy, we may disclose your personal information with third parties whenever you consent to or direct such disclosure. ## How We Retain and Dispose Your Personal Data[​](#how-we-retain-and-dispose-your-personal-data "Direct link to How We Retain and Dispose Your Personal Data") We keep your personal data for no longer than necessary for the purposes for which it is processed. The length of time for which we retain information depends on the purposes for which we collect and use it and/or as required to comply with applicable laws. ## Your Rights and Your Choices[​](#your-rights-and-your-choices "Direct link to Your Rights and Your Choices") ### Correcting, Updating, and Accessing[​](#correcting-updating-and-accessing "Direct link to Correcting, Updating, and Accessing") Upon your request and authentication of your identity, Apify will provide you with information about the personal data we have collected from you, whether we hold your personal data or process your personal data on behalf of a third party. Requests to access, change, or delete personal data made to Apify will be addressed within 30 days or earlier if required by applicable laws or regulations. If your name, e-mail or postal address, telephone number, or other personal data changes, you may update, correct, or omit the relevant information by contacting Apify at privacy\[at]apify\[dot]com or by updating your personal data on the Account settings page on the Website. In some situations, we may not be able to provide access to certain personal data. Where an access request is refused, we will notify you in writing, document the reasons for refusal and outline further steps which are available to you. When a challenge regarding the accuracy of personal data is not resolved to your satisfaction, We will annotate the personal data under our control with a note that the correction was requested but not made. ### Removal and Objection[​](#removal-and-objection "Direct link to Removal and Objection") If you prefer not to receive newsletters or other marketing emails from Apify, please let us know by clicking on the unsubscribe link within any newsletter or marketing email you receive. Please note that, regardless of your request, we may still use and disclose certain personal data as permitted by this Privacy Policy or as required by applicable law. For example, you may not opt out of certain transactional emails from us, such as those confirming your requests or providing you with updates regarding our legal terms. If you prefer not to receive marketing mail via the mail carrier, please let us know by contacting User service at support\[at]apify\[dot]com. Please note that such requests may take up to ten (10) days to become effective. For more information about your rights under EEA and U.K. GDPR, please refer to Clause “Territory-Specific Terms” below. ## Third-Party Links and Features[​](#third-party-links-and-features "Direct link to Third-Party Links and Features") The Website and Platform may contain links to third-party websites and features (such as the share and/or "like" button or interactive mini-programs). These features may collect your IP address, and which page you are visiting on our sites and may set a cookie to enable the feature to function properly. These features and widgets are hosted by a third party or hosted directly on our websites. This Privacy Policy does not apply to these features. Your interactions with these features are governed by the privacy policy and other policies of the companies providing them. Those websites may have their own privacy policies or no privacy policies at all. Apify is not responsible for those websites, and we provide the links solely for your convenience. ## International Transfer of Your Personal Data[​](#international-transfer-of-your-personal-data "Direct link to International Transfer of Your Personal Data") Your personal data is maintained and processed by us and our third-party service providers in the European Union and the United States and may also be maintained, processed, and stored in other jurisdictions that may have different data protection laws than those in your country of residence. If your information is transferred in these ways, please note that we comply with applicable legal requirements governing the transfer of information across borders. By using the Platform or Services, you agree to and acknowledge these transfers. ## How We Protect Your Personal Data[​](#how-we-protect-your-personal-data "Direct link to How We Protect Your Personal Data") We take appropriate security measures to protect against unauthorized access to or unauthorized alteration, disclosure, or destruction of personal data. These include internal reviews of our data collection, storage, and processing practices, security measures, and physical security measures to guard against unauthorized access to systems where we store personal data. We restrict access to collected information to Apify employees, service providers, and agents who need to know that information in order to operate, develop, or improve our services. These individuals are bound by confidentiality obligations. If you wish to learn more about our security practices, please see our Security Whitepaper.​ If you become aware of or suspect any unauthorized use of your Apify account, please contact us immediately using the information in the "Contact Us" section below. ## Children and Privacy[​](#children-and-privacy "Direct link to Children and Privacy") Our Website, Platform, and Services are not directed to children, and we will not request personal data from anyone who we know to be under the age of 18 unless we have obtained verifiable parental consent from a parent or legal guardian. If we become aware that a user under the age of 18 has registered with our Website, Platform, or Services, provided personal data, and Apify has not obtained prior verifiable consent from a parent or legal guardian, we will immediately remove the user’s personal data from our files. ## Aggregate Data[​](#aggregate-data "Direct link to Aggregate Data") Apify may also collect aggregate data. Aggregate data does not contain any personal data. It only contains usage statistics about your activities on the Website and Platform or in connection with the Services that cannot be used to identify, locate, or contact you (such as frequency of visits to the Website, data entered when using the Website, Website pages most frequently accessed, browser type, etc.). Generally, aggregate information is used collectively, and no single person can be identified by that compiled information. Apify uses aggregate information to provide its Services, determine the use of our Website (Platform), and monitor, audit, and analyze information pertaining to our business metrics. We may use aggregate information to improve the Website and Services, to monitor traffic and general usage patterns, and for other general business purposes. We may disclose aggregate information to third parties for various business reasons. Aggregate information will not include any personal information, and we will not disclose any personal information except as expressly stated elsewhere in this Privacy Policy. ## Territory-Specific Terms[​](#territory-specific-terms "Direct link to Territory-Specific Terms") ### EEA and the UK[​](#eea-and-the-uk "Direct link to EEA and the UK") #### Legal Basis for Processing[​](#legal-basis-for-processing "Direct link to Legal Basis for Processing") The legal bases for using your personal data as set out in this Privacy Policy are as follows: * Where we need to perform the contract we are about to enter into or have entered into with you for the Services * Where it is necessary for our legitimate interests (or those of a third party) and your interests and fundamental rights, do not override those interests * Where we need to comply with a legal or regulatory obligation * Where we have your consent to process your personal data in a certain way #### Your Data Protection Rights[​](#your-data-protection-rights "Direct link to Your Data Protection Rights") Under applicable data protection laws, you may exercise certain rights regarding your personal data: * Right to Access. You have the right to obtain confirmation from us whether we are processing your personal data, as well as the right to obtain a copy of your personal data undergoing processing. * Right to Data Portability. You may receive the personal data that you have provided to us in a structured, commonly used, and machine-readable format, and you may have the right to transmit it to other data controllers without hindrance. This right only exists if the processing is based on your consent or a contract and is carried out by automated means. * Right to Rectification. You have the right to request the rectification of inaccurate personal data and to have incomplete data completed. * Right to Objection. In some instances, you have the right to object to the processing of your personal data. * Right to Restrict Processing. In certain cases, you may request that we restrict the processing of your personal data. * Right to Erasure. You may request that we erase your personal data in some instances. * Right to Lodge a Complaint. You have the right to lodge a complaint with a supervisory authority. * Right to Refuse or Withdraw Consent. If we ask for your consent to process your personal data, you are free to refuse to give it. If you have given your consent, you may withdraw it at any time without any adverse consequences. The lawfulness of any processing of your personal data that occurred prior to the withdrawal of your consent will not be affected. * Right to Not Be Subject to Automated Decision-making. The types of automated decision-making referred to in Article 22(1) and (4) EU/UK General Data Protection Regulation (“**GDPR**”) do not take place in connection with your personal data. Should this change, we will inform you about why and how any such decision was made, the significance of it, and the possible consequences of it. You will also have the right to human intervention, to express your point of view, and to contest the decision. You may exercise these rights by contacting us using the details provided in Section “Contact Us” below. Please note that we may refuse to act on requests to exercise data protection rights in certain cases, such as where providing access might infringe someone else’s privacy rights or impact our legal obligations. #### International Transfers of Personal Data[​](#international-transfers-of-personal-data "Direct link to International Transfers of Personal Data") Due to the global nature of our operations, some of the recipients mentioned in Section 2 of the Notice may be located in countries outside the EEA, or the U.K., which do not provide an adequate level of data protection as defined by data protection laws in the EEA, and the U.K. Transfers to third parties located in such third countries take place using a valid data transfer mechanism, such as the EU Standard Contractual Clauses and/or the U.K. Addendum to such clauses, on the basis of permissible statutory derogations, or any other valid data transfer mechanism issued or approved by the EEA, or U.K. authorities. Certain third countries have been officially recognized by the EEA, and U.K. authorities as providing an adequate level of protection and no further safeguards are necessary. Please reach out to us using the contact information in Section “Contact Us” below, if you wish to receive further information about how we transfer personal data or, where available, a copy of the relevant data transfer mechanism. ## Changes to our Privacy Policy[​](#changes-to-our-privacy-policy "Direct link to Changes to our Privacy Policy") We update this Privacy Policy from time to time and encourage you to review it periodically. We will post any changes on this page. This Privacy Policy was last updated on the date indicated at the top of this Privacy Policy. Your continued use of the Website, Platform, and its Services after any changes or revisions to this Privacy Policy have been published shall indicate your agreement with the terms of such revised Privacy Policy. ## Contact Us[​](#contact-us "Direct link to Contact Us") Any notices or requests to Apify under this Privacy Policy shall be made to privacy\[at]apify\[dot]com or: By mail: Apify Technologies s.r.o. Vodičkova 704/36, Nové Město 110 00 Praha 1 Czech Republic Attn: Apify Legal Team --- # Apify Store Publishing Terms and Conditions Last updated: February 26, 2025 *** Apify Technologies s.r.o., with its registered seat at Vodičkova 704/36, 110 00 Prague 1, Czech Republic, Company reg. no. 04788290, recorded in the Commercial Register kept by the Municipal Court of Prague, File No.: C 253224 (“**we**” or the “**Provider**”) thanks you (“**you**” or the “**User**”) for using our Platform console.apify.com (the “**Platform**”). These terms and conditions (the “**Apify Store Publishing Terms**”) shall govern your rights and obligations in relation to publishing and maintaining a public Actor in Apify Store at console.apify.com/store (“**Apify Store**”) in addition to our [general terms and conditions](https://docs.apify.com/legal/general-terms-and-conditions.md) (the “**General Terms**”). Terms starting with a capital letter used in these Apify Store Publishing Terms have the meaning defined either here or in the General Terms. Provisions of the General Terms regarding liability, indemnity, governing law and choice of jurisdiction are incorporated herein by reference. ## 1. Publishing your Actor[​](#1-publishing-your-actor "Direct link to 1. Publishing your Actor") **1.1.** Actors (i.e., the serverless cloud programs running on the Platform as defined in the General Terms) can be either private or public. Public Actors are shown in Apify Store and can be run by anyone. Private Actors can only be accessed and started by their owner. If you decide to make your Actor public, the following rules apply. ## 2. Acceptance of these terms[​](#2-acceptance-of-these-terms "Direct link to 2. Acceptance of these terms") **2.1.** By publishing an Actor in Apify Store, you represent that you are over 18 years old and agree to adhere to these Apify Store Publishing Terms, in addition to the General Terms. If you act on behalf of a company when accepting these Apify Store Publishing Terms, you also hereby declare to be authorized to perform such legal actions on behalf of the company (herein the term “**you**” shall mean the relevant company). ## 3. Actor name, description and price[​](#3-actor-name-description-and-price "Direct link to 3. Actor name, description and price") **3.1.** Each Actor has its own unique name. When you publish an Actor, you agree to assign to it a relevant, non-deceiving name. **3.2.** When publishing your Actor, you agree to create, publish and maintain an up-to-date, pertinent description, documentation or other sources of information, allowing Apify users to use the Actor. **3.3.** The Actor, its name, price, description and other information connected to it published in Apify Store can be changed at any time in the future. However, changes to the Actor's price that might potentially lead to a price increase for the end user, including pricing model changes, are limited to a maximum of one change per month. ## 4. Competition[​](#4-competition "Direct link to 4. Competition") **4.1.** We encourage healthy competition. Creating an Actor that offers similar functionality or outcome as another Actor is permitted. However, you must avoid simply copying another's Actor and claiming it as yours. **4.2.** We reserve the right to immediately unpublish and/or delete any Actor that, in our sole discretion, infringes on any rights of other Apify users and/or third parties, including, but not limited to, copyright infringement caused by copying content of other Apify users, such as their Actors' readmes, descriptions or parts thereof. ## 5. Prohibited activities[​](#5-prohibited-activities "Direct link to 5. Prohibited activities") **5.1.** You may create and publish an Actor of any kind. However, to maintain a safe, legal and respectful environment on the Platform, we are asking you to avoid content and activities that are prohibited by any terms agreed between both parties and the [Acceptable Use Policy](https://docs.apify.com/legal/acceptable-use-policy.md) (the “**Prohibited Activities**”). **5.2.** While we are not obliged to monitor the content of all Actors, if we find out that your Actor: 1. contains any content falling under the Prohibited Activities; 2. has been created (at least in part) by performing the Prohibited Activities; or 3. the Actor itself performs any of the Prohibited Activities. We are authorized to unpublish and/or delete such an Actor, in our sole discretion. **5.3.** On Apify Store, you are not allowed to directly or indirectly offer, link to, or otherwise promote any product or service outside of the Platform unless we explicitly agree to it in writing. If you violate this prohibition in your Actors (including its accessories, e.g., the “read me” section of the Actor page on the Platform) or in any other content you publish on Apify Store, we are entitled to unpublish, modify, and/or delete such Actor and its accessories or content, in our sole discretion. ## 6. Provider's right to intervene[​](#6-providers-right-to-intervene "Direct link to 6. Provider's right to intervene") **6.1.** Without limitation to clause 5.2 above, we reserve the right to delete, unpublish, restrict or modify any unlawful, offensive, harmful or misleading content or public information in Apify Store or any Actor as we may see fit to protect legitimate interests of Apify, its users, or any third parties. ## 7. Privacy of the code[​](#7-privacy-of-the-code "Direct link to 7. Privacy of the code") **7.1.** By publishing your Actor on Apify Store you are allowing us to view the source code of that Actor. We may only access and inspect the source code in limited circumstances where our interference is necessary for legal, compliance or security reasons, for example, when investigating the presence of any Prohibited Activities. ## 8. Maintenance of the Actor[​](#8-maintenance-of-the-actor "Direct link to 8. Maintenance of the Actor") **8.1.** By publishing your Actor you agree to use your best effort to maintain it in working condition and make updates to it from time to time as needed, in order to maintain a continuing functionality. ## 9. Testing[​](#9-testing "Direct link to 9. Testing") **9.1.** We are performing regular automated testing of the functionality and performance of all Actors published in Apify Store. Failing the test may lead to the consequences described in clause 10 below. ## 10. Faulty Actor[​](#10-faulty-actor "Direct link to 10. Faulty Actor") **10.1.** If your Actor does not provide the declared functionality (a “**Faulty Actor**”) we are authorized to mark that Faulty Actor as “under maintenance” in Apify Store. If you do not update or make changes to the Faulty Actor to recover its functionality and the Actor remains a Faulty Actor for the period of 30 days or more, we are authorized to mark that Faulty Actor as “deprecated” and/or remove that Faulty Actor from Apify Store. You will not be reimbursed for the removal of the Faulty Actor. ## 11. Notified issues with an Actor[​](#11-notified-issues-with-an-actor "Direct link to 11. Notified issues with an Actor") **11.1.** Platform users have the option to report an issue with an Actor. The issue is then notified by email to the author of that Actor. Should you receive such a notification email about an issue with your Actor, you agree to address the issue by either fixing the issue, updating the Actor, its description or other information, or by contacting us with an explanation as soon as possible, however, no later than within 14 days. If you do not address the notified issue in accordance with this clause, we are authorized to declare your Actor a Faulty Actor. **11.2.** In addition to addressing the issues according to clause 11.1 above, you agree to respond to us, should we contact you regarding your Actor via email marked “urgent” in its subject, within three business days. ## 12. Pricing options[​](#12-pricing-options "Direct link to 12. Pricing options") **12.1.** When you decide to set your Actor as monetized, you may choose one of the following options for setting its price: 1. **Monthly Rental** which means that each user of your Actor will pay a flat monthly rental fee for use of that Actor. You will set the price as X USD per month; 2. **Price per Result** model which means that each user of your Actor will pay a fee calculated according to the number of results of each run of that Actor. You will set the price as X USD per 1,000 results. In this model the users do not pay for the Platform usage; or 3. **Price per Event** model which allows you to programatically charge for events in your Actor source code. You need to pre-define the events first when setting the Actor pricing. In this model, the users do not pay for the Platform usage. **12.2.** If you set your Actor as monetized, you will be entitled to receive remuneration calculated as follows: 1. 80% of the sum of the Monthly Rental fees paid by the users of the Actor; or 2. 80% of the sum of Price per Result or Pay per Event fees paid by the users of the Actor, further reduced by the cost of Platform usage of the corresponding Actor runs. You acknowledge that the remaining portion of the users' fees paid for the Actor shall belong to us. *Example: You set the price for your monetized Actor under the Price per Result model as USD 5 per 1,000 results. The Actor has one paying user who runs it once and gets 1,000 results. The Platform usage costs of the Actor run are USD 0.5 You will be entitled to remuneration of USD 3.5 (i.e. (80% of 5) - 0.5).* **12.3.** You acknowledge that the amount of fees paid by the users and the Platform usage costs can change throughout the month thanks to unpaid invoices or refunds, and that any information about future or past profits or remuneration available to you in the Platform UI are only estimates. Apify shall not be liable for the outcomes of any actions made based on those estimates. ## 13. Payment terms[​](#13-payment-terms "Direct link to 13. Payment terms") **13.1.** You are responsible for filling in your correct payment details in your user account and keeping them up-to-date to enable us to make payments to you. **13.2.** Your entitlement to remuneration for an Actor will cease for the time that the Actor is a Faulty Actor. If you fix or update the Faulty Actor, and it becomes functional again as advertised, your entitlement to remuneration in relation to the Actor will resume. **13.3.** Unless both parties have agreed otherwise, your remuneration will be paid on the basis of an invoice that we will issue on your behalf. The invoice will be issued without an undue delay after the end of each calendar month. You may approve or dispute the invoice within 7 days of issuance. An invoice that's neither accepted nor disputed within that period shall be deemed approved. **13.4.** The minimum amount payable is USD 20 for PayPal and USD 100 for any other payout option (the "**Minimum payout**"). Remuneration in any given month lower than the Minimum payout will be rolled over to the following month until the sum of approved invoices exceeds the Minimum payout. Attributes of an invoice such as due date do not override the Minimum payout rule. **13.5.** We may, in our sole discretion, block, remove, deprecate, or otherwise restrict your Actor from the Platform, if your Actor contains, requires, or refers the users to any payment method, other than the Apify payment gateway. This includes, without limitation, any method that (i) directly or indirectly circumvents the system of remuneration according to these Apify Store Publishing Terms; or (ii) poses a security risk to us, the Platform, the users, or any third party (e.g., by creating a false impression that the user pays any fees or other payments for the Actor to Apify). We reserve the right to withhold any and all outstanding payments due to you for such Actor until we determine whether the Actor complies with these Apify Store Publishing Terms. **13.6.** In case any suspicions arise regarding the legitimacy of any user’s payment for your Actor (e.g., suspicions of a fraudulent payment) or if the user is past due with its payment obligations, before we pay you the remuneration for such user’s payment, we shall have the right, but not the obligation, to withhold the remuneration for such user’s payment for a period necessary for us to investigate any suspicious activity related to it or until paid by the user. You agree to provide us and/or any authorized third party (e.g., PayPal) with all reasonably requested cooperation. **13.7.** If any fraudulent or otherwise non-compliant activity is identified regarding a user’s account or payments, we may ban the user from using the Platform. If we ban such a user, we shall not be obligated to pay you any remuneration resulting from such fraudulent user’s payments. In case such activities are identified after we already paid you the remuneration for such user's payment, you shall be obligated, at our written request, to refund the corresponding part of the remuneration. **13.8.** If a payment of remuneration is withheld in accordance with these Apify Store Publishing Terms, you shall not be entitled to any interest or additional payments. ## 14. Amendments[​](#14-amendments "Direct link to 14. Amendments") **14.1.** We may unilaterally amend the Apify Store Publishing Terms. We shall notify you of such an amendment in advance. Should you disagree with such an amendment, you may unpublish all your Actors from Apify Store within 30 days from the notification. Otherwise, you will be deemed to agree with the announced amendments. --- # Apify Whistleblowing Policy \[verze v českém jazyce níže] Last updated: April 14, 2025 At Apify, we are committed to upholding the highest standards of integrity, ethics, and accountability. As part of this commitment and to comply with the EU Directive and Czech Republic laws, we have implemented an internal whistleblowing system to ensure prompt and transparent reporting of any concerns related to unethical behavior, violations of company policies, or any other wrongdoing. ## Applicability[​](#applicability "Direct link to Applicability") In compliance with the applicable law, Apify excludes the possibility for reports to be submitted by persons other than those performing or having performed: * dependent work for Apify in the employment relationship (i.e., both under an employment contract and agreements made outside of employment, so-called “DPP/DPČ”); * voluntary activities; or * professional practice or internship. The exclusion does not apply to job applicants. **Apify does not accept anonymous reports.** ## Who can help you?[​](#who-can-help-you "Direct link to Who can help you?") Your report will be received by our incident resolver. The incident resolver is bound by confidentiality and will protect your identity and the information you have provided. However, please note that if we find that the report is knowingly false, the protection does not apply, and you may be subject to a fine under the Whistleblower Protection Act. Apify may also take additional measures (protection under the Whistleblower Protection Act does not apply to you in such a case). ## How to submit a report?[​](#how-to-submit-a-report "Direct link to How to submit a report?") You can submit a report to the incident resolver in writing via: * email at whistleblowing\[at]apify\[dot]com; * mail at Apify’s registered office address: Vodičkova 704/36, Nové Město, 110 00 Praha 1 (label the letter as “TO: SR. COUNSEL - CONFIDENTIAL”); Or verbally: * by calling 770627132; or * in-person at a location agreed upon with the incident resolver (you can arrange this through the contact details provided for the incident resolvers above), where the report submission will be allowed within a reasonable period after your request, but no later than 14 days. The incident resolver will write a transcript of the verbal report. If you make a report in a way other than through the channels mentioned above (for example, by emailing another colleague within Apify), such a report will not be considered a report within the meaning of the Whistleblower Protection Act. Apify may handle it differently, and you will not be entitled to protection under the Whistleblower Protection Act. ### Reporting at the Ministry of Justice[​](#reporting-at-the-ministry-of-justice "Direct link to Reporting at the Ministry of Justice") Apart from the internal reporting system, the Whistleblower Protection Act provides additional reporting means. These external reporting avenues encompass: Utilizing the Czech Ministry of Justice reporting system, available at the following address: . Making a report through public disclosure, such as in the media (limited to cases explicitly outlined in § 7 para. 1 letter (c) of the Whistleblower Protection Act). ## What happens after a report is submitted?[​](#what-happens-after-a-report-is-submitted "Direct link to What happens after a report is submitted?") Within seven calendar days of receiving your report, Apify incident resolver will provide you with written confirmation of its receipt unless you have explicitly indicated a preference not to be informed. This confirmation also does not apply where disclosure could compromise your identity. The incident resolver will thoroughly investigate the circumstances of the reported conduct within Apify based on the information provided. Following this, provided the report is qualified as reasonable, they will propose measures to prevent the reported conduct from continuing and rectify the situation. The incident resolver will monitor the acceptance and implementation of these measures. We assure you that Apify will not take any negative (retaliatory) actions against you as a result of the report, such as termination of employment, salary reduction, or other disadvantages. The resolver will inform you of the investigation findings within 30 days of receiving the report, indicating whether it was deemed reasonable. In more complex cases, the deadline may be extended by up to 30 days, but not more than twice. After specific measures are implemented, the incident resolver will promptly provide you with an update. ## Processing personal data in the whistleblowing agenda[​](#processing-personal-data-in-the-whistleblowing-agenda "Direct link to Processing personal data in the whistleblowing agenda") In connection with whistleblowing, Apify primarily receives personal data from the whistleblower, or based on Apify's own activities in assessing the reports. The purpose of processing personal data is to receive, assess, and handle cases of reports, maintain a record of reports, prevent illegal activities (especially corruption, fraud, or unethical behaviour), and enforce compliance with legal and internal regulations and obligations. Following personal data categories are typically processed in connection with whistleblowing: name, surname, date of birth, and contact address of the whistleblower; personal data present in the content of the report, including, for example, the identification of the person against whom the report is directed. The recipient of the personal data related to whistleblowing is the designated incident resolver, who receives individual reports, records them, assesses their validity, investigates, etc. Some personal data (excluding the identification of the whistleblower) may also be accessible to Apify's authorized employees (appropriately informed), based on the discretion of the incident resolver. External incident resolvers act as Apify's processors of personal data processed in connection with the whistleblowing agenda. Personal data may be further disclosed to a court, public prosecutor, police authority, National Security Office, Tax Office and Tax Directorate, Office for Personal Data Protection, and other entities based on legal obligations. The incident resolver is obliged to retain reports submitted through the internal reporting system and keep a record of data on received reports for a period of 5 years from the date of its receipt. **NOTE: Due to the specificity of processing in this area, Apify notes that the exercise of certain rights related to the processing of personal data (especially the right to access) and information obligations may be restricted due to the legal obligation to protect the identity of the whistleblower and other individuals mentioned in the report to avoid possible disruption of the investigation of reported information.** *** # Ochrana oznamovatelů v Apify V Apify jsme se zavázali dodržovat nejvyšší standardy integrity, etiky a odpovědnosti. Jako součást tohoto závazku a pro účely plnění směrnice EU a relevantních zákonů České republiky jsme zavedli systém pro podávání a posuzování oznámení o možném protiprávním jednání s cílem zajistit promptní a transparentní řešení takových podnětů. ## Kdo je oprávněn podávat oznámení?[​](#kdo-je-oprávněn-podávat-oznámení "Direct link to Kdo je oprávněn podávat oznámení?") V souladu se zákonem Apify vylučuje možnost přijímání oznámení od jiných osob než těch, které pro Apify vykonávají nebo vykonávaly: * závislou práci v rámci základního pracovněprávního vztahu (tzn., jak v rámci pracovního poměru, tak dohod uzavřených mimo pracovní poměr, tzv. DPP/DPČ); * dobrovolnickou činnost; nebo * odborné praxe či stáže. Toto vyloučení se nevztahuje na uchazeče o zaměstnání. **Apify nepřijímá anonymní oznámení.** ## Kdo bude řešit Vaše oznámení?[​](#kdo-bude-řešit-vaše-oznámení "Direct link to Kdo bude řešit Vaše oznámení?") Vaši zprávu obdrží příslušná osoba, a to Apify Sr. Counsel. Příslušná osoba je vázána mlčenlivostí a bude chránit Vaši identitu a poskytnuté informace. Nicméně, je třeba zdůraznit, že v případě zjištění, že poskytnuté oznámení je vědomě nepravdivé, nebude na Vás vztahována ochrana a může dojít k uložení pokuty podle zákona o ochraně oznamovatelů. Kromě toho může Apify přijmout i jiná dodatečná opatření. Ochrana podle zákona o ochraně oznamovatelů se tedy v případě vědomě nepravdivého oznámení neuplatní. ## Jak podat oznámení?[​](#jak-podat-oznámení "Direct link to Jak podat oznámení?") Oznámení můžete podat výše uvedené příslušné osobě buďto písemně: * e-mailem na whistleblowing\[zavináč]apify\[tečka]com; * poštou zasláním na adresu sídla Apify: Vodičkova 704/36, Nové Město, 110 00 Praha 1 (dopis označte jako “K RUKÁM SR. COUNSEL - DŮVĚRNÉ”); nebo ústně, a to: * telefonicky na 770627132 v době mezi 10:00 - 12:00 v pondělí či ve středu (vyjma státních svátků); * osobně na předem dohodnutém místě s příslušnou osobou (schůzku si můžete domluvit s příslušnou osobou prostřednictvím uvedených kontaktních údajů výše). Schůzka s Vámi bude uskutečněna v rozumné lhůtě po oznámení Vašeho požadavku, nejpozději však do 14 dnů. O ústním oznámení sepíše příslušná osoba protokol. Pokud podáte oznámení jiným způsobem než prostřednictvím výše uvedených kanálů (např. e-mailem jinému kolegovi v Apify), taková zpráva nebude považována za oznámení ve smyslu zákona o ochraně oznamovatelů. Apify s ní může nakládat v jiném režimu, avšak nebudete mít nárok na ochranu podle zákona o ochraně oznamovatelů. ### Oznámení na Ministerstvo spravedlnosti[​](#oznámení-na-ministerstvo-spravedlnosti "Direct link to Oznámení na Ministerstvo spravedlnosti") Apify je dále povinna Vás informovat o skutečnosti, že kromě interního oznamovacího systému implementovaného Apify existuje i možnost podat oznámení prostřednictvím systému Ministerstva spravedlnosti dostupného na následující adrese . Veřejného zveřejnění, a to například v médiích. Tato možnost je však omezena na případy explicitně uvedené v § 7 odst. 1 písm. (c) zákona o ochraně oznamovatelů. ## Jak budou Vaše oznámení zpracována?[​](#jak-budou-vaše-oznámení-zpracována "Direct link to Jak budou Vaše oznámení zpracována?") Do sedmi kalendářních dnů od obdržení Vašeho oznámení Vám příslušná osoba potvrdí jeho přijetí, ledaže jste v oznámení explicitně uvedli, že o přijetí oznámení nechcete být vyrozuměn(a). Toto potvrzení Vám nebude zasláno také v případě, pokud by mohlo dojít k prozrazení Vaší identity jiné osobě. Příslušná osoba důkladně prozkoumá okolnosti uvedené v podaném oznámení. Shledá-li příslušná osoba oznámení jako důvodné, navrhne následně opatření k zabránění pokračování hlášeného chování a k nápravě situace. Přijetí a provedení těchto opatření bude monitorováno příslušnou osobou. Ujišťujeme vás, že Apify nevykoná žádné negativní (represivní) kroky vůči Vám v důsledku oznámení, jako je ukončení zaměstnání, snížení mzdy nebo jiné nevýhody. Příslušná osoba Vás informuje o výsledcích vyšetřování do 30 dnů od obdržení oznámení, a to včetně vyjádření, zda bylo posouzeno jako důvodné, či nikoli. V případě složitějších situací může lhůta být prodloužena o maximálně 30 dnů, avšak nikoli více než dvakrát za sebou. O konkrétních opatření, která byla provedena v návaznosti na Vaše oznámení, Vás bude příslušná osoba bez zbytečného odkladu informovat. ## Zpracování osobních údajů[​](#zpracování-osobních-údajů "Direct link to Zpracování osobních údajů") V souvislosti s agendou ochrany oznamovatelů získává Apify osobní údaje především od oznamovatele, případně na základě vlastní činnosti Apify při posuzování oznámení. Účelem zpracování osobních údajů je přijímání, posuzování a vyřizování případů oznámení, vedení evidence oznámení, předcházení protiprávní činnosti (zejména korupci, podvodům nebo neetickému jednání) a vymáhání dodržování právních a interních předpisů a povinností. V souvislosti s agendou ochrany oznamovatelů jsou obvykle zpracovávány následující kategorie osobních údajů: jméno, příjmení, datum narození a kontaktní adresa oznamovatele; osobní údaje vyskytující se v obsahu oznámení, včetně např. identifikace osoby, proti které oznámení směřuje. Příjemcem osobních údajů týkajících se agendy ochrany oznamovatelů je určená příslušná osoba, která jednotlivá oznámení přijímá, eviduje, posuzuje jejich oprávněnost, prošetřuje apod. K některým osobním údajům (s výjimkou identifikace oznamovatele) mohou mít na základě rozhodnutí příslušné osoby přístup také pověření a náležitě poučení zaměstnanci Apify. Osobní údaje mohou být dále poskytnuty soudu, státnímu zastupitelství, policejnímu orgánu, Národnímu bezpečnostnímu úřadu, finančnímu úřadu a finančnímu ředitelství, Úřadu pro ochranu osobních údajů a dalším subjektům na základě zákonných povinností. Příslušná osoba je povinna uchovávat oznámení podaná prostřednictvím interního oznamovacího systému a evidovat údaje o přijatých oznámeních po dobu 5 let ode dne jejich přijetí. **UPOZORNĚNÍ: Vzhledem ke specifičnosti zpracování v této oblasti Apify upozorňuje, že výkon některých práv souvisejících se zpracováním osobních údajů (zejména práva na přístup) a informačních povinností může být omezen z důvodu zákonné povinnosti chránit identitu oznamovatele a dalších osob uvedených v oznámení, aby nedošlo k případnému narušení šetření oznámených informací.** --- # Apify platform > **Apify** is a cloud platform that helps you build reliable web scrapers, fast, and automate anything you can do manually in a web browser. > > **Actors** are serverless cloud programs running on the Apify platform that can easily crawl websites with millions of pages, but also perform arbitrary computing jobs such as sending emails or data transformations. They can be started manually, using our API or scheduler, and they can be easily integrated with other apps. ## Getting started[​](#getting-started "Direct link to Getting started") **Learn how to run any Actor in Apify Store or create your own. A step-by-step guides through your first steps on the Apify platform.** #### [Run Actor from Apify Store](https://docs.apify.com/platform/actors/running.md) [Apify Store is home to thousands of public Actors available to the Apify community. It's the easiest way for you to start with Apify.](https://docs.apify.com/platform/actors/running.md) #### [Build your own Actor](https://docs.apify.com/platform/actors/development.md) [Read about the technical part of building Apify Actors. Learn to define Actor inputs, build new versions, persist Actor state, and choose base Docker images.](https://docs.apify.com/platform/actors/development.md) #### [Become an expert with Apify Academy](https://docs.apify.com/academy.md) [Learn everything about web scraping and automation with our free courses that will turn you into an expert scraper developer.](https://docs.apify.com/academy.md) ## Contents[​](#contents "Direct link to Contents") #### [Actors](https://docs.apify.com/platform/actors.md) [Learn how to develop, run and share own web scraping and automation tools.](https://docs.apify.com/platform/actors.md) #### [Storage](https://docs.apify.com/platform/storage.md) [Store anything from images and key-value pairs to structured output data.](https://docs.apify.com/platform/storage.md) #### [Proxy](https://docs.apify.com/platform/proxy.md) [Learn to anonymously and reliably access websites in scraping/automation jobs.](https://docs.apify.com/platform/proxy.md) #### [Schedules](https://docs.apify.com/platform/schedules.md) [Learn how to automatically start your Actor and task runs.](https://docs.apify.com/platform/schedules.md) #### [Integrations](https://docs.apify.com/platform/integrations.md) [Learn how to connect the Apify platform with your projects.](https://docs.apify.com/platform/integrations.md) #### [Monitoring](https://docs.apify.com/platform/monitoring.md) [Learn how to check the performance of your Actors, validate your data and receive alerts.](https://docs.apify.com/platform/monitoring.md) --- # Actors **Learn how to develop, run and share serverless cloud programs. Create your own web scraping and automation tools and publish them on the Apify platform.** *** #### [Running Actors](https://docs.apify.com/platform/actors/running.md) [In this section, you learn how to run Apify Actors. You will learn about their configuration, versioning, data retention, usage, and pricing.](https://docs.apify.com/platform/actors/running.md) #### [Actor development](https://docs.apify.com/platform/actors/development.md) [Read about the technical part of building Apify Actors. Learn to define Actor inputs, build new versions, persist Actor state, and choose base Docker images.](https://docs.apify.com/platform/actors/development.md) #### [Publishing and monetization](https://docs.apify.com/platform/actors/publishing.md) [Learn about publishing, and monetizing your Actors on the Apify platform.](https://docs.apify.com/platform/actors/publishing.md) ## Actors overview[​](#actors-overview "Direct link to Actors overview") Actors are serverless cloud programs that can perform anything from a simple action, like filling out a web form, to a complex operation, like crawling an entire website or removing duplicates from a large dataset. Because Actors can persist their state and be restarted, their runs can be as short or as long as necessary, from seconds to hours, or even indefinitely. Basically, Actors are programs packaged as Docker images, which accept a well-defined JSON input, perform an action, and optionally produce a well-defined JSON output. Additional context For more context, read the [Actor whitepaper](https://whitepaper.actor/). ## Actor components[​](#actor-components "Direct link to Actor components") Actors consist of these elements: * *Dockerfile* which specifies where the Actor's source code is, how to build it, and run it. * *Documentation* in a form of a README.md file. * *Input and output schemas* that describe what input the Actor requires, and what results it produces. * Access to an out-of-the-box *storage system* for Actor data, results, and files. * *Metadata* such as the Actor name, description, author, and version. The documentation and input/output schemas help people understand what the Actor does, enter required inputs in the user interface or API, and integrate results into other workflows. Actors can call and interact with each other to build more complex systems from simple ones. ![Apify Actor diagram](/assets/images/apify-actor-drawing-9e5b2c6bbe7a85acac72e5c7a13290a4.png) ## Build Actors[​](#build-actors "Direct link to Build Actors") Build Actors to automate tasks, scrape data, or create custom workflows. The Apify platform gives you everything you need to develop, test, and deploy your code. Ready to start? Check out the [Actor development documentation](https://docs.apify.com/platform/actors/development.md). ## Running Actors[​](#running-actors "Direct link to Running Actors") You can run Actors manually in [Apify Console](https://console.apify.com/actors), using the [API](https://docs.apify.com/api.md), [CLI](https://docs.apify.com/cli), or [scheduler](https://docs.apify.com/platform/schedules.md). You can easily [integrate Actors](https://docs.apify.com/platform/integrations.md) with other apps, [share](https://docs.apify.com/platform/collaboration/access-rights.md) them with other people, [publish](https://docs.apify.com/platform/actors/publishing.md) them in [Apify Store](https://apify.com/store), and even [monetize](https://docs.apify.com/platform/actors/publishing/monetize.md). Try Actors To get a better idea of what Apify Actors are, visit [Apify Store](https://apify.com/store) and try out some of them! ![Apify Store](/assets/images/store-google-maps-scraper-5ea832e3679ab01dd58090a38f0c48c7.png) ## Public and private Actors[​](#public-and-private-actors "Direct link to Public and private Actors") Actors can be [public](https://docs.apify.com/platform/actors/running/actors-in-store.md) or private. Private Actors are yours to use and keep; no one will see them if you don't want them to. Public Actors are [available to everyone](https://docs.apify.com/platform/actors/running/actors-in-store.md) in [Apify Store](https://apify.com/store). You can make them free to use, or you can [charge for them](https://blog.apify.com/make-regular-passive-income-developing-web-automation-actors-b0392278d085/). --- # Actor development **Read about the technical part of building Apify Actors. Learn to define Actor inputs, build new versions, persist Actor state, and choose base Docker images.** *** This section will guide you through the whole story of [Actor](https://docs.apify.com/platform/actors.md) development. You can follow chapters sequentially from [Quick start](https://docs.apify.com/platform/actors/development/quick-start.md), where you learn how to create your first Actor in just a few minutes, through the more technical sections describing the whole Actor model, up to the [Performance](https://docs.apify.com/platform/actors/development/performance.md) section, where you learn how to fine-tune your Actor to get the most out of the Apify platform. #### [Quick start](https://docs.apify.com/platform/actors/development/quick-start.md) [Create your first Actor using Apify Console IDE or locally.](https://docs.apify.com/platform/actors/development/quick-start.md) #### [Actor definition](https://docs.apify.com/platform/actors/development/actor-definition.md) [Learn how to turn your arbitrary code into an Actor simply by adding an Actor definition directory.](https://docs.apify.com/platform/actors/development/actor-definition.md) #### [Programming interface](https://docs.apify.com/platform/actors/development/programming-interface.md) [Learn about the programming interface of Apify Actors, important commands and features provided by the Apify SDK, and how to use them in your Actors.](https://docs.apify.com/platform/actors/development/programming-interface.md) #### [Deployment](https://docs.apify.com/platform/actors/development/deployment.md) [Learn how to deploy your Actors to the Apify platform and build them.](https://docs.apify.com/platform/actors/development/deployment.md) #### [Builds and runs](https://docs.apify.com/platform/actors/development/builds-and-runs.md) [Learn about Actor builds and runs, their lifecycle, versioning, and other properties.](https://docs.apify.com/platform/actors/development/builds-and-runs.md) #### [Performance](https://docs.apify.com/platform/actors/development/performance.md) [Learn how to get the maximum value out of your Actors, minimize costs, and maximize results.](https://docs.apify.com/platform/actors/development/performance.md) *** If your Actor allows for integration with external services, or performs some post-processing of datasets generated by other Actors, check out the section [integration-ready Actors](https://docs.apify.com/platform/integrations/actors/integration-ready-actors.md). After your development, you can jump to the section [Publishing and monetization](https://docs.apify.com/platform/actors/publishing.md) to learn about how to publish your Actor in [Apify Store](https://apify.com/store) and monetize it by renting it out to users of the platform. --- # Actor definition **Learn how to turn your arbitrary code into an Actor simply by adding an Actor definition directory.** *** A single isolated Actor consists of source code and various settings. You can think of an Actor as a cloud app or service that runs on the Apify platform. The run of an Actor is not limited to the lifetime of a single HTTP transaction. It can run for as long as necessary, even forever. Basically, Actors are programs packaged as [Docker images](https://hub.docker.com/), which accept a well-defined JSON input, perform an action, and optionally produce an output. Actors have the following elements: * The main **[actor.json](https://docs.apify.com/platform/actors/development/actor-definition/actor-json.md)** file contains **metadata** such as the Actor name, description, author, version, and links pointing to the other definition files below. * **[Dockerfile](https://docs.apify.com/platform/actors/development/actor-definition/dockerfile.md)** which specifies where is the Actor's source code, how to build it, and run it. * **Documentation** in the form of a **README.md** file. * **[Input](https://docs.apify.com/platform/actors/development/actor-definition/input-schema.md)** and **[dataset schemas](https://docs.apify.com/platform/actors/development/actor-definition/dataset-schema.md)** that describe what input the Actor requires and what results it produces. * Access to an out-of-box **[storage](https://docs.apify.com/platform/storage.md)** system for Actor data, results, and files. The documentation and the input/dataset schemas make it possible for people to easily understand what the Actor does, enter the required inputs both in the user interface or API, and integrate the Actor's results with their other workflows. Actors can easily call and interact with each other, enabling building more complex systems on top of simple ones. The Apify platform provides an open [API](https://docs.apify.com/api/v2.md), cron-style [scheduler](https://docs.apify.com/platform/schedules.md), [webhooks](https://docs.apify.com/platform/integrations/webhooks.md), and [integrations](https://docs.apify.com/platform/integrations.md) to services such as Zapier or Make, which make it easy for users to integrate Actors with their existing workflows. Anyone is welcome to [publish Actors](https://docs.apify.com/platform/actors/publishing.md) in [Apify Store](https://apify.com/store), and you can even [monetize your Actors](https://docs.apify.com/platform/actors/publishing/monetize.md). Actors can be developed and run locally and then easily deployed to the Apify platform using the [Apify CLI](https://docs.apify.com/cli) or a [GitHub integration](https://docs.apify.com/platform/integrations/github.md). For more details, see the [Deployment](https://docs.apify.com/platform/actors/development/deployment.md) section. > **To get a better idea of what Apify Actors are, visit [Apify Store](https://apify.com/store), and try out some of them!** --- # actor.json **Learn how to write the main Actor configuration in the `.actor/actor.json` file.** *** Your main Actor configuration is in the `.actor/actor.json` file at the root of your Actor's directory. This file links your local development project to an Actor on the Apify platform. It should include details like the Actor's name, version, build tag, and environment variables. Make sure to commit this file to your Git repository. For example, the `.actor/actor.json` file can look like this: * Full actor.json * Minimal actor.json ``` { "actorSpecification": 1, // always 1 "name": "name-of-my-scraper", "version": "0.0", "buildTag": "latest", "minMemoryMbytes": 256, "maxMemoryMbytes": 4096, "environmentVariables": { "MYSQL_USER": "my_username", "MYSQL_PASSWORD": "@mySecretPassword" }, "usesStandbyMode": false, "dockerfile": "./Dockerfile", "readme": "./ACTOR.md", "input": "./input_schema.json", "storages": { "dataset": "./dataset_schema.json" }, "webServerSchema": "./web_server_openapi.json" } ``` ``` { "actorSpecification": 1, // always 1 "name": "name-of-my-scraper", "version": "0.0" } ``` ## Reference[​](#reference "Direct link to Reference") Deployment metadata Actor `name`, `version`, `buildTag`, and `environmentVariables` are currently only used when you deploy your Actor using the [Apify CLI](https://docs.apify.com/cli) and not when deployed, for example, via GitHub integration. There, it serves for informative purposes only. | Property | Type | Description | | ---------------------- | -------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `actorSpecification` | Required | The version of the Actor specification. This property must be set to `1`, which is the only version available. | | `name` | Required | The name of the Actor. | | `version` | Required | The version of the Actor, specified in the format `[Number].[Number]`, e.g., `0.1`, `0.3`, `1.0`, `1.3`, etc. | | `buildTag` | Optional | The tag name to be applied to a successful build of the Actor. If not specified, defaults to `latest`. Refer to the [builds](https://docs.apify.com/platform/actors/development/builds-and-runs/builds.md) for more information. | | `environmentVariables` | Optional | A map of environment variables to be used during local development. These variables will also be applied to the Actor when deployed on the Apify platform. For more details, see the [environment variables](https://docs.apify.com/cli/docs/vars) section of Apify CLI documentation. | | `dockerfile` | Optional | The path to the Dockerfile to be used for building the Actor on the platform. If not specified, the system will search for Dockerfiles in the `.actor/Dockerfile` and `Dockerfile` paths, in that order. Refer to the [Dockerfile](https://docs.apify.com/platform/actors/development/actor-definition/dockerfile.md) section for more information. | | `dockerContextDir` | Optional | The path to the directory to be used as the Docker context when building the Actor. The path is relative to the location of the `actor.json` file. This property is useful for monorepos containing multiple Actors. Refer to the [Actor monorepos](https://docs.apify.com/platform/actors/development/deployment/source-types.md#actor-monorepos) section for more details. | | `readme` | Optional | The path to the README file to be used on the platform. If not specified, the system will look for README files in the `.actor/README.md` and `README.md` paths, in that order of preference. Check out [Apify Marketing Playbook to learn how to write a quality README files](https://apify.notion.site/How-to-create-an-Actor-README-759a1614daa54bee834ee39fe4d98bc2) guidance. | | `input` | Optional | You can embed your [input schema](https://docs.apify.com/platform/actors/development/actor-definition/input-schema.md) object directly in `actor.json` under the `input` field. You can also provide a path to a custom input schema. If not provided, the input schema at `.actor/INPUT_SCHEMA.json` or `INPUT_SCHEMA.json` is used, in this order of preference. | | `changelog` | Optional | The path to the CHANGELOG file displayed in the Information tab of the Actor in Apify Console next to Readme. If not provided, the CHANGELOG at `.actor/CHANGELOG.md` or `CHANGELOG.md` is used, in this order of preference. Your Actor doesn't need to have a CHANGELOG but it is a good practice to keep it updated for published Actors. | | `storages.dataset` | Optional | You can define the schema of the items in your dataset under the `storages.dataset` field. This can be either an embedded object or a path to a JSON schema file. [Read more](https://docs.apify.com/platform/actors/development/actor-definition/dataset-schema.md) about Actor dataset schemas. | | `minMemoryMbytes` | Optional | Specifies the minimum amount of memory in megabytes required by the Actor to run. Requires an *integer* value. If both `minMemoryMbytes` and `maxMemoryMbytes` are set, then `minMemoryMbytes` must be equal or lower than `maxMemoryMbytes`. Refer to the [Usage and resources](https://docs.apify.com/platform/actors/running/usage-and-resources#memory) for more details about memory allocation. | | `maxMemoryMbytes` | Optional | Specifies the maximum amount of memory in megabytes required by the Actor to run. It can be used to control the costs of run, especially when developing pay per result Actors. Requires an *integer* value. Refer to the [Usage and resources](https://docs.apify.com/platform/actors/running/usage-and-resources#memory) for more details about memory allocation. | | `usesStandbyMode` | Optional | Boolean specifying whether the Actor will have [Standby mode](https://docs.apify.com/platform/actors/development/programming-interface/standby.md) enabled. | | `webServerSchema` | Optional | Defines an OpenAPI v3 schema for the web server running in the Actor. This can be either an embedded object or a path to a JSON schema file. Use this when your Actor starts its own HTTP server and you want to describe its interface. | --- # Dataset Schema Specification **Learn how to define and present your dataset schema in an user-friendly output UI.** *** The dataset schema defines the structure and representation of data produced by an Actor, both in the API and the visual user interface. ## Example[​](#example "Direct link to Example") Let's consider an example Actor that calls `Actor.pushData()` to store data into dataset: main.js ``` import { Actor } from 'apify'; // Initialize the JavaScript SDK await Actor.init(); /** * Actor code */ await Actor.pushData({ numericField: 10, pictureUrl: 'https://www.google.com/images/branding/googlelogo/2x/googlelogo_color_92x30dp.png', linkUrl: 'https://google.com', textField: 'Google', booleanField: true, dateField: new Date(), arrayField: ['#hello', '#world'], objectField: {}, }); // Exit successfully await Actor.exit(); ``` To set up the Actor's output tab UI using a single configuration file, use the following template for the `.actor/actor.json` configuration: .actor/actor.json ``` { "actorSpecification": 1, "name": "Actor Name", "title": "Actor Title", "version": "1.0.0", "storages": { "dataset": { "actorSpecification": 1, "views": { "overview": { "title": "Overview", "transformation": { "fields": [ "pictureUrl", "linkUrl", "textField", "booleanField", "arrayField", "objectField", "dateField", "numericField" ] }, "display": { "component": "table", "properties": { "pictureUrl": { "label": "Image", "format": "image" }, "linkUrl": { "label": "Link", "format": "link" }, "textField": { "label": "Text", "format": "text" }, "booleanField": { "label": "Boolean", "format": "boolean" }, "arrayField": { "label": "Array", "format": "array" }, "objectField": { "label": "Object", "format": "object" }, "dateField": { "label": "Date", "format": "date" }, "numericField": { "label": "Number", "format": "number" } } } } } } } } ``` The template above defines the configuration for the default dataset output view. Under the `views` property, there is one view titled *Overview*. The view configuration consists of two main steps: 1. `transformation` - set up how to fetch the data. 2. `display` - set up how to visually present the fetched data. The default behavior of the Output tab UI table is to display all fields from `transformation.fields` in the specified order. You can customize the display properties for specific formats or column labels if needed. ![Output tab UI](/assets/images/output-schema-example-42bf91c1c1f39834fad5bbedf209acaa.png) ## Structure[​](#structure "Direct link to Structure") Output configuration files need to be located in the `.actor` folder within the Actor's root directory. You have two choices of how to organize files within the `.actor` folder. ### Single configuration file[​](#single-configuration-file "Direct link to Single configuration file") .actor/actor.json ``` { "actorSpecification": 1, "name": "this-is-book-library-scraper", "title": "Book Library scraper", "version": "1.0.0", "storages": { "dataset": { "actorSpecification": 1, "fields": {}, "views": { "overview": { "title": "Overview", "transformation": {}, "display": {} } } } } } ``` ### Separate configuration files[​](#separate-configuration-files "Direct link to Separate configuration files") .actor/actor.json ``` { "actorSpecification": 1, "name": "this-is-book-library-scraper", "title": "Book Library scraper", "version": "1.0.0", "storages": { "dataset": "./dataset_schema.json" } } ``` .actor/dataset\_schema.json ``` { "actorSpecification": 1, "fields": {}, "views": { "overview": { "title": "Overview", "transformation": {}, "display": { "component": "table" } } } } ``` Both of these methods are valid so choose one that suits your needs best. ## Handle nested structures[​](#handle-nested-structures "Direct link to Handle nested structures") The most frequently used data formats present the data in a tabular format (Output tab table, Excel, CSV). If your Actor produces nested JSON structures, you need to transform the nested data into a flat tabular format. You can flatten the data in the following ways: * Use `transformation.flatten` to flatten the nested structure of specified fields. This transforms the nested object into a flat structure. e.g. with `flatten:["foo"]`, the object `{"foo": {"bar": "hello"}}` is turned into `{"foo.bar": "hello"}`. Once the structure is flattened, it's necessary to use the flattened property name in both `transformation.fields` and `display.properties`, otherwise, fields might not be fetched or configured properly in the UI visualization. * Use `transformation.unwind` to deconstruct the nested children into parent objects. * Change the output structure in an Actor from nested to flat before the results are saved in the dataset. ## Dataset schema structure definitions[​](#dataset-schema-structure-definitions "Direct link to Dataset schema structure definitions") The dataset schema structure defines the various components and properties that govern the organization and representation of the output data produced by an Actor. It specifies the structure of the data, the transformations to be applied, and the visual display configurations for the Output tab UI. ### DatasetSchema object definition[​](#datasetschema-object-definition "Direct link to DatasetSchema object definition") | Property | Type | Required | Description | | -------------------- | ---------------------------- | -------- | ------------------------------------------------------------------------------------------------------------ | | `actorSpecification` | integer | true | Specifies the version of dataset schemastructure document.Currently only version 1 is available. | | `fields` | JSONSchema compatible object | true | Schema of one dataset object.Use JsonSchema Draft 2020–12 orother compatible formats. | | `views` | DatasetView object | true | An object with a description of an APIand UI views. | ### DatasetView object definition[​](#datasetview-object-definition "Direct link to DatasetView object definition") | Property | Type | Required | Description | | ---------------- | ------------------------- | -------- | ----------------------------------------------------------------------------------------------------- | | `title` | string | true | The title is visible in UI in the Output taband in the API. | | `description` | string | false | The description is only available in the API response. | | `transformation` | ViewTransformation object | true | The definition of data transformationapplied when dataset data is loaded fromDataset API. | | `display` | ViewDisplay object | true | The definition of Output tab UI visualization. | ### ViewTransformation object definition[​](#viewtransformation-object-definition "Direct link to ViewTransformation object definition") | Property | Type | Required | Description | | --------- | --------- | -------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `fields` | string\[] | true | Selects fields to be presented in the output.The order of fields matches the order of columnsin visualization UI. If a field valueis missing, it will be presented as **undefined** in the UI. | | `unwind` | string\[] | false | Deconstructs nested children into parent object,For example, with `unwind:["foo"]`, the object `{"foo": {"bar": "hello"}}`is transformed into `{"bar": "hello"}`. | | `flatten` | string\[] | false | Transforms nested object into flat structure.For example, with `flatten:["foo"]` the object `{"foo":{"bar": "hello"}}`is transformed into `{"foo.bar": "hello"}`. | | `omit` | string\[] | false | Removes the specified fields from the output.Nested fields names can be used as well. | | `limit` | integer | false | The maximum number of results returned.Default is all results. | | `desc` | boolean | false | By default, results are sorted in ascending based on the write event into the dataset.If `desc:true`, the newest writes to the dataset will be returned first. | ### ViewDisplay object definition[​](#viewdisplay-object-definition "Direct link to ViewDisplay object definition") | Property | Type | Required | Description | | ------------ | ------ | -------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `component` | string | true | Only the `table` component is available. | | `properties` | Object | false | An object with keys matching the `transformation.fields`and `ViewDisplayProperty` as values. If properties are not set, the table will be rendered automatically with fields formatted as `strings`, `arrays` or `objects`. | ### ViewDisplayProperty object definition[​](#viewdisplayproperty-object-definition "Direct link to ViewDisplayProperty object definition") | Property | Type | Required | Description | | -------- | -------------------------------------------------------------------------------------------------------------------------- | -------- | ----------------------------------------------------------------------------------- | | `label` | string | false | In the Table view, the label will be visible as the table column's header. | | `format` | One of - `text`- `number`- `date`- `link`- `boolean`- `image`- `array`- `object` | false | Describes how output data values are formatted to be rendered in the Output tab UI. | --- # Dataset validation **Specify the dataset schema within the Actors so you can add monitoring and validation at the field level.** *** To define a schema for a default dataset of an Actor run, you need to set `fields` property in the dataset schema. info The schema defines a single item in the dataset. Be careful not to define the schema as an array, it always needs to be a schema of an object. Schema configuration is not available for named datasets or dataset views. You can either do that directly through `actor.json`: .actor.json ``` { "actorSpecification": 1, "storages": { "dataset": { "actorSpecification": 1, "fields": { "$schema": "http://json-schema.org/draft-07/schema#", "type": "object", "properties": { "name": { "type": "string" } }, "required": ["name"] }, "views": {} } } } ``` Or in a separate file linked from the `.actor.json`: .actor.json ``` { "actorSpecification": 1, "storages": { "dataset": "./dataset_schema.json" } } ``` dataset\_schema.json ``` { "actorSpecification": 1, "fields": { "$schema": "http://json-schema.org/draft-07/schema#", "type": "object", "properties": { "name": { "type": "string" } }, "required": ["name"] }, "views": {} } ``` important Dataset schema needs to be a valid JSON schema draft-07, so the `$schema` line is important and must be exactly this value or it must be omitted: `"$schema": "http://json-schema.org/draft-07/schema#"` ## Dataset validation[​](#dataset-validation "Direct link to Dataset validation") When you define a schema of your default dataset, the schema is then always used when you insert data into the dataset to perform validation (we use [AJV](https://ajv.js.org/)). If the validation succeeds, nothing changes from the current behavior, data is stored and an empty response with status code `201` is returned. If the data you attempt to store in the dataset is *invalid* (meaning any of the items received by the API fails validation), *the entire request will be discarded*, The API will return a response with status code `400` and the following JSON response: ``` { "error": { "type": "schema-validation-error", "message": "Schema validation failed", "data": { "invalidItems": [{ "itemPosition": "", "validationErrors": "" }] } } } ``` The type of the AJV validation error object is [here](https://github.com/ajv-validator/ajv/blob/master/lib/types/index.ts#L86). If you use the Apify JS client or Apify SDK and call `pushData` function you can access the validation errors in a `try catch` block like this: * Javascript * Python ``` try { const response = await Actor.pushData(items); } catch (error) { if (!error.data?.invalidItems) throw error; error.data.invalidItems.forEach((item) => { const { itemPosition, validationErrors } = item; }); } ``` ``` try: await Actor.push_data(items) except ApifyApiError as error: if "invalidItems" in error.data: validation_errors = e.data["invalidItems"] ``` ## Examples of common types of validation[​](#examples-of-common-types-of-validation "Direct link to Examples of common types of validation") Optional field (price is optional in this case): ``` { "$schema": "http://json-schema.org/draft-07/schema#", "type": "object", "properties": { "name": { "type": "string" }, "price": { "type": "number" } }, "required": ["name"] } ``` Field with multiple types: ``` { "price": { "type": ["string", "number"] } } ``` Field with type `any`: ``` { "price": { "type": ["string", "number", "object", "array", "boolean"] } } ``` Enabling fields to be `null` : ``` { "name": { "type": "string", "nullable": true } } ``` Define type of objects in array: ``` { "comments": { "type": "array", "items": { "type": "object", "properties": { "author_name": { "type": "string" } } } } } ``` Define specific fields, but allow anything else to be added to the item: ``` { "$schema": "http://json-schema.org/draft-07/schema#", "type": "object", "properties": { "name": { "type": "string" } }, "additionalProperties": true } ``` See [json schema reference](https://json-schema.org/understanding-json-schema/reference) for additional options. You can also use [conversion tools](https://www.liquid-technologies.com/online-json-to-schema-converter) to convert an existing JSON document into it's JSON schema. ## Dataset field statistics[​](#dataset-field-statistics "Direct link to Dataset field statistics") When you configure the dataset fields schema, we generate a field list and measure the following statistics: * **Null count:** how many items in the dataset have the field set to null * **Empty count:** how many items in the dataset are `undefined` , meaning that for example empty string is not considered empty * **Minimum and maximum** * For numbers, this is calculated directly * For strings, this field tracks string length * For arrays, this field tracks the number of items in the array * For objects, this tracks the number of keys * For booleans, this tracks whether the boolean was set to true. Minimum is always 0, but maximum can be either 1 or 0 based on whether at least one item in the dataset has the boolean field set to true. You can use them in [monitoring](https://docs.apify.com/platform/monitoring.md#alert-configuration). --- # Dockerfile **Learn about the available Docker images you can use as a base for your Apify Actors. Choose the right base image based on your Actor's requirements and the programming language you're using.** *** When developing an [Actor](https://docs.apify.com/platform/actors.md) on the Apify platform, you can choose from a variety of pre-built Docker images to serve as the base for your Actor. These base images come with pre-installed dependencies and tools, making it easier to set up your development environment and ensuring consistent behavior across different environments. ## Base Docker images[​](#base-docker-images "Direct link to Base Docker images") Apify provides several Docker images that can serve as base images for Actors. All images come in two versions: * `latest` - This version represents the stable and production-ready release of the base image. * `beta` - This version is intended for testing new features. Use at your own risk. Pre-cached Docker images All Apify Docker images are pre-cached on Apify servers to speed up Actor builds and runs. The source code for generating these images is available in the [apify-actor-docker](https://github.com/apify/apify-actor-docker) repository. ### Node.js base images[​](#nodejs-base-images "Direct link to Node.js base images") These images come with Node.js (versions `16`, `18`, `20`, or `22`) the [Apify SDK for JavaScript](https://docs.apify.com/sdk/js), and [Crawlee](https://crawlee.dev/) preinstalled. The `latest` tag corresponds to the latest LTS version of Node.js. | Image | Description | | ------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------ | | [`actor-node`](https://hub.docker.com/r/apify/actor-node/) | Slim Alpine Linux image with only essential tools. Does not include headless browsers. | | [`actor-node-puppeteer-chrome`](https://hub.docker.com/r/apify/actor-node-puppeteer-chrome/) | Debian image with Chromium, Google Chrome, and the [`puppeteer`](https://github.com/puppeteer/puppeteer) library. | | [`actor-node-playwright-chrome`](https://hub.docker.com/r/apify/actor-node-playwright-chrome/) | Debian image with Chromium, Google Chrome, and the [`playwright`](https://github.com/microsoft/playwright) library. | | [`actor-node-playwright-firefox`](https://hub.docker.com/r/apify/actor-node-playwright-firefox/) | Debian image with Firefox and the [`playwright`](https://github.com/microsoft/playwright) library . | | [`actor-node-playwright-webkit`](https://hub.docker.com/r/apify/actor-node-playwright-webkit/) | Ubuntu image with WebKit and the [`playwright`](https://github.com/microsoft/playwright) library. | | [`actor-node-playwright`](https://hub.docker.com/r/apify/actor-node-playwright/) | Ubuntu image with [`playwright`](https://github.com/microsoft/playwright) and all its browsers (Chromium, Google Chrome, Firefox, WebKit). | See the [Docker image guide](https://docs.apify.com/sdk/js/docs/guides/docker-images) for more details. ### Python base images[​](#python-base-images "Direct link to Python base images") These images come with Python (version `3.8`, `3.9`, `3.10`, `3.11`, or `3.12`) and the [Apify SDK for Python](https://docs.apify.com/sdk/python) preinstalled. The `latest` tag corresponds to the latest Python 3 version supported by the Apify SDK. | Image | Description | | ----------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------- | | [`actor-python`](https://hub.docker.com/r/apify/actor-python) | Slim Debian image with only the Apify SDK for Python. Does not include headless browsers. | | [`actor-python-playwright`](https://hub.docker.com/r/apify/actor-python-playwright) | Debian image with [`playwright`](https://github.com/microsoft/playwright) and all its browsers. | | [`actor-python-selenium`](https://hub.docker.com/r/apify/actor-python-selenium) | Debian image with [`selenium`](https://github.com/seleniumhq/selenium), Google Chrome, and [ChromeDriver](https://developer.chrome.com/docs/chromedriver/). | ## Custom Dockerfile[​](#custom-dockerfile "Direct link to Custom Dockerfile") Apify uses Docker to build and run Actors. If you create an Actor from a template, it already contains an optimized `Dockerfile` for the given use case. To use a custom `Dockerfile`, you can either: * Reference it from the `dockerfile` field in `.actor/actor.json`, * Store it in `.actor/Dockerfile` or `Dockerfile` in the root directory (searched in this order of preference). If no `Dockerfile` is provided, the system uses the following default: ``` FROM apify/actor-node:20 COPY package*.json ./ RUN npm --quiet set progress=false \ && npm install --only=prod --no-optional \ && echo "Installed NPM packages:" \ && (npm list --only=prod --no-optional --all || true) \ && echo "Node.js version:" \ && node --version \ && echo "NPM version:" \ && npm --version COPY . ./ ``` For more information about `Dockerfile` syntax and commands, see the [Dockerfile reference](https://docs.docker.com/reference/dockerfile/). Custom base images While `apify/actor-node` is a base Docker image provided by Apify, you can use other Docker images as the base for your Actors.However, using the Apify images has some performance advantages, as they are pre-caches on Apify servers. By default, Apify base Docker images with the Apify SDK and Crawlee start your Node.js application the same way as `npm start`, i.e, by running the command specified in `package.json` under `scripts` - `start`. The default `package.json` is similar to: ``` { "description": "Anonymous Actor on the Apify platform", "version": "0.0.1", "license": "UNLICENSED", "main": "main.js", "scripts": { "start": "node main.js" }, "dependencies": { "apify": "^3.0.0", "crawlee": "^3.0.0" }, "repository": {} } ``` This means the system expects the source code to be in `main.js` by default. If you want to override this behavior, use a custom `package.json` and/or `Dockerfile`. Optimization tips You can check out various optimization tips for Dockerfile in our [Performance](https://docs.apify.com/platform/actors/development/performance.md) documentation. --- # Actor input schema **Learn how to define and validate a schema for your Actor's input with code examples. Provide an autogenerated input UI for your Actor's users.** *** The input schema defines the input parameters for an Actor. It's a `JSON` object comprising various field types supported by the Apify platform. Based on the input schema, the Apify platform automatically generates a user interface for the Actor. It also validates the input data passed to the Actor when it's executed through the API or the Apify Console UI. The following is an example of an auto-generated UI for the [Website Content Crawler](https://apify.com/apify/website-content-crawler) Actor. ![Website Content Crawler input UI](/assets/images/input-ui-website-content-crawler-bdcadc69b124e19dd086b2225b501379.png) With an input schema defined as follows: ``` { "title": "Input schema for Website Content Crawler", "description": "Enter the start URL(s) of the website(s) to crawl, configure other optional settings, and run the Actor to crawl the pages and extract their text content.", "type": "object", "schemaVersion": 1, "properties": { "startUrls": { "title": "Start URLs", "type": "array", "description": "One or more URLs of the pages where the crawler will start. Note that the Actor will additionally only crawl sub-pages of these URLs. For example, for the start URL `https://www.example.com/blog`, it will crawl pages like `https://example.com/blog/article-1`, but will skip `https://example.com/docs/something-else`.", "editor": "requestListSources", "prefill": [{ "url": "https://docs.apify.com/" }] }, "crawlerType": { "sectionCaption": "Crawler settings", "title": "Crawler type", "type": "string", "enum": ["playwright:chrome", "cheerio", "jsdom"], "enumTitles": ["Headless web browser (Chrome+Playwright)", "Raw HTTP client (Cheerio)", "Raw HTTP client with JS execution (JSDOM) (experimental!)"], "description": "Select the crawling engine:\n- **Headless web browser** (default) - Useful for modern websites with anti-scraping protections and JavaScript rendering. It recognizes common blocking patterns like CAPTCHAs and automatically retries blocked requests through new sessions. However, running web browsers is more expensive as it requires more computing resources and is slower. It is recommended to use at least 8 GB of RAM.\n- **Raw HTTP client** - High-performance crawling mode that uses raw HTTP requests to fetch the pages. It is faster and cheaper, but it might not work on all websites.", "default": "playwright:chrome" }, "maxCrawlDepth": { "title": "Max crawling depth", "type": "integer", "description": "The maximum number of links starting from the start URL that the crawler will recursively descend. The start URLs have a depth of 0, the pages linked directly from the start URLs have a depth of 1, and so on.\n\nThis setting is useful to prevent accidental crawler runaway. By setting it to 0, the Actor will only crawl start URLs.", "minimum": 0, "default": 20 }, "maxCrawlPages": { "title": "Max pages", "type": "integer", "description": "The maximum number pages to crawl. It includes the start URLs, pagination pages, pages with no content, etc. The crawler will automatically finish after reaching this number. This setting is useful to prevent accidental crawler runaway.", "minimum": 0, "default": 9999999 }, // ... } } ``` The actual input object passed from the autogenerated input UI to the Actor then looks like this: ``` { "debugMode": false, "proxyConfiguration": { "useApifyProxy": true }, "saveHtml": false, "saveMarkdown": false, "saveScreenshots": false, "startUrls": [ { "url": "https://docs.apify.com/" } ] } ``` Next, let's take a look at [input schema specification](https://docs.apify.com/platform/actors/development/actor-definition/input-schema/specification/v1.md), and the possibility of using input schema to enable users to pass [secrets](https://docs.apify.com/platform/actors/development/actor-definition/input-schema/secret-input.md). --- # Secret input **Learn about making some Actor input fields secret and encrypted. Ideal for passing passwords, API tokens, or login cookies to Actors.** *** The secret input feature lets you mark specific input fields of an Actor as sensitive. When you save the Actor's input configuration, the values of these marked fields get encrypted. The encrypted input data can only be decrypted within the Actor. This provides an extra layer of security for sensitive information like API keys, passwords, or other confidential data. ## How to set a secret input field[​](#how-to-set-a-secret-input-field "Direct link to How to set a secret input field") To make an input field secret, you need to add a `"isSecret": true` setting to the input field in the Actor's [input schema](https://docs.apify.com/platform/actors/development/actor-definition/input-schema.md), like this: ``` { // ... "properties": { // ... "password": { "title": "Password", "type": "string", "description": "A secret, encrypted input field", "editor": "textfield", "isSecret": true }, // ... }, // ... } ``` The editor for this input field will then turn into a secret input, and when you edit the field value, it will be stored encrypted. ![Secret input editor](/assets/images/secret-input-editor-c5569783ff1c5e99f663baa6813a8b32.png) When you run the Actor through the API, the system automatically encrypts any input fields marked as secret before saving them to the Actor run's default key-value store. Type restriction This feature supports `string`, `object`, and `array` input types. Available editor types include: * `hidden` (for any supported input type) * `textfield` and `textarea` (for string inputs) * `json` (for `object` and `array` inputs) ## Read secret input fields[​](#read-secret-input-fields "Direct link to Read secret input fields") When you read the Actor input through `Actor.getInput()`, the encrypted fields are automatically decrypted. Decryption of string fields is supported since [JavaScript SDK](https://docs.apify.com/sdk/js/) 3.1.0; support for objects and arrays was added in [JavaScript SDK](https://docs.apify.com/sdk/js/) 3.4.2 and [Python SDK](https://docs.apify.com/sdk/python/) 2.7.0. ``` > await Actor.getInput(); { username: 'username', password: 'password' } ``` If you read the `INPUT` key from the Actor run's default key-value store directly, you will still get the original, encrypted input value. ``` > await Actor.getValue('INPUT'); { username: 'username', password: 'ENCRYPTED_VALUE:Hw/uqRMRNHmxXYYDJCyaQX6xcwUnVYQnH4fWIlKZL2Vhtq1rZmtoGXQSnhIXmF58+DjKlMZpTlK2zN3YUXk1ylzU6LfXyysOG/PISAfwm27FUgy3IfdgMyQggQ4MydLzdlzefX0mPRyixBviRcFhRTC+K7nK9lkATt3wJpj91YAZm104ZYkcd5KmsU2JX39vxN0A0lX53NjIenzs3wYPaPYLdjKIe+nqG9fHlL7kALyi7Htpy91ZgnQJ1s9saJRkKfWXvmLYIo5db69zU9dGCeJzUc0ca154O+KYYP7QTebJxqZNQsC8EH6sVMQU3W0qYKjuN8fUm1fRzyw/kKFacQ==:VfQd2ZbUt3S0RZ2ciywEWYVBbTTZOTiy' } ``` ## Encryption mechanism[​](#encryption-mechanism "Direct link to Encryption mechanism") The encryption mechanism used for encrypting the secret input fields is the same dual encryption as in [PGP](https://en.wikipedia.org/wiki/Pretty_Good_Privacy#/media/File:PGP_diagram.svg). The secret input field is encrypted using a random key, using the `aes-256-gcm` cipher, and then the key is encrypted using a 2048-bit RSA key. The RSA key is unique for each combination of user and Actor, ensuring that no Actor can decrypt input intended for runs of another Actor by the same user, and no user can decrypt input runs of the same Actor by a different user. This isolation of decryption keys enhances the security of sensitive input data. During Actor execution, the decryption keys are passed as environment variables, restricting the decryption of secret input fields to occur solely within the context of the Actor run. This approach prevents unauthorized access to sensitive input data outside the Actor's execution environment. ## Example Actor[​](#example-actor "Direct link to Example Actor") If you want to test the secret input live, check out the [Example Secret Input](https://console.apify.com/actors/O3S2UlSKzkcnFHRRA) Actor in Apify Console. If you want to dig in deeper, you can check out its [source code](https://github.com/apify/actor-example-secret-input) on GitHub. --- # Actor input schema specification **Learn how to define and validate a schema for your Actor's input with code examples. Provide an autogenerated input UI for your Actor's users.** *** The Actor input schema serves three main purposes: * It ensures the input data supplied to the Actor adhere to specified requirements and validation rules. * It is used by the Apify platform to generate a user-friendly interface for configuring and running your Actor. * It simplifies invoking your Actors from external systems by generating calling code and connectors for integrations. To define an input schema for an Actor, set `input` field in the `.actor/actor.json` file to an input schema object (described below), or path to a JSON file containing the input schema object. For backwards compatibility, if the `input` field is omitted, the system looks for an `INPUT_SCHEMA.json` file either in the `.actor` directory or the Actor's top-level directory—but note that this functionality is deprecated and might be removed in the future. The maximum allowed size for the input schema file is 500 kB. When you provide an input schema, the Apify platform will validate the input data passed to the Actor on start (via the API or Apify Console) to ensure compliance before starting the Actor. If the input object doesn't conform the schema, the caller receives an error and the Actor is not started. Validation aid You can use our [visual input schema editor](https://apify.github.io/input-schema-editor-react/) to guide you through the creation of the `INPUT_SCHEMA.json` file. To ensure the input schema is valid, here's a corresponding [JSON schema file](https://github.com/apify/apify-shared-js/blob/master/packages/input_schema/src/schema.json). You can also use the [`apify validate-schema`](https://docs.apify.com/cli/docs/reference#apify-validate-schema-path) command in the Apify CLI. ## Example[​](#example "Direct link to Example") Imagine a simple web crawler that accepts an array of start URLs and a JavaScript function to execute on each visited page. The input schema for such a crawler could be defined as follows: ``` { "title": "Cheerio Crawler input", "description": "To update crawler to another site, you need to change startUrls and pageFunction options!", "type": "object", "schemaVersion": 1, "properties": { "startUrls": { "title": "Start URLs", "type": "array", "description": "URLs to start with", "prefill": [ { "url": "http://example.com" }, { "url": "http://example.com/some-path" } ], "editor": "requestListSources" }, "pageFunction": { "title": "Page function", "type": "string", "description": "Function executed for each request", "prefill": "async () => { return $('title').text(); }", "editor": "javascript" } }, "required": ["startUrls", "pageFunction"] } ``` The generated input UI will be: ![Apify Actor input schema example](/assets/images/input-schema-example-fa0b2cfdecdbe9dcdb2580759b8d1b12.png) If you switch the input to the **JSON** display using the toggle, then you will see the entered input stringified to `JSON`, as it will be passed to the Actor: ``` { "startUrls": [ { "url": "http://example.com" }, { "url": "http://example.com/some-path" } ], "pageFunction": "async () => { return $('title').text(); }" } ``` ## Structure[​](#structure "Direct link to Structure") ``` { "title": "Cheerio Crawler input", "type": "object", "schemaVersion": 1, "properties": { /* define input fields here */ }, "required": [] } ``` | Property | Type | Required | Description | | ---------------------- | ------- | -------- | --------------------------------------------------------------------------------------------------------------------------------------------------- | | `title` | String | Yes | Any text describing your input schema. | | `description` | String | No | Help text for the input that will bedisplayed above the UI fields. | | `type` | String | Yes | This is fixed and must be setto string `object`. | | `schemaVersion` | Integer | Yes | The version of the input schemaspecification against whichyour schema is written.Currently, only version `1` is out. | | `properties` | Object | Yes | This is an object mapping each field keyto its specification. | | `required` | String | No | An array of field keys that are required. | | `additionalProperties` | Boolean | No | Controls if properties not listed in `properties` are allowed. Defaults to `true`.Set to `false` to make requests with extra properties fail. | Input schema differences Even though the structure of the Actor input schema is similar to JSON schema, there are some differences. We cannot guarantee that JSON schema tooling will work on input schema documents. For a more precise technical understanding of the matter, feel free to browse the code of the [@apify/input\_schema](https://github.com/apify/apify-shared-js/tree/master/packages/input_schema/src) package. ## Fields[​](#fields "Direct link to Fields") Each field of your input is described under its key in the `inputSchema.properties` object. The field might have `integer`, `string`, `array`, `object`, or `boolean` type, and its specification contains the following properties: | Property | Value | Required | Description | | -------------------- | ---------------------------------------------------------------------------------- | -------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `type` | One of - `string`- `array`- `object`- `boolean`- `integer` | Yes | Allowed type for the input value.Cannot be mixed. | | `title` | String | Yes | Title of the field in UI. | | `description` | String | Yes | Description of the field that will bedisplayed as help text in Actor input UI. | | `default` | Must match `type` property. | No | Default value that will beused when no value is provided. | | `prefill` | Must match `type` property. | No | Value that will be prefilledin the Actor input interface. | | `example` | Must match `type` property. | No | Sample value of this fieldfor the Actor to be displayed whenActor is published in Apify Store. | | `sectionCaption` | String | No | If this property is set,then all fields following this field(this field included) will be separatedinto a collapsible sectionwith the value set as its caption.The section ends at the last fieldor the next field which has the`sectionCaption` property set. | | `sectionDescription` | String | No | If the `sectionCaption` property is set,then you can use this property toprovide additional description to the section.The description will be visible right underthe caption when the section is open. | ### Prefill vs. default vs. required[​](#prefill-vs-default-vs-required "Direct link to Prefill vs. default vs. required") Here is a rule of thumb for whether an input field should have a `prefill`, `default`, or be required: * **Prefill** - Use for fields that don't have a reasonable default. The provided value is prefilled for the user to show them an example of using the field and to make it easy to test the Actor (e.g., search keyword, start URLs). In other words, this field is only used in the user interface but does not affect the Actor functionality and API. Note that if you add a new input option to your Actor, the Prefill value won't be used by existing integrations such as Actor tasks or API calls, but the Default will be if specified. This is useful for keeping backward compatibility when introducing a new flag or feature that you prefer new users to use. * **Required** - Use for fields that don't have a reasonable default and MUST be entered by the user (e.g., API token, password). * **Default** - Use for fields that MUST be set for the Actor run to some value, but where you don't need the user to change the default behavior (e.g., max pages to crawl, proxy settings). If the user omits the value when starting the Actor via any means (API, CLI, scheduler, or user interface), the platform automatically passes the Actor this default value. * **No particular setting** - Use for purely optional fields where it makes no sense to prefill any value (e.g., flags like debug mode or download files). In summary, you can use each option independently or use a combination of **Prefill + Required** or **Prefill + Default**, but the combination of **Default + Required** doesn't make sense to use. ## Additional properties[​](#additional-properties "Direct link to Additional properties") Most types also support additional properties defining, for example, the UI input editor. ### String[​](#string "Direct link to String") #### Code input[​](#code-input "Direct link to Code input") Example of a code input: ``` { "title": "Page function", "type": "string", "description": "Function executed for each request", "editor": "javascript", "prefill": "async () => { return $('title').text(); }" } ``` Rendered input: ![Apify Actor input schema page function]() #### Country selection[​](#country-selection "Direct link to Country selection") Example of country selection using a select input: ``` { "title": "Country", "type": "string", "description": "Select your country", "editor": "select", "default": "us", "enum": ["us", "de", "fr"], "enumTitles": ["USA", "Germany", "France"] } ``` Rendered input: ![Apify Actor input schema - country input]() #### `datepicker` editor[​](#datepicker-editor "Direct link to datepicker-editor") Example of date selection using absolute and relative `datepicker` editor: ``` { "absoluteDate": { "title": "Date", "type": "string", "description": "Select absolute date in format YYYY-MM-DD", "editor": "datepicker", "pattern": "^(\\d{4})-(0[1-9]|1[0-2])-(0[1-9]|[12]\\d|3[01])$" }, "relativeDate": { "title": "Relative date", "type": "string", "description": "Select relative date in format: {number} {unit}", "editor": "datepicker", "dateType": "relative", "pattern": "^(\\d+)\\s*(day|week|month|year)s?$" }, "anyDate": { "title": "Any date", "type": "string", "description": "Select date in format YYYY-MM-DD or {number} {unit}", "editor": "datepicker", "dateType": "absoluteOrRelative", "pattern": "^(\\d{4})-(0[1-9]|1[0-2])-(0[1-9]|[12]\\d|3[01])$|^(\\d+)\\s*(day|week|month|year)s?$" } } ``` The `absoluteDate` property renders a date picker that allows selection of a specific date and returns string value in `YYYY-MM-DD` format. Validation is ensured thanks to `pattern` field. In this case the `dateType` property is omitted, as it defaults to `"absolute"`. ![Apify Actor input schema - country input](/assets/images/input-schema-date-absolute-8dc55da08b38b38bcf29b531d7c11e5d.png) The `relativeDate` property renders an input field that enables the user to choose the relative date and returns the value in `{number} {unit}` format, for example `"2 days"`. The `dateType` parameter is set to `"relative"` to restrict input to relative dates only. ![Apify Actor input schema - country input](/assets/images/input-schema-date-relative-d22597d97577947def822c41950040c3.png) The `anyDate` property renders a date picker that accepts both absolute and relative dates. The Actor author is responsible for parsing and interpreting the selected date format. ![Apify Actor input schema - country input](/assets/images/input-schema-date-both-7331f7016c1256412af2ccfa05c96aa4.png) #### `fileupload` editor[​](#fileupload-editor "Direct link to fileupload-editor") The `fileupload` editor enables users to specify a file as input. The input is passed to the Actor as a string. It is the Actor author's responsibility to interpret this string, including validating its existence and format. ![Apify Actor input schema - fileupload input](/assets/images/input-schema-fileupload-input-f11ad6ffde04eb7892fd96c5cfc68002.png) The user provides either a URL or uploads the file to a key-value store (existing or new). ![Apify Actor input schema - fileupload input options](/assets/images/input-schema-fileupload-modal-fe1ec6bbc2752b95f5848b0a03f7e32e.png) Properties: | Property | Value | Required | Description | | ------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `editor` | One of:- `textfield`- `textarea`- `javascript`- `python`- `select`- `datepicker`- `fileupload`- `hidden` | Yes | Visual editor used for the input field. | | `pattern` | String | No | Regular expression that will be used to validate the input. If validation fails, the Actor will not run. | | `minLength` | Integer | No | Minimum length of the string. | | `maxLength` | Integer | No | Maximum length of the string. | | `enum` | \[String] | Required if `editor` is `select` | Using this field, you can limit values to the given array of strings. Input will be displayed as select box. | | `enumTitles` | \[String] | No | Titles for the `enum` keys described. | | `nullable` | Boolean | No | Specifies whether `null` is an allowed value. | | `isSecret` | Boolean | No | Specifies whether the input field will be stored encrypted. Only available with `textfield`, `textarea` and `hidden` editors. | | `dateType` | One of - `absolute`- `relative`- `absoluteOrRelative` | No | This property, which is only available with `datepicker` editor, specifies what date format should visual editor accept (The JSON editor accepts any string without validation.).- `absolute` value enables date input in `YYYY-MM-DD` format. To parse returned string regex like this can be used: `^(\d{4})-(0[1-9]\|1[0-2])-(0[1-9]\|[12]\d\|3[01])$`.- `relative` value enables relative date input in`{number} {unit}` format.Supported units are: days, weeks, months, years.The input is passed to the Actor as plain text (e.g., "3 weeks"). To parse it, regex like this can be used: `^(\d+)\s*(day\|week\|month\|year)s?$`.- `absoluteOrRelative` value enables both absolute and relative formats and user can switch between them. It's up to Actor author to parse a determine actual used format - regexes above can be used to check whether the returned string match one of them.Defaults to `absolute`. | Regex escape When using escape characters `\` for the regular expression in the `pattern` field, be sure to escape them to avoid invalid JSON issues. For example, the regular expression `https:\/\/(www\.)?apify\.com\/.+` would become `https:\\/\\/(www\\.)?apify\\.com\\/.+`. #### Advanced date and time handling[​](#advanced-date-and-time-handling "Direct link to Advanced date and time handling") While the `datepicker` editor doesn't support setting time values visually, you can allow users to handle more complex datetime formats and pass them via JSON. The following regex allows users to optionally extend the date with full ISO datetime format or pass `hours` and `minutes` as a relative date: `"pattern": "^(\\d{4})-(0[1-9]|1[0-2])-(0[1-9]|[12]\\d|3[01])(T[0-2]\\d:[0-5]\\d(:[0-5]\\d)?(\\.\\d+)?Z?)?$|^(\\d+)\\s*(minute|hour|day|week|month|year)s?$"` When implementing time-based fields, make sure to explain to your users through the description that the time values should be provided in UTC. This helps prevent timezone-related issues. ### Boolean[​](#boolean "Direct link to Boolean") Example options with group caption: ``` { "verboseLog": { "title": "Verbose log", "type": "boolean", "description": "Debug messages will be included in the log.", "default": true, "groupCaption": "Options", "groupDescription": "Various options for this Actor" }, "lightspeed": { "title": "Lightspeed", "type": "boolean", "description": "If checked then actors runs at the speed of light.", "prefill": true } } ``` Rendered input: ![Apify Actor input schema options](/assets/images/input-schema-options-7eae6d4ce07ec3adc9e10890f952db0f.png) Properties: | Property | Value | Required | Description | | ------------------ | ----------------------------------- | -------- | ------------------------------------------------------------------------------------------------------------ | | `editor` | One of - `checkbox`- `hidden` | No | Visual editor used for the input field. | | `groupCaption` | String | No | If you want to groupmultiple checkboxes together,add this option to the firstof the group. | | `groupDescription` | String | No | Description displayed as help textdisplayed of group title. | | `nullable` | Boolean | No | Specifies whether null isan allowed value. | ### Integer[​](#integer "Direct link to Integer") Example: ``` { "title": "Memory", "type": "integer", "description": "Select memory in megabytes", "default": 64, "maximum": 1024, "unit": "MB" } ``` Rendered input: ![Apify Actor input schema memory]() Properties: | Property | Value | Required | Description | | ---------- | ---------------------------------- | -------- | ----------------------------------------------------------------------------- | | `editor` | One of: - `number`- `hidden` | No | Visual editor used for input field. | | `maximum` | Integer | No | Maximum allowed value. | | `minimum` | Integer | No | Minimum allowed value. | | `unit` | String | No | Unit displayed next to the field in UI,for example *second*, *MB*, etc. | | `nullable` | Boolean | No | Specifies whether null is an allowed value. | ### Object[​](#object "Direct link to Object") Example of proxy configuration: ``` { "title": "Proxy configuration", "type": "object", "description": "Select proxies to be used by your crawler.", "prefill": { "useApifyProxy": true }, "editor": "proxy" } ``` Rendered input: ![Apify Actor input schema proxy](/assets/images/input-schema-proxy-5407b712ef88623fe43dff44904786d1.png) The object where the proxy configuration is stored has the following structure: ``` { // Indicates whether Apify Proxy was selected. "useApifyProxy": Boolean, // Array of Apify Proxy groups. Is missing or null if // Apify Proxy's automatic mode was selected // or if proxies are not used. "apifyProxyGroups": String[], // Array of custom proxy URLs. // Is missing or null if custom proxies were not used. "proxyUrls": String[], } ``` Example of a black box object: ``` { "title": "User object", "type": "object", "description": "Enter object representing user", "prefill": { "name": "John Doe", "email": "janedoe@gmail.com" }, "editor": "json" } ``` Rendered input: ![Apify Actor input schema user object]() Properties: | Property | Value | Required | Description | | --------------- | ---------------------------------------------- | -------- | ------------------------------------------------------------------------------------------------------------ | | `editor` | One of - `json`- `proxy`- `hidden` | Yes | UI editor used for input. | | `patternKey` | String | No | Regular expression that will be usedto validate the keys of the object. | | `patternValue` | String | No | Regular expression that will be usedto validate the values of object. | | `maxProperties` | Integer | No | Maximum number of propertiesthe object can have. | | `minProperties` | Integer | No | Minimum number of propertiesthe object can have. | | `nullable` | Boolean | No | Specifies whether null isan allowed value. | | `isSecret` | Boolean | No | Specifies whether the input field will be stored encrypted. Only available with `json` and `hidden` editors. | ### Array[​](#array "Direct link to Array") Example of request list sources configuration: ``` { "title": "Start URLs", "type": "array", "description": "URLs to start with", "prefill": [{ "url": "https://apify.com" }], "editor": "requestListSources" } ``` Rendered input: ![Apify Actor input schema start urls array](/assets/images/input-schema-start-urls-bef8993a2dd7f9700624e248f733f2e3.png) Example of an array: ``` { "title": "Colors", "type": "array", "description": "Enter colors you know", "prefill": ["Red", "White"], "editor": "json" } ``` Rendered input: ![Apify Actor input schema colors array]() Properties: | Property | Value | Required | Description | | ------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------- | -------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------- | | `editor` | One of - `json`- `requestListSources`- `pseudoUrls`- `globs`- `keyValue`- `stringList`- `select`- `hidden` | Yes | UI editor used for input. | | `placeholderKey` | String | No | Placeholder displayed forkey field when no value is specified.Works only with `keyValue` editor. | | `placeholderValue` | String | No | Placeholder displayed in value fieldwhen no value is provided.Works only with `keyValue` and`stringList` editors. | | `patternKey` | String | No | Regular expression thatwill be used to validatethe keys of items in the array.Works only with `keyValue`editor. | | `patternValue` | String | No | Regular expression thatwill be used to validate the valuesof items in the array.Works only with `keyValue` and`stringList` editors. | | `maxItems` | Integer | No | Maximum number of itemsthe array can contain. | | `minItems` | Integer | No | Minimum number of itemsthe array can contain. | | `uniqueItems` | Boolean | No | Specifies whether the arrayshould contain only unique values. | | `nullable` | Boolean | No | Specifies whether null isan allowed value. | | `items` | object | No | Specifies format of the items of the array, useful mainly for multiselect (see below) | | `isSecret` | Boolean | No | Specifies whether the input field will be stored encrypted. Only available with `json` and `hidden` editors. | Usage of this field is based on the selected editor: * `requestListSources` - value from this field can be used as input for the [RequestList](https://crawlee.dev/api/core/class/RequestList) class from Crawlee. * `pseudoUrls` - is intended to be used with a combination of the [PseudoUrl](https://crawlee.dev/api/core/class/PseudoUrl) class and the [enqueueLinks()](https://crawlee.dev/api/core/function/enqueueLinks) function from Crawlee. Editor type `requestListSources` supports input in formats defined by the [sources](https://crawlee.dev/api/core/interface/RequestListOptions#sources) property of [RequestListOptions](https://crawlee.dev/api/core/interface/RequestListOptions). Editor type `globs` maps to the Crawlee's [GlobInput](https://crawlee.dev/api/core#GlobInput) used by the [UrlPatterObject](https://crawlee.dev/api/core#UrlPatternObject). Editor type `select` allows the user to pick items from a select, providing multiple choices. Please check this example of how to define the multiselect field: ``` { "title": "Multiselect field", "description": "My multiselect field", "type": "array", "editor": "select", "items": { "type": "string", "enum": ["value1", "value2", "value3"], "enumTitles": ["Label of value1", "Label of value2", "Label of value3"] } } ``` To correctly define options for multiselect, you need to define the `items` property and then provide values and (optionally) labels in `enum` and `enumTitles` properties. ### Resource type[​](#resource-type "Direct link to Resource type") Resource type identifies what kind of Apify Platform object is referred to in the input field. For example, the Key-value store resource type can be referred to using a string ID. Currently, it supports storage resources only, allowing the reference of a Dataset, Key-Value Store or Request Queue. For Actor developers, the resource input value is a string representing the storage ID. The type of the property is either `string` or `array`. In case of `array` (for multiple resources) the return value is an array of IDs. In the user interface, a picker is provided for easy selection, where users can search and choose from their own storages or those they have access to. Example of a Dataset input: ``` { "title": "Dataset", "type": "string", "description": "Select a dataset", "resourceType": "dataset" } ``` Rendered input: ![Apify Actor input schema dataset](/assets/images/input-schema-dataset-2c784c5238066c743fd2c4fb28d34fed.png) The returned value is resource reference, in this example it's the dataset ID as can be seen in the JSON tab: ![Apify Actor input schema dataset](/assets/images/input-schema-dataset-json-1b7772d96d7ddb0a8be8604841754274.png) Example of multiple datasets input: ``` { "title": "Datasets", "type": "array", "description": "Select multiple datasets", "resourceType": "dataset" } ``` Rendered input: ![Apify Actor input schema datasets](/assets/images/input-schema-datasets-40376f5fbddb49389c41f2d34699de26.png) Properties: | Property | Value | Required | Description | | -------------- | --------------------------------------------------------------- | -------- | ---------------------------------------------------------------------------------- | | `type` | One of - `string`- `array` | Yes | Specifies the type of input - string for single value or array for multiple values | | `editor` | One of - `resourcePicker`- `hidden` | No | Visual editor used forthe input field. Defaults to `resourcePicker`. | | `resourceType` | One of - `dataset`- `keyValueStore`- `requestQueue` | Yes | Type of Apify Platform resource | | `minItems` | Integer | No | Minimum number of items the array can contain. Only for `type: array` | | `maxItems` | Integer | No | Maximum number of items the array can contain. Only for `type: array` | --- # Key-value Store Schema Specification **Learn how to define and present your key-value store schema to organize records into collections.** *** The key‑value store schema organizes keys into logical groups called collections, which can be used to filter and categorize data both in the API and the visual user interface. This organization helps users navigate and find specific data more efficiently, while schema‑defined rules (such as content types and JSON schema) ensure that stored values remain consistent and valid. ## Example[​](#example "Direct link to Example") Consider an example Actor that calls `Actor.setValue()` to save a record into the key-value store: main.js ``` import { Actor } from 'apify'; // Initialize the JavaScript SDK await Actor.init(); /** * Actor code */ await Actor.setValue('document-1', 'my text data', { contentType: 'text/plain' }); // ... await Actor.setValue(`image-${imageID}`, imageBuffer, { contentType: 'image/jpeg' }); // Exit successfully await Actor.exit(); ``` To configure the key-value store schema, use the following template for the `.actor/actor.json` configuration: .actor/actor.json ``` { "actorSpecification": 1, "name": "Actor Name", "title": "Actor Title", "version": "1.0.0", "storages": { "keyValueStore": { "actorKeyValueStoreSchemaVersion": 1, "title": "Key-Value Store Schema", "collections": { "documents": { "title": "Documents", "description": "Text documents stored by the Actor.", "keyPrefix": "document-" }, "images": { "title": "Images", "description": "Images stored by the Actor.", "keyPrefix": "image-", "contentTypes": ["image/jpeg"] } } } } } ``` The template above defines the configuration for the default key-value store. Each collection can define its member keys using one of the following properties: * `keyPrefix` - All keys starting with the specified prefix will be included in the collection (e.g., all keys starting with "document-"). * `key` - A specific individual key that will be included in the collection. You must use either `key` or `keyPrefix` for each collection, but not both. Once the schema is defined, tabs for each collection will appear in the **Storage** tab of the Actor's run: ![Storages tab in Run](/assets/images/kv-store-schema-example-run-e1a76231461aade32585217df4854b53.png) The tabs also appear in the storage detail view: ![Storage detail](/assets/images/kv-store-schema-example-storage-aaa62ee4017e0c38f172315ddf766118.png) ### API Example[​](#api-example "Direct link to API Example") With the key-value store schema defined, you can use the API to list keys from a specific collection by using the `collection` query parameter when calling the [Get list of keys](https://docs.apify.com/api/v2/key-value-store-keys-get) endpoint: Get list of keys from a collection ``` GET https://api.apify.com/v2/key-value-stores/{storeId}/keys?collection=documents ``` Example response: ``` { "data": { "items": [ { "key": "document-1", "size": 254 }, { "key": "document-2", "size": 368 } ], "count": 2, "limit": 1000, "exclusiveStartKey": null, "isTruncated": false } } ``` You can also filter by key prefix using the `prefix` parameter: Get list of keys with prefix ``` GET https://api.apify.com/v2/key-value-stores/{storeId}/keys?prefix=document- ``` ### Schema Validation[​](#schema-validation "Direct link to Schema Validation") When you define a key-value store schema with specific `contentTypes` for collections, the Apify platform validates any data being stored against these specifications. For example, if you've specified that a collection should only contain JSON data with content type `application/json`, attempts to store data with other content types in that collection will be rejected. The validation happens automatically when you call `Actor.setValue()` or use the [Put record](https://docs.apify.com/api/v2/reference/key-value-stores/record/put-record) API endpoint. If you've defined a `jsonSchema` for a collection with content type `application/json`, the platform will also validate that the JSON data conforms to the specified schema. This helps ensure data consistency and prevents storing malformed data. ## Structure[​](#structure "Direct link to Structure") Output configuration files need to be located in the `.actor` folder within the Actor's root directory. You have two choices of how to organize files within the `.actor` folder. ### Single configuration file[​](#single-configuration-file "Direct link to Single configuration file") .actor/actor.json ``` { "actorSpecification": 1, "name": "this-is-book-library-scraper", "title": "Book Library scraper", "version": "1.0.0", "storages": { "keyValueStore": { "actorKeyValueStoreSchemaVersion": 1, "title": "Key-Value Store Schema", "collections": { /* Define your collections here */ } } } } ``` ### Separate configuration files[​](#separate-configuration-files "Direct link to Separate configuration files") .actor/actor.json ``` { "actorSpecification": 1, "name": "this-is-book-library-scraper", "title": "Book Library scraper", "version": "1.0.0", "storages": { "keyValueStore": "./key_value_store_schema.json" } } ``` .actor/key\_value\_store\_schema.json ``` { "actorKeyValueStoreSchemaVersion": 1, "title": "Key-Value Store Schema", "collections": { /* Define your collections here */ } } ``` Choose the method that best suits your configuration. ## Key-value store schema structure definitions[​](#key-value-store-schema-structure-definitions "Direct link to Key-value store schema structure definitions") The key-value store schema defines the collections of keys and their properties. It allows you to organize and validate data stored by the Actor, making it easier to manage and retrieve specific records. ### Key-value store schema object definition[​](#key-value-store-schema-object-definition "Direct link to Key-value store schema object definition") | Property | Type | Required | Description | | --------------------------------- | ------- | -------- | --------------------------------------------------------------------------------------------------------------- | | `actorKeyValueStoreSchemaVersion` | integer | true | Specifies the version of key-value store schema structure document.Currently only version 1 is available. | | `title` | string | true | Title of the schema | | `description` | string | false | Description of the schema | | `collections` | Object | true | An object where each key is a collection ID and its value is a collection definition object (see below). | ### Collection object definition[​](#collection-object-definition "Direct link to Collection object definition") | Property | Type | Required | Description | | -------------- | ------------ | ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------ | | `title` | string | true | The collection’s title, shown in the run's storage tab and in the storage detail view, where it appears as a tab for filtering records. | | `description` | string | false | A description of the collection that appears in UI tooltips. | | `key` | string | conditional\* | Defines a single specific key that will be part of this collection. | | `keyPrefix` | string | conditional\* | Defines a prefix for keys that should be included in this collection. | | `contentTypes` | string array | false | Allowed content types for records in this collection. Used for validation when storing data. | | `jsonSchema` | object | false | For collections with content type `application/json`, you can define a JSON schema to validate structure.Uses JSON Schema Draft 07 format. | \* Either `key` or `keyPrefix` must be specified for each collection, but not both. --- # Source code **Learn about the Actor's source code placement and its structure.** *** The Apify Actor's source code placement is defined by its [Dockerfile](https://docs.apify.com/platform/actors/development/actor-definition/dockerfile.md). If you have created the Actor from one of Apify's [templates](https://apify.com/templates) then it's by convention placed in the `/src` directory. You have the flexibility to choose any programming language, technologies, and dependencies (such as Chrome browser, Selenium, Cypress, or others) for your projects. The only requirement is to define a Dockerfile that builds the image for your Actor, including all dependencies and your source code. ## Example setup[​](#example-setup "Direct link to Example setup") Let's take a look at the example JavaScript Actor's source code. The following Dockerfile: ``` FROM apify/actor-node:20 COPY package*.json ./ RUN npm --quiet set progress=false \ && npm install --omit=dev --omit=optional \ && echo "Installed NPM packages:" \ && (npm list --omit=dev --all || true) \ && echo "Node.js version:" \ && node --version \ && echo "NPM version:" \ && npm --version \ && rm -r ~/.npm COPY . ./ CMD npm start --silent ``` This `Dockerfile` does the following tasks: 1. Builds the Actor from the `apify/actor-node:20` base image. ``` FROM apify/actor-node:20 ``` 2. Copies the `package.json` and `package-lock.json` files to the image. ``` COPY package*.json ./ ``` 3. Installs the npm packages specified in package.json, omitting development and optional dependencies. ``` RUN npm --quiet set progress=false \ && npm install --omit=dev --omit=optional \ && echo "Installed NPM packages:" \ && (npm list --omit=dev --all || true) \ && echo "Node.js version:" \ && node --version \ && echo "NPM version:" \ && npm --version \ && rm -r ~/.npm ``` 4. Copies the rest of the source code to the image ``` COPY . ./ ``` 5. Runs the `npm start` command defined in `package.json` ``` CMD npm start --silent ``` Optimized build cache By copying the `package.json` and `package-lock.json` files and installing dependencies before the rest of the source code, you can take advantage of Docker's caching mechanism. This approach ensures that dependencies are only reinstalled when the `package.json` or `package-lock.json` files change, significantly reducing build times. Since the installation of dependencies is often the most time-consuming part of the build process, this optimization can lead to substantial performance improvements, especially for larger projects with many dependencies. ### `package.json`[​](#packagejson "Direct link to packagejson") The `package.json` file defines the `npm start` command: ``` { "name": "getting-started-node", "version": "0.0.1", "type": "module", "description": "This is an example of an Apify Actor.", "dependencies": { "apify": "^3.0.0" }, "devDependencies": {}, "scripts": { "start": "node src/main.js", "test": "echo \"Error: oops, the Actor has no tests yet, sad!\" && exit 1" }, "author": "It's not you; it's me", "license": "ISC" } ``` When the Actor starts, the `src/main.js` file is executed. --- # Automated tests **Learn how to automate ongoing testing and make sure your Actors perform over time. See code examples for configuring the Actor Testing Actor.** *** # Automated tests for Actors Automated testing is crucial for maintaining the reliability and performance of your Actors over time. This guide will help you set up automated tests using the [Actor Testing Actor](https://apify.com/pocesar/actor-testing). ## Set up automated tests[​](#set-up-automated-tests "Direct link to Set up automated tests") 1. Prepare test tasks - Create 1–5 separate testing tasks for your Actor. 2. Configure Actor testing - Set up a task using the Actor Testing Actor. 3. Validate tests - Run the test task multiple times until all tests pass. 4. Schedule tests - Set up a recurring schedule for your tests. 5. Monitor results - Review and address any issues on a weekly basis. ## Create test tasks[​](#create-test-tasks "Direct link to Create test tasks") Example of Actor testing tasks When creating test tasks: * Include a test for your Actor's default configuration * Set a low `maxItem` value to conserve credits * For large data tests, reduce test frequency to conserve credits ## Configure the Actor Testing Actor[​](#configure-the-actor-testing-actor "Direct link to Configure the Actor Testing Actor") Follow the [setup guide](https://apify.com/pocesar/actor-testing) in the Actor's README. Here are some recommended test scenarios: * Run status * Crash information from the log * Information from statistics (runtime, retries) * Information about and from within the dataset * Information about the key-value store ``` await expectAsync(runResult).toHaveStatus('SUCCEEDED'); ``` ``` await expectAsync(runResult).withLog((log) => { // Neither ReferenceError or TypeErrors should ever occur // in production code – they mean the code is over-optimistic // The errors must be dealt with gracefully and displayed with a helpful message to the user expect(log) .withContext(runResult.format('ReferenceError')) .not.toContain('ReferenceError'); expect(log) .withContext(runResult.format('TypeError')) .not.toContain('TypeError'); }); ``` ``` await expectAsync(runResult).withStatistics((stats) => { // In most cases, you want it to be as close to zero as possible expect(stats.requestsRetries) .withContext(runResult.format('Request retries')) .toBeLessThan(3); // What is the expected run time for the number of items? expect(stats.crawlerRuntimeMillis) .withContext(runResult.format('Run time')) .toBeWithinRange(1 * 60000, 10 * 60000); }); ``` ``` await expectAsync(runResult).withDataset(({ dataset, info }) => { // If you're sure, always set this number to be your exact maxItems expect(info.cleanItemCount) .withContext(runResult.format('Dataset cleanItemCount')) .toBe(3); // or toBeGreaterThan(1) or toBeWithinRange(1,3) // Make sure the dataset isn't empty expect(dataset.items) .withContext(runResult.format('Dataset items array')) .toBeNonEmptyArray(); const results = dataset.items; // Check dataset items to have the expected data format for (const result of results) { expect(result.directUrl) .withContext(runResult.format('Direct url')) .toStartWith('https://www.yelp.com/biz/'); expect(result.bizId) .withContext(runResult.format('Biz ID')) .toBeNonEmptyString(); } }); ``` ``` await expectAsync(runResult).withKeyValueStore(({ contentType }) => { // Check for the proper content type of the saved key-value item expect(contentType) .withContext(runResult.format('KVS contentType')) .toBe('image/gif'); }, // This also checks for existence of the key-value key { keyName: 'apify.com-scroll_losless-comp' }, ); ``` --- # Builds and runs **Learn about Actor builds and runs, their lifecycle, versioning, and other properties.** *** Actor **builds** and **runs** are fundamental concepts within the Apify platform. Understanding them is crucial for effective use of the platform. ## Building an Actor[​](#building-an-actor "Direct link to Building an Actor") When you start the build process for your Actor, you create a *build*. A build is a Docker image containing your source code and the required dependencies needed to run the Actor: ## Running an Actor[​](#running-an-actor "Direct link to Running an Actor") To create a *run*, you take your *build* and start it with some input: ## Lifecycle[​](#lifecycle "Direct link to Lifecycle") Actor builds and runs share a common lifecycle. Each build and run begins with the initial status **READY** and progress through one or more transitional statuses to reach a terminal status. *** | Status | Type | Description | | ---------- | ------------ | ------------------------------------------- | | READY | initial | Started but not allocated to any worker yet | | RUNNING | transitional | Executing on a worker machine | | SUCCEEDED | terminal | Finished successfully | | FAILED | terminal | Run failed | | TIMING-OUT | transitional | Timing out now | | TIMED-OUT | terminal | Timed out | | ABORTING | transitional | Being aborted by user | | ABORTED | terminal | Aborted by user | --- # Builds **Learn about Actor build numbers, versioning, and how to use specific Actor version in runs. Understand an Actor's lifecycle and manage its cache.** *** ## Understand Actor builds[​](#understand-actor-builds "Direct link to Understand Actor builds") Before an Actor can be run, it needs to be built. The build process creates a snapshot of a specific version of the Actor's settings, including its [source code](https://docs.apify.com/platform/actors/development/actor-definition/source-code.md) and [environment variables](https://docs.apify.com/platform/actors/development/programming-interface/environment-variables.md). This snapshot is then used to create a Docker image containing everything the Actor needs for its run, such as `npm` packages, web browsers, etc. ### Build numbers[​](#build-numbers "Direct link to Build numbers") Each build is assigned a unique build number in the format *MAJOR.MINOR.BUILD* (e.g. *1.2.345*): * *MAJOR.MINOR* corresponds to the Actor version number * *BUILD* is an automatically incremented number starting at **1**. ### Build resources[​](#build-resources "Direct link to Build resources") By default, builds have the following resource allocations: * Timeout: *1800* seconds * Memory: `4096 MB` Check out the [Resource limits](https://docs.apify.com/platform/actors/running.md) section for more details. ## Versioning[​](#versioning "Direct link to Versioning") To support active development, Actors can have multiple versions of source code and associated settings, such as the base image and environment. Each version is denoted by a version number of the form *MAJOR.MINOR*, following [Semantic Versioning](https://semver.org/) principles. For example, an Actor might have: * Production version *1.1* * Beta version *1.2* that contains new features but is still backward compatible * Development version *2.0* that contains breaking changes. ## Tags[​](#tags "Direct link to Tags") Tags simplify the process of specifying which build to use when running an Actor. Instead of using a version number, you can use a tag such as *latest* or *beta*. Tags are unique, meaning only one build can be associated with a specific tag. To set a tag for builds of a specific Actor version: 1. Set the `Build tag` property. 2. When a new build of that version is successfully finished, it's automatically assigned the tag. By default, the builds are set to the *latest* tag. ## Cache[​](#cache "Direct link to Cache") To speed up builds triggered via API, you can use the `useCache=1` parameter. This instructs the build process to use cached Docker images and layers instead of pulling the latest copies and building each layer from scratch. Note that the cached images and layers might not always be available on the server building the image, the `useCache` parameter only functions on a best-effort basis. Clean builds Running builds from the Console By default, the Console uses cached data when starting a build. You can also run a clean build without using the cache. To run a clean build: 1. Go to your Actor page. 2. Select **Source** > **Code**. 3. Locate the **Start** button. Next to it, click on the arrow & choose **Clean build** --- # Runs **Learn about Actor runs, how to start them, and how to manage them.** *** When you start an Actor, you create a run. A run is a single execution of your Actor with a specific input in a Docker container. ## Starting an Actor[​](#starting-an-actor "Direct link to Starting an Actor") You can start an Actor in several ways: * Manually from the [Apify Console](https://console.apify.com/actors) UI * Via the [Apify API](https://docs.apify.com/api/v2/act-runs-post.md) * Using the [Scheduler](https://docs.apify.com/platform/schedules.md) provided by the Apify platform * By one of the available [integrations](https://docs.apify.com/platform/integrations.md) ## Input and environment variables[​](#input-and-environment-variables "Direct link to Input and environment variables") The run receives input via the `INPUT` record of its default [key-value store](https://docs.apify.com/platform/storage/key-value-store.md). Environment variables are also passed to the run. For more information about environment variables check the [Environment variables](https://docs.apify.com/platform/actors/development/programming-interface/environment-variables.md) section. ## Run duration and timeout[​](#run-duration-and-timeout "Direct link to Run duration and timeout") Actor runs can be short or long-running. To prevent infinite runs, you can set a timeout. The timeout is specified in seconds, and the default timeout varies based on the template from which you create your Actor. If the run doesn't finish within the timeout, it's automatically stopped, and its status is set to `TIMED-OUT`. --- # State persistence **Learn how to maintain an Actor's state to prevent data loss during unexpected restarts. Includes code examples for handling server migrations.** *** Long-running [Actor](https://docs.apify.com/platform/actors.md) jobs may need to migrate between servers. Without state persistence, your job's progress is lost during migration, causing it to restart from the beginning on the new server. This can be costly and time-consuming. To prevent data loss, long-running Actors should: * Periodically save (persist) their state. * Listen for [migration events](https://docs.apify.com/sdk/js/api/apify/class/PlatformEventManager) * Check for persisted state when starting, allowing them to resume from where they left off. For short-running Actors, the risk of restarts and the cost of repeated runs are low, so you can typically ignore state persistence. ## Understanding migrations[​](#understanding-migrations "Direct link to Understanding migrations") A migration occurs when a process running on one server must stop and move to another. During this process: * All in-progress processes on the current server are stopped * Unless you've saved your state, the Actor run will restart on the new server with an empty internal state * You only have a few seconds to save your work when a migration event occurs ### Causes of migration[​](#causes-of-migration "Direct link to Causes of migration") Migrations can happen for several reasons: * Server workload optimization * Server crashes (rare) * New feature releases and bug fixes ### Frequency of migrations[​](#frequency-of-migrations "Direct link to Frequency of migrations") Migrations don't follow a specific schedule. They can occur at any time due to the events mentioned above. ## Why state is lost during migration[​](#why-state-is-lost-during-migration "Direct link to Why state is lost during migration") By default, an Actor keeps its state in the server's memory. During a server switch, the run loses access to the previous server's memory. Even if data were saved on the server's disk, access to that would also be lost. Note that the Actor run's default dataset, key-value store and request queue are preserved across migrations, by state we mean the contents of runtime variables in the Actor's code. ## Implementing state persistence[​](#implementing-state-persistence "Direct link to Implementing state persistence") The [Apify SDKs](https://docs.apify.com/sdk.md) handle state persistence automatically. This is done using the `Actor.on()` method and the `migrating` event. * The `migrating` event is triggered just before a migration occurs, allowing you to save your state. * To retrieve previously saved state, you can use the [`Actor.getValue`](https://docs.apify.com/sdk/js/reference/class/Actor#getValue)/[`Actor.get_value`](https://docs.apify.com/sdk/python/reference/class/Actor#get_value) methods. ### Code examples[​](#code-examples "Direct link to Code examples") To manually persist state, use the `Actor.on` method in the Apify SDK: * JavaScript * Python ``` import { Actor } from 'apify'; await Actor.init(); // ... Actor.on('migrating', () => { Actor.setValue('my-crawling-state', { foo: 'bar', }); }); // ... await Actor.exit(); ``` ``` from apify import Actor, Event async def actor_migrate(_event_data): await Actor.set_value('my-crawling-state', {'foo': 'bar'}) async def main(): async with Actor: # ... Actor.on(Event.MIGRATING, actor_migrate) # ... ``` To check for state saved in a previous run: * JavaScript * Python ``` import { Actor } from 'apify'; await Actor.init(); // ... const previousCrawlingState = await Actor.getValue('my-crawling-state') || {}; // ... await Actor.exit(); ``` ``` from apify import Actor async def main(): async with Actor: # ... previous_crawling_state = await Actor.get_value('my-crawling-state') # ... ``` For improved Actor performance consider [caching repeated page data](https://docs.apify.com/academy/expert-scraping-with-apify/saving-useful-stats.md). ## Speeding up migrations[​](#speeding-up-migrations "Direct link to Speeding up migrations") Once your Actor receives the `migrating` event, the Apify platform will shut it down and restart it on a new server within one minute. To speed this process up, once you have persisted the Actor state, you can manually reboot the Actor in the `migrating` event handler using the `Actor.reboot()` method available in the [Apify SDK for JavaScript](https://docs.apify.com/sdk/js/reference/class/Actor#reboot) or [Apify SDK for Python](https://docs.apify.com/sdk/python/reference/class/Actor#reboot). * JavaScript * Python ``` import { Actor } from 'apify'; await Actor.init(); // ... Actor.on('migrating', async () => { // ... // save state // ... await Actor.reboot(); }); // ... await Actor.exit(); ``` ``` from apify import Actor, Event async def actor_migrate(_event_data): # ... # save state # ... await Actor.reboot() async def main(): async with Actor: # ... Actor.on(Event.MIGRATING, actor_migrate) # ... ``` --- # Deployment **Learn how to deploy your Actors to the Apify platform and build them.** *** Deploying an Actor involves uploading your [source code](https://docs.apify.com/platform/actors/development/actor-definition.md) and [building](https://docs.apify.com/platform/actors/development/builds-and-runs/builds.md) it on the Apify platform. Once deployed, you can run and scale your Actor in the cloud. ## Deploy using Apify CLI[​](#deploy-using-apify-cli "Direct link to Deploy using Apify CLI") The fastest way to deploy and build your Actor is by using the [Apify CLI](https://docs.apify.com/cli). If you've completed one of the tutorials from the [academy](https://docs.apify.com/academy.md), you should have already have it installed. If not, follow the [Apify CLI installation instructions](https://docs.apify.com/cli/docs/installation). To deploy your Actor using Apify CLI: 1. Log in to your Apify account: ``` apify login ``` 2. Navigate to the directory of your Actor on your local machine. 3. Deploy your Actor by running: ``` apify push ``` When you deploy using the CLI, your source code is uploaded as "multiple source files" and is visible and editable in the Web IDE. ![Web IDE](/assets/images/actor-source-7ca79d7c6231c42a7c7e8adb63d645fa.png) Source files size limit The CLI deploys code as multiple source files up to 3 MB. Beyond that, it deploys as a Zip file. ### Pull an existing Actor[​](#pull-an-existing-actor "Direct link to Pull an existing Actor") You can also pull an existing Actor from the Apify platform to your local machine using `apify pull` command ``` apify pull [ACTORID] ``` This command fetches the Actor's files to your current directory. If the Actor is defined as a Git repository, it will be cloned, for Actors defined in the Web IDE, the command will fetch the files directly. You can specify a particular version of the Actor to pull by using the `--version` flag: ``` apify pull [ACTORID] --version=1.2 ``` If you don't provide the `ACTORID`, the command will update the Actor in the current directory based on its name in the `.actor/actor.json` file. ## Alternative deployment methods[​](#alternative-deployment-methods "Direct link to Alternative deployment methods") To deploy using other methods, first create the Actor manually through Apify CLI or Apify Console, then change its source type: ![Actor source types](/assets/images/actor-source-types-dc67aa6a565172977737a5c58ea4200a.png) You can link your Actor to a Git repository, Gist, or a Zip file. For more information on alternative source types, check out next chapter. --- # Continuous integration for Actors **Learn how to set up automated builds, deploys, and testing for your Actors.** *** Automating your Actor development process can save time and reduce errors, especially for projects with multiple Actors or frequent updates. Instead of manually pushing code, building Actors, and running tests, you can automate these steps to run whenever you push code to your repository. You can automate Actor builds and tests using your Git repository's automated workflows like [GitHub Actions](https://github.com/features/actions) or [Bitbucket Pipelines](https://www.atlassian.com/software/bitbucket/features/pipelines). Using Bitbucket? Follow our step-by-step guide to set up continuous integration for your Actors with Bitbucket Pipelines: [Read the Bitbucket CI guide](https://help.apify.com/en/articles/6988586-setting-up-continuous-integration-for-apify-actors-on-bitbucket). Set up continuous integration for your Actors using one of these methods: * [Trigger builds with a Webhook](#option-1-trigger-builds-with-a-webhook) * [Set up automated builds and tests with GitHub Actions](#option-2-set-up-automated-builds-and-tests-with-github-actions) Choose the method that best fits your workflow. ## Option 1: Trigger builds with a Webhook[​](#option-1-trigger-builds-with-a-webhook "Direct link to Option 1: Trigger builds with a Webhook") 1. Push your Actor to a GitHub repository. 2. Go to your Actor's detail page in Apify Console, click on the API tab in the top right, then select API Endpoints. Copy the **Build Actor** API endpoint URL. The format is as follows: ``` https://api.apify.com/v2/acts/YOUR-ACTOR-NAME/builds?token=YOUR-TOKEN-HERE&version=0.0&tag=beta&waitForFinish=60 ``` API token Make sure you select the correct API token from the dropdown. 3. In your GitHub repository, go to Settings > Webhooks > Add webhook. 4. Paste the API URL into the Payload URL field and add the webhook. ![GitHub integration](/assets/images/ci-github-integration-2ee82ac772eb3280155b7027a4259528.png) Now your Actor will automatically rebuild on every push to the GitHub repository. ## Option 2: Set up automated builds and tests with GitHub Actions[​](#option-2-set-up-automated-builds-and-tests-with-github-actions "Direct link to Option 2: Set up automated builds and tests with GitHub Actions") 1. Push your Actor to a GitHub repository. 2. Get your Apify API token from the [Apify Console](https://console.apify.com/settings/integrations) ![Apify token in app](/assets/images/ci-token-b623bba2addc8778a802a97aabda2ada.png) 3. Add your Apify token to GitHub secrets 1. Go to your repository > Settings > Secrets and variables > Actions > New repository secret 2. Name the secret and paste in your token ![Add Apify token to secrets](/assets/images/ci-add-token-ff0020fecc1f213927e24987c5145a61.png) 4. Add the Build Actor API endpoint URL to GitHub secrets 1. Go to your repository > Settings > Secrets and variables > Actions > New repository secret 2. In Apify Console, go to your Actor's detail page, click the API tab in the top right, and then select API Endpoints. Copy the **Build Actor** API endpoint URL. The format is as follows: API token Make sure you select the correct API token from the dropdown. ``` https://api.apify.com/v2/acts/YOUR-ACTOR-NAME/builds?token=YOUR-TOKEN-HERE&version=0.0&tag=latest&waitForFinish=60 ``` 3. Name the secret & paste in your API endpoint ![Add build Actor URL to secrets](/assets/images/ci-add-build-url-b1d008e3713646e5a9c26de2dc84ba4c.png) 5. Create GitHub Actions workflow files: 1. In your repository, create the `.github/workflows` directory 2. Add `latest.yml`. If you want, you can also add `beta.yml` to build Actors from the develop branch (or other branches). * latest.yml * beta.yml Use your secret names Make sure to use the exact secret names you set in the previous step. ``` name: Test and build latest version on: push: branches: - master - main jobs: test-and-build: runs-on: ubuntu-latest steps: # Install dependencies and run tests - uses: actions/checkout@v2 - run: npm install && npm run test # Build latest version - uses: distributhor/workflow-webhook@v1 env: webhook_url: ${{ secrets.BUILD_ACTOR_URL }} webhook_secret: ${{ secrets.APIFY_TOKEN }} ``` With this setup, pushing to the `main` or `master` branch tests the code and builds a new latest version. Use your secret names Make sure to use the exact secret names you set in the previous step. ``` name: Test and build beta version on: push: branches: - develop jobs: test-and-build: runs-on: ubuntu-latest steps: # Install dependencies and run tests - uses: actions/checkout@v2 - run: npm install && npm run test # Build beta version - uses: distributhor/workflow-webhook@v1 env: webhook_url: ${{ secrets.BUILD_ACTOR_URL }} webhook_secret: ${{ secrets.APIFY_TOKEN }} ``` With this setup, pushing to the `develop` branch tests the code and builds a new beta version. ## Conclusion[​](#conclusion "Direct link to Conclusion") Setting up continuous integration (CI) for your Apify Actors ensures that CI automatically tests and builds your code whenever you push changes to your repository. This helps catch issues early and streamlines your deployment process, whether you're releasing to production or maintaining a beta branch. You can also integrate directly with GitHub, check out the [official Apify GitHub integration documentation](https://docs.apify.com/platform/integrations/github.md). --- # Source types **Learn about Apify Actor source types and how to deploy an Actor from GitHub using CLI or Gist.** *** This section explains the various sources types available for Apify Actors and how to deploy an Actor from GitHub using CLI or Gist. Apify Actors supports four source types: * [Web IDE](#web-ide) * [Git repository](#git-repository) * [Private repositories](#private-repositories) * [How to configure deployment keys](#how-to-configure-deployment-keys) * [Actor monorepos](#actor-monorepos) * [Zip file](#zip-file) * [GitHub Gist](#github-gist) ## Web IDE[​](#web-ide "Direct link to Web IDE") This is the default option when your Actor's source code is hosted on the Apify platform. It offers quick previews and updates to your source code, easy file and directory browsing, and direct testing of the [`INPUT_SCHEMA.json`](https://docs.apify.com/platform/actors/development/actor-definition/input-schema.md) on the Apify platform. A `Dockerfile` is mandatory for all Actors. When using the default NodeJS Dockerfile, you'll typically need `main.js` for your source code and `package.json` for [NPM](https://www.npmjs.com/) package configurations. For more information on creating custom Dockerfiles or using Apify's base images, refer to the [Dockerfile](https://docs.apify.com/platform/actors/development/actor-definition/dockerfile.md#custom-dockerfile) and [base Docker images](https://docs.apify.com/platform/actors/development/actor-definition/dockerfile.md#base-docker-images) documentation. ## Git repository[​](#git-repository "Direct link to Git repository") [YouTube video player](https://www.youtube-nocookie.com/embed/NEzT_p_RE1Q) Hosting your Actor's source code in a Git repository allows for multiple files and directories, a custom `Dockerfile` for build process control, and a user description fetched from `README.md`. Specify the repository location using the **Git URL** setting with `https`, `git`, or `ssh` protocols. To deploy an Actor from GitHub, set the **Source Type** to **Git repository** and enter the GitHub repository URL in the **Git URL** field. You can optionally specify a branch or tag by adding a URL fragment (e.g., `#develop`). To use a specific directory, add it after the branch/tag, separated by a colon (e.g., `#develop:some/dir`) GitHub integration You can easily set up an integration where the Actor is automatically rebuilt on every commit to the Git repository. For more details, see [GitHub integration](https://docs.apify.com/platform/integrations/github.md). ### Private repositories[​](#private-repositories "Direct link to Private repositories") When using a private Git repository for your Actor's source code, you need to configure a deployment key. This key grants Apify secure, read-only access to your repository. What are deployment keys Deployment keys are unique SSH keys that allow Apify to clone and build your Actor from a private repository. #### How to configure deployment keys[​](#how-to-configure-deployment-keys "Direct link to How to configure deployment keys") To configure the deployment key for your Actor's source code you need to: 1. In your Actor's settings, find the **Git URL** input field 2. Click the **deployment key** link below the input field 3. Follow the instruction to add the key to your Git hosting service. ![Deployment keys instruction window](/assets/images/deployment-keys-5929eb8db9a4de3afa0a87033b621d98.png) By using deployment keys, you enable secure, automated builds of your Actor from private repositories. Key usage limitations Remember that each key can only be used once per Git hosting service (GitHub, Bitbucket, GitLab) ### Actor monorepos[​](#actor-monorepos "Direct link to Actor monorepos") To manage multiple Actors in a single repository, use the `dockerContextDix` property in the [Actor definition](https://docs.apify.com/platform/actors/development/actor-definition/actor-json.md) to set the Docker context directory (if not provided then the repository root is used). In the Dockerfile, copy both the Actor's source and any shared code into the Docker image. To enable sharing Dockerfiles between multiple Actors, the Actor build process passes the `ACTOR_PATH_IN_DOCKER_CONTEXT` build argument to the Docker build. It contains the relative path from `dockerContextDir` to the directory selected as the root of the Actor in the Apify Console (the "directory" part of the Actor's git URL). For an example, see the [`apify/actor-monorepo-example`](https://github.com/apify/actor-monorepo-example) repository. To build Actors from this monorepo, you would set the source URL (including branch name and folder) as `https://github.com/apify/actor-monorepo-example#main:actors/javascript-actor` and `https://github.com/apify/actor-monorepo-example#main:actors/typescript-actor` respectively. ## Zip file[​](#zip-file "Direct link to Zip file") Actors can also use source code from a Zip archive hosted on an external URL. This option supports multiple files and directories, allows for custom `Dockerfile`, and uses `README.md` for the Actor description. If not using a [custom Dockerfile](https://docs.apify.com/platform/actors/development/actor-definition/dockerfile.md#custom-dockerfile), ensure your main file is named `main.js`. Automatic use of ZIP file This source type is used automatically when you are using Apify-CLI and the source size exceeds `3MB`. ## GitHub Gist[​](#github-gist "Direct link to GitHub Gist") For smaller projects, GitHub Gist offers a simpler alternative to full Git repositories or hosted Zip files. To use a GitHub Gist, create your Gist at , set the **Source type** to **GitHub Gist**, and paste the Gist URL in the provided field. Like other source types, Gists can include multiple files, directories, and a custom Dockerfile. The Actor description is taken from `README.md`. By understanding these source types, you can choose the most appropriate option for hosting and deploying your Apify Actors. Each type offers unique advantages, allowing you to select the best fit for your project's size, complexity, and collaboration needs. --- # Performance **Learn how to get the maximum value out of your Actors, minimize costs, and maximize results.** *** ## Optimization Tips[​](#optimization-tips "Direct link to Optimization Tips") This guide provides tips to help you maximize the performance of your Actors, minimize costs, and achieve optimal results. ### Run batch jobs instead of single jobs[​](#run-batch-jobs-instead-of-single-jobs "Direct link to Run batch jobs instead of single jobs") Running a single job causes the Actor to start and stop for each execution, which is an expensive operation. If your Actor runs a web browser or other resource-intensive dependencies, their startup times further contribute to the cost. To minimize costs, we recommend running batch jobs instead of single jobs. For example, instead of starting an Actor for every URL you want to process, group the URLs into batches and run the Actor once for each batch. This approach reuses the browser instance, resulting in a more cost-efficient implementation. ### Leverage Docker layer caching to speed up builds[​](#leverage-docker-layer-caching-to-speed-up-builds "Direct link to Leverage Docker layer caching to speed up builds") When you build a Docker image, Docker caches the layers that haven't changed. This means that if you modify only a small part of your Dockerfile, Docker doesn't need to rebuild the entire image but only the changed layers. This can save significant time and money. Consider the following Dockerfile: ``` FROM apify/actor-node:16 COPY package*.json ./ RUN npm --quiet set progress=false \ && npm install --omit=dev --omit=optional \ && echo "Installed NPM packages:" \ && (npm list --omit=dev --all || true) \ && echo "Node.js version:" \ && node --version \ && echo "NPM version:" \ && npm --version \ && rm -r ~/.npm COPY . ./ CMD npm start --silent ``` We first copy the `package.json`, `package-lock.json` files , and install the dependencies before copying the rest of the source code. This way, we can take advantage of Docker's caching mechanism and only install the dependencies when the `package.json` or `package-lock.json` files change, making the build process much faster. Further optimization tips * We recommend using as few layers as possible in your Docker images. This helps to reduce overall image sizes and improve build times. * Use the [dive](https://github.com/wagoodman/dive) CLI tool to analyze the layers of a built Docker image. This tool provides insights into the composition of each layer, allowing you to understand what was added and helps you find ways to minimize their size. ### Use standardized images to accelerate Actor startup times[​](#use-standardized-images-to-accelerate-actor-startup-times "Direct link to Use standardized images to accelerate Actor startup times") Using one of [Apify's standardized images](https://github.com/apify/apify-actor-docker), can accelerate the Actor startup time. These images are cached on each worker machine, so only the layers you added in your Actor's [Dockerfile](https://docs.apify.com/platform/actors/development/actor-definition/dockerfile.md) need to be pulled. --- # Programming interface **Learn about the programming interface of Apify Actors, important commands and features provided by the Apify SDK, and how to use them in your Actors.** *** This chapter will guide you through all the commands you need to build your first Actor. This interface is provided by [Apify SDKs](https://docs.apify.com/sdk.md). The chapter starts with basic commands and guides you through system events and environment variables that are available to your Actor both locally and when running on Apify platform. #### [Basic commands](https://docs.apify.com/platform/actors/development/programming-interface/basic-commands.md) [Learn how to provide your Actor with context that determines its behavior through a plethora of pre-defined environment variables offered by the Apify SDK.](https://docs.apify.com/platform/actors/development/programming-interface/basic-commands.md) #### [Environment variables](https://docs.apify.com/platform/actors/development/programming-interface/container-web-server.md) [Learn how to provide your Actor with context that determines its behavior through a plethora of pre-defined environment variables offered by the Apify SDK.](https://docs.apify.com/platform/actors/development/programming-interface/container-web-server.md) #### [Status messages](https://docs.apify.com/platform/actors/development/programming-interface/status-messages.md) [Use custom status messages to inform the user about the progress of your Actor run.](https://docs.apify.com/platform/actors/development/programming-interface/status-messages.md) #### [Status events](https://docs.apify.com/platform/actors/development/programming-interface/status-messages.md) [Learn about system events sent to your Actor and how to benefit from them.](https://docs.apify.com/platform/actors/development/programming-interface/status-messages.md) #### [Container web server](https://docs.apify.com/platform/actors/development/programming-interface/container-web-server.md) [Learn how to run a web server inside your Actor, which enables you to communicate with the outer world via UI and API.](https://docs.apify.com/platform/actors/development/programming-interface/container-web-server.md) #### [Metamorph](https://docs.apify.com/platform/actors/development/programming-interface/metamorph.md) [The metamorph operation transforms an Actor run into the run of another Actor with a new input.](https://docs.apify.com/platform/actors/development/programming-interface/metamorph.md) #### [Actor Standby](https://docs.apify.com/platform/actors/development/programming-interface/standby.md) [Learn how to use Actors in lightweight mode as a blazingly fast API server.](https://docs.apify.com/platform/actors/development/programming-interface/standby.md) --- # Basic commands **Learn how to use basic commands of the Apify SDK for both JavaScript and Python.** *** This page covers essential commands for the Apify SDK in JavaScript & Python. These commands are designed to be used within a running Actor, either in a local environment or on the Apify platform. ## Initialize your Actor[​](#initialize-your-actor "Direct link to Initialize your Actor") Before using any Apify SDK methods, initialize your Actor. This step prepares the Actor to receive events from the Apify platform, sets up machine and storage configurations, and clears previous local storage states. * JavaScript * Python Use the `init()` method to initialize your Actor. Pair it with `exit()` to properly terminate the Actor. For more information on `exit()`, go to [Exit Actor](#exit-actor). ``` import { Actor } from 'apify'; await Actor.init(); console.log('Actor starting...'); // ... await Actor.exit(); ``` Alternatively, use the `main()` function for environments that don't support top-level awaits. The `main()` function is syntax-sugar for `init()` and `exit()`. It will call `init()` before it executes its callback and `exit()` after the callback resolves. ``` import { Actor } from 'apify'; Actor.main(async () => { console.log('Actor starting...'); // ... }); ``` In Python, use an asynchronous context manager with the `with` keyword. The `init()` method will be called before the code block is executed, and the `exit()` method will be called after the code block is finished. ``` from apify import Actor async def main(): async with Actor: Actor.log.info('Actor starting...') # ... ``` ## Get input[​](#get-input "Direct link to Get input") Access the Actor's input object, which is stored as a JSON file in the Actor's default key-value store. The input is an object with properties. If the Actor defines the input schema, the input object is guaranteed to conform to it. * JavaScript * Python ``` import { Actor } from 'apify'; await Actor.init(); const input = await Actor.getInput(); console.log(input); // prints: {'option1': 'aaa', 'option2': 456} await Actor.exit(); ``` ``` from apify import Actor async def main(): async with Actor: actor_input: dict = await Actor.get_input() or {} Actor.log.info(actor_input) # prints: {'option1': 'aaa', 'option2': 456} ``` Usually, the file is called `INPUT`, but the exact key is defined in the `ACTOR_INPUT_KEY` environment variable. ## Key-value store access[​](#key-value-store-access "Direct link to Key-value store access") Use the [Key-value store](https://docs.apify.com/platform/storage/key-value-store.md) to read and write arbitrary files * JavaScript * Python ``` import { Actor } from 'apify'; await Actor.init(); // Save object to store (stringified to JSON) await Actor.setValue('my_state', { something: 123 }); // Save binary file to store with content type await Actor.setValue('screenshot.png', buffer, { contentType: 'image/png' }); // Get a record from the store (automatically parsed from JSON) const value = await Actor.getValue('my_state'); // Access another key-value store by its name const store = await Actor.openKeyValueStore('screenshots-store'); await store.setValue('screenshot.png', buffer, { contentType: 'image/png' }); await Actor.exit(); ``` ``` from apify import Actor async def main(): async with Actor: # Save object to store (stringified to JSON) await Actor.set_value('my_state', {'something': 123}) # Get a record from the store (automatically parsed from JSON) value = await Actor.get_value('my_state') # Log the obtained value Actor.log.info(f'value = {value}') # prints: value = {'something': 123} ``` ## Push results to the dataset[​](#push-results-to-the-dataset "Direct link to Push results to the dataset") Store larger results in a [Dataset](https://docs.apify.com/platform/storage/dataset.md), an append-only object storage Note that Datasets can optionally be equipped with the schema that ensures only certain kinds of objects are stored in them. * JavaScript * Python ``` import { Actor } from 'apify'; await Actor.init(); // Append result object to the default dataset associated with the run await Actor.pushData({ someResult: 123 }); await Actor.exit(); ``` ``` from apify import Actor async def main(): async with Actor: # Append result object to the default dataset associated with the run await Actor.push_data({'some_result': 123}) ``` ## Exit Actor[​](#exit-actor "Direct link to Exit Actor") When an Actor's main process terminates, the Actor run is considered finished. The process exit code determines Actor's final status: * Exit code `0`: Status `SUCCEEDED` * Exit code not equal to `0`: Status `FAILED` By default, the platform sets a generic status message like *Actor exit with exit code 0*. However, you can provide more informative message using the SDK's exit methods. ### Basic exit[​](#basic-exit "Direct link to Basic exit") Use the `exit()` method to terminate the Actor with a custom status message: * JavaScript * Python ``` import { Actor } from 'apify'; await Actor.init(); // ... // Actor will finish with 'SUCCEEDED' status await Actor.exit('Succeeded, crawled 50 pages'); ``` ``` from apify import Actor async def main(): async with Actor: # Actor will finish with 'SUCCEEDED' status await Actor.exit(status_message='Succeeded, crawled 50 pages') # INFO Exiting actor ({"exit_code": 0}) # INFO [Terminal status message]: Succeeded, crawled 50 pages ``` ### Immediate exit[​](#immediate-exit "Direct link to Immediate exit") To exit immediately without calling exit handlers: * JavaScript * Python ``` import { Actor } from 'apify'; await Actor.init(); // ... // Exit right away without calling `exit` handlers at all await Actor.exit('Done right now', { timeoutSecs: 0 }); ``` ``` from apify import Actor async def main(): async with Actor: # Exit right away without calling `exit` handlers at all await Actor.exit(event_listeners_timeout_secs=0, status_message='Done right now') # INFO Exiting actor ({"exit_code": 0}) # INFO [Terminal status message]: Done right now ``` ### Failed exit[​](#failed-exit "Direct link to Failed exit") To indicate a failed run: * JavaScript * Python ``` import { Actor } from 'apify'; await Actor.init(); // ... // Actor will finish with 'FAILED' status await Actor.exit('Could not finish the crawl, try increasing memory', { exitCode: 1 }); ``` ``` from apify import Actor async def main(): async with Actor: # Actor will finish with 'FAILED' status await Actor.exit(status_message='Could not finish the crawl, try increasing memory', exit_code=1) # INFO Exiting actor ({"exit_code": 1}) # INFO [Terminal status message]: Could not finish the crawl, try increasing memory ``` ### Preferred exit methods[​](#preferred-exit-methods "Direct link to Preferred exit methods") The SDK provides convenient methods for exiting Actors: * Use `exit()` with custom messages to inform users about the Actor's achievements or issues. * Use `fail()` as a shortcut for `exit()` when indicating an error. It defaults to an exit code of `1` and emits the `exit` event, allowing components to perform cleanup or state persistence. * The `exit()` method also emits the `exit` event, enabling cleanup or state persistence. Example of a failed exit using a shorthand method: * JavaScript * Python ``` import { Actor } from 'apify'; await Actor.init(); // ... // Or nicer way using this syntactic sugar: await Actor.fail('Could not finish the crawl, try increasing memory'); ``` ``` from apify import Actor async def main(): async with Actor: # ... or nicer way using this syntactic sugar: await Actor.fail(status_message='Could not finish the crawl. Try increasing memory') # INFO Exiting actor ({"exit_code": 1}) # INFO [Terminal status message]: Could not finish the crawl. Try increasing memory ``` ### Exit event handlers (JavaScript only)[​](#exit-event-handlers-javascript-only "Direct link to Exit event handlers (JavaScript only)") In JavaScript, you can register handlers for the `exit` event: * JavaScript * Python ``` import { Actor } from 'apify'; await Actor.init(); // Register a handler to be called on exit. // Note that the handler has `timeoutSecs` to finish its job. Actor.on('exit', ({ statusMessage, exitCode, timeoutSecs }) => { // Perform cleanup... }); await Actor.exit(); ``` ``` # 😔 Custom handlers are not supported in the Python SDK yet. ``` --- # Container web server **Learn about how to run a web server inside your Actor to enable communication with the outside world through both UI and API.** *** Each Actor run is assigned a unique URL (e.g. `kmdo7wpzlshygi.runs.apify.net`) that allows HTTP access to an optional web server running inside the Actor's Docker container. This feature enhances your Actor's capabilities by enabling external communication. Using Actors as an API The container web server provides a way how to connect to one specific Actor run. To enable using your Actor as an API, with a pre-defined hostname, load balancing and autoscaling, check out [Actor Standby](https://docs.apify.com/platform/actors/development/programming-interface/standby.md). ## Access the container URL[​](#access-the-container-url "Direct link to Access the container URL") You can find the container URL in three locations: * In the web application, on the Actor run details page as the **Container URL** field. * In the API as the `containerUrl` property of the [Run object](https://docs.apify.com/api/v2/actor-run-get.md). * In the Actor run's container as the `ACTOR_WEB_SERVER_URL` environment variable. ## Set up the web server[​](#set-up-the-web-server "Direct link to Set up the web server") The web server inside the container must listen on the port specified by the `ACTOR_WEB_SERVER_PORT` environment variable (typically: *4321*). To use a different port: 1. Go to your Actor version configuration 2. Define the `ACTOR_WEB_SERVER_PORT` environment variable with your desired port number. Check out [Custom environment variables](https://docs.apify.com/platform/actors/development/programming-interface/environment-variables.md) for more details. ## Example: Start a simple web server[​](#example-start-a-simple-web-server "Direct link to Example: Start a simple web server") * JavaScript * Python Here's how to start a basic web server in your Actor using Express.js: ``` // npm install express import { Actor } from 'apify'; import express from 'express'; await Actor.init(); const app = express(); const port = process.env.ACTOR_WEB_SERVER_PORT; app.get('/', (req, res) => { res.send('Hello world from Express app!'); }); app.listen(port, () => console.log(`Web server is listening and can be accessed at ${process.env.ACTOR_WEB_SERVER_URL}!`)); // Let the Actor run for an hour await new Promise((r) => setTimeout(r, 60 * 60 * 1000)); await Actor.exit(); ``` Here's how to start a basic web server in your Actor using Flask: ``` # pip install flask import asyncio import os from apify import Actor from apify_shared.consts import ActorEnvVars from flask import Flask async def main(): async with Actor: # Create a Flask app app = Flask(__name__) # Define a route @app.route('/') def hello_world(): return 'Hello world from Flask app!' # Log the public URL url = os.environ.get(ActorEnvVars.WEB_SERVER_URL) Actor.log.info(f'Web server is listening and can be accessed at {url}') # Start the web server port = os.environ.get(ActorEnvVars.WEB_SERVER_PORT) app.run(host='0.0.0.0', port=port) ``` --- # Actor environment variables **Learn how to provide your Actor with context that determines its behavior through a plethora of pre-defined environment variables set by the Apify platform.** *** ## How to use environment variables in an Actor[​](#how-to-use-environment-variables-in-an-actor "Direct link to How to use environment variables in an Actor") You can set up environment variables for your Actor in two ways: * [Set up environment variables in `actor.json`](#set-up-environment-variables-in-actorjson) * [Set up environment variables in Apify Console](#set-up-environment-variables-in-apify-console) Environment variable precedence Your local `.actor/actor.json` file overrides variables set in Apify Console. To use Console variables, remove the `environmentVariables` key from the local file. Check out how you can [access environment variables in Actors](#access-environment-variables). ## System environment variables[​](#system-environment-variables "Direct link to System environment variables") Apify sets several system environment variables for each Actor run. These variables provide essential context and information about the Actor's execution environment. Here's a table of key system environment variables: | Environment Variable | Description | | ------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `ACTOR_ID` | ID of the Actor. | | `ACTOR_FULL_NAME` | Full technical name of the Actor, in the format `owner-username/actor-name`. | | `ACTOR_RUN_ID` | ID of the Actor run. | | `ACTOR_BUILD_ID` | ID of the Actor build used in the run. | | `ACTOR_BUILD_NUMBER` | Build number of the Actor build used in the run. | | `ACTOR_BUILD_TAGS` | A comma-separated list of tags of the Actor build used in the run. Note that this environment variable is assigned at the time of start of the Actor and doesn't change over time, even if the assigned build tags change. | | `ACTOR_TASK_ID` | ID of the Actor task. Empty if Actor is run outside of any task, e.g. directly using the API. | | `ACTOR_EVENTS_WEBSOCKET_URL` | Websocket URL where Actor may listen for [events](https://docs.apify.com/platform/actors/development/programming-interface/system-events.md) from Actor platform. | | `ACTOR_DEFAULT_DATASET_ID` | Unique identifier for the default dataset associated with the current Actor run. | | `ACTOR_DEFAULT_KEY_VALUE_STORE_ID` | Unique identifier for the default key-value store associated with the current Actor run. | | `ACTOR_DEFAULT_REQUEST_QUEUE_ID` | Unique identifier for the default request queue associated with the current Actor run. | | `ACTOR_INPUT_KEY` | Key of the record in the default key-value store that holds the [Actor input](https://docs.apify.com/platform/actors/running/input-and-output.md#input). | | `ACTOR_MAX_PAID_DATASET_ITEMS` | For paid-per-result Actors, the user-set limit on returned results. Do not exceed this limit. | | `ACTOR_MAX_TOTAL_CHARGE_USD` | For pay-per-event Actors, the user-set limit on run cost. Do not exceed this limit. | | `APIFY_HEADLESS` | If **1**, web browsers inside the Actor should run in headless mode (no windowing system available). | | `APIFY_IS_AT_HOME` | Contains **1** if the Actor is running on Apify servers. | | `ACTOR_MEMORY_MBYTES` | Size of memory allocated for the Actor run, in megabytes. Can be used to optimize memory usage or finetuning of low-level external libraries. | | `APIFY_PROXY_PASSWORD` | Password for accessing Apify Proxy services. This password enables the Actor to utilize proxy servers on behalf of the user who initiated the Actor run. | | `APIFY_PROXY_PORT` | TCP port number to be used for connecting to the Apify Proxy. | | `APIFY_PROXY_STATUS_URL` | URL for retrieving proxy status information. Appending `?format=json` to this URL returns the data in JSON format for programmatic processing. | | `ACTOR_STANDBY_URL` | URL for accessing web servers of Actor runs in the [Actor Standby](https://docs.apify.com/platform/actors/development/programming-interface/standby.md) mode. | | `ACTOR_STARTED_AT` | Date when the Actor was started. | | `ACTOR_TIMEOUT_AT` | Date when the Actor will time out. | | `APIFY_TOKEN` | API token of the user who started the Actor. | | `APIFY_USER_ID` | ID of the user who started the Actor. May differ from the Actor owner. | | `APIFY_USER_IS_PAYING` | If it is `1`, it means that the user who started the Actor is a paying user. | | `ACTOR_WEB_SERVER_PORT` | TCP port for the Actor to start an HTTP server on. This server can be used to receive external messages or expose monitoring and control interfaces. The server also receives messages from the [Actor Standby](https://docs.apify.com/platform/actors/development/programming-interface/standby.md) mode. | | `ACTOR_WEB_SERVER_URL` | Unique public URL for accessing the Actor run web server from the outside world. | | `APIFY_API_PUBLIC_BASE_URL` | Public URL of the Apify API. May be used to interact with the platform programmatically. Typically set to `api.apify.com`. | | `APIFY_DEDICATED_CPUS` | Number of CPU cores reserved for the actor, based on allocated memory. | | `APIFY_DISABLE_OUTDATED_WARNING` | Controls the display of outdated version warnings. Set to `1` to suppress notifications about updates. | | `APIFY_WORKFLOW_KEY` | Identifier used for grouping related runs and API calls together. | | `APIFY_META_ORIGIN` | Specifies how an Actor run was started. Possible values are in [Runs and builds](https://docs.apify.com/platform/actors/running/runs-and-builds.md#origin) documentation. | | `APIFY_SDK_LATEST_VERSION` | Specifies the most recent release version of the Apify SDK for JavaScript. Used for checking for updates. | | `APIFY_INPUT_SECRETS_KEY_FILE` | Path to the secret key used to decrypt [Secret inputs](https://docs.apify.com/platform/actors/development/actor-definition/input-schema/secret-input.md). | | `APIFY_INPUT_SECRETS_KEY_PASSPHRASE` | Passphrase for the input secret key specified in `APIFY_INPUT_SECRETS_KEY_FILE`. | Date format All date-related variables use the UTC timezone and are in [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601) format (e.g., *2022-07-13T14:23:37.281Z*). ## Set up environment variables in `actor.json`[​](#set-up-environment-variables-in-actorjson "Direct link to set-up-environment-variables-in-actorjson") Actor owners can define custom environment variables in `.actor/actor.json`. All keys from `environmentVariables` will be set as environment variables into the Apify platform after you push Actor to Apify. ``` { "actorSpecification": 1, "name": "dataset-to-mysql", "version": "0.1", "buildTag": "latest", "environmentVariables": { "MYSQL_USER": "my_username", } } ``` Git-workflow with actor.json Be aware that if you define `environmentVariables` in `.actor/actor.json`, it only works with [Apify CLI](https://docs.apify.com/cli). If you use a Git workflow for Actor development, the environment variables will not be set from `.actor/actor.json` and you need to define them in Apify Console. ## Set up environment variables in Apify Console[​](#set-up-environment-variables-in-apify-console "Direct link to Set up environment variables in Apify Console") Actor owners can define custom environment variables to pass additional configuration to their Actors. To set custom variables: 1. Go to your Actor's **Source** page in the Apify Console 2. Navigate to the **Environment variables** section. 3. Add your custom variables. For sensitive data like API keys or passwords, enable the **Secret** option. This will encrypt the value and redact it from logs to prevent accidental exposure. Build-time variables Once you start a build, you cannot change its environment variables. To use different variables, you must create a new build. Learn more in [Builds](https://docs.apify.com/platform/actors/development/builds-and-runs/builds.md). ## Access environment variables[​](#access-environment-variables "Direct link to Access environment variables") You can access environment variables in your code as follows: * JavaScript * Python In Node.js, use the `process.env` object: ``` import { Actor } from 'apify'; await Actor.init(); // get MYSQL_USER const mysql_user = process.env.MYSQL_USER // print MYSQL_USER to console console.log(mysql_user); await Actor.exit(); ``` In Python, use the `os.environ` dictionary: ``` import os print(os.environ['MYSQL_USER']) from apify import Actor async def main(): async with Actor: # get MYSQL_USER mysql_user = os.environ['MYSQL_USER'] # print MYSQL_USER to console print(mysql_user) ``` ## Use the `Configuration` class[​](#use-the-configuration-class "Direct link to use-the-configuration-class") For more convenient access to Actor configuration, use the [`Configuration`](https://docs.apify.com/sdk/js/reference/class/Configuration) class * JavaScript * Python ``` import { Actor } from 'apify'; await Actor.init(); // get current token const token = Actor.config.get('token'); // use different token Actor.config.set('token', 's0m3n3wt0k3n'); await Actor.exit(); ``` ``` from apify import Actor async def main(): async with Actor: old_token = Actor.config.token Actor.log.info(f'old_token = {old_token}') # use different token Actor.config.token = 's0m3n3wt0k3n' new_token = Actor.config.token Actor.log.info(f'new_token = {new_token}') ``` ## Build-time environment variables[​](#build-time-environment-variables "Direct link to Build-time environment variables") You can also use environment variables during the Actor's build process. In this case, they function as Docker build arguments. To use them in your Dockerfile, include `ARG` instruction: ``` ARG MY_BUILD_VARIABLE RUN echo $MY_BUILD_VARIABLE ``` Variables set during the build Build-time environment variables are not suitable for secrets, as they are not encrypted. By leveraging environment variables effectively, you can create more flexible and configurable Actors that adapt to different execution contexts and user requirements. --- # Metamorph **The metamorph operation transforms an Actor run into the run of another Actor with a new input.** *** ## Transform Actor runs[​](#transform-actor-runs "Direct link to Transform Actor runs") Metamorph is a powerful operation that transforms an Actor run into the run of another Actor with a new input. This feature enables you to leverage existing Actors and create more efficient workflows. ## Understand metamorph[​](#understand-metamorph "Direct link to Understand metamorph") The metamorph process involves several key steps. It stops the current Actor's Docker container, then starts a new container using a different Docker image. During this transition, all default storages are preserved. The new input is stored under the *INPUT-METAMORPH-1* key in the default key-value store, ensuring seamless data transfer between Actor runs. ## Benefits of metamorph[​](#benefits-of-metamorph "Direct link to Benefits of metamorph") Metamorph offers several benefits for developers: * Seamless transition between Actors without starting a new run * Building new Actors on top of existing ones * Providing users with an improved input structure and interface * Maintaining transparency for end-users These benefits make metamorph a valuable tool for creating complex, efficient workflows. ## Implementation guidelines[​](#implementation-guidelines "Direct link to Implementation guidelines") To make your Actor compatible with metamorph, use `Actor.getInput()` instead of `Actor.getValue(`INPUT`)`. This method fetches the input using the correct key (*INPUT-METAMORPH-1*) for metamorphed runs, ensuring proper data retrieval in transformed Actor runs. Runtime limits There's a limit to how many times you can metamorph a single run. Refer to the [Actor runtime limits](https://docs.apify.com/platform/limits.md#actor-limits) for more details. ## Example[​](#example "Direct link to Example") Let's walk through an example of using metamorph to create a hotel review scraper: 1. Create an Actor that accepts a hotel URL as input. 2. Use the [apify/web-scraper](https://apify.com/apify/web-scraper) Actor to scrape reviews. 3. Use the metamorph operation to transform into a run of apify/web-scraper. * JavaScript * Python Here's the JavaScript code to achieve this: ``` import { Actor } from 'apify'; await Actor.init(); // Get input of your Actor. const { hotelUrl } = await Actor.getInput(); // Create input for apify/web-scraper const newInput = { startUrls: [{ url: hotelUrl }], pageFunction: () => { // Here you pass the page function that // scrapes all the reviews ... }, // ... and here would be all the additional // input parameters. }; // Transform the Actor run to apify/web-scraper // with the new input. await Actor.metamorph('apify/web-scraper', newInput); // The line here will never be reached, because the // Actor run will be interrupted. await Actor.exit(); ``` Here's the Python code to achieve this: ``` from apify import Actor async def main(): async with Actor: # Get input of your Actor actor_input = await Actor.get_input() or {} # Create input for apify/web-scraper new_input = { 'startUrls': [{'url': actor_input['url']}], 'pageFunction': """ # Here you pass the page function that # scrapes all the reviews ... """, # ... and here would be all the additional input parameters } # Transform the Actor run to apify/web-scraper with the new input await Actor.metamorph('apify/web-scraper', new_input) # The line here will never be reached, because the Actor run will be interrupted Actor.log.info('This should not be printed') ``` By following these steps, you can create a powerful hotel review scraper that leverages the capabilities of existing Actors through the metamorph operation. --- # Standby mode **Use Actors as an API server for fast response times.** *** Traditional Actors are designed to run a single task and then stop. They're mostly intended for batch jobs, such as when you need to perform a large scrape or data processing task. However, in some applications, waiting for an Actor to start is not an option. Actor Standby mode solves this problem by letting you have the Actor ready in the background, waiting for the incoming HTTP requests. In a sense, the Actor behaves like a real-time web server or standard API server. ## Developing Actors using Standby mode[​](#developing-actors-using-standby-mode "Direct link to Developing Actors using Standby mode") The best way to start developing Standby Actors is to use the predefined templates in the [Console UI](https://console.apify.com/actors/templates) or in [CLI](https://docs.apify.com/cli/) via `apify create`. The templates contain minimal code to get you up to speed for development in JavaScript, TypeScript or Python. Standby mode will automatically be enabled with default settings. If you already have an existing Actor, or you just want to tweak the configuration of Standby mode, you can head to the Settings tab of your Actor, where the Actor Standby settings are located. ![Standby for creators](/assets/images/standby-creators-a4633d8d11b7d7b016ddb197cd838b53.png) Actors using Standby mode must run a HTTP server listening on a specific port. The user requests will then be proxied to the HTTP server. You can use any of the existing [HTTP request methods](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods) like GET, POST, PUT, DELETE, etc. You can pass the input via [HTTP request query string](https://en.wikipedia.org/wiki/Query_string) or via [HTTP request body](https://developer.mozilla.org/en-US/docs/Web/HTTP/Messages#body). Sometimes, you want the HTTP server to listen on a specific port and cannot change it yourself. You can use `ACTOR_WEB_SERVER_PORT` environment variable to override the port so that Actor Standby will work with your code. You can get the port using the Actor configuration available in Apify SDK. See example below with a simple Actor using Standby mode. * JavaScript * Python ``` import http from 'http'; import { Actor } from 'apify'; await Actor.init(); const server = http.createServer((req, res) => { res.writeHead(200, { 'Content-Type': 'text/plain' }); res.end('Hello from Actor Standby!\n'); }); server.listen(Actor.config.get('containerPort')); ``` ``` from http.server import HTTPServer, SimpleHTTPRequestHandler from apify import Actor class GetHandler(SimpleHTTPRequestHandler): def do_GET(self): self.send_response(200) self.end_headers() self.wfile.write(b'Hello from Actor Standby!') async def main() -> None: async with Actor: with HTTPServer(('', Actor.config.web_server_port), GetHandler) as http_server: http_server.serve_forever() ``` Please make sure to describe your Actors, their endpoints, and the schema for their inputs and outputs in your README. ### Readiness probe[​](#readiness-probe "Direct link to Readiness probe") Before Actor standby runs are ready to serve requests, the Apify platform checks the web server's readiness using a readiness probe. The platform sends a GET request to the path `/` with a header `x-apify-container-server-readiness-probe`. If the header is present in the request, you can perform an early return with a simple response to prevent wasting resources. Return a response You must return a response; otherwise, the Actor run will never be marked as ready and won't process requests. See example code below that distinguishes between "normal" and "readiness probe" requests. * JavaScript * Python ``` import http from 'http'; import { Actor } from 'apify'; await Actor.init(); const server = http.createServer((req, res) => { res.writeHead(200, { 'Content-Type': 'text/plain' }); if (req.headers['x-apify-container-server-readiness-probe']) { console.log('Readiness probe'); res.end('Hello, readiness probe!\n'); } else { console.log('Normal request'); res.end('Hello from Actor Standby!\n'); } }); server.listen(Actor.config.get('standbyPort')); ``` ``` from http.server import HTTPServer, SimpleHTTPRequestHandler from apify import Actor class GetHandler(SimpleHTTPRequestHandler): def do_GET(self) -> None: self.send_response(200) self.end_headers() if self.headers['x-apify-container-server-readiness-probe']: print('Readiness probe') self.wfile.write(b'Hello, readiness probe!') else: print('Normal request') self.wfile.write(b'Hello, normal request!') async def main() -> None: async with Actor: with HTTPServer(('', Actor.config.standby_port), GetHandler) as http_server: http_server.serve_forever() ``` ## Determining an Actor is started in Standby[​](#determining-an-actor-is-started-in-standby "Direct link to Determining an Actor is started in Standby") Actors that support Actor Standby can still be started in standard mode, for example from the Console or via the API. To find out in which mode was the Actor started, you can read the `metaOrigin` option in `Actor.config`, or the `APIFY_META_ORIGIN` environment variable in case you're not using the Apify SDK. If it is equal to `STANDBY`, the Actor was started in Standby mode, otherwise it was started in standard mode. * JavaScript * Python ``` import { Actor } from 'apify'; await Actor.init(); if (Actor.config.get('metaOrigin') === 'STANDBY') { // Start your Standby server here } else { // Perform the standard Actor operations here } ``` ``` from apify import Actor async def main() -> None: async with Actor: if Actor.config.meta_origin == 'STANDBY': # Start your Standby server here else: # Perform the standard Actor operations here ``` ## Timeouts[​](#timeouts "Direct link to Timeouts") When you send a request to an Actor in Standby mode, the total timeout for receiving the first response is *5 minutes*. Before the platform forwards the request to a specific Actor run, it performs a *run selection* process to determine the specific Actor run that will handle it. This process has internal timeout of *2 minutes*. ## Getting the URL of the Standby Actor[​](#getting-the-url-of-the-standby-actor "Direct link to Getting the URL of the Standby Actor") The URL is exposed as an environment variable `ACTOR_STANDBY_URL`. You can also use `Actor.config`, where the `standbyUrl` option is available. ## Monetization of Actors with the Standby mode?[​](#monetization-of-actors-with-the-standby-mode "Direct link to Monetization of Actors with the Standby mode?") You can monetize Standby Actors just like any other Actor. For best results with Standby workflows, use pay-per-event monetization model. When monetizing your Actor in Standby mode using pay per event mode, you are not responsible for covering the platform usage costs of your users' runs. Users will need to cover both the platform usage costs (paid to Apify) and event costs (paid to you). --- # Status messages **Learn how to use custom status messages to inform users about an Actor's progress.** *** Each Actor run has a status, represented by the `status` field. The following table describes the possible values: | Status | Type | Description | | ------------ | ------------ | ------------------------------------------- | | `READY` | initial | Started but not allocated to any worker yet | | `RUNNING` | transitional | Executing on a worker | | `SUCCEEDED` | terminal | Finished successfully | | `FAILED` | terminal | Run failed | | `TIMING-OUT` | transitional | Timing out now | | `TIMED-OUT` | terminal | Timed out | | `ABORTING` | transitional | Being aborted by user | | `ABORTED` | terminal | Aborted by user | ## Status messages[​](#status-messages "Direct link to Status messages") In addition to the status, each Actor run has a status message (the `statusMessage` field). This message informs users about the Actor's current activity, enhancing the user experience. ![Status message](/assets/images/status-message-5a087d1952b6d9050f089ca946bffba5.png) ## Exit status message[​](#exit-status-message "Direct link to Exit status message") When an Actor exits, the status message is set to either: * A default text (e.g., *Actor finished with exit code 1*) * A custom message (see the [exit](https://docs.apify.com/platform/actors/development/programming-interface/basic-commands.md#exit-actor) method for details) ## Update status message[​](#update-status-message "Direct link to Update status message") To keep users informed during the Actor's execution, update the status message periodically. Use the following code to set a status message: * JavaScript * Python ``` import { Actor } from 'apify'; await Actor.init(); // ... await Actor.setStatusMessage('Crawled 45 of 100 pages'); await Actor.exit(); ``` Update frequency You can call the `setStatusMessage` function as often as needed. The SDK only invokes the API if the status changes, simplifying usage. ``` from apify import Actor async def main(): async with Actor: await Actor.set_status_message('Crawled 45 of 100 pages') # INFO [Status message]: Crawled 45 of 100 pages ``` --- # System events in Apify Actors **Learn about system events sent to your Actor and how to benefit from them.** *** ## Understand system events[​](#understand-system-events "Direct link to Understand system events") Apify's system notifies Actors about various events, such as: * Migration to another server * Abort operations triggered by another Actor * CPU overload These events help you manage your Actor's behavior and resources effectively. ## System events[​](#system-events "Direct link to System events") The following table outlines the system events available: | Event name | Payload | Description | | -------------- | ------------------------------ | ----------------------------------------------------------------------------------------------------------- | | `cpuInfo` | `{ isCpuOverloaded: Boolean }` | Emitted approximately every second, indicating whether the Actor is using maximum available CPU resources. | | `migrating` | `{ timeRemainingSecs: Float }` | Signals that the Actor will soon migrate to another worker server on the Apify platform. | | `aborting` | N/A | Triggered when a user initiates a graceful abort of an Actor run, allowing time for cleanup. | | `persistState` | `{ isMigrating: Boolean }` | Emitted at regular intervals (default: *60 seconds*) to notify Apify SDK components to persist their state. | ## How system events work[​](#how-system-events-work "Direct link to How system events work") Actors receive system events through a WebSocket connection. The address is specified by the `ACTOR_EVENTS_WEBSOCKET_URL` environment variable. Messages are sent in JSON format with the following structure: ``` { // Event name name: String, // Time when the event was created, in ISO format createdAt: String, // Optional object with payload data: Object, } ``` Virtual events Some events like `persistState`, are generated virtually at the Actor SDK level, not sent via WebSocket. ## Handle system events[​](#handle-system-events "Direct link to Handle system events") To work with system events in your Actor, use the following methods: * JavaScript * Python ``` import { Actor } from 'apify'; await Actor.init(); // Add event handler Actor.on('cpuInfo', (data) => { if (data.isCpuOverloaded) console.log('Oh no, we need to slow down!'); }); // Remove all handlers for a specific event Actor.off('systemInfo'); // Remove a specific event handler Actor.off('systemInfo', handler); await Actor.exit(); ``` ``` from apify import Actor from apify_shared.consts import ActorEventTypes def handler_foo(arg: dict): Actor.log.info(f'handler_foo: arg = {arg}') def handler_boo(arg: dict): pass async def main(): async with Actor: # Add event handler Actor.on(ActorEventTypes.ABORTING, handler_foo) # Remove all handlers for a specific event Actor.off('systemInfo') # Remove a specific event handler Actor.off('systemInfo', handler_boo) ``` By utilizing these system events, you can create more robust and efficient Actors that respond dynamically to changes in their environment. --- # Quick start **Create your first Actor using the Apify Web IDE or locally in your IDE.** *** Before you build Before you start building your own Actor, try out a couple of existing Actors from [Apify Store](https://apify.com/store). See the [Running Actors](https://docs.apify.com/platform/actors/running.md) section for more information on running existing Actors. ## Technology stack[​](#technology-stack "Direct link to Technology stack") Any code that can run inside of a Docker container can be turned into Apify [Actor](https://docs.apify.com/platform/actors.md). This gives you freedom in choosing your technical stack, including programming language and technologies. But to fully benefit from running on top of the Apify platform, we recommend you choose either JavaScript/Node.js or Python, where Apify provides first-level support regarding its SDK, API clients, and learning materials. For these languages, you can also [choose from many code templates](https://apify.com/templates) that help you to kickstart your project quickly. ![Actor Templates](/assets/images/templates-ac1ea37affbfcdf353be5f9049a5ea9e.png) ## Development paths[​](#development-paths "Direct link to Development paths") You can develop Actor in two ways: ### Local development[​](#local-development "Direct link to Local development") Develop your Actor locally in your IDE and only deploy to the Apify platform when it is production ready. This way, you benefit from your local setup for a better development and debugging experience. After you are done with the development, you can [deploy](https://docs.apify.com/platform/actors/development/deployment.md) your Actor to the Apify platform. #### [Start Locally in Your IDE](https://docs.apify.com/platform/actors/development/quick-start/locally.md) ### Web IDE[​](#web-ide "Direct link to Web IDE") Using the Web IDE in [Apify Console](https://console.apify.com). This is the fastest way to kick-start your Actor development and try out the Apify platform. #### [Start in Apify Web IDE](https://docs.apify.com/platform/actors/development/quick-start/web-ide.md) ## Build with AI[​](#build-with-ai "Direct link to Build with AI") Use the Apify toolset to build with AI: #### [Build with AI](https://docs.apify.com/platform/actors/development/quick-start/build-with-ai.md) --- # Build with AI **Set up your environment, choose tools, and build workflows for effective AI development.** *** This guide explains how to enhance your development workflow by leveraging `llms.txt` and `llms-full.txt` with an AI assistant. You will learn how to provide foundational context and further refine AI's behavior with specific rules and instruction files. ## `llms.txt` and `llms-full.txt`[​](#llmstxt-and-llms-fulltxt "Direct link to llmstxt-and-llms-fulltxt") Search engines weren't built for Large Language Models (LLMs), but they needs context. That's why we've created [`llms.txt`](https://docs.apify.com/llms.txt) and [`llms-full.txt`](https://docs.apify.com/llms-full.txt) for our documentation. These files follow the [growing standard](https://llmstxt.org/) for LLMs consumption. | File | Purpose | | --------------- | ------------------------------------------------------------------------------------ | | `llms.txt` | Contains index of the docs page in Markdown, with links to all subpages in Markdown. | | `llms-full.txt` | Contains a full dump of documentation in Markdown. | ### Use llms.txt and llms-full.txt[​](#use-llmstxt-and-llms-fulltxt "Direct link to Use llms.txt and llms-full.txt") LLMs don't automatically discover `llms.txt` files, you need to add the link manually. Some tools like [Cursor](https://www.cursor.com/) provide settings for this. #### Cursor[​](#cursor "Direct link to Cursor") Go to: **Settings -> Cursor Settings -> Indexing & Docs -> Add Doc**. Now, you can just provide the link to Apify `llms-full.txt`: ``` https://docs.apify.com/llms-full.txt ``` ![Add llms-full.txt to Cursor](/assets/images/cursor-433657f7a74e093dbd76252df3dd3d8d.png) #### Windsurf[​](#windsurf "Direct link to Windsurf") Open Windsurf Cascade, and add context via `@web`: ![Add llms-full.txt to Windsurf](/assets/images/windsurf-2f0762edec815904009aad77b4bff8b4.png) Windsurf @docs Windsurf provides the `@docs` command, but you cannot customize it. It means that you cannot add your own documentation. #### GitHub Copilot[​](#github-copilot "Direct link to GitHub Copilot") Open Copilot Chat mode, and add context via `#fetch`: ![Add llms.txt to Copilot](/assets/images/github-copilot-93549aba461bdd01cb70eba9137b7a53.png) GitHub Copilot and documentation Similar to Windsurf, GitHub Copilot does not provide an option for adding your own documentation. #### Ask AI[​](#ask-ai "Direct link to Ask AI") New to Apify? Ask questions and provide the `llms.txt` link. Popular AI models can search the web. With the right context, you get better answers: ![Ask about Apify](/assets/images/claude-148a7a0ab7be8a7a91432db054951ac5.png) ## View pages as markdown[​](#view-pages-as-markdown "Direct link to View pages as markdown") Every page in Apify documentation is available in markdown format. Simply add `.md` to any documentation URL to access the raw markdown content. For example: * → * → This is useful when you want to: * Copy documentation content for AI prompts * Reference exact markdown syntax * Access content programmatically ## Add rules[​](#add-rules "Direct link to Add rules") To get the most from AI Code Editors, add rules or instructions. See how to set up rules for your AI Code Editors: * [Cursor Rules](https://docs.cursor.com/en/context/rules) * [Windsurf Rules](https://docs.windsurf.com/windsurf/cascade/memories#rules) * [GitHub Copilot instructions](https://docs.github.com/en/copilot/how-tos/configure-custom-instructions/add-repository-instructions) ### Apify rules and instructions[​](#apify-rules-and-instructions "Direct link to Apify rules and instructions") The best rules and instructions are specific for individual use cases, but we provide a prompt to help you get start with creating rules and instructions: Use pre-built prompt to get started faster. Show the promptCopy prompt Keep your rules and instructions up to date As the project evolves, you might introduce new things. Always make sure that your rules and instructions are up to date. ## Best practices[​](#best-practices "Direct link to Best practices") * *Small tasks*: Don't ask AI for many tasks at once. Break complex problems into smaller pieces. Solve them step by step. * *Iterative approach*: Work iteratively with clear steps. Start with a basic implementation and gradually add complexity. * *Versioning*: Version your changes often using git. This lets you track changes, roll back if needed, and maintain a clear history. * *Security*: Don't expose API keys, secrets, or sensitive information in your code or conversations with LLM assistants. --- # Local development **Create your first Actor locally on your machine, deploy it to the Apify platform, and run it in the cloud.** *** Use pre-built prompt to get started faster. Show the promptCopy prompt ## What you'll learn[​](#what-youll-learn "Direct link to What you'll learn") This guide walks you through the full lifecycle of an Actor. You'll start by creating and running it locally with the Apify CLI, then learn to configure its input and data storage. Finally, you will deploy the Actor to the Apify platform, making it ready to run in the cloud. ### Prerequisites[​](#prerequisites "Direct link to Prerequisites") * [Node.js](https://nodejs.org/en/) version 16 or higher with `npm` installed on your computer. * The [Apify CLI](https://docs.apify.com/cli/docs/installation) installed. * Optional: To deploy your Actor, [sign in](https://console.apify.com/sign-in). ### Step 1: Create your Actor[​](#step-1-create-your-actor "Direct link to Step 1: Create your Actor") Use Apify CLI to create a new Actor: ``` apify create ``` The CLI will ask you to: 1. Name your Actor (e.g., `your-actor-name`) 2. Choose a programming language (`JavaScript`, `TypeScript`, or `Python`) 3. Select a development template Explore Actor templates Browse the [full list of templates](https://apify.com/templates) to find the best fit for your Actor. The CLI will: * Create a `your-actor-name` directory with boilerplate code * Install all project dependencies Now, you can navigate to your new Actor directory: ``` cd `your-actor-name` ``` ### Step 2: Run your Actor[​](#step-2-run-your-actor "Direct link to Step 2: Run your Actor") Run your Actor with: ``` apify run ``` Clear data with --purge During development, use `apify run --purge`. This clears all results from previous runs, so it's as if you're running the Actor for the first time. You'll see output similar to this in your terminal: ``` INFO System info {"apifyVersion":"3.4.3","apifyClientVersion":"2.12.6","crawleeVersion":"3.13.10","osType":"Darwin","nodeVersion":"v22.17.0"} Extracted heading { level: 'h1', text: 'Your full‑stack platform for web scraping' } Extracted heading { level: 'h3', text: 'TikTok Scraper' } Extracted heading { level: 'h3', text: 'Google Maps Scraper' } Extracted heading { level: 'h3', text: 'Instagram Scraper' } ``` As you can see in the logs, the Actor extracts text from a web page. The main logic lives in `src/main.js`. Depending on your template, this file may be `src/main.ts` (TypeScript) or `src/main.py` (Python). In the next step, we’ll explore the results in more detail. ### Step 3: Explore the Actor[​](#step-3-explore-the-actor "Direct link to Step 3: Explore the Actor") Let's explore the Actor structure. #### The `.actor` folder[​](#the-actor-folder "Direct link to the-actor-folder") The `.actor` folder contains the Actor configuration. The `actor.json` file defines the Actor's name, description, and other settings. Find more info in the [actor.json](https://docs.apify.com/platform/actors/development/actor-definition/actor-json) definition. #### Actor's `input`[​](#actors-input "Direct link to actors-input") Each Actor accepts an `input object` that tells it what to do. The object uses JSON format and lives in `storage/key_value_stores/default/INPUT.json`. Edit the schema to change input To change the `INPUT.json`, edit the `input_schema.json` in the `.actor` folder first. This JSON Schema validates input automatically (no error handling needed), powers the Actor's user interface, generates API docs, and enables smart integration with tools like Zapier or Make by auto-linking input fields. Find more info in the [Input schema](https://docs.apify.com/platform/actors/development/actor-definition/input-schema.md) documentation. #### Actor's `storage`[​](#actors-storage "Direct link to actors-storage") The Actor system provides two storage types for files and results: [key-value](https://docs.apify.com/platform/storage/key-value-store.md) store and [dataset](https://docs.apify.com/platform/storage/dataset.md). ##### Key-value store[​](#key-value-store "Direct link to Key-value store") The key-value store saves and reads files or data records. Key-value stores work well for screenshots, PDFs, or persisting Actor state as JSON files. ##### Dataset[​](#dataset "Direct link to Dataset") The dataset stores a series of data objects from web scraping, crawling, or data processing jobs. You can export datasets to JSON, CSV, XML, RSS, Excel, or HTML formats. #### Actor's `output`[​](#actors-output "Direct link to actors-output") You define the Actor output using the Output schema files: * [Dataset Schema Specification](https://docs.apify.com/platform/actors/development/actor-definition/dataset-schema.md) * [Key-value Store Schema Specification](https://docs.apify.com/platform/actors/development/actor-definition/key-value-store-schema.md) The system uses this to generate an immutable JSON file that tells users where to find the Actor's results. ### Step 4: Deploy your Actor[​](#step-4-deploy-your-actor "Direct link to Step 4: Deploy your Actor") Let's now deploy your Actor to the Apify platform, where you can run the Actor on a scheduled basis, or you can make the Actor public for other users. 1. Login first: ``` apify login ``` Your Apify token location After you successfully login, your Apify token is stored in `~/.apify/auth.json`, or `C:\Users\\.apify` based on your system. 2. Push your Actor to the Apify platform: ``` apify push ``` ### Step 5: It's Time to Iterate\![​](#step-5-its-time-to-iterate "Direct link to Step 5: It's Time to Iterate!") Good job! 🎉 You're ready to develop your Actor. You can make changes to your Actor and implement your use case. ## Next steps[​](#next-steps "Direct link to Next steps") * Visit the [Apify Academy](https://docs.apify.com/academy.md) to access a comprehensive collection of tutorials, documentation, and learning resources. * To understand Actors in detail, read the [Actor Whitepaper](https://whitepaper.actor/). * Check [Continuous integration](https://docs.apify.com/platform/actors/development/deployment/continuous-integration.md) documentation to automate your Actor development process. * After you finish building your first Actor, you can [share it with other users and even monetize it](https://docs.apify.com/platform/actors/publishing.md). --- # Web IDE **Create your first Actor using the web IDE in Apify Console.** *** ## What you'll learn[​](#what-youll-learn "Direct link to What you'll learn") This guide walks you through the full lifecycle of an Actor. You'll start by creating and running it locally with the Apify CLI, then learn to configure its input and data storage. Finally, you will deploy the Actor to the Apify platform, making it ready to run in the cloud. ### Prerequisites[​](#prerequisites "Direct link to Prerequisites") * An Apify account. [Sign up for a free account](https://console.apify.com/sign-up) on the Apify website. ### Step 1: Create your Actor[​](#step-1-create-your-actor "Direct link to Step 1: Create your Actor") Log in to [Apify Console](https://console.apify.com), navigate to [**My Actors**](https://console.apify.com/actors/development/my-actors), then click the **Develop new** button. ![Create Actor](/assets/images/create-actor-69b32bf8ad2b2173628df60685cb9969.png) You'll see Actor development templates for `JavaScript`, `TypeScript`, and `Python`. These templates provide boilerplate code and a preconfigured environment. Choose the template that best suits your needs. For the following demo, we'll proceed with **Crawlee + Puppeteer + Chrome**. Explore Actor templates Browse the [full list of templates](https://apify.com/templates) to find the best fit for your Actor. ![Templates](/assets/images/actor-templates-b19999bcbcd98ba04d9c66f73632c38f.png) After choosing the template, your Actor will be automatically named and you'll be redirected to its page. ### Step 2: Explore the Actor[​](#step-2-explore-the-actor "Direct link to Step 2: Explore the Actor") The provided boilerplate code utilizes the [Apify SDK](https://docs.apify.com/sdk/js/) combined with [Crawlee](https://crawlee.dev/), Apify's popular open-source Node.js web scraping library. By default, the code crawls the [apify.com](https://apify.com) website, but you can change it to any website. Crawlee [Crawlee](https://crawlee.dev/) is an open-source Node.js library designed for web scraping and browser automation. It helps you build reliable crawlers quickly and efficiently. ### Step 3: Build the Actor[​](#step-3-build-the-actor "Direct link to Step 3: Build the Actor") To run your Actor, build it first. Click the **Build** button below the source code. ![Actor source code](/assets/images/actor-source-code-270416bb696b5630433cfb3a5405cef7.png) Once the build starts, the UI transitions to the **Last build** tab, showing build progress and Docker build logs. ![Actor build](/assets/images/actor-build-b15fc0543e1cdf15b2f97ab8aa983ebb.png) Actor creation flow The UI includes four tabs: * **Code** * **Last build** * **Input** * **Last Run** This represents the Actor creation flow, where you first build the Actor from the source code. Once the build is successful, you can provide input parameters and initiate an Actor run. ### Step 4: Run the Actor[​](#step-4-run-the-actor "Direct link to Step 4: Run the Actor") Once the Actor is built, you can look at its input, which consists of one field - **Start URL**, the URL where the crawling starts. Below the input, you can adjust the **Run options**: * **Build** * **Timeout** * **Memory limit** ![Actor input](/assets/images/actor-input-60fb9eef613c689fd1d9427d6749cb97.png) To initiate an Actor run, click the **Start** button at the bottom of the page. Once the run is created, you can monitor its progress and view the log in real-time. The **Output** tab will display the results of the Actor's execution, which will be populated as the run progresses. You can abort the run at any time using the **Abort** button. ![Actor run](/assets/images/actor-run-f17d17bfc7366c2a827219ce5be64f53.png) ### Step 5: Pull the Actor[​](#step-5-pull-the-actor "Direct link to Step 5: Pull the Actor") To continue development locally, pull the Actor's source code to your machine. Prerequisites Install `apify-cli` : * macOS/Linux * Other platforms ``` brew install apify-cli ``` ``` npm -g install apify-cli ``` To pull your Actor: 1. Log in to the Apify platform ``` apify login ``` 2. Pull your Actor: ``` apify pull your-actor-name ``` Or with a specific version: ``` apify pull your-actor-name --version [version_number] ``` As `your-actor-name`, you can use either: * The unique name of the Actor (e.g., `apify/hello-world`) * The ID of the Actor (e.g., `E2jjCZBezvAZnX8Rb`) You can find both by clicking on the Actor title at the top of the page, which will open a new window containing the Actor's unique name and ID. ### Step 6: It's time to iterate\![​](#step-6-its-time-to-iterate "Direct link to Step 6: It's time to iterate!") After pulling the Actor's source code to your local machine, you can modify and customize it to match your specific requirements. Leverage your preferred code editor or development environment to make the necessary changes and enhancements. Once you've made the desired changes, you can push the updated code back to the Apify platform for deployment & execution, leveraging the platform's scalability and reliability. ## Next steps[​](#next-steps "Direct link to Next steps") * Visit the [Apify Academy](https://docs.apify.com/academy.md) to access a comprehensive collection of tutorials, documentation, and learning resources. * To understand Actors in detail, read the [Actor Whitepaper](https://whitepaper.actor/). * Check [Continuous integration](https://docs.apify.com/platform/actors/development/deployment/continuous-integration.md) documentation to automate your Actor development process. * After you finish building your first Actor, you can [share it with other users and even monetize it](https://docs.apify.com/platform/actors/publishing.md). --- # Publishing and monetization **Apify provides a platform for developing, publishing, and monetizing web automation solutions called Actors. This guide covers the key stages involved in publishing and monetizing your Actors on the Apify platform.** *** > Sharing is caring but you can also make money from your Actors. Check out our [blog post](https://blog.apify.com/make-regular-passive-income-developing-web-automation-actors-b0392278d085/) for more context. ## Publish process[​](#publish-process "Direct link to Publish process") To build & publish an Actor on [Apify Store](https://apify.com/store), you'll go through the following main stages: 1. [Development](https://docs.apify.com/platform/actors/development.md). 2. [Publication](https://docs.apify.com/platform/actors/publishing/publish.md) and set up of [monetization](https://docs.apify.com/platform/actors/publishing/monetize.md). 3. [Testing](https://docs.apify.com/platform/actors/development/automated-tests.md). 4. [Promotion](https://apify.notion.site/3fdc9fd4c8164649a2024c9ca7a2d0da?v=6d262c0b026d49bfa45771cd71f8c9ab). ## Benefits of publishing and monetization[​](#benefits-of-publishing-and-monetization "Direct link to Benefits of publishing and monetization") Publishing your Actor on the Apify Store transforms your code, eliminating traditional SaaS development overhead. When you publish your Actor you gain access to: * Streamlined product delivery * Dedicated landing page for your Actor * Built-in documentation hosting through seamless README integration * Direct exposure to organic user traffic through Apify Store * Reduced operational burden * No need for separate domains or websites * Built-in cloud infrastructure with automatic scaling * Multiple payment options with automated billing & transactions * Minimal marketing efforts required due to marketplace presence Packaging your software as an Actor allows you to launch new SaaS product faster and earn income through various monetization models that match your Actor's value proposition like: * Fixed rental fee for continuous access * Pay-per-result for usage-based pricing * Pay-per-event for specific operations To learn more visit our [Actors in Store](https://docs.apify.com/platform/actors/running/actors-in-store#pricing-models) page. ## Maintain public Actors[​](#maintain-public-actors "Direct link to Maintain public Actors") While private Actors don't necessarily require ongoing maintenance, public Actors demand a higher level of responsibility. Since public Actors are available on the [Apify Store](https://apify.com/store) and may be used by other users, an unmaintained public Actor could negatively impact those who depend on it. To ensure long-term quality and improve your chances of successfully monetizing your Actors, we recommend reserving approximately 2 hours per week for maintenance tasks, such as: * Addressing bug reports & issues * Implementing updates & improvements * Responding to user inquireies & feedback ![](/assets/images/apify-store-1d47e1201c104ab3c2bb25aeab35b714.webp "Apify Store") If you decide to make your Actor's code publicly available on [GitHub](https://github.com), code quality becomes even more crucial, as your Actor may be the first experience some users have with Apify. ### Handling breaking changes[​](#handling-breaking-changes "Direct link to Handling breaking changes") While refactoring and updating your Actor's code is encouraged, be cautious of making changes that could break the Actor for existing users. If you plan to introduce breaking change, please contact us at beforehand, and we'll assist you in communicating the change to your users. ### Documentation and testing[​](#documentation-and-testing "Direct link to Documentation and testing") Pay special attention to your Actor's documentation ([README](https://apify.notion.site/How-to-create-an-Actor-README-759a1614daa54bee834ee39fe4d98bc2)). It should be clear, detailed, concise and, readable, using simple language and avoiding technical jargon whenever possible, as your users may not be developers. Ensure periodic testing, either manually or by [setting up automatic testing](https://docs.apify.com/platform/actors/development/automated-tests.md) and [monitoring](https://apify.com/apify/monitoring). This can help prevent users from encountering issues with your Actor. ## Inspiration for new Actors[​](#inspiration-for-new-actors "Direct link to Inspiration for new Actors") To find ideas for new Actor, consider the following sources: * Your own experiences with friends, colleagues, and customers * SEO tools to identify search terms, websites related to web scrapint, web automation, or web integrations (see the[SEO article](https://apify.notion.site/SEO-990259fe88a84fd0a85ce6d3b394d8c1) for more details) * The [Actor ideas page](https://apify.com/ideas) to find Actors in demaed by the Apify community * Our [Discord community](https://discord.com/invite/jyEM2PRvMU), especially the [#hire-freelancers](https://discord.com/channels/801163717915574323/1022804760484659210) channel can offer great insights Additionally, you can refer to our [blog](https://blog.apify.com/) for examples of how we write about and present Actors, such as the: * [Content Checker article](https://blog.apify.com/set-up-alert-when-webpage-changes/) * [Kickstarter scraper article](https://blog.apify.com/kickstarter-search-actor-create-your-own-kickstarter-api/) * [Google Sheets Actor article](https://blog.apify.com/google-sheets-import-data/) --- # Monetize your Actor **Learn how you can monetize your web scraping and automation projects by publishing Actors to users in Apify Store.** *** Apify Store allows you to monetize your web scraping, automation and AI Agent projects by publishing them as paid Actors. This guide explains the available pricing models and how to get started. ## Pricing models[​](#pricing-models "Direct link to Pricing models") Actors in Apify Store can be published under one of the following pricing models: 1. **Free**: Users can run the Actor without any additional charges beyond the platform usage costs generated by the Actor. 2. **Rental**: Users pay for the platform usage costs. However, after a trial period, they need to pay a flat monthly fee to the developer to continue using the Actor. 3. **Pay per result (PPR)**: Users don't pay for the platform usage costs. Instead, they pay the developer based on the number of results produced by the Actor. 4. **Pay per event (PPE)**: Users don't pay for the platform usage cost the Actor generates. Instead, they pay based on specific events that are programmatically triggered from the Actor's source code. These events are defined by the developer and can include actions such as generating a single result or starting an Actor. For a detailed comparison of pricing models from the perspective of your users, refer to [Actors in Store](https://docs.apify.com/platform/actors/running/actors-in-store.md) page. ### Rental pricing model[​](#rental-pricing-model "Direct link to Rental pricing model") With the rental model, you can specify a free trial period and a monthly rental price. After the trial, users with an [Apify paid plan](https://apify.com/pricing) can continue using your Actor by paying the monthly fee. You can receive 80% of the total rental fees collected each month. Example - rental pricing model You make your Actor rental with 7-day free trial and then $30/month. During the first calendar month, three users start to use your Actor: 1. First user, on the Apify paid plan, starts the free trial on the 15th 2. Second user, on the Apify paid plan, starts the free trial on the 25th 3. Third user, on the Apify free plan, starts the free trial on the 20th The first user pays their first rent 7 days after the free trial, i.e., on the 22nd of the month. The second user only starts paying the rent next month. The third user is on the Apify free plan, so after the free trial ends on the 27th of the month, they are not charged and cannot use the Actor further until they get a paid plan. Your profit is computed only from the first user. They were charged $30, so 80% of this goes to you, i.e., *0.8 \* 30 = $24*. Rental Actors in AI workflows The rental model, while easy to set up, is less profitable because its pricing doesn't scale with usage. It is also impractical for AI agent applications and MCP servers, because of the upfront commitment of rental Actors. For Actors intended for AI workflows, use the [pay-per-result](https://docs.apify.com/platform/actors/publishing/monetize.md#pay-per-result-ppr-pricing-model) or [pay-per-event](https://docs.apify.com/platform/actors/publishing/monetize.md#pay-per-event-ppe-pricing-model) models instead. ### Pay-per-result (PPR) pricing model[​](#pay-per-result-ppr-pricing-model "Direct link to Pay-per-result (PPR) pricing model") In this model, you set a price per 1,000 results. Users are charged based on the number of results your Actor produces and stores in the run's default dataset. Your profit is calculated as 80% of the revenue minus platform usage costs. The formula is: `(0.8 * revenue) - costs = profit` Only revenue and cost for Apify customers on paid plans are taken into consideration when computing your profit. Users on free plans are not reflected there. The details on how your cost is computed can be found in [Computing your costs for PPE and PPR Actors](#computing-your-costs-for-ppe-and-ppr-actors). Example - pay-per-result pricing model You make your Actor pay-per-result and set the price to be $1/1,000 results. During the first month, two users on Apify paid plans use your Actor to get 50,000 and 20,000 results, costing them $50 and $20, respectively. Let's say the underlying platform usage for the first user is $5 and for the second $2. A third user, this time on an Apify free plan, uses the Actor to get 5,000 results, with underlying platform usage of $0.5. Your profit is computed only from the first two users, since they are on Apify paid plans. The revenue for the first user is $50 and for the second $20, i.e., total revenue is $70. The total underlying cost is *$5 + $2 = $7*. Since your profit is 80% of the revenue minus the cost, it would be *0.8 \* 70 - 7 = $49*. #### Best practices for PPR Actors[​](#best-practices-for-ppr-actors "Direct link to Best practices for PPR Actors") To ensure profitable operation: * Set memory limits in your [`actor.json`](https://docs.apify.com/platform/actors/development/actor-definition/actor-json) file to control platform usage costs * Implement the `ACTOR_MAX_PAID_DATASET_ITEMS` check to prevent excess result generation. You can copy [this simple solution](https://github.com/metalwarrior665/max-paid-items-example/blob/master/src/push-data.ts). * Test your Actor with various result volumes to determine optimal pricing * Push at least one "error item" to the dataset for invalid input or if search query didn't generate any result. This will prevent users running your Actor for free. ### Pay-per-event (PPE) pricing model[​](#pay-per-event-ppe-pricing-model "Direct link to Pay-per-event (PPE) pricing model") The pay-per-event pricing model offers a flexible monetization option for Actors on Apify Store. Unlike pay per result, PPE allows you to charge users based on specific events triggered programmatically by your Actor's code. #### PPE vs. PPR[​](#ppe-vs-ppr "Direct link to PPE vs. PPR") Unlike PPR, which charges based on the number of results produced, PPE lets you define pricing for individual events. You can charge for specific events directly from your Actor by calling the PPE charging API. Common events include Actor start, dataset item creation, and external API calls. #### How is profit computed[​](#how-is-profit-computed "Direct link to How is profit computed") Your profit is calculated as follows: `profit = (0.8 * revenue) - platform costs` where: * *Revenue*: The amount charged for events via the PPE [API](https://docs.apify.com/api/v2/post-charge-run.md) or through [JS](https://docs.apify.com/sdk/js/reference/class/Actor#charge)/[Python](https://docs.apify.com/sdk/python/reference/class/Actor#charge) SDK. You receive 80% of this revenue. * *Platform costs*: The underlying platform usage costs for running the Actor, calculated in the same way as for PPR. For more details, visit the [Computing your costs for PPE and PPR Actors](#computing-your-costs-for-ppe-and-ppr-actors) section. Only paid user activity is included in profit calculations. #### How to set pricing for PPE[​](#how-to-set-pricing-for-ppe "Direct link to How to set pricing for PPE") 1. *Understand your costs*: Analyze resource usage (e.g CPU, memory, proxies, external APIs) and identify cost drivers 2. *Define clear events*: break your Actor's functionality into measurable, chargeable events. 3. *Common use cases*: 1. *For scraping*: combine Actor start and dataset items pricing to reflect setup and per-result cost. 2. *Beyond scraping*: Account for integrations with external systems or external API calls. 4. *External API costs*: Account for additional processing costs. 5. *Test your pricing*: Run your Actor and analyze cost-effectiveness using a special dataset. 6. *Communicate value*: Ensure pricing reflects the value provided and is competitive. #### Best practices for PPE Actors[​](#best-practices-for-ppe-actors "Direct link to Best practices for PPE Actors") * Set memory limits in your [`actor.json`](https://docs.apify.com/platform/actors/development/actor-definition/actor-json.md) file to control platform usage costs * Charge either for "Actor start" or for invalid input or empty search to prevent users from running your Actor for free. * Don't charge the "Actor start" event again after Actor migration (check `ChargingManager` state). * Finish the Actor run once charging reaches user-configured `Maximum cost per run`. Apify SDKs ([JS](https://docs.apify.com/sdk/js/reference/class/Actor#charge) and [Python](https://docs.apify.com/sdk/python/reference/class/Actor#charge)) return `ChargeResult` that helps determine when to finish. * If you are not using the SDKs, use idempotency keys in API calls to prevent double charges. * Try to limit the number of events. Fewer events make it easier for users to understand your pricing. * Try to make your event have tangible artifacts that users can see and understand (this might not be possible when using external APIs) i.e: * Get a record * Save it to a dataset * etc. `ChargingManager` usage Use our SDKs ([JS](https://docs.apify.com/sdk/js/reference/class/ChargingManager) and, [Python](https://docs.apify.com/sdk/python/reference/class/ChargingManager) or use [`apify actor charge`](https://docs.apify.com/cli/docs/next/reference#apify-actor-charge-eventname) when using our Apify CLI) to simplify PPE implementation into your Actor. This tool can handle pricing, usage tracking, idempotency keys, API errors, and, event charging via an API. You can also choose not to use it, but then you must handle API integration and possible edge cases manually. You can use `ChargingManager` code as a reference. ### How to attract larger customers of PPE and PPR Actors[​](#how-to-attract-larger-customers-of-ppe-and-ppr-actors "Direct link to How to attract larger customers of PPE and PPR Actors") Each user running your PPE or PPR Actor belongs to a discount tier: * *FREE* * *BRONZE* * *SILVER* * *GOLD* You can define different prices for different tiers. While optional, we recommend offering progressively lower prices for higher discount tiers. This approach can significantly improve attractiveness of your Actor to large enterprise customers who may spend thousands or tens of thousands of dollars on it. Your platform costs are also lower for these higher tier, which helps maintain healthy profit margins. This is further detailed in the [Computing your costs for PPE and PPR Actors](#computing-your-costs-for-ppe-and-ppr-actors) section. By default, we advise against setting excessively high prices for *FREE* tier users, as this can limit the ability to evaluate your Actor thoroughly. However, in certain situations, such as protecting your Actor from fraudulent activity or excessive use of your internal APIs, a higher price for *FREE* tier users might be justified. During an Actor run, you can identify the user's discount tier through Actor run environment variables or by querying user data via the Apify API. This capability allows you to offer premium features or differentiated service levels to users in higher discount tiers. In addition to the standard tiers, Apify provides further tiers specifically for enterprise customers, including *PLATINUM* and *DIAMOND* tiers. If you are interested in offering enterprise-level services and attracting major clients, please contact us. ### Computing your costs for PPE and PPR Actors[​](#computing-your-costs-for-ppe-and-ppr-actors "Direct link to Computing your costs for PPE and PPR Actors") For both PPE and PPR Actors, profit is computed using the formula `(0.8 * revenue) - costs`. In this section, we'll explain how the `costs` component is calculated. When paying users run your Actor, it generates platform usage in the form of compute units, data traffic, API operations etc. This usage determines the `costs` in the profit formula above. *FREE* tier usage Platform usage by *FREE* tier users is covered by Apify and does not contribute to your costs. To calculate your costs in dollars for a specific run by paying user, multiply the unit cost of each service by the quantity consumed. For example, if a *BRONZE* tier user run uses 10 compute units (CUs) at $0.4/CU, your cost would be $4. As highlighted in the [How to attract larger customers of PPE and PPR Actors](#how-to-attract-larger-customers-of-ppe-and-ppr-actors) section, if your Actor uses tiered pricing, the user's discount tier determines the unit costs applied to their runs. Your costs are lower for higher tiers, enabling you to offer more competitive pricing to these customers, while sustaining healthy profit margins. The following table summarizes the platform unit costs used for your cost computation across different discount tiers. | Service | Price is per | *FREE* | *BRONZE* | *SILVER* | *GOLD* | | ------------------------ | ------------ | ------- | -------- | -------- | -------- | | Compute unit | CU | $0.4 | $0.4 | $0.3 | $0.25 | | Residential proxies | GB | $8 | $8 | $7.5 | $7 | | SERPs proxy | 1,000 SERPs | $2.5 | $2.5 | $2 | $1.7 | | Data transfer - external | GB | $0.2 | $0.2 | $0.19 | $0.18 | | Data transfer - internal | GB | $0.05 | $0.05 | $0.045 | $0.04 | | Dataset - reads | 1,000 reads | $0.0004 | $0.0004 | $0.00036 | $0.00032 | | Dataset - writes | 1,000 writes | $0.005 | $0.005 | $0.0045 | $0.004 | | Key-value store - reads | 1,000 reads | $0.005 | $0.005 | $0.0045 | $0.004 | | Key-value store - writes | 1,000 writes | $0.05 | $0.05 | $0.045 | $0.04 | | Key-value store - lists | 1,000 lists | $0.05 | $0.05 | $0.045 | $0.04 | | Request queue - reads | 1,000 reads | $0.004 | $0.004 | $0.0036 | $0.0032 | | Request queue - writes | 1,000 writes | $0.02 | $0.02 | $0.018 | $0.016 | If you decide not to offer tiered discounts on your Actor, the unit prices for *FREE* tier apply. To offer enterprise level services and unlock even cheaper unit prices for enterprise customers, please reach out to us. Cost of PPE Actors in Standby mode When you monetize your Actor in Standby mode using pay per event mode only, you are not responsible for covering platform usage costs of your users' runs. ## Setting up monetization[​](#setting-up-monetization "Direct link to Setting up monetization") Navigate to your [Actor page](https://console.apify.com/actors?tab=my) in Apify Console, choose the Actor that you want to monetize, and select the Publication tab. ![Monetization section](/assets/images/monetization-section-5ea234343a91208580100eb37c1b9e7f.png) Open the Monetization section and complete your billing and payment details. ![Set up monetization](/assets/images/monetize_actor_set_up_monetization-7612e44589223f7e92b8adcd006bc1bb.png) Choose the pricing model for your Actor. ![Monetization wizard](/assets/images/monetization_wizard-2bdc169343175223d7b989363138e583.png) Follow the monetization wizard to configure your pricing model. * Rental * Pay-per-result * Pay-per-event ![rental moentization wizard](/assets/images/rental-wizard-b62550403aad597d7527027000fc508c.png) ![ppr moentization wizard](/assets/images/ppr-wizard-3bb4ff175381b3f1c750b7fc3260691e.png) ![ppe moentization wizard](/assets/images/ppe-wizard-c1c3b24286f48d3ce8b2104c31260f55.png) ## Changing monetization[​](#changing-monetization "Direct link to Changing monetization") You can change the monetization setting of your Actor by using the same wizard as for the setup in the **Monetization** section of your Actor's **Publication** tab. Any changes made to an already published Actor will take *14 days* to come into effect, so that the users of your Actor have time to prepare. Frequency of monetization adjustments Be aware that you can change the monetization setting of each Actor only once per month. For further information and guidelines, please refer to our [Terms & Conditions](https://apify.com/store-terms-and-conditions) ## Monthly payouts and analytics[​](#monthly-payouts-and-analytics "Direct link to Monthly payouts and analytics") Payout invoices are automatically generated on the 11th of each month, summarizing the profits from all your Actors for the previous month. In accordance with our [Terms & Conditions](https://apify.com/store-terms-and-conditions), only funds from legitimate users who have already paid are included in the payout invoice. How negative profits are handled If your PPR or PPE Actor's price doesn't cover its monthly platform usage costs, it will have a negative profit. When this occurs, we automatically set that Actor's profit to $0 for the month. This ensures a single Actor's loss never reduces your total payout. You have 3 days to review your payout invoice in the **Development >Insights > Payout** section. During this period, you can either approve the invoice or request a revision, which we will process promptly. If no action is taken, the payout will be automatically approved on the 14th, with funds disbursed shortly after. Payouts require meeting minimum thresholds of either: * $20 for PayPal * $100 for other payout methods If the monthly profit does not meet these thresholds, as per our [Terms & Conditions](https://apify.com/store-terms-and-conditions), the funds will roll over to the next month until the threshold is reached. ## Actor analytics[​](#actor-analytics "Direct link to Actor analytics") Monitor your Actors' performance through the [Actor Analytics](https://console.apify.com/actors/insights/analytics) dashboard under **Development > Insights > Analytics**. The analytics dashboard allows you to select specific Actors and view key metrics aggregated across all user runs: * Revenue, costs and profit trends over time * User growth metrics (both paid and free users) * Cost per 1,000 results to optimize pricing * Run success rate statistics * User acquisition funnel analytics * Shared debug runs from users All metrics can be exported as JSON for custom analysis and reporting. ## Promoting your Actor[​](#promoting-your-actor "Direct link to Promoting your Actor") Create serach-engine-optimized descriptions and README files to improve search engine visibility. Share your Actor on multiple channels: * Post on Reddit, Quora, and social media platforms * Create tutorial videos demonstrating key features * Publish articles about your Actor on relevant websites * Consider creating a product showcase on platforms like Product Hunt Remember to tag Apify in your social media posts for additional exposure. Effective promotion can significantly impact your Actor's success, differentiating between those with many paid users and those with few to none. Learn more about promoting your Actor in the [Apify's marketing playbook](https://docs.apify.com/academy/actor-marketing-playbook.md). --- # Publish your Actor **Prepare your Actor for Apify Store with a description and README file, and learn how to make your Actor available to the public.** *** Before making your Actor public, it's important to ensure your Actor has a clear **Description** and comprehensive **README** section. This will help users understand your Actor's purpose, how to configure its inputs, and the type of output it generates. This guide we'll review the essential fields you must complete before publishing your Actor. For more detailed information on [SEO & promotion](https://apify.notion.site/SEO-990259fe88a84fd0a85ce6d3b394d8c1) and [how to write a comprehensive README](https://apify.notion.site/How-to-create-an-Actor-README-759a1614daa54bee834ee39fe4d98bc2),refer to guides available at the [Apify Marketing Playbook](https://apify.notion.site/3fdc9fd4c8164649a2024c9ca7a2d0da?v=6d262c0b026d49bfa45771cd71f8c9ab). ## Make your Actor public[​](#make-your-actor-public "Direct link to Make your Actor public") Once you've finished coding and testing your Actor, it's time to publish it. Follow these steps: 1. From your Actor's page in the Apify Console, go to **Publication** > **Display information** 2. Fill in all the relevaent fields for your Actor (e.g., **Icon**, **Actor name**, **Description**, **Categories**) 3. Save your changes ![Actor settings](/assets/images/actor-display-information-edadffb4c2b700ad9bdbefaf1a403d64.webp) After filling in all the required fields, the **Publish to Store** button will turn green. Click on it to make your Actor available to the public on the Apify Store. ![Publish your Actor]() To verity that your Actor has been published successfully, go to the [Apify Store](https://apify.com/store), search for your Actor's name. Click on your Actor's card, to view its dedicated page. This is the page where users will likely have their first interaction with your ACtor, so carefully review it and ensure everything is set up correctly. ![Apify Store](/assets/images/apify-store-1d47e1201c104ab3c2bb25aeab35b714.webp) ![Actor page](/assets/images/actor-page-7fe90bdf7cc14faa1ee5a232c0595dbb.webp) ### Logo[​](#logo "Direct link to Logo") We strongly recommend adding a unique image to your Actor that visually represents the service it provides. This helps users quickly understand its purpose. However, do not use official logos or branded images from the sites you're scraping, as this can lead to copyright or trademark issues. ### Description[​](#description "Direct link to Description") The Actor's description is a short paragraph that explains its purpose. It will be displayed on the Actor's page, right below its title. ![Actor title and description](/assets/images/actor-title-description-f03a3ed71321e82d7bb72f44747e26d2.webp) When writing your Actor's description, you also have the option to provide an SEO title & description. These will be used in search engine result pages instead of Actor's name & description. Effective SEO titles & descriptions should: * Utilize popular keywords related to your Actor's functionality * Summarize the Actor's purpose concisely * Be between *40* to *50* characters for the title and *140* to *156* characters for description ![SEO title and description](/assets/images/actor-SEO-cf81e09f17cef87c058d377065e5f453.webp) ### README[​](#readme "Direct link to README") The next step is to include a comprehensive **README** detailing your Actor's features, reasons for scraping the target website, and instructions on how to use the Actor effectively. Remember that the Actor's README is generated from your `README.md` file, and you can apply the same [SEO principles](https://apify.notion.site/SEO-990259fe88a84fd0a85ce6d3b394d8c1) mentioned earlier to optimize you README for search engines. To save time when writing your Actor's README, you can use the following template as a starting point: Note that the complexity of your README should match the complexity of your Actor. Feel free to adapt the template to fit your Actor's specific requirements. --- # Actor status badge The Actor status badge can be embedded in the README or documentation to show users the current status and usage of your Actor on the Apify platform. *** This is the badge generated for the [Apify's Website Content Crawler](https://apify.com/apify/website-content-crawler) Actor: [![Website Content Crawler Actor](https://apify.com/actor-badge?actor=apify/website-content-crawler)](https://apify.com/apify/website-content-crawler) This is how such a badge looks in a GitHub repository README: ![Actor badge in GitHub README](/assets/images/github-badge-screenshot-23af8e9a39a94a7f9b3222cd3e45f2ad.png) ### How to embed the badge[​](#how-to-embed-the-badge "Direct link to How to embed the badge") The Badge is a dynamic SVG image loaded from the Apify platform. The Badge is served from the URL Template: ``` https://apify.com/actor-badge?actor=/ ``` In order to embed the badge in the HTML documentation, just use it as an image wrapped in a link as shown in the example below. Don't froget to use the `username` and `actor-name` of your Actor. #### Example[​](#example "Direct link to Example") * HTML * Markdown ``` ``` ``` [![Website Content Crawler Actor](https://apify.com/actor-badge?actor=apify/website-content-crawler)](https://apify.com/apify/website-content-crawler) ``` ### Supported Actor states[​](#supported-actor-states "Direct link to Supported Actor states") The badge indicates the state of the Actor in the Apify platform as the result of the [automated testing](https://docs.apify.com/platform/actors/development/automated-tests.md). #### Actor OK[​](#actor-ok "Direct link to Actor OK") ![Actor badge OK]() #### Actor under maintenance[​](#actor-under-maintenance "Direct link to Actor under maintenance") ![Actor badge under maintenance]() #### Actor deprecated[​](#actor-deprecated "Direct link to Actor deprecated") ![Actor badge deprecated]() #### Actor not found[​](#actor-not-found "Direct link to Actor not found") ![Actor badge not found]() --- # Automated testing **Apify has a QA system that regularly runs automated tests to ensure that all Actors in the store are functional.** *** ### Why we test[​](#why-we-test "Direct link to Why we test") We want to make sure that all Actors in Apify Store are top-notch, or at least as top-notch as they can be. Since there are many of them, we have an automated testing procedure in place that tests all Actors daily. This helps us to flag Actors that temporarily don't work as expected `under maintenance`, and to automatically `deprecate` Actors that have been broken for more than a month. ### How we test[​](#how-we-test "Direct link to How we test") The test runs the Actor with its default input (defined by the [**prefill**](https://docs.apify.com/platform/actors/development/actor-definition/input-schema/specification/v1#prefill-vs-default-vs-required) option in the input schema file) and expects it to finish with a **Succeeded** status and non-empty default dataset within 5 minutes of the beginning of the run. ![Actor page](/assets/images/actor-test-06ec218b357a5e86a8c3ea01b1050ac7.webp) If the Actor fails to complete successful runs for three consecutive days, the developer will be notified, and the Actor will be labeled `under maintenance` until it is fixed. After another 14 days of failing runs, you will receive another notification. Finally, if the runs continue to fail after yet another 14 days, the Actor will be `deprecated`. ### How can I make my Actor healthy again?[​](#how-can-i-make-my-actor-healthy-again "Direct link to How can I make my Actor healthy again?") The best course of action is to fix the Actor and rebuild it. The automatic testing system will pick this up within 24 hours and mark it as healthy. In some cases, your Actor might break because of issues with the target website. In such a case, if your Actor passes the majority of test runs in the next 7 days, it will be marked as healthy automatically. ## What if my Actor cannot comply with the test logic?[​](#what-if-my-actor-cannot-comply-with-the-test-logic "Direct link to What if my Actor cannot comply with the test logic?") Actors that require some sort of authentication will always fail the tests despite being fully functional. Also, some Actors inherently run for longer than 5 minutes. If that's the case with your Actor, please contact support at and explain your specific use case that justifies why the Actor should be excluded from the automated tests. ## Advanced Actor testing[​](#advanced-actor-testing "Direct link to Advanced Actor testing") You can easily implement your own tests and customize them to fit your Actor's particularities by using our public [Actor Testing](https://apify.com/pocesar/actor-testing) tool available in Apify Store. For more information, see the [automated testing](https://docs.apify.com/platform/actors/development/automated-tests.md) section. --- # Running Actors **In this section, you learn how to run Apify Actors using Apify Console or programmatically. You will learn about their configuration, versioning, data retention, usage, and pricing.** *** ## Run your first Apify Actor[​](#run-your-first-apify-actor "Direct link to Run your first Apify Actor") Before you can run an Actor, you have to either choose one of the existing ones from [Apify Store](https://apify.com/store) or [build your own](https://docs.apify.com/platform/actors/development.md). To get started, we recommend trying out an Actor from the [Store](https://apify.com/store). Once you have chosen an Actor, you can start it in a number of ways. > **You will need an Apify account to complete this tutorial. If you don't have one, [complete the sign-up process](https://console.apify.com/sign-up) first. Don't worry about the price - it's free.** ### 1. Choose your Actor[​](#1-choose-your-actor "Direct link to 1. Choose your Actor") After you sign-in to Apify Console, navigate to [Apify Store](https://console.apify.com/store). We'll pick the [Website Content Crawler](https://console.apify.com/actors/aYG0l9s7dbB7j3gbS/information/version-0/readme): ![Apify Store](/assets/images/store-5b5e59758034626dd92a45735c138c20.png) ### 2. Configure it[​](#2-configure-it "Direct link to 2. Configure it") On the Actor's page, head over to the **Input** tab. Don't be put off by all the boxes - the Actor is pre-configured to run without any extra input. Just click the **Start** button in the bottom-left corner. Alternatively, you can play around with the settings to make the results more interesting for you. ![Actor input](/assets/images/apify-input-eeec3989b5a1ed4bb84e06982e6b3068.png) ### 3. Wait for the results[​](#3-wait-for-the-results "Direct link to 3. Wait for the results") The Actor might take a while to gather its first results and finish its run. Meanwhile, let's take some time to explore the platform options: * Note the other tabs, which provide you with information about the Actor run. For example, you can access the run **Log** and **Storage**. * At the top right, you can click on the API button to explore the related API endpoints ![Run](/assets/images/actor-run-bcbc9356dd02906cacd7a09cd6f18528.png) ### 4. Get the results[​](#4-get-the-results "Direct link to 4. Get the results") Shortly you will see the first results popping up: ![Actor results](/assets/images/actor-results-6fc04e56f4a4032e667613502a151137.png) And you can use the export button at the bottom left to export the data in multiple formats: ![Export results](/assets/images/export-results-b04ca04e8fbe111ee6b091c9f2b5e973.png) And that's it! Now you can get back to the Actor's input, play with it, and try out more of the [Apify Actors](https://apify.com/store) or [build your own](https://docs.apify.com/platform/actors/development.md). ## Running via Apify API[​](#running-via-apify-api "Direct link to Running via Apify API") Actors can also be invoked using the Apify API by sending an HTTP POST request to the [Run Actor](https://docs.apify.com/api/v2.md#/reference/actors/run-collection/run-actor) endpoint, such as: ``` https://api.apify.com/v2/acts/compass~crawler-google-places/runs?token= ``` An Actor's input and its content type can be passed as a payload of the POST request, and additional options can be specified using URL query parameters. For more details, see the [Run Actor](https://docs.apify.com/api/v2.md#/reference/actors/run-collection/run-actor) section in the API reference. > To learn more about this, read the [Run an Actor or task and retrieve data via API](https://docs.apify.com/academy/api/run-actor-and-retrieve-data-via-api.md) tutorial. ## Running programmatically[​](#running-programmatically "Direct link to Running programmatically") Actors can also be invoked programmatically from your own applications or from other Actors. To start an Actor from your own application, we recommend using our API client libraries for [JavaScript](https://docs.apify.com/api/client/js/reference/class/ActorClient#call) or [Python](https://docs.apify.com/api/client/python/reference/class/ActorClient#call). * JavaScript * Python ``` import { ApifyClient } from 'apify-client'; const client = new ApifyClient({ token: 'MY-API-TOKEN', }); // Start the Google Maps Scraper Actor and wait for it to finish. const actorRun = await client.actor('compass/crawler-google-places').call({ queries: 'apify', }); // Fetch scraped results from the Actor's dataset. const { items } = await client.dataset(actorRun.defaultDatasetId).listItems(); console.dir(items); ``` ``` from apify_client import ApifyClient apify_client = ApifyClient('MY-API-TOKEN') # Start the Google Maps Scraper Actor and wait for it to finish. actor_run = apify_client.actor('compass/crawler-google-places').call( run_input={ 'queries': 'apify' } ) # Fetch scraped results from the Actor's dataset. dataset_items = apify_client.dataset(actor_run['defaultDatasetId']).list_items().items print(dataset_items) ``` The newly started Actor runs under the account associated with the provided `token`, and therefore all resources consumed are charged to this user account. Internally, the `call()` function invokes the [Run Actor](https://docs.apify.com/api/v2.md#/reference/actors/run-collection/run-actor) API endpoint, waits for the Actor to finish, and reads its output using the [Get items](https://docs.apify.com/api/v2.md#/reference/datasets/item-collection/get-items) API endpoint. --- # Actors in Store **[Apify Store](https://apify.com/store) is home to thousands of public Actors available to the Apify community. It's the easiest way for you to start with Apify.** *** Publishing and monetizing Actors Anyone is welcome to [publish Actors](https://docs.apify.com/platform/actors/publishing.md) in the store, and you can even [monetize your Actors](https://docs.apify.com/platform/actors/publishing/monetize.md). For more information about how to monetize your Actor, best practices, SEO, and promotion tips and tricks, head over to the [Getting the most of your public Actors in Apify Store](https://docs.apify.com/academy/actor-marketing-playbook.md) section of the Apify Developers Academy. ## Pricing models[​](#pricing-models "Direct link to Pricing models") All Actors in [Apify Store](https://apify.com/store) fall into one of the four pricing models: 1. [**Rental**](#rental-actors) - to continue using the Actor after the trial period, you must rent the Actor from the developer and pay a flat monthly fee in addition to the costs associated with the platform usage that the Actor generates. 2. [**Pay per result**](#pay-per-result) - you do not pay for platform usage the Actor generates and instead just pay for the results it produces. 3. [**Pay per event**](#pay-per-event) - you can run the Actor and you do not pay for platform usage the Actor generates. Instead you pay for the specific events the Actor creator defines, such as for generating a single result or starting the Actor. 4. [**Pay per usage**](#pay-per-usage) - you can run the Actor and you pay for the platform usage the Actor generates. ### Rental Actors[​](#rental-actors "Direct link to Rental Actors") Rental Actors are Actors for which you have to pay a recurring fee to the developer after your trial period ends. This empowers the developer to dedicate more time and effort to their Actors, thus ensuring they are of the *highest quality* and receive *ongoing maintenance*. ![Rental Actor example](/assets/images/rental-actor-example-3ad82ec435c67977703b598636eaa2c5.png) Most rental Actors have a *free trial* period. The length of the trial is displayed on each Actor's page. ![Rental Actors in Apify Store](/assets/images/rental-actor-trial-example-0ce81cb643d211da7b80975d191a1169.png) After a trial period, a flat monthly *Actor rental* fee is automatically subtracted from your prepaid platform usage in advance for the following month. Most of this fee goes directly to the developer and is paid on top of the platform usage generated by the Actor. You can read more about our motivation for releasing rental Actors in [this blog post](https://blog.apify.com/make-regular-passive-income-developing-web-automation-actors-b0392278d085/) from Apify's CEO Jan Čurn. #### Rental Actors - Frequently Asked Questions[​](#rental-actors---frequently-asked-questions "Direct link to Rental Actors - Frequently Asked Questions") ##### Can I run rental Actors via API or the Apify client?[​](#can-i-run-rental-actors-via-api-or-the-apify-client "Direct link to Can I run rental Actors via API or the Apify client?") Yes, when you are renting an Actor, you can run it using either our [API](https://docs.apify.com/api/v2.md), [JavaScript](https://docs.apify.com/api/client/js) or [Python](https://docs.apify.com/api/client/python) clients as you would do with private or free public Actors. ##### Do I pay platform costs for running rental Actors?[​](#do-i-pay-platform-costs-for-running-rental-actors "Direct link to Do I pay platform costs for running rental Actors?") Yes, you will pay normal [platform usage costs](https://apify.com/pricing) on top of the monthly Actor rental fee. The platform costs work exactly the same way as for free public Actors or your private Actors. You should find estimates of the cost of usage in each individual rental Actor's README ([see an example](https://apify.com/compass/crawler-google-places#how-much-will-it-cost)). ##### Do I need an Apify paid plan to use rental Actors?[​](#do-i-need-an-apify-paid-plan-to-use-rental-actors "Direct link to Do I need an Apify paid plan to use rental Actors?") You don't need a paid plan to start a rental Actor's free trial. Just activate the trial, and you are good to go. After that, you will need to subscribe to one of [Apify's paid plans](https://apify.com/pricing) in order to keep renting the Actor and continue using it. ##### When will I be charged for the Actor rental?[​](#when-will-i-be-charged-for-the-actor-rental "Direct link to When will I be charged for the Actor rental?") You always prepay the Actor rental for the following month. The first payment happens when the trial expires, and then recurs monthly. When you open the Actor in the Apify Console, you will see when the next rental payment is due, and you will also receive a notification when it happens. *Example*: You activate a 7-day trial of an Actor at *noon of April 1, 2021*. If you don't turn off auto-renewal, you will be charged at *noon on April 8, 2021*, then *May 8, 2021*. ##### How am I charged for Actor rental?[​](#how-am-i-charged-for-actor-rental "Direct link to How am I charged for Actor rental?") The rental fee for an Actor is automatically subtracted from your prepaid platform usage, similarly to, e.g. [compute units](https://docs.apify.com/platform/actors/running/usage-and-resources.md). If you don't have enough usage prepaid, you will need to cover any overage in the next invoice. ##### Will I be automatically charged at the end of the free trial?[​](#will-i-be-automatically-charged-at-the-end-of-the-free-trial "Direct link to Will I be automatically charged at the end of the free trial?") If you have an [Apify paid plan](https://apify.com/pricing), the monthly rental fee will be automatically subtracted from your plan's prepaid usage at the end of your free trial, and you will be able to run the Actor for another month. If you are not subscribed to any of [Apify's paid plans](https://apify.com/pricing), you will need to subscribe to one in order to continue using the Actor after the trial has ended. ##### Can I cancel my Actor rental?[​](#can-i-cancel-my-actor-rental "Direct link to Can I cancel my Actor rental?") *You can cancel the Actor rental* during your trial or any time after that so you don't get charged when your current Actor rental period expires. You can always turn it back on later if you want. ##### Where can I see how much I have paid for Actor rental?[​](#where-can-i-see-how-much-i-have-paid-for-actor-rental "Direct link to Where can I see how much I have paid for Actor rental?") Since Actor rental fees are paid from prepaid platform usage, these fees conceptually belong under platform usage. You can find the breakdown of how much you have been charged for rental Actors in the **Actors** tab, which you will find within the **Current period** tab in the [Billing](https://console.apify.com/billing) section. ![Rental Actors billing in Apify Console](/assets/images/billing-paid-actors-333edff195608ead302706f5401c94ca.png) ### Pay per result[​](#pay-per-result "Direct link to Pay per result") When you run an Actor that is *paid per result*, you pay for the successful results that an Actor returns when you run it, and you are not charged for the underlying platform usage. Estimation simplified This makes it transparent and easy to estimate upfront costs. If you have any feedback or would like to ask something, please join our [Discord](https://discord.gg/qkMS6pU4cF) community and let us know! #### Pay per result Actors - Frequently Asked Questions[​](#pay-per-result-actors---frequently-asked-questions "Direct link to Pay per result Actors - Frequently Asked Questions") ##### How do I know an Actor is paid per result?[​](#how-do-i-know-an-actor-is-paid-per-result "Direct link to How do I know an Actor is paid per result?") When you try the Actor on the platform, you will see that the Actor is paid per result next to the Actor name. ![Actor paid per result in Console](/assets/images/console_pay_per_result_tag-b80c8c3f7b3d6cd738bb4adb6fc190c7.png) ##### Do I need to pay a monthly rental fee to run the Actor?[​](#do-i-need-to-pay-a-monthly-rental-fee-to-run-the-actor "Direct link to Do I need to pay a monthly rental fee to run the Actor?") No, the Actor is free to run. You only pay for the results. ##### What happens when I interact with the dataset after the run finishes?[​](#what-happens-when-i-interact-with-the-dataset-after-the-run-finishes "Direct link to What happens when I interact with the dataset after the run finishes?") Under the **pay per result** model, all platform costs generated *during the run of an Actor* are not charged towards your account; you pay for the results instead. After the run finishes, any interactions with the default dataset storing the results, such as reading the results or writing additional data, will incur the standard platform usage costs. But do not worry, in the vast majority of cases, you only want to read the result from the dataset and that costs near to nothing. ##### Do I pay for the storage of results on the Apify platform?[​](#do-i-pay-for-the-storage-of-results-on-the-apify-platform "Direct link to Do I pay for the storage of results on the Apify platform?") You will still be charged for the timed storage of the data in the same fashion as with any other Actor. You can always decide to delete the dataset to reduce your costs after you export the data from the platform. By default, any unnamed dataset will be automatically removed after your data retention period, so usually, this is nothing to worry about. ##### Can I set a cap on how many results an Actor should return?[​](#can-i-set-a-cap-on-how-many-results-an-actor-should-return "Direct link to Can I set a cap on how many results an Actor should return?") You can set a limit on how many items an Actor should return and the amount you will be charged in Options on the Actor detail page in the section below the Actor input. ![Max items for pay-per-result](/assets/images/max-items-for-pay-per-result-eb410b914957a2078cb787a13f16b186.png) ##### Can I publish an Actor that is paid per result?[​](#can-i-publish-an-actor-that-is-paid-per-result "Direct link to Can I publish an Actor that is paid per result?") Yes, you can publish an Actor that is paid per result. ##### Where do I see how much I was charged for the pay per result Actors?[​](#where-do-i-see-how-much-i-was-charged-for-the-pay-per-result-actors "Direct link to Where do I see how much I was charged for the pay per result Actors?") You can see the overview of how much you have been charged for Actors paid by result on your invoices and in the [Usage tab](https://console.apify.com/billing) of the Billing section in Console. It will be shown there as a separate service. ![Statistics in the billing section](/assets/images/pay_per_result_billing_usage_section-b0a70147a64e4e06243be8ece25ca296.png) On the top of that, you can see how much you have been charged for a specific run in the detail of that run and also in the overview table showing all runs. ![Run cost shown on the run detail](/assets/images/pay_per_result_run_detail-fe4dadc512669d16bfb74a63c9112fb2.png) ![Run cost shown on the overview of all runs](/assets/images/pay_per_result_run_list-6b7b0f9ed43b1774242dbb01d021ffde.png) If you wish to see how much you have been charged for a particular Actor, you will find this information at the bottom of the [Usage tab](https://console.apify.com/billing). ![Actor pay-per-result cost](/assets/images/pay_per_result_actor_items-7f76640cf886d8d683517418a042a32e.png) ### Pay per event[​](#pay-per-event "Direct link to Pay per event") Pay per event Actor pricing model is very similar to the pay per result model. You still do not pay the underlying platform usage. Instead of results, you pay for specific events defined by the creator of the Actor. These events will vary between Actors, and will always be described, together with their pricing, on each Actor. Example events might be producing a single result, doing a unit piece of work (e.g. uploading a file) or starting an Actor. #### Pay per event Actors - Frequently Asked Questions[​](#pay-per-event-actors---frequently-asked-questions "Direct link to Pay per event Actors - Frequently Asked Questions") #### How do I know Actor is paid per events?[​](#how-do-i-know-actor-is-paid-per-events "Direct link to How do I know Actor is paid per events?") You will the that the Actor is paid per events next to the Actor name. ![Example pay per event Actor](/assets/images/pay_per_event_example_actor-548327b71a5877bd4e7324e517ade11c.png) #### Do I need to pay a monthly rental fee to run the Actor?[​](#do-i-need-to-pay-a-monthly-rental-fee-to-run-the-actor-1 "Direct link to Do I need to pay a monthly rental fee to run the Actor?") No, you only pay for the events. #### What happens when I interact with the dataset after the run finishes?[​](#what-happens-when-i-interact-with-the-dataset-after-the-run-finishes-1 "Direct link to What happens when I interact with the dataset after the run finishes?") You would still pay for all interactions after the Actor run finishes, same as for pay per result Actors. #### Do I pay for the storage of results on the Apify platform?[​](#do-i-pay-for-the-storage-of-results-on-the-apify-platform-1 "Direct link to Do I pay for the storage of results on the Apify platform?") You would still pay for the long term storage of results, same as for pay per result Actors. #### Some Actors declare that I still need to pay for usage, how come?[​](#some-actors-declare-that-i-still-need-to-pay-for-usage-how-come "Direct link to Some Actors declare that I still need to pay for usage, how come?") When an Actor operates in [Standby mode](https://docs.apify.com/platform/actors/running/standby), you control how the background runs scale and how efficiently they are utilized. In this case, you are responsible for paying the platform usage costs of these runs in addition to the event charges. Some selected Standby Actors may have this usage component waived, so always check the pricing page of the specific Actor to determine whether you will be charged for usage or only for events. #### Where do I see how much I was charged for the pay per result Actors?[​](#where-do-i-see-how-much-i-was-charged-for-the-pay-per-result-actors-1 "Direct link to Where do I see how much I was charged for the pay per result Actors?") Similarly to pay per result Actors, you can see how much you have been charged on your invoices, and on the [Usage tab](https://console.apify.com/billing) of the Billing section in the Console. ![Pay per event Actor - historical usage tab](/assets/images/pay_per_event_historical_usage_tab-66fd1c0812dc70b5529a003f0003cd29.png) You can also see the cost of each run on the run detail itself. ![Pay per event Actor - run detail](/assets/images/pay_per_event_price_on_run_detail-9bc0644998cb9c66b60f1b5f5be18d75.png) #### Can I put a cap on a cost of a single Actor run?[​](#can-i-put-a-cap-on-a-cost-of-a-single-actor-run "Direct link to Can I put a cap on a cost of a single Actor run?") Yes, when starting an Actor run, you can define the maximum limit on the cost of that run. When the Actor reaches the defined limit, it should terminate gracefully. Even if it didn't, for any reason, and kept producing results, we make always sure you are never charged more that your defined limit. ![Pay per event Actor - max charge per run](/assets/images/pay_per_event_price_on_run_detail-9bc0644998cb9c66b60f1b5f5be18d75.png) #### How do I raise a dispute if the charges for an Actor seem off?[​](#how-do-i-raise-a-dispute-if-the-charges-for-an-actor-seem-off "Direct link to How do I raise a dispute if the charges for an Actor seem off?") Please, in such a case, do not hesitate to contact the Actor author or our support team. If you suspect a bug in the Actor, you can also always create an issue on the Actor detail in the Apify Console. ### Pay per usage[​](#pay-per-usage "Direct link to Pay per usage") When you use a pay per usage Actor, you are only charged for the platform usage that the runs of this Actor generate. [Platform usage](https://docs.apify.com/platform/actors/running/usage-and-resources.md) includes components such as compute units, operations on [storages](https://docs.apify.com/platform/storage.md), and usage of [residential proxies](https://docs.apify.com/platform/proxy/residential-proxy.md) or [SERPs](https://docs.apify.com/platform/proxy/google-serp-proxy.md). ![Pay for usage Actor example](/assets/images/pay-per-usage-actor-example-bb2556d588b1b57227d30bc7362f1e2e.png) Estimating Actor usage cost With this model, it's very easy to see how many platform resources each Actor run consumed, but it is quite difficult to estimate their usage beforehand. The best way to find the costs of free Actors upfront is to try out the Actor on a limited scope (for example, on a small number of pages) and evaluate the consumption. You can easily do that using our [free plan](https://apify.com/pricing). *For more information on platform usage cost see the [usage and resources](https://docs.apify.com/platform/actors/running/usage-and-resources.md) page.* ## Reporting issues with Actors[​](#reporting-issues-with-actors "Direct link to Reporting issues with Actors") Each Actor has an **Issues** tab in Apify Console. There, you can open an issue (ticket) and chat with the Actor's author, platform admins, and other users of this Actor. Please feel free to use the tab to ask any questions, request new features, or give feedback. Alternatively, you can always write to . ![Paid Actors\' issues tab](/assets/images/paid-actors-issues-tab-934e13f39d178b1c8b368944afa089e2.png) ## Apify Store discounts[​](#apify-store-discounts "Direct link to Apify Store discounts") Each Apify subscription plan includes a discount tier (*BRONZE*, *SILVER*, *GOLD*) that provides access to increasingly lower prices on selected Actors. Discount participation Discount offers are optional and determined by Actor owners. Not all Actors participate in the discount program. Additional discounts are available for Enterprise customers. To check an Actor's pricing and available discounts, visit the Pricing section on the Actor's detail page in the Apify Store. ![Apify Store discounts](/assets/images/apify_store_discounts_web-b62495f120b256aa2bb504623c61d4fb.png) In the Apify Console, you can find information about pricing and available discounts in the Actor's header section. ![Apify Store discounts](/assets/images/apify_store_discounts_console-6c53137e2ff29b82b236cdca97af0f6a.png) ![Apify Store discounts full table](/assets/images/apify_store_discounts_full_table-0cc75054fa93d3c85dfcacea2e5df250.png) --- # Input and output **Configure your Actor's input parameters using Apify Console, locally or via API. Access parameters in key-value stores from your Actor's code.** *** ## Input[​](#input "Direct link to Input") Each Actor accepts input, which tells it what to do. You can run an Actor using the [Apify Console](https://console.apify.com) UI, then configure the input using the autogenerated UI: ![Input UI](/assets/images/apify-input-eeec3989b5a1ed4bb84e06982e6b3068.png) When running an Actor using the [API](https://docs.apify.com/api/v2) you can pass the same input as the JSON object. In this case, the corresponding JSON input looks as follows: ``` { "maxRequestsPerCrawl": 10, "proxy": { "useApifyProxy": true }, "startUrl": "https://apify.com" } ``` ### Options - Build, Timeout, and Memory[​](#options---build-timeout-and-memory "Direct link to Options - Build, Timeout, and Memory") As part of the input, you can also specify run options such as [Build](https://docs.apify.com/platform/actors/development/builds-and-runs/builds.md), Timeout, and [Memory](https://docs.apify.com/platform/actors/running/usage-and-resources.md) for your Actor run. ![Run options](/assets/images/actor-options-730f220a8f7037928bf45207b1dc9876.png) | Option | Description | | ------- | --------------------------------------------------------------------------- | | Build | Tag or number of the build to run (e.g. **latest** or **1.2.34**). | | Timeout | Timeout for the Actor run in seconds. Zero value means there is no timeout. | | Memory | Amount of memory allocated for the Actor run, in megabytes. | ## Output[​](#output "Direct link to Output") While the input object provides a way to instruct Actors, an Actor can also generate an output, usually stored in its default [Dataset](https://docs.apify.com/platform/storage/dataset.md), but some additional files might be stored in its [Key-value store](https://docs.apify.com/platform/storage/key-value-store.md). Always read the Actor's README to learn more about its output. For more details about storages, visit the [Storage](https://docs.apify.com/platform/storage.md) section. You can quickly access the Actor's output from the run detail page: ![Actor output](/assets/images/actor-output-38f68053e9dbc8cd70ce87cb436f68e7.png) And to access all the data associated with the run, see the **Storage** tab, where you can explore the Actor's default [Dataset](https://docs.apify.com/platform/storage/dataset.md), [Key-value store](https://docs.apify.com/platform/storage/key-value-store.md), and [Request queue](https://docs.apify.com/platform/storage/request-queue.md): ![Actor output](/assets/images/actor-storage-1282f5f5988729d3cd25e82d1f7a7925.png) You can also use [API](https://docs.apify.com/api/v2) to retrieve the output. To learn more about this, read the [Run an Actor or task and retrieve data via API](https://docs.apify.com/academy/api/run-actor-and-retrieve-data-via-api.md) tutorial. --- # Runs and builds **Learn about Actor builds and runs, their lifecycle, sharing, and data retention policy.** *** ## Builds[​](#builds "Direct link to Builds") An Actor is a combination of source code and various settings in a Docker container. To run, it needs to be built. An Actor build consists of the source code built as a Docker image, making the Actor ready to run on the Apify platform. What is Docker image? A Docker image is a lightweight, standalone, executable package of software that includes everything needed to run an application: code, runtime, system tools, system libraries, and settings. For more information visit Docker's [site](https://www.docker.com/resources/what-container/). With every new version of an Actor, a new build is created. Each Actor build has its number (for example, **1.2.34**), and some builds are tagged for easier use (for example, *latest* or *beta*). When running an Actor, you can choose what build you want to run by selecting a tag or number in the run options. ![Actor run options](/assets/images/actor-run-options-100b9aca271d49f64e9ddc3b42c98c5e.png) Each build may have different features, input, or output. By fixing the build to an exact version, you can ensure that you won't be affected by a breaking change in a new Actor version. However, you will lose updates. ## Runs[​](#runs "Direct link to Runs") When you start an Actor, an Actor run is created. An Actor run is a Docker container created from the build's Docker image with dedicated resources (CPU, memory, disk space). For more on this topic, see [Usage and resources](https://docs.apify.com/platform/actors/running/usage-and-resources.md). Each run has its own (default) [storages](https://docs.apify.com/platform/storage.md) assigned, which it may but not necessarily need to use: * [Key-value store](https://docs.apify.com/platform/storage/key-value-store.md) containing the input and enabling Actor to store other files. * [Dataset](https://docs.apify.com/platform/storage/dataset.md) enabling Actor to store the results. * [Request queue](https://docs.apify.com/platform/storage/request-queue.md) to maintain a queue of URLs to be processed. What's happening inside of an Actor is visible on the Actor run log in the Actor run detail: ![Actor run](/assets/images/actor-run-detail-4756ea99ce6c72a2253155e56a5b9f65.png) ### Origin[​](#origin "Direct link to Origin") Both **Actor runs** and **builds** have the **Origin** field indicating how the Actor run or build was invoked, respectively. The origin is displayed in Apify Console and available via [API](https://docs.apify.com/api/v2/actor-run-get) in the `meta.origin` field. | Name | Origin | | ------------- | ------------------------------------------------------------------------------- | | `DEVELOPMENT` | Manually from Apify Console in the Development mode (own Actor) | | `WEB` | Manually from Apify Console in "normal" mode (someone else's Actor or task) | | `API` | From [Apify API](https://docs.apify.com/api) | | `CLI` | From [Apify CLI](https://docs.apify.com/cli/) | | `SCHEDULER` | Using a schedule | | `WEBHOOK` | Using a webhook | | `ACTOR` | From another Actor run | | `STANDBY` | From [Actor Standby](https://docs.apify.com/platform/actors/running/standby.md) | ## Lifecycle[​](#lifecycle "Direct link to Lifecycle") Each run and build starts with the initial status **READY** and goes through one or more transitional statuses to one of the terminal statuses. *** | Status | Type | Description | | ---------- | ------------ | ------------------------------------------- | | READY | initial | Started but not allocated to any worker yet | | RUNNING | transitional | Executing on a worker machine | | SUCCEEDED | terminal | Finished successfully | | FAILED | terminal | Run failed | | TIMING-OUT | transitional | Timing out now | | TIMED-OUT | terminal | Timed out | | ABORTING | transitional | Being aborted by the user | | ABORTED | terminal | Aborted by the user | ### Aborting runs[​](#aborting-runs "Direct link to Aborting runs") You can abort runs with the statuses **READY**, **RUNNING**, or **TIMING-OUT** in two ways: * *Immediately* - this is the default option. The Actor process is killed immediately with no grace period. * *Gracefully* - the Actor run receives a signal about aborting via the `aborting` event and is granted a 30-second window to finish in-progress tasks before getting aborted. This is helpful in cases where you plan to resurrect the run later because it gives the Actor a chance to persist its state. When resurrected, the Actor can restart where it left off. You can abort a run in Apify Console using the **Abort** button or via API using the [Abort run](https://docs.apify.com/api/v2/actor-run-abort-post.md) endpoint. ### Resurrection of finished run[​](#resurrection-of-finished-run "Direct link to Resurrection of finished run") Any Actor run in a terminal state, i.e., run with status **FINISHED**, **FAILED**, **ABORTED**, and **TIMED-OUT**, might be resurrected back to a **RUNNING** state. This is helpful in many cases, for example, when the timeout for an Actor run was too low or in case of an unexpected error. The whole process of resurrection looks as follows: * Run status will be updated to **RUNNING**, and its container will be restarted with the same storage (the same behavior as when the run gets migrated to the new server). * Updated duration will not include the time when the Actor was not running. * Timeout will be counted from the point when this Actor run was resurrected. Resurrection can be performed in Apify Console using the **resurrect** button or via API using the [Resurrect run](https://docs.apify.com/api/v2/act-run-resurrect-post.md) API endpoint. Settings adjustments You can also adjust timeout and memory or change Actor build before the resurrection. This is especially helpful in case of an error in the Actor's source code as it enables you to: 1. Abort a broken run 2. Update the Actor's code and build the new version 3. Resurrect the run using the new build ### Data retention[​](#data-retention "Direct link to Data retention") Apify securely stores your ten most recent runs indefinitely, ensuring your records are always accessible. All **Actor runs** beyond the latest ten are deleted along with their default storages (Key-value store, Dataset, Request queue) after the data retention period based on your [subscription plan](https://apify.com/pricing). **Actor builds** are deleted only when they are *not tagged* and have not been used for over 90 days. ## Sharing[​](#sharing "Direct link to Sharing") Share your Actor runs with other Apify users via the [access rights](https://docs.apify.com/platform/collaboration.md) system. --- # Standby mode **Use Actors in lightweight Standby mode for fast API responses.** *** Traditional Actors are designed to run a single job and then stop. They're mostly intended for batch jobs, such as when you need to perform a large scrape or data processing task. However, in some applications, waiting for an Actor to start is not an option. Actor Standby mode solves this problem by letting you have the Actor ready in the background, waiting for the incoming HTTP requests. In a sense, the Actor behaves like a real-time web server or standard API server. ## How do I know if Standby mode is enabled[​](#how-do-i-know-if-standby-mode-is-enabled "Direct link to How do I know if Standby mode is enabled") You will know that the Actor is enabled for Standby mode if you see the **Standby** tab on the Actor's detail page. In the tab, you will find the hostname of the server, the description of the Actor's endpoints, the parameters they accept, and what they return in the Actor README. To use the Actor in Standby mode, you don't need to click a start button or not need to do anything else. Simply use the provided hostname and endpoint in your application, hit the API endpoint and get results. ![Standby tab](/assets/images/standby-tab-be2a89c92ef176b75d93f573b51e4b03.png) ## How do I pass input to Actors in Standby mode[​](#how-do-i-pass-input-to-actors-in-standby-mode "Direct link to How do I pass input to Actors in Standby mode") If you're using an Actor built by someone else, see its Information tab to find out how the input should be passed. Generally speaking, Actors in Standby mode behave as standard HTTP servers. You can use any of the existing [HTTP request methods](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods) like GET, POST, PUT, DELETE, etc. You can pass the input via [HTTP request query string](https://en.wikipedia.org/wiki/Query_string) or via [HTTP request body](https://developer.mozilla.org/en-US/docs/Web/HTTP/Messages#body). ## How do I authenticate my requests[​](#how-do-i-authenticate-my-requests "Direct link to How do I authenticate my requests") To authenticate requests to Actor Standby, follow the same process as [authenticating requests to the Apify API](https://docs.apify.com/platform/integrations/api.md). You can provide your [API token](https://docs.apify.com/platform/integrations/api.md#api-token) in one of two ways: 1. *Recommended*: Include the token in the `Authorization` header of your request as `Bearer `. This approach is recommended because it prevents your token from being logged in server logs. ``` curl -H "Authorization: Bearer my_apify_token" \ https://rag-web-browser.apify.actor/search?query=apify ``` 2. Append the token as a query parameter named `token` to the request URL. This approach can be useful if you cannot modify the request headers. ``` https://rag-web-browser.apify.actor/search?query=apify&token=my_apify_token ``` tip You can use [scoped tokens](https://docs.apify.com/platform/integrations/api.md#limited-permissions) to send standby requests. This is useful for allowing third-party services to interact with your Actor without granting access to your entire account. However, [restricting what an Actor can access](https://docs.apify.com/platform/integrations/api.md#restricted-access-restrict-what-actors-can-access-using-the-scope-of-this-actor) using a scoped token is not supported when running in Standby mode. ## Can I still run the Actor in normal mode[​](#can-i-still-run-the-actor-in-normal-mode "Direct link to Can I still run the Actor in normal mode") Yes, you can still modify the input and click the Start button to run the Actor in normal mode. However, note that the Standby Actor might not support this mode; the run might fail or return empty results. The normal mode is always supported in Standby Beta, even for Actors that don't handle it well. Please head to the Actor README to learn more about the capabilities of your chosen Actor. ## Is there any scaling to accommodate the incoming requests[​](#is-there-any-scaling-to-accommodate-the-incoming-requests "Direct link to Is there any scaling to accommodate the incoming requests") When you use the Actor in Standby mode, the system automatically scales the Actor to accommodate the incoming requests. Under the hood, the system starts new Actor runs, which you will see in the Actor runs tab, with the origin set to Standby. ## What is the timeout for incoming requests[​](#what-is-the-timeout-for-incoming-requests "Direct link to What is the timeout for incoming requests") For requests sent to an Actor in Standby mode, the maximum time allowed until receiving the first response is *5 minutes*. This represents the overall timeout for the operation. ## What is the rate limit for incoming requests[​](#what-is-the-rate-limit-for-incoming-requests "Direct link to What is the rate limit for incoming requests") The rate limit for incoming requests to a Standby Actor is *2000 requests per second* per user account. ## How do I customize Standby configuration[​](#how-do-i-customize-standby-configuration "Direct link to How do I customize Standby configuration") The Standby configuration currently consists of the following properties: * **Max requests per run** - The maximum number of concurrent HTTP requests a single Standby Actor run can accept. If this limit is exceeded, the system starts a new Actor run to handle the request, which may take a few seconds. * **Desired requests per run** - The number of concurrent HTTP requests a single Standby Actor run is configured to handle. If this limit is exceeded, the system preemptively starts a new Actor run to handle the additional requests. * **Memory (MB)** - The amount of memory (RAM) allocated for the Actor in Standby mode, in megabytes. With more memory, the Actor can typically handle more requests in parallel, but this also increases the number of compute units consumed and the associated cost. * **Idle timeout (seconds)** - If a Standby Actor run doesn’t receive any HTTP requests within this time, the system will terminate the run. When a new request arrives, the system might need to start a new Standby Actor run to handle it, which can take a few seconds. A higher idle timeout improves responsiveness but increases costs, as the Actor remains active for a longer period. * **Build** - The Actor build that the runs of the Standby Actor will use. Can be either a build tag (e.g. `latest.`), or a build number (e.g. `0.1.2`). You can see these in the Standby tab of the Actor detail page. However, note that these properties are not configurable at the Actor level. If you wish to use the Actor-level hostname, this will always use the default configuration. To override this configuration, just create a new Task from the Actor. You can then head to the Standby tab of the created Task and modify the configuration as needed. Note that the task has a specific hostname, so make sure to use that in your application if you wish to use the custom configuration. ## Are the Standby runs billed differently[​](#are-the-standby-runs-billed-differently "Direct link to Are the Standby runs billed differently") No, the Standby runs are billed in the same fashion as the normal runs. However, running Actors in Standby mode might have unexpected costs, as the Actors run in the background and consume resources even when no requests are being sent until they are terminated after the idle timeout period. ## Are the Standby runs shared among users[​](#are-the-standby-runs-shared-among-users "Direct link to Are the Standby runs shared among users") No, even if you use the Actor-level hostname with the default configuration, the background Actor runs for your requests are not shared with other users. ## How can I develop Actors using Standby mode[​](#how-can-i-develop-actors-using-standby-mode "Direct link to How can I develop Actors using Standby mode") See the [Actor Standby development section](https://docs.apify.com/platform/actors/development/programming-interface/standby.md). --- # Actor tasks **Create and save reusable configurations of Apify Actors tailored to specific use cases.** *** Actor tasks let you create multiple reusable configurations of a single Actor, adapted for specific use cases. For example, you can create one [*Web Scraper*](https://apify.com/apify/web-scraper) configuration (task) that scrapes the latest reviews from imdb.com, another that scrapes nike.com for the latest sneakers, and a third that scrapes your competitor's e-shop. You can then use and reuse these configurations directly from [Apify Console](https://console.apify.com/actors/tasks), [Schedules](https://docs.apify.com/platform/schedules.md), or [API](https://docs.apify.com/api/v2/actor-task-runs-post.md). You can find all your tasks in the [Apify Console](https://console.apify.com/actors/tasks). ## Create[​](#create "Direct link to Create") To create a task, open any Actor from [Apify Store](https://console.apify.com/store) or your list of [Actors](https://console.apify.com/actors) in Apify Console. At the top-right section of the page, click the **Create task** button. ![Create a new Apify task](/assets/images/tasks-create-task-fe2022d6fab46890d47ca528749cd4c1.png) ## Configure[​](#configure "Direct link to Configure") You can set up your task's input under the **Input** tab. A task's input configuration works just like an Actor's. After all, it's just a copy of an Actor you can pre-configure for a specific scenario. You can use either JSON or the visual input UI. ![Apify task configuration](/assets/images/tasks-create-configure-c3a0cc4d2e00baeee1d9e29fd1ac2ec1.png) An Actors' input fields may vary depending on their purpose, but they all follow the same principle: *you provide an Actor with the information it needs so it can do what you want it to do.* You can set run options such as timeout and [memory](https://docs.apify.com/platform/actors/running/usage-and-resources.md) in the **Run options** tab of the task's input configuration. ### Naming[​](#naming "Direct link to Naming") To make a task easier to identify, you can give it a name, title, and description by clicking its caption on the detail page. A task's name should be at least `3` characters long with a limit of `63` characters. ## Run[​](#run "Direct link to Run") Once you've configured your task, you can run it using the **Start** button on the top-right side of the screen. ![Run an Apify task](/assets/images/tasks-start-button-10c64e3fbc13d906e0498c44c0857e12.png) Or using the **Start** button positioned following the input configuration. ![Run an Apify task v2](/assets/images/tasks-start-after-configuration-22843067b3a7207ec59002fa909985af.png) You can also run tasks using: * [Schedules](https://docs.apify.com/platform/schedules.md). * Directly via the [Apify API](https://docs.apify.com/api/v2/actor-task-runs-post.md). * The [JavaScript API client](https://docs.apify.com/api/client/js/reference/class/TaskClient). * The [Python API client](https://docs.apify.com/api/client/python/reference/class/TaskClient). ## Share[​](#share "Direct link to Share") Like any other resource, you can share your Actor tasks with other Apify users via the [access rights](https://docs.apify.com/platform/collaboration.md) system. --- # Usage and resources **Learn about your Actors' memory and processing power requirements, their relationship with Docker resources, minimum requirements for different use cases and its impact on the cost.** *** ## Resources[​](#resources "Direct link to Resources") [Actors](https://docs.apify.com/platform/actors.md) run in [Docker containers](https://www.docker.com/resources/what-container/), which have a [limited amount of resources](https://phoenixnap.com/kb/docker-memory-and-cpu-limit) (memory, CPU, disk size, etc). When starting, the Actor needs to be allocated a certain share of those resources, such as CPU capacity that is necessary for the Actor to run. ![Setting an Actor\'s memory](/assets/images/memory-settings-6bb15c0d4061ce772fb90e677fa29b04.png) Assigning an Actor a specific **Memory** capacity, also determines the allocated CPU power and its disk size. Check out the [Limits](https://docs.apify.com/platform/limits.md) page for detailed information on Actor memory, CPU limits, disk size and other limits. ### Memory[​](#memory "Direct link to Memory") When invoking an Actor, the caller must specify the memory allocation for the Actor run. The memory allocation must follow these requirements: * It must be a power of 2. * The minimum allowed value is `128MB` * The maximum allowed value is `32768MB` * Acceptable values include: `128MB`, `256MB`, `512MB`, `1024MB`, `2048MB`, `4096MB`, `8192MB`, `16384MB`, and `32768MB` Additionally, each user has a certain total limit of memory for running Actors. The sum of memory allocated for all running Actors and builds needs to be within this limit, otherwise the user cannot start a new Actor. For more details, see [limits](https://docs.apify.com/platform/limits.md). ### CPU[​](#cpu "Direct link to CPU") The CPU allocation for an Actor is automatically computed based on the assigned memory, following these rules: * For every `4096MB` of memory, the Actor receives one full CPU core * If the memory allocation is not a multiple of `4096MB`, the CPU core allocation is calculated proportionally * Examples: * `512MB` = 1/8 of a CPU core * `1024MB` = 1/4 of a CPU core * `8192MB` = 2 CPU cores #### CPU usage spikes[​](#cpu-usage-spikes "Direct link to CPU usage spikes") ![A usage spike on an Actor\'s start-up](/assets/images/memory-cpu-usage-spike-4fcc3b31cdda7f7c59cc4e7ea4492d86.png) Sometimes, you see the Actor's CPU use go over 100%. This is not unusual. To help an Actor start up faster, it is allocated a free CPU boost. For example, if an Actor is assigned 1GB (25% of a core), it will temporarily be allowed to use 100% of the core, so it gets started quicker. ### Disk[​](#disk "Direct link to Disk") The Actor has hard disk space limited by twice the amount of memory. For example, an Actor with `1024MB` of memory will have `2048MB` of disk available. ## Requirements[​](#requirements "Direct link to Requirements") Actors built with [Crawlee](https://crawlee.dev/) use autoscaling. This means that they will always run as efficiently as they can based on the allocated memory. If you double the allocated memory, the run should be twice as fast and consume the same amount of compute units (1 \* 1 = 0.5 \* 2). A good middle ground is `4096MB`. If you need the results faster, increase the memory (bear in mind the [next point](#maximum-memory), though). You can also try decreasing it to lower the pressure on the target site. Autoscaling only applies to solutions that run multiple tasks (URLs) for at least 30 seconds. If you need to scrape just one URL or use Actors like [Google Sheets](https://apify.com/lukaskrivka/google-sheets) that do just a single isolated job, we recommend you lower the memory. If the Actor doesn't have this information, or you want to use your own solution, just run your solution like you want to use it long term. Let's say that you want to scrape the data **every hour for the whole month**. You set up a reasonable memory allocation like `4096MB`, and the whole run takes 15 minutes. That should consume 1 CU (4 \* 0.25 = 1). Now, you just need to multiply that by the number of hours in the day and by the number of days in the month, and you get an estimated usage of 720 (1 \* 24 \* 30) CUs monthly. Estimating usage Check out our article on [estimating consumption](https://help.apify.com/en/articles/3470975-how-to-estimate-compute-unit-usage-for-your-project) for more details. ### Memory requirements[​](#memory-requirements "Direct link to Memory requirements") Each use case has its own memory requirements. The larger and more complex your project, the more memory/CPU power it will require. Some examples which have minimum requirements are: * Actors using [Puppeteer](https://pptr.dev/) or [Playwright](https://playwright.dev/) for real web browser rendering require at least `1024MB` of memory. * Large and complex sites like [Google Maps](https://apify.com/compass/crawler-google-places) require at least `4096MB` for optimal speed and [concurrency](https://crawlee.dev/api/core/class/AutoscaledPool#minConcurrency). * Projects involving large amount of data in memory. ### Maximum memory[​](#maximum-memory "Direct link to Maximum memory") Apify Actors are most commonly written in [Node.js](https://nodejs.org/en/), which uses a [single thread process](https://dev.to/arealesramirez/is-node-js-single-threaded-or-multi-threaded-and-why-ab1). Unless you use external binaries such as the Chrome browser, Puppeteer, Playwright, or other multi-threaded libraries you will not gain more CPU power from assigning your Actor more than `4096MB` of memory because Node.js cannot use more than 1 core. In other words, giving a [Cheerio-based crawler](https://apify.com/apify/cheerio-scraper) `16384MB` of memory (4 CPU cores) will not improve its performance, because these crawlers cannot use more than 1 CPU core. Multi-threaded Node.js configuration It's possible to [use multiple threads in Node.js-based Actor](https://dev.to/reevranj/multiple-threads-in-nodejs-how-and-what-s-new-b23) with some configuration. This can be useful if you need to offload a part of your workload. ## Usage[​](#usage "Direct link to Usage") When you run an Actor it generates platform usage that's charged to the user account. Platform usage comprises four main parts: * **Compute units**: CPU and memory resources consumed by the Actor. * **Data transfer**: The amount of data transferred between the web, Apify platform, and other external systems. * **Proxy costs**: Residential or SERP proxy usage. * **Storage operations**: Read, write, and other operations performed on the Key-value store, Dataset, and Request queue. The platform usage can be represented either in raw units (e.g. gigabytes for data transfer, or number of writes for dataset operations), or in the dollar equivalents. To view the usage of an Actor run, navigate to the **Runs** section and check out the **Usage** column. ![Runs usage](/assets/images/usage-and-resources-runs-usage-740cbbdcca94e0fc566fee6a881f7f64.png) For a more detailed breakdown, click on the specific run you want to examine and then on the **?** icon next to the **Usage** label. ![Actors run usage details](/assets/images/usage-and-resources-runs-usage-details-ff74e9247c2ad3cedc01c4405bc2dbde.png) Usage billing elements For technical reasons, when viewing the usage in dollars for a specific historical Actor run or build in the API or Apify Console, your current service pricing is used to compute the dollar amount. This should be used for informational purposes only. For detailed information, FAQ, and, pricing check out the [platform pricing page](https://apify.com/pricing). ### What is a compute unit[​](#what-is-a-compute-unit "Direct link to What is a compute unit") A compute unit (CU) is the unit of measurement for the resources consumed by Actor runs and builds. You are charged for using Actors based on CU consumption. For example, running an Actor with`1024MB` of allocated memory for 1 hour will consume 1 CU. The cost of this CU depends on your subscription plan. You can check each Actor run's exact CU usage in the run's details. ![An Actor run\'s platform usage](/assets/images/actor-usage-3702e474a20ec5c0deeb289a954c6060.png) You can [find a summary of your overall platform and CU usage](https://console.apify.com/billing) in the **Billing** section of Apify Console. #### Compute unit calculation[​](#compute-unit-calculation "Direct link to Compute unit calculation") CUs are calculated by multiplying two factors: * **Memory** (MB) - The size of the allocated server for your Actor or task run. * **Duration** (hours) - The duration for which the server is used (Actor or task run). For example, if your run took 6 minutes, you would use 0.1 (hours) as the second number to calculate CUs. The minimum granularity is a second. Example: *1024MB memory x 1 hour = 1 CU* ### What determines consumption[​](#what-determines-consumption "Direct link to What determines consumption") The factors that influence resource consumption, in order of importance, are: * *Browser vs. Plain HTTP*: Launching a browser (e.g., [Puppeteer](https://pptr.dev/)/[Playwright](https://playwright.dev/)) is resource-intensive and slower compared to working with plain HTML ([Cheerio](https://cheerio.js.org/)). Using Cheerio can be up to *20 times* faster. * *Run size and frequency*: Large runs can use full resource scaling and are not subjected to repeated Actor start-ups (as opposed to many short runs). Whenever possible, opt for larger batches. * *Page type*: Heavy pages, such as Amazon or Facebook will take more time to load regardless whether you use a browser or Cheerio. Large pages can take up to *3 times* more resources to load and parse than average pages. You can check out our [article on estimating consumption](https://help.apify.com/en/articles/3470975-how-to-estimate-compute-unit-usage-for-your-project) for more details on what determines consumption. --- # Collaboration **Learn how to collaborate with other users and manage permissions for organizations or private resources such as Actors, Actor runs, and storages.** *** By default, each system resource (Actor, key-value store, run, ...) you create is only available to you, the owner. You can grant access to your resources in three ways: | | | | ------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | **[Access rights](https://docs.apify.com/platform/collaboration/access-rights.md)** | Enables you to grant access to another user for a certain resource you own. This way, you can share results with your client, or two engineers can collaborate on developing one Actor. | | **[Organization account](https://docs.apify.com/platform/collaboration/organization-account.md)** | Apify's organization account allows multiple engineers to collaborate on team projects with role-specific access permissions. | | **[Publishing in Apify Store](https://docs.apify.com/platform/actors/publishing.md)** | Another way to share your Actor with other users is to publish it in [Apify Store](https://apify.com/store). When publishing your Actor, you can make it a Paid Actor and get paid by the users benefiting from your tool. For more information, read the [publishing and monetization](https://docs.apify.com/platform/actors/publishing.md) section. | --- # Access rights **Manage permissions for your private resources such as Actors, Actor runs, and storages. Allow other users to read, run, modify, or build new versions.** *** You can easily and securely share your own resources - Actors, tasks, key-value stores, datasets, and request queues - with other users by using a [granular](https://www.google.com/search?q=define+granular+permissions) permissions system. This enables you, for example, to let your colleague run an [Actor](https://docs.apify.com/platform/actors.md) or view a [dataset](https://docs.apify.com/platform/storage/dataset.md) but not modify it. You can also grant permission to update an Actor and build a new version. [Storages](https://docs.apify.com/platform/storage.md) (key-value stores, request queues, and datasets) are sharable in the same way as a **read** permission or a combination of both **read** and **write** permissions. To share an Actor, task, key-value store, request queue, or dataset, click on the **Actions** button in the top-right corner and select **Share**. You can add a collaborator by using their **user ID**, **email**, or **username**. Once you have added the collaborator, you can configure their permissions. ![Access rights configuration](/assets/images/access-rights-76dfb5c1630309ea547eeb6fbde3b509.png) --- # List of permissions **Learn about the access rights you can grant to other users. See a list of all access options for Apify resources such as Actors, actActoror runs/tasks and storage.** *** This document contains all the access options that can be granted to resources on the Apify platform. ## Actors[​](#actors "Direct link to Actors") Click [here](https://docs.apify.com/platform/actors.md) to learn about Apify Actors. ### Actor[​](#actor "Direct link to Actor") | Permission | Description | | -------------------- | ---------------------------------------------------------- | | Read | View Actor settings, source code and builds. | | Write | Edit Actor settings and source code, and delete the Actor. | | Run | Run any of an Actor's builds. | | View runs | View a list of Actor runs and their details. | | Manage access rights | Manage Actor access rights. | ### Actor task[​](#actor-task "Direct link to Actor task") | Permission | Description | | -------------------- | ---------------------------------------------------------- | | Read | View task configuration. | | Write | Edit task configuration and settings, and delete the task. | | View runs | View a list of Actor task runs and their details. | | Manage access rights | Manage Actor task access rights. | To learn about Actor tasks, see the [documentation](https://docs.apify.com/platform/actors/running/tasks.md). ## Storage[​](#storage "Direct link to Storage") For more information about Storage, see its [documentation](https://docs.apify.com/platform/storage.md). ### Dataset[​](#dataset "Direct link to Dataset") | Permission | Description | | -------------------- | --------------------------------------------------------------- | | Read | View dataset information and its data. | | Write | Edit dataset settings, push data to it, and remove the dataset. | | Manage access rights | Manage dataset access rights. | To learn about dataset storage, see its [documentation](https://docs.apify.com/platform/storage/dataset.md). ### Key-value-store[​](#key-value-store "Direct link to Key-value-store") | Permission | Description | | -------------------- | ------------------------------------------------------------------------------------------------- | | Read | View key-value store details and records. | | Write | Edit key-value store settings, add, update or remove its records, and delete the key-value store. | | Manage access rights | Manage key-value store access rights. | To learn about key-value stores, see the [documentation](https://docs.apify.com/platform/storage/key-value-store.md). ### Request queue[​](#request-queue "Direct link to Request queue") | Permission | Description | | -------------------- | ---------------------------------------------------------------------------------------------- | | Read | View request queue details and records. | | Write | Edit request queue settings, add, update, or remove its records, and delete the request queue. | | Manage access rights | Manage request queue access rights. | To learn about request queue storage, see the [documentation](https://docs.apify.com/platform/storage/request-queue.md). ## Proxy[​](#proxy "Direct link to Proxy") | Permission | Description | | ---------- | ------------------------- | | Proxy | Allow to use Apify Proxy. | To learn about Apify Proxy, see its [documentation](https://docs.apify.com/platform/proxy.md). ## User permissions[​](#user-permissions "Direct link to User permissions") Permissions that can be granted to members of organizations. To learn about the organization account, see its [documentation](https://docs.apify.com/platform/collaboration/organization-account.md). | Permission | Description | | ------------------- | --------------------------------------------------------------------- | | Manage access keys | Manage account access keys, i.e. API token and proxy password. | | Update subscription | Update the type of subscription, billing details and payment methods. | | Update profile | Make changes in profile information. | | Update email | Update the contact email for the account. | | Reset password | Reset the account's password. | | View invoices | See the account's invoices. | | Manage organization | Change the organization's settings. | --- # Organization account **Create a specialized account for your organization to encourage collaboration and manage permissions. Convert an existing account, or create one from scratch.** *** Organization accounts allow groups to collaborate on projects. It enables you to manage your team members' [permissions](https://docs.apify.com/platform/collaboration/list-of-permissions.md) and to centralize your billing without having to share the credentials of a single personal account. You can [switch](https://docs.apify.com/platform/collaboration/organization-account/how-to-use.md) between your personal and organization accounts in just two clicks: in [Apify Console](https://console.apify.com), click the account button in the top-left corner, then select the organization. You can set up an organization in two ways. * [Create a new organization](#create-a-new-organization). If you don't have integrations set up yet, or if they are easy to change, you can create a new organization, preserving your personal account. * [Convert an existing account](#convert-an-existing-account) into an organization. If your Actors and [integrations](https://docs.apify.com/platform/integrations.md) are set up in a personal account, it is probably best to convert that account into an organization. This will preserve all your integrations but means you will have a new personal account created for you. > Prefer video to reading? [See our video tutorial](https://www.youtube.com/watch?v=BIL6HqtnvKk) for organization accounts. ## Availability and pricing[​](#availability-and-pricing "Direct link to Availability and pricing") The organization account is available on all our plans. [Visit our pricing page](https://apify.com/pricing) for more information. ## Create a new organization[​](#create-a-new-organization "Direct link to Create a new organization") You can create a new organization by clicking the **Create new organization** button under the **Organizations** tab in your [account](https://console.apify.com/account#/myorganizations). If you want the organization to have a separate email address (used for notifications), enter it here. Otherwise, leave the **email** field empty and the owner's email will be used for notifications. ![Create a new organization](/assets/images/create-new-org-2a391adfe3ab3f5d83355bcd406c4690.png) **You can own up to 5 and be a member of as many organizations as you need.** ## Convert an existing account[​](#convert-an-existing-account "Direct link to Convert an existing account") > **When you convert an existing user account into an organization,** > > * **You will no longer be able to sign in to the converted user account.** > * **An organization cannot be converted back to a personal account.** > * **During conversion, a new account (with the same login credentials) will be created for you. You can then use that account to [set up](https://docs.apify.com/platform/collaboration/organization-account/setup.md) the organization.** Before converting your personal account into an organization, make sure it has a **username**. An organization can't be a member of other organizations. If you want to convert your account to one, you'll first need to **leave all the organizations you are a part of**. Then, under the **Organizations** [tab](https://console.apify.com/account#/myorganizations), click the **Convert this user account to an organization** button. ![Convert your account to an organization](/assets/images/convert-to-organization-21647ca9838fde34a47fe7a088b7e0f7.png) Next, set the organization's name and click **Convert**. And that's it! Your personal account becomes the organization, and you will be logged out automatically. You can now log into your new personal account with the same credentials as you are currently logged in with. This applies to both **password** and **OAuth** methods. For information on [adding members and assigning roles](https://docs.apify.com/platform/collaboration/organization-account/setup.md), see the Setup page. ## Billing[​](#billing "Direct link to Billing") Actor and task runs are billed to the account they are started from. **Always make sure you start your runs from the correct account** to avoid having an organization's runs billed to your personal account. To find out about organization pricing, get in touch at [support@apify.com](mailto:support@apify.com?subject=Organization%20account%20pricing) or [visit the pricing page](https://apify.com/pricing). --- # Using the organization account **Learn to use and manage your organization account using the Apify Console or API. View the organizations you are in and manage your memberships.** *** Once an account becomes an organization, you can no longer log into it. Instead, you can switch into the organization from one of its member accounts to manage account information, memberships, and Actor runs. While you can't manage an organization account via [API](https://docs.apify.com/api/v2.md), you can still manage its runs and resources via API like you would with any other account. **[See our video tutorial](https://www.youtube.com/watch?v=BIL6HqtnvKk) on organization accounts.** ## In the Apify Console[​](#in-the-apify-console "Direct link to In the Apify Console") You can switch into **Organization account** view using the account button in the top-left corner. ![Switch to organization account](/assets/images/switch-to-organization-bc78bf615e803f6a5521c31f86bb66f0.png) In the menu, the account you are currently using is displayed at the top, with all the accounts you can switch to displayed below. When you need to get back to your personal account, you can just switch right back to it—no need to log in and out. The resources you can access and account details you can edit will depend on your [permissions](https://docs.apify.com/platform/collaboration/list-of-permissions.md) in the organization. > When switching between accounts, beware which account you start an Actor run in. If you accidentally start an organization's Actor run in your personal account, the run will be billed to your account (and vice versa). ### Manage your organizations[​](#manage-your-organizations "Direct link to Manage your organizations") You can view and manage the organizations you are a member of from the **Organizations** tab on your [account page](https://console.apify.com/account#/myorganization). If you want to leave an organization you own, you must first transfer ownership to someone else. ![My organizations](/assets/images/my-organizations-0b632703aa9f336c2de4beb03e228888.png) ### Transfer ownership[​](#transfer-ownership "Direct link to Transfer ownership") The organization, its Actors, and its integrations will keep running as they are. The original owner will either leave the organization or become a member with permissions defined by the new owner. Only the new owner will have complete access to the organization. ## Via API[​](#via-api "Direct link to Via API") While you cannot manage an organization account's settings and members via API, you can access its Actor and task runs, webhooks, schedules, and storages just as you would with any other account. As a member of an organization, you are assigned an [API token](https://docs.apify.com/platform/integrations.md) (under the **Integrations** tab) and proxy password (click the **Proxy** button in the left menu) for accessing the Apify platform via REST API. ![Integration tokens](/assets/images/integrations-4dfa366351c8338af2f1f909b138a5a3.png) The API tokens' functionality reflects your account's permissions in the organization, so if you only have the **read** and **run** permissions for Actors, you will only be able to view and run Actors via API. Only you can view your API token and password. > Do not share your API token or password with untrusted parties. If you have the **manage organization access keys** permission, you are able to view and use organization-wide API tokens. These are shared across the organization, so everyone with the **manage organization access keys** permission can use them for organization-wide integrations. For a [detailed breakdown of all Apify API endpoints](https://docs.apify.com/api/v2.md) and help on using them, visit the API. --- # Setup **Configure your organization account by inviting new members and assigning their roles. Manage team members' access permissions to the organization's resources.** *** After creating your organization, you can configure its settings. The **Account** tab allows you to: * Set the organization's email address * Change the username * Configure security settings * Delete the account. The **Members** tab lets you to update your organization's members and set its owner. In the **Account** tab's **Security** section, you can set security requirements for organization members. These include: * Maximum session lifespan * Two-factor authentication requirement **[See our video tutorial](https://www.youtube.com/watch?v=BIL6HqtnvKk) on organization accounts.** ## Add users to your organization[​](#add-users-to-your-organization "Direct link to Add users to your organization") You can add members to your organization in the **Members** tab. You can use their **User ID**, **username**, or **email**. When adding a member to the organization, you must assign them a **Role** so their permissions are known right away. ![Organization members](/assets/images/members-b430f7bb69da8dedebfa600a8a6be3c6.png) ## Define roles and permissions[​](#define-roles-and-permissions "Direct link to Define roles and permissions") Roles allow you to define permissions to your organization's resources by group. Every new organization comes with three pre-defined roles, which you can customize or remove. To edit the permissions for each role, click on the **Configure permissions** button in the top-right corner. ![Organization roles](/assets/images/roles-17d3d989136ea8f7066723685e2e9d24.png) > Each member can only have one role to avoid conflicting permissions. You can configure individual permissions for each resource type such as Actors, Actor tasks or storage. Bear in mind that if a user has the **read** permission for [storages](https://docs.apify.com/platform/storage.md), you cannot prevent them from accessing a particular storage (e.g. a certain [dataset](https://docs.apify.com/platform/storage.md)) - they will have access to all of the organization's storages. **Some permissions have dependencies**. For example, if someone has the **Actor run** permission, it is likely they will also need the **storage write** permission, so they can store the results from their Actor runs. ![Configure permissions](/assets/images/configure-permissions-b8534ba955f18ce7ae5481e9e7457a5c.png) [See the full list of permissions](https://docs.apify.com/platform/collaboration/list-of-permissions.md) that can be granted to Apify resources. --- # Apify Console **Learn about Apify Console's easy account creation and user-friendly homepage for efficient web scraping management.** *** ## Sign-up[​](#sign-up "Direct link to Sign-up") To use Apify Console, you first need to create an account. To create it please go to the [sign-up page](https://console.apify.com/sign-up). At the moment we support 3 different methods of creating an account: ![Apify Console sign-up form](/assets/images/console-sign-up-form-46fd3325ba8e89ec2cf584d97323d97a.png) ### Email and password[​](#email-and-password "Direct link to Email and password") This is the most common way of creating an account. You just need to provide your email address and a password. The password needs to be at least 8 characters, should not be your email address, and should either contain special characters or be long enough to be secure. The password field has a strength indicator below it that will tell you if your password is strong enough. After you click the **Sign up** button, we will send you a verification email. The email contains a link that you need to click on or copy to your browser to proceed to automated email verification. After we verify your email, you will proceed to Apify Console. CAPTCHA We are using Google reCaptcha to prevent spam accounts. Usually, you will not see it, but if Google evaluates your browser as suspicious, they will ask you to solve a reCaptcha before we create your account and send you the verification email. If you did not receive the email, you can visit the [sign-in page](https://console.apify.com/sign-in). There, you will either proceed to our verification page right away, or you can sign in and will be redirected afterward. On the verification page, you can click on the **Resend verification email** button to send the email again. ![Apify Console email verification page](/assets/images/console-email-verification-page-cd52c88d6687d43ae7540a0576cafff4.png) ### Google or GitHub[​](#google-or-github "Direct link to Google or GitHub") If you do not want to create a new password for Apify Console, you can also use your Google or GitHub account to sign up. To do that, click the **Sign up with Google** or **Sign up with GitHub** buttons. You will proceed to the corresponding authentication page, where you must sign in and authorize Apify to access your account. Then, when you come back to Apify Console, we will create your account. If you used Google to sign up, there is no verification step after this, and you can start using Apify Console right away. On the other hand, if you used GitHub to create your account and your GitHub account does not have a verified email, we will ask you to verify your email address before you can start using Apify Console. The process works the same as the process for email and password authentication. If you already have an account in Apify Console connected to Google or GitHub, clicking on these buttons on the sign-up page will not create a new account but will directly sign you in. ## Sign-in[​](#sign-in "Direct link to Sign-in") To sign in to your account, please go to the [sign-in page](https://console.apify.com/sign-in). There, you can use any of the authentication methods you have set up for your account. If you have multiple authentication methods, you can choose which one you want to use by using the corresponding button or form. If you sign in through email and password and have two-factor authentication enabled on your account, you will proceed to the two-factor authentication page after you enter your password. There, you will need to enter the code from your authenticator app to sign in. If you do not have two-factor authentication enabled, you will be directly redirected to Apify Console. ![Apify Console sign-in form](/assets/images/console-sign-in-form-9fac8fee2e4165187c32e2d03adabb8a.png) ### Forgotten password[​](#forgotten-password "Direct link to Forgotten password") In case you forgot your password, you can click on the **Forgot your password?** link, which will redirect you to the [forgot password page](https://console.apify.com/forgot-password). There, you will need to enter your email address and click on the **Reset password** button. We will then send an email to the address connected to your account with a link to the password reset page, which will allow you to change your password. ![Apify Console forgotten password page](/assets/images/console-forgotten-password-page-80595b60da993709cdae8a77443fb2fb.png) ## Adding different authentication methods[​](#adding-different-authentication-methods "Direct link to Adding different authentication methods") After you create your account, you might still want to use the other authentication methods. To do that, go to your [account settings](https://console.apify.com/account/settings) and scroll down to the **Login** section. There, you will see all available authentication methods and their configuration. ![Apify Console sign-in methods section on account page](/assets/images/console-sign-in-methods-section-52ce5a19f1238ec7a063b5c053f36493.png) ## Resetting your password[​](#resetting-your-password "Direct link to Resetting your password") This section also allows you to reset your password if you ever forget it. To do that, click the **Send email to reset password** button. We will then send an email to the address connected to your account with a link to the password reset page. After you click on the link (or copy it to your browser), you will proceed to a page where you can set up a new password. ![Apify Console reset password page](/assets/images/console-reset-password-page-579e9825d19418a1a3ed84ce15358033.png) ## Homepage overview[​](#homepage-overview "Direct link to Homepage overview") ![apify-console-homepage](/assets/images/apify-console-homepage-08db339d7d06767feced10bbdcd33ad0.png) The Apify Console homepage provides an overview of your account setup. The header displays your account name and current plan level. The homepage features several sections: * **Recently Viewed**: This section displays Actors you have recently accessed. If you haven't used any Actors yet, you will see suggestions instead. * **Suggested Actors for You**: Based on your and other users' recent activities, this section recommends Actors that might interest you. * **Actor Runs**: This section is divided into two tabs: * **Recent**: View your latest Actor runs. * **Scheduled**: Check your upcoming scheduled runs and tasks. Use the side menu to navigate other parts of Apify Console easily. #### Keyboard shortcuts[​](#keyboard-shortcuts "Direct link to Keyboard shortcuts") You can also navigate Apify Console via keyboard shortcuts. Keyboard Shortcuts | Shortcut | Tab | | -------------- | ------ | | Show shortcuts | Shift? | | Home | GH | | Store | GO | | Actors | GA | | Development | GD | | Saved tasks | GT | | Runs | GR | | Integrations | GI | | Schedules | GU | | Storage | GE | | Proxy | GP | | Settings | GS | | Billing | GB | | Tab name | Description | | ------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | [Apify Store](https://docs.apify.com/platform/console/store.md) | Search for Actors that suit your web-scraping needs. | | [Actors](https://docs.apify.com/platform/actors.md) | View recent & bookmarked Actors. | | [Runs](https://docs.apify.com/platform/actors/running/runs-and-builds.md) | View your recent runs. | | [Saved tasks](https://docs.apify.com/platform/actors/running/tasks.md) | View your saved tasks. | | [Schedules](https://docs.apify.com/platform/schedules.md) | Schedule Actor runs & tasks to run at specified time. | | [Integrations](https://docs.apify.com/platform/integrations.md) | View your integrations. | | [Development](https://docs.apify.com/platform/actors/development.md) | • My Actors - See Actors developed by you.• Insights - see analytics for your Actors.• Messaging - check on issues reported in your Actors or send emails to users of your Actors. | | [Proxy](https://docs.apify.com/platform/proxy.md) | View your proxy usage & credentials | | [Storage](https://docs.apify.com/platform/storage.md) | View stored results of your runs in various data formats. | | [Billing](https://docs.apify.com/platform/console/billing.md) | Billing information, statistics and invoices. | | [Settings](https://docs.apify.com/platform/console/settings.md) | Settings of your account. | --- # Billing **The Billings page is the central place for all information regarding your invoices, billing information regarding usage in the current billing cycle, historical usage, subscriptions & limits.** *** ## Current period[​](#current-period "Direct link to Current period") The **Current period** tab is a comprehensive resource for understanding your platform usage during the ongoing billing cycle. It provides an overview of your total usage and indicates the start and end dates of the current billing period. In addition to this, the tab features a **Plan Consumption** Graph. It shows how much of your free or paid plan has been utilized up to this point. The graph offers a broad perspective of your overall usage. Apart from the platform usage breakdown graph, this tab provides more detailed information on the usage of different platform services. You can explore the sections under the graph. You can access detailed statistics related to **Actors**, **Data transfer**, **Proxy**, and **Storage**. ![Apify Console current period view](/assets/images/console-billing-current-period-f7ddf37de35ec4984e7adc94d53253c4.png) ## Historical usage[​](#historical-usage "Direct link to Historical usage") The **Historical usage** tab provides a detailed view of your monthly platform usage, excluding any free Actor compute units or discounts from your subscription plan. You can navigate through the months by selecting from a dropdown list or using arrows to move one month backward or forward. The tab features an adjustable bar chart. This chart can be customized to display statistics either on a monthly or daily basis. Additionally, you can view these statistics as absolute or cumulative numbers, providing flexibility in how you analyze your usage data. Monthly usage data Since billing cycles can shift, the data in the **Historical usage** tab is shown for calendar months. ![Apify Console historical usage view](/assets/images/console-billing-historical-usage-edb29c979c540b2855b0675151b35b07.png) Below the bar chart, there is a table titled **Usage by Actors**. This table presents a detailed breakdown of the Compute units used per Actor and the associated costs. It clearly explains how each Actor contributes to your overall platform usage and expenses. ![Apify Console historical usage by Actor view](/assets/images/console-billing-historical-usage-by-actors-8b4651fe15d6fd802a67a7fb9eb8510e.png) ## Subscription[​](#subscription "Direct link to Subscription") The **Subscriptions** tab offers a central location to manage various aspects of your subscription plan. Here, you can access details about your current plan and explore options to upgrade to a different one. Unused credits Unused prepaid usage does not rollover, it expires at the end of the billing cycle. This tab also allows you to manage your payment methods. Additionally, you can adjust your billing details to ensure they remain up-to-date. Another feature of this tab is the visibility of any special offers currently applied to your account. Plan add-ons *This option is available only if you are on a subscription plan*. You can extend your subscription plans with add-ons, like extra proxies, Actor memory, and more. Navigate to [Subscription](https://console.apify.com/billing/subscription) section in Apify Console, and click the **Buy add-ons** button to explore the available options. ![Apify Console subscription view](/assets/images/console-billing-subscription-19a02e019587c65ed35a361401aa53af.png) ## Pricing[​](#pricing "Direct link to Pricing") The **Pricing** tab offers a way to quickly check all unit pricing for various platform services related to Apify usage for your account. ![Apify Console pricing view](/assets/images/console-billing-pricing-12b7a203be3cb84dd3ea9a401f453553.png) ## Invoices[​](#invoices "Direct link to Invoices") The **Invoices** tab is where you can find your current and previous invoices for Apify platform usage. This is your source of truth for any questions regarding previous charges and subscriptions. ![Apify Console invoices view](/assets/images/console-billing-invoices-9e5a3c2fb15ec83fd1bdbc5e0ac00100.png) ## Limits[​](#limits "Direct link to Limits") The **Limits** tab displays the usage limits for the Apify platform based on your current subscription plan. As you approach the defined limits, you will receive a notification about potential service suspension or overage charges. If your usage exceeds the specified limits, Apify platform services will be suspended to prevent incurring charges beyond your subscription plan. You can adjust the platform usage limits and enable overage, allowing you to continue using the platform beyond your subscription plan on a *pay-as-you-go* basis. Any usage, beyond your plan will be charged as overage to your account. Immediate overage invoicing If your overage platform usage reaches *200$* before your next billing cycle, you will be invoiced for the overage charges immediately. Otherwise, the overage charges will be added to your next subscription invoice. ![Apify Console limits view](/assets/images/console-billing-limits-704e7245ae3d1878b814415aa29b89f7.png) --- # Account settings **Learn how to manage your Apify account, configure integrations, create and manage organizations, and set notification preferences in the Settings tab.** *** ## Account[​](#account "Direct link to Account") By clicking the **Settings** tab on the side menu, you will be presented with an Account page where you can view & edit various settings regarding your account, such as: * account email * username * profile information * theme * login information * session information * account delete ### Session Information[​](#session-information "Direct link to Session Information") In the **Session Information** section, you can adjust the session configuration. You can modify the default session lifespan of 90 days, this customization helps ensure compliance with organization security policies. ## Integrations[​](#integrations "Direct link to Integrations") The **Integrations** tab provides essential tools for enhancing your interaction with our platform. Here, you can access your **Personal API Tokens**, which are necessary for using our [REST API](https://docs.apify.com/api/v2). This page also facilitates the integration of your Slack workspace and lists your **Actor Integration Accounts**. This section represents any third-party integrations added by you or your team. For detailed guidance on utilizing these integrations, refer to our [Integrations documentation](https://docs.apify.com/platform/integrations). ## Organization[​](#organization "Direct link to Organization") The **Organizations** tab is where you can view your accounts' current organizations, create new organizations, or convert your user account into an organization account. For more information on how to set up an organization check out this [article](https://help.apify.com/en/articles/8698948-how-to-set-up-an-organization-account). ## Notifications[​](#notifications "Direct link to Notifications") The **Notifications** tab allows you to customize your notification preferences. Here, you can specify the types of updates you wish to receive and select the methods by which you receive them. --- # Apify Store **Explore Apify Store, browse and select Actors, search by criteria, sort by relevance, and adjust settings for immediate or future runs.** *** ![apify-console-store](/assets/images/console-store-5eae82ff847e58ffe5fe90cc74f7f4db.png) Apify Store is a place where you can explore a variety of Actors, both created and maintained by Apify or our community members. Use the search box at the top of the page to find Actors by service names, such as TikTok, Google, Facebook, or by their authors. Alternatively, you can explore Actors grouped under predefined categories below the search box. You can also organize the results from the store by different criteria, including: * Category * Pricing model * Developers * Relevance Once you select an Actor from the store, you'll be directed to its specific page. Here, you can configure the settings for your future Actor run, save these configurations for later use, or run the Actor immediately. For more information on Actors in Apify Store, visit our [Apify Store documentation](https://docs.apify.com/platform/actors/running/actors-in-store.md). --- # Two-factor authentication setup **Learn about Apify Console's account two-factor authentication process and how to set it up.** *** If you use your email and password to sign in to Apify Console, you can enable two-factor authentication for your account. This will add an extra layer of security to your account and prevent anyone who gains access to your password from signing in to your account. Some organizations might require two-factor authentication (2FA) to access their resources. Members of such an organization, must enable 2FA on their account in order to continue accessing shared resources and maintain compliance with their security policies. ## Setting up two-factor authentication[​](#setting-up-two-factor-authentication "Direct link to Setting up two-factor authentication") To set up two-factor authentication, go to your [account settings](https://console.apify.com/account/settings) and scroll down to the **Two-factor authentication** section. Currently, there is only one option, which is the **Authenticator app**. If you have two-factor authentication already enabled, there will be a label **enabled** next to it. ![Apify Console two-factor authentication section](/assets/images/console-account-two-factor-disabled-a0eeb17bbf6e57c9770e887a4b2afe4d.png) If it's not enabled, click on the **Enable** button. You should see the two-factor setup view. ### Authenticator app setup[​](#authenticator-app-setup "Direct link to Authenticator app setup") ![Apify Console setup two-factor authentication - app](/assets/images/console-two-factor-app-setup-4ed8557cadb66d1a57336ff84224ae04.png) In this view, you can use your favorite authenticator app to scan the QR code. We recommend using Google Authenticator ([Google Play Store](https://play.google.com/store/apps/details?id=com.google.android.apps.authenticator2\&hl=en_US)/[Apple App Store](https://apps.apple.com/us/app/google-authenticator/id388497605)) or [Authy](https://authy.com/)([Google Play Store](https://play.google.com/store/apps/details?id=com.authy.authy)/[Apple App Store](https://apps.apple.com/us/app/twilio-authy/id494168017) but any other authenticator app should work as well. You can also set up your app/browser extension manually without the QR code. To do that, click on the **Setup key** link below the QR code. This view with the key will pop up: ![Apify Console setup two-factor authentication - key](/assets/images/console-setup-two-factor-auth-key-6485a999dce1adfdc10aaeb90f53c711.png) A new pop-up window will appear where you can copy the two-factor `secret` key, which you can use in your authenticator app to set up the account. You can also use this key to set up your authenticator app on multiple devices. After you scan the QR code or set up your app manually, the app will generate a code that you need to enter into the **Verify the code from the app** field. After you enter the code, click on the **Continue** button to get to the next step of the setup process. ### Recovery settings[​](#recovery-settings "Direct link to Recovery settings") ![Apify Console setup two-factor authentication - recovery codes](/assets/images/console-two-factor-recovery-setup-2f346602352bfde2622eebf5f33df3dd.png) In this step, you will see 16 recovery codes. If you ever lose access to your authenticated app, you will be able to use these codes to access the Apify Console. We recommend saving these codes in a safe place; ideally, you should store them in a secure password manager or print them out and keep them separate from your device. Under the recovery codes, you will find two fields for your recovery information. These two fields are what the support team will ask you to provide in case you lose access to your authenticator app and also to your recovery codes. We will never use the phone number for anything other than to verify your identity and help you regain access to your account, only as a last resort. Ideally, the personal information you provide will be enough to verify your identity. Always provide both the kind of personal information you provide and the actual information. Personal information What kind of personal information you provide is completely up to you. It does not even have to be personal, as long as it's secure and easy to remember. For example, it can be the name of your pet, the name of your favorite book, some secret code, or anything else. Keep in mind who has access to that information. While you can use the name of your pet, if you share information about your pet on public social media, it's not a good choice because anyone on the internet can access it. The same goes for any other information you provide. You will not be able to enable the two-factor authentication until you click on the **Download** / **Copy** buttons or copy the codes manually. After you do that, the **Continue** button will light up, and you can click on it to enable the two-factor authentication. The authentication process will then enable the two-factor authentication for your account and show a confirmation. ### Confirmation[​](#confirmation "Direct link to Confirmation") When you close the setup process, you should see that your two-factor authentication is enabled in the account settings. ![Apify Console two-factor authentication section enabled](/assets/images/console-account-two-factor-enabled-a6dd933650f9d494c188507baa40578a.png) ## Verification after sign-in[​](#verification-after-sign-in "Direct link to Verification after sign-in") After you enable two-factor authentication, the next time you attempt to sign in, you'll need to enter a code before you can get into the Apify Console. To do that, open your authenticator app and enter the code for your Apify account into the **Code** field. After you enter the code, click on the **Verify** button, and if the provided code is correct, you will proceed to Apify Console. ![Apify Console two-factor authentication form](/assets/images/console-two-factor-authentication-e896de75a52f44f0dbb61a55f7b651c7.png) ## Using recovery codes[​](#using-recovery-codes "Direct link to Using recovery codes") In case you lose access to your authenticator app, you can use the recovery codes to sign in to your account. To do that, click on the **recovery code or begin 2FA account recovery** link below the **Verify** button. This will redirect you to a view similar to the current one, but instead of code from the authenticator app, you will need to enter one of the 16 recovery codes you received during the setup process. If the provided recovery code is correct, you will proceed to Apify Console, the same as if you provided the code from the authenticator app. After gaining access to Apify Console, we recommend going to the [account settings](https://console.apify.com/account/settings), disabling the two-factor authentication there, and then enabling it again with the new authenticator app. Removal of recovery codes When you successfully use a recovery code, we remove the code from the original list as it's no longer possible to use it again. If you use all of your recovery codes, you will not be able to sign in to your account with them anymore, and you will need to either use your authenticator app or contact our support to help you regain access to your account. ![Apify Console two-factor authentication with recovery code form](/assets/images/console-two-factor-use-recovery-code-4a533fb19d82d4e735bea54c9ab510ff.png) ## Disabling two-factor authentication[​](#disabling-two-factor-authentication "Direct link to Disabling two-factor authentication") If you no longer want to use the two-factor authentication or lose access to your authenticator app, you can disable the two-factor authentication in your [account settings](https://console.apify.com/account/settings). To do that, scroll down to the **Two-factor authentication** section and click on the **Disable** button. We will ask you to enter either your verification code from the authenticator app or, if you do not have access to it anymore, you can use one of your recovery codes. After entering the code, click on the **Remove app** button to verify the provided code. If it's valid, it will disable the two-factor authentication and remove the configuration from your account. After you disable the two-factor authentication you will be able to sign in to your account without providing the verification code. ![Apify Console disable two-factor authentication modal](/assets/images/console-two-factor-auth-disable-aa84be25b1d5b2df1184eb07ce9d3993.png) ## What to do when you get locked out[​](#what-to-do-when-you-get-locked-out "Direct link to What to do when you get locked out") If you lose access to your authenticator app and do not have any recovery codes left, or you lost them as well, you will not be able to sign in to your account. In this case, you will need to contact our support. To do that, you can either send us an email to [support@apify.com](mailto:support@apify.com?subject='Locked%20out%20of%20account%20with%202FA%20enabled') or you can go to the [sign-in page](https://console.apify.com/sign-in) and sign in with your email and password. Then, on the two-factor authentication page, click on the **recovery code or begin 2FA account recovery** link. On the two-factor recovery page, click on the **Contact our support** link. This link will open up our online chat, and our support team can help you from there. For our support team to help you recover your account, you will need to provide them with the personal information you have configured during the two-factor authentication setup. If you provide the correct information, the support team will help you regain access to your account. caution The support team will not give you any clues about the information you provided; they will only verify if it is correct. You can always check what information you provided by going to your [account settings](https://console.apify.com/account/settings), scrolling down to the **Two-factor authentication** section, and clicking on the **Recovery settings** button, then you should see a view like this: ![Apify Console recovery settings modal request](/assets/images/console-two-factor-recovery-settings-request-45b9ae8f90544b0c93650a3627ae813f.png) After you enter a verification code from your authenticator app, you will see the recovery settings you provided during the two-factor authentication setup. ![Apify Console recovery settings modal response](/assets/images/console-two-factor-recovery-settings-revealed-8cb197f36d728b676b7ae42a243b3a74.png) --- # Integrations **Learn how to integrate the Apify platform with other services, your systems, data pipelines, and other web automation workflows.** *** > The whole is greater than the sum of its parts. > > 👴 *Aristotle* Integrations allow you to combine separate applications and take advantage of their combined capabilities. Automation of these online processes increases your productivity. That's why we made Apify in a way that allows you to connect it with practically any cloud service or web app and make it part of your larger projects. If you are building a service and your users could benefit from integrating with Apify or wise-versa then ready the [integrate with Apify](https://docs.apify.com/platform/integrations/integrate.md). ## Built-in integrations[​](#built-in-integrations "Direct link to Built-in integrations") Apify‘s RESTful API allows you to interact with the platform programmatically. HTTP webhooks notify you and your services when important events happen. By using the API, you can start Actors, retrieve their results, or basically do anything you can do on a platform UI #### [API](https://docs.apify.com/platform/integrations/api.md) [Control the Apify platform programmatically from your code.](https://docs.apify.com/platform/integrations/api.md) #### [Actors and tasks](https://docs.apify.com/platform/integrations/actors.md) [Trigger other Actors or tasks when your Actor run fails or succeeds.](https://docs.apify.com/platform/integrations/actors.md) #### [Webhooks](https://docs.apify.com/platform/integrations/webhooks.md) [Trigger external events when your Actor run fails or succeeds.](https://docs.apify.com/platform/integrations/webhooks.md) Apify offers easy-to-set-up solutions for common scenarios, like uploading your datasets to Google Drive when the run succeeds or creating an issue on GitHub when it fails. [![](/img/platform/integrations/slack.svg)](https://docs.apify.com/platform/integrations/slack.md) #### [Slack](https://docs.apify.com/platform/integrations/slack.md) [![](/img/platform/integrations/drive.svg)](https://docs.apify.com/platform/integrations/drive.md) #### [Google Drive](https://docs.apify.com/platform/integrations/drive.md) [![](/img/platform/integrations/gmail.svg)](https://docs.apify.com/platform/integrations/gmail.md) #### [Gmail](https://docs.apify.com/platform/integrations/gmail.md) [![](/img/platform/integrations/github-white.svg)](https://docs.apify.com/platform/integrations/github.md) #### [GitHub](https://docs.apify.com/platform/integrations/github.md) [![](/img/platform/integrations/airtable.svg)](https://docs.apify.com/platform/integrations/airtable.md) #### [Airtable](https://docs.apify.com/platform/integrations/airtable.md) ## Integration platforms[​](#integration-platforms "Direct link to Integration platforms") If you use one of the main integration platforms, Apify's support is here for you. The main advantage of these platforms is that you can integrate Apify into very complex workflows with the choice of thousands of supported services. [![](/img/platform/integrations/make.svg)](https://docs.apify.com/platform/integrations/make.md) #### [Make](https://docs.apify.com/platform/integrations/make.md) [![](/img/platform/integrations/zapier.svg)](https://docs.apify.com/platform/integrations/zapier.md) #### [Zapier](https://docs.apify.com/platform/integrations/zapier.md) [![](/img/platform/integrations/telegram.svg)](https://docs.apify.com/platform/integrations/telegram.md) #### [Telegram (via Zapier)](https://docs.apify.com/platform/integrations/telegram.md) [![](/img/platform/integrations/n8n.svg)](https://docs.apify.com/platform/integrations/n8n.md) #### [n8n](https://docs.apify.com/platform/integrations/n8n.md) [![](/img/platform/integrations/ifttt.svg)](https://docs.apify.com/platform/integrations/ifttt.md) #### [IFTTT](https://docs.apify.com/platform/integrations/ifttt.md) ## Data pipelines, ETLs, and AI/LLM tools[​](#data-pipelines-etls-and-aillm-tools "Direct link to Data pipelines, ETLs, and AI/LLM tools") The Apify platform integrates with popular ETL and data pipeline services, enabling you to integrate Apify Actors directly into your data integration processes. [![](/img/platform/integrations/keboola.svg)](https://docs.apify.com/platform/integrations/keboola.md) #### [Keboola](https://docs.apify.com/platform/integrations/keboola.md) [![](/img/platform/integrations/airbyte.svg)](https://docs.airbyte.com/integrations/sources/apify-dataset) #### [Airbyte](https://docs.airbyte.com/integrations/sources/apify-dataset) If you are working on AI/LLM-related applications, we recommend looking into the many integrations with popular AI/LLM ecosystems. These integrations allow you to use Apify Actors as tools and data sources. [![](/img/platform/integrations/crewai.png)](https://docs.apify.com/platform/integrations/crewai.md) #### [CrewAI](https://docs.apify.com/platform/integrations/crewai.md) [![](/img/platform/integrations/langgraph.png)](https://docs.apify.com/platform/integrations/langgraph.md) #### [LangGraph](https://docs.apify.com/platform/integrations/langgraph.md) [![](/img/platform/integrations/mastra.png)](https://docs.apify.com/platform/integrations/mastra.md) #### [Mastra](https://docs.apify.com/platform/integrations/mastra.md) [![](/img/platform/integrations/lindy.png)](https://docs.apify.com/platform/integrations/lindy.md) #### [Lindy](https://docs.apify.com/platform/integrations/lindy.md) [![](/img/platform/integrations/langflow.png)](https://docs.apify.com/platform/integrations/langflow.md) #### [Langflow](https://docs.apify.com/platform/integrations/langflow.md) [![](/img/platform/integrations/flowise.png)](https://docs.apify.com/platform/integrations/flowise.md) #### [Flowise](https://docs.apify.com/platform/integrations/flowise.md) [![](/img/platform/integrations/langchain.png)](https://docs.apify.com/platform/integrations/langchain.md) #### [LangChain](https://docs.apify.com/platform/integrations/langchain.md) [![](/img/platform/integrations/llamaindex.jpeg)](https://docs.apify.com/platform/integrations/llama.md) #### [LlamaIndex](https://docs.apify.com/platform/integrations/llama.md) [![](/img/platform/integrations/haystack.png)](https://docs.apify.com/platform/integrations/haystack.md) #### [Haystack](https://docs.apify.com/platform/integrations/haystack.md) [![](/img/platform/integrations/pinecone.svg)](https://docs.apify.com/platform/integrations/pinecone.md) #### [Pinecone](https://docs.apify.com/platform/integrations/pinecone.md) [![](/img/platform/integrations/qdrant.svg)](https://docs.apify.com/platform/integrations/qdrant.md) #### [Qdrant](https://docs.apify.com/platform/integrations/qdrant.md) [![](/img/platform/integrations/milvus.svg)](https://docs.apify.com/platform/integrations/milvus.md) #### [Milvus](https://docs.apify.com/platform/integrations/milvus.md) [![](/img/platform/integrations/mcp-logo-white.png)](https://docs.apify.com/platform/integrations/mcp.md) #### [MCP](https://docs.apify.com/platform/integrations/mcp.md) [![](/img/platform/integrations/aws-bedrock.png)](https://docs.apify.com/platform/integrations/aws_bedrock.md) #### [AWS Bedrock](https://docs.apify.com/platform/integrations/aws_bedrock.md) [![](/img/platform/integrations/openai-white.svg)](https://docs.apify.com/platform/integrations/openai-assistants.md) #### [OpenAI Assistants](https://docs.apify.com/platform/integrations/openai-assistants.md) ## Other Actors[​](#other-actors "Direct link to Other Actors") Explore [Apify Store](https://apify.com/store) for Actors that may help you with integrations, for example, [MongoDB Import](https://apify.com/drobnikj/mongodb-import) or [MySQL Insert](https://apify.com/petr_cermak/mysql-insert). ![Apify Store](/assets/images/apify-store-1d47e1201c104ab3c2bb25aeab35b714.webp) --- # What are Actor integrations? **Learn how to integrate with other Actors and tasks.** *** Integration Actors You can check out a catalogue of our Integration Actors within [Apify Store](https://apify.com/store/categories/integrations). Actor integrations provide a way to connect your Actors with other Actors or tasks easily. They provide a new level of flexibility, as adding a new integration simply means creating [integration-ready Actors](https://docs.apify.com/platform/integrations/actors/integration-ready-actors.md). Thus, new integrations can be created by the community itself. [YouTube video player](https://www.youtube-nocookie.com/embed/zExnYbvFoBM) ## How to integrate an Actor with other Actors?[​](#how-to-integrate-an-actor-with-other-actors "Direct link to How to integrate an Actor with other Actors?") To integrate one Actor with another: 1. Navigate to the **Integrations** tab in the Actor's detail page. 2. Select `Apify (Connect Actor or Task)`. ![Add integration](/assets/images/integrations_add-5cb13f76b223e9b3054d860ed51d0b54.png) 3. Find the Actor or task you want to integrate with and click `Connect`. This leads you to a setup screen, where you can provide: * **Triggers**: Events that will trigger the integrated Actor. These are the same as webhook [event types](https://docs.apify.com/platform/integrations/webhooks/events.md) (*run succeeded*, *build failed*, etc.) ![Integration trigger select](/assets/images/integration_triggers-45a318a9797429975b5a4f9803affc4a.png) * **Input for the integrated Actor**: Typically, the input has two parts. The information that is independent of the run triggering it and information that is specific for that run. The "independent" information (e.g. connection string to database or table name) can be added to the input as is. The information specific to the run (e.g. dataset ID) is either obtained from the implicit `payload` field (this is the case for most Actors that are integration-ready), or they can be provided using variables. * **Available variables** are the same ones as in webhooks. The one that you probably are going to need the most is `{{resource}}`, which is the Run object in the same shape you get from the [API](https://docs.apify.com/api/v2/actor-run-get.md) (for build event types, it will be the Build object). The variables can make use of dot notation, so you will most likely just need `{{resource.defaultDatasetId}}` or `{{resource.defaultKeyValueStoreId}}`. ## Testing your integration[​](#testing-your-integration "Direct link to Testing your integration") When adding a new integration, you can test it using a past run or build as a trigger. This will trigger a run of your target Actor or task as if your desired trigger event just occurred. The only difference between a test run and regular run is that the trigger's event type will be set to 'TEST'. The test run will still consume compute units. To test your integration, first set your desired input and options and save. You can then select one of the options from the menu. If the source of your integration is a task, you can test it using a past run. For Actors, you can use a past run or build. Alternatively, if the source of your integration has neither, you can test your integration with a random joke in the webhook's payload. ![Test integration options](/assets/images/integrations_test_options-03965bdc9b1468f3981dbdbd7c2c53b2.png) When testing with a custom run or build, you will need to enter its ID. You can find it on the run's or build's detail page. Ensure that the run or build belongs to the **source** Actor, since that is where the trigger will be coming from. ## Implementation details[​](#implementation-details "Direct link to Implementation details") Under the hood, the Actor integrations use regular [HTTP POST webhooks](https://www.redhat.com/en/topics/automation/what-is-a-webhook) and target the Apify API, for which this feature provides a friendlier UI. The UI allows you to fill the payload template using the Actor input UI rather than plain text and constructs the URL to start your Actor with the given options. The UI ensures that the variables are enclosed in strings, meaning that even the payload template is a valid JSON, not just the resulting interpolation. It also automatically adds the `payload` field that contains the default webhook payload. Thanks to this, when using Actors that are meant to be used as integrations, users don't have to fill in the variables: the Actor takes the data from this field by itself. ## Blog tutorial[​](#blog-tutorial "Direct link to Blog tutorial") You can read a complete example of integrating two Actors in [this tutorial](https://blog.apify.com/connecting-scrapers-apify-integration/). --- # Integrating Actors via API **Learn how to integrate with other Actors and tasks using the Apify API.** *** You can integrate Actors via API using the [Create webhook](https://docs.apify.com/api/v2/webhooks-post.md) endpoint. It's the same as any other webhook, but to make sure you see it in Apify Console, you need to make sure of a few things. * The `requestUrl` field needs to point to the **Run Actor** or **Run task** endpoints and needs to use their IDs as identifiers (i.e. not their technical names). * The `payloadTemplate` field should be valid JSON - i.e. it should only use variables enclosed in strings. You will also need to make sure that it contains a `payload` field. * The `shouldInterpolateStrings` field needs to be set to `true`, otherwise the variables won't work. * Add `isApifyIntegration` field with the value `true`. This is a helper that turns on the Actor integration UI, if the above conditions are met. Not meeting the conditions does not mean that the webhook won't work; it will just be displayed as a regular HTTP webhook in Apify Console. The webhook should look something like this: ``` { "requestUrl": "https://api.apify.com/v2/acts//runs", "eventTypes": ["ACTOR.RUN.SUCCEEDED"], "condition": { "actorId": "", }, "shouldInterpolateStrings": true, "isApifyIntegration": true, "payloadTemplate": "{\"field\":\"value\",\"payload\":{\"resource\":\"{{resource}}\"}}", } ``` It's usually enough to just include the `resource` field in the payload template, but some Actors might also need other fields. Keep in mind that the `payloadTemplate` is a string, not an object. --- # Integration-ready Actors **Learn how to create Actors that are ready to be integrated with other Actors and tasks.** *** Any Actor can be used in integrations. In order to provide a smooth experience for its users, there are few things to keep in mind. ## General guidelines[​](#general-guidelines "Direct link to General guidelines") If your Actor is supposed to be used as an integration, it will most likely have an input that can be described as two groups of fields. The first group is the "static" part of the input - the fields that have the same value whenever the integration is triggered. The second, "dynamic", group are fields that are specific to the triggering event - information from the run or build that triggered the integration. The Actor should ideally try to hide its complexity from users and take all the "dynamic" fields from the implicit `payload` field - it is attached automatically. This way, users don't have to take care of passing in variables on their own and only need to take care of the static part of the input. An important thing to remember is that only the **dataset ID** is passed to the Actor as input, not the **dataset contents**. This means that the Actor needs to take care of getting the actual contents of the dataset. And, ideally, it should not load the full dataset while doing so, as it might be too large to fit to memory, but rather process it in batches. ## Example[​](#example "Direct link to Example") To illustrate the above, here is a simplified example of an Actor that uploads a dataset to a table/collection in some database. We would start with an input that looks something like this: * `datasetId: string` - Id of dataset that should be uploaded * `connectionString: string` - Credentials for the database connection * `tableName: string` - Name of table / collection With this input schema, users have to provide an input that looks like this: ``` { "datasetId": "{{resource.defaultDatasetId}}", "connectionString": "****", "tableName": "results" } ``` And in the Actor code, we'd use this to get the values: ``` const { datasetId, connectionString, tableName } = await Actor.getInput(); ``` To make the integration process smoother, it's possible to define an input that's going to be prefilled when your Actor is being used as an integration. You can do that in the Actor's **Settings** tab, on the **Integrations** form. In our example, we'd use: ``` { "datasetId": "{{resource.defaultDatasetId}}" } ``` This means that users will see that the `defaultDatasetId` of the triggering run is going to be used right away. Explicitly stating what is the expected input when Actor is being used as an integration is a preferred way. However, if the Actor is **only** supposed to be used as integration, we can use a different input schema: * `connectionString: string` - Credentials for the database connection * `tableName: string` - Name of table / collection In this case, users only need to provide the "static" part of the input: ``` { "connectionString": "****", "tableName": "results" } ``` In the Actor's code, the `datasetId` (the dynamic part) would be obtained from the `payload` field: ``` const { payload, connectionString, tableName } = await Actor.getInput(); const datasetId = payload.resource.defaultDatasetId; ``` It's also possible to combine both approaches, which is useful for development purposes or advanced usage. It would mean keeping the `datasetId` in the input, only hidden under an "Advanced options" section, and using it like this: ``` const { payload, datasetId } = await Actor.getInput(); const datasetIdToProcess = datasetId || payload?.resource?.defaultDatasetId; ``` In the above example, we're focusing on accessing a run's default dataset, but the approach would be similar for any other field. ## Making your Actor available to other users[​](#making-your-actor-available-to-other-users "Direct link to Making your Actor available to other users") To allow other users to use your Actor as an integration, all you need to do is [publish it in Apify Store](https://docs.apify.com/platform/actors/publishing.md), so users can then integrate it using the **Connect Actor or task** button on the **Integrations** tab of any Actor. While publishing the Actor is enough, there are two ways to make it more visible to users. For Actors that are generic enough to be used with most other Actors, it's possible to have them listed under **Generic integrations** in the **Integrations** tab. This includes (but is not limited to) Actors that upload datasets to databases, send notifications through various messaging systems, create issues in ticketing systems, etc. To have your Actor listed under our generic integrations, [contact support](mailto:support@apify.com?subject=Actor%20generic%20integration). Some Actors can only be integrated with a few or even just one other Actor. Let's say that you have an Actor that's capable of scraping profiles from a social network. It makes sense to show it for Actors that produce usernames from the social network but not for Actors that produce lists of products. In this case, it's possible to have the Actor listed as **Specific to this Actor** under the Actor's **Integrations** tab. To have your Actor listed as specific to another Actor, [contact support](mailto:support@apify.com?subject=Actor%specific%20integration). ![Specific vs generic integrations](/assets/images/specific_vs_generic_integrations-2dc8cec0eef8c497fba1b662692138dd.png) --- # Agno Integration **Integrate Apify with Agno to power AI agents with web scraping, automation, and data insights.** *** ## What is Agno?[​](#what-is-agno "Direct link to What is Agno?") [Agno](https://docs.agno.com/) is an open-source framework for building intelligent AI agents. It provides a flexible architecture to create agents with custom tools, enabling seamless integration with external services like Apify for tasks such as web scraping, data extraction and automation. Agno documentation Check out the [Agno documentation](https://docs.agno.com/introduction) for more details on building AI agents. ## How to use Apify with Agno[​](#how-to-use-apify-with-agno "Direct link to How to use Apify with Agno") This guide shows how to integrate Apify Actors with Agno to empower your AI agents with real-time web data. We'll use the [RAG Web Browser](https://apify.com/apify/rag-web-browser) Actor to fetch web content and the [Google Places Crawler](https://apify.com/compass/crawler-google-places) Actor to extract location-based data. It is very easy to use with any other Actor by just passing the name of the Actor. See and choose from thousands of Actors in the [Apify Store](https://apify.com/store). ### Prerequisites[​](#prerequisites "Direct link to Prerequisites") * *Apify API token*: Obtain your API token from the [Apify console](https://console.apify.com/account/integrations). * *OpenAI API key*: Get your API key from the [OpenAI platform](https://platform.openai.com/account/api-keys). Alternative LLM providers While our examples use OpenAI, Agno supports other LLM providers as well. You'll need to adjust the environment variables and configuration according to your chosen provider. Check out the [Agno models documentation](https://docs.agno.com/models/introduction) for details on supported providers and configuration. * *Python environment*: Ensure Python is installed (version 3.8+ recommended). * *Required packages*: Install the following dependencies in your terminal: ``` pip install agno apify-client ``` ## Basic integration example[​](#basic-integration-example "Direct link to Basic integration example") Start by setting up an Agno agent with Apify tools. This example uses the RAG Web Browser Actor to extract content from a specific URL. ``` import os from agno.agent import Agent from agno.tools.apify import ApifyTools os.environ["APIFY_API_TOKEN"] = "YOUR_APIFY_API_TOKEN" # Replace with your Apify API token os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY" # Replace with your OpenAI API key # Initialize the agent with Apify tools agent = Agent( tools=[ApifyTools( actors=["apify/rag-web-browser"])], show_tool_calls=True, markdown=True ) # Fetch and display web content agent.print_response("Extract key details from https://docs.agno.com/introduction", markdown=True) ``` Running this code will scrape the specified URL and return formatted content your agent can use. ### Advanced scenario: Travel planning agent[​](#advanced-scenario-travel-planning-agent "Direct link to Advanced scenario: Travel planning agent") Combine multiple Apify Actors to create a powerful travel planning agent. This example uses the RAG Web Browser and Google Places Crawler to gather travel insights and local business data. ``` import os from agno.agent import Agent from agno.tools.apify import ApifyTools os.environ["APIFY_API_TOKEN"] = "YOUR_APIFY_API_TOKEN" # Replace with your Apify API token os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY" # Replace with your OpenAI API key # Create a travel planning agent agent = Agent( name="Travel Planner", instructions=[ "You are a travel planning assistant. Use web data and location insights to provide detailed travel recommendations." ], tools=[ ApifyTools( actors=[ "apify/rag-web-browser", # For general web research "compass/crawler-google-places" # For location-based data ] ) ], show_tool_calls=True, markdown=True ) # Plan a trip to Tokyo agent.print_response( """ I'm traveling to Tokyo next month. 1. Research the best time to visit and top attractions. 2. Find a highly rated sushi restaurant near Shinjuku. Compile a travel guide with this information. """, markdown=True ) ``` This agent will fetch travel-related data and restaurant recommendations, providing a comprehensive travel guide: 1. Use the RAG Web Browser to research Tokyo travel details. 2. Use the Google Places Crawler to find a top sushi restaurant. 3. Combine the results into a comprehensive guide. Apify Store Browse the [Apify Store](https://apify.com/store) to find additional Actors for tasks like social media scraping, e-commerce data extraction, or news aggregation. ### Available Apify tools[​](#available-apify-tools "Direct link to Available Apify tools") Agno supports any Apify Actor via the ApifyTools class. You can specify a single Actor ID or a list of Actor IDs to register multiple tools for your agent at once. ## Configuration options[​](#configuration-options "Direct link to Configuration options") `apify_api_token` (string, default: `None`) : Apify API token (or set via APIFY\_API\_TOKEN environment variable) `actors` (string or List\[string], default: `None`) : Single Actor ID or list of Actor IDs to register ## Resources[​](#resources "Direct link to Resources") * [How to build an AI Agent](https://blog.apify.com/how-to-build-an-ai-agent/) * [Agno Framework Documentation](https://docs.agno.com) * [Apify Platform Documentation](https://docs.apify.com) * [Apify Actor Documentation](https://docs.apify.com/actors) * [Apify Store - Browse available Actors](https://apify.com/store) * [Agno Apify Toolkit Documentation](https://docs.agno.com/tools/toolkits/others/apify#apify) --- # Airbyte integration **Learn how to integrate your Apify datasets with Airbyte.** *** Airbyte is an open-source data integration platform that allows you to move your data between different sources and destinations using pre-built connectors, which are maintained either by Airbyte itself or by its community. One of these connectors is the Apify Dataset connector, which makes it simple to move data from Apify datasets to any supported destination. To use Airbyte's Apify connector you need to: * Have an Apify account. * Have an Airbyte account. ## Set up Apify connector in Airbyte[​](#set-up-apify-connector-in-airbyte "Direct link to Set up Apify connector in Airbyte") Once you have all the necessary accounts set up, you need to set up the Apify connector. To do so, you will need to navigate to **Sources** tab in Airbyte and select **Apify Dataset** ![Airbyte sources tab](/assets/images/airbyte-sources-7915e8b8c9b5959862c7c52c1505067f.png) You will need to provide a **dataset ID** and your Apify API Token. You can find both of these in [Apify Console](https://console.apify.com). ![Airbyte source setup](/assets/images/airbyte-source-setup-8c9f9311148dad47f6c80bdbfe9cf3f1.png) To find your **dataset ID**, you need to navigate to the **Storage** tab in Apify Console. Copy it and paste it in Airbyte. ![Datasets in app](/assets/images/datasets-app-2249b1a36efd9e35b15c68ae64f99ac7.png) To find your Apify API token, you need to navigate to the **Settings** tab and select **Integrations**. Copy it and paste it in the relevant field in Airbyte. ![Integrations token](/assets/images/apify-integrations-token-a480c4034e9658f9989b7c661ee0fad5.png) And that's it! You now have Apify datasets set up as a Source, and you can use Airbyte to transfer your datasets to one of the available destinations. To learn more about how to setup a Connection, visit [Airbyte's documentation](https://docs.airbyte.com/using-airbyte/getting-started/set-up-a-connection) --- # Airtable integration **Learn how to integrate your Apify Actors with Airtable. This article shows you how to automatically upload results to your Airtable when an Actor run succeeds.** *** [Airtable](https://www.airtable.com/) is a cloud-based platform for organizing, managing, and collaborating on data. With Apify integration for Airtable, you can automatically upload Actor run results to Airtable after a successful run. This integration uses OAuth 2.0, a secure authorization protocol, to connect your Airtable account to Apify and manage data transfers. ## Connect Apify with Airtable[​](#connect-apify-with-airtable "Direct link to Connect Apify with Airtable") To use the Apify integration for Airtable, ensure you have: * An [Apify account](https://console.apify.com/) * An [Airtable account](https://www.airtable.com/) ### Set up connection within Apify Console[​](#set-up-connection-within-apify-console "Direct link to Set up connection within Apify Console") 1. In Apify Console, go to the [API & Integrations tab](https://console.apify.com/settings/integrations) in the **Settings** section. 2. Under **Account-level integrations**, click **Add account**. ![Add account button in Account-level integrations section of the settings](/assets/images/connect-account-1-2de30ae482ac3cdab1d6ce383f05997a.png) 3. Select **Airtable** from the list of available services. ![Connect with Airtable button among other buttons for connection of other available services](/assets/images/connect-account-2-98ffadd65cee09452a20ad2c701dfad8.png) 4. Follow the OAuth 2.0 authorization flow to securely connect your Airtable account. 5. Grant Apify access to the workspaces and bases you want to use. ### Set up Airtable integration[​](#set-up-airtable-integration "Direct link to Set up Airtable integration") 1. [Choose an Actor or Task](https://console.apify.com/actors) to integrate with Airtable. 2. Go to the **Integrations** tab and click **Upload data to Airtable**. ![Airtable integration option among other available integrations](/assets/images/set-up-integration-1-954d977e0859413f5650f97719d57bf2.png) 3. Select the upload mode: * **CREATE**: New table is created for each run of this integration. * **APPEND**: New records are added to the specified table. If the table does not yet exist, new one is created. * **OVERWRITE**: All records in the specified table are replaced with new data. If the table does not yet exist, new one is created. 4. Select a connected Airtable account and choose the base where the Actor run results will be uploaded. 5. Enter a table name or select an existing one. To ensure uniqueness when using CREATE mode, use dynamic variables. If a table with the same name already exists in CREATE mode, a random token will be appended. ![Airtable integration configuration form](/assets/images/set-up-integration-2-0f26b7d4299afb66dcde8ef55abfed0a.png) 6. Save the integration. Once your Actor runs, you'll see its results uploaded to Airtable. ![Airtable table filled with data](/assets/images/set-up-integration-3-2fd7f1377c3a55106b5412ab5c0b506a.png) --- # API integration **Learn how to integrate with Apify using the REST API.** *** All aspects of the Apify platform can be controlled via a REST API, which is described in detail in the [**API Reference**](https://docs.apify.com/api/v2.md). If you want to use the Apify API from JavaScript/Node.js or Python, we strongly recommend to use one of our API clients: * [**apify-client**](https://docs.apify.com/api/client/js) `npm` package for JavaScript, supporting both browser and server * [**apify-client**](https://docs.apify.com/api/client/python) PyPI package for Python. You are not required to those packages—the REST API works with any HTTP client—but the official API clients implement best practices such as exponential backoff and rate limiting. ## API token[​](#api-token "Direct link to API token") To access the Apify API in your integrations, you need to authenticate using your secret API token. You can find it on the [Integrations](https://console.apify.com/settings/integrations) page in Apify Console. Give your token a reasonable description, and never use one token for several services, much like you shouldn't use the same password for different accounts. ![Integrations page in Apify Console](/assets/images/api-token-c1ace583e58bda1c69d76a93297349f8.png) caution Do not share the API token with untrusted parties, or use it directly from client-side code, unless you fully understand the consequences! You can also consider [limiting the permission scope](#limited-permissions) of the token, so that it can only access what it really needs. ## Authentication[​](#authentication "Direct link to Authentication") You can authenticate the Apify API in two ways. You can either pass the token via the `Authorization` HTTP header or the URL `token` query parameter. We always recommend you use the authentication via the HTTP header as this method is more secure. Note that some API endpoints, such as [Get list of keys](https://docs.apify.com/api/v2/key-value-store-keys-get.md), do not require an authentication token because they contain a hard-to-guess identifier that effectively serves as an authentication key. ## Expiration[​](#expiration "Direct link to Expiration") API tokens include security features to protect your account and data. You can set an expiration date for your API tokens, ensuring they become invalid after a specified period. This is particularly useful for temporary access or when working with third-party services. ![Creating token with expiration date in Apify Console](/assets/images/api-token-expiration-date-422b4f0433871dbeee59a23d06010f21.png) ## Rotation[​](#rotation "Direct link to Rotation") If you suspect that a token has been compromised or accidentally exposed, you can rotate it through the Apify Console. When rotating a token, you have the option to keep the old token active for 24 hours, allowing you to update your applications with the new token before the old one becomes invalid. After the rotation period, the token will be regenerated, and any applications connected to the old token will need to be updated with the new token to continue functioning. ![Rotate token in Apify Console](/assets/images/api-token-rotate-6c21e1e4fc3f5507882868822a71b6a4.png) For better security awareness, the UI marks tokens identified as compromised, making it easy to identify and take action on them. ![Leaked token in Apify Console](/assets/images/api-token-leaked-6ce9b9ea14aa76a1109fab863259da85.png) ## Organization accounts[​](#organization-accounts "Direct link to Organization accounts") When working under an organization account, you will see two types of API tokens on the Integrations page. ![Integrations page in the Apify Console in organization mode](/assets/images/api-token-organization-e3655a759e0010b2015b3cb4f9a2b2b2.png) The Personal API tokens are different from your own Personal API tokens mentioned above. If you use this token in an integration, it will have the same permissions that you have within the organization, and all the operations you use it for will be ascribed to you. On the other hand the Organization API tokens (only visible if you are the owner or have Manage access tokens permission) have full permissions and are not tied to a specific member of the organization. ## API tokens with limited permissions[​](#limited-permissions "Direct link to API tokens with limited permissions") By default, tokens can access all data in your account. If that is not desirable, you can choose to limit the permissions of your token, so that it can only access data needed for the particular use case. We call these tokens **scoped**. **A scoped token can access only those resources that you'll explicitly allow it to.** info We do not allow scoped tokens to create or modify Actors. If you do need to create or modify Actors through Apify API, use an unscoped token. ### How to create a scoped token[​](#how-to-create-a-scoped-token "Direct link to How to create a scoped token") Scoped tokens behave like standard API tokens and are managed through the [Integrations](https://console.apify.com/settings/integrations) page in Apify Console. When creating a token (or updating an existing one), simply toggle "Limit token permissions" to make the token scoped. Once the token is scoped, you can specify the token's permissions. ![Toggle \"Limit token permissions\" to make a token scoped](/assets/images/api-token-scoped-1924cb0bcb243160c2c0e3db9d7e31b8.png) ### Account-level vs resource-specific permissions[​](#account-level-vs-resource-specific-permissions "Direct link to Account-level vs resource-specific permissions") We support two different types of permissions for tokens: * **Account-level permissions**: These will apply to all resources in the entire account. For example, you can use these to allow the token to run *all* your Actors. * **Resource-specific permissions**: These will apply only to specific, existing resources. For example, you can use these to allow the token to read from a particular dataset. tip A single token can combine both types. You can create a token that can *read* any data storage, but *write* only to one specific key-value store. ![An example scoped token that combines account-level permissions and resource-specific permissions](/assets/images/api-token-scoped-with-combining-permissions-2a215b511b0705b7617fe154df80ed15.png) ### Allowing tokens to create resources[​](#allowing-tokens-to-create-resources "Direct link to Allowing tokens to create resources") If you need to create new resources with the token (for example, create a new task, or storage), you need to explicitly allow that as well. Once you create a new resource with the token, *the token will gain full access to that resource*, regardless of other permissions. It is not possible to create a token that can create a dataset, but not write to it. tip This is useful if you want to for example create a token that can dynamically create & populate datasets, but without the need to access other datasets in your account. ### Permission dependencies[​](#permission-dependencies "Direct link to Permission dependencies") Some permissions require other permissions to be granted alongside them. These are called *permission dependencies*. #### Automatic dependencies[​](#automatic-dependencies "Direct link to Automatic dependencies") The form enforces certain dependencies automatically. For example, when you grant the **Write** permission for a dataset, the **Read** permission is automatically selected. This ensures that when you can write to a dataset you can also read from it. ![The Write permission depends on Read for a dataset](/assets/images/api-token-scoped-dependencies-0af67b93c684001175ddc3f273eff652.png) #### Manual dependencies[​](#manual-dependencies "Direct link to Manual dependencies") Other dependencies are more complicated, so it is up to you to ensure that the token is configured correctly. Specifically: * To create or update a Schedule, the token needs access not only to the Schedule itself, but also to the Actor (the **Run** permission) or task (the **Read** permission) that is being scheduled. * Similarly, to create, update or run a task, the token needs the **Run** permission on the task's Actor itself. tip Let's say that you have an Actor and you want to programmatically create schedules for that Actor. Then you can create a token that has the account level **Create** permission on schedules, but only the resource-specific **Run** permission on the Actor. Such a token has exactly the permissions it needs, and nothing more. ### Actor execution[​](#actor-execution "Direct link to Actor execution") When you run an Actor, Apify creates a new, short-lived run API token, and injects it into the Actor environment. This applies to scoped tokens as well, so when you run an Actor with a scoped token, **the Actor is executed with a different token with a different scope.** In the scoped token configuration you can choose what scope the run API token gets, effectively determining what the Actor can access during its run. Apify currently supports two modes: * **Full access**: Allow Actors to access all your account's data. * **Restricted access**: Restrict what Actors can access using the scope of this Actor. ![Choose permission mode for running Actors with a scoped token](/assets/images/api-token-scoped-run-modes-7d0d9163f379ea085781f96e34295844.png) #### Full access: Allow Actors to access all your account's data[​](#full-access-allow-actors-to-access-all-your-accounts-data "Direct link to Full access: Allow Actors to access all your account's data") When you run an Actor with a scoped token in this mode, Apify will inject an *unscoped* token to the Actor runtime. This way you can be sure that once you give a token the permission to run an Actor, it will just work, and you don't have to worry about the exact permissions the Actor might need. However, this also means that you need to trust the Actor. tip Use this mode if you want to integrate with a 3rd-party service to run your Actors. Create a scoped token that can only run the Actor you need, and share it with the service. Even if the token is leaked, it can't be used to access your other data. #### Restricted access: Restrict what Actors can access using the scope of this Actor[​](#restricted-access-restrict-what-actors-can-access-using-the-scope-of-this-actor "Direct link to Restricted access: Restrict what Actors can access using the scope of this Actor") When you run an Actor with a scoped token in this mode, Apify will inject a token with the same scope as the scope of the original token. This way you can be sure that Actors won't accidentally—or intentionally—access any data they shouldn't. However, Actors might not function properly if the scope is not sufficient. caution Restricted access mode is not supported for Actors running in [Standby mode](https://docs.apify.com/platform/actors/running/standby.md). While you can send standby requests using a scoped token configured with restricted access, functionality is not guaranteed. tip This restriction is *transitive*, which means that if the Actor runs another Actor, its access will be restricted as well. #### Default run storages[​](#default-run-storages "Direct link to Default run storages") When Apify [runs an Actor](https://docs.apify.com/platform/actors/running/runs-and-builds.md#runs), it automatically creates a set of default storages (a dataset, a key-value store and request queue) that the Actor can use in runtime. You can configure whether the scoped token you are going use to run the Actor should get **Write** access to these default storages. ![Configure whether the trigger token gets write access to the run default storages.](/assets/images/api-token-scoped-default-storage-access-52b8b41a1c8928e1d159e4c0a75bdb44.png) tip Let's say your Actor produces a lot of data that you want to delete just after the Actor finishes. If you enable this toggle, your scoped token will be allowed to do that. caution Even if you disable this option, **the default storages can still be accessed anonymously using just their ID** (which can be obtained via the [run object](https://docs.apify.com/api/v2#tag/Actor-runsRun-object-and-its-storages)). Moreover, if a scoped token can run an Actor, it can also list all its runs, including their storage IDs, ultimately exposing their content as well. If this is not desirable, change your Actor to output data into an existing named storage, or have it create a new storage. ### Schedules[​](#schedules "Direct link to Schedules") You can use scoped tokens to schedule Actor and Tasks. Each schedule invocation triggers a new Actor run, creating and injecting a new run API token into the Actor. However, **this token is always unscoped, which means that the scheduled Actor has access to all your account data**, regardless of the scope of the token that scheduled it. ### Webhooks configuration[​](#webhooks-configuration "Direct link to Webhooks configuration") If you allow a token to run an Actor, it'll also be able to manage the Actor's webhooks (similarly for tasks). If you set up a webhook pointing to the Apify API, the Apify platform will automatically inject a token when dispatching that webhook. However, if you set up such a webhook with a scoped token, **that webhook will be dispatched with a token with the same limited permissions**. Therefore, you need to make sure the token has sufficient permissions not only to set up the webhook, but also to perform the actual operation. tip Let's say you want to create a webhook that pushes an item to a dataset every time an Actor successfully finishes. Then such a scoped token needs to be allowed to both run the Actor (to create the webhook), and write to that dataset. ### Troubleshooting[​](#troubleshooting "Direct link to Troubleshooting") #### How do I allow a token to run a task?[​](#how-do-i-allow-a-token-to-run-a-task "Direct link to How do I allow a token to run a task?") Tasks don't have a dedicated **Run** permission. Instead, you should configure the token with the following permissions: * **Run** on the Actor that the task is executing * **Read** on the task See the following example: ![Scoped token configured to run a task](/assets/images/api-token-scoped-run-tasks-ef961ddca69d18ff3797ee47ccafe44e.png) Refer to [this section](#permission-dependencies) to understand how permission dependencies work. #### My run failed and I can see `insufficient permissions` in the logs[​](#my-run-failed-and-i-can-see-insufficient-permissions-in-the-logs "Direct link to my-run-failed-and-i-can-see-insufficient-permissions-in-the-logs") When a run fails with insufficient permissions in the logs, it typically means the Actor is using a scoped token with **Restricted access** configured. ![Scoped token with Restricted access](/assets/images/api-token-scoped-restricted-access-active-bdd6f312a8bf9ea7e8dfbe2e31c3df60.png) What is happening is that the Actor is trying to access a resource (such as a dataset, or a key-value store) or perform an operation that it does not have sufficient permissions for. If you know what it is, you can add the permission to the scope of your token. If you don't, you can switch the permission mode on the token to **Full access**. This means that the Actor will be able to access all your account data. Refer to [Actor execution](#actor-execution) section to understand how executing Actors with scoped tokens works. --- # Amazon Bedrock integrations **Learn how to integrate Apify with Amazon Bedrock Agents to provide web data for AI agents.** *** [Amazon Bedrock](https://aws.amazon.com/bedrock/) is a fully managed service that provides access to large language models (LLMs), allowing users to create and manage retrieval-augmented generative (RAG) pipelines, and create AI agents to plan and perform actions. AWS Bedrock supports a wide range of models from providers such as A21 Labs, Anthropic, Cohere, Meta, and Mistral AI. These models are designed to handle complex, multistep tasks across systems, knowledge bases, and APIs, making them versatile for various use cases. In this tutorial, we’ll demonstrate how to create and use AWS Bedrock AI agent and integrate it with Apify Actors. The AI agent will be configured to either answer questions from an internal LLM knowledge or to leverage the [RAG Web Browser](https://apify.com/apify/rag-web-browser) to perform internet searches for relevant information. This approach enables the agent to provide more comprehensive and accurate responses by combining internal knowledge with real-time data from the web. ## AWS Bedrock AI agents[​](#aws-bedrock-ai-agents "Direct link to AWS Bedrock AI agents") Amazon Bedrock allows you to create AI agents powered by large language models to analyze user input and determine the required data sources, and execute actions needed to fulfill the user requests. Before getting started, ensure you have: * An active AWS Account. * An Apify account and an [API token](https://docs.apify.com/platform/integrations/api#api-token). * Granted access to any Large Language Model from Amazon Bedrock. To add access to a LLM, follow this [guide](https://docs.aws.amazon.com/bedrock/latest/userguide/model-access-modify.html). We'll use **Anthropic Claude 3.5 Sonnet** in this example. The overall process for creating an agent includes the following [steps](https://docs.aws.amazon.com/bedrock/latest/userguide/agents.html): * Provide a name and description for the agent. * Specify instructions for the agent. * Set up actions in action groups by defining function details or using an OpenAPI schema. * Configure a Lambda function to implement the defined function or OpenAPI schema. * Select a model for the agent to orchestrate with. * (Optional) Add knowledge bases for the agent. * (Optional) Configure advanced prompts for the agent for better control. The following image illustrates the key components of an AWS Bedrock AI agent: ![AWS-Bedrock-AI-Agent](/assets/images/aws-bedrock-ai-agent-1122d56e1c1d39a22a99fe0febe7bad5.png) ### Building an Agent[​](#building-an-agent "Direct link to Building an Agent") To begin, open the Amazon Bedrock console and select agents from the left navigation panel. On the next screen, click Create agent to start building your agent. ![aws-bedrock-menu.png](/assets/images/aws-bedrock-menu-1d4b2f7fa5e1fa1923fd99ab1c197208.png) #### Step 1: Provide agent details[​](#step-1-provide-agent-details "Direct link to Step 1: Provide agent details") Enter a name and description for your agent. This will create a new agent and open the Agent Builder page. On this page, you can: * Select the LLM model to use. * Add knowledge bases. * Set up actions. * Provide instructions for the agent. For example, you can use the following instructions: ``` You are a smart and helpful assistant. Answer question based on the search results. Use an expert, friendly, and informative tone Always use RAG Web Browser if you need to retrieve the latest search results and answer questions. ``` ![AWS-Bedrock-Create-Agent](/assets/images/aws-bedrock-create-agent-19992787dcaa17536564c2c06a80d272.png) #### Step 2: Create actions[​](#step-2-create-actions "Direct link to Step 2: Create actions") After saving the agent, preparing it, and testing it in the embedded chat window, you’ll notice that it cannot yet retrieve real-time search data from the internet. To enable this functionality, you need to create actions that integrate the agent with Apify and provide the necessary search data. Navigate to the **Actions Groups**, where you can define a set of actions. Actions can be created by either: * Providing function details, or * Using an OpenAPI schema, which will be implemented via a Lambda function. For this example, use the following settings: * Action type - Define with API schema * Action group invocation - Create a new Lambda function * Action group schema - Define via in-line editor ![AWS-Bedrock-Agent-Builder-Create-Action](/assets/images/aws-bedrock-agent-builder-create-action-cdca260e1099e97663d98613fb0b6c0d.png) In the in-line editor, paste the OpenAPI schema of the [RAG Web Browser](https://raw.githubusercontent.com/apify/rag-web-browser/refs/heads/master/docs/standby-openapi-3.0.0.json). At this point, the Agent is almost ready to integrate with Apify. Save the action to create a new, empty Lambda function. The final step is to update the Lambda function to implement the OpenAPI schema and enable real-time web search capabilities. #### Step 3: Lambda function[​](#step-3-lambda-function "Direct link to Step 3: Lambda function") 1. Open the Lambda function you created and copy-paste the [Python lambda function](https://raw.githubusercontent.com/apify/rag-web-browser/refs/heads/master/docs/aws-lambda-call-rag-web-browser.py). 2. Replace `APIFY_API_TOKEN` in the code with your Apify API token. Alternatively, store the token as an environment variable: * Go to the Configuration tab. * Select Environment Variables. * Add a new variable by specifying a key and value. 3. Configure the Lambda function: * Set the memory allocation to 128 MB and timeout duration to 60 seconds. 4. Save the Lambda function and deploy it. #### Step 4: Test the agent[​](#step-4-test-the-agent "Direct link to Step 4: Test the agent") 1. Return to the **AWS Bedrock console** and prepare the agent for testing in the embedded chat window. 2. Test the agent by entering a query, for example: "What is the latest news about AWS Bedrock" 3. Based on your query, the agent will determine the appropriate action to take. You can view the agent's reasoning in the **Trace Step view**. For instance: ``` { "rationale": { "text": "To answer this question about the latest news for AWS Bedrock, I'll need to use the RAG Web Browser function to search for and retrieve the most recent information. I'll craft a search query that specifically targets AWS Bedrock news.", "traceId": "845d524a-b82c-445b-9e36-66d887b3b25e-0" } } ``` The agent will use the RAG Web Browser to gather relevant information and provide a response to the user query. ![aws-bedrock-agent-chat.png](/assets/images/aws-bedrock-agent-chat-0cab59d11394bd7c88a284cc5b7f59fc.png) ### Limitations and debugging[​](#limitations-and-debugging "Direct link to Limitations and debugging") When integrating agent in AWS Bedrock, you may encounter some limitations and issues. * **Model access**: Ensure that you have access to the model you want to use in the agent. If you don't have access, you can request it from the model provider. * **Lambda function**: Ensure that the Lambda function is correctly implemented and deployed. Check the function logs for any errors. * **Environment variables**: Ensure that the `APIFY_API_TOKEN` is correctly set in the Lambda function. * **Memory and timeout**: Ensure that the Lambda function has enough memory and timeout to call the RAG Web Browser. * **Agent trace**: Use the agent trace view to debug the agent's reasoning and actions. * **Response Size**: AWS Bedrock enforces a 25KB limit on response bodies. Limit results and truncate text as needed to stay within this constraint. ## Resources[​](#resources "Direct link to Resources") * [Amazon Bedrock Agents](https://aws.amazon.com/bedrock/agents/) * [Amazon Bedrock Agents Quickstart](https://github.com/build-on-aws/amazon-bedrock-agents-quickstart) * [Apify - RAG Web Browser](https://apify.com/apify/rag-web-browser) --- # 🤖🚀 CrewAI integration **Learn how to build AI Agents with Apify and CrewAI.** *** ## What is CrewAI[​](#what-is-crewai "Direct link to What is CrewAI") [CrewAI](https://www.crewai.com/) is an open-source Python framework designed to orchestrate autonomous, role-playing AI agents that collaborate as a "crew" to tackle complex tasks. It enables developers to define agents with specific roles, assign tasks, and integrate tools—like Apify Actors—for real-world data retrieval and automation. Explore CrewAI For more in-depth details on CrewAI, check out its [official documentation](https://docs.crewai.com/). ## How to use Apify with CrewAI[​](#how-to-use-apify-with-crewai "Direct link to How to use Apify with CrewAI") This guide demonstrates how to integrate Apify Actors with CrewAI by building a crew of agents that uses the [RAG Web Browser](https://apify.com/apify/rag-web-browser) Actor to search Google for TikTok profiles and the [TikTok Data Extractor](https://apify.com/clockworks/free-tiktok-scraper) Actor to extract and analyze data from the TikTok profiles. ### Prerequisites[​](#prerequisites "Direct link to Prerequisites") * **Apify API token**: To use Apify Actors in CrewAI, you need an Apify API token. Learn how to obtain it in the [Apify documentation](https://docs.apify.com/platform/integrations/api). * **OpenAI API key**: To power the agents in CrewAI, you need an OpenAI API key. Get one from the [OpenAI platform](https://platform.openai.com/account/api-keys). * **Python packages**: Install the following Python packages: ``` pip install 'crewai[tools]' langchain-apify langchain-openai ``` ### Building the TikTok profile search and analysis crew[​](#building-the-tiktok-profile-search-and-analysis-crew "Direct link to Building the TikTok profile search and analysis crew") First, import all required packages: ``` import os from crewai import Agent, Task, Crew from crewai_tools import ApifyActorsTool from langchain_openai import ChatOpenAI ``` Next, set the environment variables for the Apify API token and OpenAI API key: ``` os.environ["OPENAI_API_KEY"] = "Your OpenAI API key" os.environ["APIFY_API_TOKEN"] = "Your Apify API token" ``` Instantiate the LLM and Apify Actors tools: ``` llm = ChatOpenAI(model="gpt-4o-mini") browser_tool = ApifyActorsTool(actor_name="apify/rag-web-browser") tiktok_tool = ApifyActorsTool(actor_name="clockworks/free-tiktok-scraper") ``` Define the agents with roles, goals, and tools: ``` search_agent = Agent( role="Web Search Specialist", goal="Find the TikTok profile URL on the web", backstory="Expert in web searching and data retrieval", tools=[browser_tool], llm=llm, verbose=True ) analysis_agent = Agent( role="TikTok Profile Analyst", goal="Extract and analyze data from the TikTok profile", backstory="Skilled in social media data extraction and analysis", tools=[tiktok_tool], llm=llm, verbose=True ) ``` Define the tasks for the agents: ``` search_task = Task( description="Search the web for the OpenAI TikTok profile URL.", agent=search_agent, expected_output="A URL linking to the OpenAI TikTok profile." ) analysis_task = Task( description="Extract data from the OpenAI TikTok profile URL and provide a profile summary and details about the latest post.", agent=analysis_agent, context=[search_task], expected_output="A summary of the OpenAI TikTok profile including followers and likes, plus details about their most recent post." ) ``` Create and run the crew: ``` crew = Crew( agents=[search_agent, analysis_agent], tasks=[search_task, analysis_task], process="sequential" ) result = crew.kickoff() print(result) ``` Search and analysis may take some time The agent tasks may take some time as they search the web for the OpenAI TikTok profile and extract data from it. You will see the crew’s output in the console, showing the results of the search and analysis. ``` Profile Summary: - Username: OpenAI - Profile URL: [OpenAI TikTok Profile](https://www.tiktok.com/@openai) - Followers: 605,000 - Likes: 3,400,000 - Number of Videos: 152 - Verified: Yes - Signature: low key research previews - Bio Link: [OpenAI Website](https://openai.com/) Latest Post Details: - Post ID: 7474019216346287406 - Post Text: "@Adeline Mai is a photographer..." - Creation Time: February 21, 2025 - Number of Likes: 863 - Number of Shares: 26 - Number of Comments: 33 - Number of Plays: 20,400 - Number of Collects: 88 - Music Used: Original Sound by OpenAI - Web Video URL: [Watch Here](https://www.tiktok.com/@openai/video/7474019216346287406) ``` If you want to test the whole example, create a new file, `crewai_integration.py`, and copy the full code into it: ``` import os from crewai import Agent, Task, Crew from crewai_tools import ApifyActorsTool from langchain_openai import ChatOpenAI os.environ["OPENAI_API_KEY"] = "Your OpenAI API key" os.environ["APIFY_API_TOKEN"] = "Your Apify API token" llm = ChatOpenAI(model="gpt-4o-mini") browser_tool = ApifyActorsTool(actor_name="apify/rag-web-browser") tiktok_tool = ApifyActorsTool(actor_name="clockworks/free-tiktok-scraper") search_agent = Agent( role="Web Search Specialist", goal="Find the TikTok profile URL on the web", backstory="Expert in web searching and data retrieval", tools=[browser_tool], llm=llm, verbose=True ) analysis_agent = Agent( role="TikTok Profile Analyst", goal="Extract and analyze data from the TikTok profile", backstory="Skilled in social media data extraction and analysis", tools=[tiktok_tool], llm=llm, verbose=True ) search_task = Task( description="Search the web for the OpenAI TikTok profile URL.", agent=search_agent, expected_output="A URL linking to the OpenAI TikTok profile." ) analysis_task = Task( description="Extract data from the OpenAI TikTok profile URL and provide a profile summary and details about the latest post.", agent=analysis_agent, context=[search_task], expected_output="A summary of the OpenAI TikTok profile including followers and likes, plus details about their most recent post." ) crew = Crew( agents=[search_agent, analysis_agent], tasks=[search_task, analysis_task], process="sequential" ) result = crew.kickoff() print(result) ``` ## Resources[​](#resources "Direct link to Resources") * [Apify Actors](https://docs.apify.com/platform/actors) * [CrewAI Documentation](https://docs.crewai.com/) * [What are AI agents?](https://blog.apify.com/what-are-ai-agents/) * [How to build an AI agent](https://blog.apify.com/how-to-build-an-ai-agent/) --- # Google Drive integration **Learn how to integrate your Apify Actors with Google Drive. This article shows you how to automatically save results to your drive when an Actor run succeeds.** *** ## Get started[​](#get-started "Direct link to Get started") To use the Apify integration for Google Drive, you will need: * An [Apify account](https://console.apify.com/). * A Google account * A saved Actor Task ## Set up Google Drive integration[​](#set-up-google-drive-integration "Direct link to Set up Google Drive integration") 1. Head over to **Integrations** tab in your saved task and click on the **Upload file** integration. ![Google Drive integration](/assets/images/google-integrations-add-7548d11ef631c30f63f9f06733f5c34e.png) 2. Click on **Connect with Google** button and select the account with which you want to use the integration. ![Google Drive integration](/assets/images/google-integrations-connect-drive-836e2e2e4618baefb146659112e6bb4a.png) 3. Set up the integration details. You can choose the **Filename** and **Format** , which can make use of available variables. The file will be uploaded to your Google Drive account to `Apify Uploads` folder. By default, the integration is triggered by successful runs only. ![Google Drive integration](/assets/images/google-integrations-details-drive-a5ac7880e4d742e2cefe11efaa3e247f.png) 4. Click on **Save** & enable the integration. Once this is done, run your Actor to test whether the integration is working. You can manage your connected accounts at **[Settings > API & Integrations](https://console.apify.com/settings/integrations)**. ![Google Drive integration](/assets/images/google-integrations-accounts-95c33e6e7c658a29a5b87f4a4c65a653.png) --- # Flowise integration **Learn how to integrate Apify with Flowise.** *** ## What is Flowise?[​](#what-is-flowise "Direct link to What is Flowise?") Flowise is an open-source UI visual tool to build your customized LLM flow using Langchain. ## How to use Apify with Flowise[​](#how-to-use-apify-with-flowise "Direct link to How to use Apify with Flowise") ### Installation[​](#installation "Direct link to Installation") To use Flowise you have to download and run it locally. The quickest way to do so is to use the following commands: 1. To install Flowise globally on your device: ``` npm install -g flowise ``` 2. To start Flowise locally: ``` npx flowise start ``` It will be available on `https://localhost:3000` Other methods of using Flowise can be found in their [documentation](https://docs.flowiseai.com/getting-started#quick-start) ### Building your flow[​](#building-your-flow "Direct link to Building your flow") After running Flowise, you can start building your flow with Apify. The first step is to create a new flow in the web UI. In the left menu, you need to find Apify Website Content Crawler under Document Loaders. ![Flowise add Apify Crawler](/assets/images/flowise-apify-be24e3ad72927eabe8324296606fbc9e.png) Now you need to configure the crawler. You can find more information about at [Website Content Crawler page](https://apify.com/apify/website-content-crawler). ![Flowise and Apify](/assets/images/flowise-6aaa0f5e5f9f12324d65667d091b43ea.png) In the configuration, provide your Apify API token, which you can find in your [Apify account](https://console.apify.com/settings/integrations). ![Apify API token screen](/assets/images/flowise-apify-api-f22034c2739a7ec01b6459b0f630b4a6.png) You can add more loaders, or you can add some processors to process the data. In our case, we create the flow that loads data from the Apify docs using Website Content Crawler and save them into the in-memory vector database. Connect the ChatOpenAI and the OpenAI embeddings and QA retrieval into the chatbot. The final flow can answer questions about Apify docs. ![Flowise and Apify](/assets/images/flowise-2-8a54cc439fcc38ba74a1551c6e45bf29.png) For more information visit the Flowise [documentation](https://flowiseai.com/). ## Resources[​](#resources "Direct link to Resources") * [Flowise](https://flowiseai.com/) * [Flowise documentation](https://github.com/FlowiseAI/Flowise#quick-start) --- # GitHub integration **Learn how to integrate your Apify Actors with GitHub. This article shows you several possible integrations.** *** ## Get started[​](#get-started "Direct link to Get started") To use the Apify integration for GitHub, you will need: * An [Apify account](https://console.apify.com/). * A GitHub repository. ### Create an Actor from a GitHub repository[​](#create-an-actor-from-a-github-repository "Direct link to Create an Actor from a GitHub repository") Learn how to create an Actor from a GitHub repository. This is useful if you want to automatically deploy and build your code or push to your GitHub repository. ![New Actor page](/assets/images/apify-git-repository-f8cf775a8670bd3f54aa4f145e8a84fb.png) Selecting *Link Git repository* will open a new modal to select a provider to use. Selecting *GitHub* will open a new window with GitHub authentication and select a GitHub repository. ![Git provider select modal](/assets/images/apify-git-provider-select-6439e11e794233d74e19462a5edf3e3a.png) To link an account, click on *Add GitHub account* and follow the instructions on github.com. Certain organizations or users can be selected. ![Git repository account select](/assets/images/apify-git-repository-add-cea31f962e8e7937f5e4cc246e41ea82.png) You can switch among all authorized users and organizations. ![Git repository account select](/assets/images/apify-git-repository-account-63de9314b2ac9363d85ab8c6c4b0a2bc.png) If the required repository is missing, try finding it with *Search*. ![apify-git-repository-search.png](/assets/images/apify-git-repository-search-007955d7c791366f32c2d745266eb776.png) An Actor is created immediately from the selected repository. ### Create an issue when a run fails[​](#create-an-issue-when-a-run-fails "Direct link to Create an issue when a run fails") [YouTube video player](https://www.youtube-nocookie.com/embed/jZUp-rRbayc) --- # Gmail integration **Learn how to integrate your Apify Actors with Gmail. This article shows you how to automatically send an email with results when an Actor run succeeds.** *** ## Get started[​](#get-started "Direct link to Get started") To use the Apify integration for Gmail, you will need: * An [Apify account](https://console.apify.com/). * A Google account * A saved Actor Task ## Set up Gmail integration[​](#set-up-gmail-integration "Direct link to Set up Gmail integration") 1. Head over to **Integrations** tab in your task and click on Send email generic integration. ![Google Drive integration](/assets/images/google-integrations-add-7548d11ef631c30f63f9f06733f5c34e.png) 2. Click on **Connect with Google** button and select the account with which you want to use the integration. ![Google Drive integration](/assets/images/google-integrations-connect-gmail-b9c2b4101708434ac2200a43413b1d8f.png) 3. Set up the integration details. **Subject** and **Body** fields can make use of available variables. Dataset can be attached in several formats. By default, the integration is triggered by successful runs only. ![Google Drive integration](/assets/images/google-integrations-details-gmail-2d2c83f54f203bf3b889636c6c3a3931.png) 4. Click on **Save** & enable the integration. Once this is done, run your Actor to test whether the integration is working. You can manage your connected accounts at **[Settings > API & Integrations](https://console.apify.com/settings/integrations)**. ![Google Drive integration](/assets/images/google-integrations-accounts-95c33e6e7c658a29a5b87f4a4c65a653.png) --- # Haystack integration **Learn how to integrate Apify with Haystack to work with web data in the Haystack ecosystem.** *** [Haystack](https://haystack.deepset.ai/) is an open source framework for building production-ready LLM applications, agents, advanced retrieval-augmented generative pipelines, and state-of-the-art search systems that work intelligently over large document collections. For more information on Haystack, visit its [documentation](https://docs.haystack.deepset.ai/docs/intro). In this example, we'll use the [Website Content Crawler](https://apify.com/apify/website-content-crawler) Actor, which can deeply crawl websites such as documentation sites, knowledge bases, or blogs, and extract text content from the web pages. Then, we'll use the `OpenAIDocumentEmbedder` to compute text embeddings and the `InMemoryDocumentStore` to store documents in a temporary in-memory database. The last step will be to retrieve the most similar documents. This example uses the Apify-Haystack Python integration published on [PyPi](https://pypi.org/project/apify-haystack/). Before we start with the integration, we need to install all dependencies: ``` pip install apify-haystack haystack-ai ``` Import all required packages: ``` from haystack import Document, Pipeline from haystack.components.embedders import OpenAIDocumentEmbedder, OpenAITextEmbedder from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter from haystack.components.retrievers import InMemoryBM25Retriever, InMemoryEmbeddingRetriever from haystack.components.writers import DocumentWriter from haystack.document_stores.in_memory import InMemoryDocumentStore from haystack.utils.auth import Secret from apify_haystack import ApifyDatasetFromActorCall ``` Find your [Apify API token](https://console.apify.com/account/integrations) and [OpenAI API key](https://platform.openai.com/account/api-keys) and initialize these into environment variable: ``` import os os.environ["APIFY_API_TOKEN"] = "YOUR-APIFY-API-TOKEN" os.environ["OPENAI_API_KEY"] = "YOUR-OPENAI-API-KEY" ``` First, you need to create a document loader that will crawl the haystack website using the Website Content Crawler: ``` document_loader = ApifyDatasetFromActorCall( actor_id="apify/website-content-crawler", run_input={ "maxCrawlPages": 3, # limit the number of pages to crawl "startUrls": [{"url": "https://haystack.deepset.ai/"}], }, dataset_mapping_function=lambda item: Document(content=item["text"] or "", meta={"url": item["url"]}), ) ``` You can learn more about input parameters on the [Website Content Crawler inputs page](https://apify.com/apify/website-content-crawler/input-schema). The dataset mapping function is described in more detail in the [Retrieval augmented generation example](https://colab.research.google.com/github/deepset-ai/haystack-cookbook/blob/main/notebooks/apify_haystack_rag.ipynb). Next, you can utilize the [Haystack pipeline](https://docs.haystack.deepset.ai/docs/pipelines), which helps you connect several processing components together. I n this example, we connect the document loader with the document splitter, document embedder, and document writer components. ``` document_store = InMemoryDocumentStore() document_splitter = DocumentSplitter(split_by="word", split_length=150, split_overlap=50) document_embedder = OpenAIDocumentEmbedder() document_writer = DocumentWriter(document_store) pipe = Pipeline() pipe.add_component("document_loader", document_loader) pipe.add_component("document_splitter", document_splitter) pipe.add_component("document_embedder", document_embedder) pipe.add_component("document_writer", document_writer) pipe.connect("document_loader", "document_splitter") pipe.connect("document_splitter", "document_embedder") pipe.connect("document_embedder", "document_writer") ``` Run all the components in the pipeline: ``` pipe.run({}) ``` Crawling may take some time The Actor call may take some time as it crawls the Haystack website. After running the pipeline code, you can print the results ``` print(f"Added {document_store.count_documents()} to vector from Website Content Crawler") print("Retrieving documents from the document store using BM25") print("query='Haystack'") bm25_retriever = InMemoryBM25Retriever(document_store) for doc in bm25_retriever.run("Haystack", top_k=1)["documents"]: print(doc.content) ``` If you want to test the whole example, you can simply create a new file, `apify_integration.py`, and copy the whole code into it. ``` import os from haystack import Document, Pipeline from haystack.components.embedders import OpenAIDocumentEmbedder, OpenAITextEmbedder from haystack.components.preprocessors import DocumentSplitter from haystack.components.retrievers import InMemoryBM25Retriever, InMemoryEmbeddingRetriever from haystack.components.writers import DocumentWriter from haystack.document_stores.in_memory import InMemoryDocumentStore from apify_haystack import ApifyDatasetFromActorCall os.environ["APIFY_API_TOKEN"] = "YOUR-APIFY-API-TOKEN" os.environ["OPENAI_API_KEY"] = "YOUR-OPENAI-API-KEY" document_loader = ApifyDatasetFromActorCall( actor_id="apify/website-content-crawler", run_input={ "maxCrawlPages": 3, # limit the number of pages to crawl "startUrls": [{"url": "https://haystack.deepset.ai/"}], }, dataset_mapping_function=lambda item: Document(content=item["text"] or "", meta={"url": item["url"]}), ) document_store = InMemoryDocumentStore() print(f"Initialized InMemoryDocumentStore with {document_store.count_documents()} documents") document_splitter = DocumentSplitter(split_by="word", split_length=150, split_overlap=50) document_embedder = OpenAIDocumentEmbedder() document_writer = DocumentWriter(document_store) pipe = Pipeline() pipe.add_component("document_loader", document_loader) pipe.add_component("document_splitter", document_splitter) pipe.add_component("document_embedder", document_embedder) pipe.add_component("document_writer", document_writer) pipe.connect("document_loader", "document_splitter") pipe.connect("document_splitter", "document_embedder") pipe.connect("document_embedder", "document_writer") print("\nCrawling will take some time ...") print("You can visit https://console.apify.com/actors/runs to monitor the progress\n") pipe.run({}) print(f"Added {document_store.count_documents()} to vector from Website Content Crawler") print("\n ### Retrieving documents from the document store using BM25 ###\n") print("query='Haystack'\n") bm25_retriever = InMemoryBM25Retriever(document_store) for doc in bm25_retriever.run("Haystack", top_k=1)["documents"]: print(doc.content) print("\n ### Retrieving documents from the document store using vector similarity ###\n") retrieval_pipe = Pipeline() retrieval_pipe.add_component("embedder", OpenAITextEmbedder()) retrieval_pipe.add_component("retriever", InMemoryEmbeddingRetriever(document_store, top_k=1)) retrieval_pipe.connect("embedder.embedding", "retriever.query_embedding") results = retrieval_pipe.run({"embedder": {"text": "What is Haystack?"}}) for doc in results["retriever"]["documents"]: print(doc.content) ``` To run it, you can use the following command: `python apify_integration.py` ## Resources[​](#resources "Direct link to Resources") * [Apify-haystack integration documentation](https://haystack.deepset.ai/integrations/apify) * [Apify-haystack integration source code](https://github.com/apify/apify-haystack) * [Example: RAG - Extract and use website content for question answering](https://haystack.deepset.ai/cookbook/apify_haystack_rag) * [Example: RAG: Web Search and Analysis with Apify and Haystack](https://haystack.deepset.ai/cookbook/apify_haystack_rag_web_browser) * [Example: Analyze Your Instagram Comments’ Vibe](https://haystack.deepset.ai/cookbook/apify_haystack_instagram_comments_analysis) --- # IFTTT integration **Connect Apify Actors with IFTTT to automate workflows using Actor run events, data queries, and task actions.** *** [IFTTT](https://ifttt.com) is a service that helps you create automated workflows called Applets. With the [Apify integration for IFTTT](https://ifttt.com/apify), you can connect your Apify Actors to hundreds of services like Twitter, Gmail, Google Sheets, Slack, and more. This guide shows you how to integrate Apify Actors with IFTTT to build automated workflows. You'll learn how to create IFTTT Applets that can be triggered by Apify events or that can execute Apify tasks. An IFTTT Applet consists of three key parts: * A *trigger* that starts the workflow * Optional *queries* that retrieve data * One or more *actions* that execute when the Applet runs The Apify integration lets you trigger workflows when an Actor or task run finishes, start Actor or task runs from other triggers, or retrieve data from datasets and key-value stores. ## Prerequisites[​](#prerequisites "Direct link to Prerequisites") Before using the Apify integration with IFTTT, you need: * An [Apify account](https://console.apify.com/) * An [IFTTT account](https://ifttt.com/) ## Connect Apify with IFTTT[​](#connect-apify-with-ifttt "Direct link to Connect Apify with IFTTT") To connect your Apify account to IFTTT: 1. Visit the [Apify service page](https://ifttt.com/apify) on IFTTT. 2. Click the **Connect** button. 3. When redirected to the Apify login page, sign in to your Apify account. ![Apify Connect Service](/assets/images/ifttt-connect-service-31994f8f219342d0c6481bf22e128901.png) ## Create an Applet with Apify[​](#create-an-applet-with-apify "Direct link to Create an Applet with Apify") ### Create an Applet[​](#create-an-applet "Direct link to Create an Applet") To create an Applet that starts when Apify event occurs: 1. Go to the [Explore](https://ifttt.com/explore) section on IFTTT. 2. Click the **Create** button. 3. In the **If this** section, click **Add**. 4. Search for and select **Apify** in the service list. ![Apify Choose Service](/assets/images/ifttt-choose-service-c39b0667014a7c55329c8828174143f9.png) 1. Select a trigger from the available options: * **Actor Run Finished**: Triggers when a selected Actor run completes * **Task Run Finished**: Triggers when a selected Actor task run completes ![Apify Choose Trigger](/assets/images/ifttt-choose-trigger-216126271840adfba2cfd3b8e52d3631.png) 1. Configure the trigger by selecting the specific Actor or task. 2. Click **Create trigger** to continue. In case you didn't connect to the Apify account, you will be prompted to do so when you select a trigger. ### Set up an Apify Action[​](#set-up-an-apify-action "Direct link to Set up an Apify Action") To use Apify as an action in your Applet: 1. In the **Then That** section of your Applet, click **Add**. 2. Search for and select **Apify**. 3. Choose an action: * **Run Actor**: Starts an Actor run * **Run Task**: Starts an Actor Task run ![Apify Choose Action](/assets/images/ifttt-choose-action-336bd3d53b384f90739460cb649e5d6b.png) 1. Select the Actor or task you want to use from the dropdown menu. note IFTTT displays up to 50 recent items in a dropdown. If your Actor or task isn't visible, try using it at least once via API or in the Apify Console to make it appear in the list. ![Apify Actor Customization](/assets/images/ifttt-actor-config-39f6dcb2910c64be84f7e46194594fbe.png) 1. Configure the action parameters: | Parameter | Description | Example Values | | --------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------- | | **Wait until run finishes** | Defines how the Actor should be executed. | `yes`, `no` | | **Input overrides** | JSON input that overrides the Actor's default input. | `{"key": "value"}` | | **Build** | Specifies the Actor build to run. Can be a build tag or build number. See [Builds](https://docs.apify.com/platform/actors/running/runs-and-builds.md#builds) for more information. | `0.2.10`, `version-0` | | **Memory** | Memory limit for the run in megabytes. See [Memory](https://docs.apify.com/platform/actors/running/usage-and-resources.md#memory) for more information. | `256` | 2. Click **Create action** to finish setting up the action. ![Apify Actor Customization](/assets/images/ifttt-applet-overview-1cd4f1dca50421b3a8b5dc96b17134c8.png) 1. Give your Applet a name and click **Finish** to save it. ## Monitor your Applet[​](#monitor-your-applet "Direct link to Monitor your Applet") To check if your Applet is working properly: 1. Go to your Applet's detail page. 2. Clicke the **View activity** button to see the execution history. ![Apify Actor Customization](/assets/images/ifttt-applet-inspect-0308bcd12438a7e8dc24e7b78bc1eebc.png) ## Available triggers, actions, and queries[​](#available-triggers-actions-and-queries "Direct link to Available triggers, actions, and queries") ### Triggers[​](#triggers "Direct link to Triggers") * **Actor Run Finished**: Activates when a selected Actor run completes * **Task Run Finished**: Activates when a selected Actor task run completes ### Actions[​](#actions "Direct link to Actions") * **Run Actor**: Starts a specified Actor with customizable parameters * **Run Task**: Executes a specified Actor task ### Queries[​](#queries "Direct link to Queries") * **Get Dataset Items**: Retrieves items from a [dataset](https://docs.apify.com/platform/storage/dataset.md) * **Scrape Single URL**: Runs a scraper for a specified website and returns its content * **Get Key-Value Store Record**: Retrieves a value from a [key-value store](https://docs.apify.com/platform/storage/key-value-store.md) ## Troubleshooting[​](#troubleshooting "Direct link to Troubleshooting") * Ensure your JSON inputs in the "Input overrides" field are valid and properly formatted. * If an Applet fails to trigger, check your Apify API token permissions. If you have any questions or need help, feel free to reach out to us on our [developer community on Discord](https://discord.com/invite/jyEM2PRvMU). --- # Integrate with Apify If you are building a service and your users could benefit from integrating with Apify or vice versa, we would love to hear from you! Contact us at to discuss potential collaboration. We are always looking for ways to make our platform more useful and powerful for our users. ## Why integrate with Apify[​](#why-integrate-with-apify "Direct link to Why integrate with Apify") Apify is the leading platform for web scraping, AI agents, and automation tools. By integrating Apify into your platform, you enable users to incorporate real-time, structured data from the web with zero scraping infrastructure on your side. [Apify Store](https://apify.com/store) contains thousands of pre-built Actors, ready-made tools for web scraping and automation. ## Integration types[​](#integration-types "Direct link to Integration types") An Apify integration can be *general*, allowing users to integrate any Actor from Apify Store into their workflows (or their own Actors), or *Actor-specific*, enabling targeted automation like integrating [Instagram Scraper](https://apify.com/apify/instagram-scraper) for use cases like social media monitoring. ### General integrations[​](#general-integrations "Direct link to General integrations") General integrations allow users to integrate Actors into their workflows by connecting Apify with other platforms. Examples include: * [Zapier](https://docs.apify.com/platform/integrations/zapier.md) integration allows Zapier users to enrich their automation workflows with data from the web or to add additional Actions performed by [Apify Actors](https://apify.com/store). * [Keboola](https://docs.apify.com/platform/integrations/keboola.md) integration enables Keboola users to easily pull data crawled from the web into their data pipelines. ### Actor-specific integrations[​](#actor-specific-integrations "Direct link to Actor-specific integrations") Actor-specific integrations are designed for targeted use cases. While they work similarly to general integrations, they help users find the right Apify tools more easily and provide a better experience. Examples include: * [Make Instagram Scraper integration](https://www.make.com/en/integrations/apify-instagram-scraper) * [Lindy.ai Instagram integration](https://www.lindy.ai/integrations/instagram) For more examples both general and Actor-specific, check [integrations](https://docs.apify.com/platform/integrations.md). ## Integrating with Apify[​](#integrating-with-apify "Direct link to Integrating with Apify") To integrate your service with Apify, you have two options: * Build an external integration using the [Apify API](https://docs.apify.com/api/v2) * Build an [Apify Actor](https://docs.apify.com/platform/actors) that will be used as integration within [Apify Console](https://console.apify.com) ![Integration-ready Actors](/assets/images/integration-ready-actors-3f9c1f9b61abf5dd4157f050cf2cb3d8.png) ### Building an integration Actor[​](#building-an-integration-actor "Direct link to Building an integration Actor") One way to reach out to Apify users is directly within [Apify Console](https://console.apify.com). To do that, you need to build an integrable Actor that can be piped into other Actors to upload existing data into a database. This can then be easily configured within Apify Console. Follow the [guide on building integration-ready Actors](https://docs.apify.com/platform/integrations/actors/integration-ready-actors.md). ### Building an external integration[​](#building-an-external-integration "Direct link to Building an external integration") An alternative way is to let your users manage the connection directly on your side using [Apify API](https://docs.apify.com/api/v2) and our API clients for [JavaScript](https://docs.apify.com/api/client/js) or [Python](https://docs.apify.com/api/client/python). This way, users can manage the connection directly from your service. ![Airbyte sources tab](/assets/images/airbyte-sources-web-120a4cf11b196f4dbfb01659d156f0a9.png) ### Authentication methods[​](#authentication-methods "Direct link to Authentication methods") Apify supports two main authentication methods for secure API access. *OAuth 2.0* - Use OAuth 2.0 to allow users to authorize your integration without sharing their credentials. *API token* - Apify user generates personal API token from Apify account settings page. For more information, see [API Token documentation](https://docs.apify.com/platform/integrations/api#api-token). ### API implementation[​](#api-implementation "Direct link to API implementation") To build an integration, core API endpoints can be mapped as **actions and triggers** inside your platform. #### Action endpoints[​](#action-endpoints "Direct link to Action endpoints") ##### Run an Actor[​](#run-an-actor "Direct link to Run an Actor") Triggers the execution of any Apify Actor by ID, allowing users to start custom or public web scraping and automation Actors with specified input parameters. Recommended features: * Select Actor: The Actor list will be pre-populated with Actors that the user created or used, using the [List of Actors API](https://docs.apify.com/api/v2/acts-get) and enriched with Actors from the store, which the user has not run already using [API](https://docs.apify.com/api/v2/store-get). * Synchronous vs. asynchronous run: flow will wait until the run/task finishes (consider a timeout on your platform side) * Input UI: upon selecting an Actor, dynamically display specific Actor input and preload default example values based on the Actor Input schema. Alternatively, allow users to insert a JSON input for the Actor. * Additionally, it should include the option to choose [Actor build](https://docs.apify.com/platform/actors/running/runs-and-builds), [memory](https://docs.apify.com/platform/actors/running/usage-and-resources#memory), and [timeout](https://docs.apify.com/platform/actors/running/usage-and-resources#memory). * Field mapping: allowing users to map fields to data acquired in previous steps of the workflow. ##### Run a task[​](#run-a-task "Direct link to Run a task") Starts a predefined task (a saved Actor configuration), making it easy for users to run recurring or templated workflows without redefining inputs each time. Recommended features: * *Select task*: The task list will be pre-populated with tasks that the user created, using the [List of tasks](https://docs.apify.com/api/v2/actor-tasks-get) API. * *Synchronous vs. asynchronous run*: the flow will wait until the run/task finishes (considering timeout on your platform side) * *JSON input field*: possibility to add a JSON input to override the task input. ##### Get dataset items[​](#get-dataset-items "Direct link to Get dataset items") Fetches structured results (JSON, CSV, etc.) generated by a previously run Actor or task, which can be used as input for further workflow steps. Recommended features: * *Dataset*: Dropdown (user's datasets) or ID/String input. Populated via [Datasets API](https://docs.apify.com/api/v2/datasets-get). * *Limit (optional)*: The maximum number of dataset items to fetch. If empty, the default limit will be used. * *Offset (optional)*: The offset in the dataset from where to start fetching the items. If empty, it will be from the beginning. ##### Get a key-value store item[​](#get-a-key-value-store-item "Direct link to Get a key-value store item") Retrieves a specific item from a key-value store, commonly used to access metadata, snapshots, logs, or one-off results generated during Actor execution. Recommended features: * *Key-value store*: Dropdown (user's KV stores) or ID/String input. Populated via [Key-value Stores API](https://docs.apify.com/api/v2/key-value-stores-get). * *Record key*: value (string) ##### Scrape a single URL[​](#scrape-a-single-url "Direct link to Scrape a single URL") Runs Apify's [Website Content Crawler](https://apify.com/apify/website-content-crawler) in synchronous mode to extract structured data from a single web page - ideal for on-demand URL scraping inside agents or automation flows. Recommended features: * *URL*: that you intend to scrape (string) * *Crawler type*: Dropdown menu, allowing users to choose from the following options: * *Headless web browser* - Useful for websites with anti-scraping protections and JavaScript rendering. It recognizes common blocking patterns like CAPTCHAs and automatically retries blocked requests through new sessions. * *Stealthy web browser (default)* - Another headless web browser with anti-blocking measures enabled. Try this if you encounter anti-bot protections while scraping. * *Raw HTTP client* - High-performance crawling mode that uses raw HTTP requests to fetch pages. It's faster and cheaper, but might not work on all websites. ##### Universal API call[​](#universal-api-call "Direct link to Universal API call") A node to send API requests to Apify, allowing advanced users to configure or query Actors, tasks, datasets, or other API endpoints programmatically. #### Trigger endpoints[​](#trigger-endpoints "Direct link to Trigger endpoints") ##### Watch Actor runs[​](#watch-actor-runs "Direct link to Watch Actor runs") Monitors the status of an Actor run by ID, useful for triggering follow-up steps once a job has completed. Triggered when a specific Actor run reaches terminal status (succeeded, failed, timed out, aborted). Recommended features: * *Select Actor runs to watch*: Dropdown (list of user's Actors). Populated via [List Actors API](https://docs.apify.com/api/v2/acts-get) ##### Watch task runs[​](#watch-task-runs "Direct link to Watch task runs") Similar to watching Actor runs, this tracks the progress and completion status of a specific task run to allow event-driven actions in a workflow. Recommended features: * *Select Actor tasks to watch*: Dropdown (list of user's tasks). Populated via [List tasks API](https://docs.apify.com/api/v2/actor-tasks-get). ### Pricing options[​](#pricing-options "Direct link to Pricing options") Choose between two pricing models based on your integration setup. #### Direct user billing[​](#direct-user-billing "Direct link to Direct user billing") Users create their own Apify accounts and are billed directly by Apify for their usage. This model gives users full control over their Apify usage and billing. #### Whitelabel access[​](#whitelabel-access "Direct link to Whitelabel access") Users access Apify through your platform without needing an Apify account. Apify bills you based on consumption, and you factor costs into your pricing. ### Monitoring and tracking[​](#monitoring-and-tracking "Direct link to Monitoring and tracking") To help Apify monitor and support your integration, every API request should identify your platform. You can do this in one of two ways: * Preferred: * Use the `x-apify-integration-platform` header with your platform name (e.g., make.com, zapier). * If your platform has multiple Apify apps, also include the `x-apify-integration-app-id` header with the unique app ID. * Alternative: * Set a custom `User-Agent` header that identifies your platform. These identifiers enable better analytics and support for your integration. ## Technical resources[​](#technical-resources "Direct link to Technical resources") ### Apify API[​](#apify-api "Direct link to Apify API") [Apify API](https://docs.apify.com/api) provides an extensive REST API that covers all the features of the Apify platform. You can download the complete OpenAPI schema of Apify API in the [YAML](https://docs.apify.com/api/openapi.yaml) or [JSON](https://docs.apify.com/api/openapi.json) formats. Apify provides official libraries for JavaScript and Python to access API. * [Apify API Reference](https://docs.apify.com/api/v2) * Client libraries * [JavaScript/TypeScript/Node.js](https://docs.apify.com/api/client/js/) * [Python](https://docs.apify.com/api/client/python/) ### Reference implementations[​](#reference-implementations "Direct link to Reference implementations") For inspiration, check out the public repositories of Apify's existing external integrations: * Zapier * [Zapier integration documentation](https://docs.apify.com/platform/integrations/zapier) * [Source code on Github](https://github.com/apify/apify-zapier-integration) * Make.com * [Make.com integration documentation](https://docs.apify.com/platform/integrations/make) * Keboola * [Keboola integration documentation](https://docs.apify.com/platform/integrations/keboola) * [Source code on GitHub](https://github.com/apify/keboola-ex-apify/) (JavaScript) * [Google Maps Reviews Scraper integration](https://github.com/apify/keboola-gmrs/) (Actor-specific) * Airbyte * [Source code on GitHub](https://github.com/airbytehq/airbyte/tree/master/airbyte-integrations/connectors/source-apify-dataset) (Python) * Pipedream * [Source code on GitHub](https://github.com/PipedreamHQ/pipedream/tree/65e79d1d66cf0f2fca5ad20a18acd001f5eea069/components/apify) For technical support, please contact us at . --- # Keboola integration **Integrate your Apify Actors with Keboola, a cloud-based data integration platform that consolidates data from various sources into a centralized storage.** *** With Apify integration for [Keboola](https://www.keboola.com/), you can extract data from various sources using your Apify Actors and load it into Keboola for further processing, transformation, and integration with other platforms. The Keboola integration allows you to run your Actors, fetch items from datasets, and retrieve results, all within the Keboola platform. ## Connect Apify with Keboola[​](#connect-apify-with-keboola "Direct link to Connect Apify with Keboola") To use the Apify integration on Keboola, you will need to: * Have an [Apify account](https://console.apify.com/). * Have a [Keboola account](https://www.keboola.com/). ### Step 1: Create a new Data Source in Keboola[​](#step-1-create-a-new-data-source-in-keboola "Direct link to Step 1: Create a new Data Source in Keboola") Once your Keboola account is ready and you are logged in, navigate to the **Components** section in the top menu and click the **Add Component** button. ![Keboola dashboard](/assets/images/keboola-dashboard-9c1f255d99ade9239aa323f485051ae3.png) In the list of available Components, find and select the **Apify** from Data Sources and click on the **Add Component** button. ![Keboola component search](/assets/images/keboola-components-a99719fae68b238d65c9358e3084b17e.png) On the new window, click **Connect** to my data\*\* ![Keboola Connector](/assets/images/keboola-connector-c0360a0af4f42d2a0da2bc697be8c091.png) Provide a name and description for your configuration, then click the **Create** Configuration\*\* button. ![Keboola configuration setup](/assets/images/keboola-create-configuration-0368046440f454d6c0c3f06de64da89f.png) ### Step 2: Configure the Apify Data Source[​](#step-2-configure-the-apify-data-source "Direct link to Step 2: Configure the Apify Data Source") With the new configuration created, you can now configure the data source to retrieve the needed data. Click on the **Configure Component** button to begin the setup process. ![Keboola component configuration](/assets/images/keboola-configure-component-5837e3c680c9e2afbfc913d22b178a38.png) #### Choose an action[​](#choose-an-action "Direct link to Choose an action") In the next step, you can choose the action you want to perform: * **Run Actor**: This action runs the selected Actor, waits until it finishes, and then pushes all items from the default dataset to Keboola Storage. * **Retrieve dataset items from the last Actor run** - This action takes the dataset of a specific Actor's last run. * **Run Task** - This action runs the selected task, waits until it finishes, and then pushes all items from the default dataset to Keboola Storage. * **Retrieve items from the last task run** - This action takes the dataset of a specific task last run. * **Retrieve items from Dataset**: This action takes the dataset ID or dataset name and retrieves all items from that dataset. ![Keboola component action setup ](/assets/images/keboola-component-setup-c75d4f496be06dcb7056fc84c181183e.png) #### Authentication[​](#authentication "Direct link to Authentication") After selecting the action, you will need to provide your Apify API credentials. You can find these credentials on your Apify account page by navigating to **Settings > Integrations** and copying them into the provided form. ![Keboola API authentication](/assets/images/keboola-setup-api-token-2fa67938e05c98a15521b914d8c08d34.png) #### Specifications[​](#specifications "Direct link to Specifications") In the specifications step, you can set up various options for your Actor run: * **Actor**: Select the Actor you want to run from your Apify account. * **Input Table**: Choose a table from the Keboola platform to be sent to the Actor as input data. * **Output field**: Comma-separated list of fields to be picked from the dataset. * **Memory**: Adjust the memory settings if needed (the default values can be kept). * **Build**: Adjust if you want to run a specific build of an Actor. Tag or number of the build to run. * **Actor Input**: Pass any JSON data as input to the Actor. Once you have filled in all the necessary options, click the **Save** button to save your configuration. ![Keboola component specification setup](/assets/images/keboola-setup-specification-2bff78a68a69f8c899b29632806362c0.png) ### Step 3: Run the configured Data Source[​](#step-3-run-the-configured-data-source "Direct link to Step 3: Run the configured Data Source") After your data source has been configured, you can run it by clicking the **Run** button in the upper-right corner of your configuration. ![Keboola run configured component](/assets/images/keboola-run-component-a9fe7035e198fe3ce1ec3c8d73eea1a4.png) You can monitor the progress of your run in the job detail section on the right-hand side of the page. Once the run finishes successfully, you can find the results by following the link in the Storage Stats section of the job detail page. ## Next steps[​](#next-steps "Direct link to Next steps") With your data now in Keboola, you can integrate it with dozens of other services that Keboola supports. Check out the [complete list of integrations](https://www.keboola.com/product/integrations) to explore your options. You can set up a writer for a selected service using Keboola Writer or create [orchestrations](https://help.keboola.com/orchestrator/) to transform, merge, or split your data. Keboola Helper In Apify Store, you'll find the [Keboola Input Mapping Actor](https://apify.com/drobnikj/keboola-input-mapping), designed to streamline the integration between Apify and Keboola. This helper Actor parses the input table received from the Apify Keboola Data Source and maps the data into the required input format for another task or Actor within the Keboola platform. If you have any questions or need assistance, feel free to contact us at , through our live chat, or in our [developer community on Discord](https://discord.com/invite/jyEM2PRvMU). --- # 🦜🔗 LangChain integration **Learn how to integrate Apify with LangChain, in order to feed vector databases and LLMs with data crawled from the web.** *** > For more information on LangChain visit its [documentation](https://python.langchain.com/docs/). In this example, we'll use the [Website Content Crawler](https://apify.com/apify/website-content-crawler) Actor, which can deeply crawl websites such as documentation, knowledge bases, help centers, or blogs and extract text content from the web pages. Then we feed the documents into a vector index and answer questions from it. This example demonstrates how to integrate Apify with LangChain using the Python language. If you prefer to use JavaScript, you can follow the [JavaScript LangChain documentation](https://js.langchain.com/docs/integrations/document_loaders/web_loaders/apify_dataset/). Before we start with the integration, we need to install all dependencies: `pip install langchain langchain-openai langchain-apify` After successful installation of all dependencies, we can start writing code. First, import all required packages: ``` import os from langchain.indexes import VectorstoreIndexCreator from langchain_apify import ApifyWrapper from langchain_core.documents import Document from langchain_core.vectorstores import InMemoryVectorStore from langchain_openai import ChatOpenAI from langchain_openai.embeddings import OpenAIEmbeddings ``` Find your [Apify API token](https://console.apify.com/account/integrations) and [OpenAI API key](https://platform.openai.com/account/api-keys) and initialize these into environment variable: ``` os.environ["OPENAI_API_KEY"] = "Your OpenAI API key" os.environ["APIFY_API_TOKEN"] = "Your Apify API token" ``` Run the Actor, wait for it to finish, and fetch its results from the Apify dataset into a LangChain document loader. Note that if you already have some results in an Apify dataset, you can load them directly using `ApifyDatasetLoader`, as shown in [this notebook](https://github.com/langchain-ai/langchain/blob/fe1eb8ca5f57fcd7c566adfc01fa1266349b72f3/docs/modules/indexes/document_loaders/examples/apify_dataset.ipynb). In that notebook, you'll also find the explanation of the `dataset_mapping_function`, which is used to map fields from the Apify dataset records to LangChain `Document` fields. ``` apify = ApifyWrapper() llm = ChatOpenAI(model="gpt-4o-mini") loader = apify.call_actor( actor_id="apify/website-content-crawler", run_input={"startUrls": [{"url": "https://python.langchain.com/docs/get_started/introduction"}], "maxCrawlPages": 10, "crawlerType": "cheerio"}, dataset_mapping_function=lambda item: Document( page_content=item["text"] or "", metadata={"source": item["url"]} ), ) ``` Crawling may take some time The Actor call may take some time as it crawls the LangChain documentation website. Initialize the vector index from the crawled documents: ``` index = VectorstoreIndexCreator( vectorstore_cls=InMemoryVectorStore, embedding=OpenAIEmbeddings() ).from_loaders([loader]) ``` And finally, query the vector index: ``` query = "What is LangChain?" result = index.query_with_sources(query, llm=llm) print("answer:", result["answer"]) print("source:", result["sources"]) ``` If you want to test the whole example, you can simply create a new file, `langchain_integration.py`, and copy the whole code into it. ``` import os from langchain.indexes import VectorstoreIndexCreator from langchain_apify import ApifyWrapper from langchain_core.documents import Document from langchain_core.vectorstores import InMemoryVectorStore from langchain_openai import ChatOpenAI from langchain_openai.embeddings import OpenAIEmbeddings os.environ["OPENAI_API_KEY"] = "Your OpenAI API key" os.environ["APIFY_API_TOKEN"] = "Your Apify API token" apify = ApifyWrapper() llm = ChatOpenAI(model="gpt-4o-mini") print("Call website content crawler ...") loader = apify.call_actor( actor_id="apify/website-content-crawler", run_input={"startUrls": [{"url": "https://python.langchain.com/docs/get_started/introduction"}], "maxCrawlPages": 10, "crawlerType": "cheerio"}, dataset_mapping_function=lambda item: Document(page_content=item["text"] or "", metadata={"source": item["url"]}), ) print("Compute embeddings...") index = VectorstoreIndexCreator( vectorstore_cls=InMemoryVectorStore, embedding=OpenAIEmbeddings() ).from_loaders([loader]) query = "What is LangChain?" result = index.query_with_sources(query, llm=llm) print("answer:", result["answer"]) print("source:", result["sources"]) ``` To run it, you can use the following command: `python langchain_integration.py` After running the code, you should see the following output: ``` answer: LangChain is a framework designed for developing applications powered by large language models (LLMs). It simplifies the entire application lifecycle, from development to productionization and deployment. LangChain provides open-source components a nd integrates with various third-party tools, making it easier to build and optimize applications using language models. source: https://python.langchain.com/docs/get_started/introduction ``` LangChain is a standard interface through which you can interact with a variety of large language models (LLMs). It provides modules you can use to build language model applications as well as chains and agents with memory capabilities. You can use all of Apify’s Actors as document loaders in LangChain. For example, to incorporate web browsing functionality, you can use the [RAG-Web-Browser Actor](https://apify.com/apify/rag-web-browser). This allows you to either crawl and scrape top pages from Google Search results or directly scrape text content from a URL and return it as Markdown. To set this up, change the `actor_id` to `apify/rag-web-browser` and specify the `run_input`. ``` loader = apify.call_actor( actor_id="apify/rag-web-browser", run_input={"query": "apify langchain web browser", "maxResults": 3}, dataset_mapping_function=lambda item: Document(page_content=item["text"] or "", metadata={"source": item["metadata"]["url"]}), ) print("Documents:", loader.load()) ``` Similarly, you can use other Apify Actors to load data into LangChain and query the vector index. ## Resources[​](#resources "Direct link to Resources") * [LangChain introduction](https://python.langchain.com/docs/get_started/introduction) * [Apify Dataset loader](https://python.langchain.com/docs/integrations/document_loaders/apify_dataset) * [LangChain Apify Provider](https://python.langchain.com/docs/integrations/providers/apify) --- # Langflow integration **Learn how to integrate Apify with Langflow to run complex AI agent workflows.** *** ## What is Langflow[​](#what-is-langflow "Direct link to What is Langflow") [Langflow](https://langflow.org/) is a low-code, visual tool that enables developers to build powerful AI agents and workflows that can use any API, models, or databases. Explore Langflow For more information on Langflow, visit its [documentation](https://docs.langflow.org/). ## How to use Apify with Langflow[​](#how-to-use-apify-with-langflow "Direct link to How to use Apify with Langflow") This guide will demonstrate two different ways to use Apify Actors with Langflow: * **Calling Apify Actors in Langflow**: We will use the [RAG Web Browser](https://apify.com/apify/rag-web-browser) Actor to search Google for a query and extract the search results. * **Building a flow to search for a company's social media profiles**: We will use the [Google Search Results Scraper](https://apify.com/apify/google-search-scraper) Actor to search the web for social media profiles of a given company. Then, we will use the [TikTok Data Extractor](https://apify.com/clockworks/free-tiktok-scraper) Actor to extract data from the TikTok profiles. ### Prerequisites[​](#prerequisites "Direct link to Prerequisites") * **Apify API token**: To use Apify Actors in Langflow, you need an Apify API token. If you don't have one, you can learn how to get it in the [Apify documentation](https://docs.apify.com/platform/integrations/api). * **OpenAI API key**: To work with agents in Langflow, you need an OpenAI API key. If you don't have one, you can get it from the [OpenAI platform](https://platform.openai.com/account/api-keys). #### Langflow[​](#langflow "Direct link to Langflow") Cloud vs local setup Langflow can either be installed locally or used in the cloud. The cloud version is available on the [Langflow](http://langflow.org/) website. If you are using the cloud version, you can skip the installation step, and go straight to [Creating a new flow](#creating-a-new-flow) First, install the Langflow platform using Python package and project manager [uv](https://docs.astral.sh/uv/): ``` uv pip install langflow ``` After installing Langflow, you can start the platform: ``` uv run langflow run ``` When the platform is started, open the Langflow UI using `http://127.0.0.1:7860` in your browser. > Other installation methods can be found in the [Langflow documentation](https://docs.langflow.org/get-started-installation). ### Creating a new flow[​](#creating-a-new-flow "Direct link to Creating a new flow") On the Langflow welcome screen, click the **New Flow** button and then create **Blank Flow**: ![New Flow screen - Blank Flow](/assets/images/new_blank_flow-8c5272acc3b2bf2b7779caff60c4726b.png) Now, you can start building your flow. ### Calling Apify Actors in Langflow[​](#calling-apify-actors-in-langflow "Direct link to Calling Apify Actors in Langflow") To call Apify Actors in Langflow, you need to add the **Apify Actors** component to the flow. From the bundle menu, add **Apify Actors** component: ![Flow - Add Apify Actors](/assets/images/bundles_apify-b72b75511bcd261c86b0b998951b77f4.png) Next, configure the Apify Actors components. First, input your API token (learn how to get it at [Integrations](https://docs.apify.com/platform/integrations/api)). Then, set the Actor ID of the component to `apify/rag-web-browser` to use the [RAG Web Browser](https://apify.com/apify/rag-web-browser). Set the **Run input** field to pass arguments to the Actor run, allowing it to search Google with the query `"what is monero?"` (full Actor input schema can be found in the [RAG Web Browser input schema](https://apify.com/apify/rag-web-browser/input-schema)): ``` {"query": "what is monero?", "maxResults": 3} ``` Click **Run**. ![Flow - Apify Actors Run](/assets/images/apify_actors_run-27164252e3d8b180c516959224db031f.png) After the run finishes, click **Output** to view the results. ![Flow - Apify Actors Output](/assets/images/apify_actors_output-66dc9d6763d338a42b6a17db1043e526.png) The output should look similar to this: ![Flow - Apify Actors Output Data](/assets/images/apify_actors_output_data-beaa74c6c9f876eca0034296d667f489.png) To filter only the `metadata` and `markdown` fields, set **Output fields** to `metadata,markdown`. Additionally, enable **Flatten output** by setting it to `true`. This will output only the metadata and text content from the search results. > Flattening is necessary when you need to access nested dictionary fields in the output data object; they cannot be accessed directly otherwise in the Data object. ![Flow - Apify Actors Output Filter](/assets/images/apify_actors_output_filter-670268c7cd24482d2f8f62ae7de8e5ea.png) When you run the component again, the output contains only the `markdown` and flattened `metadata` fields: ![Flow - Apify Actors Output Filtered](/assets/images/apify_actors_output_data_filtered-09d37d0f2511870d96745f8d772ff8d1.png) Now that you understand how to call Apify Actors, let's build a practical example where you search for a company's social media profiles and extract data from them. ### Building a flow to search for a company's social media profiles[​](#building-a-flow-to-search-for-a-companys-social-media-profiles "Direct link to Building a flow to search for a company's social media profiles") Create a new flow and add two **Apify Actors** components from the menu. Input your API token (learn how to get it in the [Integrations documentation](https://docs.apify.com/platform/integrations/api)) and set the Actor ID of the first component to `apify/google-search-scraper` and the second one to `clockworks/free-tiktok-scraper`: ![Flow - Actors configuration](/assets/images/apify_actors_configuration-9024e10771b3242b24afb7a9bfc57687.png) Add the **Agent** component from the menu and set your OpenAI API key (get it from the [OpenAI API keys page](https://platform.openai.com/account/api-keys)): Optimize Agent results For better results, switch the model to `gpt-4o` instead of `gpt-4o-mini` in the Agent configuration ![Flow - Agent configuration](/assets/images/agent_configuration-56902bb56e84143deff4c3b733823ae4.png) To be able to interact with the agent, add **Chat Input** and **Chat Output** components from the menu and connect them to the Agent component **Input** and **Response**. Then connect both Apify Actor components **Tool** outputs to the Agent component **Tools** input so that the agent can call the Apify Actors. The final flow that can search the web for a company's social media profiles and extract data from them should look like this: ![Flow - Final](/assets/images/flow-c1424ee0b06e6cc8178ce6aa33a550ed.png) Click the **Playground** button and chat with the agent to test the flow: ![Flow - Playground](/assets/images/playground-e56729ea661b2a8a089eb3f5084bc0c1.png) Here is an example agent output for the following query: ``` find tiktok profile of company openai using google search and then show me the profile bio and their latest video ``` ![Flow - agent output](/assets/images/agent_output-5a20a0f099edca9bafadcbda946ea412.png) --- # 🦜🔘➡️ LangGraph integration **Learn how to build AI Agents with Apify and LangGraph.** *** ## What is LangGraph[​](#what-is-langgraph "Direct link to What is LangGraph") [LangGraph](https://www.langchain.com/langgraph) is a framework designed for constructing stateful, multi-agent applications with Large Language Models (LLMs), allowing developers to build complex AI agent workflows that can leverage tools, APIs, and databases. Explore LangGraph For more in-depth details on LangGraph, check out its [official documentation](https://langchain-ai.github.io/langgraph/). ## How to use Apify with LangGraph[​](#how-to-use-apify-with-langgraph "Direct link to How to use Apify with LangGraph") This guide will demonstrate how to use Apify Actors with LangGraph by building a ReAct agent that will use the [RAG Web Browser](https://apify.com/apify/rag-web-browser) Actor to search Google for TikTok profiles and [TikTok Data Extractor](https://apify.com/clockworks/free-tiktok-scraper) Actor to extract data from the TikTok profiles to analyze the profiles. ### Prerequisites[​](#prerequisites "Direct link to Prerequisites") * **Apify API token**: To use Apify Actors in LangGraph, you need an Apify API token. If you don't have one, you can learn how to obtain it in the [Apify documentation](https://docs.apify.com/platform/integrations/api). * **OpenAI API key**: In order to work with agents in LangGraph, you need an OpenAI API key. If you don't have one, you can get it from the [OpenAI platform](https://platform.openai.com/account/api-keys). * **Python packages**: You need to install the following Python packages: ``` pip install langgraph langchain-apify langchain-openai ``` ### Building the TikTok profile search and analysis agent[​](#building-the-tiktok-profile-search-and-analysis-agent "Direct link to Building the TikTok profile search and analysis agent") First, import all required packages: ``` import os from langchain_apify import ApifyActorsTool from langchain_core.messages import HumanMessage from langchain_openai import ChatOpenAI from langgraph.prebuilt import create_react_agent ``` Next, set the environment variables for the Apify API token and OpenAI API key: ``` os.environ["OPENAI_API_KEY"] = "Your OpenAI API key" os.environ["APIFY_API_TOKEN"] = "Your Apify API token" ``` Instantiate LLM and Apify Actors tools: ``` llm = ChatOpenAI(model="gpt-4o-mini") browser = ApifyActorsTool("apify/rag-web-browser") tiktok = ApifyActorsTool("clockworks/free-tiktok-scraper") ``` Create the ReAct agent with the LLM and Apify Actors tools: ``` tools = [browser, tiktok] agent_executor = create_react_agent(llm, tools) ``` Finally, run the agent and stream the messages: ``` for state in agent_executor.stream( stream_mode="values", input={ "messages": [ HumanMessage(content="Search the web for OpenAI TikTok profile and analyze their profile.") ] }): state["messages"][-1].pretty_print() ``` Search and analysis may take some time The agent tool call may take some time as it searches the web for OpenAI TikTok profiles and analyzes them. You will see the agent's messages in the console, which will show each step of the agent's workflow. ``` ================================ Human Message ================================= Search the web for OpenAI TikTok profile and analyze their profile. ================================== AI Message ================================== Tool Calls: apify_actor_apify_rag-web-browser (call_y2rbmQ6gYJYC2lHzWJAoKDaq) Call ID: call_y2rbmQ6gYJYC2lHzWJAoKDaq Args: run_input: {"query":"OpenAI TikTok profile","maxResults":1} ... ================================== AI Message ================================== The OpenAI TikTok profile is titled "OpenAI (@openai) Official." Here are some key details about the profile: - **Followers**: 592.3K - **Likes**: 3.3M - **Description**: The profile features "low key research previews" and includes videos that showcase their various projects and research developments. ### Profile Overview: - **Profile URL**: [OpenAI TikTok Profile](https://www.tiktok.com/@openai?lang=en) - **Content Focus**: The posts primarily involve previews of OpenAI's research and various AI-related innovations. ... ``` If you want to test the whole example, you can simply create a new file, `langgraph_integration.py`, and copy the whole code into it. ``` import os from langchain_apify import ApifyActorsTool from langchain_core.messages import HumanMessage from langchain_openai import ChatOpenAI from langgraph.prebuilt import create_react_agent os.environ["OPENAI_API_KEY"] = "Your OpenAI API key" os.environ["APIFY_API_TOKEN"] = "Your Apify API token" llm = ChatOpenAI(model="gpt-4o-mini") browser = ApifyActorsTool("apify/rag-web-browser") tiktok = ApifyActorsTool("clockworks/free-tiktok-scraper") tools = [browser, tiktok] agent_executor = create_react_agent(llm, tools) for state in agent_executor.stream( stream_mode="values", input={ "messages": [ HumanMessage(content="Search the web for OpenAI TikTok profile and analyze their profile.") ] }): state["messages"][-1].pretty_print() ``` ## Resources[​](#resources "Direct link to Resources") * [Apify Actors](https://docs.apify.com/platform/actors) * [LangGraph - How to Create a ReAct Agent](https://langchain-ai.github.io/langgraph/how-tos/create-react-agent/) --- # Lindy integration **Learn how to integrate your Apify Actors with Lindy.** *** [Lindy](https://www.lindy.ai/) is an AI-powered automation platform that lets you create intelligent workflows and automate complex tasks. By integrating Apify with Lindy, you can leverage Apify's web scraping capabilities within Lindy's AI-driven automation workflows to extract data, monitor websites, and trigger actions based on scraped information. ## Prerequisites[​](#prerequisites "Direct link to Prerequisites") To use the Apify integration with Lindy, you need: * A Lindy account with access to premium actions (required for certain integrations or higher usage limits). ## How to Run an Apify Actor from Lindy[​](#how-to-run-an-apify-actor-from-lindy "Direct link to How to Run an Apify Actor from Lindy") This section demonstrates how to integrate Apify's data extraction capabilities into Lindy's AI automation. 1. Start a new Lindy workflow by clicking the **+ New Lindy** button. ![Lindy dashboard with new Lindy button highlighted](/assets/images/lindy-new-button-455860dd5a4537f85d9ffcfd59434492.png) Select **Start from scratch** to build a custom workflow. ![Lindy workflow creation options, with \"Start from scratch\" selected](/assets/images/lindy-scratch-c8420dba2bf3586f860ddc538c612815.png) 2. Choose a trigger that will initiate your automation. For this demonstration, we will select **Chat with Lindy/Message received**. This allows you to trigger the Apify Actor simply by sending a message to Lindy. ![Lindy workflow editor, showing trigger selection, with \"Select Trigger\" highlighted](/assets/images/lindy-trigger-7c76b60f84ca086a502e157bc3b65a50.png) ![Lindy workflow editor, showing trigger selection, with \"Chat with Lindy/Message received\" chosen](/assets/images/lindy-received-d7214e022c2e00d51664bce3c804bb17.png) 3. After setting the trigger, select **Perform an Action**. ![Lindy workflow editor, showing the option to \"Perform an Action\"](/assets/images/lindy-action-33de046c5cd6e51ae9644373a9dd44a9.png) In the action search box, search for "Apify" or navigate to the **Scrapers** category and choose **Run Actor**. ![Lindy action search box with \"Apify\" typed, showing \"Run Actor\" option, or \"Scrapers\" category with \"Run Actor\" highlighted.](/assets/images/lindy-run-actor-4628894db97dd6e69cf1e9b479deeca8.png) 4. Configure the Apify "Run Actor" Module. In the Apify "Run Actor" configuration, choose the Actor you want to execute. For example, select the **Instagram profile scraper**. ![Apify \"Run Actor\" module configuration in Lindy, showing a dropdown or search for Actors, with \"Instagram profile scraper\" selected.](/assets/images/lindy-instagram-actor-755247f6a8302bc2e9e77a791d458024.png) Actor Availability You have access to thousands of Actors available on the [Apify Store](https://apify.com/store). Please note that Actors using the *rental pricing model* are not available for use with this integration. For details on Actor pricing models, refer to our [Pricing Documentation](https://docs.apify.com/platform/actors/publishing/monetize.md#rental-pricing-model). This establishes the fundamental workflow:*Chatting with Lindy can now trigger the Apify Instagram Profile Scraper.* ### Extending Your Workflow[​](#extending-your-workflow "Direct link to Extending Your Workflow") Lindy offers different triggers (e.g., *email received*, *Slack message received*, etc.) and actions beyond running an Actor. After the Apify Actor run is initiated, you can define what happens next, depending on your needs: * **When Actor Run Starts:** * You might want to send a notification. * Log the start time. * Run a pre-processing step. * **After Results Are Available:** Once the Apify Actor completes and its results are ready, you can: * Retrieve the Actor's output data from its dataset. * Pass the extracted data to Lindy's AI for summarization, analysis, content generation, or other AI-driven tasks. * Route the data to other services (e.g., Google Sheets, databases, email notifications) using Lindy's action modules. ## Available Actions in Lindy for Apify[​](#available-actions-in-lindy-for-apify "Direct link to Available Actions in Lindy for Apify") While Lindy's specific module names may evolve, the core Apify functionalities typically exposed are: * **Run Actor:** Initiates a specific Apify Actor and can optionally wait for its completion. --- # LlamaIndex integration **Learn how to integrate Apify with LlamaIndex to feed vector databases and LLMs with data crawled from the web.** *** > For more information on LlamaIndex, visit its [documentation](https://docs.llamaindex.ai/en/stable/). ## What is LlamaIndex?[​](#what-is-llamaindex "Direct link to What is LlamaIndex?") LlamaIndex is a platform that allows you to create and manage vector databases and LLMs. ## How to integrate Apify with LlamaIndex?[​](#how-to-integrate-apify-with-llamaindex "Direct link to How to integrate Apify with LlamaIndex?") You can integrate Apify dataset or Apify Actor with LlamaIndex. Before we start with the integration, we need to install all dependencies: `pip install apify-client llama-index-core llama-index-readers-apify` After successfully installing all dependencies, we can start writing Python code. ### Apify Actor[​](#apify-actor "Direct link to Apify Actor") To use the Apify Actor, import `ApifyActor` and `Document`, and set your [Apify API token](https://docs.apify.com/platform/integrations/api#api-token) in the code. The following example uses the [Website Content Crawler](https://apify.com/apify/website-content-crawler) Actor to crawl an entire website, which will extract text content from the web pages. The extracted text is formatted as a llama\_index `Document` and can be fed to a vector store or language model like GPT. ``` from llama_index.core import Document from llama_index.readers.apify import ApifyActor reader = ApifyActor("") documents = reader.load_data( actor_id="apify/website-content-crawler", run_input={ "startUrls": [{"url": "https://docs.llamaindex.ai/en/latest/"}] }, dataset_mapping_function=lambda item: Document( text=item.get("text"), metadata={ "url": item.get("url"), }, ), ) ``` ### Apify Dataset[​](#apify-dataset "Direct link to Apify Dataset") To download Apify Dataset, import `ApifyDataset` and `Document` and load the dataset using a dataset ID. ``` from llama_index.core import Document from llama_index.readers.apify import ApifyDataset reader = ApifyDataset("") documents = reader.load_data( dataset_id="my_dataset_id", dataset_mapping_function=lambda item: Document( text=item.get("text"), metadata={ "url": item.get("url"), }, ), ) ``` ## Resources[​](#resources "Direct link to Resources") * [Apify loaders](https://llamahub.ai/l/readers/llama-index-readers-apify) * [LlamaIndex documentation](https://docs.llamaindex.ai/en/stable/) --- # Make integration **Learn how to integrate your Apify Actors with Make.** *** [Make](https://www.make.com/) *(formerly Integromat)* allows you to create scenarios where you can integrate various services (modules) to automate and centralize jobs. Apify has its own module you can use to run Apify Actors, get notified about run statuses, and receive Actor results directly in your Make scenario. ## Connect Apify to Make[​](#connect-apify-to-make "Direct link to Connect Apify to Make") To use the Apify integration on Make, you will need: * An [Apify account](https://console.apify.com/). * A Make account (and a [scenario](https://www.make.com/en/help/scenarios/creating-a-scenario)). ### Add the Apify module to scenario[​](#add-the-apify-module-to-scenario "Direct link to Add the Apify module to scenario") Add the Apify module to your scenario. You can find this module by searching for "Apify" in the module search bar. Next, select one of the available options under Triggers, Actions and Searches, then click on the Apify module to open its configuration window. ![Apify module](/assets/images/apify-module-0c1255d3ce1dfa02ab5c155acb632e25.png) ### Create a connection to Apify[​](#create-a-connection-to-apify "Direct link to Create a connection to Apify") In the Connection configuration window, you'll authorize the connection between Make and Apify. The recommended method is to use an OAuth connection. Alternatively, you can choose to connect using Apify API token: 1. You will need to provide your Apify API token in the designated field. ![API token](/assets/images/apify-token-6f9beb3af8bba7899b601d4aa8b6e77c.png) 2. You can find this token in the Apify Console by navigating to **[Settings > API & Integrations](https://console.apify.com/settings/integrations)** ![Integrations token](/assets/images/apify-integrations-token-a480c4034e9658f9989b7c661ee0fad5.png) 3. Finally, copy your API token from Apify, paste it into the Make module, and save to create the connection. Congratulations! You have successfully connected the Apify app and can now use it in your scenarios. ## Run an Actor or task with Output[​](#run-an-actor-or-task-with-output "Direct link to Run an Actor or task with Output") We have two methods to run an Actor or task and retrieve its data in Make.com, depending on your needs and the complexity of the Actor: * **Synchronous run using the action module** * **Asynchronous run using the trigger module** info Make.com imposes a hard timeout for synchronous runs, the timeout varies based on your plan. If the Actor or task takes longer than the timeout to complete, the data will not be fully returned. If you anticipate that the Actor run will exceed the timeout, use the asynchronous method with a trigger module instead. The primary difference between the two methods is that the synchronous run waits for the Actor or task to finish and retrieves its output using the "Get Dataset Items" module. By contrast, the asynchronous run watches for the run of an Actor or task (which could have been triggered from another scenario, manually from Apify console or elsewhere) and gets its output once it finishes. ### Synchronous run using the action module[​](#synchronous-run-using-the-action-module "Direct link to Synchronous run using the action module") In this example, we will demonstrate how to run an Actor synchronously and export the output to Google Sheets. The same principle applies to module that runs a task. #### Step 1: Add the Apify "Run an Actor" Module[​](#step-1-add-the-apify-run-an-actor-module "Direct link to Step 1: Add the Apify \"Run an Actor\" Module") First, ensure that you have [connected your Apify account to Make.com](#create-a-connection-to-apify). Next, add the Apify module called "Run an Actor" to your scenario and configure it. For this example, we will use the "Google Maps Review Scraper" Actor. Make sure to set the "Run synchronously" option to "Yes," so the module waits for the Actor to finish run. ![make-com-sync-2.png](/assets/images/make-com-sync-2-c431fd46874e293055d844712ce5b45f.png) #### Step 2: Add the Apify "Get Dataset Items" module[​](#step-2-add-the-apify-get-dataset-items-module "Direct link to Step 2: Add the Apify \"Get Dataset Items\" module") In the next step, add the "Get Dataset Items" module to your scenario, which is responsible for retrieving the output data from the Actor run. In the "Dataset ID" field, provide the default dataset ID from the Actor run. You can find this dataset ID in the variables generated by the previous "Run an Actor" module. If the variables do not appear, run the scenario first, then check again. ![make-com-sync-3.png](/assets/images/make-com-sync-3-c7ab14a5b1725a7a414dd15f57471922.png) #### Step 3: Add the Google Sheets "Create Spreadsheet Rows" module[​](#step-3-add-the-google-sheets-create-spreadsheet-rows-module "Direct link to Step 3: Add the Google Sheets \"Create Spreadsheet Rows\" module") Finally, add the Google Sheets "Bulk Add Rows" module to your scenario. This module will automatically create new rows in a Google Sheets file to store the Actor's output. In the "Spreadsheet ID" field, provide the ID of the target Google Sheets file, which you can find in its URL. Configure the column range (e.g., "A-Z") and map the data retrieved from the "Get Dataset Items" module to the row values. ![make-com-sync-4.png](/assets/images/make-com-sync-4-6cfd606dbad43eb915e401e13a9bf2c7.png) You’re all set! Once the scenario is started, it will run the Actor synchronously and export its output to your Google Sheets file. ### Asynchronous run using the trigger module[​](#asynchronous-run-using-the-trigger-module "Direct link to Asynchronous run using the trigger module") In this example, we will demonstrate how to run an Actor asynchronously and export its output to Google Sheets. Before starting, decide where you want to initiate the Actor run. You can do this manually via the Apify console, on a schedule, or from a separate Make.com scenario. #### Step 1: Add the Apify "Watch Actor Runs" Module[​](#step-1-add-the-apify-watch-actor-runs-module "Direct link to Step 1: Add the Apify \"Watch Actor Runs\" Module") First, ensure that you have [connected your Apify account to Make.com](#create-a-connection-to-apify). Next, add the Apify module called "Watch Actor Runs" to your scenario. This module will set up a webhook to listen for the finished runs of the selected Actor. For this example, we will use the "Google Maps Review Scraper" Actor. ![make-com-async-1.png](/assets/images/make-com-async-1-97d812db1bd9bd4f58e7bdef0ee47b76.png) #### Step 2: Add the Apify "Get Dataset Items" module[​](#step-2-add-the-apify-get-dataset-items-module-1 "Direct link to Step 2: Add the Apify \"Get Dataset Items\" module") Add the "Get Dataset Items" module to your scenario to retrieve the output of the Actor run. In the "Dataset ID" field, provide the default dataset ID from the Actor run. You can find the dataset ID in the variables generated by the "Watch Actor Runs" module. ![make-com-async-2.png](/assets/images/make-com-async-2-053080fc056eb8207b3150b691c98a0a.png) #### Step 3: Add the Google Sheets "Create Spreadsheet Rows" module[​](#step-3-add-the-google-sheets-create-spreadsheet-rows-module-1 "Direct link to Step 3: Add the Google Sheets \"Create Spreadsheet Rows\" module") Finally, add the Google Sheets "Bulk Add Rows" module to your scenario, which will create new rows in the specified Google Sheets file to store the Actor's output. In the "Spreadsheet ID" field, enter the ID of the target Google Sheets file, which you can find in its URL. Configure the column range (e.g., "A-Z") and map the data retrieved from the "Get Dataset Items" module to the row values. ![make-com-async-3.png](/assets/images/make-com-async-3-42faeb79aded20b704b101eb9dc071b4.png) That’s it! Once the Actor run is complete, its data will be exported to the Google Sheets file. You can initiate the Actor run via the Apify console, a scheduler, or from another Make.com scenario. ## Available modules and triggers[​](#available-modules-and-triggers "Direct link to Available modules and triggers") ### Triggers[​](#triggers "Direct link to Triggers") * **Watch Actor Runs:** Triggers when a selected Actor run is finished. * **Watch Task Runs:** Triggers when a selected task run is finished. ### Actions[​](#actions "Direct link to Actions") * **Run a Task:** Runs a selected Actor task. * **Run an Actor:** Runs a selected Actor. * **Scrape Single URL:** Runs a scraper for the website and returns its content as text, markdown and HTML. * **Make an API Call:** Makes an arbitrary authorized API call. ### Searches[​](#searches "Direct link to Searches") * **Get Dataset Items:** Retrieves items from a [dataset](https://docs.apify.com/platform/storage/dataset.md). --- # Make - AI crawling Actor integration ## Apify Scraper for AI Crawling[​](#apify-scraper-for-ai-crawling "Direct link to Apify Scraper for AI Crawling") Apify Scraper for AI Crawling from [Apify](https://apify.com/) lets you extract text content from websites to feed AI models, LLM applications, vector databases, or Retrieval Augmented Generation (RAG) pipelines. It supports rich formatting using Markdown, cleans the HTML of irrelevant elements, downloads linked files, and integrates with AI ecosystems like LangChain, LlamaIndex, and other LLM frameworks. To use these modules, you need an [Apify account](https://console.apify.com) and an [API token](https://docs.apify.com/platform/integrations/api#api-token). You can find your token in the [Apify Console](https://console.apify.com/) under **Settings > Integrations**. After connecting, you can automate content extraction at scale and incorporate the results into your AI workflows. ## Connect Apify Scraper for AI Crawling[​](#connect-apify-scraper-for-ai-crawling "Direct link to Connect Apify Scraper for AI Crawling") 1. Create an account at [Apify](https://console.apify.com/). You can sign up using your email, Gmail, or GitHub account. ![Sign up page](/assets/images/wcc-signup-3466817a3b1d42a890677c24908da9e1.png) 2. To connect your Apify account to Make, you can use an OAuth connection (recommended) or an Apify API token. To get the Apify API token, navigate to **[Settings > API & Integrations](https://console.apify.com/settings/integrations)** in the Apify Console. ![Apify Console token for Make.png](/assets/images/Apify_Console_token_for_Make-b9bb781ff306d44dcde51b1189d82e08.png) 3. Find your token under **Personal API tokens** section. You can also create a new API token with multiple customizable permissions by clicking on **+ Create a new token**. 4. Click the **Copy** icon next to your API token to copy it to your clipboard. Then, return to your Make scenario interface. ![Apify token on Make.png](/assets/images/Apify_token_on_Make-78f67b559503d92cffb17e5abffd18d2.png) 5. In Make, click **Add** to open the **Create a connection** dialog of the chosen Apify Scraper module. 6. In the **API token** field, paste the API token you copied from Apify. Provide a clear **Connection name**, and click **Save**. ![Make API token]() Once connected, you can build workflows to automate website extraction and integrate results into your AI applications. ## Apify Scraper for Website Content modules[​](#apify-scraper-for-website-content-modules "Direct link to Apify Scraper for Website Content modules") After connecting the app, you can use one of the two modules as native scrapers to extract website content. ### Standard Settings Module[​](#standard-settings-module "Direct link to Standard Settings Module") The Standard Settings module is a streamlined component of the Website Content Crawler that allows you to quickly extract content from websites using optimized default settings. This module is perfect for extracting content from blogs, documentation sites, knowledge bases, or any text-rich website to feed into AI models. #### How it works[​](#how-it-works "Direct link to How it works") The crawler starts with one or more **Start URLs** you provide, typically the top-level URL of a documentation site, blog, or knowledge base. It then: * Crawls these start URLs * Finds links to other pages on the site * Recursively crawls those pages as long as their URL is under the start URL * Respects URL patterns for inclusion/exclusion * Automatically skips duplicate pages with the same canonical URL * Provides various settings to customize crawling behavior (crawler type, max pages, depth, concurrency, etc.) Once a web page is loaded, the Actor processes its HTML to ensure quality content extraction: * Waits for dynamic content to load if using a headless browser * Can scroll to a certain height to ensure all page content is loaded * Can expand clickable elements to reveal hidden content * Removes DOM nodes matching specific CSS selectors (like navigation, headers, footers) * Optionally keeps only content matching specific CSS selectors * Removes cookie warnings using browser extensions * Transforms the page using the selected HTML transformer to extract the main content #### Output data[​](#output-data "Direct link to Output data") For each crawled web page, you'll receive: * *Page metadata*: URL, title, description, canonical URL * *Cleaned text content*: The main article content with irrelevant elements removed * *Markdown formatting*: Structured content with headers, lists, links, and other formatting preserved * *Crawl information*: Loaded URL, referrer URL, timestamp, HTTP status * *Optional file downloads*: PDFs, DOCs, and other linked documents Sample output (shortened) ``` { "url": "https://docs.apify.com/academy/web-scraping-for-beginners", "crawl": { "loadedUrl": "https://docs.apify.com/academy/web-scraping-for-beginners", "loadedTime": "2025-04-22T14:33:20.514Z", "referrerUrl": "https://docs.apify.com/academy", "depth": 1, "httpStatusCode": 200 }, "metadata": { "canonicalUrl": "https://docs.apify.com/academy/web-scraping-for-beginners", "title": "Web scraping for beginners | Apify Documentation", "description": "Learn the basics of web scraping with a step-by-step tutorial and practical exercises.", "languageCode": "en", "markdown": "# Web scraping for beginners\n\nWelcome to our comprehensive web scraping tutorial for beginners. This guide will take you through the fundamentals of extracting data from websites, with practical examples and exercises.\n\n## What is web scraping?\n\nWeb scraping is the process of extracting data from websites. It involves making HTTP requests to web servers, downloading HTML pages, and parsing them to extract the desired information.\n\n## Why learn web scraping?\n\n- **Data collection**: Gather information for research, analysis, or business intelligence\n- **Automation**: Save time by automating repetitive data collection tasks\n- **Integration**: Connect web data with your applications or databases\n- **Monitoring**: Track changes on websites automatically\n\n## Getting started\n\nTo begin web scraping, you'll need to understand the basics of HTML, CSS selectors, and HTTP. This tutorial will guide you through these concepts step by step.\n\n...", "text": "Web scraping for beginners\n\nWelcome to our comprehensive web scraping tutorial for beginners. This guide will take you through the fundamentals of extracting data from websites, with practical examples and exercises.\n\nWhat is web scraping?\n\nWeb scraping is the process of extracting data from websites. It involves making HTTP requests to web servers, downloading HTML pages, and parsing them to extract the desired information.\n\nWhy learn web scraping?\n\n- Data collection: Gather information for research, analysis, or business intelligence\n- Automation: Save time by automating repetitive data collection tasks\n- Integration: Connect web data with your applications or databases\n- Monitoring: Track changes on websites automatically\n\nGetting started\n\nTo begin web scraping, you'll need to understand the basics of HTML, CSS selectors, and HTTP. This tutorial will guide you through these concepts step by step.\n\n..." } } ``` ### Advanced Settings Module[​](#advanced-settings-module "Direct link to Advanced Settings Module") The Advanced Settings module provides complete control over the content extraction process, allowing you to fine-tune every aspect of the crawling and transformation pipeline. This module is ideal for complex websites, JavaScript-heavy applications, or when you need precise control over content extraction. #### Key features[​](#key-features "Direct link to Key features") * *Multiple Crawler Options*: Choose between headless browsers (Playwright) or faster HTTP clients (Cheerio) * *Custom Content Selection*: Specify exactly which elements to keep or remove * *Advanced Navigation Control*: Set crawling depth, scope, and URL patterns * *Dynamic Content Handling*: Wait for JavaScript-rendered content to load * *Interactive Element Support*: Click expandable sections to reveal hidden content * *Multiple Output Formats*: Save content as Markdown, HTML, or plain text * *Proxy Configuration*: Use proxies to handle geo-restrictions or avoid IP blocks * *Content Transformation Options*: Multiple algorithms for optimal content extraction #### How it works[​](#how-it-works-1 "Direct link to How it works") The Advanced Settings module provides granular control over the entire crawling process: 1. *Crawler Selection*: Choose from Playwright (Firefox/Chrome), or Cheerio based on website complexity 2. *URL Management*: Define precise scoping with include/exclude URL patterns 3. *DOM Manipulation*: Control which HTML elements to keep or remove 4. *Content Transformation*: Apply specialized algorithms for content extraction 5. *Output Formatting*: Select from multiple formats for AI model compatibility #### Configuration options[​](#configuration-options "Direct link to Configuration options") Advanced Settings offers numerous configuration options, including: * *Crawler Type*: Select the rendering engine (browser or HTTP client) * *Content Extraction Algorithm*: Choose from multiple HTML transformers * *Element Selectors*: Specify which elements to keep, remove, or click * *URL Patterns*: Define URL inclusion/exclusion patterns with glob syntax * *Crawling Parameters*: Set concurrency, depth, timeouts, and retries * *Proxy Configuration*: Configure proxy settings for robust crawling * *Output Options*: Select content formats and storage options #### Output data[​](#output-data-1 "Direct link to Output data") In addition to the standard output fields, Advanced Settings provides: * *Multiple Format Options*: Content in Markdown, HTML, or plain text * *Debug Information*: Detailed extraction diagnostics and snapshots * *HTML Transformations*: Results from different content extraction algorithms * *File Storage Options*: Flexible storage for HTML, screenshots, or downloaded files Looking for more than just AI crawling? You can use other native Make apps powered by Apify: * [Instagram Data](https://docs.apify.com/platform/integrations/make/instagram.md) * [TikTok Data](https://docs.apify.com/platform/integrations/make/tiktok.md) * [Google Search](https://docs.apify.com/platform/integrations/make/search.md) * [Google Maps Emails Data](https://docs.apify.com/platform/integrations/make/maps.md) * [YouTube Data](https://docs.apify.com/platform/integrations/make/youtube.md) * [Amazon](https://docs.apify.com/platform/integrations/make/amazon.md) And more! Because you can access any of our 4,500+ scrapers on Apify Store by using the [general Apify connections](https://www.make.com/en/integrations/apify). --- # Make - Amazon Actor integration ## Apify Scraper for Amazon Data[​](#apify-scraper-for-amazon-data "Direct link to Apify Scraper for Amazon Data") The Amazon Scraper module from [Apify](https://apify.com) allows you to extract product, search, or category data from Amazon. To use the module, you need an [Apify account](https://console.apify.com) and an [API token](https://docs.apify.com/platform/integrations/api#api-token), which you can find in the Apify Console under **Settings > Integrations**. After connecting, you can automate data extraction and incorporate the results into your workflows. ## Connect Apify Scraper for Amazon Data modules to Make[​](#connect-apify-scraper-for-amazon-data-modules-tomake "Direct link to Connect Apify Scraper for Amazon Data modules to Make") 1. Create an account at [Apify](https://console.apify.com/). You can sign up using your email, Gmail, or GitHub account. ![Sign up page](/assets/images/image-dfc813ec1133c63789d9ee81a0e21a5f.png) 2. To connect your Apify account to Make, you can use an OAuth connection (recommended) or an Apify API token. To get the Apify API token, navigate to **[Settings > API & Integrations](https://console.apify.com/settings/integrations)** in the Apify Console. ![Apify Console token for Make.png](/assets/images/Apify_Console_token_for_Make-b9bb781ff306d44dcde51b1189d82e08.png) 3. Find your token under **Personal API tokens** section. You can also create a new API token with multiple customizable permissions by clicking on **+ Create a new token**. 4. Click the **Copy** icon next to your API token to copy it to your clipboard. Then, return to your Make scenario interface. ![Apify token on Make.png](/assets/images/Apify_token_on_Make-78f67b559503d92cffb17e5abffd18d2.png) 5. In Make, click **Add** to open the **Create a connection** dialog of the chosen Apify Scraper module. 6. In the **API token** field, paste the API token you copied from Apify. Provide a clear **Connection name**, and click **Save**. ![Make API token](/assets/images/image1-e02e5cbfa882177befb386f0341fe3d3.png) Once connected, you can build workflows to automate Amazon data extraction and integrate results into your applications. ## Apify Scraper for Amazon Data module[​](#apify-scraper-for-amazon-data-module "Direct link to Apify Scraper for Amazon Data module") After connecting the app, you can use the Search module as a native scraper to extract public Amazon data. Here’s what you get: ### Extract Amazon data[​](#extract--amazon-data "Direct link to Extract Amazon data") Get data via [Apify's Amazon Scraper](https://apify.com/junglee/free-amazon-product-scraper). Fill in the URLs of products, searches, or categories you want to gather information about. For Amazon URLs, you can extract: * *Product title* * *Amazon ASIN number* * *Brand name* * *Number of reviews* * *Image* * *Description* * *Price value and currency* Example ``` [ { "title": "Logitech M185 Wireless Mouse, 2.4GHz with USB Mini Receiver, 12-Month Battery Life, 1000 DPI Optical Tracking, Ambidextrous PC/Mac/Laptop - Swift Grey", "asin": "B004YAVF8I", "brand": "Logitech", "stars": 4.5, "reviewsCount": 37418, "thumbnailImage": "https://m.media-amazon.com/images/I/5181UFuvoBL.__AC_SX300_SY300_QL70_FMwebp_.jpg", "breadCrumbs": "Electronics›Computers & Accessories›Computer Accessories & Peripherals›Keyboards, Mice & Accessories›Mice", "description": "Logitech Wireless Mouse M185. A simple, reliable mouse with plug-and-play wireless, a 1-year battery life and 3-year limited hardware warranty.(Battery life may vary based on user and computing conditions.) System Requirements: Windows Vista Windows 7 Windows 8 Windows 10|Mac OS X 10.5 or later|Chrome OS|Linux kernel 2.6+|USB port", "price": { "value": 13.97, "currency": "$" }, "url": "https://www.amazon.com/dp/B004YAVF8I" }, { "title": "Logitech MX Master 3S - Wireless Performance Mouse with Ultra-fast Scrolling, Ergo, 8K DPI, Track on Glass, Quiet Clicks, USB-C, Bluetooth, Windows, Linux, Chrome - Graphite", "asin": "B09HM94VDS", "brand": "Logitech", "stars": 4.5, "reviewsCount": 9333, "thumbnailImage": "https://m.media-amazon.com/images/I/41+eEANAv3L._AC_SY300_SX300_.jpg", "breadCrumbs": "Electronics›Computers & Accessories›Computer Accessories & Peripherals›Keyboards, Mice & Accessories›Mice", "description": "Logitech MX Master 3S Performance Wireless Mouse Introducing Logitech MX Master 3S – an iconic mouse remastered. Now with Quiet Clicks(2) and 8K DPI any-surface tracking for more feel and performance than ever before. Product details: Weight: 4.97 oz (141 g) Dimensions: 2 x 3.3 x 4.9 in (51 x 84.3 x 124.9 mm) Compatible with Windows, macOS, Linux, Chrome OS, iPadOS, Android operating systems (8) Rechargeable Li-Po (500 mAh) battery Sensor technology: Darkfield high precision Buttons: 7 buttons (Left/Right-click, Back/Forward, App-Switch, Wheel mode-shift, Middle click), Scroll Wheel, Thumbwheel, Gesture button Wireless operating distance: 33 ft (10 m) (9)Footnotes: (1) 4 mm minimum glass thickness (2) Compared to MX Master 3, MX Master 3S has 90% less Sound Power Level left and right click, measured at 1m (3) Compared to regular Logitech mouse without an electromagnetic scroll wheel (4) Compared to Logitech Master 2S mouse with Logitech Options installed and Smooth scrolling enabled (5) Requires Logi Options+ software, available for Windows and macOS (6) Not compatible with Logitech Unifying technology (7) Battery life may vary based on user and computing conditions. (8) Device basic functions will be supported without software for operating systems other than Windows and macOS (9) Wireless range may vary depending on operating environment and computer setup", "price": { "value": 89.99, "currency": "$" }, "url": "https://www.amazon.com/dp/B09HM94VDS" }, { "title": "Apple Magic Mouse - White Multi-Touch Surface ", "asin": "B0DL72PK1P", "brand": "Apple", "stars": 4.6, "reviewsCount": 18594, "thumbnailImage": "", "breadCrumbs": "", "description": null, "price": { "value": 78.99, "currency": "$" }, "url": "https://www.amazon.com/dp/B0DL72PK1P" } ] ``` Search data sample ``` [ { "title": "Logitech MK270 Wireless Keyboard And Mouse Combo For Windows, 2.4 GHz Wireless, Compact Mouse, 8 Multimedia And Shortcut Keys, For PC, Laptop - Black", "asin": "B079JLY5M5", "brand": "Logitech", "stars": 4.5, "reviewsCount": 107637, "thumbnailImage": "https://m.media-amazon.com/images/I/61gSpxZTZZL.__AC_SX300_SY300_QL70_ML2_.jpg", "breadCrumbs": "Electronics›Computers & Accessories›Computer Accessories & Peripherals›Keyboards, Mice & Accessories›Keyboard & Mouse Combos", "description": "The stylish Logitech MK270 Wireless Keyboard and Mouse Combo is perfect for the home office or workplace. Ditch the touchpad for this full size keyboard and mouse. Easily connect using Logitech's plug and forget receiver—just plug it into the USB port, and you're ready to work. There's no lengthy installation procedure to slow you down. When you're on the move, the receiver stores comfortably inside the mouse. Both the keyboard and mouse included in the MK270 combo use wireless 2.4GHz connectivity to provide seamless, interruption free use. Use the keyboard within a 10 m range without keyboard lag. Work for longer with the MK270's long battery life. The keyboard can be used for up to 24 months, and the mouse for 12 months, without replacing batteries. The Logitech MK270 keyboard includes 8 hotkeys that are programmable to your most used applications to boost your productivity.", "price": { "value": 21.98, "currency": "$" }, "url": "https://www.amazon.com/dp/B079JLY5M5" }, { "title": "Wireless Keyboard and Mouse Combo - Round Keycaps, Full-Size Retro Typewriter Keyboard with Detachable Wrist Rest, Sleep Mode & Tilt Legs, 2.4GHz Cordless Connection for Mac/Windows/PC (Hot Pink)", "asin": "B0CQJV4BW3", "brand": "SABLUTE", "stars": 4.3, "reviewsCount": 928, "thumbnailImage": "https://m.media-amazon.com/images/I/61NOammUF2L.__AC_SY300_SX300_QL70_FMwebp_.jpg", "breadCrumbs": "Electronics›Computers & Accessories›Computer Accessories & Peripherals›Keyboards, Mice & Accessories›Keyboard & Mouse Combos", "description": null, "price": { "value": 39.99, "currency": "$" }, "url": "https://www.amazon.com/dp/B0CQJV4BW3" }, { "title": "Redragon S101 Gaming Keyboard, M601 Mouse, RGB Backlit Gaming Keyboard, Programmable Backlit Gaming Mouse, Value Combo Set [New Version]", "asin": "B00NLZUM36", "brand": "Redragon", "stars": 4.6, "reviewsCount": 46346, "thumbnailImage": "https://m.media-amazon.com/images/I/71QDJHG1PqL.__AC_SX300_SY300_QL70_FMwebp_.jpg", "breadCrumbs": "Video Games›PC›Accessories›Gaming Keyboards", "description": null, "price": { "value": 39.99, "currency": "$" }, "url": "https://www.amazon.com/dp/B00NLZUM36" } ] ``` Product data sample ``` [ { "title": "Amazon Basics Wired Keyboard, Full-Sized, QWERTY Layout, Black", "asin": "B07WJ5D3H4", "brand": "Amazon Basics", "stars": 4.5, "reviewsCount": 7606, "thumbnailImage": "https://m.media-amazon.com/images/I/71ehwfAM4-L.__AC_SY300_SX300_QL70_FMwebp_.jpg", "breadCrumbs": "Electronics›Computers & Accessories›Computer Accessories & Peripherals›Keyboards, Mice & Accessories›Keyboards", "description": "Product DescriptionFeaturing a standard US QWERTY layout, the keyboard provides comfort and familiarity, while the sleek black design complements any tech setup or decor seamlessly. This wired keyboard and mouse set is great for those seeking a wired keyboard and mouse for home or office use. The mouse and keyboard combo offers a practical plug-and-play solution, compatible with multiple versions of Windows, including 2000, XP, Vista, 7, 8, and 10/11.From the ManufacturerAmazon Basics", "price": { "value": 18.04, "currency": "$" }, "url": "https://www.amazon.com/dp/B07WJ5D3H4" } ] ``` Category data sample ``` [ { "title": "Logitech M185 Wireless Mouse, 2.4GHz with USB Mini Receiver, 12-Month Battery Life, 1000 DPI Optical Tracking, Ambidextrous PC/Mac/Laptop - Swift Grey", "asin": "B004YAVF8I", "brand": "Logitech", "stars": 4.5, "reviewsCount": 37418, "thumbnailImage": "https://m.media-amazon.com/images/I/5181UFuvoBL.__AC_SX300_SY300_QL70_FMwebp_.jpg", "breadCrumbs": "Electronics›Computers & Accessories›Computer Accessories & Peripherals›Keyboards, Mice & Accessories›Mice", "description": "Logitech Wireless Mouse M185. A simple, reliable mouse with plug-and-play wireless, a 1-year battery life and 3-year limited hardware warranty.(Battery life may vary based on user and computing conditions.) System Requirements: Windows Vista Windows 7 Windows 8 Windows 10|Mac OS X 10.5 or later|Chrome OS|Linux kernel 2.6+|USB port", "price": { "value": 13.97, "currency": "$" }, "url": "https://www.amazon.com/dp/B004YAVF8I" }, { "title": "Logitech MX Master 3S - Wireless Performance Mouse with Ultra-fast Scrolling, Ergo, 8K DPI, Track on Glass, Quiet Clicks, USB-C, Bluetooth, Windows, Linux, Chrome - Graphite", "asin": "B09HM94VDS", "brand": "Logitech", "stars": 4.5, "reviewsCount": 9333, "thumbnailImage": "https://m.media-amazon.com/images/I/41+eEANAv3L._AC_SY300_SX300_.jpg", "breadCrumbs": "Electronics›Computers & Accessories›Computer Accessories & Peripherals›Keyboards, Mice & Accessories›Mice", "description": "Logitech MX Master 3S Performance Wireless Mouse Introducing Logitech MX Master 3S – an iconic mouse remastered. Now with Quiet Clicks(2) and 8K DPI any-surface tracking for more feel and performance than ever before. Product details: Weight: 4.97 oz (141 g) Dimensions: 2 x 3.3 x 4.9 in (51 x 84.3 x 124.9 mm) Compatible with Windows, macOS, Linux, Chrome OS, iPadOS, Android operating systems (8) Rechargeable Li-Po (500 mAh) battery Sensor technology: Darkfield high precision Buttons: 7 buttons (Left/Right-click, Back/Forward, App-Switch, Wheel mode-shift, Middle click), Scroll Wheel, Thumbwheel, Gesture button Wireless operating distance: 33 ft (10 m) (9)Footnotes: (1) 4 mm minimum glass thickness (2) Compared to MX Master 3, MX Master 3S has 90% less Sound Power Level left and right click, measured at 1m (3) Compared to regular Logitech mouse without an electromagnetic scroll wheel (4) Compared to Logitech Master 2S mouse with Logitech Options installed and Smooth scrolling enabled (5) Requires Logi Options+ software, available for Windows and macOS (6) Not compatible with Logitech Unifying technology (7) Battery life may vary based on user and computing conditions. (8) Device basic functions will be supported without software for operating systems other than Windows and macOS (9) Wireless range may vary depending on operating environment and computer setup", "price": { "value": 89.99, "currency": "$" }, "url": "https://www.amazon.com/dp/B09HM94VDS" }, { "title": "Apple Magic Mouse - White Multi-Touch Surface ", "asin": "B0DL72PK1P", "brand": "Apple", "stars": 4.6, "reviewsCount": 18594, "thumbnailImage": "https://m.media-amazon.com/images/I/41U6Q0T5toL.__AC_SY445_SX342_QL70_FMwebp_.jpg", "breadCrumbs": "", "description": null, "price": { "value": 78.99, "currency": "$" }, "url": "https://www.amazon.com/dp/B0DL72PK1P" } ] ``` ## Other scrapers available[​](#other-scrapers-available "Direct link to Other scrapers available") There are other native Make Apps powered by Apify. You can check out Apify Scraper for: * [Instagram Data](https://docs.apify.com/platform/integrations/make/instagram.md) * [TikTok Data](https://docs.apify.com/platform/integrations/make/tiktok.md) * [Google Search](https://docs.apify.com/platform/integrations/make/search.md) * [Google Maps Emails Data](https://docs.apify.com/platform/integrations/make/maps.md) * [YouTube Data](https://docs.apify.com/platform/integrations/make/youtube.md) * [AI crawling](https://docs.apify.com/platform/integrations/make/ai-crawling.md) And more! Because you can access any of our 4,500+ scrapers on Apify Store by using the [general Apify connections](https://www.make.com/en/integrations/apify). --- # Make - Facebook Actor integration ## Apify Scraper for Facebook Data[​](#apify-scraper-for-facebook-data "Direct link to Apify Scraper for Facebook Data") The Facebook Scraper modules from [Apify](https://apify.com/) allow you to extract posts, comments, and profile data from Facebook. To use these modules, you need an [Apify account](https://console.apify.com) and an [API token](https://docs.apify.com/platform/integrations/api#api-token). You can find your token in the [Apify Console](https://console.apify.com/) under **Settings > Integrations**. After connecting, you can automate data extraction and incorporate the results into your workflows. ## Connect Apify Scraper for Facebook Data modules to Make[​](#connect-apify-scraper-for-facebook-data-modules-to-make "Direct link to Connect Apify Scraper for Facebook Data modules to Make") 1. Create an account at [Apify](https://console.apify.com/). You can sign up using your email, Gmail, or GitHub account. ![Sign up page](/assets/images/signup-9708d31d8ee3eee598de711064f649f3.png) 2. To connect your Apify account to Make, you can use an OAuth connection (recommended) or an Apify API token. To get the Apify API token, navigate to **[Settings > API & Integrations](https://console.apify.com/settings/integrations)** in the Apify Console. ![Apify Actor rental](/assets/images/actor-rental-f376d02ecb872d8399a7f664c8be2278.png) 3. Review the trial details. You won't be charged during the trial unless you actively switch to a paid plan. Click **Rent Actor** to activate your trial. ![Start Actor rental](/assets/images/start-rental-67a43c22c84cb3cb1d71c199909dc354.png) 4. Connect your Apify account with Make, you need to get the Apify API token. In the Apify Console, navigate to **[Settings > API & Integrations](https://console.apify.com/settings/integrations)**. ![Apify Console token for Make.png](/assets/images/Apify_Console_token_for_Make-b9bb781ff306d44dcde51b1189d82e08.png) 5. Find your token under **Personal API tokens** section. You can also create a new API token with multiple customizable permissions by clicking on **+ Create a new token**. 6. Click the **Copy** icon next to your API token to copy it to your clipboard. Then, return to your Make scenario interface. ![Apify token on Make.png](/assets/images/Apify_token_on_Make-78f67b559503d92cffb17e5abffd18d2.png) 7. In Make, click **Add** to open the **Create a connection** dialog of the chosen Apify Scraper module. 8. In the **API token** field, paste the API token you copied from Apify. Provide a clear **Connection name**, and click **Save**. ![Make API token](/assets/images/make-api-token-0fa647a34cecf9027b22ab2ad6db7c27.png) Once connected, you can build workflows to automate Facebook data extraction and integrate results into your applications. ## Apify Scraper for Facebook Data modules[​](#apify-scraper-for-facebook-data-modules "Direct link to Apify Scraper for Facebook Data modules") After connecting the app, you can use one of the three existing Search modules as native scrapers to extract public data from Facebook. ### Extract Facebook groups[​](#extract-facebook-groups "Direct link to Extract Facebook groups") Get data via Apify's Facebook Groups Scraper. Just add one or multiple URLs of public groups you want to extract data from, then indicate a number of posts, and optionally, choose a sorting order and date filter. For each given Facebook group URL, you will extract: * *Post details*: post ID, legacy ID, Facebook group URL, direct post URL, post text, timestamp, and Facebook feedback ID. * *Engagement metrics*: likes, shares, comments, top reactions, and breakdown by type (like, love, wow, care, haha). * *User (post author)*: user ID and name. * *Attachments*: media set URL, image thumbnail, full image URL, dimensions, OCR text (if any), media ID, and owner ID. * *Top comments*: comment ID, comment URL, timestamp, text, feedback ID, commenter ID and name, profile picture, likes count, and threading depth. Profile data, shortened sample ``` [ { "facebookUrl": "https://www.facebook.com/groups/WeirdSecondhandFinds", "url": "https://www.facebook.com/groups/WeirdSecondhandFinds/permalink/3348022435381946/", "time": "2025-04-09T15:34:31.000Z", "user": { "name": "Author name" }, "text": "4/9/2025 - This glass fish was found at a friend's yard sale and for some reason it had to come home with me. Any ideas on how to display it?", "reactionLikeCount": 704, "reactionLoveCount": 185, "reactionWowCount": 10, "reactionCareCount": 6, "reactionHahaCount": 3, "attachments": [ { "url": "https://www.facebook.com/media/set/?set=pcb.3348022435381946&type=1", "thumbnail": "https://scontent.fcgh33-1.fna.fbcdn.net/v/t39.30808-6/490077910_10228674979643758_5977579619381197326_n.jpg?stp=dst-jpg_s600x600_tt6" } ], "likesCount": 908, "sharesCount": 3, "commentsCount": 852, "topComments": [ { "commentUrl": "https://www.facebook.com/groups/WeirdSecondhandFinds/permalink/3348022435381946/?comment_id=3348201365364053", "text": "Would this work okay? Water and floating candle?", "profileName": "Bonnie FireUrchin Lambourn", "likesCount": 2 } ], "facebookId": "650812835102933", "groupTitle": "Weird (and Wonderful) Secondhand Finds That Just Need To Be Shared" } ] ``` ### Extract Facebook comments[​](#extract-facebook-comments "Direct link to Extract Facebook comments") Use the Facebook Comment Scraper to collect comments from Facebook posts. Add the post URLs, set the number of comments you want, and optionally choose comment order and whether to include replies. You’ll get: * *Text*: Comment text * *Timestamp*: Date and time of the comment * *Like count*: Number of likes on the comment * *Commenter info*: Username, profile picture, profile URL, user ID * *Number of replies*: Number of replies to the comment (not included in this example) * *Post URL*: Link to the post the comment is associated with * *Nested replies*: Nested replies to the comment (not included in this example) Free plan limitations Features like *replies* and *comment sorting* are limited for users on Apify's Free Plan. Consider upgrading to a [paid subscription](https://apify.com/pricing). Example (shortened) ``` [ { "facebookUrl": "https://www.facebook.com/NASAJSC/posts/pfbid0ohxEG5cJnm3JNFodkvsehRUY3yfLx5Vis8cude7xRdmrXV9EMDxsuScPaSCtX9KNl?locale=cs_CZ", "commentUrl": "https://www.facebook.com/NASAJSC/posts/pfbid0ohxEG5cJnm3JNFodkvsehRUY3yfLx5Vis8cude7xRdmrXV9EMDxsuScPaSCtX9KNl?comment_id=2386082985122451", "id": "Y29tbWVudDoxMDU1NDAzMDgzMzY4Mzk1XzIzODYwODI5ODUxMjI0NTE=", "feedbackId": "ZmVlZGJhY2s6MTA1NTQwMzA4MzM2ODM5NV8yMzg2MDgyOTg1MTIyNDUx", "date": "2025-04-09T18:39:23.000Z", "text": "Green is my favorite color. The beach my peaceful place. When I visited I was amazed at all the green and to see the beach area. Very cool", "profileUrl": "https://www.facebook.com/people/Elizabeth-Grindrod/pfbid022LryhRGvvGeZrrHq6SeS95doHdjDg7WHfaJHErzcEiNF8KPHiTx3drT9pw3oKMKTl/", "profilePicture": "https://scontent-bkk1-1.xx.fbcdn.net/v/t39.30808-1/489953042_122145581006424177_4615090019565194474_n.jpg?stp=cp0_dst-jpg_s32x32_tt6&_nc_cat=109&ccb=1-7&_nc_sid=e99d92&_nc_ohc=fJU9pA6IZpkQ7kNvwFulSHc&_nc_oc=AdldBxtJX_EilisOewldRrGT1dHWEFd690Wt6nWFTEVLY9-rlYNGHFTlMjgjB5bDsAM&_nc_zt=24&_nc_ht=scontent-bkk1-1.xx&_nc_gid=Kbf_nt_NCH2lzg1SIjTdHg&oh=00_AfGKLaCo8R4odY5OLT4esFDzvURJ46R6dxwCE0fD8jJR2A&oe=67FCA025", "profileId": "pfbid022LryhRGvvGeZrrHq6SeS95doHdjDg7WHfaJHErzcEiNF8KPHiTx3drT9pw3oKMKTl", "profileName": "Elizabeth Grindrod", "likesCount": "2", "threadingDepth": 0, "facebookId": "1055403083368395", ] ``` ### Extract Facebook posts[​](#extract-facebook-posts "Direct link to Extract Facebook posts") Use the Facebook Post Scraper to get post data by adding one or multiple page URLs and the amount of posts you want to scrape. You’ll get: * *Post URL*: Link to the post * *Shortcode*: Unique identifier for the post * *Timestamp*: Date and time of the post * *Content type*: Whether it’s an image, video, or carousel * *Caption*: Text content of the post * *Hashtags*: List of hashtags used in the post * *Mentions*: Usernames of mentioned accounts * *Likes*: Number of likes on the post * *Comments*: Number of comments on the post * *Shares*: Number of times the post has been shared * *Media info*: * *URLs*: Links to media files * *Type*: Whether it's an image or video * *Dimensions*: Size of the media * *Owner info*: * *Username*: Account name of the post owner * *User ID*: Unique identifier for the owner * *Full name*: Full name of the account holder * *Tags*: Hashtags used in the post * *Location*: Geographic location tagged in the post (if available) Example (shortened) ``` [ { "facebookUrl": "https://www.facebook.com/nasa", "postId": "1215784396583601", "pageName": "NASA", "url": "https://www.facebook.com/NASA/posts/pfbid029aLb3sDGnXuYA5P7DK5uRT7Upf39X5fwCBFcRz9C3M4EMShwJWNwLLaXA5RdYeyKl", "time": "2025-04-07T19:09:00.000Z", "user": { "id": "100044561550831", "name": "NASA - National Aeronautics and Space Administration", "profileUrl": "https://www.facebook.com/NASA", "profilePic": "https://scontent.fbog3-2.fna.fbcdn.net/v/t39.30808-1/243095782_416661036495945_3843362260429099279_n.png?stp=cp0_dst-png_s40x40&_nc_cat=1&ccb=1-7&_nc_sid=2d3e12&_nc_ohc=pGNKYYiG82gQ7kNvwGLgqmB&_nc_oc=AdmpIOT7GNKe9qxJgFM-EEuF78UvDx97YygzhxiRXW5nXDyZmQScZzHnWAFlGmn8VBk" }, "text": "It’s your time to shine! This Citizen Science Month, contribute to a NASA Citizen Science project that will help improve life on Earth and solve cosmic mysteries.", "link": "https://science.nasa.gov/citizen-science/", "likes": 2016, "comments": 171, "shares": 217, "media": [ { "thumbnail": "https://scontent.fbog3-3.fna.fbcdn.net/v/t39.30808-6/489419147_1215784366583604_2492050236576327908_n.jpg?stp=dst-jpg_s720x720_tt6&_nc_cat=110&ccb=1-7&_nc_sid=127cfc&_nc_ohc=YI6mnyIKJmwQ7kNvwGVLR7C&_nc_oc=AdklMZgJuQZ-r924q5F9ikY0F5E_LF2gbzNnepx75qTmtJ-jDnq6Ve-VkIQ1hcaCDhA" } ] }, { "facebookUrl": "https://www.facebook.com/nasa", "postId": "1215717559923618", "pageName": "NASA", "url": "https://www.facebook.com/NASA/posts/pfbid01SDwDikd344679WW4Er1F1UAB3cfpBH4Ud54RJEaTtD1Fih2xSzjtsCsYXgbh93Ll", "time": "2025-04-07T17:04:00.000Z", "user": { "id": "100044561550831", "name": "NASA - National Aeronautics and Space Administration", "profileUrl": "https://www.facebook.com/NASA", "profilePic": "https://scontent.fbog3-2.fna.fbcdn.net/v/t39.30808-1/243095782_416661036495945_3843362260429099279_n.png?stp=cp0_dst-png_s40x40&_nc_cat=1&ccb=1-7&_nc_sid=2d3e12&_nc_ohc=pGNKYYiG82gQ7kNvwGLgqmB&_nc_oc=AdmpIOT7GNKe9qxJgFM-EEuF78UvDx97YygzhxiRXW5nXDyZmQScZzHnWAFlGmn8VBk" }, "text": "NASA's Hubble Space Telescope has studied Uranus for more than 20 years and is still learning more about its gas.", "link": "https://go.nasa.gov/3RIapAw", "likes": 1878, "comments": 144, "shares": 215, "media": [ { "thumbnail": "https://scontent.fbog3-1.fna.fbcdn.net/v/t39.30808-6/489532065_1215717536590287_873488674466633974_n.jpg?stp=dst-jpg_p180x540_tt6&_nc_cat=109&ccb=1-7&_nc_sid=127cfc&_nc_ohc=kAiP3avgomkQ7kNvwGOb-YS&_nc_oc=Adn31Ca9oiQ5ieTtUtFqcr45R4jdJdVxei1kMR1kj-RLDehS-fyEVJD1fY2-5IItLe0" } ] }, { "facebookUrl": "https://www.facebook.com/nasa", "postId": "1212614090233965", "pageName": "NASA", "url": "https://www.facebook.com/NASA/videos/958890849561531/", "time": "2025-04-03T18:06:29.000Z", "user": { "id": "100044561550831", "name": "NASA - National Aeronautics and Space Administration", "profileUrl": "https://www.facebook.com/NASA", "profilePic": "https://scontent.fssz1-1.fna.fbcdn.net/v/t39.30808-1/243095782_416661036495945_3843362260429099279_n.png?stp=cp0_dst-png_s40x40&_nc_cat=1&ccb=1-7&_nc_sid=2d3e12&_nc_ohc=pGNKYYiG82gQ7kNvwGLgqmB&_nc_oc=AdmpIOT7GNKe9qxJgFM-EEuF78UvDx97YygzhxiRXW5nXDyZmQScZzHnWAFlGmn8VBk" }, "text": "Rocket? Stacking. Crew training? Underway. Mission patch? Ready to go.", "link": "https://go.nasa.gov/41ZErWJ", "likes": 1813, "comments": 190, "shares": 456, "media": [ { "thumbnail": "https://scontent.fssz1-1.fna.fbcdn.net/v/t15.5256-10/488073346_1027101039315356_6805938007276905855_n.jpg?_nc_cat=109&ccb=1-7&_nc_sid=7965db&_nc_ohc=M4hIzfAIbdAQ7kNvwFnbXVw&_nc_oc=AdmJODt8am5l58TuwIbYLbEMK_w9IFb6uaUqiq7SCtNI9ouf4Xd_nZcifKpRLWSsclg" } ] } ] ``` ## Other scrapers available[​](#other-scrapers-available "Direct link to Other scrapers available") Looking for more than just Facebook? You can use other native Make apps powered by Apify: * [Instagram Data](https://docs.apify.com/platform/integrations/make/instagram.md) * [TikTok Data](https://docs.apify.com/platform/integrations/make/tiktok.md) * [Google Search](https://docs.apify.com/platform/integrations/make/search.md) * [Google Maps Emails Data](https://docs.apify.com/platform/integrations/make/maps.md) * [YouTube Data](https://docs.apify.com/platform/integrations/make/youtube.md) * [AI crawling](https://docs.apify.com/platform/integrations/make/ai-crawling.md) * [Amazon](https://docs.apify.com/platform/integrations/make/amazon.md) And more! Because you can access any of our 4,500+ scrapers on Apify Store by using the [general Apify connections](https://www.make.com/en/integrations/apify). --- # Make - Instagram Actor integration **Learn about Instagram scraper modules. Extract posts, comments, and profile data.** *** ## Apify Scraper for Instagram Data[​](#apify-scraper-for-instagram-data "Direct link to Apify Scraper for Instagram Data") The Instagram Scraper modules from [Apify](https://apify.com) allow you to extract posts, comments, and profile data from Instagram. To use these modules, you need an [Apify account](https://console.apify.com) and an [API token](https://docs.apify.com/platform/integrations/api#api-token). You can find your token in the [Apify Console](https://console.apify.com/) under **Settings > Integrations**. After connecting, you can automate data extraction and incorporate the results into your workflows. ## Connect Apify Scraper for Instagram Data modules to Make[​](#connect-apify-scraper-for-instagram-data-modules-tomake "Direct link to Connect Apify Scraper for Instagram Data modules to Make") 1. Create an account at [Apify](https://console.apify.com/). You can sign up using your email, Gmail, or GitHub account. ![Sign up page](/assets/images/Apify_Make_Sign_up_page-61b1eb99291e67a5f9aa005b1477a8eb.png) 2. To connect your Apify account to Make, you can use an OAuth connection (recommended) or an Apify API token. To get the Apify API token, navigate to **[Settings > API & Integrations](https://console.apify.com/settings/integrations)** in the Apify Console. ![Apify Console token for Make.png](/assets/images/Apify_Console_token_for_Make-b9bb781ff306d44dcde51b1189d82e08.png) 3. Find your token under **Personal API tokens** section. You can also create a new API token with multiple customizable permissions by clicking on **+ Create a new token**. 4. Click the **Copy** icon next to your API token to copy it to your clipboard. Then, return to your Make scenario interface. ![Apify token on Make.png](/assets/images/Apify_token_on_Make-78f67b559503d92cffb17e5abffd18d2.png) 5. In Make, click **Add** to open the **Create a connection** dialog of the chosen Apify Scraper module. 6. In the **API token** field, paste the API token you copied from Apify. Provide a clear **Connection name**, and click **Save**. ![Apify Token for modules on Make.png](/assets/images/Apify_Token_for_modules_on_Make-bfbbc21f67c9db5bf2abf963e2b7bfc8.png) Once connected, you can build workflows to automate Instagram data extraction and integrate results into your applications. ## Apify Scraper for Instagram Data modules[​](#apify-scraper-for-instagram-data-modules "Direct link to Apify Scraper for Instagram Data modules") After connecting, you can use one of the three existing Search modules as native scrapers to extract public data from Instagram. ### Extract Instagram profiles[​](#extract-instagram-profiles "Direct link to Extract Instagram profiles") Get profile details via [Apify's Instagram Profiles Scraper](https://apify.com/apify/instagram-profile-scraper). All you need to set up is usernames or URLs you want to extract data from. For each Instagram profile, you will extract: * *Basic profile details*: username, full name, biography, and profile URL. * *Account status*: verification status, whether the account is private or public, and if it's a business account. * *Follower and engagement metrics*: number of followers and accounts followed. * *Profile pictures*: standard and HD profile picture URLs. * *External links*: website URL (if provided). * *Content information*: number of IGTV videos and highlight reels. * *Related profiles*: suggested accounts, including their username, full name, profile picture URL, and verification status. Profile data, shortened sample ``` [ { "fullName": "NASA", "profilePicUrl": "https://scontent-atl3-2.cdninstagram.com/v/t51.2885-19/29090066_159271188110124_1152068159029641216_n.jpg?stp=dst-jpg_e0_s150x150_tt6&_nc_ht=scontent-atl3-2.cdninstagram.com&_nc_cat=1&_nc_oc=Q6cZ2AHA8W2z8Q8c-m6E-NgP2su5m59iPYiKVnQlhOBQsfJhVbSzs1AMZMECpvJPB_LanuQ&_nc_ohc=1vXGhkEoh-oQ7kNvgErK0o3&_nc_gid=f2687478a00141a1b759031727c45f9c&edm=AOQ1c0wBAAAA&ccb=7-5&oh=00_AYBBdI58EkpbDvplnxGKsqMUNzd7IYS1GzG-F3fb493okg&oe=67B983E9&_nc_sid=8b3546", "username": "nasa", "postsCount": 4381, "followersCount": 96718778, "followsCount": 81, "private": false, "verified": true, "isBusinessAccount": true, "biography": "🚀 🌎 Exploring the universe and our home planet. Verification: nasa.gov/socialmedia" }, { "fullName": "National Geographic", "profilePicUrl": "https://scontent-lga3-3.cdninstagram.com/v/t51.2885-19/476424694_3911047592506753_8995392926225146489_n.jpg?stp=dst-jpg_e0_s150x150_tt6&_nc_ht=scontent-lga3-3.cdninstagram.com&_nc_cat=1&_nc_oc=Q6cZ2AHN-FkuOj4TjoJuFvCdSEHSAA2nPN9hSjUK1b1phJY5bDOJjsQVtvzHguB7UvZVA78&_nc_ohc=Nw5ra__Z3vEQ7kNvgG9nmDU&_nc_gid=dfd92e92eeda46f99367b1eaa093ff3a&edm=AOQ1c0wBAAAA&ccb=7-5&oh=00_AYDUu2rnJb1CKhHODZr7GGg142G24F_Sxb0cVh7gRqUn1g&oe=67B97B88&_nc_sid=8b3546", "username": "natgeo", "postsCount": 30103, "followersCount": 279573878, "followsCount": 167, "private": false, "verified": true, "isBusinessAccount": true, "biography": "Inspiring the explorer in everyone 🌎" } ] ``` ### Extract Instagram comments[​](#extract-instagram-comments "Direct link to Extract Instagram comments") Retrieve comments from posts by calling [Apify's Instagram Comments Scraper](https://apify.com/apify/instagram-comment-scraper). To set up this module, you will need to add Instagram posts or reels to extract the comments from, the desired number of comments, and optionally, the order of comments, and replies. For each Instagram post, you will extract: * *Comment details*: comment text, timestamp, and number of likes. * *Commenter profile*: username, full name, profile picture URL, and account status (private or public). * *Engagement data*: number of replies and whether the commenter is verified. * *Post association*: URL of the Instagram post the comment belongs to. * *Replies (if any)*: nested replies under the main comment. Free plan limitations Features like *replies* and *newest comments first* are limited for users on Apify's Free Plan. To access the newest comment sorting or more than 20 replies per comment, consider upgrading to a [paid subscription](https://apify.com/pricing). Comment data, shortened sample ``` [ { "text": "So beautiful 🥲🥹✨", "timestamp": "2024-10-24T10:16:00.000Z", "ownerUsername": "im_sanaz3", "ownerProfilePicUrl": "https://scontent-ber1-1.cdninstagram.com/v/t51.2885-19/475976048_1321670355521757_8632924050781709835_n.jpg?stp=dst-jpg_e0_s150x150_tt6&_nc_ht=scontent-ber1-1.cdninstagram.com&_nc_cat=109&_nc_oc=Q6cZ2AHRZYgJDKz3fcI9QKX0mLmjyXeZOpQxGcUhRqH71pVWJMe0YOr2d7BqTu5-kLCdJsU&_nc_ohc=Z8izKMKm5QAQ7kNvgGSfNLs&_nc_gid=f12dbe88e285431d800ffc93637264f1&edm=AId3EpQBAAAA&ccb=7-5&oh=00_AYAacAhAy2-oAy8D-_z_MP_2sI59yqf9t5tdz5uvrlH3NA&oe=67B9A2A2&_nc_sid=f5838a", "postUrl": "https://www.instagram.com/p/DBea8-8Jn2z/" }, "text": "So something gonna hit earth? Since we see stories all over internet. Please give us the details 😂", "timestamp": "2025-02-11T19:01:03.000Z", "ownerUsername": "isabellain_", "ownerProfilePicUrl": "https://scontent-ber1-1.cdninstagram.com/v/t51.2885-19/477089999_1404980843702640_3169514283121086597_n.jpg?stp=dst-jpg_e0_s150x150_tt6&_nc_ht=scontent-ber1-1.cdninstagram.com&_nc_cat=100&_nc_oc=Q6cZ2AHRZYgJDKz3fcI9QKX0mLmjyXeZOpQxGcUhRqH71pVWJMe0YOr2d7BqTu5-kLCdJsU&_nc_ohc=f3WerXJOT3IQ7kNvgGSWaW3&_nc_gid=f12dbe88e285431d800ffc93637264f1&edm=AId3EpQBAAAA&ccb=7-5&oh=00_AYAXqkREDSM9YNfa14dKLPp8uuHQgwAIb_zKvYA4W_I_Pg&oe=67B98B5E&_nc_sid=f5838a", "postUrl": "https://www.instagram.com/p/DBea8-8Jn2z/" }, { "text": "Please archive ALL of your articles and research with and about women making history with NASA before you remove the data from your websites. And while you’re at it, remove the word men and stick to just names, fair is fair and there won’t be any sex called out in any of your articles.", "timestamp": "2025-02-11T15:40:44.000Z", "ownerUsername": "hippiesoulmo", "ownerProfilePicUrl": "https://scontent-ber1-1.cdninstagram.com/v/t51.2885-19/471553535_3637106739845033_7912985502669751019_n.jpg?stp=dst-jpg_e0_s150x150_tt6&_nc_ht=scontent-ber1-1.cdninstagram.com&_nc_cat=105&_nc_oc=Q6cZ2AHRZYgJDKz3fcI9QKX0mLmjyXeZOpQxGcUhRqH71pVWJMe0YOr2d7BqTu5-kLCdJsU&_nc_ohc=2NKox-3InPkQ7kNvgHMdSEH&_nc_gid=f12dbe88e285431d800ffc93637264f1&edm=AId3EpQBAAAA&ccb=7-5&oh=00_AYD1tLwbEVW58ey9hxlvkO6nFKVr-VmIgzbZFPnF3mL83w&oe=67B9942E&_nc_sid=f5838a", "postUrl": "https://www.instagram.com/p/DBea8-8Jn2z/" } ] ``` ### Extract Instagram posts[​](#extract-instagram-posts "Direct link to Extract Instagram posts") Gather post data with [Apify's Instagram Post Scraper](https://apify.com/apify/instagram-post-scraper). To set up this module, you will need to add Instagram usernames to extract the post from, the desired number of posts, and the timeframe of the posts. For each Instagram post, you will extract: * *Post details*: post URL, content type (image, video, carousel), shortcode, post ID, and timestamp. * *Text content*: caption, hashtags, and mentions. * *Engagement metrics*: number of likes and comments, including a few latest comments. * *Media details*: image dimensions (height and width), display URL, and alternative text (if available). * *User information*: owner’s username, full name (if available), and user ID. * *Additional data*: tagged users, child posts (for carousel posts), and location details (if available). Post data, shortened sample ``` [ { "caption": "A supernova glowing in the dark 🌟⁣\n ⁣\nWhen supernova remnant SN 1006 first appeared in the sky in 1006 C.E., it was far brighter than Venus and visible during the daytime for weeks. From that moment on, it occupied the hearts of astronomers all over the world; it has been studied from the ground and from space many times.⁣\n ⁣\nIn this image, visible, radio, and X-ray data combine to give us that blue (and red) view of the remnant’s full shell – the debris field that was created when a white dwarf star exploded and sent material hurtling into space.⁣\n ⁣\nScientists believe SN 1006 is a Type Ia supernova. This class of supernova is caused when a white dwarf never lets another star go: either it pulls too much mass from a companion star and explodes, or it merges with another white dwarf and explodes. Understanding Type Ia supernovas is especially important because astronomers use observations of these explosions in distant galaxies as mileposts to mark the expansion of the universe.⁣\n ⁣\nImage description: This supernova remnant looks like a bubble filled with blue and red clouds of dust and gas, floating amid a million stars. These stars are visible all around the bubble and even can be seen peeking through it.⁣\n ⁣\nCredit: NASA, ESA, and Z. Levay (STScI)⁣\n ⁣\n#NASA #Supernova #Stars #IVE #Astronomy #Hubble #Chandra #Clouds #아이브 #SupernovaLove #DavidGuetta", "ownerFullName": "NASA", "ownerUsername": "nasa", "url": "https://www.instagram.com/p/DCHmqs1NoaJ/", "commentsCount": 3565, "firstComment": "🔥🙌❤️👏", "likesCount": 1214485, "timestamp": "2024-11-08T17:30:07.000Z" }, { "caption": "Take a deep breath...\n\nX-ray images from our Chandra X-ray Observatory helped astronomers confirm that most of the oxygen in the universe is synthesized in massive stars. So, everybody say \"thank you\" to supernova remnants (SNRs) like this one, which has enough oxygen for thousands of solar systems.\n\nSupernova remnants are, naturally, the remains of exploded stars. They're extremely important for understanding our galaxy. If it weren't for SNRs, there would be no Earth, no plants, animals, or people. This is because all the elements heavier than iron were made in a supernova explosion, so the only reason we find these elements on Earth or in our solar system — or any other extrasolar planetary system — is because those elements were formed during a supernova.\n\n@nasachandraxray's data is represented in this image by blue and purple, while optical data from @nasahubble and the Very Large Telescope in Chile are in red and green.\n\nImage description: The darkness of space is almost covered by the array of objects in this image. Stars of different sizes are strewn about, while a blue and red bubble of gas is at the center. An area of pink and green covers the bottom-right corner.\n\nCredit: X-ray (NASA/CXC/ESO/F.Vogt et al); Optical (ESO/VLT/MUSE), Optical (NASA/STScI)\n\n#NASA #Supernova #Space #Universe #Astronomy #Astrophotography #Telescope #Xray", "ownerFullName": "NASA", "ownerUsername": "nasa", "url": "https://www.instagram.com/p/DBKBByizDHZ/", "commentsCount": 2050, "firstComment": "👍", "likesCount": 1020495, "timestamp": "2024-10-15T19:27:29.000Z" }, { "caption": "It’s giving rainbows and unicorns, like a middle school binder 🦄🌈 ⁣⁣\n⁣⁣\nMeet NGC 602, a young star cluster in the Small Magellanic Cloud (one of our satellite galaxies), where astronomers using @NASAWebb have found candidates for the first brown dwarfs outside of our galaxy. This star cluster has a similar environment to the kinds of star-forming regions that would have existed in the early universe—with very low amounts of elements heavier than hydrogen and helium. It’s drastically different from our own solar neighborhood and close enough to study in detail. ⁣⁣\n ⁣⁣\nBrown dwarfs are… not quite stars, but also not quite gas giant planets either. Typically they range from about 13 to 75 Jupiter masses. They are sometimes free-floating and not gravitationally bound to a star, like a planet would be. But they do share some characteristics with exoplanets, like storm patterns and atmospheric composition. ⁣⁣\n\n@NASAHubble showed us that NGC 602 harbors some very young low-mass stars; Webb is showing us how significant and extensive objects like brown dwarfs are in this cluster. Scientists are excited to better be able to understand how they form, particularly in an environment similar to the harsh conditions of the early universe.⁣⁣\n ⁣⁣\nRead more at the link in @ESAWebb’s bio. ⁣⁣\n ⁣⁣\nImage description: A two image swipe-through of a star cluster is shown inside a large nebula of many-coloured gas and dust. The material forms dark ridges and peaks of gas and dust surrounding the cluster, lit on the inner side, while layers of diffuse, translucent clouds blanket over them. Around and within the gas, a huge number of distant galaxies can be seen, some quite large, as well as a few stars nearer to us which are very large and bright.⁣⁣\n ⁣⁣\nImage Credit: ESA/Webb, NASA & CSA, P. Zeidler, E. Sabbi, A. Nota, M. Zamani (ESA/Webb)⁣⁣\n ⁣⁣\n#JWST #Webb #JamesWebbSpaceTelescope #NGC602 #browndwarf #space #NASA #ESA", "ownerFullName": "NASA", "ownerUsername": "nasa", "url": "https://www.instagram.com/p/DBea8-8Jn2z/", "commentsCount": 3356, "firstComment": "🔥🌍", "likesCount": 1092162, "timestamp": "2024-10-23T17:38:49.000Z" } ] ``` ## Other scrapers available[​](#other-scrapers-available "Direct link to Other scrapers available") There are other native Make Apps powered by Apify. You can check out Apify Scraper for: * [TikTok Data](https://docs.apify.com/platform/integrations/make/tiktok.md) * [Google Search](https://docs.apify.com/platform/integrations/make/search.md) * [Google Maps Emails Data](https://docs.apify.com/platform/integrations/make/maps.md) * [YouTube Data](https://docs.apify.com/platform/integrations/make/youtube.md) * [AI crawling](https://docs.apify.com/platform/integrations/make/ai-crawling.md) * [Amazon](https://docs.apify.com/platform/integrations/make/amazon.md) And more! Because you can access any of our 4,500+ scrapers on Apify Store by using the [general Apify connections](https://www.make.com/en/integrations/apify). --- # Make - Google Maps Leads Actor integration ## Apify Scraper for Google Maps Leads[​](#apify-scraper-for-google-maps-leads "Direct link to Apify Scraper for Google Maps Leads") The Google Maps Leads Scraper modules from [apify.com](http://apify.com/) allow you to extract valuable business lead data from Google Maps, including contact information, email addresses, social media profiles, business websites, phone numbers, and detailed location data. Perfect for sales teams, marketers, and business developers looking to build targeted lead lists, marketers or other commercial teams looking to data mine reviews or assess sentiment analysis wide geographies. To use these modules, you need an [Apify account](https://console.apify.com/sign-up) and an [API token](https://docs.apify.com/platform/integrations/api#api-token), which you can find under **Settings > Integrations** in Apify Console. After connecting, you can automate lead generation at scale and incorporate the results into your sales and marketing workflows. For more details, follow the tutorial below. ## Connect Apify Scraper for Google Maps Leads[​](#connect-apify-scraper-for-google-maps-leads "Direct link to Connect Apify Scraper for Google Maps Leads") 1. Create an account at [Apify](https://console.apify.com/). You can sign up using your email, Gmail, or GitHub account. ![Sign up page](/assets/images/maps-signup-9df016e55569910774bd905b375e44e6.png) 2. To connect your Apify account to Make, you can use an OAuth connection (recommended) or an Apify API token. To get the Apify API token, navigate to **[Settings > API & Integrations](https://console.apify.com/settings/integrations)** in the Apify Console. ![Apify Console token for Make.png](/assets/images/Apify_Console_token_for_Make-b9bb781ff306d44dcde51b1189d82e08.png) 3. Find your token under **Personal API tokens** section. You can also create a new API token with multiple customizable permissions by clicking on **+ Create a new token**. 4. Click the **Copy** icon next to your API token to copy it to your clipboard. Then, return to your Make scenario interface. ![Apify token on Make.png](/assets/images/Apify_token_on_Make-78f67b559503d92cffb17e5abffd18d2.png) 5. In Make, click **Add** to open the **Create a connection** dialog of the chosen Apify Scraper module. 6. In the **API token** field, paste the API token you copied from Apify. Provide a clear **Connection name**, and click **Save**. ![Make API token](/assets/images/Screenshot_2025-04-22_at_12.45.33-c123a18a0dcf92ea643d982cae6db262.png) Once connected, you can build workflows to automate Google Maps extraction and integrate results into your applications. ## Apify Scraper for Google Maps modules[​](#apify-scraper-for-google-maps-modules "Direct link to Apify Scraper for Google Maps modules") After connecting the app, you can use one of the four existing Search modules as native scrapers to extract Google Maps Data. ### Search with Categories[​](#search-with-categories "Direct link to Search with Categories") The Search with Categories module is a component of the Google Maps Leads Scraper that allows you to discover and extract business leads by filtering places based on both search terms and categories. #### Key Features[​](#key-features "Direct link to Key Features") * *Category-Based Filtering*: Filter businesses using Google Maps' extensive category system with over 4,000 available options * *Location Targeting*: Define your target market using simple location queries (city + country format) * *Customizable Result Limit*: Control exactly how many leads to extract per search term or category * *Language Selection*: View business information in your preferred language * *Quality Filters*: Focus on high-quality leads by setting minimum star ratings * *Website Availability Filter*: Target only businesses with (or without) websites * *Exact Name Matching*: Find businesses with exact or partial name matches * *Operational Status Filter*: Exclude temporarily or permanently closed businesses #### How It Works[​](#how-it-works "Direct link to How It Works") The module allows you to combine category filtering with location parameters to discover relevant business leads, data mine reviews, or extract relevant Google Maps information. You can use categories alone or with specific search terms to create precisely targeted lead lists. Categories can be general (e.g., "restaurant") which includes all variations like "Asian restaurant" or "family restaurant," or they can be specific (e.g., "restaurant terrace"). Using broader categories with a few specific search terms typically yields the best results without excluding potentially valuable leads. * *Business identification*: name, category, place ID for unique identification. * *Full contact details*: phone number, website URL, email addresses (with email enrichment). * *Complete address information*: street, city, state, postal code, country for direct mail campaigns. * *Geographic data*: precise coordinates, neighborhood, plus codes for territory planning. * *Business quality indicators*: ratings, number of reviews, price level to qualify leads. * *Operational insights*: opening hours, popular times, serviceability to better target outreach. * *Competitive intelligence*: nearby businesses, related places. * *Additional lead qualification data (optional)*: customer reviews, business photos, social media profiles. Business lead data, shortened sample ``` { "searchString": "Restaurant in Staten Island", "rank": 3, "title": "Kim's Island", "placeId": "ChIJJaKM4pyKwokRCZ8XaBNj_Gw", "categoryName": "Chinese restaurant", "price": "$10–20", "rating": 4.6, "reviewsCount": 182, "featuredInLists": ["Best Chinese Food", "Top Rated Restaurants"], // Complete address information for targeted outreach "address": "175 Main St, Staten Island, NY 10307", "neighborhood": "Tottenville", "street": "175 Main St", "city": "Staten Island", "postalCode": "10307", "state": "New York", "countryCode": "US", "plusCode": "GQ62+8M Staten Island, New York", // Multiple contact channels "website": "http://kimsislandsi.com/", "phone": "(718) 356-5168", "phoneUnformatted": "+17183565168", "email": "info@kimsislandsi.com", // From website enrichment // Business qualification data "yearsInBusiness": 12, "claimThisBusiness": false, // Verified listing "popular": true, "temporarilyClosed": false, // Precise location for territory planning "location": { "lat": 40.5107736, "lng": -74.2482624 }, // Operational insights for scheduling outreach "openingHours": { "Monday": "11:00 AM - 10:00 PM", "Tuesday": "11:00 AM - 10:00 PM", "Wednesday": "11:00 AM - 10:00 PM", "Thursday": "11:00 AM - 10:00 PM", "Friday": "11:00 AM - 11:00 PM", "Saturday": "11:00 AM - 11:00 PM", "Sunday": "12:00 PM - 9:30 PM" } } ``` ### Search with Search Terms Module[​](#search-with-search-terms-module "Direct link to Search with Search Terms Module") The Search Terms module is a component of the Google Maps Leads Scraper designed to discover and extract business leads by using specific search queries, similar to how you'd search on Google Maps directly. #### Key Features[​](#key-features-1 "Direct link to Key Features") * *Keyword-Based Discovery*: Find businesses using the same search terms you'd enter in Google Maps * *Location Targeting*: Define your target market using simple location queries (city + country format) * *Customizable Result Limit*: Control exactly how many leads to extract per search term * *Language Selection*: View business information in your preferred language * *Quality Filters*: Focus on high-quality leads by setting minimum star ratings * *Website Availability Filter*: Target only businesses with (or without) websites * *Exact Name Matching*: Find businesses with exact or partial name matches * *Operational Status Filter*: Exclude temporarily or permanently closed businesses #### How It Works[​](#how-it-works-1 "Direct link to How It Works") This module allows you to enter search terms that match what you would typically type into the Google Maps search bar. You can search for general business types (like "coffee shop"), specific services ("dog grooming"), or product offerings ("organic produce"). The search results can be further refined using optional category filters, which help ensure you're capturing precisely the type of businesses you're targeting. For maximum efficiency, you can combine broader search terms with strategic category filters to capture the most relevant leads without excluding valuable prospects. ### Advanced and Custom Search Module - Google Maps Leads Scraper[​](#advanced-and-custom-search-module---google-maps-leads-scraper "Direct link to Advanced and Custom Search Module - Google Maps Leads Scraper") The Advanced and Custom Search module is the most powerful component of the Google Maps Leads Scraper, designed for sophisticated lead generation campaigns that require precise geographic targeting and advanced search capabilities. This module gives you complete control over your lead discovery process with multiple location definition methods and advanced filtering options. #### Key Features[​](#key-features-2 "Direct link to Key Features") * *Multiple Location Definition Methods*: Define target areas using free-text location queries, country/state/city selections, postal codes, or custom polygon coordinates * *Custom Geographic Targeting*: Draw precise search areas using longitude/latitude coordinates for highly targeted campaigns * *Direct URL Importing*: Extract leads from specific Google Maps search URLs, CID links, or shortened map links * *Keyword-Based Discovery*: Find businesses using search terms, just like in Google Maps * *Category Filtering*: Further refine results with optional category filters * *Comprehensive Lead Filtering*: Apply multiple quality filters simultaneously for precise lead targeting #### How It Works[​](#how-it-works-2 "Direct link to How It Works") This module provides the most flexible options for defining where and how to search for business leads: ### Geographic Targeting Options[​](#geographic-targeting-options "Direct link to Geographic Targeting Options") * *Simple Location Query*: Use natural language location inputs like "New York, USA" * *Structured Location Components*: Build precise locations using country, state, city, or county parameters * *Postal Code Targeting*: Target specific postal/ZIP code areas for direct mail campaigns * *Custom Polygon Areas*: Define exact geographic boundaries using coordinate pairs for ultra-precise targeting ### Search and Filter Capabilities[​](#search-and-filter-capabilities "Direct link to Search and Filter Capabilities") * *Keyword-Based Search*: Discover businesses using industry, service, or product terms * *Category-Based Filtering*: Apply Google's category system to refine results * *Quality Filters*: Target businesses with specific ratings, website presence, and operational status Advances output data, shortened sample ``` { "searchString": "coffee shop", "rank": 9, "searchPageUrl": "https://www.google.com/maps/search/coffee%20shop/@40.748508724216016,-74.0186770781978,17z?hl=en", "searchPageLoadedUrl": "https://www.google.com/maps/search/coffee%20shop/@40.748508724216016,-74.0186770781978,17z?hl=en", "isAdvertisement": false, "title": "Bluestone Lane Chelsea Piers Café", "price": "$20–30", "categoryName": "Coffee shop", // Address and location data "address": "62 Chelsea Piers Pier 62, New York, NY 10011", "neighborhood": "Manhattan", "street": "62 Chelsea Piers Pier 62", "city": "New York", "postalCode": "10011", "state": "New York", "countryCode": "US", "location": { "lat": 40.7485378, "lng": -74.0087457 }, "plusCode": "GQ62+8M Staten Island, New York", // Contact information "website": "https://bluestonelane.com/?y_source=1_MjMwNjk1NDAtNzE1LWxvY2F0aW9uLndlYnNpdGU%3D", "phone": "(718) 374-6858", "phoneUnformatted": "+17183746858", // Rating and reviews "totalScore": 4.3, "reviewsCount": 425, "imagesCount": 659, // Business identifiers "claimThisBusiness": false, "permanentlyClosed": false, "temporarilyClosed": false, "placeId": "ChIJDTUgz1dZwokRtsQ97Tbf0cA", "categories": ["Coffee shop", "Cafe"], "fid": "0x89c25957cf20350d:0xc0d1df36ed3dc4b6", "cid": "13894131752416167094", // Operating hours "openingHours": [ {"day": "Monday", "hours": "7 AM to 6 PM"}, {"day": "Tuesday", "hours": "7 AM to 6 PM"}, {"day": "Wednesday", "hours": "7 AM to 6 PM"}, {"day": "Thursday", "hours": "7 AM to 6 PM"}, {"day": "Friday", "hours": "7 AM to 6 PM"}, {"day": "Saturday", "hours": "7 AM to 6 PM"}, {"day": "Sunday", "hours": "7 AM to 6 PM"} ], // Business attributes and amenities "additionalInfo": { "Service options": [ {"Outdoor seating": true}, {"Curbside pickup": true}, {"No-contact delivery": true}, {"Delivery": true}, {"Onsite services": true}, {"Takeout": true}, {"Dine-in": true} ], "Highlights": [ {"Great coffee": true}, {"Great tea selection": true}, {"Live music": true}, {"Live performances": true}, {"Rooftop seating": true} ], "Popular for": [ {"Breakfast": true}, {"Lunch": true}, {"Solo dining": true}, {"Good for working on laptop": true} ], "Accessibility": [ {"Wheelchair accessible entrance": true}, {"Wheelchair accessible parking lot": true}, {"Wheelchair accessible restroom": true}, {"Wheelchair accessible seating": true} ], "Offerings": [ {"Coffee": true}, {"Comfort food": true}, {"Organic dishes": true}, {"Prepared foods": true}, {"Quick bite": true}, {"Small plates": true}, {"Vegetarian options": true} ], "Dining options": [ {"Breakfast": true}, {"Brunch": true}, {"Lunch": true}, {"Catering": true}, {"Dessert": true}, {"Seating": true} ], "Amenities": [ {"Restroom": true}, {"Wi-Fi": true}, {"Free Wi-Fi": true} ], "Atmosphere": [ {"Casual": true}, {"Cozy": true}, {"Trendy": true} ], "Crowd": [ {"Family-friendly": true}, {"LGBTQ+ friendly": true}, {"Transgender safespace": true} ], "Planning": [ {"Accepts reservations": true} ], "Payments": [ {"Credit cards": true}, {"Debit cards": true}, {"NFC mobile payments": true} ], "Children": [ {"Good for kids": true}, {"High chairs": true} ] }, // Image and metadata "imageUrl": "https://lh3.googleusercontent.com/p/AF1QipMl6-SnuqYEeE3mD54M0q5D5nysRUZQj1BB0g8=w408-h272-k-no", "kgmid": "/g/11ph8zh6sg", "url": "https://www.google.com/maps/search/?api=1&query=Bluestone%20Lane%20Chelsea%20Piers%20Caf%C3%A9&query_place_id=ChIJDTUgz1dZwokRtsQ97Tbf0cA", "scrapedAt": "2025-04-22T14:23:34.961Z" } ``` ## Best Practices[​](#best-practices "Direct link to Best Practices") 1. *Choose the right location method* for your campaign: * Free-text location queries for quick, general area targeting * Country/State/City combinations for administrative boundary targeting * Postal codes for direct mail campaign areas * Custom polygons for precise neighborhood or business district targeting 2. *Layer search parameters effectively*: * Start with broader geographic targeting * Apply search terms to identify relevant business types * Use category filters to further refine results * Apply quality filters (ratings, website presence) as the final step 3. *Consider URL-based extraction* for specific scenarios: * When you have existing Google Maps searches with desired filters * For capturing specific business types Google has already grouped * When working with curated Google Maps lists 4. *Optimize polygon definitions* for complex areas: * Use 4-8 coordinate pairs for most areas * Ensure coordinates form a closed shape * Test with smaller areas before scaling to large regions ## Advanced Features[​](#advanced-features "Direct link to Advanced Features") * *Multi-Location Campaigns*: Configure separate runs for each territory and combine results * *Direct Place ID Targeting*: Extract data from specific businesses using place IDs * *Custom Boundary Definitions*: Use longitude/latitude coordinates to define precise areas like neighborhoods, business districts, or sales territories * *URL Parameter Extraction*: Capture lead data from complex Google Maps search URLs with multiple parameters ## Important Notes[​](#important-notes "Direct link to Important Notes") * Different location methods should not be combined (use either free-text location OR country/state/city parameters) * Custom polygon areas take precedence over other location methods when defined * Always verify location coverage before running large-scale extractions * Direct URL imports are limited to approximately 300 results per URL * For complex geographic areas, breaking into multiple targeted searches yields better results There are other native Make Apps powered by Apify. You can check out Apify Scraper for: * [Instagram Data](https://docs.apify.com/platform/integrations/make/instagram.md) * [TikTok](https://docs.apify.com/platform/integrations/make/tiktok.md) * [Google Search](https://docs.apify.com/platform/integrations/make/search.md) * [YouTube Data](https://docs.apify.com/platform/integrations/make/youtube.md) * [AI crawling](https://docs.apify.com/platform/integrations/make/ai-crawling.md) * [Amazon](https://docs.apify.com/platform/integrations/make/amazon.md) And more! Because you can access any of our 4,500+ scrapers on Apify Store by using the [general Apify connections](https://www.make.com/en/integrations/apify). --- # Make - Google Search Actor integration ## Apify Scraper for Google Search[​](#apify-scraper-for-google-search "Direct link to Apify Scraper for Google Search") The Google search modules from [Apify](https://apify.com) allows you to crawl Google Search Results Pages (SERPs) and extract data from those web pages in structured format such as JSON, XML, CSV, or Excel. To use the module, you need an [Apify account](https://console.apify.com) and an [API token](https://docs.apify.com/platform/integrations/api#api-token), which you can find in the Apify Console under **Settings > Integrations**. After connecting, you can automate data extraction and incorporate the results into your workflows. ## Connect Apify Scraper for Google Search modules to Make[​](#connect-apify-scraper-for-google-search--modules-tomake "Direct link to Connect Apify Scraper for Google Search modules to Make") 1. Create an account at [Apify](https://console.apify.com/). You can sign up using your email, Gmail, or GitHub account. ![Sign up page](/assets/images/search-signup-695d73caa9c0d37a5e13ba0e6e7cb181.png) 1. To connect your Apify account to Make, you can use an OAuth connection (recommended) or an Apify API token. To get the Apify API token, navigate to **[Settings > API & Integrations](https://console.apify.com/settings/integrations)** in the Apify Console. ![Apify Console token for Make.png](/assets/images/Apify_Console_token_for_Make-b9bb781ff306d44dcde51b1189d82e08.png) 1. Find your token under **Personal API tokens** section. You can also create a new API token with multiple customizable permissions by clicking on **+ Create a new token**. 2. Click the **Copy** icon next to your API token to copy it to your clipboard. Then, return to your Make scenario interface. ![Apify token on Make.png](/assets/images/Apify_token_on_Make-78f67b559503d92cffb17e5abffd18d2.png) 3. On Make, click **Add** to open the **Create a connection** dialog of the chosen Apify Scraper module. 4. In the **API token** field, paste the API token you copied from Apify. Provide a clear **Connection name**, and click **Save**. ![Make API token]() Once connected, you can build workflows to automate Google Search extraction and integrate results into your applications. ## Apify Scraper for Google Search Data modules[​](#apify-scraper-for-google-search-data-modules "Direct link to Apify Scraper for Google Search Data modules") After connecting the app, you can use one of the primary modules as native scrapers to extract public Google Search data: ### Extract Google Search Results[​](#extract-google-search-results "Direct link to Extract Google Search Results") Get comprehensive search results via [Apify’s Google Search Scraper](https://apify.com/apify/google-search-scraper). Enter your search terms or Google Search URLs, one per line. For each Google Search query, you will extract: * *Organic results*: comprehensive listings with titles, descriptions, URLs, and position data. * *Paid results*: sponsored listings and advertisements that appear in search results. * *AI Overviews*: Google’s AI-generated summaries that appear at the top of results. * *People Also Ask*: related questions and their expandable answers. * *Related queries*: suggested search terms related to your original query. * *Featured snippets*: highlighted content that directly answers search queries. * *Additional data*: prices, review ratings, product information, and more where available. Search results data, shortened sample ``` { "searchQuery": { "term": "javascript", "page": 1, "type": "SEARCH", "countryCode": "us", "languageCode": "en", "locationUule": null, "device": "DESKTOP" }, "url": "https://www.google.com/search?q=javascript&hl=en&gl=us&num=10", "hasNextPage": true, "resultsCount": 13600000000, "organicResults": [ { "title": "JavaScript Tutorial", "url": "https://www.w3schools.com/js/", "displayedUrl": "https://www.w3schools.com › js", "description": "JavaScript is the world's most popular programming language. JavaScript is the programming language of the Web. JavaScript is easy to learn.", "position": 1, "emphasizedKeywords": ["JavaScript", "JavaScript", "JavaScript", "JavaScript"], "siteLinks": [] } ], "paidResults": [ { "title": "JavaScript Online Course - Start Learning JavaScript", "url": "https://www.example-ad.com/javascript", "displayedUrl": "https://www.example-ad.com", "description": "Learn JavaScript from scratch with our comprehensive online course. Start your coding journey today!", "position": 1, "type": "SHOPPING" } ], "peopleAlsoAsk": [ { "question": "What is JavaScript used for?", "answer": "JavaScript is used for creating interactive elements on websites, browser games, frontend of web applications, mobile applications, and server applications...", "url": "https://www.example.com/javascript-uses" } ] } ``` #### Advanced Search Capabilities[​](#advanced-search-capabilities "Direct link to Advanced Search Capabilities") Customize your searches with powerful filtering options: * *Regional targeting*: Select specific countries and languages for localized results * *Device simulation*: Get results as they would appear on desktop or mobile devices * *Site filtering*: Restrict results to specific domains with `site:example.com` * *Exclusion operators*: Remove unwanted sources with `site:reddit.com` * *Exact phrase matching*: Search for precise phrases with quotation marks * *Date filtering*: Limit results to specific time periods * *File type filtering*: Target specific document formats like PDF, DOC, or XLSX * *Content location targeting*: Find keywords in specific parts of pages with `intext:`, `intitle:`, and `inurl:` * *UULE parameters*: Target searches to exact geographic locations The scraper exports data in various formats including JSON, CSV, Excel, and XML, enabling integration with your workflows and applications. There are other native Make Apps powered by Apify. You can check out Apify Scraper for: * [Instagram Data](https://docs.apify.com/platform/integrations/make/instagram.md) * [TikTok Data](https://docs.apify.com/platform/integrations/make/tiktok.md) * [Google Maps Emails Data](https://docs.apify.com/platform/integrations/make/maps.md) * [YouTube Data](https://docs.apify.com/platform/integrations/make/youtube.md) * [AI crawling](https://docs.apify.com/platform/integrations/make/ai-crawling.md) * [Amazon Data](https://docs.apify.com/platform/integrations/make/amazon.md) And more! Because you can access any of our 4,500+ scrapers on Apify Store by using the [general Apify connections](https://www.make.com/en/integrations/apify). --- # Make - TikTok Actor integration ## Apify Scraper for TikTok Data[​](#apify-scraper-for-tiktok-data "Direct link to Apify Scraper for TikTok Data") The TikTok Scraper modules from [Apify](https://apify.com) allow you to extract hashtag, comments, and profile data from TikTok. To use these modules, you need an [Apify account](https://console.apify.com) and an [API token](https://docs.apify.com/platform/integrations/api#api-token). You can find your token in the [Apify Console](https://console.apify.com/) under **Settings > Integrations**. After connecting, you can automate data extraction and incorporate the results into your workflows. ## Connect Apify Scraper for TikTok Data modules to Make[​](#connect-apify-scraper-for-tiktok-data-modules-tomake "Direct link to Connect Apify Scraper for TikTok Data modules to Make") 1. Create an account at [Apify](https://console.apify.com/). You can sign up using your email, Gmail, or GitHub account. ![Sign up page](/assets/images/image-faa0832d8cb43a46c5e88988f22a4552.png) 2. To connect your Apify account to Make, you can use an OAuth connection (recommended) or an Apify API token. To get the Apify API token, navigate to **[Settings > API & Integrations](https://console.apify.com/settings/integrations)** in the Apify Console. ![Apify Console token for Make.png](/assets/images/Apify_Console_token_for_Make-b9bb781ff306d44dcde51b1189d82e08.png) 3. Find your token under **Personal API tokens** section. You can also create a new API token with multiple customizable permissions by clicking on **+ Create a new token**. 4. Click the **Copy** icon next to your API token to copy it to your clipboard. Then, return to your Make scenario interface. ![Apify token on Make.png](/assets/images/Apify_token_on_Make-78f67b559503d92cffb17e5abffd18d2.png) 5. In Make, click **Add** to open the **Create a connection** dialog of the chosen Apify Scraper module. 6. In the **API token** field, paste the API token you copied from Apify. Provide a clear **Connection name**, and click **Save**. ![Apify Token for modules on Make.png](/assets/images/image1-a6ed2fb36a6d01b499bdfa0a1d91d126.png) Once connected, you can build workflows to automate TikTok data extraction and integrate results into your applications. ## Apify Scraper for TikTok Data modules[​](#apify-scraper-for-tiktok-data-modules "Direct link to Apify Scraper for TikTok Data modules") After connecting the app, you can use one of the three existing Search modules as native scrapers to extract public TikTok data: ### Extract TikTok profiles[​](#extract-tiktok-profiles "Direct link to Extract TikTok profiles") Get profile details via [Apify's TikTok Profile Scraper](https://apify.com/clockworks/tiktok-profile-scraper). To use this module, fill in the profile names you want to gather information about. For each TikTok profile, you will extract: * *Basic profile details*: name, nickname, bio, ID, and profile URL. * *Account status*: whether the account is verified or not, and if it's a business and seller account. * *Follower and engagement metrics*: number of followers and accounts followed. * *Profile avatar*: avatar URLs. * *Content information*: number of videos, fans, hearts, friends, and likes. Profile data, shortened sample ``` [ { "authorMeta": { "id": "6987048613642159109", "name": "nasaofficial", "profileUrl": "https://www.tiktok.com/@nasaofficial", "nickName": "NASA", "verified": false, "signature": "National Aeronautics Space Association", "bioLink": null, "originalAvatarUrl": "https://p16-sign-va.tiktokcdn.com/tos-maliva-avt-0068/6f0cf6a7e7d410e3a624f0af8fa4d314~tplv-tiktokx-cropcenter:720:720.jpeg?dr=10399&nonce=84125&refresh_token=05118aa7a7b44a43f792d1a09d7bfecf&x-expires=1740060000&x-signature=NKl%2Fc2Ma6bNAhN2pHpCRWflSejQ%3D&idc=no1a&ps=13740610&shcp=81f88b70&shp=a5d48078&t=4d5b0474", "avatar": "https://p16-sign-va.tiktokcdn.com/tos-maliva-avt-0068/6f0cf6a7e7d410e3a624f0af8fa4d314~tplv-tiktokx-cropcenter:720:720.jpeg?dr=10399&nonce=84125&refresh_token=05118aa7a7b44a43f792d1a09d7bfecf&x-expires=1740060000&x-signature=NKl%2Fc2Ma6bNAhN2pHpCRWflSejQ%3D&idc=no1a&ps=13740610&shcp=81f88b70&shp=a5d48078&t=4d5b0474", "commerceUserInfo": { "commerceUser": true, "category": "Education & Training", "categoryButton": false }, "privateAccount": false, "region": "US", "roomId": "", "ttSeller": false, "following": 4, "friends": 0, "fans": 2049, "heart": 135, "video": 0, "digg": 0 }, "input": "https://www.tiktok.com/@nasaofficial", } ] ``` ### Extract TikTok comments[​](#extract-tiktok-comments "Direct link to Extract TikTok comments") Retrieve comments from videos by calling [Apify's TikTok Comments Scraper](https://apify.com/clockworks/tiktok-comments-scraper). To set up this module, you will need to add TikTok video URLs to extract the comments from, the desired number of comments, and optionally, the maximum number of replies per comment. For each TikTok video, you will extract: * *Comment details*: comment text, timestamp, and number of likes. * *Commenter profile*: username, ID, and avatar URL. * *Engagement data*: number of replies. * *Post association*: URL of the TikTok video the comment belongs to. Comment data, shortened sample ``` [ { "text": "Free lunches??!!!", "diggCount": 1, "replyCommentTotal": 1, "createTimeISO": "2024-02-21T16:10:50.000Z", "uniqueId": "abdmohimnhareth99", "videoWebUrl": "https://www.tiktok.com/@apifyoffice/video/7338085038258457889", "uid": "7114813797776491525", "cid": "7338088354673640225", "avatarThumbnail": "https://p77-sign-va.tiktokcdn.com/tos-maliva-avt-0068/e678ece1460eac51f1c4ed95db9a8e31~tplv-tiktokx-cropcenter:100:100.jpg?dr=10399&nonce=21560&refresh_token=3d45927e8ec8daaf4c27956e2fdaa849&x-expires=1739973600&x-signature=aFYfAqAMHdHdad9pNzOgThjcgds%3D&idc=no1a&ps=13740610&shcp=ff37627b&shp=30310797&t=4d5b0474" }, { "text": "Every day🤭", "diggCount": 0, "replyCommentTotal": null, "createTimeISO": "2024-02-21T16:24:09.000Z", "uniqueId": "apifyoffice", "videoWebUrl": "https://www.tiktok.com/@apifyoffice/video/7338085038258457889", "uid": "7095709566285480965", "cid": "7338091744464978720", "avatarThumbnail": "https://p16-sign-useast2a.tiktokcdn.com/tos-useast2a-avt-0068-euttp/2c511269b14f70cca0c11c3285ddc668~tplv-tiktokx-cropcenter:100:100.jpg?dr=10399&nonce=11659&refresh_token=c2a577eebaa68fc73aac11e9b99fefcb&x-expires=1739973600&x-signature=LUTudhynytGwrfL9MKFHKO8v7EA%3D&idc=no1a&ps=13740610&shcp=ff37627b&shp=30310797&t=4d5b0474" }, ] ``` ### Extract TikTok hashtags[​](#extract-tiktok-hashtags "Direct link to Extract TikTok hashtags") Gather post data with [Apify's TikTok Hashtag Scraper](https://apify.com/clockworks/tiktok-hashtag-scraper). To set up this module, you will need to add the TikTok hashtags from which you want to extract videos and the desired number of videos per hashtag. For each TikTok hashtag, you will extract: * *All TikToks posted with chosen hashtags*: caption, video URL, number of plays, hearts, comments, shares, country of creation, timestamp, paid status, video and music metadata. * *Basic creator info from TikToks posted with chosen hashtags*: name, ID, avatar, bio, account status, total followers/following numbers, given/received likes count, etc. * *Total number of views for a chosen hashtag* Hashtag data, shortened sample ``` [ { "videoMeta.coverUrl": "https://p77-sign-va.tiktokcdn.com/obj/tos-maliva-p-0068/1824f891fd0e48e7bf46513f27383e20_1727638068?lk3s=b59d6b55&x-expires=1740060000&x-signature=PNotHaeJ5nqiyt6zbbZqi4RljzA%3D&shp=b59d6b55&shcp=-", "text": "y como es tu hijo?🥰#trendslab #CapCut #hijo #bebe #capcutamor #amordemivida #parati ", "diggCount": 56500, "shareCount": 5968, "playCount": 5500000, "commentCount": 0, "videoMeta.duration": 9, "isAd": false, "isMuted": false, "hashtags": [ { "id": "1662966768289798", "name": "trendslab", "title": "", "cover": "" }, { "id": "1663935709411330", "name": "capcut", "title": "CapCut is a new, easy-to-use video editing tool designed for mobile platforms. CapCut provides users with a wide range of video editing functions, filters, audio & visual effects, video templates, while keeping it free of charge and ads-free. Everyone can be a creator by using CapCut. \n\nStart creating your cool videos today: \nhttps://capcut.onelink.me/XKqI/228cad85", "cover": "" }, ] ``` ## Other scrapers available[​](#other-scrapers-available "Direct link to Other scrapers available") There are other native Make Apps powered by Apify. You can check out Apify Scraper for: * [Instagram Data](https://docs.apify.com/platform/integrations/make/instagram.md) * [Google Search](https://docs.apify.com/platform/integrations/make/search.md) * [Google Maps Emails Data](https://docs.apify.com/platform/integrations/make/maps.md) * [YouTube Data](https://docs.apify.com/platform/integrations/make/youtube.md) * [AI crawling](https://docs.apify.com/platform/integrations/make/ai-crawling.md) * [Amazon](https://docs.apify.com/platform/integrations/make/amazon.md) And more! Because you can access any of our 4,500+ scrapers on Apify Store by using the [general Apify connections](https://www.make.com/en/integrations/apify). --- # Make - YouTube Actor integration ## Apify Scraper for YouTube Data[​](#apify-scraper-for-youtube-data "Direct link to Apify Scraper for YouTube Data") The YouTube Scraper module from [apify.com](https://apify.com) allows you to extract channel, video, streams, shorts, and search data from YouTube. To use this module, you need an [Apify account](https://console.apify.com) and an [API token](https://docs.apify.com/platform/integrations/api#api-token), which you can find in the Apify Console under **Settings > Integrations**. After connecting, you can automate data extraction and incorporate the results into your workflows. For more details, follow the tutorial below. ## Connect Apify Scraper for YouTube Data modules to Make[​](#connect-apify-scraper-for-youtube-data-modules-tomake "Direct link to Connect Apify Scraper for YouTube Data modules to Make") 1. Create an account at [Apify](https://console.apify.com/). You can sign up using your email, Gmail, or GitHub account. ![Sign up page](/assets/images/image-b4a29a36809c9d87da504217f8854791.png) 2. To connect your Apify account to Make, you can use an OAuth connection (recommended) or an Apify API token. To get the Apify API token, navigate to **[Settings > API & Integrations](https://console.apify.com/settings/integrations)** in the Apify Console. ![Apify Console token for Make.png](/assets/images/Apify_Console_token_for_Make-b9bb781ff306d44dcde51b1189d82e08.png) 3. Find your token under **Personal API tokens** section. You can also create a new API token with multiple customizable permissions by clicking on **+ Create a new token**. 4. Click the **Copy** icon next to your API token to copy it to your clipboard. Then, return to your Make scenario interface. ![Apify token on Make.png](/assets/images/Apify_token_on_Make-78f67b559503d92cffb17e5abffd18d2.png) 5. In Make, click **Add** to open the **Create a connection** dialog of the chosen Apify Scraper module. 6. In the **API token** field, paste the API token you copied from Apify. Provide a clear **Connection name**, and click **Save**. ![Make API token](/assets/images/image1-d89f6e0125cc13274da296fdec3262f0.png) Once connected, you can build workflows to automate YouTube data extraction and integrate results into your applications. ## Apify Scraper for YouTube Data module[​](#apify-scraper-for-youtube-data-module "Direct link to Apify Scraper for YouTube Data module") After connecting the app, you can use the Search module as a native scraper to extract public YouTube data. Here’s what you get: ### Extract YouTube data[​](#extract-youtube-data "Direct link to Extract YouTube data") Get data via [Apify's YouTube Scraper](https://apify.com/streamers/youtube-scraper). To do so, simply fill in the URLs of videos, streams, shorts, searches, or channels you want to gather information about. For YouTube URLs, you can extract: * *Basic channel details*: name, number of subscribers, total videos, location, social media links * *Video and search details*: author name, number of likes, comments, views, title, URL, subtitles, duration, release date Channel data sample ``` { "id": "HV6OlMPn5sI", "title": "Raimu - The Spirit Within 🍃 [lofi hip hop/relaxing beats]", "duration": "29:54", "channelName": "Lofi Girl", "channelUrl": "https://www.youtube.com/channel/UCSJ4gkVC6NrvII8umztf0Ow", "date": "10 months ago", "url": "https://www.youtube.com/watch?v=HV6OlMPn5sI", "viewCount": 410458, "fromYTUrl": "https://www.youtube.com/@LofiGirl/videos", "channelDescription": "\"That girl studying by the window non-stop\"\n\n🎧 | Listen on Spotify, Apple music and more\n→ https://bit.ly/lofigirl-playlists\n\n💬 | Join the Lofi Girl community \n→ https://bit.ly/lofigirl-discord\n→ https://bit.ly/lofigirl-reddit\n\n🌎 | Lofi Girl on all social media\n→ https://bit.ly/lofigirl-sociaI", "channelDescriptionLinks": [ { "text": "Discord", "url": "https://discord.com/invite/hUKvJnw" }, { "text": "Tiktok", "url": "https://www.tiktok.com/@lofigirl/" }, { "text": "Instagram", "url": "https://www.instagram.com/lofigirl/" }, { "text": "Twitter", "url": "https://twitter.com/lofigirl" }, { "text": "Spotify", "url": "https://open.spotify.com/playlist/0vvXsWCC9xrXsKd4FyS8kM" }, { "text": "Apple music", "url": "https://music.apple.com/fr/playlist/lofi-hip-hop-music-beats-to-relax-study-to/pl.u-2aoq8mqiGo7J6A0" }, { "text": "Merch", "url": "https://lofigirlshop.com/" } ], "channelJoinedDate": "Mar 18, 2015", "channelLocation": "France", "channelTotalVideos": 409, "channelTotalViews": "1,710,167,563", "numberOfSubscribers": 13100000, "isMonetized": true, "inputChannelUrl": "https://www.youtube.com/@LofiGirl/about" } ``` Video data sample ``` { "title": "Stromae - Santé (Live From The Tonight Show Starring Jimmy Fallon)", "id": "CW7gfrTlr0Y", "url": "https://www.youtube.com/watch?v=CW7gfrTlr0Y", "thumbnailUrl": "https://i.ytimg.com/vi/CW7gfrTlr0Y/maxresdefault.jpg", "viewCount": 35582192, "date": "2021-12-21", "likes": 512238, "location": null, "channelName": "StromaeVEVO", "channelUrl": "http://www.youtube.com/@StromaeVEVO", "numberOfSubscribers": 6930000, "duration": "00:03:17", "commentsCount": 14, "text": "Stromae - Santé (Live From The Tonight Show Starring Jimmy Fallon on NBC)\nListen to \"La solassitude\" here: https://stromae.lnk.to/la-solassitude\nOrder my new album \"Multitude\" here: https://stromae.lnk.to/multitudeID\n--\nhttps://www.stromae.com/fr/\nhttps://www.tiktok.com/@stromae\nhttps://www.facebook.com/stromae\nhttps://www.instagram.com/stromae\nhttps://twitter.com/stromae\n / @stromae \n--\nMosaert\nPaul Van Haver (Stromae) : creative direction\nCoralie Barbier : creative direction and fashion design\nLuc Van Haver : creative direction\nGaëlle Birenbaum : communication & project manager\nEvence Guinet-Dannonay : executive assistant\nGaëlle Cools : content & community manager\nRoxane Hauzeur : textile product manager\nDiego Mitrugno : office manager\n\nPartizan\nProducer : Auguste Bas\nLine Producer : Zélie Deletrain \nProduction coordinator : Lou Bardou-Jacquet \nProduction assistant : Hugo Dao\nProduction assistant : Adrien Bossa\nProduction assistant : Basile Jan\n\nDirector : Julien Soulier \n1st assistant director : Mathieu Perez \n2nd assistant director : Leila Gentet \n\nDirector of Photography : Kaname Onoyama \n1st assistant operator : Micaela albanese\n2nd assistant operator : Florian Rey \nDoP Mantee : Zhaopeng Zhong\nMaking of : Adryen Barreyat\n\nHead Gaffer : Sophie Delorme \nElectrician : Sacha Brauman\nElectrician: Tom Devianne\nLighting designer : Aurélien Dayot\nPrelight electrician : Emmanuel Malherbe\n\nHead Grip : Dioclès Desrieux \nBest Boy grip : Eloi Perrin \nPrelight Grip : Vladimir Duranovic \n\nLocation manager : Léo Rodriguez \nLocation manager assistant : Grégoire Décatoire \nLocation manager assistant : Mathieu Barazer \n\nStylist : Sandra Gonzalez \nStylist assistant : Sarah Bernard\n\nMake Up and Hair Artist : Camille Roche \nMake up Artist : Carla Lange \nMake Up and Hair Artist : Victoria Pinto \n\nSound Engineer : Lionel Capouillez \nBackliner : Nicolas Fradet \n\nProduction Designer : Penelope Hemon \n\nChoreographer : Marion Motin \nChoreographer assistant : Jeanne Michel \n\nPost production : Royal Post\nPost-Production Director : Cindy Durand Paucsik\nEditor : Marco Novoa\nEditor assistant : Térence Nury \nGrader : Vincent Amor\nVFX Supervisor : Julien Laudicina\nGraphic designer : Quentin Mesureux \nGraphic designer : Lucas Ponçon \nFilm Lab Assistant : Hadrian Kalmbach\n\nMusicians:\nFlorian Rossi \nManoli Avgoustinatos\nSimon Schoovaerts \nYoshi Masuda \n\nDancers: \nJuliana Casas\nLydie Alberto \nRobinson Cassarino\nYohann Hebi daher\nChris Fargeot \nAudrey Hurtis \nElodie Hilsum\nDaya jones \nThéophile Bensusan \nBrandon Masele \nJean Michel Premier \nKevin Bago\nAchraf Bouzefour\nPauline Journe \nCaroline Bouquet \nManon Bouquet\nAshley Biscette \nJocelyn Laurent \nOumrata Konan\nKylian Toto\nEnzo Lesne \nSalomon Mpondo-Dicka\nSandrine Monar \nKarl-Ruben Noel\n\n#Stromae #Sante #JimmyFallon", "descriptionLinks": [ { "url": "https://stromae.lnk.to/la-solassitude", "text": "https://stromae.lnk.to/la-solassitude" }, { "url": "https://stromae.lnk.to/multitudeID", "text": "https://stromae.lnk.to/multitudeID" }, { "url": "https://www.stromae.com/fr/", "text": "https://www.stromae.com/fr/" }, { "url": "https://www.tiktok.com/@stromae", "text": "https://www.tiktok.com/@stromae" }, { "url": "https://www.facebook.com/stromae", "text": "https://www.facebook.com/stromae" }, { "url": "https://www.instagram.com/stromae", "text": "https://www.instagram.com/stromae" }, { "url": "https://twitter.com/stromae", "text": "https://twitter.com/stromae" }, { "url": "https://www.youtube.com/channel/UCXF0YCBWewAj3RytJUAivGA", "text": " / @stromae " }, { "url": "https://www.youtube.com/hashtag/stromae", "text": "#Stromae" }, { "url": "https://www.youtube.com/hashtag/sante", "text": "#Sante" }, { "url": "https://www.youtube.com/hashtag/jimmyfallon", "text": "#JimmyFallon" } ], "subtitles": null, "comments": null, "isMonetized": true, "commentsTurnedOff": false } ``` Search results data sample ``` { "id": "CwRMBKk8St0", "title": "LET'S ARGUE: Beyoncé Fails the Bechdel Test!", "duration": "13:48", "channelName": "fantano", "channelUrl": "https://www.youtube.com/@fantano", "date": "5 years ago", "url": "https://www.youtube.com/watch?v=CwRMBKk8St0", "viewCount": 635379, "fromYTUrl": "https://www.youtube.com/results?search_query=bechdel+test" }, { "id": "k86TWcdjQuM", "title": "This scene is the anti-Bechdel test", "duration": "1:00", "channelName": "Newbie Star Trek", "channelUrl": "https://www.youtube.com/@NewbieStarTrek", "date": "2 months ago", "url": "https://www.youtube.com/shorts/k86TWcdjQuM", "viewCount": 6907, "fromYTUrl": "https://www.youtube.com/results?search_query=bechdel+test" }, { "id": "vKlakrW50QU", "title": "Inside Job passes the bechdel test", "duration": "0:15", "channelName": "Random Daydreamer", "channelUrl": "https://www.youtube.com/@randomdaydreamer9970", "date": "1 year ago", "url": "https://www.youtube.com/watch?v=vKlakrW50QU", "viewCount": 121824, "fromYTUrl": "https://www.youtube.com/results?search_query=bechdel+test" }, { "id": "mL5hgNu4y3A", "title": "Jessica Chastain Describes the Importance of the 'Bechdel Test'", "duration": "3:05", "channelName": "theoffcamerashow", "channelUrl": "https://www.youtube.com/@theoffcamerashow", "date": "2 years ago", "url": "https://www.youtube.com/watch?v=mL5hgNu4y3A", "viewCount": 24145, "fromYTUrl": "https://www.youtube.com/results?search_query=bechdel+test" } ``` ## Other scrapers available[​](#other-scrapers-available "Direct link to Other scrapers available") There are other native Make Apps powered by Apify. You can check out Apify Scraper for: * [Instagram Data](https://docs.apify.com/platform/integrations/make/instagram.md) * [TikTok Data](https://docs.apify.com/platform/integrations/make/tiktok.md) * [Google Search](https://docs.apify.com/platform/integrations/make/search.md) * [Google Maps Emails Data](https://docs.apify.com/platform/integrations/make/maps.md) * [AI crawling](https://docs.apify.com/platform/integrations/make/ai-crawling.md) * [Amazon](https://docs.apify.com/platform/integrations/make/amazon.md) And more! Because you can access any of our 4,500+ scrapers on Apify Store by using the [general Apify connections](https://www.make.com/en/integrations/apify). --- # Mastra MCP integration **Learn how to build AI agents with Mastra and Apify Actors MCP Server.** *** ## What is Mastra[​](#what-is-mastra "Direct link to What is Mastra") [Mastra](https://mastra.ai) is an open-source TypeScript framework for building AI applications efficiently. It provides essential tools like agents, workflows, retrieval-augmented generation (RAG), integrations, and evaluations. Supporting any LLM (e.g., GPT-4, Claude, Gemini). You can run it locally or deploy it to a serverless cloud like [Apify](https://apify.com). Explore Mastra Check out the [Mastra docs](https://mastra.ai/docs) for more information. ## What is MCP server[​](#what-is-mcp-server "Direct link to What is MCP server") A [Model Context Protocol](https://modelcontextprotocol.io) (MCP) server exposes specific data sources or tools to agents via a standardized protocol. It acts as a bridge, connecting large language models (LLMs) to external systems like databases, APIs, or local filesystems. Built on a client-server architecture, MCP servers enable secure, real-time interaction, allowing agents to fetch context or execute actions without custom integrations. Think of it as a modular plugin system for agents, simplifying how they access and process data. Apify provides [Actors MCP Server](https://apify.com/apify/actors-mcp-server) to expose [Apify Actors](https://docs.apify.com/platform/actors) from the [Apify Store](https://apify.com/store) as tools via the MCP protocol. ## How to use Apify with Mastra via MCP[​](#how-to-use-apify-with-mastra-via-mcp "Direct link to How to use Apify with Mastra via MCP") This guide demonstrates how to integrate Apify Actors with Mastra by building an agent that uses the [RAG Web Browser](https://apify.com/apify/rag-web-browser) Actor to search Google for TikTok profiles and the [TikTok Data Extractor](https://apify.com/clockworks/free-tiktok-scraper) Actor to extract and analyze data from the TikTok profiles via MCP. ### Prerequisites[​](#prerequisites "Direct link to Prerequisites") * *Apify API token*: To use Apify Actors, you need an Apify API token. Learn how to obtain it in the [Apify documentation](https://docs.apify.com/platform/integrations/api). * *LLM provider API key*: To power the agents, you need an LLM provider API key. For example, get one from the [OpenAI](https://platform.openai.com/account/api-keys) or [Anthropic](https://console.anthropic.com/settings/keys). * *Node.js*: Ensure you have Node.js installed. * *Packages*: Install the following packages: ``` npm install @mastra/core @mastra/mcp @ai-sdk/openai ``` ### Building the TikTok profile search and analysis agent[​](#building-the-tiktok-profile-search-and-analysis-agent "Direct link to Building the TikTok profile search and analysis agent") First, import all required packages: ``` import { Agent } from '@mastra/core/agent'; import { MastraMCPClient } from '@mastra/mcp'; import { openai } from '@ai-sdk/openai'; // For Anthropic use // import { anthropic } from '@ai-sdk/anthropic'; ``` Next, set the environment variables for the Apify API token and OpenAI API key: ``` process.env.APIFY_TOKEN = "your-apify-token"; process.env.OPENAI_API_KEY = "your-openai-api-key"; // For Anthropic use // process.env.ANTHROPIC_API_KEY = "your-anthropic-api-key"; ``` Instantiate the Mastra MCP client: ``` const mcpClient = new MastraMCPClient({ name: 'apify-client', server: { url: new URL('https://mcp.apify.com/sse'), requestInit: { headers: { Authorization: `Bearer ${process.env.APIFY_TOKEN}` } }, // The EventSource package augments EventSourceInit with a "fetch" parameter. // You can use this to set additional headers on the outgoing request. // Based on this example: https://github.com/modelcontextprotocol/typescript-sdk/issues/118 eventSourceInit: { async fetch(input: Request | URL | string, init?: RequestInit) { const headers = new Headers(init?.headers || {}); headers.set('authorization', `Bearer ${process.env.APIFY_TOKEN}`); return fetch(input, { ...init, headers }); } } }, timeout: 300_000, // 5 minutes tool call timeout }); ``` Connect to the MCP server and fetch the tools: ``` console.log('Connecting to Mastra MCP server...'); await mcpClient.connect(); console.log('Fetching tools...'); const tools = await mcpClient.tools(); ``` Instantiate the agent with the OpenAI model: ``` const agent = new Agent({ name: 'Social Media Agent', instructions: 'You’re a social media data extractor. Find TikTok URLs and analyze profiles with precision.', // You can swap to any other AI-SDK LLM provider model: openai('gpt-4o-mini') }); ``` Generate a response using the agent and the Apify tools: ``` const prompt = 'Search the web for the OpenAI TikTok profile URL, then extract and summarize its data.'; console.log(`Generating response for prompt: ${prompt}`); const response = await agent.generate(prompt, { toolsets: { apify: tools } }); ``` Print the response and disconnect from the MCP server: ``` console.log(response.text); await mcpClient.disconnect(); ``` Use any Apify Actor Since it uses the [Apify MCP Server](https://mcp.apify.com), swap in any Apify Actor from the [Apify Store](https://apify.com/store) by updating the startup request’s `actors` parameter. No other changes are needed in the agent code. Run the agent: ``` npx tsx mastra-agent.ts ``` Search and analysis may take some time The agent's execution may take some time as it searches the web for the OpenAI TikTok profile and extracts data from it. You will see the agent’s output in the console, showing the results of the search and analysis. ``` Connecting to Mastra MCP server... Fetching tools... Generating response for prompt: Search the web for the OpenAI TikTok profile URL, then extract and summarize its data. ### OpenAI TikTok Profile Summary - **Profile URL**: [OpenAI on TikTok](https://www.tiktok.com/@openai?lang=en) - **Followers**: 608,100 - **Likes**: 3.4 million - **Videos Posted**: 156 - **Bio**: "low key research previews" ... ``` If you want to test the whole example, create a new file, `mastra-agent.ts`, and copy the full code into it: ``` import { Agent } from '@mastra/core/agent'; import { MastraMCPClient } from '@mastra/mcp'; import { openai } from '@ai-sdk/openai'; // For Anthropic use // import { anthropic } from '@ai-sdk/anthropic'; process.env.APIFY_TOKEN = "your-apify-token"; process.env.OPENAI_API_KEY = "your-openai-api-key"; // For Anthropic use // process.env.ANTHROPIC_API_KEY = "your-anthropic-api-key"; const mcpClient = new MastraMCPClient({ name: 'apify-client', server: { url: new URL('https://mcp.apify.com/sse'), requestInit: { headers: { Authorization: `Bearer ${process.env.APIFY_TOKEN}` } }, // The EventSource package augments EventSourceInit with a "fetch" parameter. // You can use this to set additional headers on the outgoing request. // Based on this example: https://github.com/modelcontextprotocol/typescript-sdk/issues/118 eventSourceInit: { async fetch(input: Request | URL | string, init?: RequestInit) { const headers = new Headers(init?.headers || {}); headers.set('authorization', `Bearer ${process.env.APIFY_TOKEN}`); return fetch(input, { ...init, headers }); } } }, timeout: 300_000, // 5 minutes tool call timeout }); console.log('Connecting to Mastra MCP server...'); await mcpClient.connect(); console.log('Fetching tools...'); const tools = await mcpClient.tools(); const agent = new Agent({ name: 'Social Media Agent', instructions: 'You’re a social media data extractor. Find TikTok URLs and analyze profiles with precision.', // You can swap to any other AI-SDK LLM provider model: openai('gpt-4o-mini') }); const prompt = 'Search the web for the OpenAI TikTok profile URL, then extract and summarize its data.'; console.log(`Generating response for prompt: ${prompt}`); const response = await agent.generate(prompt, { toolsets: { apify: tools } }); console.log(response.text); await mcpClient.disconnect(); ``` ## Resources[​](#resources "Direct link to Resources") * [Apify Actors](https://docs.apify.com/platform/actors) * [Mastra Documentation](https://mastra.ai/docs) * [Apify MCP Server](https://mcp.apify.com) * [How to use MCP with Apify Actors](https://blog.apify.com/how-to-use-mcp/) * [Apify Store](https://apify.com/store) * [What are AI Agents?](https://blog.apify.com/what-are-ai-agents/) * [How to Build an AI Agent](https://blog.apify.com/how-to-build-an-ai-agent/) --- # Apify MCP server The *Apify Model Context Protocol (MCP) Server* allows AI applications to connect to Apify’s extensive library of Actors as tools to perform web scraping, data extraction, or other automation tasks in real time. ![Apify MCP Server](/assets/images/apify_mcp_server-d7b7369162651886da809d991d23e26e.png) ## Quickstart[​](#quickstart "Direct link to Quickstart") You can use the Apify MCP Server in two ways: * *HTTPS Endpoint* `mcp.apify.com`: Connect your MCP client through OAuth or by including `Authorization: Bearer ` header in your requests. * `https://mcp.apify.com` for streamable transport (recommended) * `https://mcp.apify.com/sse` for SSE transport (legacy) * *Standard Input/Output (stdio)*: Ideal for local integrations and command-line tools such as the Claude for Desktop client. * Set MCP client server command to `npx @apify/actors-mcp-server` and environment variable `APIFY_TOKEN` to your Apify API token * See `npx @apify/actors-mcp-server --help` for more options You could also use legacy option by running [Apify Actors MCP Server](https://apify.com/apify/actors-mcp-server) as an Actor. Run instantly ⚡ To install the Apify MCP Server in [Claude for Desktop](https://claude.ai/download) with one click, download and run the latest [Apify MCP Server DXT file](https://github.com/apify/actors-mcp-server/releases/latest/download/actors-mcp-server.dxt) ## Prerequisites[​](#prerequisites "Direct link to Prerequisites") Before you start, make sure you have the following: 1. *An Apify account:* Sign up for a free Apify account if you don’t have one. 2. *Apify API Token:* Get your personal API token from the **Integrations** section in [Apify Console](https://console.apify.com/account#/integrations). This token will be used to authorize the MCP server to run Actors on your behalf. 3. *MCP client:* An AI agent or client that supports MCP. This could be Anthropic Claude for Desktop, a VS Code extension with MCP support, Apify’s web-based Tester MCP Client, or any custom client implementation. See supported MCP clients in [official documentation](https://modelcontextprotocol.io/clients). ## Example usage (Streamable HTTP with OAuth)[​](#example-usage-streamable-http-with-oauth "Direct link to Example usage (Streamable HTTP with OAuth)") We recommend connecting through OAuth for a secure and simple authentication process. During setup, provide the server URL `https://mcp.apify.com`. You will then be redirected to your browser to sign in to your Apify account and approve the connection. The configuration steps may vary slightly depending on your MCP client. ``` { "mcpServers": { "apify": { "url": "https://mcp.apify.com" } } } ``` ## Example usage (local stdio with Claude for Desktop)[​](#example-usage-local-stdio-with-claude-for-desktop "Direct link to Example usage (local stdio with Claude for Desktop)") Let’s walk through an example of using Claude for Desktop with the Apify MCP Server: 1. *Configure Claude for Desktop:* Enable MCP servers via the **Developer settings**. You need to add an entry for the Apify MCP server. For instance, in Claude’s config file, under `mcpServers`, add an entry like: ``` { "mcpServers": { "actors-mcp-server": { "command": "npx", "args": ["-y", "@apify/actors-mcp-server"], "env": { "APIFY_TOKEN": "YOUR_APIFY_TOKEN" } } } } ``` This tells Claude to spawn the Apify MCP Server (via the [Actors MCP Server](https://www.npmjs.com/package/@apify/actors-mcp-server) NPM package with your API token (on the first run, it will download the package automatically). 2. *Launch Claude and connect:* After updating the config, restart Claude for Desktop. If successful, Claude will show a “plugin” (often indicated by a plug icon 🔌) signifying it connected to the Apify Actors MCP server. 3. *Use the Actors in conversation:* You can chat with Claude and ask it to use Apify Actors. For example: *“What Apify Actors can I use?”* Claude will list available tools via the MCP server. If none are pre-loaded, it may show defaults or guide you to find more. Web-based alternative If you prefer not to set up Claude desktop, you can achieve a similar result using [Apify’s Tester MCP Client](https://apify.com/jiri.spilka/tester-mcp-client), which provides a web UI to test the MCP server.) ![Apify Tester MCP Client](/assets/images/chat-ui-467ad792ee3008b47cd960befc786e7e.webp) ## Interact with the MCP server over SSE[​](#interact-with-the-mcp-server-over-sse "Direct link to Interact with the MCP server over SSE") You can interact with the server through Server-Sent Events (SSE) to send messages and receive responses. In the client settings, you need to provide server configuration: ``` { "mcpServers": { "apify": { "type": "sse", "url": "https://mcp.apify.com/sse", "headers": { "Authorization": "Bearer your-apify-token" } } } } ``` ## Adding Multiple Actors[​](#adding-multiple-actors "Direct link to Adding Multiple Actors") By default, the main Actors MCP Server starts with a single default [RAG Web Browser Actor](https://apify.com/apify/rag-web-browser). However, you can fully customize which Actors are available: * *Dynamic adding during a session:* If your client supports it, the agent itself can add Actors dynamically by name (using the `add-actor` operation) at runtime. For example, after using `search-actors` to find an Actor’s name, calling `add-actor` with that name will load it. *Tools for adding and removing Actors are enabled by default.* You can disable these tools by setting the parameter `?enableAddingActors=false` in the MCP Server URL, or with the CLI flag `--enable-adding-actors=false` (can also be set in Claude for Desktop config args as `--enable-adding-actors=false`). Not all MCP client frameworks allow dynamic tool addition at runtime, but Apify’s own tester client does, if adding Actors is enabled. * *Via url:* If you are using Streamable HTTP or SSE protocol, you could add `actors` query parameter with Actor names separated by comma: ``` { "mcpServers": { "Apify": { "url": "https://mcp.apify.com/?actors=lukaskrivka/google-maps-with-contact-details,apify/instagram-scraper" } } } ``` * *Via config file:* For local stdio connection, you can specify which Actors should be immediately available by configuring your json configuration. Add the Actors as a comma-separated list in the `--actors` parameter, as shown in the example below. This pre-loads your selected tools without requiring discovery during conversations, ideal for workflows with predictable tool needs. ``` { "mcpServers": { "actors-mcp-server": { "command": "npx", "args": [ "-y", "@apify/actors-mcp-server", "--actors", "lukaskrivka/google-maps-with-contact-details,apify/instagram-scraper" ], "env": { "APIFY_TOKEN": "YOUR_APIFY_TOKEN" } } } } ``` In summary, you can start with a broad set (everything open and discoverable) or a narrow set (just what you need) and even expand tools on the fly, giving your agent a lot of flexibility without overwhelming it initially. ## Configure tools for the MCP server[​](#configure-tools-for-the-mcp-server "Direct link to Configure tools for the MCP server") You can customize the MCP server’s available tools by adding query parameters to the server URL or by passing arguments to the CLI. This allows you to enable or disable specific tool categories and control which tools are available. The following tool categories are available: * *Actor discovery and management* (default, always enabled): Search for [Actors](https://docs.apify.com/platform/actors) (`search-actors`), view details (`get-actor-details`), and dynamically add them (`add-actor`). * *docs* (default, can be disabled): Search Apify documentation (`search-apify-docs`) and fetch specific documents (`fetch-apify-docs`). * *runs* (optional): Get a list of your [Actor runs](https://docs.apify.com/platform/actors/running/runs-and-builds#runs) (`get-actor-run-list`), specific run details (`get-actor-run`), and logs from a specific Actor run (`get-actor-log`). * *storage* (optional): Access [datasets](https://docs.apify.com/platform/storage/dataset) and [key-value stores](https://docs.apify.com/platform/storage/key-value-store), including their records (`get-dataset`, `get-dataset-items`, `get-dataset-list`, `get-key-value-store`, `get-key-value-store-keys`, `get-key-value-store-record`, `get-key-value-store-records`). * *preview* (optional): Experimental tools in preview mode. Call any Actor using API (`call-actor`). The *Actor discovery and management* tools are always present and cannot be disabled. The *docs* tools are enabled by default but can be switched off using the `tools` parameter. ### Configure mcp.apify.com using query parameters[​](#configure-mcpapifycom-using-query-parameters "Direct link to Configure mcp.apify.com using query parameters") Use the `tools` query parameter to enable or disable specific tool categories. For example, to enable only the `runs` and `storage` tools, you can use: ``` https://mcp.apify.com/?tools=runs,storage ``` The server will expose all *Actor discovery and management tools*, as well as `runs` and `storage`. The list of tools you can enable/disable is as follows: `docs`, `runs`, `storage`, and `preview`. ### Configure stdio server using CLI arguments[​](#configure-stdio-server-using-cli-arguments "Direct link to Configure stdio server using CLI arguments") When running the MCP server via the command line, you can specify the tools using the `--tools` parameter. For example, to enable only the `runs` and `storage` tools, you can run: ``` npx @apify/actors-mcp-server --tools runs,storage ``` ## Dynamic discovery of Actors[​](#dynamic-discovery-of-actors "Direct link to Dynamic discovery of Actors") One of the powerful features of MCP with Apify is **dynamic Actor tooling** – the ability for an AI agent to find new tools (Actors) as needed and incorporate them. Supported dynamic tool operations (enabled by default): * `search-actors`: Find available Actors by keyword or category. * `get-actor-details`: View details and usage information for a specific Actor. * `add-actor`: Dynamically add an Actor as a tool for the current session, making it available for use. These operations allow your agent to expand its toolset on demand, without requiring a server restart or manual configuration. Dynamic tool addition can be disabled using the `?enableAddingActors=false`. Not all MCP clients support dynamic tool addition. Check your client’s documentation or settings to confirm this feature is available. ## Rate limits[​](#rate-limits "Direct link to Rate limits") The Apify MCP server has a rate limit of *30 requests per second* per user. If you exceed this limit, you will receive a `429 Too Many Requests` response. ## Troubleshooting[​](#troubleshooting "Direct link to Troubleshooting") * *Authorization (API Token):* If the MCP server isn’t executing Actors, ensure you provided a correct Apify API token. Without a valid `APIFY_TOKEN`, the server cannot start Actor runs. Always set the `APIFY_TOKEN` environment variable when running locally. * *Ensure latest version:* If running via NPM, always use the latest version of `@apify/actors-mcp-server` for the newest features and fixes. You can append `@latest` when installing or in your config args to ensure this. * *Node.js environment:* If running the server locally, make sure Node.js is installed and up to date (`node -v`). The MCP server requires Node.js v18+. * *No response or long delay:* Keep in mind that when an Actor tool is called, it may take some time to complete (depending on the task). If nothing is coming back, check the Actor’s logs in Apify console — the Actor might be waiting on a long operation or input. ## Learn more[​](#learn-more "Direct link to Learn more") * [Model Context Protocol (MCP)](https://modelcontextprotocol.io/introduction): Learn about the open standard on the official MCP website – understanding the protocol can help you build custom agents. * [Apify Actors MCP Server](https://apify.com/apify/actors-mcp-server): The README for the Apify MCP Server actor (available on Apify Store as `apify/actors-mcp-server`) provides technical details on implementation and advanced usage. * [Apify Tester MCP Client](https://apify.com/jiri.spilka/tester-mcp-client): A specialized client actor (`jiri.spilka/tester-mcp-client`) that you can run to simulate an AI agent in your browser. Useful for testing your setup with a chat UI. * [How to use MCP with Apify Actors](https://blog.apify.com/how-to-use-mcp/): Learn how to expose over 5,000 Apify Actors to AI agents with Claude and LangGraph, and configure MCP clients and servers. * [Apify MCP Server Tutorial](https://www.youtube.com/watch?v=BKu8H91uCTg): Integrate thousands of Apify Actors and Agents with Claude. --- # Milvus integration **Learn how to integrate Apify with Milvus (Zilliz) to save data scraped from websites into the Milvus vector database.** *** [Milvus](https://milvus.io/) is an open-source vector database optimized for performing similarity searches on large datasets of high-dimensional vectors. Its focus on efficient vector similarity search allows for the creation of powerful and scalable retrieval systems. The Apify integration for Milvus allows exporting results from Apify Actors and Dataset items into a Milvus collection. It can also be connected to a managed Milvus instance on [Zilliz Cloud](https://cloud.zilliz.com). ## Prerequisites[​](#prerequisites "Direct link to Prerequisites") Before you begin, ensure that you have the following: * A Milvus/Zilliz database universal resource identifier (URI) and Token to setup the client. Optionally, you can use a username and password in the URI. You can run Milvus on Docker or Kubernetes, but in this example, we'll use the hosted Milvus service at [Zilliz Cloud](https://cloud.zilliz.com). * An [OpenAI API key](https://openai.com/index/openai-api/) to compute text embeddings. * An [Apify API token](https://docs.apify.com/platform/integrations/api#api-token) to access [Apify Actors](https://apify.com/store). ### How to set up Milvus/Zilliz database[​](#how-to-set-up-milvuszilliz-database "Direct link to How to set up Milvus/Zilliz database") 1. Sign up or log in to your Zilliz account and create a new cluster. 2. Find the `uri` and `token`, which correspond to the [Public Endpoint and API key](https://docs.zilliz.com/docs/on-zilliz-cloud-console#cluster-details) in Zilliz Cloud. Note that the collection does not need to exist beforehand. It will be automatically created when data is uploaded to the database. Once the cluster is ready, and you have the `URI` and `Token`, you can set up the integration with Apify. ### Integration Methods[​](#integration-methods "Direct link to Integration Methods") You can integrate Apify with Milvus using either the Apify Console or the Apify Python SDK. Website Content Crawler usage These examples use the Website Content Crawler Actor, which performs deep website crawling, cleans HTML by removing modals and navigation elements, and converts the content into Markdown. #### Apify Console[​](#apify-console "Direct link to Apify Console") 1. Set up the [Website Content Crawler](https://apify.com/apify/website-content-crawler) Actor in the [Apify Console](https://console.apify.com). Refer to this guide on how to set up [website content crawl for your project](https://blog.apify.com/talk-to-your-website-with-large-language-models/). 2. After setting up the crawler, go to the **integration** section, select **Connect Actor or Task**, and search for the Milvus integration. 3. Select when to trigger this integration (typically when a run succeeds) and fill in all the required fields. If you haven't created a collection, it will be created automatically. You can learn more about the input parameters at the [Milvus integration input schema](https://apify.com/apify/milvus-integration/input-schema). * For a detailed explanation of the input parameters, including dataset settings, incremental updates, and examples, see the [Milvus integration description](https://apify.com/apify/milvus-integration). * For an explanation on how to combine Actors to accomplish more complex tasks, refer to the guide on [Actor-to-Actor](https://blog.apify.com/connecting-scrapers-apify-integration/) integrations. #### Python[​](#python "Direct link to Python") Another way to interact with Milvus is through the [Apify Python SDK](https://docs.apify.com/sdk/python/). 1. Install the Apify Python SDK by running the following command: ``` pip install apify-client ``` 2. Create a Python script and import all the necessary modules: ``` from apify_client import ApifyClient APIFY_API_TOKEN = "YOUR-APIFY-TOKEN" OPENAI_API_KEY = "YOUR-OPENAI-API-KEY" MILVUS_COLLECTION_NAME = "YOUR-MILVUS-COLLECTION-NAME" MILVUS_URI = "YOUR-MILVUS-URI" MILVUS_TOKEN = "YOUR-MILVUS-TOKEN" client = ApifyClient(APIFY_API_TOKEN) ``` 3. Call the [Website Content Crawler](https://apify.com/apify/website-content-crawler) Actor to crawl the Milvus documentation and Zilliz website and extract text content from the web pages: ``` actor_call = client.actor("apify/website-content-crawler").call( run_input={"maxCrawlPages": 10, "startUrls": [{"url": "https://milvus.io/"}, {"url": "https://zilliz.com/"}]} ) ``` 4. Call Apify's Milvus integration and store all data in the Milvus Vector Database: ``` milvus_integration_inputs = { "milvusUri": MILVUS_URI, "milvusToken": MILVUS_TOKEN, "milvusCollectionName": MILVUS_COLLECTION_NAME, "datasetFields": ["text"], "datasetId": actor_call["defaultDatasetId"], "deltaUpdatesPrimaryDatasetFields": ["url"], "expiredObjectDeletionPeriodDays": 30, "embeddingsApiKey": OPENAI_API_KEY, "embeddingsProvider": "OpenAI", } actor_call = client.actor("apify/milvus-integration").call(run_input=milvus_integration_inputs) ``` Congratulations! You've successfully integrated Apify with Milvus, and the scraped data is now stored in your Milvus database. For a complete example of Retrieval-Augmented Generation (RAG), check out the Additional Resources below. ## Additional Resources[​](#additional-resources "Direct link to Additional Resources") * [Apify Milvus Integration](https://apify.com/apify/milvus-integration) * [Milvus documentation](https://milvus.io/docs) * [Retrieval-Augmented Generation: Crawling Websites with Apify and Saving Data to Milvus for Question Answering](https://milvus.io/docs/apify_milvus_rag.md) --- # n8n integration **Connect Apify with n8n to automate workflows by running Actors, extracting structured data, and responding to Actor or task events.** *** [n8n](https://n8n.io/) is an open source, fair-code licensed tool for workflow automation. With the [Apify integration for n8n](https://github.com/apify/n8n-nodes-apify), you can connect Apify Actors and storage to hundreds of services You can run scrapers, extract data, and trigger workflows based on Actor or task events. In this guide, you'll learn how to install the Apify node, set up authentication, and incorporate it into your n8n workflows as either a trigger or an action. ## Prerequisites[​](#prerequisites "Direct link to Prerequisites") Before you begin, make sure you have: * An [Apify account](https://console.apify.com/) * An [n8n instance](https://docs.n8n.io/getting-started/) (self‑hosted or cloud) ## Install the Apify Node (self-hosted)[​](#install-the-apify-node-self-hosted "Direct link to Install the Apify Node (self-hosted)") If you're running a self-hosted n8n instance, you can install the Apify community node directly from the editor. This process adds the node to your available tools, enabling Apify operations in workflows. 1. Open your n8n instance. 2. Go to **Settings > Community Nodes**. 3. Select **Install**. 4. Enter the npm package name: `@apify/n8n-nodes-apify` (for latest version). To install a specific [version](https://www.npmjs.com/package/@apify/n8n-nodes-apify?activeTab=versions) enter e.g `@apify/n8n-nodes-apify@0.4.4`. 5. Agree to the [risks](https://docs.n8n.io/integrations/community-nodes/risks/) of using community nodes and select **Install**. 6. You can now use the node in your workflows. ![Apify Install Node](/assets/images/n8n-install-node-self-hosted-b2015cc6380ce3461e1b212390e654e7.png) ## Install the Apify Node (n8n Cloud)[​](#install-the-apify-node-n8n-cloud "Direct link to Install the Apify Node (n8n Cloud)") For n8n Cloud users, installation is even simpler and doesn't require manual package entry. Just search and add the node from the canvas. 1. Go to the **Canvas** and open the **nodes panel** 2. Search for **Apify** in the community node registry 3. Click **Install node** to add the Apify node to your instance ![Apify Install Node](/assets/images/n8n-install-node-cloud-13764ea6e0d983273c4c67226f1dcecc.png) Verified community nodes visibility On n8n Cloud, instance owners can toggle visibility of verified community nodes in the Cloud Admin Panel. Ensure this setting is enabled to install the Apify node. Once installed, the next step is authentication. ## Authentication[​](#authentication "Direct link to Authentication") The Apify node offers two authentication methods to securely connect to your Apify account. Choose based on your setup - API key works for both self-hosted and cloud instances, while OAuth2 is cloud-only. ### API Key (cloud & self-hosted instance)[​](#api-key-cloud--self-hosted-instance "Direct link to API Key (cloud & self-hosted instance)") 1. In the n8n Editor UI, click on **Create Credential**. 2. Search for Apify API and click **Continue**. 3. Enter your Apify API token. (find it in the [Apify Console](https://console.apify.com/settings/integrations)). 4. Click **Save**. ![Apify Auth](/assets/images/n8n-api-auth-22c246a45b472d054d686e3934ae0fb0.png) ### OAuth2 (cloud instance only)[​](#oauth2-cloud-instance-only "Direct link to OAuth2 (cloud instance only)") 1. In n8n Cloud, select **Create Credential**. 2. Search for Apify OAuth2 API and select **Continue**. 3. Select **Connect my account** and authorize with your Apify account. 4. n8n automatically retrieves and stores the OAuth2 tokens. ![Apify Auth](/assets/images/n8n-oauth-50be70f1893c39c34c87f81275eccc85.png) note For simplicity on n8n Cloud, use the API key method if you prefer manual control over credentials. With authentication set up, you can now create workflows that incorporate the Apify node. ## Create a Workflow with the Apify Node[​](#create-a-workflow-with-the-apify-node "Direct link to Create a Workflow with the Apify Node") Start by building a basic workflow in n8n, then add the Apify node to handle tasks like running Actors or fetching data. 1. Create a new workflow in n8n. 2. Select **Add Node**, search for **Apify**, and select it. 3. Choose the desired **Resource** and **Operation**. 4. In the node's **Credentials** dropdown, choose the Apify credential you configured earlier. If you haven't configured any credentials, you can do so in this step. The process will be the same. 5. You can now use Apify node as a trigger or action in your workflow. ![Apify Node](/assets/images/n8n-list-of-operations-58381c03d012ba42a6487d0d05c39ed9.png) ## Use Apify node as trigger[​](#use-apify-node-as-trigger "Direct link to Use Apify node as trigger") Triggers let your workflow respond automatically to events in Apify, such as when an Actor run finishes. This is ideal for real-time automation, like processing scraped data as soon as it's available. 1. Create a new workflow. 2. Click **Add Node**, search for **Apify**, and select it. 3. Select **On new Apify Event** trigger. 4. Configure the trigger: * **Actor or Actor task**: select the Actor or task to listen for terminal events. * **Event Type**: the status of the Actor or task run that should trigger the workflow. 5. Add subsequent nodes (e.g., HTTP Request, Google Sheets) to process or store the output. 6. Save and execute the workflow. ![Apify Node](/assets/images/n8n-trigger-example-34337d7c7a7659e12338ad9d21d51f43.png) ## Use Apify node as an action[​](#use-apify-node-as-an-action "Direct link to Use Apify node as an action") Actions allow you to perform operations like running an Actor within a workflow. For instance, you could trigger a scraper and then retrieve its results. 1. Create a new workflow. 2. Click **Add Node**, search for **Apify**, and select it. 3. Select any operation. In this example we will use **Run Actor**. 4. Configure it: * **Custom input**: JSON input for the Actor run, which you can find on the Actor input page in Apify Console. See [Inputs](https://docs.apify.com/platform/actors/running/input-and-output.md#input) for more information. If empty, the run uses the input specified in the default run configuration * **Timeout**: Timeout for the Actor run in seconds. Zero value means there is no timeout * **Memory**: Amount of memory allocated for the Actor run, in megabytes * **Build Tag**: Specifies the Actor build tag to run. By default, the run uses the build specified in the default run configuration for the Actor (typically `latest`) * **Wait for finish**: Whether to wait for the run to finish before continuing. If true, the node will wait for the run to complete (successfully or not) before moving to the next node ![Apify Node](/assets/images/n8n-run-actor-example-95adc2a943084b8ecadcf38d7d550440.png) 5. Add another Apify operation called **Get Dataset Items**. * Set **Dataset ID** parameter as **defaultDatasetId** value received from the previous **Run Actor** node. This will give you the output of the Actor run ![Apify Node](/assets/images/n8n-get-dataset-items-example-d17a8dff8bb90274157a8ae71ca0d36f.png) 6. Add any subsequent nodes (e.g. Google Sheets) to process or store the output 7. Save and execute the workflow ![Apify Node](/assets/images/n8n-workflow-example-d0cb41c9d3ac5c6260bfbb4b32fca42a.png) ## Available Operations[​](#available-operations "Direct link to Available Operations") The Apify node provides a range of operations for managing Actors, tasks, runs, and storage. These can be used as actions in your workflows. For triggers, focus on event-based activations to start workflows automatically. ### Actors[​](#actors "Direct link to Actors") Run and manage Actors directly. * **Run Actor**: Starts a specified Actor with customizable parameters * **Scrape Single URL**: Runs a scraper for a specified website and returns its content * **Get Last Run**: Retrieve metadata for the most recent run of an Actor ### Actor Tasks[​](#actor-tasks "Direct link to Actor Tasks") Execute predefined tasks efficiently. * **Run Task**: Executes a specified Actor task ### Actor Runs[​](#actor-runs "Direct link to Actor Runs") Retrieve run details. * **Get User Runs List**: Retrieve a list of all runs for a user * **Get Run**: Retrieve detailed information for a specific run ID * **Get Runs**: Retrieve all runs for a specific Actor ### Storage[​](#storage "Direct link to Storage") Pull data from Apify storage. #### Datasets[​](#datasets "Direct link to Datasets") * **Get Items**: Retrieves items from a [dataset](https://docs.apify.com/platform/storage/dataset.md) #### Key-Value Stores[​](#key-value-stores "Direct link to Key-Value Stores") * **Get Record**: Retrieves a value from a [key-value store](https://docs.apify.com/platform/storage/key-value-store.md) ### Triggers[​](#triggers "Direct link to Triggers") Automatically start an n8n workflow when an Actor or task run finishes: * **Actor Run Finished**: Activates when a selected Actor run completes * **Task Run Finished**: Activates when a selected Actor task run completes ## Resources[​](#resources "Direct link to Resources") * [n8n Community Nodes Documentation](https://docs.n8n.io/integrations/community-nodes/) * [Apify API Documentation](https://docs.apify.com) * [n8n Documentation](https://docs.n8n.io) ## Troubleshooting[​](#troubleshooting "Direct link to Troubleshooting") If you encounter issues, start by double-checking basics. * **Authentication errors**: Verify your API token or OAuth2 settings in **Credentials**. * **Operation failures**: Check input parameters, JSON syntax, and resource IDs in your Apify account. Feel free to explore other resources and contribute to the integration on [GitHub](https://github.com/apify/n8n-nodes-apify). --- # OpenAI Assistants integration **Learn how to integrate Apify with OpenAI Assistants to provide real-time search data and to save them into OpenAI Vector Store.** *** [OpenAI Assistants API](https://platform.openai.com/docs/assistants/overview) allows you to build your own AI applications such as chatbots, virtual assistants, and more. The OpenAI Assistants can access OpenAI knowledge base ([vector store](https://platform.openai.com/docs/api-reference/vector-stores)) via file search and use function calling for dynamic interaction and data retrieval. Unlike Custom GPT, OpenAI Assistants are available via API, enabling integration with Apify to automatically update assistant data and deliver real-time information, improving the quality of answers. In this tutorial, we’ll start by demonstrating how to create an assistant and integrate real-time data using function calling with the [RAG Web Browser](https://apify.com/apify/rag-web-browser). Next, we’ll show how to save data from Apify Actors into the OpenAI Vector Store for easy retrieval through [file-search](https://platform.openai.com/docs/assistants/tools/file-search). ## Real-time search data for OpenAI Assistant[​](#real-time-search-data-for-openai-assistant "Direct link to Real-time search data for OpenAI Assistant") We'll use the [RAG Web Browser](https://apify.com/apify/rag-web-browser) Actor to fetch the latest information from the web and provide it to the OpenAI Assistant through [function calling](https://platform.openai.com/docs/assistants/tools/function-calling?context=without-streaming). To begin, we need to create an OpenAI Assistant with the appropriate instructions. After that, we can initiate a conversation with the assistant by creating a thread, adding messages, and running the assistant to receive responses. The image below provides an overview of the entire process: ![Apify-OpenAI Assistant integration](/assets/images/openai-assistant-rag-web-browser-d9f1ec0a46f28033d253192377d249e4.png) Before we start creating the assistant, we need to install all dependencies: ``` pip install apify-client openai ``` Import all required packages: ``` import json import time from apify_client import ApifyClient from openai import OpenAI, Stream from openai.types.beta.threads.run_submit_tool_outputs_params import ToolOutput ``` Find your [Apify API token](https://console.apify.com/account/integrations) and [OpenAI API key](https://platform.openai.com/account/api-keys) and initialize OpenAI and Apify clients: ``` client = OpenAI(api_key="YOUR OPENAI API KEY") apify_client = ApifyClient("YOUR APIFY API TOKEN") ``` First, let us specify assistant's instructions. Here, we ask the assistant to always provide answers based on the latest information from the internet and include relevant sources whenever possible. In a real-world scenario, you can customize the instructions based on your requirements. ``` INSTRUCTIONS = """ You are a smart and helpful assistant. Maintain an expert, friendly, and informative tone in your responses. Your task is to answer questions based on information from the internet. Always call call_rag_web_browser function to retrieve the latest and most relevant online results. Never provide answers based solely on your own knowledge. For each answer, always include relevant sources whenever possible. """ ``` Next, we define a function description with two parameters, search query (`query`) and number of results we need to retrieve (`maxResults`). RAG Web Browser can be called with more parameters, check the [Actor input schema](https://apify.com/apify/rag-web-browser/input-schema) for details. ``` rag_web_browser_function = { "type": "function", "function": { "name": "call_rag_web_browser", "description": "Query Google search, scrape the top N pages from the results, and returns their cleaned content as markdown", "parameters": { "type": "object", "properties": { "query": { "type": "string", "description": "Use regular search words or enter Google Search URLs. "}, "maxResults": {"type": "integer", "description": "The number of top organic search results to return and scrape text from"} }, "required": ["query"] } } } ``` We also need to implement the `call_rag_web_browser` function, which will be used to retrieve the search data. ``` def call_rag_web_browser(query: str, max_results: int) -> list[dict]: """ Query Google search, scrape the top N pages from the results, and returns their cleaned content as markdown. First start the Actor and wait for it to finish. Then fetch results from the Actor run's default dataset. """ actor_call = apify_client.actor("apify/rag-web-browser").call(run_input={"query": query, "maxResults": max_results}) return apify_client.dataset(actor_call["defaultDatasetId"]).list_items().items ``` Now, we can create an assistant with the specified instructions and function description: ``` my_assistant = client.beta.assistants.create( instructions=INSTRUCTIONS, name="OpenAI Assistant with Web Browser", tools=[rag_web_browser_function], model="gpt-4o-mini", ) ``` Once the assistant is created, we can initiate a conversation. Start by creating a thread and adding messages to it, and then calling the run method. Since runs are asynchronous, we need to continuously poll the `Run` object until it reaches a terminal status. To simplify this, we use the `create_and_poll` convenience function, which both initiates the run and polls it until completion. ``` thread = client.beta.threads.create() message = client.beta.threads.messages.create( thread_id=thread.id, role="user", content="What are the latest LLM news?" ) run = client.beta.threads.runs.create_and_poll(thread_id=thread.id, assistant_id=my_assistant.id) ``` Finally, we need to check the run status to determine if the assistant requires any action to retrieve the search data. If it does, we must submit the results using the `submit_tool_outputs` function. This function will trigger RAG Web Browser to fetch the search data and submit it to the assistant for processing. Let's implement the `submit_tool_outputs` function: ``` def submit_tool_outputs(run_): """ Submit tool outputs to continue the run """ tool_output = [] for tool in run_.required_action.submit_tool_outputs.tool_calls: if tool.function.name == "call_rag_web_browser": d = json.loads(tool.function.arguments) output = call_rag_web_browser(query=d["query"], max_results=d["maxResults"]) tool_output.append(ToolOutput(tool_call_id=tool.id, output=json.dumps(output))) print("RAG Web Browser added as a tool output.") return client.beta.threads.runs.submit_tool_outputs_and_poll(thread_id=run_.thread_id, run_id=run_.id, tool_outputs=tool_output) ``` Now, we can check the run status and submit the tool outputs if required: ``` if run.status == "requires_action": run = submit_tool_outputs(run) ``` The function `submit_tool_output` also poll the run until it reaches a terminal status. After the run is completed, we can print the assistant's response: ``` print("Assistant response:") for m in client.beta.threads.messages.list(thread_id=run.thread_id): print(m.content[0].text.value) ``` For the question "What are the latest LLM news?" the assistant's response might look like this: ``` Assistant response: The latest news on LLM is as follows: - [OpenAI](https://openai.com) has released a new version of GPT-4. - [Hugging Face](https://huggingface.co) has updated their Transformers library. - [Apify](https://apify.com) has released a new RAG Web Browser. ``` Complete example of real-time search data for OpenAI Assistant ``` import json from apify_client import ApifyClient from openai import OpenAI, Stream from openai.types.beta.threads.run_submit_tool_outputs_params import ToolOutput client = OpenAI(api_key="YOUR-OPENAI-API-KEY") apify_client = ApifyClient("YOUR-APIFY-API-TOKEN") INSTRUCTIONS = """ You are a smart and helpful assistant. Maintain an expert, friendly, and informative tone in your responses. Your task is to answer questions based on information from the internet. Always call call_rag_web_browser function to retrieve the latest and most relevant online results. Never provide answers based solely on your own knowledge. For each answer, always include relevant sources whenever possible. """ rag_web_browser_function = { "type": "function", "function": { "name": "call_rag_web_browser", "description": "Query Google search, scrape the top N pages from the results, and returns their cleaned content as markdown", "parameters": { "type": "object", "properties": { "query": {"type": "string", "description": "Use regular search words or enter Google Search URLs. "}, "maxResults": {"type": "integer", "description": "The number of top organic search results to return and scrape text from"} }, "required": ["query"] } } } def call_rag_web_browser(query: str, max_results: int) -> list[dict]: """ Query Google search, scrape the top N pages from the results, and returns their cleaned content as markdown. First start the Actor and wait for it to finish. Then fetch results from the Actor run's default dataset. """ actor_call = apify_client.actor("apify/rag-web-browser").call(run_input={"query": query, "maxResults": max_results}) return apify_client.dataset(actor_call["defaultDatasetId"]).list_items().items def submit_tool_outputs(run_): """ Submit tool outputs to continue the run """ tool_output = [] for tool in run_.required_action.submit_tool_outputs.tool_calls: if tool.function.name == "call_rag_web_browser": d = json.loads(tool.function.arguments) output = call_rag_web_browser(query=d["query"], max_results=d["maxResults"]) tool_output.append(ToolOutput(tool_call_id=tool.id, output=json.dumps(output))) print("RAG Web Browser added as a tool output.") return client.beta.threads.runs.submit_tool_outputs_and_poll(thread_id=run_.thread_id, run_id=run_.id, tool_outputs=tool_output) # Runs are asynchronous, which means you'll want to monitor their status by polling the Run object until a terminal status is reached. thread = client.beta.threads.create() message = client.beta.threads.messages.create( thread_id=thread.id, role="user", content="What are the latest LLM news?" ) # Run with assistant and poll for the results run = client.beta.threads.runs.create_and_poll(thread_id=thread.id, assistant_id=my_assistant.id) if run.status == "requires_action": run = submit_tool_outputs(run) print("Assistant response:") for m in client.beta.threads.messages.list(thread_id=run.thread_id): print(m.content[0].text.value) ``` ## Save data into OpenAI Vector Store and use it in the assistant[​](#save-data-into-openai-vector-store-and-use-it-in-the-assistant "Direct link to Save data into OpenAI Vector Store and use it in the assistant") To provide real-time or proprietary data, OpenAI Assistants can access the [OpenAI Vector Store](https://platform.openai.com/docs/assistants/tools/file-search/vector-stores) to retrieve information for their answers. With the [Apify OpenAI Vector Store Integration](https://apify.com/jiri.spilka/openai-vector-store-integration), data saving and updating the OpenAI Vector Store can be fully automated. The following image illustrates the Apify-OpenAI Vector Store integration: ![Apify-OpenAI Vector Store integration](/assets/images/openai-vector-store-integration-181c764fa3677f2767f586accfe7adbe.png) In this example, we'll demonstrate how to save data into the OpenAI Vector Store and use it in the assistant. For more information on automating this process, check out the blog post [How we built an AI salesperson with the OpenAI Assistants API](https://blog.apify.com/enterprise-support-openai-assistant/). Before we start, we need to install all dependencies: ``` pip install apify-client openai ``` Find your [Apify API token](https://console.apify.com/account/integrations) and [OpenAI API key](https://platform.openai.com/account/api-keys) and initialize OpenAI and Apify clients: ``` from apify_client import ApifyClient from openai import OpenAI client = OpenAI(api_key="YOUR OPENAI API KEY") apify_client = ApifyClient("YOUR APIFY API TOKEN") ``` Create an assistant with the instructions and `file-search` tool: ``` my_assistant = client.beta.assistants.create( instructions="As a customer support agent at Apify, your role is to assist customers", name="Support assistant", tools=[{"type": "file_search"}], model="gpt-4o-mini", ) ``` Next, create a vector store and attach it to the assistant: ``` vector_store = client.beta.vector_stores.create(name="Support assistant vector store") assistant = client.beta.assistants.update( assistant_id=my_assistant.id, tool_resources={"file_search": {"vector_store_ids": [vector_store.id]}}, ) ``` Now, use [Website Content Crawler](https://apify.com/apify/website-content-crawler) to crawl the web and save the data into Apify's dataset: ``` run_input = {"startUrls": [{"url": "https://docs.apify.com/platform"}], "maxCrawlPages": 10, "crawlerType": "cheerio"} actor_call_website_crawler = apify_client.actor("apify/website-content-crawler").call(run_input=run_input) dataset_id = actor_call_website_crawler["defaultDatasetId"] ``` Finally, save the data into the OpenAI Vector Store using [OpenAI Vector Store Integration](https://apify.com/jiri.spilka/openai-vector-store-integration) ``` run_input_vs = { "datasetId": dataset_id, "assistantId": my_assistant.id, "datasetFields": ["text", "url"], "openaiApiKey": "YOUR-OPENAI-API-KEY", "vectorStoreId": vector_store.id, } apify_client.actor("jiri.spilka/openai-vector-store-integration").call(run_input=run_input_vs) ``` Now, the assistant can access the data stored in the OpenAI Vector Store and use it in its responses. Start by creating a thread and adding messages to it. Then, initiate a run and poll for the results. Once the run is completed, you can print the assistant's response. ``` thread = client.beta.threads.create() message = client.beta.threads.messages.create( thread_id=thread.id, role="user", content="How can I scrape a website using Apify?" ) run = client.beta.threads.runs.create_and_poll( thread_id=thread.id, assistant_id=assistant.id, tool_choice={"type": "file_search"} ) print("Assistant response:") for m in client.beta.threads.messages.list(thread_id=run.thread_id): print(m.content[0].text.value) ``` For the question "How can I scrape a website using Apify?" the assistant's response might look like this: ``` Assistant response: You can scrape a website using Apify by following these steps: 1. Visit the [Apify website](https://apify.com) and create an account. 2. Go to the [Apify Store](https://apify.com/store) and choose a web scraper. 3. Configure the web scraper with the URL of the website you want to scrape. 4. Run the web scraper and download the data. ``` Complete example of saving data into OpenAI Vector Store and using it in the assistant ``` from apify_client import ApifyClient from openai import OpenAI client = OpenAI(api_key="YOUR-OPENAI-API-KEY") apify_client = ApifyClient("YOUR-APIFY-API-TOKEN") my_assistant = client.beta.assistants.create( instructions="As a customer support agent at Apify, your role is to assist customers", name="Support assistant", tools=[{"type": "file_search"}], model="gpt-4o-mini", ) # Create a vector store vector_store = client.beta.vector_stores.create(name="Support assistant vector store") # Update the assistant to use the new Vector Store assistant = client.beta.assistants.update( assistant_id=my_assistant.id, tool_resources={"file_search": {"vector_store_ids": [vector_store.id]}}, ) run_input = {"startUrls": [{"url": "https://docs.apify.com/platform"}], "maxCrawlPages": 10, "crawlerType": "cheerio"} actor_call_website_crawler = apify_client.actor("apify/website-content-crawler").call(run_input=run_input) dataset_id = actor_call_website_crawler["defaultDatasetId"] run_input_vs = { "datasetId": dataset_id, "assistantId": my_assistant.id, "datasetFields": ["text", "url"], "openaiApiKey": "YOUR-OPENAI-API-KEY", "vectorStoreId": vector_store.id, } apify_client.actor("jiri.spilka/openai-vector-store-integration").call(run_input=run_input_vs) # Create a thread and a message thread = client.beta.threads.create() message = client.beta.threads.messages.create( thread_id=thread.id, role="user", content="How can I scrape a website using Apify?" ) # Run with assistant and poll for the results run = client.beta.threads.runs.create_and_poll( thread_id=thread.id, assistant_id=assistant.id, tool_choice={"type": "file_search"} ) print("Assistant response:") for m in client.beta.threads.messages.list(thread_id=run.thread_id): print(m.content[0].text.value) ``` ## Resources[​](#resources "Direct link to Resources") * [OpenAI Assistants](https://platform.openai.com/docs/assistants/overview) * [OpenAI function calling](https://platform.openai.com/docs/assistants/tools/function-calling) * [RAG Web Browser](https://apify.com/apify/rag-web-browser) Actor * [OpenAI Vector Store Integration](https://apify.com/jiri.spilka/openai-vector-store-integration) Actor --- # Pinecone integration **Learn how to integrate Apify with Pinecone to feed data crawled from the web into the Pinecone vector database.** *** [Pinecone](https://www.pinecone.io) is a managed vector database that allows users to store and query dense vectors for AI applications such as recommendation systems, semantic search, and retrieval augmented generation (RAG). The Apify integration for Pinecone enables you to export results from Apify Actors and Dataset items into a specific Pinecone vector index. ## Prerequisites[​](#prerequisites "Direct link to Prerequisites") Before you begin, ensure that you have the following: * A [Pinecone database](https://www.pinecone.io/) and index set up. * A Pinecone index created & Pinecone API token obtained. * An [OpenAI API key](https://openai.com/index/openai-api/) to compute text embeddings. * An [Apify API token](https://docs.apify.com/platform/integrations/api#api-token) to access [Apify Actors](https://apify.com/store). ### How to setup Pinecone database and create an index[​](#how-to-setup-pinecone-database-and-create-an-index "Direct link to How to setup Pinecone database and create an index") 1. Sign up or log in to your Pinecone account and click on **Create Index**. 2. Specify the following details: index name, vector dimension, vector distance metric, deployment type (serverless or pod), and cloud provider. ![Pinecone index configuration](/assets/images/pinecone-create-index-dc7488389754a28dbec882c2847fd6f4.png) Once the index is created and ready, you can proceed with integrating Apify. ### Integration Methods[​](#integration-methods "Direct link to Integration Methods") You can integrate Apify with Pinecone using either the Apify Console or the Apify Python SDK. Website Content Crawler usage The examples utilize the Website Content Crawler Actor, which deeply crawls websites, cleans HTML by removing modals and navigation elements, and converts HTML to Markdown for training AI models or providing web content to LLMs and generative AI applications. #### Apify Console[​](#apify-console "Direct link to Apify Console") 1. Set up the [Website Content Crawler](https://apify.com/apify/website-content-crawler) Actor in the [Apify Console](https://console.apify.com). Refer to this guide on how to set up [website content crawl for your project](https://blog.apify.com/talk-to-your-website-with-large-language-models/). 2. Once you have the crawler ready, navigate to the integration section and add Apify’s Pinecone integration. ![Website Content Crawler with Pinecone integration](/assets/images/pinecone-wcc-integration-d5b8e8b5f86645e4a32ac9e1a3f3732e.png) 3. Select when to trigger this integration (typically when a run succeeds) and fill in all the required fields for the Pinecone integration. You can learn more about the input parameters at the [Pinecone integration input schema](https://apify.com/apify/pinecone-integration/input-schema). ![Pinecone integration configuration](/assets/images/pinecone-integration-setup-f054e98ba2a9bf2f31afa32ee5151d51.png) Pinecone index configuration You need to ensure that your embedding model in the Pinecone index configuration matches the Actor settings. For example, the `text-embedding-3-small` model from OpenAI generates vectors of size `1536`, so your Pinecone index should be configured for vectors of the same size. * For a detailed explanation of the input parameters, including dataset settings, incremental updates, and examples, see the [Pinecone integration description](https://apify.com/apify/pinecone-integration). * For an explanation on how to combine Actors to accomplish more complex tasks, refer to the guide on [Actor-to-Actor](https://blog.apify.com/connecting-scrapers-apify-integration/) integrations. #### Python[​](#python "Direct link to Python") Another way to interact with Pinecone is through the [Apify Python SDK](https://docs.apify.com/sdk/python/). 1. Install the Apify Python SDK by running the following command: `pip install apify-client` 2. Create a Python script and import all the necessary modules: ``` from apify_client import ApifyClient APIFY_API_TOKEN = "YOUR-APIFY-TOKEN" OPENAI_API_KEY = "YOUR-OPENAI-API-KEY" PINECONE_API_KEY = "YOUR-PINECONE-API-KEY" PINECONE_INDEX_NAME = "YOUR-PINECONE-INDEX-NAME" client = ApifyClient(APIFY_API_TOKEN) ``` 3. Call the [Website Content Crawler](https://apify.com/apify/website-content-crawler) Actor to crawl the Pinecone documentation and extract text content from the web pages: ``` actor_call = client.actor("apify/website-content-crawler").call( run_input={"startUrls": [{"url": "https://docs.pinecone.io/home"}]} ) print("Website Content Crawler Actor has finished") print(actor_call) ``` 4. Use Apify's [Pinecone integration](https://apify.com/apify/pinecone-integration) to store all the selected data from the dataset (provided by `datasetId` from the Actor call) into the Pinecone vector database. ``` pinecone_integration_inputs = { "pineconeApiKey": PINECONE_API_KEY, "pineconeIndexName": PINECONE_INDEX_NAME, "datasetFields": ["text"], "datasetId": actor_call["defaultDatasetId"], "enableDeltaUpdates": True, "deltaUpdatesPrimaryDatasetFields": ["url"], "deleteExpiredObjects": True, "expiredObjectDeletionPeriodDays": 30, "embeddingsApiKey": OPENAI_API_KEY, "embeddingsProvider": "OpenAI", "performChunking": True, "chunkSize": 1000, "chunkOverlap": 0, } actor_call = client.actor("apify/pinecone-integration").call(run_input=pinecone_integration_inputs) print("Apify's Pinecone Integration has finished") print(actor_call) ``` You have successfully integrated Apify with Pinecone and the data is now stored in the Pinecone vector database. ## Additional Resources[​](#additional-resources "Direct link to Additional Resources") * [Apify Pinecone integration](https://apify.com/apify/pinecone-integration) * [What is Pinecone and why use it with your LLMs?](https://blog.apify.com/what-is-pinecone-why-use-it-with-llms/) * [Pinecone documentation](https://docs.pinecone.io/) --- # Qdrant integration **Learn how to integrate Apify with Qdrant to transfer crawled data into the Qdrant vector database.** *** [Qdrant](https://qdrant.tech) is a high performance managed vector database that allows users to store and query dense vectors for next generation AI applications such as recommendation systems, semantic search, and retrieval augmented generation (RAG). The Apify integration for Qdrant enables you to export results from Apify Actors and Dataset items into a specific Qdrant collection. ## Prerequisites[​](#prerequisites "Direct link to Prerequisites") Before you begin, ensure that you have the following: * A [Qdrant cluster](https://qdrant.tech) set up. * A Qdrant URL to the database and Qdrant API token. * An [OpenAI API key](https://openai.com/index/openai-api/) to compute text embeddings. * An [Apify API token](https://docs.apify.com/platform/integrations/api#api-token) to access [Apify Actors](https://apify.com/store). ### How to setup Qdrant database and create a cluster[​](#how-to-setup-qdrant-database-and-create-a-cluster "Direct link to How to setup Qdrant database and create a cluster") 1. Sign up or log in to your Qdrant account and create a new cluster. 2. Specify the following details: provider, region, and name. 3. Set up an API key for the cluster once it is created and its status is healthy. With the cluster ready and its URL and API key in hand, you can proceed with integrating Apify. ### Integration Methods[​](#integration-methods "Direct link to Integration Methods") You can integrate Apify with Qdrant using either the Apify Console or the Apify Python SDK. Website Content Crawler usage The examples utilize the Website Content Crawler Actor, which deeply crawls websites, cleans HTML by removing modals and navigation elements, and converts HTML to Markdown for training AI models or providing web content to LLMs and generative AI applications. #### Apify Console[​](#apify-console "Direct link to Apify Console") 1. Set up the [Website Content Crawler](https://apify.com/apify/website-content-crawler) Actor in the [Apify Console](https://console.apify.com). Refer to this guide on how to set up [website content crawl for your project](https://blog.apify.com/talk-to-your-website-with-large-language-models/). 2. Once you have the crawler ready, navigate to the integration section and add Apify's Qdrant integration. ![Website Content Crawler with Qdrant integration](/assets/images/qdrant-wcc-integration-1fde951a0aa495bdffcafba80a6a0e29.png) 3. Select when to trigger this integration (typically when a run succeeds) and fill in all the required fields for the Qdrant integration. If you haven't created a collection, it can be created automatically with the specified model. You can learn more about the input parameters at the [Qdrant integration input schema](https://apify.com/apify/qdrant-integration). ![Qdrant integration configuration](/assets/images/qdrant-integration-setup-15d4bade6cb47bc451d6968e66668a88.png) * For a detailed explanation of the input parameters, including dataset settings, incremental updates, and examples, see the [Qdrant integration description](https://apify.com/apify/qdrant-integration). * For an explanation on how to combine Actors to accomplish more complex tasks, refer to the guide on [Actor-to-Actor](https://blog.apify.com/connecting-scrapers-apify-integration/) integrations. #### Python[​](#python "Direct link to Python") Another way to interact with Qdrant is through the [Apify Python SDK](https://docs.apify.com/sdk/python/). 1. Install the Apify Python SDK by running the following command: ``` pip install apify-client ``` 2. Create a Python script and import all the necessary modules: ``` from apify_client import ApifyClient APIFY_API_TOKEN = "YOUR-APIFY-TOKEN" OPENAI_API_KEY = "YOUR-OPENAI-API-KEY" QDRANT_URL = "YOUR-QDRANT-URL" QDRANT_API_KEY = "YOUR-QDRANT-API-KEY" QDRANT_COLLECTION_NAME = "YOUR-QDRANT-COLLECTION-NAME" client = ApifyClient(APIFY_API_TOKEN) ``` 3. Call the [Website Content Crawler](https://apify.com/apify/website-content-crawler) Actor to crawl the Qdrant documentation and extract text content from the web pages: ``` actor_call = client.actor("apify/website-content-crawler").call( run_input={"startUrls": [{"url": "https://qdrant.tech/documentation/"}]} ) ``` 4. Call Apify's Qdrant integration and store all data in the Qdrant Vector Database: ``` qdrant_integration_inputs = { "qdrantUrl": QDRANT_URL, "qdrantApiKey": QDRANT_API_KEY, "qdrantCollectionName": QDRANT_COLLECTION_NAME, "qdrantAutoCreateCollection": True, "datasetId": actor_call["defaultDatasetId"], "datasetFields": ["text"], "enableDeltaUpdates": True, "deltaUpdatesPrimaryDatasetFields": ["url"], "deleteExpiredObjects": True, "expiredObjectDeletionPeriodDays": 30, "embeddingsProvider": "OpenAI", "embeddingsApiKey": OPENAI_API_KEY, "performChunking": True, "chunkSize": 1000, "chunkOverlap": 0, } actor_call = client.actor("apify/qdrant-integration").call(run_input=qdrant_integration_inputs) ``` You have successfully integrated Apify with Qdrant and the data is now stored in the Qdrant vector database. ## Additional Resources[​](#additional-resources "Direct link to Additional Resources") * [Apify Qdrant integration](https://apify.com/apify/qdrant-integration) * [Qdrant documentation](https://qdrant.tech/documentation/) --- # Slack integration **Learn how to integrate your Apify Actors with Slack. This article guides you from installation through to automating your whole workflow in Slack.** A tutorial can be found [here](https://help.apify.com/en/articles/6454058-apify-integration-for-slack). *** > Explore the [integration for Slack tutorial](https://help.apify.com/en/articles/6454058-apify-integration-for-slack). [Slack](https://slack.com/) allows you to install various services in your workspace in order to automate and centralize jobs. Apify is one of these services, and it allows you to run your Apify Actors, get notified about their run statuses, and receive your results, all without opening your browser. ## Get started[​](#get-started "Direct link to Get started") To use the Apify integration for Slack, you will need: * An [Apify account](https://console.apify.com/). * A Slack account (and workspace). ## Step 1: Set up the integration for Slack[​](#step-one "Direct link to Step 1: Set up the integration for Slack") You can find all integrations on an Actor's or task's **Integrations** tab. For example, you can try using the [Google Shopping Scraper](https://console.apify.com/actors/aLTexEuCetoJNL9bL). Find the integration for Slack, then click the **Configure** button. You will be prompted to log in with your Slack account and select your workspace in the **Settings > Integrations** window. ![Integrations tab](/assets/images/integrations-tab-ccd1902979bfea9812a6de7046ec6f04.png) Then, head back to your task to finish the setup. Select what type of events you would like to be notified of (e.g., when a run is created, when a run succeeds, when a run fails, etc.), your workspace, and the channel you want to receive the notifications in (you can set up an ad-hoc channel for this test). In the **Message** field, you can see how the notification will look, or you can craft a new custom one. ![Integration setup](/assets/images/slack-integration-setup-0b413d14c705608f5d6a73e0ee5b5e05.png) Once you are done, click the **Save** button. ## Step 2: Give the Apify integration a trial run\![​](#step-two "Direct link to Step 2: Give the Apify integration a trial run!") Click the **Start** button and head to the Slack channel you selected to see your first Apify integration notifications. ## Step 3: Start your run directly from Slack[​](#step-three "Direct link to Step 3: Start your run directly from Slack") You can now run the same Actor or task directly from Slack by typing `/apify call [Actor or task ID]` into the Slack message box. ![Use Apify from Slack](/assets/images/slack-apify-message-6c772c8d007770c873bfdc0f4201e80e.png) When an Actor doesn't require you to fill in any input fields, you can run it by simply typing `/apify call [Actor or task ID]`. You're all set! If you have any questions or need help, feel free to reach out to us on our [Discord channel](https://discord.com/invite/jyEM2PRvMU). --- # Telegram integration through Zapier **Learn how to integrate your Apify Actors with Telegram through Zapier.** *** With [Apify integration for Zapier](https://zapier.com/apps/apify/integrations), you can connect your Apify Actors to Slack, Trello, Google Sheets, Dropbox, Salesforce, and loads more. Your Zapier workflows can start Apify Actors or tasks, fetch items from a dataset, set and get records from key-value stores, or find Actor or task runs. You can use the Zapier integration to trigger a workflow whenever an Actor or a task finishes. Complementary to the following guide we've created a detailed video, that will guide you through the process of setting up your Telegram integration through Zapier. [YouTube video player](https://www.youtube.com/embed/XldEuQleq3c?si=86qbdrzWpVLoY_fr) ## Connect Apify with Zapier[​](#connect-apify-with-zapier "Direct link to Connect Apify with Zapier") To use the Apify integration on Zapier, you will need to: * Have an [Apify account](https://console.apify.com/). * Have a [Zapier account](https://zapier.com/). ### Step 1: Create Zap and find Apify on Zapier[​](#step-1-create-zap-and-find-apify-on-zapier "Direct link to Step 1: Create Zap and find Apify on Zapier") Once you have your Zapier account ready and you are successfully logged in, you can create your first Zap. Go to the Zaps section and find the **Create Zap** button. In the create Zap form, you can choose whether you want to use Apify as the trigger or action for the Zap. Click on Trigger and find Apify using the search box. ![Zapier choosing source of trigger](/assets/images/zapier-trigger-9cfba21b5331f81b0ef284b610a7756e.png) You have two possible triggers that you can choose while setting up your Telegram integration * Finished Actor Run - triggers upon the completion of a selected Actor run. * Finished Task Run - triggers upon the completion of a selected Actor task run. ![Zapier choosing trigger event](/assets/images/zapier-zap-choose-event-256ebd8df51a8482927c35e55376c2e0.png) Available Actors & Tasks Please note that only Actors or Actor tasks that you previously run will be available to choose from. ### Step 2: Create a connection to your Apify account[​](#step-2-create-a-connection-to-your-apify-account "Direct link to Step 2: Create a connection to your Apify account") The next step is to connect your Apify account. Click on the "Sign in" button next to the Connect to Apify title. Or you can select an account if you already have one connected. The connection configuration options open in a new tab in your browser or in a modal window. ![Apify Zapier connection](/assets/images/zapier-new-connection-84a5dda50ec779c44f543918580c98f0.png) In connection configuration, you need to provide your Apify API Token. You can find the token in Apify Console by navigating to **Settings > Integrations**. ![Integrations token](/assets/images/apify-integrations-token-a480c4034e9658f9989b7c661ee0fad5.png) Copy the token and paste it into the configuration form and continue with the "Yes, Continue to Apify" button. The connection is now created and the configuration form closed. ## Connect Telegram bot with Zapier[​](#connect-telegram-bot-with-zapier "Direct link to Connect Telegram bot with Zapier") ### Step 1: Create & connect new bot on Telegram[​](#step-1-create--connect-new-bot-on-telegram "Direct link to Step 1: Create & connect new bot on Telegram") After setting up Apify as your trigger within Zapier, it's time to set up Telegram as the action that will occur based on the trigger. ![Choosing action source within Zapier](/assets/images/zapier-zap-choose-action-2cab03ba13fee2ae5e70cb4f06350e04.png) You have two possible actions that you can choose while setting up your Telegram integration * Send Message - sends a message from your bot when trigger activates. * Send Poll - sends a poll from your bot when trigger activates. ![Zapier choosing action event](/assets/images/zapier-zap-choose-event-2-4f8cffd5c5cee82720b2b11ca91ff168.png) After you choose your event that will be happening after trigger, you need to connect your Telegram bot that will be responsible for sending the message or sending polls. ![Telegram bot HTTP token](/assets/images/zapier-zap-telegram-access-f8700284981895e9d44f6d416c16beb1.png) The best way to do it's to: 1. Start conversation with Telegrams BotFather, a bot that manages bots on Telegram. ![BotFather conversation](/assets/images/botfather-start-f6dbf707c00fcdd5bc3cd664e1125338.png) 2. Issue the `/newbot` command in conversation with it and follow the instructions, until you get your HTTP API token. ![BotFather conversation with new HTTP API token](/assets/images/botfather-HTTP-API-blurred-d6dbee971c3b8bfa2b734a07baa68db3.png) ### Step 2: Create action for your new Telegram bot[​](#step-2-create-action-for-your-new-telegram-bot "Direct link to Step 2: Create action for your new Telegram bot") Once you've setup your new bot within Zapier, it's time to setup an action. Start new conversation with your bot and copy the **Chat-Id** and input it within Zapier. Select **Chat-Id**, **Text Format**, and a **Message Text** that suits your need for example ![Zapier choosing Action event](/assets/images/zapier-zap-action-39f63d9f8090d28c3a9526a64f270e91.png) Once you fill all required fields, you can test your integration and if everything works hit **Publish** and you are done! --- # Webhook integration **Learn how to integrate multiple Apify Actors or external systems with your Actor or task run. Send alerts when your Actor run succeeds or fails.** *** Webhooks allow you to configure the Apify platform to perform an action when a certain system event occurs. For example, you can use them to start another Actor when the current run finishes or fails. You can find webhooks under the **Integrations** tab on an Actor's page in [Apify Console](https://console.apify.com/actors). ![Integrations tab in Apify Console](/assets/images/integrations-tab-ccd1902979bfea9812a6de7046ec6f04.png) To define a webhook, select a system **event** that triggers the webhook. Then, provide the **action** to execute after the event. When the event occurs, the system executes the action. Current webhook limitations Currently, the only available action is to send a POST HTTP request to a URL specified in the webhook. * [**Events**](https://docs.apify.com/platform/integrations/webhooks/events.md) * [**Actions**](https://docs.apify.com/platform/integrations/webhooks/actions.md) * [**Ad-hoc webhooks**](https://docs.apify.com/platform/integrations/webhooks/ad-hoc-webhooks.md) --- # Webhook actions **Send notifications when specific events occur in your Actor/task run or build. Dynamically add data to the notification payload.** *** ## Send HTTP request[​](#send-http-request "Direct link to Send HTTP request") To send notification, you can use the HTTP request action, which sends an HTTP POST request to a specified URL with a JSON payload. The payload is defined using a payload template, which is a JSON-like syntax that allows you to include variables enclosed in double curly braces `{{variable}}`. This enables the dynamic injection of data into the payload when the webhook is triggered. Webhook delivery Webhooks are sent from servers that use static IP addresses. If your webhook destination is secured by a firewall, you can add the following IP addresses to your allow list to ensure Apify webhooks are delivered successfully. IP list * `3.215.64.207` * `13.216.80.7` * `13.216.180.86` * `34.224.107.31` * `34.236.208.85` * `44.198.219.104` * `44.207.71.44` * `44.207.141.205` * `52.4.20.206` * `52.203.255.236` ### Response management[​](#response-management "Direct link to Response management") The response to the POST request must have an HTTP status code in the `2XX` range. If the response has a different status code, it is considered an error, and the request will be retried periodically with an exponential back-off: * First retry: after approximately *1 minute* * Second retry: after *2 minutes* * Third retry: after *4 minutes* * ... * Eleventh retry: after approximately *32 hours* If the request fails after *11 retries*, the system stops retrying. ### Security considerations[​](#security-considerations "Direct link to Security considerations") For security reasons, include a secret token in the webhook URL to ensure that only Apify can invoke it. You can use the **Test** button in the user interface to test your endpoint. Headers template You can also use [Headers template](https://docs.apify.com/platform/integrations/webhooks/actions.md#headers-template) for this purpose. Note that webhook HTTP requests have a timeout of *30 seconds*. If your endpoint performs a time-consuming operation, respond to the request immediately to prevent timeouts before Apify receives the response. To ensure reliable completion of the time-consuming operation, consider using a message queue internally to retry the operation on internal failure. In rare cases, the webhook might be invoked more than once. Design your code to be idempotent to handle duplicate calls. Apify requests: auto-added tokens If the URL of your request points toward Apify, you don't need to add a token, since it will be added automatically. ## Payload template[​](#payload-template "Direct link to Payload template") The payload template is a JSON-like string that allows you to define a custom payload structure and inject dynamic data known only at the time of the webhook's invocation. Apart from the variables, the string must be a valid JSON. Variables must be enclosed in double curly braces and can only use the pre-defined variables listed in the [Available variables](#available-variables) section. Using any other variable will result in a validation error. The syntax of a variable is: `{{oneOfAvailableVariables}}`. Variables support accessing nested properties using dot notation: `{{variable.property}}`. ### Default payload template[​](#default-payload-template "Direct link to Default payload template") ``` { "userId": {{userId}}, "createdAt": {{createdAt}}, "eventType": {{eventType}}, "eventData": {{eventData}}, "resource": {{resource}} } ``` ### Default payload example[​](#default-payload-example "Direct link to Default payload example") ``` { "userId": "abf6vtB2nvQZ4nJzo", "createdAt": "2019-01-09T15:59:56.408Z", "eventType": "ACTOR.RUN.SUCCEEDED", "eventData": { "actorId": "fW4MyDhgwtMLrB987", "actorRunId": "uPBN9qaKd2iLs5naZ" }, "resource": { "id": "uPBN9qaKd2iLs5naZ", "actId": "fW4MyDhgwtMLrB987", "userId": "abf6vtB2nvQZ4nJzo", "startedAt": "2019-01-09T15:59:40.750Z", "finishedAt": "2019-01-09T15:59:56.408Z", "status": "SUCCEEDED", // ... } } ``` #### String interpolation[​](#string-interpolation "Direct link to String interpolation") The payload template is *not* a valid JSON by default, but the resulting payload is. To use templates that provide the same functionality and are valid JSON at the same time, you can use string interpolation. With string interpolation, the default payload template looks like this: ``` { "userId": "{{userId}}", "createdAt": "{{createdAt}}", "eventType": "{{eventType}}", "eventData": "{{eventData}}", "resource": "{{resource}}" } ``` If the string being interpolated contains only the variable, the actual variable value is used in the payload. For example `"{{eventData}}"` results in an object. If the string contains more than just the variable, the string value of the variable will appear in the payload: ``` { "text": "My user id is {{userId}}" } { "text": "My user id is abf6vtB2nvQZ4nJzo" } ``` To enable string interpolation, use **Interpolate variables in string fields** switch within the Apify Console. In JS API Client it's called `shouldInterpolateStrings`. This field is always `true` when integrating Actors or tasks. ### Payload template example[​](#payload-template-example "Direct link to Payload template example") This example shows how to use payload template variables to send a custom object that displays the status of a run, its ID and a custom property: ``` { "runId": {{resource.id}}, "runStatus": {{resource.status}}, "myProp": "hello world" } ``` Note that the `eventData` and `resource` properties contain redundant data for backward compatibility. You can use either `eventData` or `resource` in your templates, depending on your use case. ## Headers template[​](#headers-template "Direct link to Headers template") The headers template is a JSON-like text where you can add additional information to the default HTTP header of the webhook request. You can pass the variables in the same way as in [payload template](#payload-template), including the use of string interpolation and the available variables. The resulting interpolated text need to be a valid JSON object, and values can be strings only. Note that the following HTTP headers are always set by the system and your changes will always be rewritten: | Variable | Value | | ----------------------------- | ------------------------- | | `Host` | Request URL | | `Content-Type` | `application/json` | | `X-Apify-Webhook` | Apify internal value | | `X-Apify-Webhook-Dispatch-Id` | Apify webhook dispatch ID | | `X-Apify-Request-Origin` | Apify origin | ## Description[​](#description "Direct link to Description") The description is an optional string that you can add to the webhook. It serves for your information and is not sent with the HTTP request when the webhook is dispatched. ## Available variables[​](#available-variables "Direct link to Available variables") | Variable | Type | Description | | ----------- | ------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | `userId` | string | ID of the Apify user who owns the webhook. | | `createdAt` | string | ISO string date of the webhook's trigger event. | | `eventType` | string | Type of the trigger event, see [Events](https://docs.apify.com/platform/integrations/webhooks/events.md). | | `eventData` | Object | Data associated with the trigger event, see [Events](https://docs.apify.com/platform/integrations/webhooks/events.md). | | `resource` | Object | The resource that caused the trigger event. | | `globals` | Object | Data available in global context. Contains `dateISO` (date of webhook's trigger event in ISO 8601 format) and `dateUnix` (date of trigger event in Unix time in seconds) | ### Resource[​](#resource "Direct link to Resource") The `resource` variable represents the triggering system resource. For example, when using the `ACTOR.RUN.SUCCEEDED` event, the resource is the Actor run. The variable will be replaced by the `Object` that you would receive as a response from the relevant API at the moment when the webhook is triggered. For the Actor run resource, it would be the response of the [Get Actor run](https://docs.apify.com/api/v2/actor-run-get.md) API endpoint. In addition to Actor runs, webhooks also support various events related to Actor builds. In such cases, the resource object will look like the response of the [Get Actor build](https://docs.apify.com/api/v2/actor-build-get.md) API endpoint. --- # Ad-hoc webhooks **Set up one-time webhooks for Actor runs initiated through the Apify API or from the Actor's code. Trigger events when the run reaches a specific state.** *** An ad-hoc webhook is a single-use webhook created for a specific Actor run when starting the run using the [Apify API](https://docs.apify.com/api/v2.md). The webhook triggers once when the run transitions to the specified state. Define ad-hoc webhooks using the `webhooks` URL parameter added to the API endpoint that starts an Actor or Actor task: ``` https://api.apify.com/v2/acts/[ACTOR_ID]/runs?token=[YOUR_API_TOKEN]&webhooks=[AD_HOC_WEBHOOKS] ``` replace `AD_HOC_WEBHOOKS` with a base64 encoded stringified JSON array of webhook definitions: ``` [ { eventTypes: ['ACTOR.RUN.FAILED'], requestUrl: 'https://example.com/run-failed', }, { eventTypes: ['ACTOR.RUN.SUCCEEDED'], requestUrl: 'https://example.com/run-succeeded', payloadTemplate: '{"hello": "world", "resource":{{resource}}}', }, ]; ``` ## Create an ad-hoc webhook dynamically[​](#create-an-ad-hoc-webhook-dynamically "Direct link to Create an ad-hoc webhook dynamically") You can also create a webhook dynamically from your Actor's code using the Actor's add webhook method: * JavaScript * Python ``` import { Actor } from 'apify'; await Actor.init(); // ... await Actor.addWebhook({ eventTypes: ['ACTOR.RUN.FAILED'], requestUrl: 'https://example.com/run-failed', }); // ... await Actor.exit(); ``` ``` from apify import Actor async def main(): async with Actor: await Actor.add_webhook( event_types=['ACTOR.RUN.FAILED'], request_url='https://example.com/run-failed', ) # ... ``` For more information, check out the [JavaScript SDK documentation](https://docs.apify.com/sdk/js/reference/class/Actor#addWebhook) or the [Python SDK documentation](https://docs.apify.com/sdk/python/reference/class/Actor#add_webhook). To prevent duplicate ad-hoc webhooks in case of Actor restart, use the idempotency key parameter. The idempotency key must be unique across all user webhooks to ensure only one webhook is created for a given value. For example, use the Actor run ID as an idempotency key: * JavaScript * Python ``` import { Actor } from 'apify'; await Actor.init(); // ... await Actor.addWebhook({ eventTypes: ['ACTOR.RUN.FAILED'], requestUrl: 'https://example.com/run-failed', idempotencyKey: process.env.APIFY_ACTOR_RUN_ID, }); // ... await Actor.exit(); ``` ``` import os from apify import Actor async def main(): async with Actor: await Actor.add_webhook( event_types=['ACTOR.RUN.FAILED'], request_url='https://example.com/run-failed', idempotency_key=os.environ['APIFY_ACTOR_RUN_ID'], ) # ... ``` --- # Events types for webhooks **Specify the types of events that trigger a webhook in an Actor or task run. Trigger an action on Actor or task run creation, success, failure, termination or timeout.** *** You can configure webhooks to trigger actions based on specific events that occur during Actor runs or builds. ## Actor run events[​](#actor-run-events "Direct link to Actor run events") Actor run events are triggered when an Actor run is created or transitions to a new state. You can define webhooks for all runs of an Actor on its detail page or for a specific Actor task on its detail page. In the latter case, the webhook is invoked only for runs started for that task. ### Event types[​](#event-types "Direct link to Event types") * `ACTOR.RUN.CREATED` - A new Actor run has been created. * `ACTOR.RUN.SUCCEEDED` - An Actor run finished with status `SUCCEEDED`. * `ACTOR.RUN.FAILED` - An Actor run finished with status `FAILED`. * `ACTOR.RUN.ABORTED` - An Actor run finished with status `ABORTED`. * `ACTOR.RUN.TIMED_OUT` - An Actor run finished with status `TIMED-OUT`. * `ACTOR.RUN.RESURRECTED` - An Actor run has been resurrected. ### Event data[​](#event-data "Direct link to Event data") The following data is provided for Actor run events: ``` { "actorId": "ID of the triggering Actor.", "actorTaskId": "If task was used, its ID.", "actorRunId": "ID of the triggering Actor run.", } ``` To fetch the results from the Actor run, you can use the `actorRunId` event property and call one of the [Run object and its storages](https://docs.apify.com/api/v2/actor-runs.md) API endpoints. For example: ``` https://api.apify.com/v2/actor-runs/[ACTOR_RUN_ID]/dataset/items?token=[TOKEN] ``` Apart from the event data, actions also have the `resource` object available, which can provide more details about the object that triggered the event. For more information about the `resource` objects, see the [Webhooks Actions page](https://docs.apify.com/platform/integrations/webhooks/actions.md#resource) ## Actor build events[​](#actor-build-events "Direct link to Actor build events") Actor build events are triggered when an Actor build is created or transitions into a new state. You can define webhooks for all builds on its detail page. ### Event types[​](#event-types-1 "Direct link to Event types") * `ACTOR.BUILD.CREATED` - A new Actor build has been created. * `ACTOR.BUILD.SUCCEEDED` - An Actor build finished with the status `SUCCEEDED`. * `ACTOR.BUILD.FAILED` - An Actor build finished with the status `FAILED`. * `ACTOR.BUILD.ABORTED` - An Actor build finished with the status `ABORTED`. * `ACTOR.BUILD.TIMED_OUT` - An Actor build finished with the status `TIMED-OUT`. ### Event Data[​](#event-data-1 "Direct link to Event Data") The following data is provided for Actor build events: ``` { "actorId": "ID of the triggering Actor.", "actorBuildId": "ID of the triggering Actor build.", } ``` --- # Zapier integration **Learn how to integrate your Apify Actors with Zapier.** *** With [Apify integration for Zapier](https://zapier.com/apps/apify/integrations), you can connect your Apify Actors to Slack, Trello, Google Sheets, Dropbox, Salesforce, and loads more. Your Zapier workflows can start Apify Actors or tasks, fetch items from a dataset, set and get records from key-value stores, or find Actor or task runs. You can use the Zapier integration to trigger a workflow whenever an Actor or a task finishes. ## Connect Apify with Zapier[​](#connect-apify-with-zapier "Direct link to Connect Apify with Zapier") To use the Apify integration on Zapier, you will need to: * Have an [Apify account](https://console.apify.com/). * Have a [Zapier account](https://zapier.com/). ### Step 1: Create Zap and find Apify on Zapier[​](#step-1-create-zap-and-find-apify-on-zapier "Direct link to Step 1: Create Zap and find Apify on Zapier") Once you have your Zapier account ready and you are successfully logged in, you can create your first Zap. Go to the Zaps section and find the "Create Zap" button. In the create Zap form, you can choose whether you want to use Apify as the trigger or action for the Zap. Click on Trigger and find Apify using the search box. ![Apify Zapier Trigger 2](/assets/images/zapier-trigger-9cfba21b5331f81b0ef284b610a7756e.png) Then select which trigger you want to use. ![Apify Zapier Trigger 2](/assets/images/zapier-trigger-2-cf99ab20a4576f75c004f7bc55712a33.png) ### Step 2: Create a connection to your Apify account[​](#step-2-create-a-connection-to-your-apify-account "Direct link to Step 2: Create a connection to your Apify account") Next, connect your Apify account. Click the **Select** button next to the **Connect to Apify** title. Or you can select an account if you already have one connected. The connection configuration options open in a new tab in your browser. ![Apify Zapier connection](/assets/images/zapier-choose-account-5c8166e1c16112618f9508af4af975d7.png) Choose the account that you want to connect with Zapier. A new window will appear displaying the information that will be shared between Zapier and Apify. ![Integrations OAuth](/assets/images/zapier-allow-access-2d640b6d1bc4b88044177b6f632e6cc9.png) Continue by clicking **Allow Access** button. The connection is now created and the configuration form closes. You can continue with the Zap configuration and select the Actor you want to use as trigger. The last step is to test the connection. The **Test Trigger** button lets you test the trigger. The test should prefill data from existing Actor runs you have in Apify. If you don't have any runs, you can create one in Apify Console and then test the connection again. After a successful test, you can continue with the Zap configuration and set up the action for the Zap. ### Step 3: Set up your Apify action in Zapier[​](#step-3-set-up-your-apify-action-in-zapier "Direct link to Step 3: Set up your Apify action in Zapier") You are able to use any action to follow your Apify trigger. For example, you can use Gmail to send an email about a finished Actor run. In this guide we'll show you how to use Apify as an action to start an Actor run. After you select Apify as an action, you need to select the action you want to use. Let's use the "Run Actor" action for this example. ![Zapier Action](/assets/images/zapier-action-1-d802c992ccd8d7a6d8db6d4fe7fe6e56.png) You need to select the connection you want to use. If you don't have any connections, you can create a new one by clicking on the "Sign in" button and follow the steps in Step 2. In the next step, you need to select the Actor you want to use. You can use the search box to find the Actor. We will use the Web Scraper Actor in this example. ![Zapier Action](/assets/images/zapier-action-2-abd92c7a5a0b16cdb5be06c492460be4.png) You need to fill the input for the Actor and use the **Continue** button to advance to the next step. > You can choose to run Actor synchronously or asynchronously. If you choose to run Actor synchronously, the Zap will wait until the Actor finishes and the Actor output and data will be available in next steps. Beware that the hard timeout for the run is 30 seconds. If the Actor doesn't finish in 30 seconds, the run will be terminated and the Actor output will not be available in next steps. In the next step, you can test the action and check if everything is using the **Test step** button. This button runs the Actor run on Apify and you can see the data in Zapier. ![Zapier Action](/assets/images/zapier-action-3-cdd47e458ca684ee04df4f9d83552952.png) Once you are happy with the test, you can publish the Zap. When it is turned on, it will run the Actor every time the trigger is fired. ## Triggers[​](#triggers "Direct link to Triggers") ### Finished Actor Run[​](#finished-actor-run "Direct link to Finished Actor Run") > Triggers when a selected Actor run is finished. ### Finished Task Run[​](#finished-task-run "Direct link to Finished Task Run") > Triggers when a selected Actor task run is finished. ## Actions[​](#actions "Direct link to Actions") ### Run Actor[​](#run-actor "Direct link to Run Actor") > Runs a selected Actor. ### Run Task[​](#run-task "Direct link to Run Task") > Runs a selected Actor task. ### Scrape Single URL[​](#scrape-single-url "Direct link to Scrape Single URL") > Runs a scraper for the website and returns its content as text, markdown and HTML. This action is for getting content of a single page to use, for example, in LLM flows. ### Set Key-Value Store Record[​](#set-key-value-store-record "Direct link to Set Key-Value Store Record") > Sets a value to a [Key-value store](https://docs.apify.com/platform/storage/key-value-store.md). ## Searches[​](#searches "Direct link to Searches") ### Fetch Dataset Items[​](#fetch-dataset-items "Direct link to Fetch Dataset Items") > Retrieves items from a [dataset](https://docs.apify.com/platform/storage/dataset.md). ### Find Last Actor Run[​](#find-last-actor-run "Direct link to Find Last Actor Run") > Finds the most recent Actor run. ### Find Last Task Run[​](#find-last-task-run "Direct link to Find Last Task Run") > Finds the most recent Actor task run. ### Get Key-Value Store Record[​](#get-key-value-store-record "Direct link to Get Key-Value Store Record") > Retrieves value from a [Key-value store](https://docs.apify.com/platform/storage/key-value-store.md). If you have any questions or need help, feel free to reach out to us on our [developer community on Discord](https://discord.com/invite/jyEM2PRvMU). --- # Limits **Learn the Apify platform's resource capability and limitations such as max memory, disk size and number of Actors and tasks per user.** *** The tables below demonstrate the Apify platform's default resource limits. For API limits such as rate limits and max payload size, see the [API documentation](https://docs.apify.com/api/v2.md#rate-limiting). > If needed, the limits shown below can be increased on paid accounts. For details, contact us at **** or using the chat in [Apify Console](https://console.apify.com/) under the "Help & Resources → Contact Support". ## Actor runtime limits[​](#actor-runtime-limits "Direct link to Actor runtime limits") | Description | Limit for plan | | | | | ------------------------------------------- | --------------------- | --------- | ---------- | -------- | | | Free | Starter | Scale | Business | | Build memory size | 4,096 MB | | | | | Run minimum memory | 128 MB | 128 MB | | | | Run maximum memory | 4,096 MB | 32,768 MB | | | | Maximum combined memory of all running jobs | 4,096 MB | 32,768 MB | 131,072 MB | | | Build timeout | 1800 secs | | | | | Build/run disk size | 2× job memory limit | | | | | Memory per CPU core | 4,096 MB | | | | | Maximum log size | 10,485,760 characters | | | | | Maximum number of metamorphs | 10 metamorphs per run | | | | ## Apify platform limits[​](#apify-platform-limits "Direct link to Apify platform limits") | Description | Limit for plan | | | | | ---------------------------------------------------------------------- | -------------- | ------- | ----- | -------- | | | Free | Starter | Scale | Business | | Maximum number of dataset columns for tabular formats (XLSX, CSV, ...) | 2000 columns | | | | | Maximum size of Actor input schema | 500 kB | | | | | Maximum number of Actors per user | 100 | | | | | Maximum number of tasks per user | 1000 | | | | | Maximum number of schedules per user | 100 | | | | | Maximum number of webhooks per user | 100 | | | | | Maximum number of Actors per schedule | 10 | | | | | Maximum number of tasks per schedule | 10 | | | | | Maximum number of concurrent Actor runs per user | 25 | 32 | 128 | 256 | ## Usage limit[​](#usage-limit "Direct link to Usage limit") The Apify platform also introduces usage limits based on the billing plan to protect users from accidental overspending. To learn more about usage limits, head over to the [Limits](https://docs.apify.com/platform/console/billing.md#limits) section of our docs. View these limits and adjust your maximum usage limit in [Apify Console](https://console.apify.com/billing#/limits): ![](/assets/images/usage-limits-2b0ebb13462f1d8122148611409b965a.png "Apify Security Whitepaper") --- # Monitoring **Learn how to continuously make sure that your Actors and tasks perform as expected and retrieve correct results. Receive alerts when your jobs or their metrics are not as you expect.** *** The web is continuously evolving, and so are the websites you interact with. If you implement Apify Actors or the data they provide into your daily workflows, you need to make sure that everything runs as expected. > Monitoring allows you to track and observe how the software works. It enables you to measure and compare your programs' performance over time and to be notified when something goes wrong. Also, you can use the data you gain from monitoring to optimize your software and maximize its potential. ## Built-in monitoring[​](#built-in-monitoring "Direct link to Built-in monitoring") Monitoring is an option you can find on any Actor or saved task in Apify Console. It allows you to display metric statistics about your solution's runs and set up alerts for when your solution behaves differently than you expect. The monitoring system is free for all users. You can use it to monitor as many Actors and tasks as you want, and it does not use any additional resources on top of your usage when running them. ![Monitoring](/assets/images/monitoring-d9eb8c25cd2bfdf9432a0505f8fdeabc.png) ### Features[​](#features "Direct link to Features") Currently, the monitoring option offers the following features: 1. Chart showing **statuses** of runs of the Actor or saved task over last 30 days. ![Daily run statuses](/assets/images/daily-run-statuses-8f9e27cf20c91230ef60f420e6c62d91.png) 2. Chart displaying **metrics** of the last 200 runs of the Actor or saved task. ![Run statistics](/assets/images/run-statistics-chart-51e2921b6fec3a9bde25171ef0bde7f2.png) 3. Option to set up **alerts** with notifications based on the run metrics. ![Alerts](/assets/images/alerts-9dbb69f1af9c6c56ea6737a4633fa8e7.png) > Both charts can also be added to your Apify Console home page so you can quickly see if there are any issues every time you open Apify Console. ### Alert configuration[​](#alert-configuration "Direct link to Alert configuration") When you set up an alert, you have four choices for how you want the metrics to be evaluated. And depending on your choices, the alerting system will behave differently: 1. **Alert, when the metric is lower than** - This type of alert is checked after the run finishes. If the metric is lower than the value you set, the alert will be triggered and you will receive a notification. 2. **Alert, when the metric is higher than** - This type of alert is checked both during the run and after the run finishes. During the run, we do periodic checks (approximately every 5 minutes) so that we can notify you as soon as possible if the metric is higher than the value you set. After the run finishes, we do a final check to make sure that the metric does not go over the limit in the last few minutes of the run. 3. **Alert, when run status is one of following** - This type of alert is checked only after the run finishes. It makes possible to track the status of your finished runs and send an alert if the run finishes in a state you do not expect. If your Actor runs very often and suddenly starts failing, you will receive a single alert after the first failed run in 1 minute, and then aggregated alert every 15 minutes. 4. **Alert for dataset field statistics** - If you have a [dataset schema](https://docs.apify.com/platform/actors/development/actor-definition/dataset-schema/validation.md) set up, then you can use the field statistics to set up an alert. You can use field statistics for example to track if some field is filled in all records, if some numeric value is too low/high (for example when tracking the price of a product over multiple sources), if the number of items in an array is too low/high (for example alert on Instagram Actor if post has a lot of comments) and many other tasks like these. important Available dataset fields are taken from the last successful build of the monitored Actor. If different versions have different fields, currently the solution will always display only those from the default version. ![Metric condition configuration](/assets/images/metric-options-17a021316d5b262a6fd030f060de3d0b.png) You can get notified by email, Slack, or in Apify Console. If you use Slack, we suggest using Slack notifications instead of email because they are more reliable, and you can also get notified quicker. 1. **Email** - You can set up one or more emails to receive alert notifications. To do that, you just have to separate them by commas. You can also disable email notifications if you don't want to receive them. 2. **Slack** - To set up Slack notifications, you first need to connect your Slack workspace to Apify. To do that, go to your [account integration settings](https://console.apify.com/account/integrations) and click on the **+ Add** button in the Slack section. Once you have your workspace connected, you can choose the workspace when setting up alert notifications and then pick a channel to which you want the notifications to be delivered. 3. **In Console** - You can also get notified in Apify Console. This is useful if you access Apify Console often, and you do not need to be notified as soon as possible. ![Notifications configurations](/assets/images/notifications-fb7866065c7158dd7f9ecc88ed03ca88.png) ### Alert notification[​](#alert-notification "Direct link to Alert notification") The email and Slack alert notifications both contain the same information. You will receive a notification with the following information: 1. **Alert name** 2. **Condition** - The condition that was violated. 3. **Value** - The value of the metric violating the condition and triggering the alert. 4. **Run ID** - The ID of the run that triggered the alert, which links directly to the run detail in Apify Console. 5. **Actor** - The full name of the Actor that triggered the alert which links to the Actor detail in Apify Console. 6. **Task** - If the monitoring alert was set up for a task, then this field will contain the name of the task which links to the task detail in Apify Console. ![Email notification](/assets/images/email-notification-5b04c48ae46552b82a0849cc2f3ba834.png) While the in-app notification will contain less information, it will point you directly to the Actor or task that triggered the alert: ![](/assets/images/in-app-notification-0fb9d2298241d2a8533df11f9fa46beb.png "In app notification") ## Other[​](#other "Direct link to Other") ### What should I monitor when scraping?[​](#what-should-i-monitor-when-scraping "Direct link to What should I monitor when scraping?") You might want to monitor various metrics when you're scraping the web. Here are some examples: **Data quality**: 1. **Number of results** returned by your solution. This is useful if you are scraping a list of products, for example. You can set up an alert to notify you if the number of results is lower than expected. Which indicates that something changed on the website you are scraping. 2. **Number of fields** returned. This is something that indicates a change in the website. For example, the manufacturer name moved to another place. **Performance**: 1. **Duration** of the run. If your solution is taking longer than usual to finish, you can set up an alert to notify you. This will help you prevent your solution from being stuck and from wasting resources. **Usage and cost**: 1. **Usage cost** may change when the robot blocking solution gets implemented. An increase of the cost may indicate that many URLs are being retried. 2. **Proxy usage.** Seeing how your solution uses a proxy and if there are any changes can help you optimize your usage and prevent increased costs if your solution starts behaving differently than expected. These are just a few examples of what you can monitor. It's always recommended to start small, iterate, and get more complex over time based on your experience. ### Alternative solutions[​](#alternative-solutions "Direct link to Alternative solutions") For more complex monitoring, you can use the [monitoring suite](https://apify.com/apify/monitoring), which is a collection of [Apify Actors](https://docs.apify.com/platform/actors.md) that allows you to automate the monitoring of jobs you have running on the [Apify platform](https://apify.com). The monitoring suite offers some features that are not **currently** available in Apify Console, such as: 1. Schema validation of the output 2. Duplicate checks in the output 3. Dashboards with data grouping 4. Daily/weekly/monthly monitoring instead of after every run > Please note that this solution is more complex and requires more time to set up. Also, it uses schedules, Actors, tasks, and webhooks, so using it will increase your overall usage on the Apify platform. --- # Proxy **Learn to anonymously access websites in scraping/automation jobs. Improve data outputs and efficiency of bots, and access websites from various geographies.** *** > [Apify Proxy](https://apify.com/proxy) allows you to change your IP address when web scraping to reduce the chance of being [blocked](https://docs.apify.com/academy/anti-scraping/techniques.md) because of your geographical location. You can use proxies in your [Actors](https://docs.apify.com/platform/actors.md) or any other application that supports HTTP proxies. Apify Proxy monitors the health of your IP pool and intelligently rotates addresses to prevent IP address-based blocking. You can view your proxy settings and password on the [Proxy](https://console.apify.com/proxy) page in Apify Console. For pricing information, visit [apify.com/pricing](https://apify.com/pricing). ## Quickstart[​](#quickstart "Direct link to Quickstart") Usage of Apify Proxy means just a couple of lines of code, thanks to our [SDKs](https://docs.apify.com/sdk.md): * JavaScript SDK with PuppeteerCrawler * Python SDK with requests ``` import { Actor } from 'apify'; import { PuppeteerCrawler } from 'crawlee'; await Actor.init(); const proxyConfiguration = await Actor.createProxyConfiguration(); const crawler = new PuppeteerCrawler({ proxyConfiguration, async requestHandler({ page }) { console.log(await page.content()); }, }); await crawler.run(['https://proxy.apify.com/?format=json']); await Actor.exit(); ``` ``` import requests, asyncio from apify import Actor async def main(): async with Actor: proxy_configuration = await Actor.create_proxy_configuration() proxy_url = await proxy_configuration.new_url() proxies = { 'http': proxy_url, 'https': proxy_url, } response = requests.get('https://api.apify.com/v2/browser-info', proxies=proxies) print(response.text) if __name__ == '__main__': asyncio.run(main()) ``` ## Proxy types[​](#proxy-types "Direct link to Proxy types") Several types of proxy servers exist, each offering distinct advantages, disadvantages, and varying pricing structures. You can use them to access websites from various geographies and with different levels of anonymity. #### [Datacenter proxy](https://docs.apify.com/platform/proxy/datacenter-proxy.md) [The fastest and cheapest option. It uses datacenters to change your IP address. Note that there is a chance of being blocked because of the activity of other users.](https://docs.apify.com/platform/proxy/datacenter-proxy.md) #### [Residential proxy](https://docs.apify.com/platform/proxy/residential-proxy.md) [IP addresses located in homes and offices around the world. These IPs are the least likely to be blocked.](https://docs.apify.com/platform/proxy/residential-proxy.md) #### [Google SERP proxy](https://docs.apify.com/platform/proxy/google-serp-proxy.md) [Download and extract data from Google Search Engine Result Pages (SERPs). You can select country and language to get localized results.](https://docs.apify.com/platform/proxy/google-serp-proxy.md) --- # Datacenter proxy **Learn how to reduce blocking when web scraping using IP address rotation. See proxy parameters and learn to implement Apify Proxy in an application.** *** Datacenter proxies are a cheap, fast and stable way to mask your identity online. When you access a website using a datacenter proxy, the site can only see the proxy center's credentials, not yours. Datacenter proxies allow you to mask and [rotate](https://docs.apify.com/platform/proxy/usage.md#ip-address-rotation) your IP address during web scraping and automation jobs, reducing the possibility of them being [blocked](https://docs.apify.com/academy/anti-scraping/techniques.md#access-denied). For each [HTTP/S request](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods), the proxy takes the list of all available IP addresses and selects the one used the longest time ago for the specific hostname. You can refer to our [blog post](https://blog.apify.com/datacenter-proxies-when-to-use-them-and-how-to-make-the-most-of-them/) for tips on how to make the most out of datacenter proxies. ## Features[​](#features "Direct link to Features") * Periodic health checks of proxies in the pool so requests are not forwarded via dead proxies. * Intelligent rotation of IP addresses so target hosts are accessed via proxies that have accessed them the longest time ago, to reduce the chance of blocking. * Periodically checks whether proxies are banned by selected target websites. If they are, stops forwarding traffic to them to get the proxies unbanned as soon as possible. * Ensures proxies are located in specific countries using IP geolocation. * Allows selection of groups of proxy servers with specific characteristics. * Supports persistent sessions that enable you to keep the same IP address for certain parts of your crawls. * Measures statistics of traffic for specific users and hostnames. * Allows selection of proxy servers by country. ## Datacenter proxy types[​](#datacenter-proxy-types "Direct link to Datacenter proxy types") When using Apify's datacenter proxies, you can either select a proxy group, or the `auto` mode. [Apify Proxy](https://apify.com/proxy) offers either proxy groups that are shared across multiple customers or dedicated ones. ### Shared proxy groups[​](#shared-proxy-groups "Direct link to Shared proxy groups") Each user has access to a selected number of proxy servers from a shared pool. These servers are spread into groups (called proxy groups). Each group shares a common feature (location, provider, speed, etc.). For a full list of plans and number of allocated proxy servers for each plan, see our [pricing](https://apify.com/pricing). To get access to more servers, you can upgrade your plan in the [subscription settings](https://console.apify.com/billing/subscription); ### Dedicated proxy groups[​](#dedicated-proxy-groups "Direct link to Dedicated proxy groups") When you purchase access to dedicated proxy groups, they are assigned to you, and only you can use them. You gain access to a range of static IP addresses from these groups. This feature is also useful if you have your own pool of proxy servers and still want to benefit from the features of Apify Proxy (like [IP address rotation](https://docs.apify.com/platform/proxy/usage.md#ip-address-rotation), [persistent sessions](#session-persistence), and health checking). If you do not have your own pool, the [customer support](https://apify.com/contact) team can set up a dedicated group for you based on your needs and requirements. Prices for dedicated proxy servers are mainly based on the number of proxy servers, their type, and location. [Contact us](https://apify.com/contact) for more information. ## Connecting to datacenter proxies[​](#connecting-to-datacenter-proxies "Direct link to Connecting to datacenter proxies") By default, each proxied HTTP request is potentially sent via a different target proxy server, which adds overhead and could be potentially problematic for websites which save cookies based on IP address. If you want to pick an IP address and pass all subsequent connections via that same IP address, you can use the `session` [parameter](https://docs.apify.com/platform/proxy/usage.md#sessions). ### Username parameters[​](#username-parameters "Direct link to Username parameters") The `username` field enables you to pass various [parameters](https://docs.apify.com/platform/proxy/usage.md#connection-settings), such as groups, session and country, for your proxy connection. **This parameter is optional**. By default, the proxy uses all available proxy servers from all groups you have access to. If you do not want to specify either `groups` or `session` parameters and therefore use the default behavior for both, set the username to `auto`. ### Examples[​](#examples "Direct link to Examples") * PuppeteerCrawler * CheerioCrawler * Python SDK with requests * gotScraping() ``` import { Actor } from 'apify'; import { PuppeteerCrawler } from 'crawlee'; await Actor.init(); const proxyConfiguration = await Actor.createProxyConfiguration(); const crawler = new PuppeteerCrawler({ proxyConfiguration, async requestHandler({ page }) { console.log(await page.content()); }, }); await crawler.run(['https://proxy.apify.com/?format=json']); await Actor.exit(); ``` ``` import { Actor } from 'apify'; import { CheerioCrawler } from 'crawlee'; await Actor.init(); const proxyConfiguration = await Actor.createProxyConfiguration(); const crawler = new CheerioCrawler({ proxyConfiguration, async requestHandler({ body }) { // ... console.log(body); }, }); await crawler.run(['https://proxy.apify.com']); await Actor.exit(); ``` ``` from apify import Actor import requests, asyncio async def main(): async with Actor: proxy_configuration = await Actor.create_proxy_configuration() proxy_url = await proxy_configuration.new_url() proxies = { 'http': proxy_url, 'https': proxy_url, } for _ in range(10): response = requests.get('https://api.apify.com/v2/browser-info', proxies=proxies) print(response.text) if __name__ == '__main__': asyncio.run(main()) ``` ``` import { Actor } from 'apify'; import { gotScraping } from 'got-scraping'; await Actor.init(); const proxyConfiguration = await Actor.createProxyConfiguration(); const proxyUrl = await proxyConfiguration.newUrl(); const url = 'https://api.apify.com/v2/browser-info'; const response1 = await gotScraping({ url, proxyUrl, responseType: 'json', }); const response2 = await gotScraping({ url, proxyUrl, responseType: 'json', }); console.log(response1.body.clientIp); console.log('Should be different than'); console.log(response2.body.clientIp); await Actor.exit(); ``` ## Session persistence[​](#session-persistence "Direct link to Session persistence") When you use datacenter proxy with the `session` [parameter](https://docs.apify.com/platform/proxy/usage.md#sessions) set in the `username` [field](#username-parameters), a single IP is assigned to the `session ID` provided after you make the first request. **Session IDs represent IP addresses. Therefore, you can manage the IP addresses you use by managing sessions.** \[[More info](https://docs.apify.com/platform/proxy/usage.md#sessions)] This IP/session ID combination is persisted and expires 26 hours later. Each additional request resets the expiration time to 26 hours. If you use the session at least once a day, it will never expire, with two possible exceptions: * The proxy server stops responding and is marked as dead during a health check. * If the proxy server is part of a proxy group that is refreshed monthly and is rotated out. If the session is discarded due to the reasons above, it is assigned a new IP address. To learn more about [sessions](https://docs.apify.com/platform/proxy/usage.md#sessions) and [IP address rotation](https://docs.apify.com/platform/proxy/usage.md#ip-address-rotation), see the [proxy overview page](https://docs.apify.com/platform/proxy.md). ### Examples using sessions[​](#examples-using-sessions "Direct link to Examples using sessions") * PuppeteerCrawler * CheerioCrawler * Python SDK with requests * gotScraping() ``` import { Actor } from 'apify'; import { PuppeteerCrawler } from 'crawlee'; await Actor.init(); const proxyConfiguration = await Actor.createProxyConfiguration(); const crawler = new PuppeteerCrawler({ proxyConfiguration, sessionPoolOptions: { maxPoolSize: 1 }, async requestHandler({ page }) { console.log(await page.content()); }, }); await crawler.run([ 'https://proxy.apify.com/?format=json', 'https://proxy.apify.com', ]); await Actor.exit(); ``` ``` import { Actor } from 'apify'; import { CheerioCrawler } from 'crawlee'; await Actor.init(); const proxyConfiguration = await Actor.createProxyConfiguration(); const crawler = new CheerioCrawler({ proxyConfiguration, sessionPoolOptions: { maxPoolSize: 1 }, async requestHandler({ json }) { // ... console.log(json); }, }); await crawler.run([ 'https://api.apify.com/v2/browser-info', 'https://proxy.apify.com/?format=json', ]); await Actor.exit(); ``` ``` from apify import Actor import requests, asyncio async def main(): async with Actor: proxy_configuration = await Actor.create_proxy_configuration() proxy_url = await proxy_configuration.new_url('my_session') proxies = { 'http': proxy_url, 'https': proxy_url, } # each request uses the same IP address for _ in range(10): response = requests.get('https://api.apify.com/v2/browser-info', proxies=proxies) print(response.text) if __name__ == '__main__': asyncio.run(main()) ``` ``` import { Actor } from 'apify'; import { gotScraping } from 'got-scraping'; await Actor.init(); const proxyConfiguration = await Actor.createProxyConfiguration(); const proxyUrl = await proxyConfiguration.newUrl('my_session'); const response1 = await gotScraping({ url: 'https://api.apify.com/v2/browser-info', proxyUrl, responseType: 'json', }); const response2 = await gotScraping({ url: 'https://api.apify.com/v2/browser-info', proxyUrl, responseType: 'json', }); console.log(response1.body.clientIp); console.log('Should be the same as'); console.log(response2.body.clientIp); await Actor.exit(); ``` ## Examples using standard libraries and languages[​](#examples-using-standard-libraries-and-languages "Direct link to Examples using standard libraries and languages") You can find your proxy password on the [Proxy page](https://console.apify.com/proxy) of the Apify Console. > The `username` field is **not** your Apify username.Instead, you specify proxy settings (e.g. `groups-BUYPROXIES94952`, `session-123`).Use `auto` for default settings. For examples using [PHP](https://www.php.net/), you need to have the [cURL](https://www.php.net/manual/en/book.curl.php) extension enabled in your PHP installation. See [installation instructions](https://www.php.net/manual/en/curl.installation.php) for more information. Examples in [Python 2](https://www.python.org/download/releases/2.0/) use the [six](https://pypi.org/project/six/) library. Run `pip install six` to enable it. * Node.js (axios) * Python 3 * Python 2 * PHP * PHP (Guzzle) ``` import axios from 'axios'; const proxy = { protocol: 'http', host: 'proxy.apify.com', port: 8000, // Replace below with your password // found at https://console.apify.com/proxy auth: { username: 'auto', password: '' }, }; const url = 'http://proxy.apify.com/?format=json'; const { data } = await axios.get(url, { proxy }); console.log(data); ``` ``` import urllib.request as request import ssl # Replace below with your password # found at https://console.apify.com/proxy password = "" proxy_url = f"http://auto:{password}@proxy.apify.com:8000" proxy_handler = request.ProxyHandler({ "http": proxy_url, "https": proxy_url, }) ctx = ssl.create_default_context() ctx.check_hostname = False ctx.verify_mode = ssl.CERT_NONE httpHandler = request.HTTPSHandler(context=ctx) opener = request.build_opener(httpHandler,proxy_handler) print(opener.open("http://proxy.apify.com/?format=json").read()) ``` ``` import six from six.moves.urllib import request # Replace below with your password # found at https://console.apify.com/proxy password = "" proxy_url = ( "http://auto:%s@proxy.apify.com:8000" % (password) ) proxy_handler = request.ProxyHandler({ "http": proxy_url, "https": proxy_url, }) opener = request.build_opener(proxy_handler) print(opener.open("http://proxy.apify.com/?format=json").read()) ``` ``` below with your password // found at https://console.apify.com/proxy curl_setopt($curl, CURLOPT_PROXYUSERPWD, "auto:"); $response = curl_exec($curl); curl_close($curl); if ($response) echo $response; ?> ``` ``` below with your password // found at https://console.apify.com/proxy 'proxy' => 'http://auto:@proxy.apify.com:8000' ]); $response = $client->get("http://proxy.apify.com/?format=json"); echo $response->getBody(); ``` --- # Google SERP proxy **Learn how to collect search results from Google Search-powered tools. Get search results from localized domains in multiple countries, e.g. the US and Germany.** *** Google SERP proxy allows you to extract search results from Google Search-powered services. It allows searching in [various countries](#country-selection) and to dynamically switch between country domains. Our Google SERP proxy currently supports the below services. * Google Search (`http://www.google./search`). * Google Shopping (`http://www.google./shopping/product/`). * Google Shopping Search (`http://www.google./search?tbm=shop`). > Google SERP proxy can **only** be used for Google Search and Shopping. It cannot be used to access other websites. When using the proxy, **pricing is based on the number of requests made**. ## Connecting to Google SERP proxy[​](#connecting-to-google-serp-proxy "Direct link to Connecting to Google SERP proxy") Requests made through the proxy are automatically routed through a proxy server from the selected country and pure **HTML code of the search result page is returned**. **Important:** Only HTTP requests are allowed, and the Google hostname needs to start with the `www.` prefix. For code examples on how to connect to Google SERP proxies, see the [examples](#examples-using-the-apify-sdk) section. ### Username parameters[​](#username-parameters "Direct link to Username parameters") The `username` field enables you to pass various [parameters](https://docs.apify.com/platform/proxy/usage.md#username-parameters), such as groups and country, for your proxy connection. When using Google SERP proxy, the username should always be: ``` groups-GOOGLE_SERP ``` Unlike [datacenter](https://docs.apify.com/platform/proxy/datacenter-proxy.md) or [residential](https://docs.apify.com/platform/proxy/residential-proxy.md) proxies, there is no [session](https://docs.apify.com/platform/proxy/usage.md#sessions) parameter. If you use the `country` [parameter](https://docs.apify.com/platform/proxy/usage.md), the Google proxy location is used if you access a website whose hostname (stripped of `www.`) starts with **google**. ## Country selection[​](#country-selection "Direct link to Country selection") You must use the correct Google domain to get results for your desired country code. For example: * Search results from the USA: `http://www.google.com/search?q=` * Shopping results from Great Britain: `http://www.google.co.uk/seach?tbm=shop&q=` See a [full list](https://ipfs.io/ipfs/QmXoypizjW3WknFiJnKLwHCnL72vedxjQkDDP1mXWo6uco/wiki/List_of_Google_domains.html) of available domain names for specific countries. When using them, remember to prepend the domain name with the `www.` prefix. ## Examples[​](#examples "Direct link to Examples") ### Using the Apify SDK[​](#examples-using-the-apify-sdk "Direct link to Using the Apify SDK") If you are developing your own Apify [Actor](https://docs.apify.com/platform/actors.md) using the [Apify SDK](https://docs.apify.com/sdk.md) and [Crawlee](https://crawlee.dev/), the most efficient way to use Google SERP proxy is [CheerioCrawler](https://crawlee.dev/api/cheerio-crawler/class/CheerioCrawler). This is because Google SERP proxy [only returns a page's HTML](https://docs.apify.com/platform/proxy.md). Alternatively, you can use the [got-scraping](https://github.com/apify/got-scraping) [NPM package](https://www.npmjs.com/package/got-scraping) by specifying the proxy URL in the options. For Python, you can leverage the [`requests`](https://pypi.org/project/requests/) library along with the Apify SDK. The following examples get a list of search results for the keyword **wikipedia** from the USA (`google.com`). * CheerioCrawler * Python SDK with requests * gotScraping() ``` import { Actor } from 'apify'; import { CheerioCrawler } from 'crawlee'; await Actor.init(); const proxyConfiguration = await Actor.createProxyConfiguration({ groups: ['GOOGLE_SERP'], }); const crawler = new CheerioCrawler({ proxyConfiguration, async requestHandler({ body }) { // ... console.log(body); }, }); await crawler.run(['http://www.google.com/search?q=wikipedia']); await Actor.exit(); ``` ``` from apify import Actor import requests, asyncio async def main(): async with Actor: proxy_configuration = await Actor.create_proxy_configuration(groups=['GOOGLE_SERP']) proxy_url = await proxy_configuration.new_url() proxies = { 'http': proxy_url, 'https': proxy_url, } response = requests.get('http://www.google.com/search?q=wikipedia', proxies=proxies) print(response.text) if __name__ == '__main__': asyncio.run(main()) ``` ``` import { Actor } from 'apify'; import { gotScraping } from 'got-scraping'; await Actor.init(); const proxyConfiguration = await Actor.createProxyConfiguration({ groups: ['GOOGLE_SERP'], }); const proxyUrl = await proxyConfiguration.newUrl(); const { body } = await gotScraping({ url: 'http://www.google.com/search?q=wikipedia', proxyUrl, }); console.log(body); await Actor.exit(); ``` ### Using standard libraries and languages[​](#using-standard-libraries-and-languages "Direct link to Using standard libraries and languages") You can find your proxy password on the [Proxy page](https://console.apify.com/proxy/access) of Apify Console. > The `username` field is **not** your Apify username.Instead, you specify proxy settings (e.g. `groups-GOOGLE_SERP`).Use `groups-GOOGLE_SERP` to use proxies from all available countries. For examples using [PHP](https://www.php.net/), you need to have the [cURL](https://www.php.net/manual/en/book.curl.php) extension enabled in your PHP installation. See [installation instructions](https://www.php.net/manual/en/curl.installation.php) for more information. Examples in [Python 2](https://www.python.org/download/releases/2.0/) use the [six](https://pypi.org/project/six/) library. Run `pip install six` to enable it. The following examples get the HTML of search results for the keyword **wikipedia** from the USA (**google.com**). Select this option by setting the `username` parameter to `groups-GOOGLE_SERP`. Add the item you want to search to the `query` parameter. * Node.js (axios) * Python 3 * Python 2 * PHP * PHP (Guzzle) ``` import axios from 'axios'; const proxy = { protocol: 'http', host: 'proxy.apify.com', port: 8000, // Replace below with your password // found at https://console.apify.com/proxy auth: { username: 'groups-GOOGLE_SERP', password: '' }, }; const url = 'http://www.google.com/search'; const params = { q: 'wikipedia' }; const { data } = await axios.get(url, { proxy, params }); console.log(data); ``` ``` import urllib.request as request import urllib.parse as parse # Replace below with your password # found at https://console.apify.com/proxy password = '' proxy_url = f"http://groups-GOOGLE_SERP:{password}@proxy.apify.com:8000" proxy_handler = request.ProxyHandler({ 'http': proxy_url, }) opener = request.build_opener(proxy_handler) query = parse.urlencode({ 'q': 'wikipedia' }) print(opener.open(f"http://www.google.com/search?{query}").read()) ``` ``` import six from six.moves.urllib import request, urlencode # Replace below with your password # found at https://console.apify.com/proxy password = '' proxy_url = ( 'http://groups-GOOGLE_SERP:%s@proxy.apify.com:8000' % (password) ) proxy_handler = request.ProxyHandler({ 'http': proxy_url, }) opener = request.build_opener(proxy_handler) query = parse.urlencode({ 'q': 'wikipedia' }) url = ( 'http://www.google.com/search?%s' % (query) ) print(opener.open(url).read()) ``` ``` below with your password // found at https://console.apify.com/proxy curl_setopt($curl, CURLOPT_PROXYUSERPWD, 'groups-GOOGLE_SERP:'); $response = curl_exec($curl); curl_close($curl); echo $response; ?> ``` ``` below with your password // found at https://console.apify.com/proxy 'proxy' => 'http://groups-GOOGLE_SERP:@proxy.apify.com:8000' ]); $response = $client->get("http://www.google.com/search", [ 'query' => ['q' => 'wikipedia'] ]); echo $response->getBody(); ``` --- # Residential proxy **Achieve a higher level of anonymity using IP addresses from human users. Access a wider pool of proxies and reduce blocking by websites' anti-scraping measures.** *** Residential proxies use IP addresses assigned by Internet Service Providers to the homes and offices of actual users. Unlike [datacenter proxies](https://docs.apify.com/platform/proxy/datacenter-proxy.md), traffic from residential proxies is indistinguishable from that of legitimate users. This solution allows you to access a larger pool of servers than datacenter proxy. This makes it a better option in cases when you need a large number of different IP addresses. Residential proxies support [IP address rotation](https://docs.apify.com/platform/proxy/usage.md#ip-address-rotation) and [sessions](#session-persistence). **Pricing is based on data traffic**. It is measured for each connection made and displayed on your [proxy usage dashboard](https://console.apify.com/proxy/usage) in the Apify Console. ## Connecting to residential proxy[​](#connecting-to-residential-proxy "Direct link to Connecting to residential proxy") Connecting to residential proxy works the same way as [datacenter proxy](https://docs.apify.com/platform/proxy/datacenter-proxy.md), with two differences. 1. The `groups` [username parameter](https://docs.apify.com/platform/proxy/usage.md#username-parameters) should always specify `RESIDENTIAL`. 2. You can specify the country in which you want your proxies to be. ### How to set a proxy group[​](#how-to-set-a-proxy-group "Direct link to How to set a proxy group") When using [standard libraries and languages](https://docs.apify.com/platform/proxy/datacenter-proxy.md), specify the `groups` parameter in the [username](https://docs.apify.com/platform/proxy/usage.md#username-parameters) as `groups-RESIDENTIAL`. For example, your **proxy URL** when using the [got-scraping](https://www.npmjs.com/package/got-scraping) JavaScript library will look like this: ``` const proxyUrl = 'http://groups-RESIDENTIAL:@proxy.apify.com:8000'; ``` In the [Apify SDK](https://docs.apify.com/sdk.md) you set the **groups** in your proxy configuration: * JavaScript * Python ``` import { Actor } from 'apify'; await Actor.init(); // ... const proxyConfiguration = await Actor.createProxyConfiguration({ groups: ['RESIDENTIAL'], }); // ... await Actor.exit(); ``` ``` from apify import Actor async def main(): async with Actor: # ... proxy_configuration = await Actor.create_proxy_configuration(groups=['RESIDENTIAL']) # ... ``` ### How to set a proxy country[​](#how-to-set-a-proxy-country "Direct link to How to set a proxy country") When using [standard libraries and languages](https://docs.apify.com/platform/proxy/datacenter-proxy.md), specify the `country` parameter in the [username](https://docs.apify.com/platform/proxy/usage.md#username-parameters) as `country-COUNTRY-CODE`. For example, your `username` parameter when using [Python 3](https://docs.python.org/3/) will look like this: ``` username = "groups-RESIDENTIAL,country-JP" ``` In the [Apify SDK](https://docs.apify.com/sdk.md) you set the country in your proxy configuration using two-letter [country codes](https://laendercode.net/en/2-letter-list.html). Specify the groups as `RESIDENTIAL`, then add a `countryCode`/`country_code` parameter: * JavaScript * Python ``` import { Actor } from 'apify'; await Actor.init(); // ... const proxyConfiguration = await Actor.createProxyConfiguration({ groups: ['RESIDENTIAL'], countryCode: 'FR', }); // ... await Actor.exit(); ``` ``` from apify import Actor async def main(): async with Actor: # ... proxy_configuration = await Actor.create_proxy_configuration( groups=['RESIDENTIAL'], country_code='FR', ) # ... ``` ## Session persistence[​](#session-persistence "Direct link to Session persistence") When using residential proxy with the `session` [parameter](https://docs.apify.com/platform/proxy/usage.md#sessions) set in the [username](https://docs.apify.com/platform/proxy/usage.md#username-parameters), a single IP address is assigned to the **session ID** provided after you make the first request. **Session IDs represent IP addresses. Therefore, you can manage the IP addresses you use by managing sessions.** \[[More info](https://docs.apify.com/platform/proxy/usage.md#sessions)] This IP/session ID combination is persisted for 1 minute. Each subsequent request resets the expiration time to 1 minute. If the proxy server becomes unresponsive or the session expires, a new IP address is selected for the next request. > If you really need to persist the same session, you can try sending some data using that session (e.g. every 20 seconds) to keep it alive.Providing the connection is not interrupted, this will let you keep the IP address for longer. To learn more about [sessions](https://docs.apify.com/platform/proxy/usage.md#sessions) and [IP address rotation](https://docs.apify.com/platform/proxy/usage.md#ip-address-rotation), see the proxy [overview page](https://docs.apify.com/platform/proxy.md). ## Tips to keep in mind[​](#tips-to-keep-in-mind "Direct link to Tips to keep in mind") [Residential](https://docs.apify.com/platform/proxy.md) proxies are less predictable than [datacenter](https://docs.apify.com/platform/proxy/datacenter-proxy.md) proxies and are priced differently (by number of IPs vs traffic used). Because of this, there are some important things to consider before using residential proxy in your solutions. ### Control traffic used by automated browsers[​](#control-traffic-used-by-automated-browsers "Direct link to Control traffic used by automated browsers") Residential proxy is priced by data traffic used. Thus, it's easy to quickly use up all your prepaid traffic. In particular, when accessing websites with large files loaded on every page. To reduce your traffic use, we recommend using the `blockRequests()` function of [`playwrightUtils`](https://crawlee.dev/api/playwright-crawler/namespace/playwrightUtils#blockRequests)/[`puppeteerUtils`](https://crawlee.dev/api/puppeteer-crawler/namespace/puppeteerUtils#blockRequests) (depending on the library used). ### Connected proxy speed variation[​](#connected-proxy-speed-variation "Direct link to Connected proxy speed variation") Each host on the residential proxy network uses a different device. They have different network speeds and different latencies. This means that requests made with one [session](https://docs.apify.com/platform/proxy/usage.md#sessions) can be extremely fast, while another request with a different session can be extremely slow. The difference can range from a few milliseconds to a few seconds. If your solution requires quickly loaded content, the best option is to set a [session](https://docs.apify.com/platform/proxy/usage.md#sessions), try a small request and see if the response time is acceptable. If it is, you can use this session for other requests. Otherwise, repeat the attempt with a different session. ### Connection interruptions[​](#connection-interruptions "Direct link to Connection interruptions") While sessions are persistent, they can be destroyed at any time if the host devices are turned off or disconnected. For this problem there is no easy solution. One option is to not use residential proxy for larger requests (and use [datacenter](https://docs.apify.com/platform/proxy/datacenter-proxy.md) proxy instead). If you have no other choice, expect that interruptions might happen and write your solution with this in mind. --- # Proxy usage **Learn how to configure and use Apify Proxy. See the required parameters such as the correct username and password.** *** ## Connection settings[​](#connection-settings "Direct link to Connection settings") To connect to Apify Proxy, you use the [HTTP proxy protocol](https://en.wikipedia.org/wiki/Proxy_server#Web_proxy_servers). This means that you need to configure your HTTP client to use the proxy server at the Apify Proxy hostname and provide it with your Apify Proxy password and the other parameters described below. The full connection string has the following format: ``` http://:@: ``` caution All usage of Apify Proxy with your password is charged towards your account. Do not share the password with untrusted parties or use it from insecure networks, as **the password is sent unencrypted** due to the HTTP protocol's [limitations](https://www.guru99.com/difference-http-vs-https.html). ### External connection[​](#external-connection "Direct link to External connection") If you want to connect to Apify Proxy from outside of the Apify Platform, you need to have a paid Apify plan (to prevent abuse). If you need to test Apify Proxy before you subscribe, please [contact our support](https://apify.com/contact). | Parameter | Value / explanation | | --------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | Hostname | `proxy.apify.com` | | Port | `8000` | | Username | Specifies the proxy parameters such as groups, [session](#sessions) and location. See [username parameters](#username-parameters) below for details.**Note**: this is not your Apify username. | | Password | Apify Proxy password. Your password is displayed on the [Proxy](https://console.apify.com/proxy/groups) page in Apify Console.**Note**: this is not your Apify account password. | caution If you use these connection parameters for connecting to Apify Proxy from your Actors running on the Apify Platform, the connection will still be considered external, it will not work on the Free plan, and on paid plans you will be charged for external data transfer. Please use the connection parameters from the [Connection from Actors](#connection-from-actors) section when using Apify Proxy from Actors. Example connection string for external connections: ``` http://auto:apify_proxy_EaAFg6CFhc4eKk54Q1HbGDEiUTrk480uZv03@proxy.apify.com:8000 ``` ### Connection from Actors[​](#connection-from-actors "Direct link to Connection from Actors") If you want to connect to Apify Proxy from Actors running on the Apify Platform, the recommended way is to use built-in proxy configuration tools in the [Apify SDK JavaScript](https://docs.apify.com/sdk/js/docs/guides/proxy-management) or [Apify SDK Python](https://docs.apify.com/sdk/python/docs/concepts/proxy-management) If you don't want to use these helpers, and want to connect to Apify Proxy manually, you can find the right configuration values in [environment variables](https://docs.apify.com/platform/actors/development/programming-interface/environment-variables.md) provided to the Actor. By using this configuration, you ensure that you connect to Apify Proxy directly through the Apify infrastructure, bypassing any external connection via the Internet, thereby improving the connection speed, and ensuring you don't pay for external data transfer. | Parameter | Source / explanation | | --------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | Hostname | `APIFY_PROXY_HOSTNAME` environment variable | | Port | `APIFY_PROXY_PORT` environment variable | | Username | Specifies the proxy parameters such as groups, [session](#sessions) and location. See [username parameters](#username-parameters) below for details.**Note**: this is not your Apify username. | | Password | `APIFY_PROXY_PASSWORD` environment variable | Example connection string creation: ``` const { APIFY_PROXY_HOSTNAME, APIFY_PROXY_PORT, APIFY_PROXY_PASSWORD } = process.env; const connectionString = `http://auto:${APIFY_PROXY_PASSWORD}@${APIFY_PROXY_HOSTNAME}:${APIFY_PROXY_PORT}`; ``` ### Username parameters[​](#username-parameters "Direct link to Username parameters") The `username` field enables you to pass parameters like **[groups](#proxy-groups)**, **[session ID](#sessions)** and **country** for your proxy connection. For example, if you're using [datacenter proxies](https://docs.apify.com/platform/proxy/datacenter-proxy.md) and want to use the `new_job_123` session using the `SHADER` group, the username will be: ``` groups-SHADER,session-new_job_123 ``` The table below describes the available parameters. | Parameter | Type | Description | | --------- | -------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `groups` | Required | Set proxied requests to use servers from the selected groups:- `groups-[group name]` or `auto` when using datacenter proxies.- `groups-RESIDENTIAL` when using residential proxies.- `groups-GOOGLE_SERP` when using Google SERP proxies. | | `session` | Optional | If specified to `session-new_job_123`, for example, all proxied requests with the same session identifier are routed through the same IP address. If not specified, each proxied request is assigned a randomly picked least used IP address.The session string can only contain numbers (0–9), letters (a-z or A-Z), dot (.), underscore (\_), a tilde (\~). The maximum length is 50 characters.Session management may work differently for residential and SERP proxies. Check relevant documentations for more details. | | `country` | Optional | If specified, all proxied requests will use proxy servers from a selected country. Note that if there are no proxy servers from the specified country, the connection will fail. For example `groups-SHADER,country-US` uses proxies from the `SHADER` group located in the USA. By default, the proxy uses all available proxy servers from all countries. | If you want to specify one parameter and not the others, just provide that parameter and omit the others. To use the default behavior (not specifying either `groups`, `session`, or `country`), set the username to `auto`, which serves as a default placeholder, because the proxy username cannot be empty. ## Code examples[​](#code-examples "Direct link to Code examples") We have code examples for connecting to our proxy using the [Apify SDK](https://docs.apify.com/sdk.md) and [Crawlee](https://crawlee.dev/) and other libraries, as well as examples in PHP. * [Datacenter proxy](https://docs.apify.com/platform/proxy/datacenter-proxy.md#examples) * [Residential proxy](https://docs.apify.com/platform/proxy/residential-proxy.md#connecting-to-residential-proxy) * [Google SERP proxy](https://docs.apify.com/platform/proxy/google-serp-proxy.md#examples) For code examples related to proxy management in Apify SDK and Crawlee, see: * [Apify SDK JavaScript](https://docs.apify.com/sdk/js/docs/guides/proxy-management) * [Apify SDK Python](https://docs.apify.com/sdk/python/docs/concepts/proxy-management) * [Crawlee](https://crawlee.dev/docs/guides/proxy-management) ## IP address rotation[​](#ip-address-rotation "Direct link to IP address rotation") Web scrapers can rotate the IP addresses they use to access websites. They assign each request a different IP address, which makes it appear like they are all coming from different users. This greatly enhances performance and data throughout. Depending on whether you use a [browser](https://apify.com/apify/web-scraper) or [HTTP requests](https://apify.com/apify/cheerio-scraper) for your scraping jobs, IP address rotation works differently. * Browser—a different IP address is used for each browser. * HTTP request—a different IP address is used for each request. Use [sessions](#sessions) to control how you rotate IP addresses. See our guide [Anti-scraping techniques](https://docs.apify.com/academy/anti-scraping/techniques.md) to learn more about IP address rotation and our findings on how blocking works. ## Sessions[​](#sessions "Direct link to Sessions") Sessions allow you to use the same IP address for multiple connections. In cases where you need to keep the same session (e.g. when you need to log in to a website), it is best to keep the same proxy and so the IP address. On the other hand by switching the IP address, you can avoid being blocked by the website. To set a new session, pass the `session` parameter in your [username](https://docs.apify.com/platform/proxy/usage.md#username-parameters) field when connecting to a proxy. This will serve as the session's ID and an IP address will be assigned to it. To [use that IP address in other requests](https://docs.apify.com/platform/proxy/datacenter-proxy.md#connecting-to-datacenter-proxies), pass that same session ID in the username field. We recommend you to use [SessionPool](https://crawlee.dev/api/core/class/SessionPool) abstraction when managing sessions. The created session will then store information such as cookies and can be used to generate [browser fingerprints](https://docs.apify.com/academy/anti-scraping/mitigation/generating-fingerprints.md). You can also assign custom user data such as authorization tokens and specific headers. Sessions are available for [datacenter](https://docs.apify.com/platform/proxy/datacenter-proxy.md) and \[residential]\(./ residential\_proxy.md#session-persistence) proxies. For datacenter proxies, a session persists for **26 hours** ([more info](https://docs.apify.com/platform/proxy/datacenter-proxy.md)). For residential proxies, it persists for **1 minute** ([more info](https://docs.apify.com/platform/proxy/residential-proxy.md#session-persistence)) but you can prolong the lifetime by regularly using the session. Google SERP proxies do not support sessions. ## Proxy groups[​](#proxy-groups "Direct link to Proxy groups") You can see which proxy groups you have access to on the [Proxy page](https://console.apify.com/proxy/groups) in the Apify Console. To use a specific proxy group (or multiple groups), specify it in the `username` parameter. ## Proxy IP addresses[​](#proxy-ip-addresses "Direct link to Proxy IP addresses") If you need to allow communication to `apify.proxy.com`, add the following IP addresses to your firewall rule or whitelist: * `18.208.102.16` * `35.171.134.41` ## Troubleshooting[​](#troubleshooting "Direct link to Troubleshooting") To view your connection status to [Apify Proxy](https://apify.com/proxy), open the URL below in the browser using the proxy. . If the proxy connection is working, the page should look something like this: ![Apify proxy status page](/assets/images/proxy-status-6ea6ff237ab297936618fcb2c52a58e4.png) To test that your requests are proxied and IP addresses are being [rotated](https://docs.apify.com/academy/anti-scraping/techniques.md) correctly, open the following API endpoint via the proxy. It shows information about the client IP address. ### A different approach to `502 Bad Gateway`[​](#a-different-approach-to-502-bad-gateway "Direct link to a-different-approach-to-502-bad-gateway") Sometimes when the `502` status code is not comprehensive enough. Therefore, we have modified our server with `590-599` codes instead to provide more insight: * `590 Non Successful`: upstream responded with non-200 status code. * `591 RESERVED`: *this status code is reserved for further use.* * `592 Status Code Out Of Range`: upstream responded with status code different than 100–999. * `593 Not Found`: DNS lookup failed, indicating either [`EAI_NODATA`](https://github.com/libuv/libuv/blob/cdbba74d7a756587a696fb3545051f9a525b85ac/include/uv.h#L82) or [`EAI_NONAME`](https://github.com/libuv/libuv/blob/cdbba74d7a756587a696fb3545051f9a525b85ac/include/uv.h#L83). * `594 Connection Refused`: upstream refused connection. * `595 Connection Reset`: connection reset due to loss of connection or timeout. * `596 Broken Pipe`: trying to write on a closed socket. * `597 Auth Failed`: incorrect upstream credentials. * `598 RESERVED`: *this status code is reserved for further use.* * `599 Upstream Error`: generic upstream error. The typical issues behind these codes are: * `590` and `592` indicate an issue on the upstream side. * `593` indicates an incorrect `proxy-chain` configuration. * `594`, `595` and `596` may occur due to connection loss. * `597` indicates incorrect upstream credentials. * `599` is a generic error, where the above is not applicable. Note that the Apify Proxy is based on the [proxy-chain](https://github.com/apify/proxy-chain) open-source `npm` package developed and maintained by Apify. You can find the details of the above errors and their implementation there. --- # Using your own proxies **Learn how to use your own proxies while using the Apify platform.** *** In addition to our proxies, you can use your own both in Apify Console and SDK. ## Custom proxies in console[​](#console "Direct link to Custom proxies in console") To use your own proxies with Apify Console, in your Actor's **Input and options** tab, scroll down and open the **Proxy and browser configuration** section. Enter your proxy URLs, and you're good to go. ![Using custom proxy in Apify Console]() ## Custom proxies in SDK[​](#SDK "Direct link to Custom proxies in SDK") In the Apify SDK, use the `proxyConfiguration.newUrl(sessionId)` (JavaScript) or `proxy_configuration.new_url(session_id)` (Python) command to add your custom proxy URLs to the proxy configuration. See the [JavaScript](https://docs.apify.com/sdk/js/api/apify/class/ProxyConfiguration#newUrl) or [Python](https://docs.apify.com/sdk/python/reference/class/ProxyConfiguration#new_url) SDK docs for more details. --- # Schedules **Learn how to automatically start your Actor and task runs and the basics of cron expressions. Set up and manage your schedules from Apify Console or via API.** *** Schedules allow you to run your Actors and tasks at specific times. You schedule the run frequency using [cron expressions](#cron-expressions). Timezone & Daylight Savings Time Schedules allow timezone settings and support daylight saving time shifts (DST). You can set up and manage your Schedules using: * [Apify Console](https://console.apify.com/schedules) * [Apify API](https://docs.apify.com/api/v2/schedules.md) * [JavaScript API client](https://docs.apify.com/api/client/js/reference/class/ScheduleClient) * [Python API client](https://docs.apify.com/api/client/python/reference/class/ScheduleClient) When scheduling a new Actor or task run, you can override its input settings using a JSON object similarly to when invoking an Actor or task using the [Apify REST API](https://docs.apify.com/api/v2/schedules.md). Events Startup Variability In most cases, scheduled events are fired within one second of their scheduled time.However, runs can be delayed because of a system overload or a server shutting down. Each schedule can be associated with a maximum of *10* Actors and *10* Actor tasks. ## Setting up a new schedule[​](#setting-up-a-new-schedule "Direct link to Setting up a new schedule") Before setting up a new schedule, you should have the [Actor](https://docs.apify.com/platform/actors.md) or [task](https://docs.apify.com/platform/actors/running/tasks.md) you want to schedule prepared and tested. To schedule an Actor, you need to have run it at least once before. To run the Actor, navigate to the Actor's page through [Apify Console](https://console.apify.com/store), where you can configure and initiate the Actor's run with your preferred settings by clicking the **Start** button. After this initial run, you can then use Schedules to automate future runs. Name Length Your schedule's name should be 3–63 characters long. ### Apify Console[​](#apify-console "Direct link to Apify Console") In [Apify Console](https://console.apify.com/schedules), click on the **Schedules** in the navigation menu, then click the **Create new** button. Click on the name (by default it is **My Schedule**), there you can change its name, add a description, as well as check its *Unique name* or *ID*. You can adjust how often your Actor or task runs using the [schedule setup tool](#schedule-setup). You can find it by clicking on the **Schedule setup** card. ![New schedule](/assets/images/schedules-overview-4fc6e3773b82c99ce27f199649b95a83.png) Next, you'll need to give the schedule something to run. This is where the Actor or task you prepared earlier comes in. Click on the **Add** dropdown and select whether you want to schedule an Actor or task. If you're scheduling an Actor run, you'll be able to specify the Actor's [input](https://docs.apify.com/platform/actors/running/input-and-output.md) and running options like [build](https://docs.apify.com/platform/actors/development/builds-and-runs/builds.md), timeout, [memory](https://docs.apify.com/platform/actors/running/usage-and-resources.md). The **timeout** value is specified in seconds; a value of *0* means there is no timeout, and the Actor runs until it finishes. If you don't provide an input, then the Actor's default input is used. If you provide an input with some fields missing, the missing fields are filled in with values from the default input. If input options are not provided, the default options values are used. ![Add Actor to schedule](/assets/images/schedules-actor-input-6d44fb0b56f4bb415555e9100fa83b7e.png) If you're scheduling a task, just select the task you prepared earlier using the drop-down. If you need to override the task's input, you can pass it as a JSON object in the **Input JSON overrides** field. ![Add task to schedule](/assets/images/schedules-task-input-d6416881175c4cba28d8735ba4a76663.png) To add more Actors or tasks, just repeat the process. Now, all you need to do is click **Save & activate** and let the scheduler take care of running your jobs on time. For integrations, you can also add a [webhook](https://docs.apify.com/platform/integrations/webhooks.md) to your tasks, which will notify you (or perform an action of your choice) every time the task runs. ### Apify API[​](#apify-api "Direct link to Apify API") To create a new [schedule](https://docs.apify.com/api/v2/schedules.md) using the Apify API, send a `POST` request to the [create schedule](https://docs.apify.com/api/v2/schedules-post.md) endpoint. You can find your [secret API token](https://docs.apify.com/platform/integrations.md) under the [Integrations](https://console.apify.com/account?tab=integrations) tab of your Apify account settings. API authentication recommendations When providing your API authentication token, we recommend using the request's `Authorization` header, rather than the URL ([more info](https://docs.apify.com/api/v2.md#authentication)). In the `POST` request's payload should be a JSON object specifying the schedule's name, your [user ID](https://console.apify.com/account#/integrations), and the schedule's *actions*. The following JSON object creates a schedule which runs an SEO audit of the Apify domain once a month. ``` { "name": "apify-domain-monthly-seo-audit", "userId": "7AxwNO4kCDZxsMHip", "isEnabled": true, "isExclusive": true, "cronExpression": "@monthly", "timezone": "UTC", "description": "A monthly audit of the Apify domain's SEO", "actions": [ { "type": "RUN_ACTOR_TASK", "actorTaskId": "6rHoK2zjYJkmYhSug", "input": { "startUrl": "https://apify.com" } } ] } ``` If the request is successful, you will receive a `201` [HTTP response code](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status) and a JSON object in the response body containing the details of your new schedule. If you receive an error (`4**` code), you will need to check your API token, user ID, or `POST` request body. You can add multiple Actor and task runs to a schedule with a single `POST` request. Simply add another object with the run's details to the **actions** array in your `POST` request's payload object. For more information, refer to the [schedules](https://docs.apify.com/api/v2/schedule-get.md) section in our API documentation. ## Schedule setup[​](#schedule-setup "Direct link to Schedule setup") The schedule setup tool uses [cron expressions](https://en.wikipedia.org/wiki/Cron#CRON_expression) to specify run times. If you're familiar with how to use them and need a specific run schedule, you can dive right in. If not, don't worry - the setup tool has a visual custom schedule builder that provides a similar level of control as cron expressions, though it's not quite as powerful. ![Schedule setup tool](/assets/images/schedules-setup-tool-55422d2b611433faca739c1d024217dd.png) The **Next runs** section shows when the next run will be, if you click on **Show more** button it will expand and show you the next five runs. You can use this live feedback to experiment until you find the correct configuration. You can find more information and examples of cron expressions on [crontab.guru](https://crontab.guru/). For additional and non-standard characters, see [this](https://en.wikipedia.org/wiki/Cron#CRON_expression) Wikipedia article. ### Notifications[​](#notifications "Direct link to Notifications") The schedule setup tool allows you to control the schedule's notifications. All schedules have notifications enabled by default. When one of the scheduled Actors or tasks fails to start due to being configured incorrectly, you will receive an email notification. If you want to manage the notifications for your schedules in bulk, you can do this from the [Notifications](https://console.apify.com/settings/notifications) settings tab. As long as you have 15 schedules or less, you can manage their notifications all at once by clicking the **Manage notifications for specific schedules** button. ![Schedule bulk notifications](/assets/images/schedules-bulk-notifications-72b807c829fc03c361a9d8360516223c.png) ## Cron expressions[​](#cron-expressions "Direct link to Cron expressions") A cron expression has the following structure: | Position | Field | Values | Wildcards | Optional | | -------- | ------------ | ----------------------------- | --------- | -------- | | 1 | second | 0 - 59 | , - \* / | yes | | 2 | minute | 0 - 59 | , - \* / | no | | 3 | hour | 0 - 23 | , - \* / | no | | 4 | day of month | 1 - 31 | , - \* / | no | | 5 | month | 1 - 12 | , - \* / | no | | 6 | day of week | 0 - 7(0 or 7 is Sunday) | , - \* / | no | For example, the expression `30 5 16 * * 1` will start an Actor at 16:05:30 every Monday. The minimum interval between runs is 10 seconds; if your next run is scheduled sooner than 10 seconds after the previous run, the next run will be skipped. ### Examples of cron expressions[​](#examples-of-cron-expressions "Direct link to Examples of cron expressions") * `0 8 * * *` - every day at 8 AM. * `0 0 * * 0` - every 7 days (at 00:00 on Sunday). * `*/3 * * * *` - every 3rd minute. * `0 0 1 */2 *` - every other month (at 00:00 on the first day of month, every 2nd month). Additionally, you can use the following shortcut expressions: * `@yearly` = `0 0 1 1 *` - once a year, on Jan 1st at midnight. * `@monthly` = `0 0 1 * *` - once a month, on the 1st at midnight. * `@weekly` = `0 0 * * 0` - once a week, on Sunday at midnight. * `@daily` = `0 0 * * *` - run once a day, at midnight. * `@hourly` = `0 * * * *` - on the hour, every hour. --- # Security **Learn more about Apify's security practices and data protection measures that are used to protect your Actors, their data, and the Apify platform in general.** *** ## SOC 2 type II compliance[​](#soc-2-type-ii-compliance "Direct link to SOC 2 type II compliance") The Apify platform is SOC 2 Type II compliant. This means that we have undergone an independent audit to ensure that our information security practices, policies, procedures, and operations comply with SOC 2 standards for security, availability, and confidentiality of customer data. [![](/assets/images/soc2-logo-ef4ccbc67876671e3b226d5d42fb62b6.png "Apify Security Whitepaper")](https://www.aicpa-cima.com/topic/audit-assurance/audit-and-assurance-greater-than-soc-2 "AICPA SOC 2") To learn more, read the [announcement](https://blog.apify.com/apify-soc2/) and visit our [Trust Center](https://trust.apify.com) for additional information or to request a copy of our SOC 2 Type II report. ## Trust center[​](#trust-center "Direct link to Trust center") To learn more about Apify's security practices, data protection measures, and compliance certifications, please visit our [Trust Center](https://trust.apify.com). The Trust Center includes: * List of our security certifications and compliance reports * Information about Apify's data protection controls * List of Apify's data subprocessors * An AI chatbot to answer your security-related questions ## Security Whitepaper[​](#security-whitepaper "Direct link to Security Whitepaper") At Apify, security is our top priority every day. Security best practices are reflected in our development, deployment, monitoring, and project management processes. Read the Apify Security Whitepaper for a comprehensive description of Apify's security measures and commitments: [![](/assets/images/whitepaper-cover-e66741eb1a1ffef137ff2d308cd808e1.png "Apify Security Whitepaper")](https://apify.com/security-whitepaper.pdf "Apify Security Whitepaper") ## Vulnerability disclosure policy[​](#vulnerability-disclosure-policy "Direct link to Vulnerability disclosure policy") We invite security researchers, ethical hackers, and the broader community to help us keep Apify safe by reporting any potential security vulnerabilities or weaknesses. Your responsible disclosure helps protect our users and strengthen the Apify platform. *Scope:* The following Apify services and domains are eligible for security research and responsible reporting: * [apify.com](https://apify.com) * [console.apify.com](https://console.apify.com) * [api.apify.com](https://api.apify.com) * [console-backend.apify.com](https://console-backend.apify.com) Please use your personal account for research purposes. Free accounts are sufficient for most testing. *Out-of-scope:* * Issues with third-party systems * Clickjacking on non-sensitive pages * SPF/DKIM/DMARC or other email configuration issues * Best practices or informational findings without impact * Denial of Service (DoS), brute-force attacks, and resource exhaustion * Social engineering, phishing, or physical attacks * Attacks requiring MITM or stolen credentials *We are especially interested in reports that demonstrate:* * Unauthorized access to data * Elevation of privileges * Server-side vulnerabilities (e.g., SSRF, RCE) * Cross-site scripting (XSS) and injection attacks * Logic flaws impacting account integrity or billing * Authentication/authorization issues * Data leaks due to misconfiguration ### Reporting process[​](#reporting-process "Direct link to Reporting process") If you notice or suspect a potential security issue, please report it to our security team at with as much detail as possible, including the following: * Clear description of the issue * Step-by-step reproduction instructions * PoC (screenshots or code snippets) * Impact analysis * Affected URL or endpoint Voluntary disclosures Thank you for helping us keep Apify secure! Please note that we don’t offer financial or other rewards for vulnerability reports. Participation in our VDP is entirely voluntary, and we sincerely appreciate your contribution to the safety of the platform and the community. ### Rules of engagement[​](#rules-of-engagement "Direct link to Rules of engagement") * Only target accounts or data you control (test accounts) * Never disrupt our services or other users * Avoid privacy violations and do not destroy or alter data * Automated scanners are not permitted without prior approval * No spam, DoS, or social engineering * Submit one vulnerability per report (unless chaining is required) If you follow these guidelines and act in good faith, we will not take legal action against you for responsibly reporting a security issue. Crucial rules and legal obligations Please adhere strictly to the following rules. Failure to do so may result in legal action: * *Do not publicly disclose vulnerabilities until resolved.* This ensures that the issue can be properly evaluated and mitigated before being exposed to potential exploitation. * *Treat all related information as confidential.* Any details about a vulnerability you are reporting are considered confidential information and cannot be disclosed unless explicitly approved by Apify in writing. * *Comply with all legal terms.* As per our [Terms of Service](https://docs.apify.com/legal), you must not take any action that might cause an overload, disruption, or denial of service, result in unauthorized access to another user's data, or have a similar adverse effect on Apify's services or other users. ## Securing your data[​](#securing-your-data "Direct link to Securing your data") The Apify platform provides you with multiple ways to secure your data, including [encrypted environment variables](https://docs.apify.com/platform/actors/development/programming-interface/environment-variables.md) for storing your configuration secrets and [encrypted input](https://docs.apify.com/platform/actors/development/actor-definition/input-schema/secret-input.md) for securing the input parameters of your Actors. --- # Storage **Store anything from images and key-value pairs to structured output data. Learn how to access and manage your stored data on the Apify Console or via the API.** *** The Apify platform provides three types of storage accessible both within our [Apify Console](https://console.apify.com/storage) and externally through our [REST API](https://docs.apify.com/api/v2.md) [Apify API Clients](https://docs.apify.com/api.md) or [SDKs](https://docs.apify.com/sdk.md). #### [Dataset](https://docs.apify.com/platform/storage/dataset.md) [Stores results from web scraping and data processing, with each Actor run getting a unique dataset. Features include table-like data visualization and multiple export formats like JSON and Excel.](https://docs.apify.com/platform/storage/dataset.md) #### [Key-value store](https://docs.apify.com/platform/storage/key-value-store.md) [Stores various data types like JSON, HTML, images, and strings. Accessible via Apify Console or API, it's ideal for diverse data storage needs.​](https://docs.apify.com/platform/storage/key-value-store.md) #### [Request queue](https://docs.apify.com/platform/storage/request-queue.md) [Manages URL processing for web crawling and other tasks. Supports different crawling orders and allows for querying and updating URLs, accessible via Apify Console or API​.](https://docs.apify.com/platform/storage/request-queue.md) --- # Dataset **Store and export web scraping, crawling or data processing job results. Learn how to access and manage datasets in Apify Console or via API.** *** Dataset storage enables you to sequentially save and retrieve data. A unique dataset is automatically created and assigned to each Actor run when the first item is stored. Typically, datasets comprise results from web scraping, crawling, and data processing jobs. You can visualize this data in a table, where each object is forming a row and its attributes are represented as columns. You have the option to export data in various formats, including JSON, CSV, XML, Excel, HTML Table, RSS or JSONL. > Named datasets are retained indefinitely. Unnamed datasets expire after 7 days unless otherwise specified. [Learn more](https://docs.apify.com/platform/storage/usage.md#named-and-unnamed-storages) Dataset storage is *append-only* - data can only be added and cannot be modified or deleted once stored. ## Basic usage[​](#basic-usage "Direct link to Basic usage") You can access your datasets in several ways: * [Apify Console](https://console.apify.com) - provides an easy-to-understand interface. * [Apify API](https://docs.apify.com/api/v2.md) - to access your datasets programmatically. * [Apify API client](https://docs.apify.com/api.md) - to access your datasets from any Node.js/Python application. * [Apify SDKs](https://docs.apify.com/sdk.md) - when building your own JavaScript/Python Actor. ### Apify Console[​](#apify-console "Direct link to Apify Console") In [Apify Console](https://console.apify.com), you can view your datasets in the [Storage](https://console.apify.com/storage) section under the [Datasets](https://console.apify.com/storage?tab=datasets) tab. ![Datasets in app](/assets/images/datasets-app-7f95b1edcb4e2cd28d7885c648820bf0.png) To view or download a dataset: 1. Click on its **Dataset ID**. 2. Select the format & configure other options if desired in **Export dataset** section. 3. Click **Download**. Utilize the **Actions** menu to modify the dataset's name, which also affects its [retention period](https://docs.apify.com/platform/storage/usage.md#data-retention), and to adjust [access rights](https://docs.apify.com/platform/collaboration.md). The **API** button allows you to explore and test the dataset's [API endpoints](https://docs.apify.com/api/v2/storage-datasets.md). ![Datasets detail view](/assets/images/datasets-detail-eae867b18e248d8dc218a9e626c9e858.png) ### Apify API[​](#apify-api "Direct link to Apify API") The [Apify API](https://docs.apify.com/api/v2/storage-datasets.md) enables you programmatic access to your datasets using [HTTP requests](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods). If you are accessing your datasets using the `username~store-name` [store ID format](https://docs.apify.com/platform/storage.md), you will need to use your secret API token. You can find the token (and your user ID) on the [Integrations](https://console.apify.com/account#/integrations)tab of **Settings** page of your Apify account. > When providing your API authentication token, we recommend using the request's `Authorization` header, rather than the URL. ([More info](https://docs.apify.com/platform/integrations/api.md#authentication)). To retrieve a list of your datasets, send a GET request to the [Get list of datasets](https://docs.apify.com/api/v2/datasets-get.md) endpoint. ``` https://api.apify.com/v2/datasets ``` To get information about a dataset such as its creation time and item count, send a GET request to the [Get dataset](https://docs.apify.com/api/v2/dataset-get.md) endpoint. ``` https://api.apify.com/v2/datasets/{DATASET_ID} ``` To view a dataset's data, send a GET request to the [Get dataset items](https://docs.apify.com/api/v2/dataset-items-get.md) Apify API endpoint. ``` https://api.apify.com/v2/datasets/{DATASET_ID}/items ``` Control the data export by appending a comma-separated list of fields to the `fields` query parameter. Likewise, you can also omit certain fields using the `omit` parameter. > If you fill both `omit` and `field` parameters with the same value, then >`omit` parameter will take precedence and the field is excluded from the >results. In addition, you can set the format in which you retrieve the data using the `?format=` parameter. The available formats are `json`, `jsonl`, `csv`, `html`, `xlsx`, `xml` and `rss`. The default value is `json`. To retrieve the `hotel` and `cafe` fields, you would send your GET request to the URL below. ``` https://api.apify.com/v2/datasets/{DATASET_ID}/items?format=json&fields=hotel%2Ccafe ``` > Use `%2C` instead of commas for URL encoding, as `%2C` represent a comma. For more on URL encoding check out [this page](https://www.url-encode-decode.com) To add data to a dataset, issue a POST request to the [Put items](https://docs.apify.com/api/v2/dataset-items-post.md) endpoint with the data as a JSON object payload. ``` https://api.apify.com/v2/datasets/{DATASET_ID}/items ``` > API data push to a dataset is capped at *200 requests per second* to avoid overloading our servers. Example payload: ``` [ { "foo": "bar" }, { "foo": "hotel" }, { "foo": "cafe" } ] ``` For further details and a breakdown of each storage API endpoint, refer to the [API documentation](https://docs.apify.com/api/v2/storage-datasets.md). ### Apify API Clients[​](#apify-api-clients "Direct link to Apify API Clients") #### JavaScript API client[​](#javascript-api-client "Direct link to JavaScript API client") The [JavaScript API client](https://docs.apify.com/api/client/js/reference/class/DatasetClient) (`apify-client`) enables you access to your datasets from any Node.js application, whether hosted on the Apify platform or externally. After importing and initiating the client, you can save each dataset to a variable for easier access. ``` const myDatasetClient = apifyClient.dataset('jane-doe/my-dataset'); ``` You can then use that variable to [access the dataset's items and manage it](https://docs.apify.com/api/client/js/reference/class/DatasetClient). > When using the [`.listItems()`](https://docs.apify.com/api/client/js/reference/class/DatasetClient#listItems) method, if you fill both `omit` and `field` parameters with the same value, then `omit` parameter will take precedence and the field is excluded from the results. Check out the [JavaScript API client documentation](https://docs.apify.com/api/client/js/reference/class/DatasetClient) for [help with setup](https://docs.apify.com/api/client/js/docs) and more details. #### Python API client[​](#python-api-client "Direct link to Python API client") The [Python API client](https://docs.apify.com/api/client/python/reference/class/DatasetClient) (`apify-client`) enables you access to your datasets from any Python application, whether it is running on the Apify platform or externally. After importing and initiating the client, you can save each dataset to a variable for easier access. ``` my_dataset_client = apify_client.dataset('jane-doe/my-dataset') ``` You can then use that variable to [access the dataset's items and manage it](https://docs.apify.com/api/client/python/reference/class/DatasetClient). > When using the [`.list_items()`](https://docs.apify.com/api/client/python/reference/class/DatasetClient#list_items) method, if you fill both `omit` and `field` parameters with the same value, then `omit` parameter will take precedence and the field is excluded from the results. Check out the [Python API client documentation](https://docs.apify.com/api/client/python/reference/class/DatasetClient) for [help with setup](https://docs.apify.com/api/client/python/docs/overview/introduction) and more details. ### Apify SDKs[​](#apify-sdks "Direct link to Apify SDKs") #### JavaScript SDK[​](#javascript-sdk "Direct link to JavaScript SDK") When working with a JavaScript [Actor](https://docs.apify.com/platform/actors.md), the [JavaScript SDK](https://docs.apify.com/sdk/js/docs/guides/result-storage#dataset) is an essential tool, especially for dataset management. It simplifies the tasks of storing and retrieving data, seamlessly integrating with the Actor's workflow. Key features of the SDK include the ability to append data, retrieve what is stored, and manage dataset properties effectively. Central to this functionality is the [`Dataset`](https://docs.apify.com/sdk/js/reference/class/Dataset) class. This class allows you to determine where your data is stored - locally or in the Apify cloud. To add data to your chosen datasets, use the [`pushData()`](https://docs.apify.com/sdk/js/reference/class/Dataset#pushData) method. Additionally the SDK offers other methods like [`getData()`](https://docs.apify.com/sdk/js/reference/class/Dataset#getData), [`map()`](https://docs.apify.com/sdk/js/reference/class/Dataset#map), and [`reduce()`](https://docs.apify.com/sdk/js/reference/class/Dataset#reduce). For practical applications of these methods, refer to the [example](https://docs.apify.com/sdk/js/docs/examples/map-and-reduce) section. If you have chosen to store your dataset locally, you can find it in the location below. ``` {APIFY_LOCAL_STORAGE_DIR}/datasets/{DATASET_ID}/{INDEX}.json ``` `DATASET_ID` refers to the dataset's *name* or *ID*. The default dataset will be stored in the *default* directory. To add data to the default dataset, you can use the example below: ``` // Import the JavaScript SDK into your project import { Actor } from 'apify'; await Actor.init(); // ... // Add one item to the default dataset await Actor.pushData({ foo: 'bar' }); // Add multiple items to the default dataset await Actor.pushData([{ foo: 'hotel' }, { foo: 'cafe' }]); // ... await Actor.exit(); ``` > It's crucial to use the `await` keyword when calling `pushData()`, to ensure data storage completes before the Actor process terminates. If you want to use something other than the default dataset, e.g. a dataset that you share between Actors or between Actor runs, you can use the [`Actor.openDataset()`](https://docs.apify.com/sdk/js/reference/class/Actor#openDataset) method. ``` import { Actor } from 'apify'; await Actor.init(); // ... // Save a named dataset to a variable const dataset = await Actor.openDataset('some-name'); // Add data to the named dataset await dataset.pushData({ foo: 'bar' }); // ... await Actor.exit(); ``` Utilize the `fields` option in the [`getData()`](https://docs.apify.com/sdk/js/reference/class/Dataset#getData) method to specify which data fields to retrieve. This option accepts an array of fields names (string) to include in your results. ``` import { Actor } from 'apify'; await Actor.init(); // ... const dataset = await Actor.openDataset(); // Only get the 'hotel' and 'cafe' fields const hotelAndCafeData = await dataset.getData({ fields: ['hotel', 'cafe'], }); // ... await Actor.exit(); ``` Check out the [JavaScript SDK documentation](https://docs.apify.com/sdk/js/docs/guides/result-storage#dataset) and the `Dataset` class's [API reference](https://docs.apify.com/sdk/js/reference/class/Dataset) for details on managing datasets with the JavaScript SDK. #### Python SDK[​](#python-sdk "Direct link to Python SDK") For Python [Actors](https://docs.apify.com/platform/actors.md), the [Python SDK](https://docs.apify.com/sdk/python/docs/concepts/storages#working-with-datasets) is essential. The dataset is represented by a [`Dataset`](https://docs.apify.com/sdk/python/reference/class/Dataset) class. You can use this class to specify whether your data is stored locally or in the Apify cloud and push data to the datasets of your choice using the [`push_data()`](https://docs.apify.com/sdk/python/reference/class/Dataset#push_data) method. For further data manipulation you could also use other methods such as [`get_data()`](https://docs.apify.com/sdk/python/reference/class/Dataset#get_data), [`map()`](https://docs.apify.com/sdk/python/reference/class/Dataset#map) and [`reduce()`](https://docs.apify.com/sdk/python/reference/class/Dataset#reduce). For datasets stored locally, the data is located at the following path: ``` {APIFY_LOCAL_STORAGE_DIR}/datasets/{DATASET_ID}/{INDEX}.json ``` The `DATASET_ID` refers to the dataset's *name* or *ID*. The default dataset will be stored in the *default* directory. To add data to the default dataset, you can use the example below: ``` from apify import Actor async def main(): async with Actor: # Add one item to the default dataset await Actor.push_data({'foo': 'bar'}) # Add multiple items to the default dataset await Actor.push_data([{'foo': 'hotel'}, {'foo': 'cafe'}]) ``` If you want to use something other than the default dataset, e.g. a dataset that you share between Actors or between Actor runs, you can use the [`Actor.open_dataset()`](https://docs.apify.com/sdk/python/reference/class/Actor#open_dataset) method. ``` from apify import Actor async def main(): async with Actor: # Save a named dataset to a variable dataset = await Actor.open_dataset(name='some-name') # Add data to the named dataset await dataset.push_data({'foo': 'bar'}) ``` Utilize the `fields` option in the [`get_data()`](https://docs.apify.com/sdk/python/reference/class/Dataset#get_data) method to specify which data fields to retrieve. This option accepts an array of fields names (string) to include in your results. ``` from apify import Actor async def main(): async with Actor: dataset = await Actor.open_dataset() # Only get the 'hotel' and 'cafe' fields hotel_and_cafe_data = await dataset.get_data(fields=['hotel', 'cafe']) ``` For more information, visit our [Python SDK documentation](https://docs.apify.com/sdk/python/docs/concepts/storages#working-with-datasets) and the `Dataset` class's [API reference](https://docs.apify.com/sdk/python/reference/class/Dataset) for details on managing datasets with the Python SDK. ## Hidden fields[​](#hidden-fields "Direct link to Hidden fields") Fields in a dataset that begin with a `#` are treated as hidden. You can exclude these fields when downloading data by using either `skipHidden=1` or `clean=1` in your query parameters. This feature is useful for excluding debug information from the final dataset output. The following example demonstrates a dataset record with hidden fields, including HTTP response and error details. ``` { "url": "https://example.com", "title": "Example page", "data": { "foo": "bar" }, "#error": null, "#response": { "statusCode": 201 } } ``` Data excluding hidden fields, termed as "clean" data, can be downloaded from the [Apify Console](https://console.apify.com/storage?tab=datasets) using the **Clean items** option. Alternatively, you can download it via API by applying `clean=true` or `clean=1` as [URL parameters](https://docs.apify.com/api/v2/dataset-items-get.md). ## XML format extension[​](#xml-format-extension "Direct link to XML format extension") In `XML` and `RSS` export formats, object property name are converted into XML tags, and their corresponding values are represented as children of these tags. For example, the JavaScript object: ``` { name: 'Rashida Jones', address: [ { type: 'home', street: '21st', city: 'Chicago', }, { type: 'office', street: null, city: null, }, ], } ``` becomes the following XML snippet: ``` Rashida Jones home 21st Chicago office ``` In a JavaScript object, if a property is named `@`, its sub-properties are exported as attributes of the corresponding parent XML element. Additionally, when the parent XML element lacks child elements, its value is sourced from a property named `#` in the JavaScript Object. For example, the following JavaScript object: ``` { address: [ { '@': { type: 'home', }, street: '21st', city: 'Chicago', }, { '@': { type: 'office', }, '#': 'unknown', }, ], } ``` will be transformed to the following XML snippet: ``` 21st Chicago unknown ``` This feature is also useful when customizing your RSS feeds generated for various websites. By default, the whole result is wrapped in an `` element, while each page object is contained in an `` element. You can change this using the `xmlRoot` and `xmlRow` URL parameters when retrieving your data with a GET request. ## Sharing[​](#sharing "Direct link to Sharing") You can grant [access rights](https://docs.apify.com/platform/collaboration.md) to your dataset through the **Share** button under the **Actions** menu. For more details, check the [full list of permissions](https://docs.apify.com/platform/collaboration/list-of-permissions.md). ### Sharing datasets between runs[​](#sharing-datasets-between-runs "Direct link to Sharing datasets between runs") You can access a dataset from any [Actor](https://docs.apify.com/platform/actors.md) or [task](https://docs.apify.com/platform/actors/running/tasks.md) run as long as you know its *name* or *ID*. To access a dataset from another run using the [Apify SDK](https://docs.apify.com/sdk.md), open it using the same method as you would with any other dataset. * JavaScript * Python ``` import { Actor } from 'apify'; await Actor.init(); const otherDataset = await Actor.openDataset('old-dataset'); // ... await Actor.exit(); ``` ``` from apify import Actor async def main(): async with Actor: other_dataset = await Actor.open_dataset(name='old-dataset') # ... ``` In the [JavaScript API client](https://docs.apify.com/api/client/js/reference/class/DatasetClient) as well as in [Python API client](https://docs.apify.com/api/client/python/reference/class/DatasetClient) , you can access a dataset using its client. Once you've opened the dataset, you can read its contents and add new data in the same manner as you would for a dataset from your current run. * JavaScript * Python ``` const otherDatasetClient = apifyClient.dataset('jane-doe/old-dataset'); ``` ``` other_dataset_client = apify_client.dataset('jane-doe/old-dataset') ``` The same applies for the [Apify API](#apify-api) - you can use [the same endpoints](#apify-api) as you would normally do. See the [Storage overview](https://docs.apify.com/platform/storage/usage.md#sharing-storages-between-runs) for details on sharing storages between runs. ## Limits[​](#limits "Direct link to Limits") * Data storage formats that use tabulation (like HTML, CSV, and EXCEL) are limited to a maximum of *3000* columns. Data exceeding this limit will not be retrieved. * The `pushData()`method is constrained by the receiving API's size limit. It accepts objects with JSON size under *9MB*. While individual objects within an array must not exceed *9MB*, the overall size has no restriction. * The maximum length for dataset names is 63 characters. ### Rate limiting[​](#rate-limiting "Direct link to Rate limiting") The rate limit for pushing data to a dataset through the [API](https://docs.apify.com/api/v2/dataset-items-post.md) is capped at *200 requests per second* for each dataset, a measure to prevent overloading Apify servers. For all other dataset [API endpoints](https://docs.apify.com/api/v2/storage-datasets.md) , the rate limit is *30 requests per second* for each dataset. Check out the [API documentation](https://docs.apify.com/api/v2.md#rate-limiting) for more information and guidance on actions to take if you exceed these rate limits. --- # Key-value store **Store anything from Actor or task run results, JSON documents, or images. Learn how to access and manage key-value stores from Apify Console or via API.** *** The key-value store is simple storage that can be used for storing any kind of data. It can be JSON or HTML documents, zip files, images, or strings. The data are stored along with their [MIME content type](https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/MIME_types/Common_types). Each Actor run is assigned its own key-value store when it is created. The store contains the Actor's input, and, if necessary, other data such as its output. Key-value stores are mutable–you can both add entries and delete them. > Named key-value stores are retained indefinitely.Unnamed key-value stores expire after 7 days unless otherwise specified.> [Learn more](https://docs.apify.com/platform/storage/usage.md#named-and-unnamed-storages) ## Basic usage[​](#basic-usage "Direct link to Basic usage") You can access key-value stores through several methods * [Apify Console](https://console.apify.com) - provides an easy-to-understand interface. * [Apify API](https://docs.apify.com/api/v2.md) - for accessing your key-value stores programmatically. * [Apify API clients](https://docs.apify.com/api.md) - to access your key-value stores from any Node.js/Python application. * [Apify SDKs](https://docs.apify.com/sdk.md) - when building your own JavaScript/Python Actor. ### Apify Console[​](#apify-console "Direct link to Apify Console") In [Apify Console](https://console.apify.com), you can view your key-value stores in the [Storage](https://console.apify.com/storage) section under the [Key-value stores](https://console.apify.com/storage?tab=keyValueStores) tab. ![Key-value stores in app](/assets/images/key-value-stores-app-e32c8eb13addf4990370a0b02b7f3919.png) To view a key-value store's content, click on its **Store ID**. Under the **Actions** menu, you can rename your store (and, in turn extend its [retention period](https://docs.apify.com/platform/storage/usage.md#named-and-unnamed-storages)) and grant [access rights](https://docs.apify.com/platform/collaboration.md) using the **Share** button. Click on the **API** button to view and test a store's [API endpoints](https://docs.apify.com/api/v2/storage-key-value-stores.md). ![Key-value stores detail](/assets/images/key-value-stores-detail-8685afcf3febe86a16fb5a842a7071af.png) ### Apify API[​](#apify-api "Direct link to Apify API") The [Apify API](https://docs.apify.com/api/v2/storage-key-value-stores.md) enables you programmatic access to your key-value stores using [HTTP requests](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods). If you are accessing your datasets using the `username~store-name` [store ID format](https://docs.apify.com/platform/storage.md), you will need to use your secret API token. You can find the token (and your user ID) on the [Integrations](https://console.apify.com/account#/integrations) tab of **Settings** page of your Apify account. > When providing your API authentication token, we recommend using the request's `Authorization` header, rather than the URL. ([More info](https://docs.apify.com/platform/integrations/api.md#authentication)). To retrieve a list of your key-value stores, send a GET request to the [Get list of key-value stores](https://docs.apify.com/api/v2/key-value-stores-get.md) endpoint. ``` https://api.apify.com/v2/key-value-stores ``` To get information about a key-value store such as its creation time and item count, send a GET request to the [Get store](https://docs.apify.com/api/v2/key-value-store-get.md) endpoint. ``` https://api.apify.com/v2/key-value-stores/{STORE_ID} ``` To get a record (its value) from a key-value store, send a GET request to the [Get record](https://docs.apify.com/api/v2/key-value-store-record-get.md) endpoint. ``` https://api.apify.com/v2/key-value-stores/{STORE_ID}/records/{KEY_ID} ``` To add a record with a specific key in a key-value store, send a PUT request to the [Store record](https://docs.apify.com/api/v2/key-value-store-record-put.md) endpoint. ``` https://api.apify.com/v2/key-value-stores/{STORE_ID}/records/{KEY_ID} ``` Example payload: ``` { "foo": "bar", "fos": "baz" } ``` To delete a record, send a DELETE request specifying the key from a key-value store to the [Delete record](https://docs.apify.com/api/v2/key-value-store-record-delete.md) endpoint. ``` https://api.apify.com/v2/key-value-stores/{STORE_ID}/records/{KEY_ID} ``` For further details and a breakdown of each storage API endpoint, refer to the [API documentation](https://docs.apify.com/api/v2/storage-key-value-stores.md). ### Apify API Clients[​](#apify-api-clients "Direct link to Apify API Clients") #### JavaScript API client[​](#javascript-api-client "Direct link to JavaScript API client") The Apify [JavaScript API client](https://docs.apify.com/api/client/js/reference/class/KeyValueStoreClient) (`apify-client`) enables you to access your key-value stores from any Node.js application, whether hosted on the Apify platform or externally. After importing and initiating the client, you can save each key-value store to a variable for easier access. ``` const myKeyValStoreClient = apifyClient.keyValueStore( 'jane-doe/my-key-val-store', ); ``` You can then use that variable to [access the key-value store's items and manage it](https://docs.apify.com/api/client/js/reference/class/KeyValueStoreClient). Check out the [JavaScript API client documentation](https://docs.apify.com/api/client/js/reference/class/KeyValueStoreClient) for [help with setup](https://docs.apify.com/api/client/js/docs) and more details. #### Python API client[​](#python-api-client "Direct link to Python API client") The Apify [Python API client](https://docs.apify.com/api/client/python/reference/class/KeyValueStoreClient) (`apify-client`) allows you to access your key-value stores from any Python application, whether it is running on the Apify platform or externally. After importing and initiating the client, you can save each key-value store to a variable for easier access. ``` my_key_val_store_client = apify_client.key_value_store('jane-doe/my-key-val-store') ``` You can then use that variable to [access the key-value store's items and manage it](https://docs.apify.com/api/client/python/reference/class/KeyValueStoreClient). Check out the [Python API client documentation](https://docs.apify.com/api/client/python/reference/class/KeyValueStoreClient) for [help with setup](https://docs.apify.com/api/client/python/docs/overview/introduction) and more details. ### Apify SDKs[​](#apify-sdks "Direct link to Apify SDKs") #### JavaScript SDK[​](#javascript-sdk "Direct link to JavaScript SDK") When working with a JavaScript [Actor](https://docs.apify.com/platform/actors.md), the [JavaScript SDK](https://docs.apify.com/sdk/js/docs/guides/result-storage#key-value-store) is an essential tool, especially for key-value store management. The primary class for this purpose is the [`KeyValueStore`](https://docs.apify.com/sdk/js/reference/class/KeyValueStore). This class allows you to decide whether your data will be stored locally or in the Apify cloud. For data manipulation, it offers the [`getValue()`](https://docs.apify.com/sdk/js/reference/class/KeyValueStore#getValue) and [`setValue()`](https://docs.apify.com/sdk/js/reference/class/KeyValueStore#setValue) methods to retrieve and assign values, respectively. Additionally, you can iterate over the keys in your store using the [`forEachKey()`](https://docs.apify.com/sdk/js/reference/class/KeyValueStore#forEachKey) method. Every Actor run is linked to a default key-value store that is automatically created for that specific run. If you're running your Actors and opt to store data locally, you can easily supply the [input](https://docs.apify.com/platform/actors/running/input-and-output.md) by placing an *INPUT.json* file in the corresponding directory of the default key-value store. This method ensures that you Actor has all the necessary data readily available for its execution. You can find *INPUT.json* and other key-value store files in the location below. ``` {APIFY_LOCAL_STORAGE_DIR}/key_value_stores/{STORE_ID}/{KEY}.{EXT} ``` The default key-value store's ID is *default*. The `{KEY}` is the record's *key* and `{EXT}` corresponds to the record value's MIME content type. To manage your key-value stores, you can use the following methods. See the `KeyValueStore` class's [API reference](https://docs.apify.com/sdk/js/reference/class/KeyValueStore) for the full list. ``` import { Actor } from 'apify'; await Actor.init(); // ... // Get the default input const input = await Actor.getInput(); // Open a named key-value store const exampleStore = await Actor.openKeyValueStore('my-store'); // Read a record in the exampleStore storage const value = await exampleStore.getValue('some-key'); // Write a record to exampleStore await exampleStore.setValue('some-key', { foo: 'bar' }); // Delete a record from exampleStore await exampleStore.setValue('some-key', null); // ... await Actor.exit(); ``` > Note that JSON is automatically parsed to a JavaScript object, text data returned as a string and other data is returned as binary buffer. ``` import { Actor } from 'apify'; await Actor.init(); // ... // Get input of your Actor const input = await Actor.getInput(); const value = await Actor.getValue('my-key'); // ... await Actor.setValue('OUTPUT', imageBuffer, { contentType: 'image/jpeg' }); // ... await Actor.exit(); ``` The `Actor.getInput()` method is not only a shortcut to `Actor.getValue('INPUT')`; it is also compatible with [`Actor.metamorph()`](https://docs.apify.com/platform/actors/development/programming-interface/metamorph.md). This is because a metamorphed Actor run's input is stored in the *INPUT-METAMORPH-1* key instead of *INPUT*, which hosts the original input. Check out the [JavaScript SDK documentation](https://docs.apify.com/sdk/js/docs/guides/result-storage#key-value-store) and the `KeyValueStore` class's [API reference](https://docs.apify.com/sdk/js/reference/class/KeyValueStore) for details on managing your key-value stores with the JavaScript SDK. #### Python SDK[​](#python-sdk "Direct link to Python SDK") For Python [Actor](https://docs.apify.com/platform/actors.md), the [Python SDK](https://docs.apify.com/sdk/python/docs/concepts/storages#working-with-key-value-stores) is essential. The key-value store is represented by a [`KeyValueStore`](https://docs.apify.com/sdk/python/reference/class/KeyValueStore) class. You can use this class to specify whether your data is stored locally or in the Apify cloud. For further data manipulation it offers [`get_value()`](https://docs.apify.com/sdk/python/reference/class/KeyValueStore#get_value) and [`set_value()`](https://docs.apify.com/sdk/python/reference/class/KeyValueStore#set_value) methods to retrieve and assign values, respectively. Every Actor run is linked to a default key-value store that is automatically created for that specific run. If you're running your Actors and opt to store data locally, you can easily supply the [input](https://docs.apify.com/platform/actors/running/input-and-output.md) by placing an *INPUT.json* file in the corresponding directory of the default key-value store. This method ensures that you Actor has all the necessary data readily available for its execution. You can find *INPUT.json* and other key-value store files in the location below. ``` {APIFY_LOCAL_STORAGE_DIR}/key_value_stores/{STORE_ID}/{KEY}.{EXT} ``` The default key-value store's ID is *default*. The {KEY} is the record's *key* and {EXT} corresponds to the record value's MIME content type. To manage your key-value stores, you can use the following methods. See the `KeyValueStore` class [documentation](https://docs.apify.com/sdk/python/reference/class/KeyValueStore) for the full list. ``` from apify import Actor from apify.storages import KeyValueStore async def main(): async with Actor: # Open a named key-value store example_store: KeyValueStore = await Actor.open_key_value_store(name='my-store') # Read a record in the example_store storage value = await example_store.get_value('some-key') # Write a record to example_store await example_store.set_value('some-key', {'foo': 'bar'}) # Delete a record from example_store await example_store.set_value('some-key', None) ``` > Note that JSON is automatically parsed to a Python dictionary, text data returned as a string and other data is returned as binary buffer. ``` from apify import Actor async def main(): async with Actor: value = await Actor.get_value('my-key') # ... image_buffer = ... # Get image data await Actor.set_value(key='OUTPUT', value=image_buffer, content_type='image/jpeg') ``` The `Actor.get_input()` method is not only a shortcut to `Actor.get_value('INPUT')`; it is also compatible with [`Actor.metamorph()`](https://docs.apify.com/platform/actors/development/programming-interface/metamorph.md). This is because a metamorphed Actor run's input is stored in the *INPUT-METAMORPH-1* key instead of *INPUT*, which hosts the original input. Check out the [Python SDK documentation](https://docs.apify.com/sdk/python/docs/concepts/storages#working-with-key-value-stores) and the `KeyValueStore` class's [API reference](https://docs.apify.com/sdk/python/reference/class/KeyValueStore) for details on managing your key-value stores with the Python SDK. ## Compression[​](#compression "Direct link to Compression") Previously, when using the [Store record](https://docs.apify.com/api/v2/key-value-store-record-put.md) endpoint, every record was automatically compressed with Gzip before being uploaded. However, this process has been updated. *Now, records are stored exactly as you upload them.* This change means that it is up to you whether the record is stored compressed or uncompressed. You can compress a record and use the [Content-Encoding request header](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Encoding) to let our platform know which compression it uses. We recommend compressing large key-value records to save storage space and network traffic. *Using the [JavaScript SDK](https://docs.apify.com/sdk/js/reference/class/KeyValueStore#setValue) or our [JavaScript API client](https://docs.apify.com/api/client/js/reference/class/KeyValueStoreClient#setRecord) automatically compresses your files.* We advise utilizing the JavaScript API client for data compression prior to server upload and decompression upon retrieval, minimizing storage costs. ## Sharing[​](#sharing "Direct link to Sharing") You can grant [access rights](https://docs.apify.com/platform/collaboration.md) to your key-value store through the **Share** button under the **Actions** menu. For more details check the [full list of permissions](https://docs.apify.com/platform/collaboration/list-of-permissions.md). ### Sharing key-value stores between runs[​](#sharing-key-value-stores-between-runs "Direct link to Sharing key-value stores between runs") You can access a key-value store from any [Actor](https://docs.apify.com/platform/actors.md) or [task](https://docs.apify.com/platform/actors/running/tasks.md) run as long as you know its *name* or *ID*. To access a key-value store from another run using the [Apify SDK](https://docs.apify.com/sdk.md), open it using the same method as you would do with any other store. * JavaScript * Python ``` import { Actor } from 'apify'; await Actor.init(); const otherStore = await Actor.openKeyValueStore('old-store'); // ... await Actor.exit(); ``` ``` from apify import Actor async def main(): async with Actor: other_store = await Actor.open_key_value_store(name='old-store') # ... ``` In the [JavaScript API client](https://docs.apify.com/api/client/js/reference/class/KeyValueStoreClient) as well as in [Python API client](https://docs.apify.com/api/client/python/reference/class/KeyValueStoreClient), you can access a store using its client. Once you've opened a store, read and manage its contents like you would do with a key-value store from your current run. * JavaScript * Python ``` const otherStoreClient = apifyClient.keyValueStore('jane-doe/old-store'); ``` ``` other_store_client = apify_client.key_value_store('jane-doe/old-store') ``` The same applies for the [Apify API](#apify-api) - you can use [the same endpoints](#apify-api) as you would normally do. Check out the [Storage overview](https://docs.apify.com/platform/storage/usage.md#sharing-storages-between-runs) for details on sharing storages between runs. ## Data consistency[​](#data-consistency "Direct link to Data consistency") Key-value storage uses the [AWS S3](https://aws.amazon.com/s3/) service. According to the [S3 documentation](https://aws.amazon.com/s3/consistency/), it provides *strong read-after-write* consistency. ## Limits[​](#limits "Direct link to Limits") * The maximum length for key of key-value store is 63 characters. --- # Request queue **Queue URLs for an Actor to visit in its run. Learn how to share your queues between Actor runs. Access and manage request queues from Apify Console or via API.** *** Request queues enable you to enqueue and retrieve requests such as URLs with an [HTTP method](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods) and other parameters. They prove essential not only in web crawling scenarios but also in any situation requiring the management of a large number of URLs and the addition of new links. The storage system for request queues accommodates both breadth-first and depth-first crawling strategies, along with the inclusion of custom data attributes. This system enables you to check if certain URLs have already been encountered, add new URLs to the queue, and retrieve the next set of URLs for processing. > Named request queues are retained indefinitely.Unnamed request queues expire after 7 days unless otherwise specified.> [Learn more](https://docs.apify.com/platform/storage/usage.md#named-and-unnamed-storages) ## Basic usage[​](#basic-usage "Direct link to Basic usage") You can access your request queues in several ways: * [Apify Console](https://console.apify.com) - provides an easy-to-understand interface. * [Apify API](https://docs.apify.com/api/v2.md) - for accessing your request queues programmatically. * [Apify API clients](https://docs.apify.com/api.md) - to access your request queues from any Node.js application. * [Apify SDK](https://docs.apify.com/sdk.md) - when building your own JavaScript Actor. ### Apify Console[​](#apify-console "Direct link to Apify Console") In the [Apify Console](https://console.apify.com), you can view your request queues in the [Storage](https://console.apify.com/storage) section under the [Request queues](https://console.apify.com/storage?tab=requestQueues) tab. ![Request queues in app](/assets/images/request-queue-app-894d0f685329bf3b5e6b80c55d315473.png) To view a request queue, click on its **Queue ID**. Under the **Actions** menu, you can rename your queue's name (and, in turn, its [retention period](https://docs.apify.com/platform/storage/usage.md#named-and-unnamed-storages)) and [access rights](https://docs.apify.com/platform/collaboration.md) using the **Share** button. Click on the **API** button to view and test a queue's [API endpoints](https://docs.apify.com/api/v2/storage-request-queues.md). ![Request queues detail](/assets/images/request-queue-detail-1f70f5c5b1915f0de208bbe5fcdafe56.png) ### Apify API[​](#apify-api "Direct link to Apify API") The [Apify API](https://docs.apify.com/api/v2/storage-request-queues.md) allows you programmatic access to your request queues using [HTTP requests](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods). If you are accessing your datasets using the `username~store-name` [store ID format](https://docs.apify.com/platform/storage.md), you will need to use your secret API token. You can find the token (and your user ID) on the [Integrations](https://console.apify.com/account#/integrations) page of your Apify account. > When providing your API authentication token, we recommend using the request's `Authorization` header, rather than the URL. ([More info](https://docs.apify.com/platform/integrations/api.md#authentication)). To get a list of your request queues, send a GET request to the [Get list of request queues](https://docs.apify.com/api/v2/request-queues-get.md) endpoint. ``` https://api.apify.com/v2/request-queues ``` To get information about a request queue such as its creation time and item count, send a GET request to the [Get request queue](https://docs.apify.com/api/v2/request-queue-get.md) endpoint. ``` https://api.apify.com/v2/request-queues/{QUEUE_ID} ``` To get a request from a queue, send a GET request to the [Get request](https://docs.apify.com/api/v2/request-queue-request-get.md) endpoint. ``` https://api.apify.com/v2/request-queues/{QUEUE_ID}/requests/{REQUEST_ID} ``` To add a request to a queue, send a POST request with the request to be added as a JSON object in the request's payload to the [Add request](https://docs.apify.com/api/v2/request-queue-requests-post.md) endpoint. ``` https://api.apify.com/v2/request-queues/{QUEUE_ID}/requests ``` Example payload: ``` { "uniqueKey": "http://example.com", "url": "http://example.com", "method": "GET" } ``` To update a request in a queue, send a PUT request with the request to update as a JSON object in the request's payload to the [Update request](https://docs.apify.com/api/v2/request-queue-request-put.md) endpoint. In the payload, specify the request's ID and add the information you want to update. ``` https://api.apify.com/v2/request-queues/{QUEUE_ID}/requests/{REQUEST_ID} ``` Example payload: ``` { "id": "dnjkDMKLmdlkmlkmld", "uniqueKey": "http://example.com", "url": "http://example.com", "method": "GET" } ``` > When adding or updating requests, you can optionally provide a `clientKey` parameter to your request. It must be a string between 1 and 32 characters in length. This identifier is used to determine whether the queue was accessed by [multiple clients](#sharing). If `clientKey` is not provided, the system considers this API call to come from a new client. See the `hadMultipleClients` field returned by the [`Get head`](https://docs.apify.com/api/v2/request-queue-head-get.md) operation for details. > > Example: `client-abc` For further details and a breakdown of each storage API endpoint, refer to the [API documentation](https://docs.apify.com/api/v2/storage-key-value-stores.md). ### Apify API Clients[​](#apify-api-clients "Direct link to Apify API Clients") #### JavaScript API client[​](#javascript-api-client "Direct link to JavaScript API client") The Apify [JavaScript API client](https://docs.apify.com/api/client/js/reference/class/RequestQueueClient) (`apify-client`) enables you to access your request queues from any Node.js application, whether it is running on the Apify platform or externally. After importing and initiating the client, you can save each request queue to a variable for easier access. ``` const myQueueClient = apifyClient.requestQueue('jane-doe/my-request-queue'); ``` You can then use that variable to [access the request queue's items and manage it](https://docs.apify.com/api/client/js/reference/class/RequestQueueClient). Check out the [JavaScript API client documentation](https://docs.apify.com/api/client/js/reference/class/RequestQueueClient) for [help with setup](https://docs.apify.com/api/client/js/docs) and more details. #### Python API client[​](#python-api-client "Direct link to Python API client") The Apify [Python API client](https://docs.apify.com/api/client/python) (`apify-client`) allows you to access your request queues from any Python application, whether it's running on the Apify platform or externally. After importing and initiating the client, you can save each request queue to a variable for easier access. ``` my_queue_client = apify_client.request_queue('jane-doe/my-request-queue') ``` You can then use that variable to [access the request queue's items and manage it](https://docs.apify.com/api/client/python/reference/class/RequestQueueClient). Check out the [Python API client documentation](https://docs.apify.com/api/client/python/reference/class/RequestQueueClient) for [help with setup](https://docs.apify.com/api/client/python/docs/overview/introduction) and more details. ### Apify SDKs[​](#apify-sdks "Direct link to Apify SDKs") #### JavaScript SDK[​](#javascript-sdk "Direct link to JavaScript SDK") When working with a JavaScript [Actor](https://docs.apify.com/platform/actors.md), the [JavaScript SDK](https://docs.apify.com/sdk/js/docs/guides/request-storage#request-queue) is an essential tool, especially for request queue management. The primary class for this purpose is the [`RequestQueue`](https://docs.apify.com/sdk/js/reference/class/RequestQueue) class. Use this class to decide whether your data is stored locally or in the Apify cloud. If you are building a JavaScript [Actor](https://docs.apify.com/platform/actors.md), you will be using the [JavaScript SDK](https://docs.apify.com/sdk/js/docs/guides/request-storage#request-queue). The request queue is represented by a [`RequestQueue`](https://docs.apify.com/sdk/js/reference/class/RequestQueue) class. You can use the class to specify whether your data is stored locally or in the Apify cloud and [enqueue new URLs](https://docs.apify.com/sdk/js/reference/class/RequestQueue#addRequests). Every Actor run is automatically linked with a default request queue, initiated upon adding the first request. This queue is primarily utilized for storing URLs to be crawled during the particular Actor run, though its use is not mandatory. For enhanced flexibility, you can establish named queues. These named queues offer the advantage of being shareable across different Actors or various Actor runs, facilitating a more interconnected and efficient process. If you are storing your data locally, you can find your request queue at the following location. ``` {APIFY_LOCAL_STORAGE_DIR}/request_queues/{QUEUE_ID}/{ID}.json ``` The default request queue's ID is *default*. Each request in the queue is stored as a separate JSON file, where `{ID}` is a request ID. To open a request queue, use the [`Actor.openRequestQueue()`](https://docs.apify.com/sdk/js/reference/class/Actor#openRequestQueue) method. ``` // Import the JavaScript SDK into your project import { Actor } from 'apify'; await Actor.init(); // ... // Open the default request queue associated with // the Actor run const queue = await Actor.openRequestQueue(); // Open the 'my-queue' request queue const queueWithName = await Actor.openRequestQueue('my-queue'); // ... await Actor.exit(); ``` Once a queue is open, you can manage it using the following methods. Check out the `RequestQueue` class's [API reference](https://docs.apify.com/sdk/js/reference/class/RequestQueue) for the full list. ``` // Import the JavaScript SDK into your project import { Actor } from 'apify'; await Actor.init(); // ... const queue = await Actor.openRequestQueue(); // Enqueue requests await queue.addRequests([{ url: 'http://example.com/aaa' }]); await queue.addRequests(['http://example.com/foo', 'http://example.com/bar'], { forefront: true, }); // Get the next request from queue const request1 = await queue.fetchNextRequest(); const request2 = await queue.fetchNextRequest(); // Get a specific request const specificRequest = await queue.getRequest('shi6Nh3bfs3'); // Reclaim a failed request back to the queue // and process it again await queue.reclaimRequest(request2); // Remove a queue await queue.drop(); // ... await Actor.exit(); ``` Check out the [JavaScript SDK documentation](https://docs.apify.com/sdk/js/docs/guides/request-storage#request-queue) and the `RequestQueue` class's [API reference](https://docs.apify.com/sdk/js/reference/class/RequestQueue) for details on managing your request queues with the JavaScript SDK. #### Python SDK[​](#python-sdk "Direct link to Python SDK") For Python [Actor](https://docs.apify.com/platform/actors.md) development, the [Python SDK](https://docs.apify.com/sdk/python/docs/concepts/storages#working-with-request-queues) the in essential. The request queue is represented by [`RequestQueue`](https://docs.apify.com/sdk/python/reference/class/RequestQueue) class. Utilize this class to determine whether your data is stored locally or in the Apify cloud. For managing your data, it provides the capability to [enqueue new URLs](https://docs.apify.com/sdk/python/reference/class/RequestQueue#add_requests), facilitating seamless integration and operation within your Actor. Every Actor run is automatically connected to a default request queue, established specifically for that run upon the addition of the first request. If you're operating your Actors and choose to utilize this queue, it typically serves to store URLs for crawling in the respective Actor run, though its use is not mandatory. To extend functionality, you have the option to create named queue, which offer the flexibility to be shared among different Actors or across multiple Actor runs. If you are storing your data locally, you can find your request queue at the following location. ``` {APIFY_LOCAL_STORAGE_DIR}/request_queues/{QUEUE_ID}/{ID}.json ``` The default request queue's ID is *default*. Each request in the queue is stored as a separate JSON file, where `{ID}` is a request ID. To *open a request queue*, use the [`Actor.open_request_queue()`](https://docs.apify.com/sdk/python/reference/class/Actor#open_request_queue) method. ``` from apify import Actor async def main(): async with Actor: # Open the default request queue associated with the Actor run queue = await Actor.open_request_queue() # Open the 'my-queue' request queue queue_with_name = await Actor.open_request_queue(name='my-queue') # ... ``` Once a queue is open, you can manage it using the following methods. See the `RequestQueue` class's [API reference](https://docs.apify.com/sdk/python/reference/class/RequestQueue) for the full list. ``` from apify import Actor from apify.storages import RequestQueue async def main(): async with Actor: queue: RequestQueue = await Actor.open_request_queue() # Enqueue requests await queue.add_request(request={'url': 'http:#example.com/aaa'}) await queue.add_request(request={'url': 'http:#example.com/foo'}) await queue.add_request(request={'url': 'http:#example.com/bar'}, forefront=True) # Get the next requests from queue request1 = await queue.fetch_next_request() request2 = await queue.fetch_next_request() # Get a specific request specific_request = await queue.get_request('shi6Nh3bfs3') # Reclaim a failed request back to the queue and process it again await queue.reclaim_request(request2) # Remove a queue await queue.drop() ``` Check out the [Python SDK documentation](https://docs.apify.com/sdk/python/docs/concepts/storages#working-with-request-queues) and the `RequestQueue` class's [API reference](https://docs.apify.com/sdk/python/reference/class/RequestQueue) for details on managing your request queues with the Python SDK. ## Features[​](#features "Direct link to Features") Request queue is a storage type built with scraping in mind, enabling developers to write scraping logic efficiently and scalably. The Apify tooling, including [Crawlee](https://crawlee.dev/), [Apify SDK for JavaScript](https://docs.apify.com/sdk/js/), and [Apify SDK for Python](https://docs.apify.com/sdk/python/), incorporates all these features, enabling users to leverage them effortlessly without extra configuration. In the following section, we will discuss each of the main features in depth. ### Persistence and retention[​](#persistence-and-retention "Direct link to Persistence and retention") Request queues prioritize persistence, ensuring indefinite retention of your requests in named request queues, and for the data retention period in your subscription in unnamed request queues. This capability facilitates incremental crawling, where you can append new URLs to the queue and resume from where you stopped in subsequent Actor runs. Consider the scenario of scraping an e-commerce website with thousands of products. Incremental scraping allows you to scrape only the products added since the last product discovery. In the following code example, we demonstrate how to use the Apify SDK and Crawlee to create an incremental crawler that saves the title of each new found page in Apify Docs to a dataset. By running this Actor multiple times, you can incrementally crawl the source website and save only pages added since the last crawl, as reusing a single request queue ensures that only URLs not yet visited are processed. ``` // Basic example of incremental crawling with Crawlee. import { Actor } from 'apify'; import { CheerioCrawler, Dataset } from 'crawlee'; interface Input { startUrls: string[]; persistRquestQueueName: string; } await Actor.init(); // Structure of input is defined in input_schema.json const { startUrls = ['https://docs.apify.com/'], persistRequestQueueName = 'persist-request-queue', } = (await Actor.getInput()) ?? ({} as Input); // Open or create request queue for incremental scrape. // By opening same request queue, the crawler will continue where it left off and skips already visited URLs. const requestQueue = await Actor.openRequestQueue(persistRequestQueueName); const proxyConfiguration = await Actor.createProxyConfiguration(); const crawler = new CheerioCrawler({ proxyConfiguration, requestQueue, // Pass incremental request queue to the crawler. requestHandler: async ({ enqueueLinks, request, $, log }) => { log.info('enqueueing new URLs'); await enqueueLinks(); // Extract title from the page. const title = $('title').text(); log.info(`New page with ${title}`, { url: request.loadedUrl }); // Save the URL and title of the loaded page to the output dataset. await Dataset.pushData({ url: request.loadedUrl, title }); }, }); await crawler.run(startUrls); await Actor.exit(); ``` ### Batch operations[​](#batch-operations "Direct link to Batch operations") Request queues support batch operations on requests to enqueue or retrieve multiple requests in bulk, to cut down on network latency and enable easier parallel processing of requests. You can find the batch operations in the [Apify API](https://docs.apify.com/api/v2/storage-request-queues.md), as well in the Apify API client for [JavaScript](https://docs.apify.com/api/client/js/reference/class/RequestQueueClient#batchAddRequests) and [Python](https://docs.apify.com/api/client/python/reference/class/RequestQueueClient#batch_add_requests). * JavaScript * Python ``` const { ApifyClient } = require('apify-client'); const client = new ApifyClient({ token: 'MY-APIFY-TOKEN', }); const requestQueueClient = client.requestQueue('my-queue-id'); // Add multiple requests to the queue await requestQueueClient.batchAddRequests([ { url: 'http://example.com/foo', uniqueKey: 'http://example.com/foo', method: 'GET', }, { url: 'http://example.com/bar', uniqueKey: 'http://example.com/bar', method: 'GET', }, ]); // Remove multiple requests from the queue await requestQueueClient.batchDeleteRequests([ { uniqueKey: 'http://example.com/foo' }, { uniqueKey: 'http://example.com/bar' }, ]); ``` ``` from apify_client import ApifyClient apify_client = ApifyClient('MY-APIFY-TOKEN') request_queue_client = apify_client.request_queue('my-queue-id') # Add multiple requests to the queue request_queue_client.batch_add_requests([ {'url': 'http://example.com/foo', 'uniqueKey': 'http://example.com/foo', 'method': 'GET'}, {'url': 'http://example.com/bar', 'uniqueKey': 'http://example.com/bar', 'method': 'GET'}, ]) # Remove multiple requests from the queue request_queue_client.batch_delete_requests([ {'uniqueKey': 'http://example.com/foo'}, {'uniqueKey': 'http://example.com/bar'}, ]) ``` ### Distributivity[​](#distributivity "Direct link to Distributivity") Request queue includes a locking mechanism to avoid concurrent processing of one request by multiple clients (for example Actor runs). You can lock a request so that no other clients receive it when they fetch the queue head, with an expiration period on the lock so that requests which fail processing are eventually unlocked and retried. This feature is seamlessly integrated into Crawlee, requiring minimal extra setup. By default, requests are locked for the same duration as the timeout for processing requests in the crawler ([`requestHandlerTimeoutSecs`](https://crawlee.dev/api/next/basic-crawler/interface/BasicCrawlerOptions#requestHandlerTimeoutSecs)). If the Actor processing the request fails, the lock expires, and the request is processed again eventually. For more details, refer to the [Crawlee documentation](https://crawlee.dev/docs/next/experiments/experiments-request-locking). In the following example, we demonstrate how you can use locking mechanisms to avoid concurrent processing of the same request across multiple Actor runs. info The lock mechanism works on the client level, as well as the run level, when running the Actor on the Apify platform. This means you can unlock or prolong the lock the locked request only if: * You are using the same client key, or * The operation is being called from the same Actor run. - Actor 1 - Actor 2 ``` import { Actor, ApifyClient } from 'apify'; await Actor.init(); const client = new ApifyClient({ token: 'MY-APIFY-TOKEN', }); // Creates a new request queue. const requestQueue = await client.requestQueues().getOrCreate('example-queue'); // Creates two clients with different keys for the same request queue. const requestQueueClient = client.requestQueue(requestQueue.id, { clientKey: 'requestqueueone', }); // Adds multiple requests to the queue. await requestQueueClient.batchAddRequests([ { url: 'http://example.com/foo', uniqueKey: 'http://example.com/foo', method: 'GET', }, { url: 'http://example.com/bar', uniqueKey: 'http://example.com/bar', method: 'GET', }, { url: 'http://example.com/baz', uniqueKey: 'http://example.com/baz', method: 'GET', }, { url: 'http://example.com/qux', uniqueKey: 'http://example.com/qux', method: 'GET', }, ]); // Locks the first two requests at the head of the queue. const processingRequestsClientOne = await requestQueueClient.listAndLockHead( { limit: 2, lockSecs: 120, }, ); // Checks when the lock will expire. The locked request will have a lockExpiresAt attribute. const lockedRequest = processingRequestsClientOne.items[0]; const lockedRequestDetail = await requestQueueClient.getRequest( lockedRequest.id, ); console.log(`Request locked until ${lockedRequestDetail?.lockExpiresAt}`); // Prolongs the lock of the first request or unlocks it. await requestQueueClient.prolongRequestLock( lockedRequest.id, { lockSecs: 120 }, ); await requestQueueClient.deleteRequestLock( lockedRequest.id, ); await Actor.exit(); ``` ``` import { Actor, ApifyClient } from 'apify'; await Actor.init(); const client = new ApifyClient({ token: 'MY-APIFY-TOKEN', }); // Waits for the first Actor to lock the requests. await new Promise((resolve) => setTimeout(resolve, 5000)); // Get the same request queue in different Actor run and with a different client key. const requestQueue = await client.requestQueues().getOrCreate('example-queue'); const requestQueueClient = client.requestQueue(requestQueue.id, { clientKey: 'requestqueuetwo', }); // Get all requests from the queue and check one locked by the first Actor. const requests = await requestQueueClient.listRequests(); const requestsLockedByAnotherRun = requests.items.filter((request) => request.lockByClient === 'requestqueueone'); const requestLockedByAnotherRunDetail = await requestQueueClient.getRequest( requestsLockedByAnotherRun[0].id, ); // Other clients cannot list and lock these requests; the listAndLockHead call returns other requests from the queue. const processingRequestsClientTwo = await requestQueueClient.listAndLockHead( { limit: 10, lockSecs: 60, }, ); const wasBothRunsLockedSameRequest = !!processingRequestsClientTwo.items.find( (request) => request.id === requestLockedByAnotherRunDetail.id, ); console.log(`Was the request locked by the first run locked by the second run? ${wasBothRunsLockedSameRequest}`); console.log(`Request locked until ${requestLockedByAnotherRunDetail?.lockExpiresAt}`); // Other clients cannot modify the lock; attempting to do so will throw an error. try { await requestQueueClient.prolongRequestLock( requestLockedByAnotherRunDetail.id, { lockSecs: 60 }, ); } catch (err) { // This will throw an error. } // Cleans up the queue. await requestQueueClient.delete(); await Actor.exit(); ``` A detailed tutorial on how to process one request queue with multiple Actor runs can be found in [Academy tutorials](https://docs.apify.com/academy/node-js/multiple-runs-scrape). ## Sharing[​](#sharing "Direct link to Sharing") You can grant [access rights](https://docs.apify.com/platform/collaboration.md) to your request queue through the **Share** button under the **Actions** menu. For more details check the [full list of permissions](https://docs.apify.com/platform/collaboration/list-of-permissions.md). ### Sharing request queues between runs[​](#sharing-request-queues-between-runs "Direct link to Sharing request queues between runs") You can access a request queue from any [Actor](https://docs.apify.com/platform/actors.md) or [task](https://docs.apify.com/platform/actors/running/tasks.md) run as long as you know its *name* or *ID*. To access a request queue from another run using the [Apify SDK](https://docs.apify.com/sdk.md), open it using the same method like you would do with any other request queue. * JavaScript * Python ``` import { Actor } from 'apify'; await Actor.init(); const otherQueue = await Actor.openRequestQueue('old-queue'); // ... await Actor.exit(); ``` ``` from apify import Actor async def main(): async with Actor: other_queue = await Actor.open_request_queue(name='old-queue') # ... ``` In the [JavaScript API client](https://docs.apify.com/api/client/js/reference/class/RequestQueueClient) as well as in [Python API client](https://docs.apify.com/api/client/python/reference/class/RequestQueueClient), you can access a request queue using its respective client. Once you've opened the request queue, you can use it in your crawler or add new requests like you would do with a queue from your current run. * JavaScript * Python ``` const otherQueueClient = apifyClient.requestQueue('jane-doe/old-queue'); ``` ``` other_queue_client = apify_client.request_queue('jane-doe/old-queue') ``` The same applies for the [Apify API](#apify-api) - you can use [the same endpoints](#apify-api) as you would normally do. Check out the [Storage overview](https://docs.apify.com/platform/storage/usage.md#sharing-storages-between-runs) for details on sharing storages between runs. ## Limits[​](#limits "Direct link to Limits") * The maximum length for request queue name is 63 characters. ### Rate limiting[​](#rate-limiting "Direct link to Rate limiting") When managing request queues via [API](https://docs.apify.com/api/v2/storage-request-queues-requests.md), CRUD ([add](https://docs.apify.com/api/v2/request-queue-requests-post.md), [get](https://docs.apify.com/api/v2/request-queue-request-get.md), [update](https://docs.apify.com/api/v2/request-queue-request-put.md), [delete](https://docs.apify.com/api/v2/request-queue-request-delete.md)) operation requests are limited to *200 requests per second* per request queue. This helps protect Apify servers from being overloaded. All other request queue API [endpoints](https://docs.apify.com/api/v2/storage-request-queues.md) are limited to *30 requests per second* per request queue. Check out the [API documentation](https://docs.apify.com/api/v2.md#rate-limiting) for more information and guidance on actions to take if you exceed these rate limits. --- # Storage usage **Learn how to effectively use Apify's storage options. Understand key aspects of data retention, rate limiting, and secure sharing.** *** ## Dataset[​](#dataset "Direct link to Dataset") [Dataset](https://docs.apify.com/platform/storage/dataset.md) storage allows you to store a series of data objects, such as results from web scraping, crawling, or data processing jobs. You can export your datasets in JSON, CSV, XML, RSS, Excel, or HTML formats. ![Dataset graphic](/assets/images/datasets-overview-f866e92e0de64d2a0475085016fcbc37.png) ## Key-value store[​](#key-value-store "Direct link to Key-value store") The [key-value store](https://docs.apify.com/platform/storage/key-value-store.md) is ideal for saving data records such as files, screenshots of web pages, and PDFs or for persisting your Actor's state. The records are accessible under a unique name and can be written and read quickly. ![Key-value store graphic](/assets/images/key-value-overview-cc78407f157ecfa2b9ccb942e59d54b6.svg) ## Request queue[​](#request-queue "Direct link to Request queue") [Request queues](https://docs.apify.com/platform/storage/request-queue.md) allow you to dynamically maintain a queue of URLs of web pages. You can use this when recursively crawling websites: you start from initial URLs and add new links as they are found while skipping duplicates. ![Request queue graphic](/assets/images/request-queue-overview-6ebf1f808632078b7bbbd5e68696a472.svg) ## Basic usage[​](#basic-usage "Direct link to Basic usage") You can access your storage in several ways: * [Apify Console](https://console.apify.com/storage) - provides an easy-to-use interface. * [Apify API](https://docs.apify.com/api/v2/storage-key-value-stores.md) - to access your storages programmatically. * [API clients](https://docs.apify.com/api.md) - to access your storages from any Node.js/Python application. * [Apify SDKs](https://docs.apify.com/sdk.md) - when building your own JavaScript/Python Actor. ### Apify Console[​](#apify-console "Direct link to Apify Console") To access your storages via Apify Console, navigate to the [**Storage**](https://console.apify.com/storage) section in the left-side menu. From there, you can click through the tabs to view your key-value stores, datasets, and request queues, and you can click on the **API** button in the top right corner to view related API endpoints. To view a storage, click its **ID**. ![Storages in app](/assets/images/datasets-app-7f95b1edcb4e2cd28d7885c648820bf0.png) > Use the **Include unnamed storages** checkbox to either display or hide unnamed storages. By default Apify Console will display them. You can edit your store's name by clicking on the **Actions** menu and selecting **Rename**. Additionally, you can quickly share the contents and details of your storage by selecting **Share** under the **Actions** menu and providing either email, username or user ID. ![Storage API](/assets/images/overview-api-a844fd68dfc603eb2c8a64daa74126a4.png) These URLs link to API *endpoints*—the places where your data is stored. Endpoints that allow you to *read* stored information do not require an [authentication token](https://docs.apify.com/api/v2.md#authentication). Calls are authenticated using a hard-to-guess ID, allowing for secure sharing. However, operations such as *update* or *delete* require the authentication token. > Never share a URL containing your authentication token, to avoid compromising your account's security.If the data you want to share requires a token, first download the data, then share it as a file. ### Apify API[​](#apify-api "Direct link to Apify API") The [Apify API](https://docs.apify.com/api/v2/storage-key-value-stores.md) allows you to access your storages programmatically using [HTTP requests](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods) and easily share your crawling results. In most cases, when accessing your storages via API, you will need to provide a `store ID`, which you can do in the following formats: * `WkzbQMuFYuamGv3YF` - the store's alphanumerical ID if the store is unnamed. * `~store-name` - the store's name prefixed with tilde (`~`) character if the store is named (e.g. `~ecommerce-scraping-results`) * `username~store-name` - username and the store's name separated by a tilde (`~`) character if the store is named and belongs to a different account (e.g. `janedoe~ecommerce-scraping-results`). Note that in this case, the store's owner needs to grant you access first. For read (GET) requests, it is enough to use a store's alphanumerical ID, since the ID is hard to guess and effectively serves as an authentication key. With other request types and when using the `username~store-name`, however, you will need to provide your secret API token in your request's [`Authorization`](https://docs.apify.com/api/v2.md#authentication) header or as a query parameter. You can find your token on the [Integrations](https://console.apify.com/account?tab=integrations) page of your Apify account. For further details and a breakdown of each storage API endpoint, refer to the [API documentation](https://docs.apify.com/api/v2/storage-datasets.md). ### Apify API Clients[​](#apify-api-clients "Direct link to Apify API Clients") The Apify API Clients allow you to access your datasets from any Node.js or Python application, whether it's running on the Apify platform or externally. You can visit [API Clients](https://docs.apify.com/api.md) documentations for more information. ### Apify SDKs[​](#apify-sdks "Direct link to Apify SDKs") The Apify SDKs are libraries in JavaScript or Python that provide tools for building your own Actors. * JavaScript SDK requires [Node.js](https://nodejs.org/en/) 16 or later. * Python SDK requires [Python](https://www.python.org/downloads/release/python-380/) 3.8 or above. ## Rate limiting[​](#rate-limiting "Direct link to Rate limiting") All API endpoints limit their rate of requests to protect Apify servers from overloading. The default rate limit for storage objects is *30 requests per second*. However, there are exceptions limited to *200 requests per second* per storage object, including: * [Push items](https://docs.apify.com/api/v2/dataset-items-post.md) to dataset. * CRUD ([add](https://docs.apify.com/api/v2/request-queue-requests-post.md), [get](https://docs.apify.com/api/v2/request-queue-request-get.md), [update](https://docs.apify.com/api/v2/request-queue-request-put.md), [delete](https://docs.apify.com/api/v2/request-queue-request-delete.md)) operations of *request queue* requests. If a client exceeds this limit, the API endpoints respond with the HTTP status code `429 Too Many Requests` and the following body: ``` { "error": { "type": "rate-limit-exceeded", "message": "You have exceeded the rate limit of ... requests per second" } } ``` Go to the [API documentation](https://docs.apify.com/api/v2.md#rate-limiting) for details and to learn what to do if you exceed the rate limit. ## Data retention[​](#data-retention "Direct link to Data retention") Apify securely stores your ten most recent runs indefinitely, ensuring your records are always accessible. Unnamed datasets and runs beyond the latest ten will be automatically deleted after 7 days unless otherwise specified. Named datasets are retained indefinitely. ### Preserving your storages[​](#preserving-storages "Direct link to Preserving your storages") To ensure indefinite retention of your storages, assign them a name. This can be done via Apify Console or through our API. First, you'll need your store's ID. You can find it in the details of the run that created it. In Apify Console, head over to your run's details and select the **Dataset**, **Key-value store**, or **Request queue** tab as appropriate. Check that store's details, and you will find its ID among them. ![Finding your store\'s ID](/assets/images/find-store-id-0c95342b8b520433938455a67069f81e.png) Find and open your storage by clicking the ID, click on the **Actions** menu, choose **Rename**, and enter its new name in the field. Your storage will now be preserved indefinitely. To name your storage via API, get its ID from the run that generated it using the [Get run](https://docs.apify.com/api/v2/actor-run-get.md) endpoint. You can then give it a new name using the `Update \[storage\]` endpoint. For example, [Update dataset](https://docs.apify.com/api/v2/dataset-put.md). Our SDKs and clients each have unique naming conventions for storages. For more information check out documentation: * [SDKs](https://docs.apify.com/sdk.md) * [API Clients](https://docs.apify.com/api.md) ## Named and unnamed storages[​](#named-and-unnamed-storages "Direct link to Named and unnamed storages") The default storages for an Actor run are unnamed, identified only by an *ID*. This allows them to expire after 7 days (or longer on paid plans) conserving your storage space. If you want to preserve a storage, [assign it a name](#preserving-storages), and it will be retained indefinitely. > Storages' names can be up to 63 characters long. Named and unnamed storages are identical in all aspects except for their retention period. The key advantage of named storages is their ease in identifying and verifying the correct store. For example, storage names `janedoe~my-storage-1` and `janedoe~web-scrape-results` are easier to tell apart than the alphanumerical IDs `cAbcYOfuXemTPwnIB` and `CAbcsuZbp7JHzkw1B`. ## Sharing[​](#sharing "Direct link to Sharing") You can grant [access rights](https://docs.apify.com/platform/collaboration.md) to others Apify users to view or modify your storages. Check the [full list of permissions](https://docs.apify.com/platform/collaboration/list-of-permissions.md). ### Sharing storages between runs[​](#sharing-storages-between-runs "Direct link to Sharing storages between runs") Storage can be accessed from any [Actor](https://docs.apify.com/platform/actors.md) or [task](https://docs.apify.com/platform/actors/running/tasks.md) run, provided you have its *name* or *ID*. You can access and manage storages from other runs using the same methods or endpoints as with storages from your current run. [Datasets](https://docs.apify.com/platform/storage/dataset.md) and [key-value stores](https://docs.apify.com/platform/storage/key-value-store.md) support concurrent use by multiple Actors. Thus, several Actors or tasks can simultaneously write data to a single dataset or key-value store. Similarly, multiple runs can read data from datasets and key-value stores at the same time. [Request queues](https://docs.apify.com/platform/storage/request-queue.md), on the other hand, only allow multiple runs to add new data. A request queue can only be processed by one Actor or task run at any one time. > When multiple runs try to write data to a storage simultaneously, the order of data writing cannot be controlled. Data is written as each request is processed.Similar principle applies in key-value stores and request queues, when a delete request for a record precedes a read request for the same record, the read request will fail. ## Deleting storages[​](#deleting-storages "Direct link to Deleting storages") Named storages are only removed upon your request.You can delete storages in the following ways: * [Apify Console](https://console.apify.com/storage) - using the **Actions** button in the store's detail page. * [JavaScript SDK](https://docs.apify.com/sdk/js) - using the `.drop()` method of the [Dataset](https://docs.apify.com/sdk/js/api/apify/class/Dataset#drop), [Key-value store](https://docs.apify.com/sdk/js/api/apify/class/KeyValueStore#drop), or [Request queue](https://docs.apify.com/sdk/js/api/apify/class/RequestQueue#drop) class. * [Python SDK](https://docs.apify.com/sdk/python) - using the `.drop()` method of the [Dataset](https://docs.apify.com/sdk/python/reference/class/Dataset#drop), [Key-value store](https://docs.apify.com/sdk/python/reference/class/KeyValueStore#drop), or [Request queue](https://docs.apify.com/sdk/python/reference/class/RequestQueue#drop) class. * [JavaScript API client](https://docs.apify.com/api/client/js) - using the `.delete()` method in the [dataset](https://docs.apify.com/api/client/js/reference/class/DatasetClient), [key-value store](https://docs.apify.com/api/client/js/reference/class/KeyValueStoreClient), or [request queue](https://docs.apify.com/api/client/js/reference/class/RequestQueueClient) clients. * [Python API client](https://docs.apify.com/api/client/python) - using the `.delete()` method in the [dataset](https://docs.apify.com/api/client/python#datasetclient), [key-value store](https://docs.apify.com/api/client/python/reference/class/KeyValueStoreClient), or [request queue](https://docs.apify.com/api/client/python/reference/class/RequestQueueClient) clients. * [API](https://docs.apify.com/api/v2/key-value-store-delete.md) using the - `Delete [store]` endpoint, where `[store]` is the type of storage you want to delete. --- # API client for JavaScript | Apify Documentation ## api - [Search the documentation](https://docs.apify.com/api/client/js/search.md) - [Apify API client for JavaScript](https://docs.apify.com/api/client/js/docs.md): apify-client is the official library to access the Apify REST API from your JavaScript/TypeScript applications. It runs both in Node.js and browser and provides useful features like automatic retries and convenience functions that improve the experience of using the Apify API. All requests and responses (including errors) are encoded in JSON format with UTF-8 encoding. - [Changelog](https://docs.apify.com/api/client/js/docs/changelog.md): It seems that the changelog is not available. - [Code examples](https://docs.apify.com/api/client/js/docs/examples.md): Passing an input to the Actor - [apify-client](https://docs.apify.com/api/client/js/reference.md) - [ActorClient](https://docs.apify.com/api/client/js/reference/class/ActorClient.md) - [ActorCollectionClient](https://docs.apify.com/api/client/js/reference/class/ActorCollectionClient.md) - [ApifyApiError](https://docs.apify.com/api/client/js/reference/class/ApifyApiError.md): An `ApifyApiError` is thrown for successful HTTP requests that reach the API, but the API responds with an error response. Typically, those are rate limit errors and internal errors, which are automatically retried, or validation errors, which are thrown immediately, because a correction by the user is needed. - [ApifyClient](https://docs.apify.com/api/client/js/reference/class/ApifyClient.md): ApifyClient is the official library to access [Apify API](https://docs.apify.com/api/v2) from your JavaScript applications. It runs both in Node.js and browser. - [BuildClient](https://docs.apify.com/api/client/js/reference/class/BuildClient.md) - [BuildCollectionClient](https://docs.apify.com/api/client/js/reference/class/BuildCollectionClient.md) - [DatasetClient ](https://docs.apify.com/api/client/js/reference/class/DatasetClient.md) - [DatasetCollectionClient](https://docs.apify.com/api/client/js/reference/class/DatasetCollectionClient.md) - [InvalidResponseBodyError](https://docs.apify.com/api/client/js/reference/class/InvalidResponseBodyError.md): This error exists for the quite common situation, where only a partial JSON response is received and an attempt to parse the JSON throws an error. In most cases this can be resolved by retrying the request. We do that by identifying this error in HttpClient. The properties mimic AxiosError for easier integration in HttpClient error handling. - [KeyValueStoreClient](https://docs.apify.com/api/client/js/reference/class/KeyValueStoreClient.md) - [KeyValueStoreCollectionClient](https://docs.apify.com/api/client/js/reference/class/KeyValueStoreCollectionClient.md) - [LogClient](https://docs.apify.com/api/client/js/reference/class/LogClient.md) - [RequestQueueClient](https://docs.apify.com/api/client/js/reference/class/RequestQueueClient.md) - [RequestQueueCollectionClient](https://docs.apify.com/api/client/js/reference/class/RequestQueueCollectionClient.md) - [RunClient](https://docs.apify.com/api/client/js/reference/class/RunClient.md) - [RunCollectionClient](https://docs.apify.com/api/client/js/reference/class/RunCollectionClient.md) - [ScheduleClient](https://docs.apify.com/api/client/js/reference/class/ScheduleClient.md) - [ScheduleCollectionClient](https://docs.apify.com/api/client/js/reference/class/ScheduleCollectionClient.md) - [StoreCollectionClient](https://docs.apify.com/api/client/js/reference/class/StoreCollectionClient.md) - [TaskClient](https://docs.apify.com/api/client/js/reference/class/TaskClient.md) - [TaskCollectionClient](https://docs.apify.com/api/client/js/reference/class/TaskCollectionClient.md) - [UserClient](https://docs.apify.com/api/client/js/reference/class/UserClient.md) - [WebhookClient](https://docs.apify.com/api/client/js/reference/class/WebhookClient.md) - [WebhookCollectionClient](https://docs.apify.com/api/client/js/reference/class/WebhookCollectionClient.md) - [WebhookDispatchClient](https://docs.apify.com/api/client/js/reference/class/WebhookDispatchClient.md) - [WebhookDispatchCollectionClient](https://docs.apify.com/api/client/js/reference/class/WebhookDispatchCollectionClient.md) - [ActorListSortBy](https://docs.apify.com/api/client/js/reference/enum/ActorListSortBy.md) - [DownloadItemsFormat](https://docs.apify.com/api/client/js/reference/enum/DownloadItemsFormat.md) - [PlatformFeature](https://docs.apify.com/api/client/js/reference/enum/PlatformFeature.md) - [ScheduleActions](https://docs.apify.com/api/client/js/reference/enum/ScheduleActions.md) - [WebhookDispatchStatus](https://docs.apify.com/api/client/js/reference/enum/WebhookDispatchStatus.md) - [AccountAndUsageLimits](https://docs.apify.com/api/client/js/reference/interface/AccountAndUsageLimits.md) - [Actor](https://docs.apify.com/api/client/js/reference/interface/Actor.md) - [ActorBuildOptions](https://docs.apify.com/api/client/js/reference/interface/ActorBuildOptions.md) - [ActorCallOptions](https://docs.apify.com/api/client/js/reference/interface/ActorCallOptions.md) - [ActorChargeEvent](https://docs.apify.com/api/client/js/reference/interface/ActorChargeEvent.md) - [ActorCollectionCreateOptions](https://docs.apify.com/api/client/js/reference/interface/ActorCollectionCreateOptions.md) - [ActorCollectionListItem](https://docs.apify.com/api/client/js/reference/interface/ActorCollectionListItem.md) - [ActorCollectionListOptions](https://docs.apify.com/api/client/js/reference/interface/ActorCollectionListOptions.md) - [ActorDefaultRunOptions](https://docs.apify.com/api/client/js/reference/interface/ActorDefaultRunOptions.md) - [ActorDefinition](https://docs.apify.com/api/client/js/reference/interface/ActorDefinition.md) - [ActorExampleRunInput](https://docs.apify.com/api/client/js/reference/interface/ActorExampleRunInput.md) - [ActorLastRunOptions](https://docs.apify.com/api/client/js/reference/interface/ActorLastRunOptions.md) - [ActorRun](https://docs.apify.com/api/client/js/reference/interface/ActorRun.md) - [ActorRunListItem](https://docs.apify.com/api/client/js/reference/interface/ActorRunListItem.md) - [ActorRunMeta](https://docs.apify.com/api/client/js/reference/interface/ActorRunMeta.md) - [ActorRunOptions](https://docs.apify.com/api/client/js/reference/interface/ActorRunOptions.md) - [ActorRunStats](https://docs.apify.com/api/client/js/reference/interface/ActorRunStats.md) - [ActorRunUsage](https://docs.apify.com/api/client/js/reference/interface/ActorRunUsage.md) - [ActorStandby](https://docs.apify.com/api/client/js/reference/interface/ActorStandby.md) - [ActorStartOptions](https://docs.apify.com/api/client/js/reference/interface/ActorStartOptions.md) - [ActorStats](https://docs.apify.com/api/client/js/reference/interface/ActorStats.md) - [ActorStoreList](https://docs.apify.com/api/client/js/reference/interface/ActorStoreList.md) - [ActorTaggedBuild](https://docs.apify.com/api/client/js/reference/interface/ActorTaggedBuild.md) - [ApifyClientOptions](https://docs.apify.com/api/client/js/reference/interface/ApifyClientOptions.md) - [Build](https://docs.apify.com/api/client/js/reference/interface/Build.md) - [BuildClientGetOptions](https://docs.apify.com/api/client/js/reference/interface/BuildClientGetOptions.md) - [BuildClientWaitForFinishOptions](https://docs.apify.com/api/client/js/reference/interface/BuildClientWaitForFinishOptions.md) - [BuildCollectionClientListOptions](https://docs.apify.com/api/client/js/reference/interface/BuildCollectionClientListOptions.md) - [BuildMeta](https://docs.apify.com/api/client/js/reference/interface/BuildMeta.md) - [BuildOptions](https://docs.apify.com/api/client/js/reference/interface/BuildOptions.md) - [BuildStats](https://docs.apify.com/api/client/js/reference/interface/BuildStats.md) - [BuildUsage](https://docs.apify.com/api/client/js/reference/interface/BuildUsage.md) - [Current](https://docs.apify.com/api/client/js/reference/interface/Current.md) - [Dataset](https://docs.apify.com/api/client/js/reference/interface/Dataset.md) - [DatasetClientDownloadItemsOptions](https://docs.apify.com/api/client/js/reference/interface/DatasetClientDownloadItemsOptions.md) - [DatasetClientListItemOptions](https://docs.apify.com/api/client/js/reference/interface/DatasetClientListItemOptions.md) - [DatasetClientUpdateOptions](https://docs.apify.com/api/client/js/reference/interface/DatasetClientUpdateOptions.md) - [DatasetCollectionClientGetOrCreateOptions](https://docs.apify.com/api/client/js/reference/interface/DatasetCollectionClientGetOrCreateOptions.md) - [DatasetCollectionClientListOptions](https://docs.apify.com/api/client/js/reference/interface/DatasetCollectionClientListOptions.md) - [DatasetStatistics](https://docs.apify.com/api/client/js/reference/interface/DatasetStatistics.md) - [DatasetStats](https://docs.apify.com/api/client/js/reference/interface/DatasetStats.md) - [FieldStatistics](https://docs.apify.com/api/client/js/reference/interface/FieldStatistics.md) - [FlatPricePerMonthActorPricingInfo](https://docs.apify.com/api/client/js/reference/interface/FlatPricePerMonthActorPricingInfo.md) - [FreeActorPricingInfo](https://docs.apify.com/api/client/js/reference/interface/FreeActorPricingInfo.md) - [KeyValueClientGetRecordOptions](https://docs.apify.com/api/client/js/reference/interface/KeyValueClientGetRecordOptions.md) - [KeyValueClientListKeysOptions](https://docs.apify.com/api/client/js/reference/interface/KeyValueClientListKeysOptions.md) - [KeyValueClientListKeysResult](https://docs.apify.com/api/client/js/reference/interface/KeyValueClientListKeysResult.md) - [KeyValueClientUpdateOptions](https://docs.apify.com/api/client/js/reference/interface/KeyValueClientUpdateOptions.md) - [KeyValueListItem](https://docs.apify.com/api/client/js/reference/interface/KeyValueListItem.md) - [KeyValueStore](https://docs.apify.com/api/client/js/reference/interface/KeyValueStore.md) - [KeyValueStoreCollectionClientGetOrCreateOptions](https://docs.apify.com/api/client/js/reference/interface/KeyValueStoreCollectionClientGetOrCreateOptions.md) - [KeyValueStoreCollectionClientListOptions](https://docs.apify.com/api/client/js/reference/interface/KeyValueStoreCollectionClientListOptions.md) - [KeyValueStoreRecord ](https://docs.apify.com/api/client/js/reference/interface/KeyValueStoreRecord.md) - [KeyValueStoreRecordOptions](https://docs.apify.com/api/client/js/reference/interface/KeyValueStoreRecordOptions.md) - [KeyValueStoreStats](https://docs.apify.com/api/client/js/reference/interface/KeyValueStoreStats.md) - [Limits](https://docs.apify.com/api/client/js/reference/interface/Limits.md) - [MonthlyUsage](https://docs.apify.com/api/client/js/reference/interface/MonthlyUsage.md) - [MonthlyUsageCycle](https://docs.apify.com/api/client/js/reference/interface/MonthlyUsageCycle.md) - [OpenApiDefinition](https://docs.apify.com/api/client/js/reference/interface/OpenApiDefinition.md) - [PaginatedList ](https://docs.apify.com/api/client/js/reference/interface/PaginatedList.md) - [PricePerDatasetItemActorPricingInfo](https://docs.apify.com/api/client/js/reference/interface/PricePerDatasetItemActorPricingInfo.md) - [PricePerEventActorPricingInfo](https://docs.apify.com/api/client/js/reference/interface/PricePerEventActorPricingInfo.md) - [PricingInfo](https://docs.apify.com/api/client/js/reference/interface/PricingInfo.md) - [ProxyGroup](https://docs.apify.com/api/client/js/reference/interface/ProxyGroup.md) - [RequestQueue](https://docs.apify.com/api/client/js/reference/interface/RequestQueue.md) - [RequestQueueClientAddRequestOptions](https://docs.apify.com/api/client/js/reference/interface/RequestQueueClientAddRequestOptions.md) - [RequestQueueClientAddRequestResult](https://docs.apify.com/api/client/js/reference/interface/RequestQueueClientAddRequestResult.md) - [RequestQueueClientBatchAddRequestWithRetriesOptions](https://docs.apify.com/api/client/js/reference/interface/RequestQueueClientBatchAddRequestWithRetriesOptions.md) - [RequestQueueClientBatchRequestsOperationResult](https://docs.apify.com/api/client/js/reference/interface/RequestQueueClientBatchRequestsOperationResult.md) - [RequestQueueClientDeleteRequestLockOptions](https://docs.apify.com/api/client/js/reference/interface/RequestQueueClientDeleteRequestLockOptions.md) - [RequestQueueClientListAndLockHeadOptions](https://docs.apify.com/api/client/js/reference/interface/RequestQueueClientListAndLockHeadOptions.md) - [RequestQueueClientListAndLockHeadResult](https://docs.apify.com/api/client/js/reference/interface/RequestQueueClientListAndLockHeadResult.md) - [RequestQueueClientListHeadOptions](https://docs.apify.com/api/client/js/reference/interface/RequestQueueClientListHeadOptions.md) - [RequestQueueClientListHeadResult](https://docs.apify.com/api/client/js/reference/interface/RequestQueueClientListHeadResult.md) - [RequestQueueClientListItem](https://docs.apify.com/api/client/js/reference/interface/RequestQueueClientListItem.md) - [RequestQueueClientListRequestsOptions](https://docs.apify.com/api/client/js/reference/interface/RequestQueueClientListRequestsOptions.md) - [RequestQueueClientListRequestsResult](https://docs.apify.com/api/client/js/reference/interface/RequestQueueClientListRequestsResult.md) - [RequestQueueClientPaginateRequestsOptions](https://docs.apify.com/api/client/js/reference/interface/RequestQueueClientPaginateRequestsOptions.md) - [RequestQueueClientProlongRequestLockOptions](https://docs.apify.com/api/client/js/reference/interface/RequestQueueClientProlongRequestLockOptions.md) - [RequestQueueClientProlongRequestLockResult](https://docs.apify.com/api/client/js/reference/interface/RequestQueueClientProlongRequestLockResult.md) - [RequestQueueClientRequestSchema](https://docs.apify.com/api/client/js/reference/interface/RequestQueueClientRequestSchema.md) - [RequestQueueClientUnlockRequestsResult](https://docs.apify.com/api/client/js/reference/interface/RequestQueueClientUnlockRequestsResult.md) - [RequestQueueClientUpdateOptions](https://docs.apify.com/api/client/js/reference/interface/RequestQueueClientUpdateOptions.md) - [RequestQueueCollectionListOptions](https://docs.apify.com/api/client/js/reference/interface/RequestQueueCollectionListOptions.md) - [RequestQueueStats](https://docs.apify.com/api/client/js/reference/interface/RequestQueueStats.md) - [RequestQueueUserOptions](https://docs.apify.com/api/client/js/reference/interface/RequestQueueUserOptions.md) - [RunAbortOptions](https://docs.apify.com/api/client/js/reference/interface/RunAbortOptions.md) - [RunChargeOptions](https://docs.apify.com/api/client/js/reference/interface/RunChargeOptions.md) - [RunCollectionListOptions](https://docs.apify.com/api/client/js/reference/interface/RunCollectionListOptions.md) - [RunGetOptions](https://docs.apify.com/api/client/js/reference/interface/RunGetOptions.md) - [RunMetamorphOptions](https://docs.apify.com/api/client/js/reference/interface/RunMetamorphOptions.md) - [RunResurrectOptions](https://docs.apify.com/api/client/js/reference/interface/RunResurrectOptions.md) - [RunUpdateOptions](https://docs.apify.com/api/client/js/reference/interface/RunUpdateOptions.md) - [RunWaitForFinishOptions](https://docs.apify.com/api/client/js/reference/interface/RunWaitForFinishOptions.md) - [Schedule](https://docs.apify.com/api/client/js/reference/interface/Schedule.md) - [ScheduleActionRunActor](https://docs.apify.com/api/client/js/reference/interface/ScheduleActionRunActor.md) - [ScheduleActionRunActorTask](https://docs.apify.com/api/client/js/reference/interface/ScheduleActionRunActorTask.md) - [ScheduleCollectionListOptions](https://docs.apify.com/api/client/js/reference/interface/ScheduleCollectionListOptions.md) - [ScheduledActorRunInput](https://docs.apify.com/api/client/js/reference/interface/ScheduledActorRunInput.md) - [ScheduledActorRunOptions](https://docs.apify.com/api/client/js/reference/interface/ScheduledActorRunOptions.md) - [StoreCollectionListOptions](https://docs.apify.com/api/client/js/reference/interface/StoreCollectionListOptions.md) - [Task](https://docs.apify.com/api/client/js/reference/interface/Task.md) - [TaskCallOptions](https://docs.apify.com/api/client/js/reference/interface/TaskCallOptions.md) - [TaskCollectionListOptions](https://docs.apify.com/api/client/js/reference/interface/TaskCollectionListOptions.md) - [TaskCreateData](https://docs.apify.com/api/client/js/reference/interface/TaskCreateData.md) - [TaskLastRunOptions](https://docs.apify.com/api/client/js/reference/interface/TaskLastRunOptions.md) - [TaskOptions](https://docs.apify.com/api/client/js/reference/interface/TaskOptions.md) - [TaskStats](https://docs.apify.com/api/client/js/reference/interface/TaskStats.md) - [UsageCycle](https://docs.apify.com/api/client/js/reference/interface/UsageCycle.md) - [User](https://docs.apify.com/api/client/js/reference/interface/User.md) - [UserPlan](https://docs.apify.com/api/client/js/reference/interface/UserPlan.md) - [UserProxy](https://docs.apify.com/api/client/js/reference/interface/UserProxy.md) - [Webhook](https://docs.apify.com/api/client/js/reference/interface/Webhook.md) - [WebhookAnyRunOfActorCondition](https://docs.apify.com/api/client/js/reference/interface/WebhookAnyRunOfActorCondition.md) - [WebhookAnyRunOfActorTaskCondition](https://docs.apify.com/api/client/js/reference/interface/WebhookAnyRunOfActorTaskCondition.md) - [WebhookCertainRunCondition](https://docs.apify.com/api/client/js/reference/interface/WebhookCertainRunCondition.md) - [WebhookCollectionListOptions](https://docs.apify.com/api/client/js/reference/interface/WebhookCollectionListOptions.md) - [WebhookDispatch](https://docs.apify.com/api/client/js/reference/interface/WebhookDispatch.md) - [WebhookDispatchCall](https://docs.apify.com/api/client/js/reference/interface/WebhookDispatchCall.md) - [WebhookDispatchCollectionListOptions](https://docs.apify.com/api/client/js/reference/interface/WebhookDispatchCollectionListOptions.md) - [WebhookIdempotencyKey](https://docs.apify.com/api/client/js/reference/interface/WebhookIdempotencyKey.md) - [WebhookStats](https://docs.apify.com/api/client/js/reference/interface/WebhookStats.md) - [Apify API client for JavaScript](https://docs.apify.com/api/client/js/index.md) --- # Full Documentation Content [Skip to main content](#__docusaurus_skipToContent_fallback) [![](/api/client/js/img/apify_sdk.svg)![](/api/client/js/img/apify_sdk_white.svg)](https://docs.apify.com) [Academy](https://docs.apify.com/academy)[Platform](https://docs.apify.com/platform) [API](https://docs.apify.com/api) * [Reference](https://docs.apify.com/api/v2) * [Client for JavaScript](https://docs.apify.com/api/client/js/) * [Client for Python](https://docs.apify.com/api/client/python/) [SDK](https://docs.apify.com/sdk) * [SDK for JavaScript](https://docs.apify.com/sdk/js/) * [SDK for Python](https://docs.apify.com/sdk/python/) [CLI](https://docs.apify.com/cli/) [Open source](https://docs.apify.com/open-source) * [Crawlee](https://crawlee.dev) * [Got Scraping](https://github.com/apify/got-scraping) * [Fingerprint Suite](https://github.com/apify/fingerprint-suite) * [Apify on GitHub](https://github.com/apify) * [Actor whitepaper](https://whitepaper.actor) [Discord](https://discord.com/invite/jyEM2PRvMU "Chat on Discord")[Get started](https://console.apify.com) [API client for JavaScript](https://docs.apify.com/api/client/js/api/client/js/.md) [Docs](https://docs.apify.com/api/client/js/api/client/js/docs.md)[Reference](https://docs.apify.com/api/client/js/api/client/js/reference.md)[Changelog](https://docs.apify.com/api/client/js/api/client/js/docs/changelog.md)[GitHub](https://github.com/apify/apify-client-js) [2.12](https://docs.apify.com/api/client/js/api/client/js/docs.md) * [Next](https://docs.apify.com/api/client/js/api/client/js/docs/next) * [2.12](https://docs.apify.com/api/client/js/api/client/js/docs.md) * [2.11](https://docs.apify.com/api/client/js/api/client/js/docs/2.11) * [2.10](https://docs.apify.com/api/client/js/api/client/js/docs/2.10) * [2.9](https://docs.apify.com/api/client/js/api/client/js/docs/2.9) * [2.8](https://docs.apify.com/api/client/js/api/client/js/docs/2.8) * [2.7](https://docs.apify.com/api/client/js/api/client/js/docs/2.7) * [2.6](https://docs.apify.com/api/client/js/api/client/js/docs/2.6) # Search the documentation Type your search here Next (current) [](https://www.algolia.com/) Learn * [Academy](https://docs.apify.com/academy) * [Platform](https://docs.apify.com/platform) API * [Reference](https://docs.apify.com/api/v2) * [Client for JavaScript](https://docs.apify.com/api/client/js/) * [Client for Python](https://docs.apify.com/api/client/python/) SDK * [SDK for JavaScript](https://docs.apify.com/sdk/js/) * [SDK for Python](https://docs.apify.com/sdk/python/) Other * [CLI](https://docs.apify.com/cli/) * [Open source](https://docs.apify.com/open-source) More * [Crawlee](https://crawlee.dev) * [GitHub](https://github.com/apify) * [Discord](https://discord.com/invite/jyEM2PRvMU) * [Trust Center](https://trust.apify.com) [](https://apify.com) --- # Apify API client for JavaScript `apify-client` is the official library to access the [Apify REST API](https://docs.apify.com/api/v2) from your JavaScript/TypeScript applications. It runs both in Node.js and browser and provides useful features like automatic retries and convenience functions that improve the experience of using the Apify API. All requests and responses (including errors) are encoded in JSON format with UTF-8 encoding. ## Pre-requisites[​](#pre-requisites "Direct link to heading") `apify-client` requires Node.js version 16 or higher. Node.js is available for download on the [official website](https://nodejs.org/). Check for your current node version by running: ``` node -v ``` ## Installation[​](#installation "Direct link to heading") You can install the client via [NPM](https://www.npmjs.com/) or use any other package manager of your choice. * NPM * Yarn * PNPM * Bun ``` npm i apify-client ``` ``` yarn add apify-client ``` ``` pnpm add apify-client ``` ``` bun add apify-client ``` ## Authentication and Initialization[​](#authentication-and-initialization "Direct link to heading") To use the client, you need an [API token](https://docs.apify.com/platform/integrations/api#api-token). You can find your token under [Integrations](https://console.apify.com/account/integrations) tab in Apify Console. Copy the token and initialize the client by providing the token (`MY-APIFY-TOKEN`) as a parameter to the `ApifyClient` constructor. ``` // import Apify client import { ApifyClient } from 'apify-client'; // Client initialization with the API token const client = new ApifyClient({ token: 'MY-APIFY-TOKEN', }); ``` Secure access The API token is used to authorize your requests to the Apify API. You can be charged for the usage of the underlying services, so do not share your API token with untrusted parties or expose it on the client side of your applications ## Quick start[​](#quick-start "Direct link to heading") One of the most common use cases is starting [Actors](https://docs.apify.com/platform/actors) (serverless programs running in the [Apify cloud](https://docs.apify.com/platform)) and getting results from their [datasets](https://docs.apify.com/platform/storage/dataset) (storage) after they finish the job (usually scraping, automation processes or data processing). ``` import { ApifyClient } from 'apify-client'; const client = new ApifyClient({ token: 'MY-APIFY-TOKEN' }); // Starts an Actor and waits for it to finish const { defaultDatasetId } = await client.actor('username/actor-name').call(); // Lists items from the Actor's dataset const { items } = await client.dataset(defaultDatasetId).listItems(); ``` ### Running Actors[​](#running-actors "Direct link to heading") To start an Actor, you can use the [ActorClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/ActorClient.md) (`client.actor()`) and pass the Actor's ID (e.g. `john-doe/my-cool-actor`) to define which Actor you want to run. The Actor's ID is a combination of the username and the Actor owner’s username. You can run both your own Actors and [Actors from Apify Store](https://docs.apify.com/platform/actors/running/actors-in-store). #### Passing input to the Actor[​](#passing-input-to-the-actor "Direct link to heading") To define the Actor's input, you can pass an object to the [`call()`](https://docs.apify.com/api/client/js/api/client/js/reference/class/ActorClient.md#call) method. The input object can be any JSON object that the Actor expects (respects the Actor's [input schema](https://docs.apify.com/platform/actors/development/actor-definition/input-schema)). The input object is used to pass configuration to the Actor, such as URLs to scrape, search terms, or any other data. ``` import { ApifyClient } from 'apify-client'; const client = new ApifyClient({ token: 'MY-APIFY-TOKEN' }); // Runs an Actor with an input and waits for it to finish. const { defaultDatasetId } = await client.actor('username/actor-name').call({ some: 'input', }); ``` ### Getting results from the dataset[​](#getting-results-from-the-dataset "Direct link to heading") To get the results from the dataset, you can use the [DatasetClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/DatasetClient.md) (`client.dataset()`) and [`listItems()`](https://docs.apify.com/api/client/js/api/client/js/reference/class/DatasetClient.md#listItems) method. You need to pass the dataset ID to define which dataset you want to access. You can get the dataset ID from the Actor's run object (represented by `defaultDatasetId`). ``` import { ApifyClient } from 'apify-client'; const client = new ApifyClient({ token: 'MY-APIFY-TOKEN' }); // Lists items from the Actor's dataset. const { items } = await client.dataset('dataset-id').listItems(); ``` Dataset access Running an Actor might take time, depending on the Actor's complexity and the amount of data it processes. If you want only to get data and have an immediate response you should access the existing dataset of the finished [Actor run](https://docs.apify.com/platform/actors/running/runs-and-builds#runs). ## Usage concepts[​](#usage-concepts "Direct link to heading") The `ApifyClient` interface follows a generic pattern that applies to all of its components. By calling individual methods of `ApifyClient`, specific clients that target individual API resources are created. There are two types of those clients: * [`actorClient`](https://docs.apify.com/api/client/js/api/client/js/reference/class/ActorClient.md): a client for the management of a single resource * [`actorCollectionClient`](https://docs.apify.com/api/client/js/api/client/js/reference/class/ActorCollectionClient.md): a client for the collection of resources ``` import { ApifyClient } from 'apify-client'; const client = new ApifyClient({ token: 'MY-APIFY-TOKEN' }); // Collection clients do not require a parameter. const actorCollectionClient = client.actors(); // Creates an actor with the name: my-actor. const myActor = await actorCollectionClient.create({ name: 'my-actor-name' }); // List all your used Actors (both own and from Apify Store) const { items } = await actorCollectionClient.list(); ``` Resource identification The resource ID can be either the `id` of the said resource, or a combination of your `username/resource-name`. ``` import { ApifyClient } from 'apify-client'; const client = new ApifyClient({ token: 'MY-APIFY-TOKEN' }); // Resource clients accept an ID of the resource. const actorClient = client.actor('username/actor-name'); // Fetches the john-doe/my-actor object from the API. const myActor = await actorClient.get(); // Starts the run of john-doe/my-actor and returns the Run object. const myActorRun = await actorClient.start(); ``` ### Nested clients[​](#nested-clients "Direct link to heading") Sometimes clients return other clients. That's to simplify working with nested collections, such as runs of a given Actor. ``` import { ApifyClient } from 'apify-client'; const client = new ApifyClient({ token: 'MY-APIFY-TOKEN' }); const actorClient = client.actor('username/actor-name'); const runsClient = actorClient.runs(); // Lists the last 10 runs of your Actor. const { items } = await runsClient.list({ limit: 10, desc: true, }); // Select the last run of your Actor that finished // with a SUCCEEDED status. const lastSucceededRunClient = actorClient.lastRun({ status: 'SUCCEEDED' }); // Fetches items from the run's dataset. const { items } = await lastSucceededRunClient.dataset().listItems(); ``` The quick access to `dataset` and other storage directly from the run client can be used with the [`lastRun()`](https://docs.apify.com/api/client/js/api/client/js/reference/class/ActorClient.md#lastRun) method. ## Features[​](#features "Direct link to heading") Based on the endpoint, the client automatically extracts the relevant data and returns it in the expected format. Date strings are automatically converted to `Date` objects. For exceptions, the client throws an [`ApifyApiError`](https://docs.apify.com/api/client/js/api/client/js/reference/class/ApifyApiError.md), which wraps the plain JSON errors returned by API and enriches them with other contexts for easier debugging. ``` import { ApifyClient } from 'apify-client'; const client = new ApifyClient({ token: 'MY-APIFY-TOKEN' }); try { const { items } = await client.dataset('non-existing-dataset-id').listItems(); } catch (error) { // The error is an instance of ApifyApiError const { message, type, statusCode, clientMethod, path } = error; // Log error for easier debugging console.log({ message, statusCode, clientMethod, type }); } ``` ### Retries with exponential backoff[​](#retries-with-exponential-backoff "Direct link to heading") Network communication sometimes fails. That's a given. The client will automatically retry requests that failed due to a network error, an internal error of the Apify API (HTTP 500+), or a rate limit error (HTTP 429). By default, it will retry up to 8 times. The first retry will be attempted after \~500ms, the second after \~1000ms, and so on. You can configure those parameters using the `maxRetries` and `minDelayBetweenRetriesMillis` options of the `ApifyClient` constructor. ``` import { ApifyClient } from 'apify-client'; const client = new ApifyClient({ token: 'MY-APIFY-TOKEN', maxRetries: 8, minDelayBetweenRetriesMillis: 500, // 0.5s timeoutSecs: 360, // 6 mins }); ``` ### Convenience functions and options[​](#convenience-functions-and-options "Direct link to heading") Some actions can't be performed by the API itself, such as indefinite waiting for an Actor run to finish (because of network timeouts). The client provides convenient `call()` and `waitForFinish()` functions that do that. If the limit is reached, the returned promise is resolved to a run object that will have status `READY` or `RUNNING` and it will not contain the Actor run output. [Key-value store](https://docs.apify.com/platform/storage/key-value-store) records can be retrieved as objects, buffers, or streams via the respective options, dataset items can be fetched as individual objects or serialized data. ``` import { ApifyClient } from 'apify-client'; const client = new ApifyClient({ token: 'MY-APIFY-TOKEN' }); // Starts an Actor and waits for it to finish. const finishedActorRun = await client.actor('username/actor-name').call(); // Starts an Actor and waits maximum 60s for the finish const { status } = await client.actor('username/actor-name').start({ waitForFinish: 60, // 1 minute }); ``` ### Pagination[​](#pagination "Direct link to heading") Most methods named `list` or `listSomething` return a [`Promise`](https://docs.apify.com/api/client/js/api/client/js/reference/interface/PaginatedList.md). There are some exceptions though, like `listKeys` or `listHead` which paginate differently. The results you're looking for are always stored under `items` and you can use the `limit` property to get only a subset of results. Other props are also available, depending on the method. ``` import { ApifyClient } from 'apify-client'; const client = new ApifyClient({ token: 'MY-APIFY-TOKEN' }); // Resource clients accept an ID of the resource. const datasetClient = client.dataset('dataset-id'); // Number of items per page const limit = 1000; // Initial offset let offset = 0; // Array to store all items let allItems = []; while (true) { const { items, total } = await datasetClient.listItems({ limit, offset }); console.log(`Fetched ${items.length} items`); // Merge new items with other already loaded items allItems.push(...items); // If there are no more items to fetch, exit the loading if (offset + limit >= total) { break; } offset += limit; } console.log(`Overall fetched ${allItems.length} items`); ``` --- # Changelog ### [2.14.0](https://github.com/apify/apify-client-js/releases/tag/v2.14.0)[​](#2140 "Direct link to heading") ##### [2.14.0](https://github.com/apify/apify-client-js/releases/tag/v2.14.0) (2025-08-11)[​](#2140-2025-08-11 "Direct link to heading") ###### 🚀 Features[​](#-features "Direct link to heading") * Add keyValueStore.getRecordPublicUrl ([#725](https://github.com/apify/apify-client-js/pull/725)) ([d84a03a](https://github.com/apify/apify-client-js/commit/d84a03afe6fd49e38d4ca9a6821681e852c73a2a)) by [@danpoletaev](https://github.com/danpoletaev) ### [2.13.0](https://github.com/apify/apify-client-js/releases/tag/v2.13.0)[​](#2130 "Direct link to heading") ##### [2.13.0](https://github.com/apify/apify-client-js/releases/tag/v2.13.0) (2025-08-06)[​](#2130-2025-08-06 "Direct link to heading") ###### 🚀 Features[​](#-features-1 "Direct link to heading") * Add new methods Dataset.createItemsPublicUrl & KeyValueStore.createKeysPublicUrl ([#720](https://github.com/apify/apify-client-js/pull/720)) ([62554e4](https://github.com/apify/apify-client-js/commit/62554e48a8bf6bf1853f356ac84f046fed5945c1)) by [@danpoletaev](https://github.com/danpoletaev) ###### 🐛 Bug Fixes[​](#-bug-fixes "Direct link to heading") * Add `eventData` to `WebhookDispatch` type ([#714](https://github.com/apify/apify-client-js/pull/714)) ([351f11f](https://github.com/apify/apify-client-js/commit/351f11f268a54532c7003ab099bc0d7d8d9c9ad7)) by [@valekjo](https://github.com/valekjo) * KV store createKeysPublicUrl wrong URL ([#724](https://github.com/apify/apify-client-js/pull/724)) ([a48ec58](https://github.com/apify/apify-client-js/commit/a48ec58e16a36cc8aa188524e4a738c40f5b74e9)) by [@danpoletaev](https://github.com/danpoletaev) ### [2.12.6](https://github.com/apify/apify-client-js/releases/tag/v2.12.6)[​](#2126 "Direct link to heading") ##### [2.12.6](https://github.com/apify/apify-client-js/releases/tag/v2.12.6) (2025-06-30)[​](#2126-2025-06-30 "Direct link to heading") ###### 🚀 Features[​](#-features-2 "Direct link to heading") * Allow sorting of Actors collection ([#708](https://github.com/apify/apify-client-js/pull/708)) ([562a193](https://github.com/apify/apify-client-js/commit/562a193b90ce4f2b05bf166da8fe2dddaa87eb6b)) by [@protoss70](https://github.com/protoss70) ###### 🐛 Bug Fixes[​](#-bug-fixes-1 "Direct link to heading") * Use appropriate timeouts ([#704](https://github.com/apify/apify-client-js/pull/704)) ([b896bf2](https://github.com/apify/apify-client-js/commit/b896bf2e653e0766ef297f29a35304c1a5f27598)) by [@janbuchar](https://github.com/janbuchar) * Rename option for new sortBy parameter ([#711](https://github.com/apify/apify-client-js/pull/711)) ([f45dd03](https://github.com/apify/apify-client-js/commit/f45dd037c581a6c0e27fd8c036033b99cec1ba89)) by [@protoss70](https://github.com/protoss70) ### [2.12.5](https://github.com/apify/apify-client-js/releases/tag/v2.12.5)[​](#2125 "Direct link to heading") ##### [2.12.5](https://github.com/apify/apify-client-js/releases/tag/v2.12.5) (2025-05-28)[​](#2125-2025-05-28 "Direct link to heading") ###### 🚀 Features[​](#-features-3 "Direct link to heading") * List kv store keys by collection of prefix ([#688](https://github.com/apify/apify-client-js/pull/688)) ([be25137](https://github.com/apify/apify-client-js/commit/be25137575435547aaf2c3849fc772daf0537450)) by [@MFori](https://github.com/MFori) * Add unlockRequests endpoint to RequestQueue client ([#700](https://github.com/apify/apify-client-js/pull/700)) ([7c52c64](https://github.com/apify/apify-client-js/commit/7c52c645e2eb66ad97c8daa9791b080bfc747288)) by [@drobnikj](https://github.com/drobnikj) ###### 🐛 Bug Fixes[​](#-bug-fixes-2 "Direct link to heading") * Add missing 'effectivePlatformFeatures', 'createdAt', 'isPaying' to User interface ([#691](https://github.com/apify/apify-client-js/pull/691)) ([e138093](https://github.com/apify/apify-client-js/commit/e1380933476e5336469e5da083d2017147518f88)) by [@metalwarrior665](https://github.com/metalwarrior665) * Move prettier into `devDependencies` ([#695](https://github.com/apify/apify-client-js/pull/695)) ([1ba903a](https://github.com/apify/apify-client-js/commit/1ba903a1bfa7a95a8c54ef53951db502dfa4b276)) by [@hudson-worden](https://github.com/hudson-worden) ### [2.12.4](https://github.com/apify/apify-client-js/releases/tag/v2.12.4)[​](#2124 "Direct link to heading") ##### [2.12.4](https://github.com/apify/apify-client-js/releases/tag/v2.12.4) (2025-05-13)[​](#2124-2025-05-13 "Direct link to heading") ###### 🚀 Features[​](#-features-4 "Direct link to heading") * Allow overriding timeout of `KVS.setRecord` calls ([#692](https://github.com/apify/apify-client-js/pull/692)) ([105bd68](https://github.com/apify/apify-client-js/commit/105bd6888117a6c64b21a725c536d4992dff099c)) by [@B4nan](https://github.com/B4nan) ###### 🐛 Bug Fixes[​](#-bug-fixes-3 "Direct link to heading") * Fix `RunCollectionListOptions` status type ([#681](https://github.com/apify/apify-client-js/pull/681)) ([8fbcf82](https://github.com/apify/apify-client-js/commit/8fbcf82bfaca57d087719cf079fc850c6d31daa5)) by [@MatousMarik](https://github.com/MatousMarik) * **actor:** Add missing 'pricingInfos' field to Actor object ([#683](https://github.com/apify/apify-client-js/pull/683)) ([4bd4853](https://github.com/apify/apify-client-js/commit/4bd485369ac42d0b72597638c0316a6ca60f9847)) by [@metalwarrior665](https://github.com/metalwarrior665) ### [2.12.3](https://github.com/apify/apify-client-js/releases/tag/v2.12.3)[​](#2123 "Direct link to heading") ##### [2.12.3](https://github.com/apify/apify-client-js/releases/tag/v2.12.3) (2025-04-24)[​](#2123-2025-04-24 "Direct link to heading") ###### 🐛 Bug Fixes[​](#-bug-fixes-4 "Direct link to heading") * DefaultBuild() returns BuildClient ([#677](https://github.com/apify/apify-client-js/pull/677)) ([8ce72a4](https://github.com/apify/apify-client-js/commit/8ce72a4c90aac421281d14ad0ff25fdecba1d094)) by [@danpoletaev](https://github.com/danpoletaev) ### [2.12.2](https://github.com/apify/apify-client-js/releases/tag/v2.12.2)[​](#2122 "Direct link to heading") ##### [2.12.2](https://github.com/apify/apify-client-js/releases/tag/v2.12.2) (2025-04-14)[​](#2122-2025-04-14 "Direct link to heading") ###### 🚀 Features[​](#-features-5 "Direct link to heading") * Add support for general resource access ([#669](https://github.com/apify/apify-client-js/pull/669)) ([7deba52](https://github.com/apify/apify-client-js/commit/7deba52a5ff96c990254687d6b965fc1a5bf3467)) by [@tobice](https://github.com/tobice) * Add defaultBuild method ([#668](https://github.com/apify/apify-client-js/pull/668)) ([c494b3b](https://github.com/apify/apify-client-js/commit/c494b3b8b664a88620e9f41c902acba533d636cf)) by [@danpoletaev](https://github.com/danpoletaev) ### [2.12.1](https://github.com/apify/apify-client-js/releases/tag/v2.12.1)[​](#2121 "Direct link to heading") ##### [2.12.1](https://github.com/apify/apify-client-js/releases/tag/v2.12.1) (2025-03-11)[​](#2121-2025-03-11 "Direct link to heading") ###### 🚀 Features[​](#-features-6 "Direct link to heading") * Add maxItems and maxTotalChargeUsd to resurrect ([#652](https://github.com/apify/apify-client-js/pull/652)) ([5fb9c9a](https://github.com/apify/apify-client-js/commit/5fb9c9a35d6ccb7313c5cbbd7d09b19a64d70d8e)) by [@novotnyj](https://github.com/novotnyj) ### [2.11.2](https://github.com/apify/apify-client-js/releases/tag/v2.11.2)[​](#2112 "Direct link to heading") ##### [2.11.2](https://github.com/apify/apify-client-js/releases/tag/v2.11.2) (2025-02-03)[​](#2112-2025-02-03 "Direct link to heading") ###### 🚀 Features[​](#-features-7 "Direct link to heading") * Add dataset.statistics ([#621](https://github.com/apify/apify-client-js/pull/621)) ([6aeb2b7](https://github.com/apify/apify-client-js/commit/6aeb2b7fae041468d125a0c8bbb00804e290143a)) by [@MFori](https://github.com/MFori) * Added getOpenApiSpecification() to BuildClient ([#626](https://github.com/apify/apify-client-js/pull/626)) ([6248b28](https://github.com/apify/apify-client-js/commit/6248b2844796f93e22404ddea85ee77c1a5b7d50)) by [@danpoletaev](https://github.com/danpoletaev) ### [2.11.1](https://github.com/apify/apify-client-js/releases/tag/v2.11.1)[​](#2111 "Direct link to heading") ##### [2.11.1](https://github.com/apify/apify-client-js/releases/tag/v2.11.1) (2025-01-10)[​](#2111-2025-01-10 "Direct link to heading") ###### 🐛 Bug Fixes[​](#-bug-fixes-5 "Direct link to heading") * Change type `Build.actorDefinitions` to `Build.actorDefinition` ([#624](https://github.com/apify/apify-client-js/pull/624)) ([611f313](https://github.com/apify/apify-client-js/commit/611f31365727e70f58d899009ff5a05c6b888253)) by [@jirispilka](https://github.com/jirispilka) * Add ActorRunPricingInfo type ([#623](https://github.com/apify/apify-client-js/pull/623)) ([8880295](https://github.com/apify/apify-client-js/commit/8880295f13c1664ab6ae0b8b3f171025317ea011)) by [@janbuchar](https://github.com/janbuchar) ### [2.11.0](https://github.com/apify/apify-client-js/releases/tag/v2.11.0)[​](#2110 "Direct link to heading") ##### [2.11.0](https://github.com/apify/apify-client-js/releases/tag/v2.11.0) (2024-12-16)[​](#2110-2024-12-16 "Direct link to heading") ###### 🚀 Features[​](#-features-8 "Direct link to heading") * **actor-build:** Add actorDefinition type for actor build detail, deprecate inputSchema and readme. ([#611](https://github.com/apify/apify-client-js/pull/611)) ([123c2b8](https://github.com/apify/apify-client-js/commit/123c2b81c945a0ca6922221598aa73c42cc298d6)) by [@drobnikj](https://github.com/drobnikj) * Add `charge` method to the run client for "pay per event" ([#613](https://github.com/apify/apify-client-js/pull/613)) ([3d9c64d](https://github.com/apify/apify-client-js/commit/3d9c64d5442b4f8f27c2b19dd98dd3b758944287)) by [@Jkuzz](https://github.com/Jkuzz) * **request-queue:** Add queueHasLockedRequests and clientKey into RequestQueueClientListAndLockHeadResult ([#617](https://github.com/apify/apify-client-js/pull/617)) ([f58ce98](https://github.com/apify/apify-client-js/commit/f58ce989e431de54eb673e561e407a7066ea2b64)) by [@drobnikj](https://github.com/drobnikj) ###### 🐛 Bug Fixes[​](#-bug-fixes-6 "Direct link to heading") * **actor:** Correctly set type for ActorTaggedBuilds ([#612](https://github.com/apify/apify-client-js/pull/612)) ([3bda7ee](https://github.com/apify/apify-client-js/commit/3bda7ee741caf2ccfea249a42ed7512cda36bf0b)) by [@metalwarrior665](https://github.com/metalwarrior665) ### [2.10.0](https://github.com/apify/apify-client-js/releases/tag/v2.10.0)[​](#2100 "Direct link to heading") ##### [2.10.0](https://github.com/apify/apify-client-js/releases/tag/v2.10.0) (2024-11-01)[​](#2100-2024-11-01 "Direct link to heading") ###### 🚀 Features[​](#-features-9 "Direct link to heading") * Add user.updateLimits ([#595](https://github.com/apify/apify-client-js/pull/595)) ([bf97c0f](https://github.com/apify/apify-client-js/commit/bf97c0f5bf8d0cbd8decb60382f0605243b00dd5)) by [@MFori](https://github.com/MFori) * Allow appending custom parts to the user agent ([#602](https://github.com/apify/apify-client-js/pull/602)) ([d07452b](https://github.com/apify/apify-client-js/commit/d07452b7bff83d16b48bf3cfba5b88aa564ffe2b)) by [@B4nan](https://github.com/B4nan) ###### 🐛 Bug Fixes[​](#-bug-fixes-7 "Direct link to heading") * Allow `null` when updating dataset/kvs/rq `name` ([#604](https://github.com/apify/apify-client-js/pull/604)) ([0034c2e](https://github.com/apify/apify-client-js/commit/0034c2ee63d6d1c6856c4e7786da43d86a3d63ce)) by [@B4nan](https://github.com/B4nan) ### [v2.9.7](https://github.com/apify/apify-client-js/releases/tag/v2.9.7)[​](#v297 "Direct link to heading") ##### What's Changed[​](#whats-changed "Direct link to heading") * feat: Rename maxCostPerRunUsd to maxTotalChargeUsd by [@novotnyj](https://github.com/novotnyj) in [#592](https://github.com/apify/apify-client-js/pull/592) **Full Changelog**: ### [v2.9.6](https://github.com/apify/apify-client-js/releases/tag/v2.9.6)[​](#v296 "Direct link to heading") ##### What's Changed[​](#whats-changed-1 "Direct link to heading") * fix: Rename maxCostPerRun by [@novotnyj](https://github.com/novotnyj) in [#589](https://github.com/apify/apify-client-js/pull/589) **Full Changelog**: ### [v2.9.5](https://github.com/apify/apify-client-js/releases/tag/v2.9.5)[​](#v295 "Direct link to heading") ##### What's Changed[​](#whats-changed-2 "Direct link to heading") * fix: add `isDeprecated` to actor update type by [@Jkuzz](https://github.com/Jkuzz) in [#566](https://github.com/apify/apify-client-js/pull/566) * feat: add Actor Standby types by [@jirimoravcik](https://github.com/jirimoravcik) in [#569](https://github.com/apify/apify-client-js/pull/569) * feat: allow `unwind` param to `DatasetClient.listItems()` to be an array by [@fnesveda](https://github.com/fnesveda) in [#576](https://github.com/apify/apify-client-js/pull/576) * feat: add maxCostPerRun param by [@stetizu1](https://github.com/stetizu1) in [#578](https://github.com/apify/apify-client-js/pull/578) ##### New Contributors[​](#new-contributors "Direct link to heading") * [@stetizu1](https://github.com/stetizu1) made their first contribution in [#578](https://github.com/apify/apify-client-js/pull/578) **Full Changelog**: ### [v2.9.4](https://github.com/apify/apify-client-js/releases/tag/v2.9.4)[​](#v294 "Direct link to heading") ##### What's Changed[​](#whats-changed-3 "Direct link to heading") * fix: add missing `isApifyIntegration` field to `Webhook` type by [@omikader](https://github.com/omikader) in [#523](https://github.com/apify/apify-client-js/pull/523) * feat: add notifications field to Schedule by [@m-murasovs](https://github.com/m-murasovs) in [#545](https://github.com/apify/apify-client-js/pull/545) * feat: added data property to API error object by [@gippy](https://github.com/gippy) in [#559](https://github.com/apify/apify-client-js/pull/559) **Full Changelog**: ### [v2.9.3](https://github.com/apify/apify-client-js/releases/tag/v2.9.3)[​](#v293 "Direct link to heading") ##### What's Changed[​](#whats-changed-4 "Direct link to heading") * chore: remove warning when parseDateFields reaches depth limit by [@tobice](https://github.com/tobice) in [#521](https://github.com/apify/apify-client-js/pull/521) **Full Changelog**: ### [v2.9.2](https://github.com/apify/apify-client-js/releases/tag/v2.9.2)[​](#v292 "Direct link to heading") ##### What's Changed[​](#whats-changed-5 "Direct link to heading") * feat: add monthlyUsage() and limits() endpoints to UserClients by [@tobice](https://github.com/tobice) in [#517](https://github.com/apify/apify-client-js/pull/517) * feat: parse monthlyUsage.dailyServiceUsages\[].date as Date by [@tobice](https://github.com/tobice) in [#519](https://github.com/apify/apify-client-js/pull/519) **Full Changelog**: ### [v2.9.1](https://github.com/apify/apify-client-js/releases/tag/v2.9.1)[​](#v291 "Direct link to heading") ##### What's Changed[​](#whats-changed-6 "Direct link to heading") * fix: ensure axios headers are instance of AxiosHeaders via interceptor by [@B4nan](https://github.com/B4nan) in [#515](https://github.com/apify/apify-client-js/pull/515) **Full Changelog**: ### [v2.9.0](https://github.com/apify/apify-client-js/releases/tag/v2.9.0)[​](#v290 "Direct link to heading") ##### What's Changed[​](#whats-changed-7 "Direct link to heading") * fix: publish browser bundle by [@B4nan](https://github.com/B4nan) in [#506](https://github.com/apify/apify-client-js/pull/506) * fix: update axios to v1.6 by [@B4nan](https://github.com/B4nan) in [#505](https://github.com/apify/apify-client-js/pull/505) * feat: add `KeyValueStore.recordExists()` method by [@barjin](https://github.com/barjin) in [#510](https://github.com/apify/apify-client-js/pull/510) * feat: add `log()` method to BuildClient by [@tobice](https://github.com/tobice) in [#509](https://github.com/apify/apify-client-js/pull/509) * feat: add `runs()` and `builds()` top level endpoints by [@foxt451](https://github.com/foxt451) in [#468](https://github.com/apify/apify-client-js/pull/468) ##### New Contributors[​](#new-contributors-1 "Direct link to heading") * [@tobice](https://github.com/tobice) made their first contribution in [#509](https://github.com/apify/apify-client-js/pull/509) **Full Changelog**: ### [v2.8.6](https://github.com/apify/apify-client-js/releases/tag/v2.8.6)[​](#v286 "Direct link to heading") ##### What's Changed[​](#whats-changed-8 "Direct link to heading") * fix: replace ReadableStream with Readable by [@foxt451](https://github.com/foxt451) in [#463](https://github.com/apify/apify-client-js/pull/463) * fix: add missing properties to `ActorCollectionCreateOptions` type by [@jirimoravcik](https://github.com/jirimoravcik) in [#486](https://github.com/apify/apify-client-js/pull/486) * feat(request-queue): Limit payload size for batchAddRequests() by [@drobnikj](https://github.com/drobnikj) in [#489](https://github.com/apify/apify-client-js/pull/489) * docs: add code owner for documentation by [@TC-MO](https://github.com/TC-MO) in [#488](https://github.com/apify/apify-client-js/pull/488) ##### New Contributors[​](#new-contributors-2 "Direct link to heading") * [@foxt451](https://github.com/foxt451) made their first contribution in [#463](https://github.com/apify/apify-client-js/pull/463) * [@TC-MO](https://github.com/TC-MO) made their first contribution in [#488](https://github.com/apify/apify-client-js/pull/488) **Full Changelog**: ### [v2.8.4](https://github.com/apify/apify-client-js/releases/tag/v2.8.4)[​](#v284 "Direct link to heading") ##### What's Changed[​](#whats-changed-9 "Direct link to heading") * fix(schedule): expose other fields when id optional by [@omikader](https://github.com/omikader) in [#451](https://github.com/apify/apify-client-js/pull/451) ##### New Contributors[​](#new-contributors-3 "Direct link to heading") * [@omikader](https://github.com/omikader) made their first contribution in [#451](https://github.com/apify/apify-client-js/pull/451) **Full Changelog**: ### [v.2.8.2](https://github.com/apify/apify-client-js/releases/tag/v2.8.2)[​](#v282 "Direct link to heading") ##### What's Changed[​](#whats-changed-10 "Direct link to heading") * ci: test on node 20 + improve tests workflow by [@B4nan](https://github.com/B4nan) in [#430](https://github.com/apify/apify-client-js/pull/430) * feat: Add how to install javascript Apify client by [@webrdaniel](https://github.com/webrdaniel) in [#440](https://github.com/apify/apify-client-js/pull/440) * fix: ScheduleUpdateData type by [@magne4000](https://github.com/magne4000) in [#276](https://github.com/apify/apify-client-js/pull/276) ##### New Contributors[​](#new-contributors-4 "Direct link to heading") * [@webrdaniel](https://github.com/webrdaniel) made their first contribution in [#440](https://github.com/apify/apify-client-js/pull/440) * [@magne4000](https://github.com/magne4000) made their first contribution in [#276](https://github.com/apify/apify-client-js/pull/276) **Full Changelog**: ### [v2.8.1](https://github.com/apify/apify-client-js/releases/tag/v2.8.1)[​](#v281 "Direct link to heading") ##### What's Changed[​](#whats-changed-11 "Direct link to heading") * fix: don't parse non-date strings by [@barjin](https://github.com/barjin) in [#412](https://github.com/apify/apify-client-js/pull/412) * chore: Removed references to issuesEnabled by [@Jkuzz](https://github.com/Jkuzz) in [#416](https://github.com/apify/apify-client-js/pull/416) * feat: add new webhook fields by [@m-murasovs](https://github.com/m-murasovs) in [#426](https://github.com/apify/apify-client-js/pull/426) * feat: Add delete to runs and builds by [@Jkuzz](https://github.com/Jkuzz) in [#428](https://github.com/apify/apify-client-js/pull/428) ##### New Contributors[​](#new-contributors-5 "Direct link to heading") * [@Jkuzz](https://github.com/Jkuzz) made their first contribution in [#416](https://github.com/apify/apify-client-js/pull/416) **Full Changelog**: ### [v2.8.0](https://github.com/apify/apify-client-js/releases/tag/v2.8.0)[​](#v280 "Direct link to heading") ##### What's Changed[​](#whats-changed-12 "Direct link to heading") * feat: Add Actor reboot method by [@jirimoravcik](https://github.com/jirimoravcik) in [#408](https://github.com/apify/apify-client-js/pull/408) **Full Changelog**: ### [v2.7.2](https://github.com/apify/apify-client-js/releases/tag/v2.7.2)[​](#v272 "Direct link to heading") ##### What's Changed[​](#whats-changed-13 "Direct link to heading") * feat: Rename APIFY\_ACTOR\_MAX\_ITEMS to ACTOR\_MAX\_PAID\_DATASET\_ITEMS by [@novotnyj](https://github.com/novotnyj) in [#353](https://github.com/apify/apify-client-js/pull/353) * feat(runs, builds): Add usage usd into Actor run and build types by [@drobnikj](https://github.com/drobnikj) in [#355](https://github.com/apify/apify-client-js/pull/355) * feat: Add shouldInterpolateStrings field to webhook type by [@valekjo](https://github.com/valekjo) in [#358](https://github.com/apify/apify-client-js/pull/358) * feat: Use Actor/Apify env vars instead of `ENV_VARS` by [@jirimoravcik](https://github.com/jirimoravcik) in [#373](https://github.com/apify/apify-client-js/pull/373) * feat: Added StoreCollectionClient class useful for listing Actors in Apify Store by [@drobnikj](https://github.com/drobnikj) in [#395](https://github.com/apify/apify-client-js/pull/395) * docs: Change subtitle by [@barjin](https://github.com/barjin) in [#380](https://github.com/apify/apify-client-js/pull/380) * fix(docs): Fix docs for resource clients to hide constructor by [@drobnikj](https://github.com/drobnikj) in [#397](https://github.com/apify/apify-client-js/pull/397) * fix: Update index.js by [@jancurn](https://github.com/jancurn) in [#379](https://github.com/apify/apify-client-js/pull/379) * chore: Use new workflow secrets by [@fnesveda](https://github.com/fnesveda) in [#354](https://github.com/apify/apify-client-js/pull/354) * chore: Invalidate CloudFront cache after docs deploy by [@fnesveda](https://github.com/fnesveda) in [#357](https://github.com/apify/apify-client-js/pull/357) * chore: Update dependencies **Full Changelog**: ### [v2.7.1](https://github.com/apify/apify-client-js/releases/tag/v2.7.1)[​](#v271 "Direct link to heading") ##### What's Changed[​](#whats-changed-14 "Direct link to heading") * fix: add `types` to package `exports` by [@B4nan](https://github.com/B4nan) in [#349](https://github.com/apify/apify-client-js/pull/349) **Full Changelog**: ### [v2.7.0](https://github.com/apify/apify-client-js/releases/tag/v2.7.0)[​](#v270 "Direct link to heading") ##### What's Changed[​](#whats-changed-15 "Direct link to heading") * fix: `isAtHome` value in User-Agent header by [@mvolfik](https://github.com/mvolfik) in [#286](https://github.com/apify/apify-client-js/pull/286) * fix: types for actor run by [@drobnikj](https://github.com/drobnikj) in [#331](https://github.com/apify/apify-client-js/pull/331) * fix: improve reading of the version when using bundlers by [@vladfrangu](https://github.com/vladfrangu) in [#332](https://github.com/apify/apify-client-js/pull/332) * feat: add support for `maxItems` in run options by [@novotnyj](https://github.com/novotnyj) in [#330](https://github.com/apify/apify-client-js/pull/330) * feat: mark Request Queue v2 methods as stable by [@drobnikj](https://github.com/drobnikj) in [#334](https://github.com/apify/apify-client-js/pull/334) * feat: add standard handing for setStatusMessage by [@barjin](https://github.com/barjin) in [#333](https://github.com/apify/apify-client-js/pull/333) ##### New Contributors[​](#new-contributors-6 "Direct link to heading") * [@novotnyj](https://github.com/novotnyj) made their first contribution in [#330](https://github.com/apify/apify-client-js/pull/330) **Full Changelog**: ### [v2.6.3](https://github.com/apify/apify-client-js/releases/tag/v2.6.3)[​](#v263 "Direct link to heading") ##### What's Changed[​](#whats-changed-16 "Direct link to heading") * feat: isStatusMessageTerminal in RunUpdate interface by [@barjin](https://github.com/barjin) in [#306](https://github.com/apify/apify-client-js/pull/306) **Full Changelog**: ### [v2.6.2](https://github.com/apify/apify-client-js/releases/tag/v2.6.2)[​](#v262 "Direct link to heading") ##### What's Changed[​](#whats-changed-17 "Direct link to heading") * fix: `Actor.call` and `Task.call` accept `waitSecs` not `waitForFinish` by [@vladfrangu](https://github.com/vladfrangu) in [#283](https://github.com/apify/apify-client-js/pull/283) * feat: re-export useful types and classes by [@vladfrangu](https://github.com/vladfrangu) in [#285](https://github.com/apify/apify-client-js/pull/285) * fix(types): correct extends clause for Dataset entries by [@vladfrangu](https://github.com/vladfrangu) in [#284](https://github.com/apify/apify-client-js/pull/284) * fix: Correct docs links for actor env vars, some refactoring by [@jirimoravcik](https://github.com/jirimoravcik) in [#287](https://github.com/apify/apify-client-js/pull/287) * fix: make ActorUpdateOptions type have optional fields by [@metalwarrior665](https://github.com/metalwarrior665) in [#288](https://github.com/apify/apify-client-js/pull/288) * fix: correctly set default client headers by [@valekjo](https://github.com/valekjo) in [#290](https://github.com/apify/apify-client-js/pull/290) **Full Changelog**: --- # Code examples ## Passing an input to the Actor[​](#passing-an-input-to-the-actor "Direct link to heading") The fastest way to get results from an Actor is to pass input directly to the `call` function. Input can be passed to `call` function and the reference of running Actor (or wait for finish) is available in `runData` variable. This example starts an Actor that scrapes 20 posts from the Instagram website based on the hashtag. ``` import { ApifyClient } from 'apify-client'; // Client initialization with the API token const client = new ApifyClient({ token: 'MY_APIFY_TOKEN' }); const actorClient = client.actor('apify/instagram-hashtag-scraper'); const input = { hashtags: ['rainbow'], resultsLimit: 20 }; // Run the Actor and wait for it to finish up to 60 seconds. // Input is not persisted for next runs. const runData = await actorClient.call(input, { waitSecs: 60 }); console.log('Run data:'); console.log(runData); ``` To run multiple inputs with the same Actor, most convenient way is to create multiple [tasks](https://docs.apify.com/platform/actors/running/tasks) with different inputs. Task input is persisted on Apify platform when task is created. ``` import { ApifyClient } from 'apify-client'; // Client initialization with the API token const client = new ApifyClient({ token: 'MY_APIFY_TOKEN' }); const animalsHashtags = ['zebra', 'lion', 'hippo']; // Multiple input schemas for one Actor can be persisted in tasks. // Tasks are saved in the Apify platform and can be run multiple times. const socialsTasksPromises = animalsHashtags.map((hashtag) => client.tasks().create({ actId: 'apify/instagram-hashtag-scraper', name: `hashtags-${hashtag}`, input: { hashtags: [hashtag], resultsLimit: 20 }, options: { memoryMbytes: 1024 }, }), ); // Create all tasks in parallel const createdTasks = await Promise.all(socialsTasksPromises); console.log('Created tasks:'); console.log(createdTasks); // Run all tasks in parallel await Promise.all(createdTasks.map((task) => client.task(task.id).call())); ``` ## Getting latest data from an Actor, joining datasets[​](#getting-latest-data-from-an-actor-joining-datasets "Direct link to heading") Actor data are stored to [datasets](https://docs.apify.com/platform/storage/dataset). Datasets can be retrieved from Actor runs. Dataset items can be listed with pagination. Also, datasets can be merged together to make analysis further on with single file as dataset can be exported to various data format (CSV, JSON, XSLX, XML). [Integrations](https://docs.apify.com/platform/integrations) can do the trick as well. ``` import { ApifyClient } from 'apify-client'; // Client initialization with the API token const client = new ApifyClient({ token: 'MY_APIFY_TOKEN' }); const actorClient = client.actor('apify/instagram-hashtag-scraper'); const actorRuns = actorClient.runs(); // See pagination to understand how to get more datasets const actorDatasets = await actorRuns.list({ limit: 20 }); console.log('Actor datasets:'); console.log(actorDatasets); const mergingDataset = await client.datasets().getOrCreate('merge-dataset'); for (const datasetItem of actorDatasets.items) { // Dataset items can be handled here. Dataset items can be paginated const datasetItems = await client.dataset(datasetItem.defaultDatasetId).listItems({ limit: 1000 }); // Items can be pushed to single dataset await client.dataset(mergingDataset.id).pushItems(datasetItems.items); // ... } ``` ## Handling webhooks[​](#handling-webhooks "Direct link to heading") [Webhooks](https://docs.apify.com/platform/integrations/webhooks) can be used to get notifications about Actor runs. For example, a webhook can be triggered when an Actor run finishes successfully. Webhook can receive dataset ID for further processing. Initialization of webhook: ``` import { ApifyClient } from 'apify-client'; // Client initialization with the API token const client = new ApifyClient({ token: 'MY_APIFY_TOKEN' }); const webhooksClient = client.webhooks(); await webhooksClient.create({ description: 'Instagram hashtag actor succeeded', condition: { actorId: 'reGe1ST3OBgYZSsZJ' }, // Actor ID of apify/instagram-hashtag-scraper // Request URL can be generated using https://webhook.site. Any REST server can be used requestUrl: 'https://webhook.site/CUSTOM_WEBHOOK_ID', eventTypes: ['ACTOR.RUN.SUCCEEDED'], }); ``` Simple webhook listener can be built on [`express`](https://expressjs.com/) library, which can helps to create a REST server for handling webhooks: ``` import express from 'express'; import bodyParser from 'body-parser'; import { ApifyClient, DownloadItemsFormat } from 'apify-client'; // Initialize Apify client, express and define server port const client = new ApifyClient({ token: 'MY_APIFY_TOKEN' }); const app = express(); const PORT = 3000; // Tell express to use body-parser's JSON parsing app.use(bodyParser.json()); app.post('apify-webhook', async (req, res) => { // Log the payload from the webhook console.log(req.body); const runDataset = await client.dataset(req.body.resource.defaultDatasetId); // e.g. Save dataset locally as JSON await runDataset.downloadItems(DownloadItemsFormat.JSON); // Respond to the webhook res.send('Webhook received'); }); // Start express on the defined port app.listen(PORT, () => console.log(`🚀 Server running on port ${PORT}`)); ``` --- # apify-client ## Index[**](#Index) ### Enumerations * [**ActorListSortBy](https://docs.apify.com/api/client/js/api/client/js/reference/enum/ActorListSortBy.md) * [**DownloadItemsFormat](https://docs.apify.com/api/client/js/api/client/js/reference/enum/DownloadItemsFormat.md) * [**PlatformFeature](https://docs.apify.com/api/client/js/api/client/js/reference/enum/PlatformFeature.md) * [**ScheduleActions](https://docs.apify.com/api/client/js/api/client/js/reference/enum/ScheduleActions.md) * [**WebhookDispatchStatus](https://docs.apify.com/api/client/js/api/client/js/reference/enum/WebhookDispatchStatus.md) ### Classes * [**ActorClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/ActorClient.md) * [**ActorCollectionClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/ActorCollectionClient.md) * [**ApifyApiError](https://docs.apify.com/api/client/js/api/client/js/reference/class/ApifyApiError.md) * [**ApifyClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/ApifyClient.md) * [**BuildClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/BuildClient.md) * [**BuildCollectionClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/BuildCollectionClient.md) * [**DatasetClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/DatasetClient.md) * [**DatasetCollectionClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/DatasetCollectionClient.md) * [**InvalidResponseBodyError](https://docs.apify.com/api/client/js/api/client/js/reference/class/InvalidResponseBodyError.md) * [**KeyValueStoreClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/KeyValueStoreClient.md) * [**KeyValueStoreCollectionClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/KeyValueStoreCollectionClient.md) * [**LogClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/LogClient.md) * [**RequestQueueClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/RequestQueueClient.md) * [**RequestQueueCollectionClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/RequestQueueCollectionClient.md) * [**RunClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/RunClient.md) * [**RunCollectionClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/RunCollectionClient.md) * [**ScheduleClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/ScheduleClient.md) * [**ScheduleCollectionClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/ScheduleCollectionClient.md) * [**StoreCollectionClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/StoreCollectionClient.md) * [**TaskClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/TaskClient.md) * [**TaskCollectionClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/TaskCollectionClient.md) * [**UserClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/UserClient.md) * [**WebhookClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/WebhookClient.md) * [**WebhookCollectionClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/WebhookCollectionClient.md) * [**WebhookDispatchClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/WebhookDispatchClient.md) * [**WebhookDispatchCollectionClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/WebhookDispatchCollectionClient.md) ### Interfaces * [**AccountAndUsageLimits](https://docs.apify.com/api/client/js/api/client/js/reference/interface/AccountAndUsageLimits.md) * [**Actor](https://docs.apify.com/api/client/js/api/client/js/reference/interface/Actor.md) * [**ActorBuildOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorBuildOptions.md) * [**ActorCallOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorCallOptions.md) * [**ActorChargeEvent](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorChargeEvent.md) * [**ActorCollectionCreateOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorCollectionCreateOptions.md) * [**ActorCollectionListItem](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorCollectionListItem.md) * [**ActorCollectionListOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorCollectionListOptions.md) * [**ActorDefaultRunOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorDefaultRunOptions.md) * [**ActorDefinition](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorDefinition.md) * [**ActorExampleRunInput](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorExampleRunInput.md) * [**ActorLastRunOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorLastRunOptions.md) * [**ActorRun](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorRun.md) * [**ActorRunListItem](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorRunListItem.md) * [**ActorRunMeta](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorRunMeta.md) * [**ActorRunOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorRunOptions.md) * [**ActorRunStats](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorRunStats.md) * [**ActorRunUsage](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorRunUsage.md) * [**ActorStandby](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorStandby.md) * [**ActorStartOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorStartOptions.md) * [**ActorStats](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorStats.md) * [**ActorStoreList](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorStoreList.md) * [**ActorTaggedBuild](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorTaggedBuild.md) * [**ApifyClientOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ApifyClientOptions.md) * [**Build](https://docs.apify.com/api/client/js/api/client/js/reference/interface/Build.md) * [**BuildClientGetOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/BuildClientGetOptions.md) * [**BuildClientWaitForFinishOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/BuildClientWaitForFinishOptions.md) * [**BuildCollectionClientListOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/BuildCollectionClientListOptions.md) * [**BuildMeta](https://docs.apify.com/api/client/js/api/client/js/reference/interface/BuildMeta.md) * [**BuildOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/BuildOptions.md) * [**BuildStats](https://docs.apify.com/api/client/js/api/client/js/reference/interface/BuildStats.md) * [**BuildUsage](https://docs.apify.com/api/client/js/api/client/js/reference/interface/BuildUsage.md) * [**Current](https://docs.apify.com/api/client/js/api/client/js/reference/interface/Current.md) * [**Dataset](https://docs.apify.com/api/client/js/api/client/js/reference/interface/Dataset.md) * [**DatasetClientDownloadItemsOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/DatasetClientDownloadItemsOptions.md) * [**DatasetClientListItemOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/DatasetClientListItemOptions.md) * [**DatasetClientUpdateOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/DatasetClientUpdateOptions.md) * [**DatasetCollectionClientGetOrCreateOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/DatasetCollectionClientGetOrCreateOptions.md) * [**DatasetCollectionClientListOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/DatasetCollectionClientListOptions.md) * [**DatasetStatistics](https://docs.apify.com/api/client/js/api/client/js/reference/interface/DatasetStatistics.md) * [**DatasetStats](https://docs.apify.com/api/client/js/api/client/js/reference/interface/DatasetStats.md) * [**FieldStatistics](https://docs.apify.com/api/client/js/api/client/js/reference/interface/FieldStatistics.md) * [**FlatPricePerMonthActorPricingInfo](https://docs.apify.com/api/client/js/api/client/js/reference/interface/FlatPricePerMonthActorPricingInfo.md) * [**FreeActorPricingInfo](https://docs.apify.com/api/client/js/api/client/js/reference/interface/FreeActorPricingInfo.md) * [**KeyValueClientGetRecordOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/KeyValueClientGetRecordOptions.md) * [**KeyValueClientListKeysOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/KeyValueClientListKeysOptions.md) * [**KeyValueClientListKeysResult](https://docs.apify.com/api/client/js/api/client/js/reference/interface/KeyValueClientListKeysResult.md) * [**KeyValueClientUpdateOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/KeyValueClientUpdateOptions.md) * [**KeyValueListItem](https://docs.apify.com/api/client/js/api/client/js/reference/interface/KeyValueListItem.md) * [**KeyValueStore](https://docs.apify.com/api/client/js/api/client/js/reference/interface/KeyValueStore.md) * [**KeyValueStoreCollectionClientGetOrCreateOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/KeyValueStoreCollectionClientGetOrCreateOptions.md) * [**KeyValueStoreCollectionClientListOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/KeyValueStoreCollectionClientListOptions.md) * [**KeyValueStoreRecord](https://docs.apify.com/api/client/js/api/client/js/reference/interface/KeyValueStoreRecord.md) * [**KeyValueStoreRecordOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/KeyValueStoreRecordOptions.md) * [**KeyValueStoreStats](https://docs.apify.com/api/client/js/api/client/js/reference/interface/KeyValueStoreStats.md) * [**Limits](https://docs.apify.com/api/client/js/api/client/js/reference/interface/Limits.md) * [**MonthlyUsage](https://docs.apify.com/api/client/js/api/client/js/reference/interface/MonthlyUsage.md) * [**MonthlyUsageCycle](https://docs.apify.com/api/client/js/api/client/js/reference/interface/MonthlyUsageCycle.md) * [**OpenApiDefinition](https://docs.apify.com/api/client/js/api/client/js/reference/interface/OpenApiDefinition.md) * [**PaginatedList](https://docs.apify.com/api/client/js/api/client/js/reference/interface/PaginatedList.md) * [**PricePerDatasetItemActorPricingInfo](https://docs.apify.com/api/client/js/api/client/js/reference/interface/PricePerDatasetItemActorPricingInfo.md) * [**PricePerEventActorPricingInfo](https://docs.apify.com/api/client/js/api/client/js/reference/interface/PricePerEventActorPricingInfo.md) * [**PricingInfo](https://docs.apify.com/api/client/js/api/client/js/reference/interface/PricingInfo.md) * [**ProxyGroup](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ProxyGroup.md) * [**RequestQueue](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueue.md) * [**RequestQueueClientAddRequestOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueueClientAddRequestOptions.md) * [**RequestQueueClientAddRequestResult](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueueClientAddRequestResult.md) * [**RequestQueueClientBatchAddRequestWithRetriesOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueueClientBatchAddRequestWithRetriesOptions.md) * [**RequestQueueClientBatchRequestsOperationResult](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueueClientBatchRequestsOperationResult.md) * [**RequestQueueClientDeleteRequestLockOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueueClientDeleteRequestLockOptions.md) * [**RequestQueueClientListAndLockHeadOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueueClientListAndLockHeadOptions.md) * [**RequestQueueClientListAndLockHeadResult](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueueClientListAndLockHeadResult.md) * [**RequestQueueClientListHeadOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueueClientListHeadOptions.md) * [**RequestQueueClientListHeadResult](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueueClientListHeadResult.md) * [**RequestQueueClientListItem](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueueClientListItem.md) * [**RequestQueueClientListRequestsOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueueClientListRequestsOptions.md) * [**RequestQueueClientListRequestsResult](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueueClientListRequestsResult.md) * [**RequestQueueClientPaginateRequestsOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueueClientPaginateRequestsOptions.md) * [**RequestQueueClientProlongRequestLockOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueueClientProlongRequestLockOptions.md) * [**RequestQueueClientProlongRequestLockResult](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueueClientProlongRequestLockResult.md) * [**RequestQueueClientRequestSchema](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueueClientRequestSchema.md) * [**RequestQueueClientUnlockRequestsResult](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueueClientUnlockRequestsResult.md) * [**RequestQueueClientUpdateOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueueClientUpdateOptions.md) * [**RequestQueueCollectionListOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueueCollectionListOptions.md) * [**RequestQueueStats](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueueStats.md) * [**RequestQueueUserOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueueUserOptions.md) * [**RunAbortOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RunAbortOptions.md) * [**RunChargeOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RunChargeOptions.md) * [**RunCollectionListOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RunCollectionListOptions.md) * [**RunGetOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RunGetOptions.md) * [**RunMetamorphOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RunMetamorphOptions.md) * [**RunResurrectOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RunResurrectOptions.md) * [**RunUpdateOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RunUpdateOptions.md) * [**RunWaitForFinishOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RunWaitForFinishOptions.md) * [**Schedule](https://docs.apify.com/api/client/js/api/client/js/reference/interface/Schedule.md) * [**ScheduleActionRunActor](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ScheduleActionRunActor.md) * [**ScheduleActionRunActorTask](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ScheduleActionRunActorTask.md) * [**ScheduleCollectionListOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ScheduleCollectionListOptions.md) * [**ScheduledActorRunInput](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ScheduledActorRunInput.md) * [**ScheduledActorRunOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ScheduledActorRunOptions.md) * [**StoreCollectionListOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/StoreCollectionListOptions.md) * [**Task](https://docs.apify.com/api/client/js/api/client/js/reference/interface/Task.md) * [**TaskCallOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/TaskCallOptions.md) * [**TaskCollectionListOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/TaskCollectionListOptions.md) * [**TaskCreateData](https://docs.apify.com/api/client/js/api/client/js/reference/interface/TaskCreateData.md) * [**TaskLastRunOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/TaskLastRunOptions.md) * [**TaskOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/TaskOptions.md) * [**TaskStats](https://docs.apify.com/api/client/js/api/client/js/reference/interface/TaskStats.md) * [**UsageCycle](https://docs.apify.com/api/client/js/api/client/js/reference/interface/UsageCycle.md) * [**User](https://docs.apify.com/api/client/js/api/client/js/reference/interface/User.md) * [**UserPlan](https://docs.apify.com/api/client/js/api/client/js/reference/interface/UserPlan.md) * [**UserProxy](https://docs.apify.com/api/client/js/api/client/js/reference/interface/UserProxy.md) * [**Webhook](https://docs.apify.com/api/client/js/api/client/js/reference/interface/Webhook.md) * [**WebhookAnyRunOfActorCondition](https://docs.apify.com/api/client/js/api/client/js/reference/interface/WebhookAnyRunOfActorCondition.md) * [**WebhookAnyRunOfActorTaskCondition](https://docs.apify.com/api/client/js/api/client/js/reference/interface/WebhookAnyRunOfActorTaskCondition.md) * [**WebhookCertainRunCondition](https://docs.apify.com/api/client/js/api/client/js/reference/interface/WebhookCertainRunCondition.md) * [**WebhookCollectionListOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/WebhookCollectionListOptions.md) * [**WebhookDispatch](https://docs.apify.com/api/client/js/api/client/js/reference/interface/WebhookDispatch.md) * [**WebhookDispatchCall](https://docs.apify.com/api/client/js/api/client/js/reference/interface/WebhookDispatchCall.md) * [**WebhookDispatchCollectionListOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/WebhookDispatchCollectionListOptions.md) * [**WebhookIdempotencyKey](https://docs.apify.com/api/client/js/api/client/js/reference/interface/WebhookIdempotencyKey.md) * [**WebhookStats](https://docs.apify.com/api/client/js/api/client/js/reference/interface/WebhookStats.md) ### Type Aliases * [**ActorChargeEvents](https://docs.apify.com/api/client/js/api/client/js/reference.md#ActorChargeEvents) * [**ActorCollectionListResult](https://docs.apify.com/api/client/js/api/client/js/reference.md#ActorCollectionListResult) * [**ActorRunPricingInfo](https://docs.apify.com/api/client/js/api/client/js/reference.md#ActorRunPricingInfo) * [**ActorTaggedBuilds](https://docs.apify.com/api/client/js/api/client/js/reference.md#ActorTaggedBuilds) * [**ActorUpdateOptions](https://docs.apify.com/api/client/js/api/client/js/reference.md#ActorUpdateOptions) * [**AllowedHttpMethods](https://docs.apify.com/api/client/js/api/client/js/reference.md#AllowedHttpMethods) * [**BuildCollectionClientListItem](https://docs.apify.com/api/client/js/api/client/js/reference.md#BuildCollectionClientListItem) * [**BuildCollectionClientListResult](https://docs.apify.com/api/client/js/api/client/js/reference.md#BuildCollectionClientListResult) * [**DatasetCollectionClientListResult](https://docs.apify.com/api/client/js/api/client/js/reference.md#DatasetCollectionClientListResult) * [**Dictionary](https://docs.apify.com/api/client/js/api/client/js/reference.md#Dictionary) * [**KeyValueStoreCollectionListResult](https://docs.apify.com/api/client/js/api/client/js/reference.md#KeyValueStoreCollectionListResult) * [**LimitsUpdateOptions](https://docs.apify.com/api/client/js/api/client/js/reference.md#LimitsUpdateOptions) * [**RequestQueueClientGetRequestResult](https://docs.apify.com/api/client/js/api/client/js/reference.md#RequestQueueClientGetRequestResult) * [**RequestQueueClientRequestToDelete](https://docs.apify.com/api/client/js/api/client/js/reference.md#RequestQueueClientRequestToDelete) * [**RequestQueueCollectionListResult](https://docs.apify.com/api/client/js/api/client/js/reference.md#RequestQueueCollectionListResult) * [**RequestQueueRequestsAsyncIterable](https://docs.apify.com/api/client/js/api/client/js/reference.md#RequestQueueRequestsAsyncIterable) * [**ReturnTypeFromOptions](https://docs.apify.com/api/client/js/api/client/js/reference.md#ReturnTypeFromOptions) * [**ScheduleAction](https://docs.apify.com/api/client/js/api/client/js/reference.md#ScheduleAction) * [**ScheduleCreateOrUpdateData](https://docs.apify.com/api/client/js/api/client/js/reference.md#ScheduleCreateOrUpdateData) * [**TaskList](https://docs.apify.com/api/client/js/api/client/js/reference.md#TaskList) * [**TaskStartOptions](https://docs.apify.com/api/client/js/api/client/js/reference.md#TaskStartOptions) * [**TaskUpdateData](https://docs.apify.com/api/client/js/api/client/js/reference.md#TaskUpdateData) * [**WebhookCondition](https://docs.apify.com/api/client/js/api/client/js/reference.md#WebhookCondition) * [**WebhookEventType](https://docs.apify.com/api/client/js/api/client/js/reference.md#WebhookEventType) * [**WebhookUpdateData](https://docs.apify.com/api/client/js/api/client/js/reference.md#WebhookUpdateData) ## Type Aliases[**](<#Type Aliases>) ### [**](#ActorChargeEvents)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L548)ActorChargeEvents **ActorChargeEvents: Record\ ### [**](#ActorCollectionListResult)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor_collection.ts#L69)ActorCollectionListResult **ActorCollectionListResult: [PaginatedList](https://docs.apify.com/api/client/js/api/client/js/reference/interface/PaginatedList.md)<[ActorCollectionListItem](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorCollectionListItem.md)> ### [**](#ActorRunPricingInfo)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L558)ActorRunPricingInfo **ActorRunPricingInfo: [PricePerEventActorPricingInfo](https://docs.apify.com/api/client/js/api/client/js/reference/interface/PricePerEventActorPricingInfo.md) | [PricePerDatasetItemActorPricingInfo](https://docs.apify.com/api/client/js/api/client/js/reference/interface/PricePerDatasetItemActorPricingInfo.md) | [FlatPricePerMonthActorPricingInfo](https://docs.apify.com/api/client/js/api/client/js/reference/interface/FlatPricePerMonthActorPricingInfo.md) | [FreeActorPricingInfo](https://docs.apify.com/api/client/js/api/client/js/reference/interface/FreeActorPricingInfo.md) ### [**](#ActorTaggedBuilds)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L311)ActorTaggedBuilds **ActorTaggedBuilds: Record\ ### [**](#ActorUpdateOptions)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L319)ActorUpdateOptions **ActorUpdateOptions: Partial\> ### [**](#AllowedHttpMethods)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L710)AllowedHttpMethods **AllowedHttpMethods: GET | HEAD | POST | PUT | DELETE | TRACE | OPTIONS | CONNECT | PATCH ### [**](#BuildCollectionClientListItem)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/build_collection.ts#L42)BuildCollectionClientListItem **BuildCollectionClientListItem: Required\> & Partial\> ### [**](#BuildCollectionClientListResult)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/build_collection.ts#L45)BuildCollectionClientListResult **BuildCollectionClientListResult: [PaginatedList](https://docs.apify.com/api/client/js/api/client/js/reference/interface/PaginatedList.md)<[BuildCollectionClientListItem](https://docs.apify.com/api/client/js/api/client/js/reference.md#BuildCollectionClientListItem)> ### [**](#DatasetCollectionClientListResult)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset_collection.ts#L58)DatasetCollectionClientListResult **DatasetCollectionClientListResult: [PaginatedList](https://docs.apify.com/api/client/js/api/client/js/reference/interface/PaginatedList.md)<[Dataset](https://docs.apify.com/api/client/js/api/client/js/reference/interface/Dataset.md)> ### [**](#Dictionary)[**](https://github.com/apify/apify-client-js/blob/master/src/utils.ts#L260)Dictionary **Dictionary\: Record\ #### Type parameters * **T** = unknown ### [**](#KeyValueStoreCollectionListResult)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/key_value_store_collection.ts#L63)KeyValueStoreCollectionListResult **KeyValueStoreCollectionListResult: Omit<[KeyValueStore](https://docs.apify.com/api/client/js/api/client/js/reference/interface/KeyValueStore.md), stats> & { username? : string } ### [**](#LimitsUpdateOptions)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L251)LimitsUpdateOptions **LimitsUpdateOptions: Pick<[Limits](https://docs.apify.com/api/client/js/api/client/js/reference/interface/Limits.md), maxMonthlyUsageUsd | dataRetentionDays> ### [**](#RequestQueueClientGetRequestResult)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L708)RequestQueueClientGetRequestResult **RequestQueueClientGetRequestResult: Omit<[RequestQueueClientListItem](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueueClientListItem.md), retryCount> ### [**](#RequestQueueClientRequestToDelete)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L704)RequestQueueClientRequestToDelete **RequestQueueClientRequestToDelete: Pick<[RequestQueueClientRequestSchema](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueueClientRequestSchema.md), id> | Pick<[RequestQueueClientRequestSchema](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueueClientRequestSchema.md), uniqueKey> ### [**](#RequestQueueCollectionListResult)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue_collection.ts#L53)RequestQueueCollectionListResult **RequestQueueCollectionListResult: [PaginatedList](https://docs.apify.com/api/client/js/api/client/js/reference/interface/PaginatedList.md)<[RequestQueue](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueue.md) & { username? : string }> & { unnamed: boolean } ### [**](#RequestQueueRequestsAsyncIterable)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L712)RequestQueueRequestsAsyncIterable **RequestQueueRequestsAsyncIterable\: AsyncIterable\ #### Type parameters * **T** ### [**](#ReturnTypeFromOptions)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/key_value_store.ts#L311)ReturnTypeFromOptions **ReturnTypeFromOptions\: Options\[stream] extends true ? Readable : Options\[buffer] extends true ? Buffer : JsonValue #### Type parameters * **Options**: [KeyValueClientGetRecordOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/KeyValueClientGetRecordOptions.md) ### [**](#ScheduleAction)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/schedule.ts#L103)ScheduleAction **ScheduleAction: [ScheduleActionRunActor](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ScheduleActionRunActor.md) | [ScheduleActionRunActorTask](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ScheduleActionRunActorTask.md) ### [**](#ScheduleCreateOrUpdateData)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/schedule.ts#L84)ScheduleCreateOrUpdateData **ScheduleCreateOrUpdateData: Partial\ & { actions: DistributiveOptional<[ScheduleAction](https://docs.apify.com/api/client/js/api/client/js/reference.md#ScheduleAction), id>\[] }> ### [**](#TaskList)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/task_collection.ts#L56)TaskList **TaskList: Omit<[Task](https://docs.apify.com/api/client/js/api/client/js/reference/interface/Task.md), options | input> ### [**](#TaskStartOptions)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/task.ts#L235)TaskStartOptions **TaskStartOptions: Omit<[ActorStartOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorStartOptions.md), contentType> ### [**](#TaskUpdateData)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/task.ts#L227)TaskUpdateData **TaskUpdateData: Partial\> ### [**](#WebhookCondition)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/webhook.ts#L127)WebhookCondition **WebhookCondition: [WebhookAnyRunOfActorCondition](https://docs.apify.com/api/client/js/api/client/js/reference/interface/WebhookAnyRunOfActorCondition.md) | [WebhookAnyRunOfActorTaskCondition](https://docs.apify.com/api/client/js/api/client/js/reference/interface/WebhookAnyRunOfActorTaskCondition.md) | [WebhookCertainRunCondition](https://docs.apify.com/api/client/js/api/client/js/reference/interface/WebhookCertainRunCondition.md) ### [**](#WebhookEventType)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/webhook.ts#L125)WebhookEventType **WebhookEventType: typeof WEBHOOK\_EVENT\_TYPES\[keyof typeof WEBHOOK\_EVENT\_TYPES] ### [**](#WebhookUpdateData)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/webhook.ts#L103)WebhookUpdateData **WebhookUpdateData: Partial\> & [WebhookIdempotencyKey](https://docs.apify.com/api/client/js/api/client/js/reference/interface/WebhookIdempotencyKey.md) --- # ActorClient ### Hierarchy * ResourceClient * *ActorClient* ## Index[**](#Index) ### Properties * [**apifyClient](#apifyClient) * [**baseUrl](#baseUrl) * [**httpClient](#httpClient) * [**id](#id) * [**params](#params) * [**resourcePath](#resourcePath) * [**safeId](#safeId) * [**url](#url) ### Methods * [**build](#build) * [**builds](#builds) * [**call](#call) * [**defaultBuild](#defaultBuild) * [**delete](#delete) * [**get](#get) * [**lastRun](#lastRun) * [**runs](#runs) * [**start](#start) * [**update](#update) * [**version](#version) * [**versions](#versions) * [**webhooks](#webhooks) ## Properties[**](#Properties) ### [**](#apifyClient)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L32)inheritedapifyClient **apifyClient: [ApifyClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/ApifyClient.md) Inherited from ResourceClient.apifyClient ### [**](#baseUrl)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L26)inheritedbaseUrl **baseUrl: string Inherited from ResourceClient.baseUrl ### [**](#httpClient)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L34)inheritedhttpClient **httpClient: HttpClient Inherited from ResourceClient.httpClient ### [**](#id)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L22)optionalinheritedid **id? : string Inherited from ResourceClient.id ### [**](#params)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L36)optionalinheritedparams **params? : Record\ Inherited from ResourceClient.params ### [**](#resourcePath)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L28)inheritedresourcePath **resourcePath: string Inherited from ResourceClient.resourcePath ### [**](#safeId)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L24)optionalinheritedsafeId **safeId? : string Inherited from ResourceClient.safeId ### [**](#url)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L30)inheritedurl **url: string Inherited from ResourceClient.url ## Methods[**](#Methods) ### [**](#build)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L145)build * ****build**(versionNumber, options): Promise<[Build](https://docs.apify.com/api/client/js/api/client/js/reference/interface/Build.md)> - *** #### Parameters * ##### versionNumber: string * ##### options: [ActorBuildOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorBuildOptions.md) = {} #### Returns Promise<[Build](https://docs.apify.com/api/client/js/api/client/js/reference/interface/Build.md)> ### [**](#builds)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L213)builds * ****builds**(): [BuildCollectionClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/BuildCollectionClient.md) - *** #### Returns [BuildCollectionClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/BuildCollectionClient.md) ### [**](#call)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L115)call * ****call**(input, options): Promise<[ActorRun](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorRun.md)> - Starts an actor and waits for it to finish before returning the Run object. It waits indefinitely, unless the `waitSecs` option is provided. *** #### Parameters * ##### optionalinput: unknown * ##### options: [ActorCallOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorCallOptions.md) = {} #### Returns Promise<[ActorRun](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorRun.md)> ### [**](#defaultBuild)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L172)defaultBuild * ****defaultBuild**(options): Promise<[BuildClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/BuildClient.md)> - *** #### Parameters * ##### options: [BuildClientGetOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/BuildClientGetOptions.md) = {} #### Returns Promise<[BuildClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/BuildClient.md)> ### [**](#delete)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L51)delete * ****delete**(): Promise\ - *** #### Returns Promise\ ### [**](#get)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L35)get * ****get**(): Promise\ - *** #### Returns Promise\ ### [**](#lastRun)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L192)lastRun * ****lastRun**(options): [RunClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/RunClient.md) - *** #### Parameters * ##### options: [ActorLastRunOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorLastRunOptions.md) = {} #### Returns [RunClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/RunClient.md) ### [**](#runs)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L224)runs * ****runs**(): [RunCollectionClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/RunCollectionClient.md) - *** #### Returns [RunCollectionClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/RunCollectionClient.md) ### [**](#start)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L59)start * ****start**(input, options): Promise<[ActorRun](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorRun.md)> - Starts an actor and immediately returns the Run object. *** #### Parameters * ##### optionalinput: unknown * ##### options: [ActorStartOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorStartOptions.md) = {} #### Returns Promise<[ActorRun](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorRun.md)> ### [**](#update)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L42)update * ****update**(newFields): Promise<[Actor](https://docs.apify.com/api/client/js/api/client/js/reference/interface/Actor.md)> - *** #### Parameters * ##### newFields: Partial\> #### Returns Promise<[Actor](https://docs.apify.com/api/client/js/api/client/js/reference/interface/Actor.md)> ### [**](#version)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L235)version * ****version**(versionNumber): ActorVersionClient - *** #### Parameters * ##### versionNumber: string #### Returns ActorVersionClient ### [**](#versions)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L248)versions * ****versions**(): ActorVersionCollectionClient - *** #### Returns ActorVersionCollectionClient ### [**](#webhooks)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L256)webhooks * ****webhooks**(): [WebhookCollectionClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/WebhookCollectionClient.md) - *** #### Returns [WebhookCollectionClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/WebhookCollectionClient.md) --- # ActorCollectionClient ### Hierarchy * ResourceCollectionClient * *ActorCollectionClient* ## Index[**](#Index) ### Properties * [**apifyClient](#apifyClient) * [**baseUrl](#baseUrl) * [**httpClient](#httpClient) * [**id](#id) * [**params](#params) * [**resourcePath](#resourcePath) * [**safeId](#safeId) * [**url](#url) ### Methods * [**create](#create) * [**list](#list) ## Properties[**](#Properties) ### [**](#apifyClient)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L32)inheritedapifyClient **apifyClient: [ApifyClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/ApifyClient.md) Inherited from ResourceCollectionClient.apifyClient ### [**](#baseUrl)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L26)inheritedbaseUrl **baseUrl: string Inherited from ResourceCollectionClient.baseUrl ### [**](#httpClient)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L34)inheritedhttpClient **httpClient: HttpClient Inherited from ResourceCollectionClient.httpClient ### [**](#id)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L22)optionalinheritedid **id? : string Inherited from ResourceCollectionClient.id ### [**](#params)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L36)optionalinheritedparams **params? : Record\ Inherited from ResourceCollectionClient.params ### [**](#resourcePath)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L28)inheritedresourcePath **resourcePath: string Inherited from ResourceCollectionClient.resourcePath ### [**](#safeId)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L24)optionalinheritedsafeId **safeId? : string Inherited from ResourceCollectionClient.safeId ### [**](#url)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L30)inheritedurl **url: string Inherited from ResourceCollectionClient.url ## Methods[**](#Methods) ### [**](#create)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor_collection.ts#L41)create * ****create**(actor): Promise<[Actor](https://docs.apify.com/api/client/js/api/client/js/reference/interface/Actor.md)> - *** #### Parameters * ##### actor: [ActorCollectionCreateOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorCollectionCreateOptions.md) #### Returns Promise<[Actor](https://docs.apify.com/api/client/js/api/client/js/reference/interface/Actor.md)> ### [**](#list)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor_collection.ts#L23)list * ****list**(options): Promise<[ActorCollectionListResult](https://docs.apify.com/api/client/js/api/client/js/reference.md#ActorCollectionListResult)> - *** #### Parameters * ##### options: [ActorCollectionListOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorCollectionListOptions.md) = {} #### Returns Promise<[ActorCollectionListResult](https://docs.apify.com/api/client/js/api/client/js/reference.md#ActorCollectionListResult)> --- # ApifyApiError An `ApifyApiError` is thrown for successful HTTP requests that reach the API, but the API responds with an error response. Typically, those are rate limit errors and internal errors, which are automatically retried, or validation errors, which are thrown immediately, because a correction by the user is needed. ### Hierarchy * Error * *ApifyApiError* ## Index[**](#Index) ### Properties * [**attempt](#attempt) * [**cause](#cause) * [**clientMethod](#clientMethod) * [**data](#data) * [**httpMethod](#httpMethod) * [**message](#message) * [**name](#name) * [**originalStack](#originalStack) * [**path](#path) * [**stack](#stack) * [**statusCode](#statusCode) * [**type](#type) * [**stackTraceLimit](#stackTraceLimit) ### Methods * [**captureStackTrace](#captureStackTrace) * [**prepareStackTrace](#prepareStackTrace) ## Properties[**](#Properties) ### [**](#attempt)[**](https://github.com/apify/apify-client-js/blob/master/src/apify_api_error.ts#L46)attempt **attempt: number Number of the API call attempt. ### [**](#cause)[**](https://undefined/apify/apify-client-js/blob/master/website/node_modules/typescript/src/lib.es2022.error.d.ts#L24)externaloptionalinheritedcause **cause? : unknown Inherited from Error.cause ### [**](#clientMethod)[**](https://github.com/apify/apify-client-js/blob/master/src/apify_api_error.ts#L31)clientMethod **clientMethod: string The invoked resource client and the method. Known issue: Sometimes it displays as `unknown` because it can't be parsed from a stack trace. ### [**](#data)[**](https://github.com/apify/apify-client-js/blob/master/src/apify_api_error.ts#L67)optionaldata **data? : Record\ Additional data provided by the API about the error ### [**](#httpMethod)[**](https://github.com/apify/apify-client-js/blob/master/src/apify_api_error.ts#L51)optionalhttpMethod **httpMethod? : string HTTP method of the API call. ### [**](#message)[**](https://undefined/apify/apify-client-js/blob/master/website/node_modules/typescript/src/lib.es5.d.ts#L1077)externalinheritedmessage **message: string Inherited from Error.message ### [**](#name)[**](https://github.com/apify/apify-client-js/blob/master/src/apify_api_error.ts#L25)name **name: string Overrides Error.name ### [**](#originalStack)[**](https://github.com/apify/apify-client-js/blob/master/src/apify_api_error.ts#L62)originalStack **originalStack: string Original stack trace of the exception. It is replaced by a more informative stack with API call information. ### [**](#path)[**](https://github.com/apify/apify-client-js/blob/master/src/apify_api_error.ts#L56)optionalpath **path? : string Full path of the API endpoint (URL excluding origin). ### [**](#stack)[**](https://undefined/apify/apify-client-js/blob/master/website/node_modules/typescript/src/lib.es5.d.ts#L1078)externaloptionalinheritedstack **stack? : string Inherited from Error.stack ### [**](#statusCode)[**](https://github.com/apify/apify-client-js/blob/master/src/apify_api_error.ts#L36)statusCode **statusCode: number HTTP status code of the error. ### [**](#type)[**](https://github.com/apify/apify-client-js/blob/master/src/apify_api_error.ts#L41)optionaltype **type? : string The type of the error, as returned by the API. ### [**](#stackTraceLimit)[**](https://undefined/apify/apify-client-js/blob/master/node_modules/@types/node/globals.d.ts#L161)staticexternalinheritedstackTraceLimit **stackTraceLimit: number Inherited from Error.stackTraceLimit The `Error.stackTraceLimit` property specifies the number of stack frames collected by a stack trace (whether generated by `new Error().stack` or `Error.captureStackTrace(obj)`). The default value is `10` but may be set to any valid JavaScript number. Changes will affect any stack trace captured *after* the value has been changed. If set to a non-number value, or set to a negative number, stack traces will not capture any frames. ## Methods[**](#Methods) ### [**](#captureStackTrace)[**](https://undefined/apify/apify-client-js/blob/master/node_modules/@types/node/globals.d.ts#L145)staticexternalinheritedcaptureStackTrace * ****captureStackTrace**(targetObject, constructorOpt): void - Inherited from Error.captureStackTrace Creates a `.stack` property on `targetObject`, which when accessed returns a string representing the location in the code at which `Error.captureStackTrace()` was called. ``` const myObject = {}; Error.captureStackTrace(myObject); myObject.stack; // Similar to `new Error().stack` ``` The first line of the trace will be prefixed with `${myObject.name}: ${myObject.message}`. The optional `constructorOpt` argument accepts a function. If given, all frames above `constructorOpt`, including `constructorOpt`, will be omitted from the generated stack trace. The `constructorOpt` argument is useful for hiding implementation details of error generation from the user. For instance: ``` function a() { b(); } function b() { c(); } function c() { // Create an error without stack trace to avoid calculating the stack trace twice. const { stackTraceLimit } = Error; Error.stackTraceLimit = 0; const error = new Error(); Error.stackTraceLimit = stackTraceLimit; // Capture the stack trace above function b Error.captureStackTrace(error, b); // Neither function c, nor b is included in the stack trace throw error; } a(); ``` *** #### Parameters * ##### externaltargetObject: object * ##### externaloptionalconstructorOpt: Function #### Returns void ### [**](#prepareStackTrace)[**](https://undefined/apify/apify-client-js/blob/master/node_modules/@types/node/globals.d.ts#L149)staticexternalinheritedprepareStackTrace * ****prepareStackTrace**(err, stackTraces): any - Inherited from Error.prepareStackTrace * **@see** *** #### Parameters * ##### externalerr: Error * ##### externalstackTraces: CallSite\[] #### Returns any --- # ApifyClient ApifyClient is the official library to access [Apify API](https://docs.apify.com/api/v2) from your JavaScript applications. It runs both in Node.js and browser. ## Index[**](#Index) ### Constructors * [**constructor](#constructor) ### Properties * [**baseUrl](#baseUrl) * [**httpClient](#httpClient) * [**logger](#logger) * [**stats](#stats) * [**token](#token) ### Methods * [**actor](#actor) * [**actors](#actors) * [**build](#build) * [**builds](#builds) * [**dataset](#dataset) * [**datasets](#datasets) * [**keyValueStore](#keyValueStore) * [**keyValueStores](#keyValueStores) * [**log](#log) * [**requestQueue](#requestQueue) * [**requestQueues](#requestQueues) * [**run](#run) * [**runs](#runs) * [**schedule](#schedule) * [**schedules](#schedules) * [**setStatusMessage](#setStatusMessage) * [**store](#store) * [**task](#task) * [**tasks](#tasks) * [**user](#user) * [**webhook](#webhook) * [**webhookDispatch](#webhookDispatch) * [**webhookDispatches](#webhookDispatches) * [**webhooks](#webhooks) ## Constructors[**](#Constructors) ### [**](#constructor)[**](https://github.com/apify/apify-client-js/blob/master/src/apify_client.ts#L53)constructor * ****new ApifyClient**(options): [ApifyClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/ApifyClient.md) - #### Parameters * ##### options: [ApifyClientOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ApifyClientOptions.md) = {} #### Returns [ApifyClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/ApifyClient.md) ## Properties[**](#Properties) ### [**](#baseUrl)[**](https://github.com/apify/apify-client-js/blob/master/src/apify_client.ts#L43)baseUrl **baseUrl: string ### [**](#httpClient)[**](https://github.com/apify/apify-client-js/blob/master/src/apify_client.ts#L51)httpClient **httpClient: HttpClient ### [**](#logger)[**](https://github.com/apify/apify-client-js/blob/master/src/apify_client.ts#L49)logger **logger: Log ### [**](#stats)[**](https://github.com/apify/apify-client-js/blob/master/src/apify_client.ts#L47)stats **stats: Statistics ### [**](#token)[**](https://github.com/apify/apify-client-js/blob/master/src/apify_client.ts#L45)optionaltoken **token? : string ## Methods[**](#Methods) ### [**](#actor)[**](https://github.com/apify/apify-client-js/blob/master/src/apify_client.ts#L111)actor * ****actor**(id): [ActorClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/ActorClient.md) - *** #### Parameters * ##### id: string #### Returns [ActorClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/ActorClient.md) ### [**](#actors)[**](https://github.com/apify/apify-client-js/blob/master/src/apify_client.ts#L104)actors * ****actors**(): [ActorCollectionClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/ActorCollectionClient.md) - *** #### Returns [ActorCollectionClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/ActorCollectionClient.md) ### [**](#build)[**](https://github.com/apify/apify-client-js/blob/master/src/apify_client.ts#L130)build * ****build**(id): [BuildClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/BuildClient.md) - *** #### Parameters * ##### id: string #### Returns [BuildClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/BuildClient.md) ### [**](#builds)[**](https://github.com/apify/apify-client-js/blob/master/src/apify_client.ts#L123)builds * ****builds**(): [BuildCollectionClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/BuildCollectionClient.md) - *** #### Returns [BuildCollectionClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/BuildCollectionClient.md) ### [**](#dataset)[**](https://github.com/apify/apify-client-js/blob/master/src/apify_client.ts#L149)dataset * ****dataset**\(id): [DatasetClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/DatasetClient.md)\ - *** #### Parameters * ##### id: string #### Returns [DatasetClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/DatasetClient.md)\ ### [**](#datasets)[**](https://github.com/apify/apify-client-js/blob/master/src/apify_client.ts#L142)datasets * ****datasets**(): [DatasetCollectionClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/DatasetCollectionClient.md) - *** #### Returns [DatasetCollectionClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/DatasetCollectionClient.md) ### [**](#keyValueStore)[**](https://github.com/apify/apify-client-js/blob/master/src/apify_client.ts#L170)keyValueStore * ****keyValueStore**(id): [KeyValueStoreClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/KeyValueStoreClient.md) - *** #### Parameters * ##### id: string #### Returns [KeyValueStoreClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/KeyValueStoreClient.md) ### [**](#keyValueStores)[**](https://github.com/apify/apify-client-js/blob/master/src/apify_client.ts#L163)keyValueStores * ****keyValueStores**(): [KeyValueStoreCollectionClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/KeyValueStoreCollectionClient.md) - *** #### Returns [KeyValueStoreCollectionClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/KeyValueStoreCollectionClient.md) ### [**](#log)[**](https://github.com/apify/apify-client-js/blob/master/src/apify_client.ts#L182)log * ****log**(buildOrRunId): [LogClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/LogClient.md) - *** #### Parameters * ##### buildOrRunId: string #### Returns [LogClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/LogClient.md) ### [**](#requestQueue)[**](https://github.com/apify/apify-client-js/blob/master/src/apify_client.ts#L201)requestQueue * ****requestQueue**(id, options): [RequestQueueClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/RequestQueueClient.md) - *** #### Parameters * ##### id: string * ##### options: [RequestQueueUserOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueueUserOptions.md) = {} #### Returns [RequestQueueClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/RequestQueueClient.md) ### [**](#requestQueues)[**](https://github.com/apify/apify-client-js/blob/master/src/apify_client.ts#L194)requestQueues * ****requestQueues**(): [RequestQueueCollectionClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/RequestQueueCollectionClient.md) - *** #### Returns [RequestQueueCollectionClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/RequestQueueCollectionClient.md) ### [**](#run)[**](https://github.com/apify/apify-client-js/blob/master/src/apify_client.ts#L231)run * ****run**(id): [RunClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/RunClient.md) - *** #### Parameters * ##### id: string #### Returns [RunClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/RunClient.md) ### [**](#runs)[**](https://github.com/apify/apify-client-js/blob/master/src/apify_client.ts#L221)runs * ****runs**(): [RunCollectionClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/RunCollectionClient.md) - *** #### Returns [RunCollectionClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/RunCollectionClient.md) ### [**](#schedule)[**](https://github.com/apify/apify-client-js/blob/master/src/apify_client.ts#L269)schedule * ****schedule**(id): [ScheduleClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/ScheduleClient.md) - *** #### Parameters * ##### id: string #### Returns [ScheduleClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/ScheduleClient.md) ### [**](#schedules)[**](https://github.com/apify/apify-client-js/blob/master/src/apify_client.ts#L262)schedules * ****schedules**(): [ScheduleCollectionClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/ScheduleCollectionClient.md) - *** #### Returns [ScheduleCollectionClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/ScheduleCollectionClient.md) ### [**](#setStatusMessage)[**](https://github.com/apify/apify-client-js/blob/master/src/apify_client.ts#L335)setStatusMessage * ****setStatusMessage**(message, options): Promise\ - #### Parameters * ##### message: string * ##### optionaloptions: SetStatusMessageOptions #### Returns Promise\ ### [**](#store)[**](https://github.com/apify/apify-client-js/blob/master/src/apify_client.ts#L331)store * ****store**(): [StoreCollectionClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/StoreCollectionClient.md) - *** #### Returns [StoreCollectionClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/StoreCollectionClient.md) ### [**](#task)[**](https://github.com/apify/apify-client-js/blob/master/src/apify_client.ts#L250)task * ****task**(id): [TaskClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/TaskClient.md) - *** #### Parameters * ##### id: string #### Returns [TaskClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/TaskClient.md) ### [**](#tasks)[**](https://github.com/apify/apify-client-js/blob/master/src/apify_client.ts#L243)tasks * ****tasks**(): [TaskCollectionClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/TaskCollectionClient.md) - *** #### Returns [TaskCollectionClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/TaskCollectionClient.md) ### [**](#user)[**](https://github.com/apify/apify-client-js/blob/master/src/apify_client.ts#L281)user * ****user**(id): [UserClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/UserClient.md) - *** #### Parameters * ##### id: string = ME\_USER\_NAME\_PLACEHOLDER #### Returns [UserClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/UserClient.md) ### [**](#webhook)[**](https://github.com/apify/apify-client-js/blob/master/src/apify_client.ts#L300)webhook * ****webhook**(id): [WebhookClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/WebhookClient.md) - *** #### Parameters * ##### id: string #### Returns [WebhookClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/WebhookClient.md) ### [**](#webhookDispatch)[**](https://github.com/apify/apify-client-js/blob/master/src/apify_client.ts#L319)webhookDispatch * ****webhookDispatch**(id): [WebhookDispatchClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/WebhookDispatchClient.md) - *** #### Parameters * ##### id: string #### Returns [WebhookDispatchClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/WebhookDispatchClient.md) ### [**](#webhookDispatches)[**](https://github.com/apify/apify-client-js/blob/master/src/apify_client.ts#L312)webhookDispatches * ****webhookDispatches**(): [WebhookDispatchCollectionClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/WebhookDispatchCollectionClient.md) - *** #### Returns [WebhookDispatchCollectionClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/WebhookDispatchCollectionClient.md) ### [**](#webhooks)[**](https://github.com/apify/apify-client-js/blob/master/src/apify_client.ts#L293)webhooks * ****webhooks**(): [WebhookCollectionClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/WebhookCollectionClient.md) - *** #### Returns [WebhookCollectionClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/WebhookCollectionClient.md) --- # BuildClient ### Hierarchy * ResourceClient * *BuildClient* ## Index[**](#Index) ### Properties * [**apifyClient](#apifyClient) * [**baseUrl](#baseUrl) * [**httpClient](#httpClient) * [**id](#id) * [**params](#params) * [**resourcePath](#resourcePath) * [**safeId](#safeId) * [**url](#url) ### Methods * [**abort](#abort) * [**delete](#delete) * [**get](#get) * [**getOpenApiDefinition](#getOpenApiDefinition) * [**log](#log) * [**waitForFinish](#waitForFinish) ## Properties[**](#Properties) ### [**](#apifyClient)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L32)inheritedapifyClient **apifyClient: [ApifyClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/ApifyClient.md) Inherited from ResourceClient.apifyClient ### [**](#baseUrl)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L26)inheritedbaseUrl **baseUrl: string Inherited from ResourceClient.baseUrl ### [**](#httpClient)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L34)inheritedhttpClient **httpClient: HttpClient Inherited from ResourceClient.httpClient ### [**](#id)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L22)optionalinheritedid **id? : string Inherited from ResourceClient.id ### [**](#params)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L36)optionalinheritedparams **params? : Record\ Inherited from ResourceClient.params ### [**](#resourcePath)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L28)inheritedresourcePath **resourcePath: string Inherited from ResourceClient.resourcePath ### [**](#safeId)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L24)optionalinheritedsafeId **safeId? : string Inherited from ResourceClient.safeId ### [**](#url)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L30)inheritedurl **url: string Inherited from ResourceClient.url ## Methods[**](#Methods) ### [**](#abort)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/build.ts#L39)abort * ****abort**(): Promise<[Build](https://docs.apify.com/api/client/js/api/client/js/reference/interface/Build.md)> - *** #### Returns Promise<[Build](https://docs.apify.com/api/client/js/api/client/js/reference/interface/Build.md)> ### [**](#delete)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/build.ts#L52)delete * ****delete**(): Promise\ - *** #### Returns Promise\ ### [**](#get)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/build.ts#L25)get * ****get**(options): Promise\ - *** #### Parameters * ##### options: [BuildClientGetOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/BuildClientGetOptions.md) = {} #### Returns Promise\ ### [**](#getOpenApiDefinition)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/build.ts#L59)getOpenApiDefinition * ****getOpenApiDefinition**(): Promise<[OpenApiDefinition](https://docs.apify.com/api/client/js/api/client/js/reference/interface/OpenApiDefinition.md)> - *** #### Returns Promise<[OpenApiDefinition](https://docs.apify.com/api/client/js/api/client/js/reference/interface/OpenApiDefinition.md)> ### [**](#log)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/build.ts#L94)log * ****log**(): [LogClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/LogClient.md) - *** #### Returns [LogClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/LogClient.md) ### [**](#waitForFinish)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/build.ts#L80)waitForFinish * ****waitForFinish**(options): Promise<[Build](https://docs.apify.com/api/client/js/api/client/js/reference/interface/Build.md)> - Returns a promise that resolves with the finished Build object when the provided actor build finishes or with the unfinished Build object when the `waitSecs` timeout lapses. The promise is NOT rejected based on run status. You can inspect the `status` property of the Build object to find out its status. The difference between this function and the `waitForFinish` parameter of the `get` method is the fact that this function can wait indefinitely. Its use is preferable to the `waitForFinish` parameter alone, which it uses internally. This is useful when you need to immediately start a run after a build finishes. *** #### Parameters * ##### options: [BuildClientWaitForFinishOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/BuildClientWaitForFinishOptions.md) = {} #### Returns Promise<[Build](https://docs.apify.com/api/client/js/api/client/js/reference/interface/Build.md)> --- # BuildCollectionClient ### Hierarchy * ResourceCollectionClient * *BuildCollectionClient* ## Index[**](#Index) ### Properties * [**apifyClient](#apifyClient) * [**baseUrl](#baseUrl) * [**httpClient](#httpClient) * [**id](#id) * [**params](#params) * [**resourcePath](#resourcePath) * [**safeId](#safeId) * [**url](#url) ### Methods * [**list](#list) ## Properties[**](#Properties) ### [**](#apifyClient)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L32)inheritedapifyClient **apifyClient: [ApifyClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/ApifyClient.md) Inherited from ResourceCollectionClient.apifyClient ### [**](#baseUrl)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L26)inheritedbaseUrl **baseUrl: string Inherited from ResourceCollectionClient.baseUrl ### [**](#httpClient)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L34)inheritedhttpClient **httpClient: HttpClient Inherited from ResourceCollectionClient.httpClient ### [**](#id)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L22)optionalinheritedid **id? : string Inherited from ResourceCollectionClient.id ### [**](#params)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L36)optionalinheritedparams **params? : Record\ Inherited from ResourceCollectionClient.params ### [**](#resourcePath)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L28)inheritedresourcePath **resourcePath: string Inherited from ResourceCollectionClient.resourcePath ### [**](#safeId)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L24)optionalinheritedsafeId **safeId? : string Inherited from ResourceCollectionClient.safeId ### [**](#url)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L30)inheritedurl **url: string Inherited from ResourceCollectionClient.url ## Methods[**](#Methods) ### [**](#list)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/build_collection.ts#L22)list * ****list**(options): Promise<[BuildCollectionClientListResult](https://docs.apify.com/api/client/js/api/client/js/reference.md#BuildCollectionClientListResult)> - *** #### Parameters * ##### options: [BuildCollectionClientListOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/BuildCollectionClientListOptions.md) = {} #### Returns Promise<[BuildCollectionClientListResult](https://docs.apify.com/api/client/js/api/client/js/reference.md#BuildCollectionClientListResult)> --- # DatasetClient \ ### Hierarchy * ResourceClient * *DatasetClient* ## Index[**](#Index) ### Properties * [**apifyClient](#apifyClient) * [**baseUrl](#baseUrl) * [**httpClient](#httpClient) * [**id](#id) * [**params](#params) * [**resourcePath](#resourcePath) * [**safeId](#safeId) * [**url](#url) ### Methods * [**delete](#delete) * [**downloadItems](#downloadItems) * [**get](#get) * [**getStatistics](#getStatistics) * [**listItems](#listItems) * [**pushItems](#pushItems) * [**update](#update) ## Properties[**](#Properties) ### [**](#apifyClient)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L32)inheritedapifyClient **apifyClient: [ApifyClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/ApifyClient.md) Inherited from ResourceClient.apifyClient ### [**](#baseUrl)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L26)inheritedbaseUrl **baseUrl: string Inherited from ResourceClient.baseUrl ### [**](#httpClient)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L34)inheritedhttpClient **httpClient: HttpClient Inherited from ResourceClient.httpClient ### [**](#id)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L22)optionalinheritedid **id? : string Inherited from ResourceClient.id ### [**](#params)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L36)optionalinheritedparams **params? : Record\ Inherited from ResourceClient.params ### [**](#resourcePath)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L28)inheritedresourcePath **resourcePath: string Inherited from ResourceClient.resourcePath ### [**](#safeId)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L24)optionalinheritedsafeId **safeId? : string Inherited from ResourceClient.safeId ### [**](#url)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L30)inheritedurl **url: string Inherited from ResourceClient.url ## Methods[**](#Methods) ### [**](#delete)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset.ts#L49)delete * ****delete**(): Promise\ - *** #### Returns Promise\ ### [**](#downloadItems)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset.ts#L89)downloadItems * ****downloadItems**(format, options): Promise\ - Unlike `listItems` which returns a PaginationList with an array of individual dataset items, `downloadItems` returns the items serialized to the provided format. *** #### Parameters * ##### format: [DownloadItemsFormat](https://docs.apify.com/api/client/js/api/client/js/reference/enum/DownloadItemsFormat.md) * ##### options: [DatasetClientDownloadItemsOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/DatasetClientDownloadItemsOptions.md) = {} #### Returns Promise\ ### [**](#get)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset.ts#L33)get * ****get**(): Promise\ - *** #### Returns Promise\ ### [**](#getStatistics)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset.ts#L150)getStatistics * ****getStatistics**(): Promise\ - *** #### Returns Promise\ ### [**](#listItems)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset.ts#L56)listItems * ****listItems**(options): Promise<[PaginatedList](https://docs.apify.com/api/client/js/api/client/js/reference/interface/PaginatedList.md)\> - *** #### Parameters * ##### options: [DatasetClientListItemOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/DatasetClientListItemOptions.md) = {} #### Returns Promise<[PaginatedList](https://docs.apify.com/api/client/js/api/client/js/reference/interface/PaginatedList.md)\> ### [**](#pushItems)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset.ts#L131)pushItems * ****pushItems**(items): Promise\ - *** #### Parameters * ##### items: string | string\[] | Data | Data\[] #### Returns Promise\ ### [**](#update)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset.ts#L40)update * ****update**(newFields): Promise<[Dataset](https://docs.apify.com/api/client/js/api/client/js/reference/interface/Dataset.md)> - *** #### Parameters * ##### newFields: [DatasetClientUpdateOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/DatasetClientUpdateOptions.md) #### Returns Promise<[Dataset](https://docs.apify.com/api/client/js/api/client/js/reference/interface/Dataset.md)> --- # DatasetCollectionClient ### Hierarchy * ResourceCollectionClient * *DatasetCollectionClient* ## Index[**](#Index) ### Properties * [**apifyClient](#apifyClient) * [**baseUrl](#baseUrl) * [**httpClient](#httpClient) * [**id](#id) * [**params](#params) * [**resourcePath](#resourcePath) * [**safeId](#safeId) * [**url](#url) ### Methods * [**getOrCreate](#getOrCreate) * [**list](#list) ## Properties[**](#Properties) ### [**](#apifyClient)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L32)inheritedapifyClient **apifyClient: [ApifyClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/ApifyClient.md) Inherited from ResourceCollectionClient.apifyClient ### [**](#baseUrl)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L26)inheritedbaseUrl **baseUrl: string Inherited from ResourceCollectionClient.baseUrl ### [**](#httpClient)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L34)inheritedhttpClient **httpClient: HttpClient Inherited from ResourceCollectionClient.httpClient ### [**](#id)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L22)optionalinheritedid **id? : string Inherited from ResourceCollectionClient.id ### [**](#params)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L36)optionalinheritedparams **params? : Record\ Inherited from ResourceCollectionClient.params ### [**](#resourcePath)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L28)inheritedresourcePath **resourcePath: string Inherited from ResourceCollectionClient.resourcePath ### [**](#safeId)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L24)optionalinheritedsafeId **safeId? : string Inherited from ResourceCollectionClient.safeId ### [**](#url)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L30)inheritedurl **url: string Inherited from ResourceCollectionClient.url ## Methods[**](#Methods) ### [**](#getOrCreate)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset_collection.ts#L39)getOrCreate * ****getOrCreate**(name, options): Promise<[Dataset](https://docs.apify.com/api/client/js/api/client/js/reference/interface/Dataset.md)> - *** #### Parameters * ##### optionalname: string * ##### optionaloptions: [DatasetCollectionClientGetOrCreateOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/DatasetCollectionClientGetOrCreateOptions.md) #### Returns Promise<[Dataset](https://docs.apify.com/api/client/js/api/client/js/reference/interface/Dataset.md)> ### [**](#list)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset_collection.ts#L22)list * ****list**(options): Promise<[DatasetCollectionClientListResult](https://docs.apify.com/api/client/js/api/client/js/reference.md#DatasetCollectionClientListResult)> - *** #### Parameters * ##### options: [DatasetCollectionClientListOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/DatasetCollectionClientListOptions.md) = {} #### Returns Promise<[DatasetCollectionClientListResult](https://docs.apify.com/api/client/js/api/client/js/reference.md#DatasetCollectionClientListResult)> --- # InvalidResponseBodyError This error exists for the quite common situation, where only a partial JSON response is received and an attempt to parse the JSON throws an error. In most cases this can be resolved by retrying the request. We do that by identifying this error in HttpClient. The properties mimic AxiosError for easier integration in HttpClient error handling. ### Hierarchy * Error * *InvalidResponseBodyError* ## Index[**](#Index) ### Constructors * [**constructor](#constructor) ### Properties * [**cause](#cause) * [**code](#code) * [**message](#message) * [**name](#name) * [**response](#response) * [**stack](#stack) * [**stackTraceLimit](#stackTraceLimit) ### Methods * [**captureStackTrace](#captureStackTrace) * [**prepareStackTrace](#prepareStackTrace) ## Constructors[**](#Constructors) ### [**](#constructor)[**](https://github.com/apify/apify-client-js/blob/master/src/interceptors.ts#L24)constructor * ****new InvalidResponseBodyError**(response, cause): [InvalidResponseBodyError](https://docs.apify.com/api/client/js/api/client/js/reference/class/InvalidResponseBodyError.md) - Overrides Error.constructor #### Parameters * ##### response: AxiosResponse\ * ##### cause: Error #### Returns [InvalidResponseBodyError](https://docs.apify.com/api/client/js/api/client/js/reference/class/InvalidResponseBodyError.md) ## Properties[**](#Properties) ### [**](#cause)[**](https://github.com/apify/apify-client-js/blob/master/src/interceptors.ts#L22)cause **cause: Error Overrides Error.cause ### [**](#code)[**](https://github.com/apify/apify-client-js/blob/master/src/interceptors.ts#L18)code **code: string ### [**](#message)[**](https://undefined/apify/apify-client-js/blob/master/website/node_modules/typescript/src/lib.es5.d.ts#L1077)externalinheritedmessage **message: string Inherited from Error.message ### [**](#name)[**](https://undefined/apify/apify-client-js/blob/master/website/node_modules/typescript/src/lib.es5.d.ts#L1076)externalinheritedname **name: string Inherited from Error.name ### [**](#response)[**](https://github.com/apify/apify-client-js/blob/master/src/interceptors.ts#L20)response **response: AxiosResponse\ ### [**](#stack)[**](https://undefined/apify/apify-client-js/blob/master/website/node_modules/typescript/src/lib.es5.d.ts#L1078)externaloptionalinheritedstack **stack? : string Inherited from Error.stack ### [**](#stackTraceLimit)[**](https://undefined/apify/apify-client-js/blob/master/node_modules/@types/node/globals.d.ts#L161)staticexternalinheritedstackTraceLimit **stackTraceLimit: number Inherited from Error.stackTraceLimit The `Error.stackTraceLimit` property specifies the number of stack frames collected by a stack trace (whether generated by `new Error().stack` or `Error.captureStackTrace(obj)`). The default value is `10` but may be set to any valid JavaScript number. Changes will affect any stack trace captured *after* the value has been changed. If set to a non-number value, or set to a negative number, stack traces will not capture any frames. ## Methods[**](#Methods) ### [**](#captureStackTrace)[**](https://undefined/apify/apify-client-js/blob/master/node_modules/@types/node/globals.d.ts#L145)staticexternalinheritedcaptureStackTrace * ****captureStackTrace**(targetObject, constructorOpt): void - Inherited from Error.captureStackTrace Creates a `.stack` property on `targetObject`, which when accessed returns a string representing the location in the code at which `Error.captureStackTrace()` was called. ``` const myObject = {}; Error.captureStackTrace(myObject); myObject.stack; // Similar to `new Error().stack` ``` The first line of the trace will be prefixed with `${myObject.name}: ${myObject.message}`. The optional `constructorOpt` argument accepts a function. If given, all frames above `constructorOpt`, including `constructorOpt`, will be omitted from the generated stack trace. The `constructorOpt` argument is useful for hiding implementation details of error generation from the user. For instance: ``` function a() { b(); } function b() { c(); } function c() { // Create an error without stack trace to avoid calculating the stack trace twice. const { stackTraceLimit } = Error; Error.stackTraceLimit = 0; const error = new Error(); Error.stackTraceLimit = stackTraceLimit; // Capture the stack trace above function b Error.captureStackTrace(error, b); // Neither function c, nor b is included in the stack trace throw error; } a(); ``` *** #### Parameters * ##### externaltargetObject: object * ##### externaloptionalconstructorOpt: Function #### Returns void ### [**](#prepareStackTrace)[**](https://undefined/apify/apify-client-js/blob/master/node_modules/@types/node/globals.d.ts#L149)staticexternalinheritedprepareStackTrace * ****prepareStackTrace**(err, stackTraces): any - Inherited from Error.prepareStackTrace * **@see** *** #### Parameters * ##### externalerr: Error * ##### externalstackTraces: CallSite\[] #### Returns any --- # KeyValueStoreClient ### Hierarchy * ResourceClient * *KeyValueStoreClient* ## Index[**](#Index) ### Properties * [**apifyClient](#apifyClient) * [**baseUrl](#baseUrl) * [**httpClient](#httpClient) * [**id](#id) * [**params](#params) * [**resourcePath](#resourcePath) * [**safeId](#safeId) * [**url](#url) ### Methods * [**delete](#delete) * [**deleteRecord](#deleteRecord) * [**get](#get) * [**getRecord](#getRecord) * [**listKeys](#listKeys) * [**recordExists](#recordExists) * [**setRecord](#setRecord) * [**update](#update) ## Properties[**](#Properties) ### [**](#apifyClient)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L32)inheritedapifyClient **apifyClient: [ApifyClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/ApifyClient.md) Inherited from ResourceClient.apifyClient ### [**](#baseUrl)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L26)inheritedbaseUrl **baseUrl: string Inherited from ResourceClient.baseUrl ### [**](#httpClient)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L34)inheritedhttpClient **httpClient: HttpClient Inherited from ResourceClient.httpClient ### [**](#id)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L22)optionalinheritedid **id? : string Inherited from ResourceClient.id ### [**](#params)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L36)optionalinheritedparams **params? : Record\ Inherited from ResourceClient.params ### [**](#resourcePath)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L28)inheritedresourcePath **resourcePath: string Inherited from ResourceClient.resourcePath ### [**](#safeId)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L24)optionalinheritedsafeId **safeId? : string Inherited from ResourceClient.safeId ### [**](#url)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L30)inheritedurl **url: string Inherited from ResourceClient.url ## Methods[**](#Methods) ### [**](#delete)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/key_value_store.ts#L50)delete * ****delete**(): Promise\ - *** #### Returns Promise\ ### [**](#deleteRecord)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/key_value_store.ts#L234)deleteRecord * ****deleteRecord**(key): Promise\ - *** #### Parameters * ##### key: string #### Returns Promise\ ### [**](#get)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/key_value_store.ts#L34)get * ****get**(): Promise\ - *** #### Returns Promise\ ### [**](#getRecord)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/key_value_store.ts#L111)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/key_value_store.ts#L113)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/key_value_store.ts#L118)getRecord * ****getRecord**(key): Promise\> * ****getRecord**\(key, options): Promise\>> - You can use the `buffer` option to get the value in a Buffer (Node.js) or ArrayBuffer (browser) format. In Node.js (not in browser) you can also use the `stream` option to get a Readable stream. When the record does not exist, the function resolves to `undefined`. It does NOT resolve to a `KeyValueStore` record with an `undefined` value. *** #### Parameters * ##### key: string #### Returns Promise\> ### [**](#listKeys)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/key_value_store.ts#L57)listKeys * ****listKeys**(options): Promise<[KeyValueClientListKeysResult](https://docs.apify.com/api/client/js/api/client/js/reference/interface/KeyValueClientListKeysResult.md)> - *** #### Parameters * ##### options: [KeyValueClientListKeysOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/KeyValueClientListKeysOptions.md) = {} #### Returns Promise<[KeyValueClientListKeysResult](https://docs.apify.com/api/client/js/api/client/js/reference/interface/KeyValueClientListKeysResult.md)> ### [**](#recordExists)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/key_value_store.ts#L85)recordExists * ****recordExists**(key): Promise\ - Tests whether a record with the given key exists in the key-value store without retrieving its value. *** #### Parameters * ##### key: string The queried record key. #### Returns Promise\ `true` if the record exists, `false` if it does not. ### [**](#setRecord)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/key_value_store.ts#L178)setRecord * ****setRecord**(record, options): Promise\ - The value in the record can be a stream object (detected by having the `.pipe` and `.on` methods). However, note that in that case following redirects or retrying the request if it fails (for example due to rate limiting) isn't possible. If you want to keep that behavior, you need to collect the whole stream contents into a Buffer and then send the full buffer. See [this StackOverflow answer](https://stackoverflow.com/a/14269536/7292139) for an example how to do that. *** #### Parameters * ##### record: [KeyValueStoreRecord](https://docs.apify.com/api/client/js/api/client/js/reference/interface/KeyValueStoreRecord.md)\ * ##### options: [KeyValueStoreRecordOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/KeyValueStoreRecordOptions.md) = {} #### Returns Promise\ ### [**](#update)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/key_value_store.ts#L41)update * ****update**(newFields): Promise<[KeyValueStore](https://docs.apify.com/api/client/js/api/client/js/reference/interface/KeyValueStore.md)> - *** #### Parameters * ##### newFields: [KeyValueClientUpdateOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/KeyValueClientUpdateOptions.md) #### Returns Promise<[KeyValueStore](https://docs.apify.com/api/client/js/api/client/js/reference/interface/KeyValueStore.md)> --- # KeyValueStoreCollectionClient ### Hierarchy * ResourceCollectionClient * *KeyValueStoreCollectionClient* ## Index[**](#Index) ### Properties * [**apifyClient](#apifyClient) * [**baseUrl](#baseUrl) * [**httpClient](#httpClient) * [**id](#id) * [**params](#params) * [**resourcePath](#resourcePath) * [**safeId](#safeId) * [**url](#url) ### Methods * [**getOrCreate](#getOrCreate) * [**list](#list) ## Properties[**](#Properties) ### [**](#apifyClient)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L32)inheritedapifyClient **apifyClient: [ApifyClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/ApifyClient.md) Inherited from ResourceCollectionClient.apifyClient ### [**](#baseUrl)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L26)inheritedbaseUrl **baseUrl: string Inherited from ResourceCollectionClient.baseUrl ### [**](#httpClient)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L34)inheritedhttpClient **httpClient: HttpClient Inherited from ResourceCollectionClient.httpClient ### [**](#id)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L22)optionalinheritedid **id? : string Inherited from ResourceCollectionClient.id ### [**](#params)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L36)optionalinheritedparams **params? : Record\ Inherited from ResourceCollectionClient.params ### [**](#resourcePath)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L28)inheritedresourcePath **resourcePath: string Inherited from ResourceCollectionClient.resourcePath ### [**](#safeId)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L24)optionalinheritedsafeId **safeId? : string Inherited from ResourceCollectionClient.safeId ### [**](#url)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L30)inheritedurl **url: string Inherited from ResourceCollectionClient.url ## Methods[**](#Methods) ### [**](#getOrCreate)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/key_value_store_collection.ts#L41)getOrCreate * ****getOrCreate**(name, options): Promise<[KeyValueStore](https://docs.apify.com/api/client/js/api/client/js/reference/interface/KeyValueStore.md)> - *** #### Parameters * ##### optionalname: string * ##### optionaloptions: [KeyValueStoreCollectionClientGetOrCreateOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/KeyValueStoreCollectionClientGetOrCreateOptions.md) #### Returns Promise<[KeyValueStore](https://docs.apify.com/api/client/js/api/client/js/reference/interface/KeyValueStore.md)> ### [**](#list)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/key_value_store_collection.ts#L22)list * ****list**(options): Promise<[PaginatedList](https://docs.apify.com/api/client/js/api/client/js/reference/interface/PaginatedList.md)<[KeyValueStoreCollectionListResult](https://docs.apify.com/api/client/js/api/client/js/reference.md#KeyValueStoreCollectionListResult)>> - *** #### Parameters * ##### options: [KeyValueStoreCollectionClientListOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/KeyValueStoreCollectionClientListOptions.md) = {} #### Returns Promise<[PaginatedList](https://docs.apify.com/api/client/js/api/client/js/reference/interface/PaginatedList.md)<[KeyValueStoreCollectionListResult](https://docs.apify.com/api/client/js/api/client/js/reference.md#KeyValueStoreCollectionListResult)>> --- # LogClient ### Hierarchy * ResourceClient * *LogClient* ## Index[**](#Index) ### Properties * [**apifyClient](#apifyClient) * [**baseUrl](#baseUrl) * [**httpClient](#httpClient) * [**id](#id) * [**params](#params) * [**resourcePath](#resourcePath) * [**safeId](#safeId) * [**url](#url) ### Methods * [**get](#get) * [**stream](#stream) ## Properties[**](#Properties) ### [**](#apifyClient)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L32)inheritedapifyClient **apifyClient: [ApifyClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/ApifyClient.md) Inherited from ResourceClient.apifyClient ### [**](#baseUrl)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L26)inheritedbaseUrl **baseUrl: string Inherited from ResourceClient.baseUrl ### [**](#httpClient)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L34)inheritedhttpClient **httpClient: HttpClient Inherited from ResourceClient.httpClient ### [**](#id)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L22)optionalinheritedid **id? : string Inherited from ResourceClient.id ### [**](#params)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L36)optionalinheritedparams **params? : Record\ Inherited from ResourceClient.params ### [**](#resourcePath)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L28)inheritedresourcePath **resourcePath: string Inherited from ResourceClient.resourcePath ### [**](#safeId)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L24)optionalinheritedsafeId **safeId? : string Inherited from ResourceClient.safeId ### [**](#url)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L30)inheritedurl **url: string Inherited from ResourceClient.url ## Methods[**](#Methods) ### [**](#get)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/log.ts#L23)get * ****get**(): Promise\ - *** #### Returns Promise\ ### [**](#stream)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/log.ts#L44)stream * ****stream**(): Promise\ - Gets the log in a Readable stream format. Only works in Node.js. *** #### Returns Promise\ --- # RequestQueueClient ### Hierarchy * ResourceClient * *RequestQueueClient* ## Index[**](#Index) ### Properties * [**apifyClient](#apifyClient) * [**baseUrl](#baseUrl) * [**httpClient](#httpClient) * [**id](#id) * [**params](#params) * [**resourcePath](#resourcePath) * [**safeId](#safeId) * [**url](#url) ### Methods * [**addRequest](#addRequest) * [**batchAddRequests](#batchAddRequests) * [**batchDeleteRequests](#batchDeleteRequests) * [**delete](#delete) * [**deleteRequest](#deleteRequest) * [**deleteRequestLock](#deleteRequestLock) * [**get](#get) * [**getRequest](#getRequest) * [**listAndLockHead](#listAndLockHead) * [**listHead](#listHead) * [**listRequests](#listRequests) * [**paginateRequests](#paginateRequests) * [**prolongRequestLock](#prolongRequestLock) * [**unlockRequests](#unlockRequests) * [**update](#update) * [**updateRequest](#updateRequest) ## Properties[**](#Properties) ### [**](#apifyClient)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L32)inheritedapifyClient **apifyClient: [ApifyClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/ApifyClient.md) Inherited from ResourceClient.apifyClient ### [**](#baseUrl)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L26)inheritedbaseUrl **baseUrl: string Inherited from ResourceClient.baseUrl ### [**](#httpClient)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L34)inheritedhttpClient **httpClient: HttpClient Inherited from ResourceClient.httpClient ### [**](#id)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L22)optionalinheritedid **id? : string Inherited from ResourceClient.id ### [**](#params)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L36)optionalinheritedparams **params? : Record\ Inherited from ResourceClient.params ### [**](#resourcePath)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L28)inheritedresourcePath **resourcePath: string Inherited from ResourceClient.resourcePath ### [**](#safeId)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L24)optionalinheritedsafeId **safeId? : string Inherited from ResourceClient.safeId ### [**](#url)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L30)inheritedurl **url: string Inherited from ResourceClient.url ## Methods[**](#Methods) ### [**](#addRequest)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L123)addRequest * ****addRequest**(request, options): Promise<[RequestQueueClientAddRequestResult](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueueClientAddRequestResult.md)> - *** #### Parameters * ##### request: Omit<[RequestQueueClientRequestSchema](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueueClientRequestSchema.md), id> * ##### options: [RequestQueueClientAddRequestOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueueClientAddRequestOptions.md) = {} #### Returns Promise<[RequestQueueClientAddRequestResult](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueueClientAddRequestResult.md)> ### [**](#batchAddRequests)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L266)batchAddRequests * ****batchAddRequests**(requests, options): Promise<[RequestQueueClientBatchRequestsOperationResult](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueueClientBatchRequestsOperationResult.md)> - *** #### Parameters * ##### requests: Omit<[RequestQueueClientRequestSchema](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueueClientRequestSchema.md), id>\[] * ##### options: [RequestQueueClientBatchAddRequestWithRetriesOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueueClientBatchAddRequestWithRetriesOptions.md) = {} #### Returns Promise<[RequestQueueClientBatchRequestsOperationResult](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueueClientBatchRequestsOperationResult.md)> ### [**](#batchDeleteRequests)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L330)batchDeleteRequests * ****batchDeleteRequests**(requests): Promise<[RequestQueueClientBatchRequestsOperationResult](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueueClientBatchRequestsOperationResult.md)> - *** #### Parameters * ##### requests: [RequestQueueClientRequestToDelete](https://docs.apify.com/api/client/js/api/client/js/reference.md#RequestQueueClientRequestToDelete)\[] #### Returns Promise<[RequestQueueClientBatchRequestsOperationResult](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueueClientBatchRequestsOperationResult.md)> ### [**](#delete)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L64)delete * ****delete**(): Promise\ - *** #### Returns Promise\ ### [**](#deleteRequest)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L412)deleteRequest * ****deleteRequest**(id): Promise\ - #### Parameters * ##### id: string #### Returns Promise\ ### [**](#deleteRequestLock)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L458)deleteRequestLock * ****deleteRequestLock**(id, options): Promise\ - *** #### Parameters * ##### id: string * ##### options: [RequestQueueClientDeleteRequestLockOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueueClientDeleteRequestLockOptions.md) = {} #### Returns Promise\ ### [**](#get)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L48)get * ****get**(): Promise\ - *** #### Returns Promise\ ### [**](#getRequest)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L359)getRequest * ****getRequest**(id): Promise\ - *** #### Parameters * ##### id: string #### Returns Promise\ ### [**](#listAndLockHead)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L95)listAndLockHead * ****listAndLockHead**(options): Promise<[RequestQueueClientListAndLockHeadResult](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueueClientListAndLockHeadResult.md)> - *** #### Parameters * ##### options: [RequestQueueClientListAndLockHeadOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueueClientListAndLockHeadOptions.md) #### Returns Promise<[RequestQueueClientListAndLockHeadResult](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueueClientListAndLockHeadResult.md)> ### [**](#listHead)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L71)listHead * ****listHead**(options): Promise<[RequestQueueClientListHeadResult](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueueClientListHeadResult.md)> - *** #### Parameters * ##### options: [RequestQueueClientListHeadOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueueClientListHeadOptions.md) = {} #### Returns Promise<[RequestQueueClientListHeadResult](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueueClientListHeadResult.md)> ### [**](#listRequests)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L481)listRequests * ****listRequests**(options): Promise<[RequestQueueClientListRequestsResult](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueueClientListRequestsResult.md)> - *** #### Parameters * ##### options: [RequestQueueClientListRequestsOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueueClientListRequestsOptions.md) = {} #### Returns Promise<[RequestQueueClientListRequestsResult](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueueClientListRequestsResult.md)> ### [**](#paginateRequests)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L530)paginateRequests * ****paginateRequests**(options): [RequestQueueRequestsAsyncIterable](https://docs.apify.com/api/client/js/api/client/js/reference.md#RequestQueueRequestsAsyncIterable)<[RequestQueueClientListRequestsResult](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueueClientListRequestsResult.md)> - Usage: for await (const { items } of client.paginateRequests({ limit: 10 })) { items.forEach((request) => console.log(request)); } *** #### Parameters * ##### options: [RequestQueueClientPaginateRequestsOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueueClientPaginateRequestsOptions.md) = {} #### Returns [RequestQueueRequestsAsyncIterable](https://docs.apify.com/api/client/js/api/client/js/reference.md#RequestQueueRequestsAsyncIterable)<[RequestQueueClientListRequestsResult](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueueClientListRequestsResult.md)> ### [**](#prolongRequestLock)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L428)prolongRequestLock * ****prolongRequestLock**(id, options): Promise<[RequestQueueClientProlongRequestLockResult](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueueClientProlongRequestLockResult.md)> - *** #### Parameters * ##### id: string * ##### options: [RequestQueueClientProlongRequestLockOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueueClientProlongRequestLockOptions.md) #### Returns Promise<[RequestQueueClientProlongRequestLockResult](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueueClientProlongRequestLockResult.md)> ### [**](#unlockRequests)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L509)unlockRequests * ****unlockRequests**(): Promise<[RequestQueueClientUnlockRequestsResult](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueueClientUnlockRequestsResult.md)> - *** #### Returns Promise<[RequestQueueClientUnlockRequestsResult](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueueClientUnlockRequestsResult.md)> ### [**](#update)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L55)update * ****update**(newFields): Promise<[RequestQueue](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueue.md)> - *** #### Parameters * ##### newFields: [RequestQueueClientUpdateOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueueClientUpdateOptions.md) #### Returns Promise<[RequestQueue](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueue.md)> ### [**](#updateRequest)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L380)updateRequest * ****updateRequest**(request, options): Promise<[RequestQueueClientAddRequestResult](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueueClientAddRequestResult.md)> - *** #### Parameters * ##### request: [RequestQueueClientRequestSchema](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueueClientRequestSchema.md) * ##### options: [RequestQueueClientAddRequestOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueueClientAddRequestOptions.md) = {} #### Returns Promise<[RequestQueueClientAddRequestResult](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueueClientAddRequestResult.md)> --- # RequestQueueCollectionClient ### Hierarchy * ResourceCollectionClient * *RequestQueueCollectionClient* ## Index[**](#Index) ### Properties * [**apifyClient](#apifyClient) * [**baseUrl](#baseUrl) * [**httpClient](#httpClient) * [**id](#id) * [**params](#params) * [**resourcePath](#resourcePath) * [**safeId](#safeId) * [**url](#url) ### Methods * [**getOrCreate](#getOrCreate) * [**list](#list) ## Properties[**](#Properties) ### [**](#apifyClient)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L32)inheritedapifyClient **apifyClient: [ApifyClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/ApifyClient.md) Inherited from ResourceCollectionClient.apifyClient ### [**](#baseUrl)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L26)inheritedbaseUrl **baseUrl: string Inherited from ResourceCollectionClient.baseUrl ### [**](#httpClient)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L34)inheritedhttpClient **httpClient: HttpClient Inherited from ResourceCollectionClient.httpClient ### [**](#id)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L22)optionalinheritedid **id? : string Inherited from ResourceCollectionClient.id ### [**](#params)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L36)optionalinheritedparams **params? : Record\ Inherited from ResourceCollectionClient.params ### [**](#resourcePath)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L28)inheritedresourcePath **resourcePath: string Inherited from ResourceCollectionClient.resourcePath ### [**](#safeId)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L24)optionalinheritedsafeId **safeId? : string Inherited from ResourceCollectionClient.safeId ### [**](#url)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L30)inheritedurl **url: string Inherited from ResourceCollectionClient.url ## Methods[**](#Methods) ### [**](#getOrCreate)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue_collection.ts#L39)getOrCreate * ****getOrCreate**(name): Promise<[RequestQueue](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueue.md)> - *** #### Parameters * ##### optionalname: string #### Returns Promise<[RequestQueue](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueue.md)> ### [**](#list)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue_collection.ts#L22)list * ****list**(options): Promise<[RequestQueueCollectionListResult](https://docs.apify.com/api/client/js/api/client/js/reference.md#RequestQueueCollectionListResult)> - *** #### Parameters * ##### options: [RequestQueueCollectionListOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueueCollectionListOptions.md) = {} #### Returns Promise<[RequestQueueCollectionListResult](https://docs.apify.com/api/client/js/api/client/js/reference.md#RequestQueueCollectionListResult)> --- # RunClient ### Hierarchy * ResourceClient * *RunClient* ## Index[**](#Index) ### Properties * [**apifyClient](#apifyClient) * [**baseUrl](#baseUrl) * [**httpClient](#httpClient) * [**id](#id) * [**params](#params) * [**resourcePath](#resourcePath) * [**safeId](#safeId) * [**url](#url) ### Methods * [**abort](#abort) * [**charge](#charge) * [**dataset](#dataset) * [**delete](#delete) * [**get](#get) * [**keyValueStore](#keyValueStore) * [**log](#log) * [**metamorph](#metamorph) * [**reboot](#reboot) * [**requestQueue](#requestQueue) * [**resurrect](#resurrect) * [**update](#update) * [**waitForFinish](#waitForFinish) ## Properties[**](#Properties) ### [**](#apifyClient)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L32)inheritedapifyClient **apifyClient: [ApifyClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/ApifyClient.md) Inherited from ResourceClient.apifyClient ### [**](#baseUrl)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L26)inheritedbaseUrl **baseUrl: string Inherited from ResourceClient.baseUrl ### [**](#httpClient)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L34)inheritedhttpClient **httpClient: HttpClient Inherited from ResourceClient.httpClient ### [**](#id)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L22)optionalinheritedid **id? : string Inherited from ResourceClient.id ### [**](#params)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L36)optionalinheritedparams **params? : Record\ Inherited from ResourceClient.params ### [**](#resourcePath)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L28)inheritedresourcePath **resourcePath: string Inherited from ResourceClient.resourcePath ### [**](#safeId)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L24)optionalinheritedsafeId **safeId? : string Inherited from ResourceClient.safeId ### [**](#url)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L30)inheritedurl **url: string Inherited from ResourceClient.url ## Methods[**](#Methods) ### [**](#abort)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/run.ts#L46)abort * ****abort**(options): Promise<[ActorRun](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorRun.md)> - *** #### Parameters * ##### options: [RunAbortOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RunAbortOptions.md) = {} #### Returns Promise<[ActorRun](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorRun.md)> ### [**](#charge)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/run.ts#L159)charge * ****charge**(options): Promise\>> - *** #### Parameters * ##### options: [RunChargeOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RunChargeOptions.md) #### Returns Promise\>> ### [**](#dataset)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/run.ts#L219)dataset * ****dataset**(): [DatasetClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/DatasetClient.md)\> - This also works through `actorClient.lastRun().dataset()`. *** #### Returns [DatasetClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/DatasetClient.md)\> ### [**](#delete)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/run.ts#L66)delete * ****delete**(): Promise\ - *** #### Returns Promise\ ### [**](#get)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/run.ts#L32)get * ****get**(options): Promise\ - *** #### Parameters * ##### options: [RunGetOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RunGetOptions.md) = {} #### Returns Promise\ ### [**](#keyValueStore)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/run.ts#L233)keyValueStore * ****keyValueStore**(): [KeyValueStoreClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/KeyValueStoreClient.md) - This also works through `actorClient.lastRun().keyValueStore()`. *** #### Returns [KeyValueStoreClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/KeyValueStoreClient.md) ### [**](#log)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/run.ts#L261)log * ****log**(): [LogClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/LogClient.md) - This also works through `actorClient.lastRun().log()`. *** #### Returns [LogClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/LogClient.md) ### [**](#metamorph)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/run.ts#L73)metamorph * ****metamorph**(targetActorId, input, options): Promise<[ActorRun](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorRun.md)> - *** #### Parameters * ##### targetActorId: string * ##### input: unknown * ##### options: [RunMetamorphOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RunMetamorphOptions.md) = {} #### Returns Promise<[ActorRun](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorRun.md)> ### [**](#reboot)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/run.ts#L116)reboot * ****reboot**(): Promise<[ActorRun](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorRun.md)> - *** #### Returns Promise<[ActorRun](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorRun.md)> ### [**](#requestQueue)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/run.ts#L247)requestQueue * ****requestQueue**(): [RequestQueueClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/RequestQueueClient.md) - This also works through `actorClient.lastRun().requestQueue()`. *** #### Returns [RequestQueueClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/RequestQueueClient.md) ### [**](#resurrect)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/run.ts#L135)resurrect * ****resurrect**(options): Promise<[ActorRun](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorRun.md)> - *** #### Parameters * ##### options: [RunResurrectOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RunResurrectOptions.md) = {} #### Returns Promise<[ActorRun](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorRun.md)> ### [**](#update)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/run.ts#L126)update * ****update**(newFields): Promise<[ActorRun](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorRun.md)> - #### Parameters * ##### newFields: [RunUpdateOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RunUpdateOptions.md) #### Returns Promise<[ActorRun](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorRun.md)> ### [**](#waitForFinish)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/run.ts#L202)waitForFinish * ****waitForFinish**(options): Promise<[ActorRun](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorRun.md)> - Returns a promise that resolves with the finished Run object when the provided actor run finishes or with the unfinished Run object when the `waitSecs` timeout lapses. The promise is NOT rejected based on run status. You can inspect the `status` property of the Run object to find out its status. The difference between this function and the `waitForFinish` parameter of the `get` method is the fact that this function can wait indefinitely. Its use is preferable to the `waitForFinish` parameter alone, which it uses internally. This is useful when you need to chain actor executions. Similar effect can be achieved by using webhooks, so be sure to review which technique fits your use-case better. *** #### Parameters * ##### options: [RunWaitForFinishOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RunWaitForFinishOptions.md) = {} #### Returns Promise<[ActorRun](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorRun.md)> --- # RunCollectionClient ### Hierarchy * ResourceCollectionClient * *RunCollectionClient* ## Index[**](#Index) ### Properties * [**apifyClient](#apifyClient) * [**baseUrl](#baseUrl) * [**httpClient](#httpClient) * [**id](#id) * [**params](#params) * [**resourcePath](#resourcePath) * [**safeId](#safeId) * [**url](#url) ### Methods * [**list](#list) ## Properties[**](#Properties) ### [**](#apifyClient)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L32)inheritedapifyClient **apifyClient: [ApifyClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/ApifyClient.md) Inherited from ResourceCollectionClient.apifyClient ### [**](#baseUrl)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L26)inheritedbaseUrl **baseUrl: string Inherited from ResourceCollectionClient.baseUrl ### [**](#httpClient)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L34)inheritedhttpClient **httpClient: HttpClient Inherited from ResourceCollectionClient.httpClient ### [**](#id)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L22)optionalinheritedid **id? : string Inherited from ResourceCollectionClient.id ### [**](#params)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L36)optionalinheritedparams **params? : Record\ Inherited from ResourceCollectionClient.params ### [**](#resourcePath)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L28)inheritedresourcePath **resourcePath: string Inherited from ResourceCollectionClient.resourcePath ### [**](#safeId)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L24)optionalinheritedsafeId **safeId? : string Inherited from ResourceCollectionClient.safeId ### [**](#url)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L30)inheritedurl **url: string Inherited from ResourceCollectionClient.url ## Methods[**](#Methods) ### [**](#list)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/run_collection.ts#L24)list * ****list**(options): Promise<[PaginatedList](https://docs.apify.com/api/client/js/api/client/js/reference/interface/PaginatedList.md)<[ActorRunListItem](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorRunListItem.md)>> - *** #### Parameters * ##### options: [RunCollectionListOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RunCollectionListOptions.md) = {} #### Returns Promise<[PaginatedList](https://docs.apify.com/api/client/js/api/client/js/reference/interface/PaginatedList.md)<[ActorRunListItem](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorRunListItem.md)>> --- # ScheduleClient ### Hierarchy * ResourceClient * *ScheduleClient* ## Index[**](#Index) ### Properties * [**apifyClient](#apifyClient) * [**baseUrl](#baseUrl) * [**httpClient](#httpClient) * [**id](#id) * [**params](#params) * [**resourcePath](#resourcePath) * [**safeId](#safeId) * [**url](#url) ### Methods * [**delete](#delete) * [**get](#get) * [**getLog](#getLog) * [**update](#update) ## Properties[**](#Properties) ### [**](#apifyClient)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L32)inheritedapifyClient **apifyClient: [ApifyClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/ApifyClient.md) Inherited from ResourceClient.apifyClient ### [**](#baseUrl)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L26)inheritedbaseUrl **baseUrl: string Inherited from ResourceClient.baseUrl ### [**](#httpClient)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L34)inheritedhttpClient **httpClient: HttpClient Inherited from ResourceClient.httpClient ### [**](#id)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L22)optionalinheritedid **id? : string Inherited from ResourceClient.id ### [**](#params)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L36)optionalinheritedparams **params? : Record\ Inherited from ResourceClient.params ### [**](#resourcePath)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L28)inheritedresourcePath **resourcePath: string Inherited from ResourceClient.resourcePath ### [**](#safeId)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L24)optionalinheritedsafeId **safeId? : string Inherited from ResourceClient.safeId ### [**](#url)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L30)inheritedurl **url: string Inherited from ResourceClient.url ## Methods[**](#Methods) ### [**](#delete)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/schedule.ts#L40)delete * ****delete**(): Promise\ - *** #### Returns Promise\ ### [**](#get)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/schedule.ts#L25)get * ****get**(): Promise\ - *** #### Returns Promise\ ### [**](#getLog)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/schedule.ts#L47)getLog * ****getLog**(): Promise\ - *** #### Returns Promise\ ### [**](#update)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/schedule.ts#L32)update * ****update**(newFields): Promise<[Schedule](https://docs.apify.com/api/client/js/api/client/js/reference/interface/Schedule.md)> - *** #### Parameters * ##### newFields: Partial\ & { actions: DistributiveOptional<[ScheduleAction](https://docs.apify.com/api/client/js/api/client/js/reference.md#ScheduleAction), id>\[] }> #### Returns Promise<[Schedule](https://docs.apify.com/api/client/js/api/client/js/reference/interface/Schedule.md)> --- # ScheduleCollectionClient ### Hierarchy * ResourceCollectionClient * *ScheduleCollectionClient* ## Index[**](#Index) ### Properties * [**apifyClient](#apifyClient) * [**baseUrl](#baseUrl) * [**httpClient](#httpClient) * [**id](#id) * [**params](#params) * [**resourcePath](#resourcePath) * [**safeId](#safeId) * [**url](#url) ### Methods * [**create](#create) * [**list](#list) ## Properties[**](#Properties) ### [**](#apifyClient)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L32)inheritedapifyClient **apifyClient: [ApifyClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/ApifyClient.md) Inherited from ResourceCollectionClient.apifyClient ### [**](#baseUrl)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L26)inheritedbaseUrl **baseUrl: string Inherited from ResourceCollectionClient.baseUrl ### [**](#httpClient)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L34)inheritedhttpClient **httpClient: HttpClient Inherited from ResourceCollectionClient.httpClient ### [**](#id)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L22)optionalinheritedid **id? : string Inherited from ResourceCollectionClient.id ### [**](#params)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L36)optionalinheritedparams **params? : Record\ Inherited from ResourceCollectionClient.params ### [**](#resourcePath)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L28)inheritedresourcePath **resourcePath: string Inherited from ResourceCollectionClient.resourcePath ### [**](#safeId)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L24)optionalinheritedsafeId **safeId? : string Inherited from ResourceCollectionClient.safeId ### [**](#url)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L30)inheritedurl **url: string Inherited from ResourceCollectionClient.url ## Methods[**](#Methods) ### [**](#create)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/schedule_collection.ts#L38)create * ****create**(schedule): Promise<[Schedule](https://docs.apify.com/api/client/js/api/client/js/reference/interface/Schedule.md)> - *** #### Parameters * ##### optionalschedule: Partial\ & { actions: DistributiveOptional<[ScheduleAction](https://docs.apify.com/api/client/js/api/client/js/reference.md#ScheduleAction), id>\[] }> #### Returns Promise<[Schedule](https://docs.apify.com/api/client/js/api/client/js/reference/interface/Schedule.md)> ### [**](#list)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/schedule_collection.ts#L22)list * ****list**(options): Promise<[PaginatedList](https://docs.apify.com/api/client/js/api/client/js/reference/interface/PaginatedList.md)<[Schedule](https://docs.apify.com/api/client/js/api/client/js/reference/interface/Schedule.md)>> - *** #### Parameters * ##### options: [ScheduleCollectionListOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ScheduleCollectionListOptions.md) = {} #### Returns Promise<[PaginatedList](https://docs.apify.com/api/client/js/api/client/js/reference/interface/PaginatedList.md)<[Schedule](https://docs.apify.com/api/client/js/api/client/js/reference/interface/Schedule.md)>> --- # StoreCollectionClient ### Hierarchy * ResourceCollectionClient * *StoreCollectionClient* ## Index[**](#Index) ### Properties * [**apifyClient](#apifyClient) * [**baseUrl](#baseUrl) * [**httpClient](#httpClient) * [**id](#id) * [**params](#params) * [**resourcePath](#resourcePath) * [**safeId](#safeId) * [**url](#url) ### Methods * [**list](#list) ## Properties[**](#Properties) ### [**](#apifyClient)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L32)inheritedapifyClient **apifyClient: [ApifyClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/ApifyClient.md) Inherited from ResourceCollectionClient.apifyClient ### [**](#baseUrl)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L26)inheritedbaseUrl **baseUrl: string Inherited from ResourceCollectionClient.baseUrl ### [**](#httpClient)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L34)inheritedhttpClient **httpClient: HttpClient Inherited from ResourceCollectionClient.httpClient ### [**](#id)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L22)optionalinheritedid **id? : string Inherited from ResourceCollectionClient.id ### [**](#params)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L36)optionalinheritedparams **params? : Record\ Inherited from ResourceCollectionClient.params ### [**](#resourcePath)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L28)inheritedresourcePath **resourcePath: string Inherited from ResourceCollectionClient.resourcePath ### [**](#safeId)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L24)optionalinheritedsafeId **safeId? : string Inherited from ResourceCollectionClient.safeId ### [**](#url)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L30)inheritedurl **url: string Inherited from ResourceCollectionClient.url ## Methods[**](#Methods) ### [**](#list)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/store_collection.ts#L22)list * ****list**(options): Promise<[PaginatedList](https://docs.apify.com/api/client/js/api/client/js/reference/interface/PaginatedList.md)<[ActorStoreList](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorStoreList.md)>> - *** #### Parameters * ##### options: [StoreCollectionListOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/StoreCollectionListOptions.md) = {} #### Returns Promise<[PaginatedList](https://docs.apify.com/api/client/js/api/client/js/reference/interface/PaginatedList.md)<[ActorStoreList](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorStoreList.md)>> --- # TaskClient ### Hierarchy * ResourceClient * *TaskClient* ## Index[**](#Index) ### Properties * [**apifyClient](#apifyClient) * [**baseUrl](#baseUrl) * [**httpClient](#httpClient) * [**id](#id) * [**params](#params) * [**resourcePath](#resourcePath) * [**safeId](#safeId) * [**url](#url) ### Methods * [**call](#call) * [**delete](#delete) * [**get](#get) * [**getInput](#getInput) * [**lastRun](#lastRun) * [**runs](#runs) * [**start](#start) * [**update](#update) * [**updateInput](#updateInput) * [**webhooks](#webhooks) ## Properties[**](#Properties) ### [**](#apifyClient)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L32)inheritedapifyClient **apifyClient: [ApifyClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/ApifyClient.md) Inherited from ResourceClient.apifyClient ### [**](#baseUrl)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L26)inheritedbaseUrl **baseUrl: string Inherited from ResourceClient.baseUrl ### [**](#httpClient)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L34)inheritedhttpClient **httpClient: HttpClient Inherited from ResourceClient.httpClient ### [**](#id)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L22)optionalinheritedid **id? : string Inherited from ResourceClient.id ### [**](#params)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L36)optionalinheritedparams **params? : Record\ Inherited from ResourceClient.params ### [**](#resourcePath)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L28)inheritedresourcePath **resourcePath: string Inherited from ResourceClient.resourcePath ### [**](#safeId)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L24)optionalinheritedsafeId **safeId? : string Inherited from ResourceClient.safeId ### [**](#url)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L30)inheritedurl **url: string Inherited from ResourceClient.url ## Methods[**](#Methods) ### [**](#call)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/task.ts#L103)call * ****call**(input, options): Promise<[ActorRun](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorRun.md)> - Starts a task and waits for it to finish before returning the Run object. It waits indefinitely, unless the `waitSecs` option is provided. *** #### Parameters * ##### optionalinput: [Dictionary](https://docs.apify.com/api/client/js/api/client/js/reference.md#Dictionary) * ##### options: [TaskCallOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/TaskCallOptions.md) = {} #### Returns Promise<[ActorRun](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorRun.md)> ### [**](#delete)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/task.ts#L46)delete * ****delete**(): Promise\ - *** #### Returns Promise\ ### [**](#get)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/task.ts#L30)get * ****get**(): Promise\ - *** #### Returns Promise\ ### [**](#getInput)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/task.ts#L131)getInput * ****getInput**(): Promise\ - *** #### Returns Promise\ ### [**](#lastRun)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/task.ts#L164)lastRun * ****lastRun**(options): [RunClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/RunClient.md) - *** #### Parameters * ##### options: [TaskLastRunOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/TaskLastRunOptions.md) = {} #### Returns [RunClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/RunClient.md) ### [**](#runs)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/task.ts#L185)runs * ****runs**(): [RunCollectionClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/RunCollectionClient.md) - *** #### Returns [RunCollectionClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/RunCollectionClient.md) ### [**](#start)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/task.ts#L54)start * ****start**(input, options): Promise<[ActorRun](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorRun.md)> - Starts a task and immediately returns the Run object. *** #### Parameters * ##### optionalinput: [Dictionary](https://docs.apify.com/api/client/js/api/client/js/reference.md#Dictionary) * ##### options: [TaskStartOptions](https://docs.apify.com/api/client/js/api/client/js/reference.md#TaskStartOptions) = {} #### Returns Promise<[ActorRun](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorRun.md)> ### [**](#update)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/task.ts#L37)update * ****update**(newFields): Promise<[Task](https://docs.apify.com/api/client/js/api/client/js/reference/interface/Task.md)> - *** #### Parameters * ##### newFields: Partial\> #### Returns Promise<[Task](https://docs.apify.com/api/client/js/api/client/js/reference/interface/Task.md)> ### [**](#updateInput)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/task.ts#L150)updateInput * ****updateInput**(newFields): Promise<[Dictionary](https://docs.apify.com/api/client/js/api/client/js/reference.md#Dictionary) | [Dictionary](https://docs.apify.com/api/client/js/api/client/js/reference.md#Dictionary)\[]> - *** #### Parameters * ##### newFields: [Dictionary](https://docs.apify.com/api/client/js/api/client/js/reference.md#Dictionary) | [Dictionary](https://docs.apify.com/api/client/js/api/client/js/reference.md#Dictionary)\[] #### Returns Promise<[Dictionary](https://docs.apify.com/api/client/js/api/client/js/reference.md#Dictionary) | [Dictionary](https://docs.apify.com/api/client/js/api/client/js/reference.md#Dictionary)\[]> ### [**](#webhooks)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/task.ts#L196)webhooks * ****webhooks**(): [WebhookCollectionClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/WebhookCollectionClient.md) - *** #### Returns [WebhookCollectionClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/WebhookCollectionClient.md) --- # TaskCollectionClient ### Hierarchy * ResourceCollectionClient * *TaskCollectionClient* ## Index[**](#Index) ### Properties * [**apifyClient](#apifyClient) * [**baseUrl](#baseUrl) * [**httpClient](#httpClient) * [**id](#id) * [**params](#params) * [**resourcePath](#resourcePath) * [**safeId](#safeId) * [**url](#url) ### Methods * [**create](#create) * [**list](#list) ## Properties[**](#Properties) ### [**](#apifyClient)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L32)inheritedapifyClient **apifyClient: [ApifyClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/ApifyClient.md) Inherited from ResourceCollectionClient.apifyClient ### [**](#baseUrl)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L26)inheritedbaseUrl **baseUrl: string Inherited from ResourceCollectionClient.baseUrl ### [**](#httpClient)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L34)inheritedhttpClient **httpClient: HttpClient Inherited from ResourceCollectionClient.httpClient ### [**](#id)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L22)optionalinheritedid **id? : string Inherited from ResourceCollectionClient.id ### [**](#params)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L36)optionalinheritedparams **params? : Record\ Inherited from ResourceCollectionClient.params ### [**](#resourcePath)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L28)inheritedresourcePath **resourcePath: string Inherited from ResourceCollectionClient.resourcePath ### [**](#safeId)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L24)optionalinheritedsafeId **safeId? : string Inherited from ResourceCollectionClient.safeId ### [**](#url)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L30)inheritedurl **url: string Inherited from ResourceCollectionClient.url ## Methods[**](#Methods) ### [**](#create)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/task_collection.ts#L43)create * ****create**(task): Promise<[Task](https://docs.apify.com/api/client/js/api/client/js/reference/interface/Task.md)> - *** #### Parameters * ##### task: [TaskCreateData](https://docs.apify.com/api/client/js/api/client/js/reference/interface/TaskCreateData.md) #### Returns Promise<[Task](https://docs.apify.com/api/client/js/api/client/js/reference/interface/Task.md)> ### [**](#list)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/task_collection.ts#L27)list * ****list**(options): Promise<[PaginatedList](https://docs.apify.com/api/client/js/api/client/js/reference/interface/PaginatedList.md)<[TaskList](https://docs.apify.com/api/client/js/api/client/js/reference.md#TaskList)>> - *** #### Parameters * ##### optionaloptions: [TaskCollectionListOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/TaskCollectionListOptions.md) = {} #### Returns Promise<[PaginatedList](https://docs.apify.com/api/client/js/api/client/js/reference/interface/PaginatedList.md)<[TaskList](https://docs.apify.com/api/client/js/api/client/js/reference.md#TaskList)>> --- # UserClient ### Hierarchy * ResourceClient * *UserClient* ## Index[**](#Index) ### Properties * [**apifyClient](#apifyClient) * [**baseUrl](#baseUrl) * [**httpClient](#httpClient) * [**id](#id) * [**params](#params) * [**resourcePath](#resourcePath) * [**safeId](#safeId) * [**url](#url) ### Methods * [**get](#get) * [**limits](#limits) * [**monthlyUsage](#monthlyUsage) * [**updateLimits](#updateLimits) ## Properties[**](#Properties) ### [**](#apifyClient)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L32)inheritedapifyClient **apifyClient: [ApifyClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/ApifyClient.md) Inherited from ResourceClient.apifyClient ### [**](#baseUrl)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L26)inheritedbaseUrl **baseUrl: string Inherited from ResourceClient.baseUrl ### [**](#httpClient)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L34)inheritedhttpClient **httpClient: HttpClient Inherited from ResourceClient.httpClient ### [**](#id)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L22)optionalinheritedid **id? : string Inherited from ResourceClient.id ### [**](#params)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L36)optionalinheritedparams **params? : Record\ Inherited from ResourceClient.params ### [**](#resourcePath)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L28)inheritedresourcePath **resourcePath: string Inherited from ResourceClient.resourcePath ### [**](#safeId)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L24)optionalinheritedsafeId **safeId? : string Inherited from ResourceClient.safeId ### [**](#url)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L30)inheritedurl **url: string Inherited from ResourceClient.url ## Methods[**](#Methods) ### [**](#get)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L23)get * ****get**(): Promise<[User](https://docs.apify.com/api/client/js/api/client/js/reference/interface/User.md)> - Depending on whether ApifyClient was created with a token, the method will either return public or private user data. *** #### Returns Promise<[User](https://docs.apify.com/api/client/js/api/client/js/reference/interface/User.md)> ### [**](#limits)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L55)limits * ****limits**(): Promise\ - *** #### Returns Promise\ ### [**](#monthlyUsage)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L30)monthlyUsage * ****monthlyUsage**(): Promise\ - *** #### Returns Promise\ ### [**](#updateLimits)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L74)updateLimits * ****updateLimits**(options): Promise\ - *** #### Parameters * ##### options: [LimitsUpdateOptions](https://docs.apify.com/api/client/js/api/client/js/reference.md#LimitsUpdateOptions) #### Returns Promise\ --- # WebhookClient ### Hierarchy * ResourceClient * *WebhookClient* ## Index[**](#Index) ### Properties * [**apifyClient](#apifyClient) * [**baseUrl](#baseUrl) * [**httpClient](#httpClient) * [**id](#id) * [**params](#params) * [**resourcePath](#resourcePath) * [**safeId](#safeId) * [**url](#url) ### Methods * [**delete](#delete) * [**dispatches](#dispatches) * [**get](#get) * [**test](#test) * [**update](#update) ## Properties[**](#Properties) ### [**](#apifyClient)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L32)inheritedapifyClient **apifyClient: [ApifyClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/ApifyClient.md) Inherited from ResourceClient.apifyClient ### [**](#baseUrl)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L26)inheritedbaseUrl **baseUrl: string Inherited from ResourceClient.baseUrl ### [**](#httpClient)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L34)inheritedhttpClient **httpClient: HttpClient Inherited from ResourceClient.httpClient ### [**](#id)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L22)optionalinheritedid **id? : string Inherited from ResourceClient.id ### [**](#params)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L36)optionalinheritedparams **params? : Record\ Inherited from ResourceClient.params ### [**](#resourcePath)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L28)inheritedresourcePath **resourcePath: string Inherited from ResourceClient.resourcePath ### [**](#safeId)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L24)optionalinheritedsafeId **safeId? : string Inherited from ResourceClient.safeId ### [**](#url)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L30)inheritedurl **url: string Inherited from ResourceClient.url ## Methods[**](#Methods) ### [**](#delete)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/webhook.ts#L43)delete * ****delete**(): Promise\ - *** #### Returns Promise\ ### [**](#dispatches)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/webhook.ts#L70)dispatches * ****dispatches**(): [WebhookDispatchCollectionClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/WebhookDispatchCollectionClient.md) - *** #### Returns [WebhookDispatchCollectionClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/WebhookDispatchCollectionClient.md) ### [**](#get)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/webhook.ts#L27)get * ****get**(): Promise\ - *** #### Returns Promise\ ### [**](#test)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/webhook.ts#L50)test * ****test**(): Promise\ - *** #### Returns Promise\ ### [**](#update)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/webhook.ts#L34)update * ****update**(newFields): Promise<[Webhook](https://docs.apify.com/api/client/js/api/client/js/reference/interface/Webhook.md)> - *** #### Parameters * ##### newFields: [WebhookUpdateData](https://docs.apify.com/api/client/js/api/client/js/reference.md#WebhookUpdateData) #### Returns Promise<[Webhook](https://docs.apify.com/api/client/js/api/client/js/reference/interface/Webhook.md)> --- # WebhookCollectionClient ### Hierarchy * ResourceCollectionClient * *WebhookCollectionClient* ## Index[**](#Index) ### Properties * [**apifyClient](#apifyClient) * [**baseUrl](#baseUrl) * [**httpClient](#httpClient) * [**id](#id) * [**params](#params) * [**resourcePath](#resourcePath) * [**safeId](#safeId) * [**url](#url) ### Methods * [**create](#create) * [**list](#list) ## Properties[**](#Properties) ### [**](#apifyClient)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L32)inheritedapifyClient **apifyClient: [ApifyClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/ApifyClient.md) Inherited from ResourceCollectionClient.apifyClient ### [**](#baseUrl)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L26)inheritedbaseUrl **baseUrl: string Inherited from ResourceCollectionClient.baseUrl ### [**](#httpClient)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L34)inheritedhttpClient **httpClient: HttpClient Inherited from ResourceCollectionClient.httpClient ### [**](#id)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L22)optionalinheritedid **id? : string Inherited from ResourceCollectionClient.id ### [**](#params)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L36)optionalinheritedparams **params? : Record\ Inherited from ResourceCollectionClient.params ### [**](#resourcePath)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L28)inheritedresourcePath **resourcePath: string Inherited from ResourceCollectionClient.resourcePath ### [**](#safeId)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L24)optionalinheritedsafeId **safeId? : string Inherited from ResourceCollectionClient.safeId ### [**](#url)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L30)inheritedurl **url: string Inherited from ResourceCollectionClient.url ## Methods[**](#Methods) ### [**](#create)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/webhook_collection.ts#L40)create * ****create**(webhook): Promise<[Webhook](https://docs.apify.com/api/client/js/api/client/js/reference/interface/Webhook.md)> - *** #### Parameters * ##### optionalwebhook: [WebhookUpdateData](https://docs.apify.com/api/client/js/api/client/js/reference.md#WebhookUpdateData) #### Returns Promise<[Webhook](https://docs.apify.com/api/client/js/api/client/js/reference/interface/Webhook.md)> ### [**](#list)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/webhook_collection.ts#L22)list * ****list**(options): Promise<[PaginatedList](https://docs.apify.com/api/client/js/api/client/js/reference/interface/PaginatedList.md)\>> - *** #### Parameters * ##### options: [WebhookCollectionListOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/WebhookCollectionListOptions.md) = {} #### Returns Promise<[PaginatedList](https://docs.apify.com/api/client/js/api/client/js/reference/interface/PaginatedList.md)\>> --- # WebhookDispatchClient ### Hierarchy * ResourceClient * *WebhookDispatchClient* ## Index[**](#Index) ### Properties * [**apifyClient](#apifyClient) * [**baseUrl](#baseUrl) * [**httpClient](#httpClient) * [**id](#id) * [**params](#params) * [**resourcePath](#resourcePath) * [**safeId](#safeId) * [**url](#url) ### Methods * [**get](#get) ## Properties[**](#Properties) ### [**](#apifyClient)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L32)inheritedapifyClient **apifyClient: [ApifyClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/ApifyClient.md) Inherited from ResourceClient.apifyClient ### [**](#baseUrl)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L26)inheritedbaseUrl **baseUrl: string Inherited from ResourceClient.baseUrl ### [**](#httpClient)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L34)inheritedhttpClient **httpClient: HttpClient Inherited from ResourceClient.httpClient ### [**](#id)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L22)optionalinheritedid **id? : string Inherited from ResourceClient.id ### [**](#params)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L36)optionalinheritedparams **params? : Record\ Inherited from ResourceClient.params ### [**](#resourcePath)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L28)inheritedresourcePath **resourcePath: string Inherited from ResourceClient.resourcePath ### [**](#safeId)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L24)optionalinheritedsafeId **safeId? : string Inherited from ResourceClient.safeId ### [**](#url)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L30)inheritedurl **url: string Inherited from ResourceClient.url ## Methods[**](#Methods) ### [**](#get)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/webhook_dispatch.ts#L19)get * ****get**(): Promise\ - *** #### Returns Promise\ --- # WebhookDispatchCollectionClient ### Hierarchy * ResourceCollectionClient * *WebhookDispatchCollectionClient* ## Index[**](#Index) ### Properties * [**apifyClient](#apifyClient) * [**baseUrl](#baseUrl) * [**httpClient](#httpClient) * [**id](#id) * [**params](#params) * [**resourcePath](#resourcePath) * [**safeId](#safeId) * [**url](#url) ### Methods * [**list](#list) ## Properties[**](#Properties) ### [**](#apifyClient)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L32)inheritedapifyClient **apifyClient: [ApifyClient](https://docs.apify.com/api/client/js/api/client/js/reference/class/ApifyClient.md) Inherited from ResourceCollectionClient.apifyClient ### [**](#baseUrl)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L26)inheritedbaseUrl **baseUrl: string Inherited from ResourceCollectionClient.baseUrl ### [**](#httpClient)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L34)inheritedhttpClient **httpClient: HttpClient Inherited from ResourceCollectionClient.httpClient ### [**](#id)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L22)optionalinheritedid **id? : string Inherited from ResourceCollectionClient.id ### [**](#params)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L36)optionalinheritedparams **params? : Record\ Inherited from ResourceCollectionClient.params ### [**](#resourcePath)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L28)inheritedresourcePath **resourcePath: string Inherited from ResourceCollectionClient.resourcePath ### [**](#safeId)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L24)optionalinheritedsafeId **safeId? : string Inherited from ResourceCollectionClient.safeId ### [**](#url)[**](https://github.com/apify/apify-client-js/blob/master/src/base/api_client.ts#L30)inheritedurl **url: string Inherited from ResourceCollectionClient.url ## Methods[**](#Methods) ### [**](#list)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/webhook_dispatch_collection.ts#L22)list * ****list**(options): Promise<[PaginatedList](https://docs.apify.com/api/client/js/api/client/js/reference/interface/PaginatedList.md)<[WebhookDispatch](https://docs.apify.com/api/client/js/api/client/js/reference/interface/WebhookDispatch.md)>> - *** #### Parameters * ##### options: [WebhookDispatchCollectionListOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/WebhookDispatchCollectionListOptions.md) = {} #### Returns Promise<[PaginatedList](https://docs.apify.com/api/client/js/api/client/js/reference/interface/PaginatedList.md)<[WebhookDispatch](https://docs.apify.com/api/client/js/api/client/js/reference/interface/WebhookDispatch.md)>> --- # ActorListSortBy ## Index[**](#Index) ### Enumeration Members * [**CREATED\_AT](#CREATED_AT) * [**LAST\_RUN\_STARTED\_AT](#LAST_RUN_STARTED_AT) ## Enumeration Members[**](<#Enumeration Members>) ### [**](#CREATED_AT)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor_collection.ts#L49)CREATED\_AT **CREATED\_AT: createdAt ### [**](#LAST_RUN_STARTED_AT)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor_collection.ts#L50)LAST\_RUN\_STARTED\_AT **LAST\_RUN\_STARTED\_AT: stats.lastRunStartedAt --- # DownloadItemsFormat ## Index[**](#Index) ### Enumeration Members * [**CSV](#CSV) * [**HTML](#HTML) * [**JSON](#JSON) * [**JSONL](#JSONL) * [**RSS](#RSS) * [**XLSX](#XLSX) * [**XML](#XML) ## Enumeration Members[**](<#Enumeration Members>) ### [**](#CSV)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset.ts#L228)CSV **CSV: csv ### [**](#HTML)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset.ts#L227)HTML **HTML: html ### [**](#JSON)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset.ts#L224)JSON **JSON: json ### [**](#JSONL)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset.ts#L225)JSONL **JSONL: jsonl ### [**](#RSS)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset.ts#L230)RSS **RSS: rss ### [**](#XLSX)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset.ts#L229)XLSX **XLSX: xlsx ### [**](#XML)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset.ts#L226)XML **XML: xml --- # PlatformFeature ## Index[**](#Index) ### Enumeration Members * [**Actors](#Actors) * [**Proxy](#Proxy) * [**ProxyExternalAccess](#ProxyExternalAccess) * [**ProxySERPS](#ProxySERPS) * [**Scheduler](#Scheduler) * [**Storage](#Storage) * [**Webhooks](#Webhooks) ## Enumeration Members[**](<#Enumeration Members>) ### [**](#Actors)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L146)Actors **Actors: ACTORS ### [**](#Proxy)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L151)Proxy **Proxy: PROXY ### [**](#ProxyExternalAccess)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L152)ProxyExternalAccess **ProxyExternalAccess: PROXY\_EXTERNAL\_ACCESS ### [**](#ProxySERPS)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L148)ProxySERPS **ProxySERPS: PROXY\_SERPS ### [**](#Scheduler)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L149)Scheduler **Scheduler: SCHEDULER ### [**](#Storage)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L147)Storage **Storage: STORAGE ### [**](#Webhooks)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L150)Webhooks **Webhooks: WEBHOOKS --- # ScheduleActions ## Index[**](#Index) ### Enumeration Members * [**RunActor](#RunActor) * [**RunActorTask](#RunActorTask) ## Enumeration Members[**](<#Enumeration Members>) ### [**](#RunActor)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/schedule.ts#L94)RunActor **RunActor: RUN\_ACTOR ### [**](#RunActorTask)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/schedule.ts#L95)RunActorTask **RunActorTask: RUN\_ACTOR\_TASK --- # WebhookDispatchStatus ## Index[**](#Index) ### Enumeration Members * [**Active](#Active) * [**Failed](#Failed) * [**Succeeded](#Succeeded) ## Enumeration Members[**](<#Enumeration Members>) ### [**](#Active)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/webhook_dispatch.ts#L36)Active **Active: ACTIVE ### [**](#Failed)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/webhook_dispatch.ts#L38)Failed **Failed: FAILED ### [**](#Succeeded)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/webhook_dispatch.ts#L37)Succeeded **Succeeded: SUCCEEDED --- # AccountAndUsageLimits ## Index[**](#Index) ### Properties * [**current](#current) * [**limits](#limits) * [**monthlyUsageCycle](#monthlyUsageCycle) ## Properties[**](#Properties) ### [**](#current)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L229)current **current: [Current](https://docs.apify.com/api/client/js/api/client/js/reference/interface/Current.md) ### [**](#limits)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L228)limits **limits: [Limits](https://docs.apify.com/api/client/js/api/client/js/reference/interface/Limits.md) ### [**](#monthlyUsageCycle)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L227)monthlyUsageCycle **monthlyUsageCycle: [MonthlyUsageCycle](https://docs.apify.com/api/client/js/api/client/js/reference/interface/MonthlyUsageCycle.md) --- # Actor ## Index[**](#Index) ### Properties * [**actorStandby](#actorStandby) * [**categories](#categories) * [**createdAt](#createdAt) * [**defaultRunOptions](#defaultRunOptions) * [**deploymentKey](#deploymentKey) * [**description](#description) * [**exampleRunInput](#exampleRunInput) * [**id](#id) * [**isAnonymouslyRunnable](#isAnonymouslyRunnable) * [**isDeprecated](#isDeprecated) * [**isPublic](#isPublic) * [**modifiedAt](#modifiedAt) * [**name](#name) * [**pricingInfos](#pricingInfos) * [**restartOnError](#restartOnError) * [**seoDescription](#seoDescription) * [**seoTitle](#seoTitle) * [**stats](#stats) * [**taggedBuilds](#taggedBuilds) * [**title](#title) * [**userId](#userId) * [**username](#username) * [**versions](#versions) ## Properties[**](#Properties) ### [**](#actorStandby)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L284)optionalactorStandby **actorStandby? : [ActorStandby](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorStandby.md) & { isEnabled: boolean } ### [**](#categories)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L283)optionalcategories **categories? : string\[] ### [**](#createdAt)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L270)createdAt **createdAt: Date ### [**](#defaultRunOptions)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L275)defaultRunOptions **defaultRunOptions: [ActorDefaultRunOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorDefaultRunOptions.md) ### [**](#deploymentKey)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L278)deploymentKey **deploymentKey: string ### [**](#description)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L266)optionaldescription **description? : string ### [**](#exampleRunInput)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L276)optionalexampleRunInput **exampleRunInput? : [ActorExampleRunInput](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorExampleRunInput.md) ### [**](#id)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L262)id **id: string ### [**](#isAnonymouslyRunnable)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L269)optionalisAnonymouslyRunnable **isAnonymouslyRunnable? : boolean ### [**](#isDeprecated)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L277)optionalisDeprecated **isDeprecated? : boolean ### [**](#isPublic)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L268)isPublic **isPublic: boolean ### [**](#modifiedAt)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L271)modifiedAt **modifiedAt: Date ### [**](#name)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L264)name **name: string ### [**](#pricingInfos)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L274)optionalpricingInfos **pricingInfos? : [ActorRunPricingInfo](https://docs.apify.com/api/client/js/api/client/js/reference.md#ActorRunPricingInfo)\[] ### [**](#restartOnError)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L267)optionalrestartOnError **restartOnError? : boolean ### [**](#seoDescription)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L282)optionalseoDescription **seoDescription? : string ### [**](#seoTitle)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L281)optionalseoTitle **seoTitle? : string ### [**](#stats)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L272)stats **stats: [ActorStats](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorStats.md) ### [**](#taggedBuilds)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L280)optionaltaggedBuilds **taggedBuilds? : [ActorTaggedBuilds](https://docs.apify.com/api/client/js/api/client/js/reference.md#ActorTaggedBuilds) ### [**](#title)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L279)optionaltitle **title? : string ### [**](#userId)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L263)userId **userId: string ### [**](#username)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L265)username **username: string ### [**](#versions)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L273)versions **versions: ActorVersion\[] --- # ActorBuildOptions ## Index[**](#Index) ### Properties * [**betaPackages](#betaPackages) * [**tag](#tag) * [**useCache](#useCache) * [**waitForFinish](#waitForFinish) ## Properties[**](#Properties) ### [**](#betaPackages)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L482)optionalbetaPackages **betaPackages? : boolean ### [**](#tag)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L483)optionaltag **tag? : string ### [**](#useCache)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L484)optionaluseCache **useCache? : boolean ### [**](#waitForFinish)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L485)optionalwaitForFinish **waitForFinish? : number --- # ActorCallOptions ### Hierarchy * Omit<[ActorStartOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorStartOptions.md), waitForFinish> * *ActorCallOptions* ## Index[**](#Index) ### Properties * [**build](#build) * [**contentType](#contentType) * [**maxItems](#maxItems) * [**memory](#memory) * [**timeout](#timeout) * [**waitSecs](#waitSecs) * [**webhooks](#webhooks) ## Properties[**](#Properties) ### [**](#build)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L350)optionalinheritedbuild **build? : string Inherited from Omit.build Tag or number of the actor build to run (e.g. `beta` or `1.2.345`). If not provided, the run uses build tag or number from the default actor run configuration (typically `latest`). ### [**](#contentType)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L358)optionalinheritedcontentType **contentType? : string Inherited from Omit.contentType Content type for the `input`. If not specified, `input` is expected to be an object that will be stringified to JSON and content type set to `application/json; charset=utf-8`. If `options.contentType` is specified, then `input` must be a `String` or `Buffer`. ### [**](#maxItems)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L393)optionalinheritedmaxItems **maxItems? : number Inherited from Omit.maxItems Specifies maximum number of items that the actor run should return. This is used by pay per result actors to limit the maximum number of results that will be charged to customer. Value can be accessed in actor run using `ACTOR_MAX_PAID_DATASET_ITEMS` environment variable. ### [**](#memory)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L364)optionalinheritedmemory **memory? : number Inherited from Omit.memory Memory in megabytes which will be allocated for the new actor run. If not provided, the run uses memory of the default actor run configuration. ### [**](#timeout)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L369)optionalinheritedtimeout **timeout? : number Inherited from Omit.timeout Timeout for the actor run in seconds. Zero value means there is no timeout. If not provided, the run uses timeout of the default actor run configuration. ### [**](#waitSecs)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L399)optionalwaitSecs **waitSecs? : number ### [**](#webhooks)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L386)optionalinheritedwebhooks **webhooks? : readonly [WebhookUpdateData](https://docs.apify.com/api/client/js/api/client/js/reference.md#WebhookUpdateData)\[] Inherited from Omit.webhooks Specifies optional webhooks associated with the actor run, which can be used to receive a notification e.g. when the actor finished or failed, see [ad hook webhooks documentation](https://docs.apify.com/webhooks/ad-hoc-webhooks) for detailed description. --- # ActorChargeEvent ## Index[**](#Index) ### Properties * [**eventDescription](#eventDescription) * [**eventPriceUsd](#eventPriceUsd) * [**eventTitle](#eventTitle) ## Properties[**](#Properties) ### [**](#eventDescription)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L545)optionaleventDescription **eventDescription? : string ### [**](#eventPriceUsd)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L543)eventPriceUsd **eventPriceUsd: number ### [**](#eventTitle)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L544)eventTitle **eventTitle: string --- # ActorCollectionCreateOptions ## Index[**](#Index) ### Properties * [**actorStandby](#actorStandby) * [**categories](#categories) * [**defaultRunOptions](#defaultRunOptions) * [**description](#description) * [**exampleRunInput](#exampleRunInput) * [**isDeprecated](#isDeprecated) * [**isPublic](#isPublic) * [**name](#name) * [**restartOnError](#restartOnError) * [**seoDescription](#seoDescription) * [**seoTitle](#seoTitle) * [**title](#title) * [**versions](#versions) ## Properties[**](#Properties) ### [**](#actorStandby)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor_collection.ts#L84)optionalactorStandby **actorStandby? : [ActorStandby](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorStandby.md) & { isEnabled: boolean } ### [**](#categories)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor_collection.ts#L72)optionalcategories **categories? : string\[] ### [**](#defaultRunOptions)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor_collection.ts#L73)optionaldefaultRunOptions **defaultRunOptions? : [ActorDefaultRunOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorDefaultRunOptions.md) ### [**](#description)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor_collection.ts#L74)optionaldescription **description? : string ### [**](#exampleRunInput)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor_collection.ts#L75)optionalexampleRunInput **exampleRunInput? : [ActorExampleRunInput](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorExampleRunInput.md) ### [**](#isDeprecated)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor_collection.ts#L76)optionalisDeprecated **isDeprecated? : boolean ### [**](#isPublic)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor_collection.ts#L77)optionalisPublic **isPublic? : boolean ### [**](#name)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor_collection.ts#L78)optionalname **name? : string ### [**](#restartOnError)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor_collection.ts#L79)optionalrestartOnError **restartOnError? : boolean ### [**](#seoDescription)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor_collection.ts#L81)optionalseoDescription **seoDescription? : string ### [**](#seoTitle)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor_collection.ts#L80)optionalseoTitle **seoTitle? : string ### [**](#title)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor_collection.ts#L82)optionaltitle **title? : string ### [**](#versions)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor_collection.ts#L83)optionalversions **versions? : ActorVersion\[] --- # ActorCollectionListItem ## Index[**](#Index) ### Properties * [**createdAt](#createdAt) * [**id](#id) * [**modifiedAt](#modifiedAt) * [**name](#name) * [**username](#username) ## Properties[**](#Properties) ### [**](#createdAt)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor_collection.ts#L63)createdAt **createdAt: Date ### [**](#id)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor_collection.ts#L62)id **id: string ### [**](#modifiedAt)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor_collection.ts#L64)modifiedAt **modifiedAt: Date ### [**](#name)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor_collection.ts#L65)name **name: string ### [**](#username)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor_collection.ts#L66)username **username: string --- # ActorCollectionListOptions ## Index[**](#Index) ### Properties * [**desc](#desc) * [**limit](#limit) * [**my](#my) * [**offset](#offset) * [**sortBy](#sortBy) ## Properties[**](#Properties) ### [**](#desc)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor_collection.ts#L57)optionaldesc **desc? : boolean ### [**](#limit)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor_collection.ts#L55)optionallimit **limit? : number ### [**](#my)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor_collection.ts#L54)optionalmy **my? : boolean ### [**](#offset)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor_collection.ts#L56)optionaloffset **offset? : number ### [**](#sortBy)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor_collection.ts#L58)optionalsortBy **sortBy? : [ActorListSortBy](https://docs.apify.com/api/client/js/api/client/js/reference/enum/ActorListSortBy.md) --- # ActorDefaultRunOptions ## Index[**](#Index) ### Properties * [**build](#build) * [**memoryMbytes](#memoryMbytes) * [**timeoutSecs](#timeoutSecs) ## Properties[**](#Properties) ### [**](#build)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L301)build **build: string ### [**](#memoryMbytes)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L303)memoryMbytes **memoryMbytes: number ### [**](#timeoutSecs)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L302)timeoutSecs **timeoutSecs: number --- # ActorDefinition ## Index[**](#Index) ### Properties * [**actorSpecification](#actorSpecification) * [**buildTag](#buildTag) * [**changelog](#changelog) * [**dockerContextDir](#dockerContextDir) * [**dockerfile](#dockerfile) * [**environmentVariables](#environmentVariables) * [**input](#input) * [**maxMemoryMbytes](#maxMemoryMbytes) * [**minMemoryMbytes](#minMemoryMbytes) * [**name](#name) * [**readme](#readme) * [**storages](#storages) * [**usesStandbyMode](#usesStandbyMode) * [**version](#version) ## Properties[**](#Properties) ### [**](#actorSpecification)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L493)actorSpecification **actorSpecification: number ### [**](#buildTag)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L496)optionalbuildTag **buildTag? : string ### [**](#changelog)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L502)optionalchangelog **changelog? : null | string ### [**](#dockerContextDir)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L499)optionaldockerContextDir **dockerContextDir? : string ### [**](#dockerfile)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L498)optionaldockerfile **dockerfile? : string ### [**](#environmentVariables)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L497)optionalenvironmentVariables **environmentVariables? : Record\ ### [**](#input)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L501)optionalinput **input? : null | object ### [**](#maxMemoryMbytes)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L507)optionalmaxMemoryMbytes **maxMemoryMbytes? : number ### [**](#minMemoryMbytes)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L506)optionalminMemoryMbytes **minMemoryMbytes? : number ### [**](#name)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L494)name **name: string ### [**](#readme)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L500)optionalreadme **readme? : null | string ### [**](#storages)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L503)optionalstorages **storages? : { dataset? : object } #### Type declaration * ##### optionaldataset?: object ### [**](#usesStandbyMode)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L508)optionalusesStandbyMode **usesStandbyMode? : boolean ### [**](#version)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L495)version **version: string --- # ActorExampleRunInput ## Index[**](#Index) ### Properties * [**body](#body) * [**contentType](#contentType) ## Properties[**](#Properties) ### [**](#body)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L307)body **body: string ### [**](#contentType)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L308)contentType **contentType: string --- # ActorLastRunOptions ## Index[**](#Index) ### Properties * [**status](#status) ## Properties[**](#Properties) ### [**](#status)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L489)optionalstatus **status? : READY | RUNNING | SUCCEEDED | FAILED | TIMING\_OUT | TIMED\_OUT | ABORTING | ABORTED --- # ActorRun ### Hierarchy * [ActorRunListItem](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorRunListItem.md) * *ActorRun* ## Index[**](#Index) ### Properties * [**actId](#actId) * [**actorTaskId](#actorTaskId) * [**buildId](#buildId) * [**buildNumber](#buildNumber) * [**chargedEventCounts](#chargedEventCounts) * [**containerUrl](#containerUrl) * [**defaultDatasetId](#defaultDatasetId) * [**defaultKeyValueStoreId](#defaultKeyValueStoreId) * [**defaultRequestQueueId](#defaultRequestQueueId) * [**exitCode](#exitCode) * [**finishedAt](#finishedAt) * [**generalAccess](#generalAccess) * [**gitBranchName](#gitBranchName) * [**id](#id) * [**isContainerServerReady](#isContainerServerReady) * [**meta](#meta) * [**options](#options) * [**pricingInfo](#pricingInfo) * [**startedAt](#startedAt) * [**stats](#stats) * [**status](#status) * [**statusMessage](#statusMessage) * [**usage](#usage) * [**usageTotalUsd](#usageTotalUsd) * [**usageUsd](#usageUsd) * [**userId](#userId) ## Properties[**](#Properties) ### [**](#actId)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L404)inheritedactId **actId: string Inherited from ActorRunListItem.actId ### [**](#actorTaskId)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L405)optionalinheritedactorTaskId **actorTaskId? : string Inherited from ActorRunListItem.actorTaskId ### [**](#buildId)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L410)inheritedbuildId **buildId: string Inherited from ActorRunListItem.buildId ### [**](#buildNumber)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L411)inheritedbuildNumber **buildNumber: string Inherited from ActorRunListItem.buildNumber ### [**](#chargedEventCounts)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L430)optionalchargedEventCounts **chargedEventCounts? : Record\ ### [**](#containerUrl)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L424)containerUrl **containerUrl: string ### [**](#defaultDatasetId)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L413)inheriteddefaultDatasetId **defaultDatasetId: string Inherited from ActorRunListItem.defaultDatasetId ### [**](#defaultKeyValueStoreId)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L412)inheriteddefaultKeyValueStoreId **defaultKeyValueStoreId: string Inherited from ActorRunListItem.defaultKeyValueStoreId ### [**](#defaultRequestQueueId)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L414)inheriteddefaultRequestQueueId **defaultRequestQueueId: string Inherited from ActorRunListItem.defaultRequestQueueId ### [**](#exitCode)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L423)optionalexitCode **exitCode? : number ### [**](#finishedAt)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L407)inheritedfinishedAt **finishedAt: Date Inherited from ActorRunListItem.finishedAt ### [**](#generalAccess)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L431)optionalgeneralAccess **generalAccess? : null | RUN\_GENERAL\_ACCESS ### [**](#gitBranchName)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L426)optionalgitBranchName **gitBranchName? : string ### [**](#id)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L403)inheritedid **id: string Inherited from ActorRunListItem.id ### [**](#isContainerServerReady)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L425)optionalisContainerServerReady **isContainerServerReady? : boolean ### [**](#meta)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L409)inheritedmeta **meta: [ActorRunMeta](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorRunMeta.md) Inherited from ActorRunListItem.meta ### [**](#options)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L422)options **options: [ActorRunOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorRunOptions.md) ### [**](#pricingInfo)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L429)optionalpricingInfo **pricingInfo? : [ActorRunPricingInfo](https://docs.apify.com/api/client/js/api/client/js/reference.md#ActorRunPricingInfo) ### [**](#startedAt)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L406)inheritedstartedAt **startedAt: Date Inherited from ActorRunListItem.startedAt ### [**](#stats)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L421)stats **stats: [ActorRunStats](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorRunStats.md) ### [**](#status)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L408)inheritedstatus **status: READY | RUNNING | SUCCEEDED | FAILED | ABORTING | ABORTED | TIMING-OUT | TIMED-OUT Inherited from ActorRunListItem.status ### [**](#statusMessage)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L420)optionalstatusMessage **statusMessage? : string ### [**](#usage)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L427)optionalusage **usage? : [ActorRunUsage](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorRunUsage.md) ### [**](#usageTotalUsd)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L415)optionalinheritedusageTotalUsd **usageTotalUsd? : number Inherited from ActorRunListItem.usageTotalUsd ### [**](#usageUsd)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L428)optionalusageUsd **usageUsd? : [ActorRunUsage](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorRunUsage.md) ### [**](#userId)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L419)userId **userId: string --- # ActorRunListItem ### Hierarchy * *ActorRunListItem* * [ActorRun](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorRun.md) ## Index[**](#Index) ### Properties * [**actId](#actId) * [**actorTaskId](#actorTaskId) * [**buildId](#buildId) * [**buildNumber](#buildNumber) * [**defaultDatasetId](#defaultDatasetId) * [**defaultKeyValueStoreId](#defaultKeyValueStoreId) * [**defaultRequestQueueId](#defaultRequestQueueId) * [**finishedAt](#finishedAt) * [**id](#id) * [**meta](#meta) * [**startedAt](#startedAt) * [**status](#status) * [**usageTotalUsd](#usageTotalUsd) ## Properties[**](#Properties) ### [**](#actId)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L404)actId **actId: string ### [**](#actorTaskId)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L405)optionalactorTaskId **actorTaskId? : string ### [**](#buildId)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L410)buildId **buildId: string ### [**](#buildNumber)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L411)buildNumber **buildNumber: string ### [**](#defaultDatasetId)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L413)defaultDatasetId **defaultDatasetId: string ### [**](#defaultKeyValueStoreId)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L412)defaultKeyValueStoreId **defaultKeyValueStoreId: string ### [**](#defaultRequestQueueId)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L414)defaultRequestQueueId **defaultRequestQueueId: string ### [**](#finishedAt)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L407)finishedAt **finishedAt: Date ### [**](#id)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L403)id **id: string ### [**](#meta)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L409)meta **meta: [ActorRunMeta](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorRunMeta.md) ### [**](#startedAt)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L406)startedAt **startedAt: Date ### [**](#status)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L408)status **status: READY | RUNNING | SUCCEEDED | FAILED | ABORTING | ABORTED | TIMING-OUT | TIMED-OUT ### [**](#usageTotalUsd)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L415)optionalusageTotalUsd **usageTotalUsd? : number --- # ActorRunMeta ## Index[**](#Index) ### Properties * [**clientIp](#clientIp) * [**origin](#origin) * [**userAgent](#userAgent) ## Properties[**](#Properties) ### [**](#clientIp)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L451)optionalclientIp **clientIp? : string ### [**](#origin)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L450)origin **origin: string ### [**](#userAgent)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L452)userAgent **userAgent: string --- # ActorRunOptions ## Index[**](#Index) ### Properties * [**build](#build) * [**diskMbytes](#diskMbytes) * [**maxTotalChargeUsd](#maxTotalChargeUsd) * [**memoryMbytes](#memoryMbytes) * [**timeoutSecs](#timeoutSecs) ## Properties[**](#Properties) ### [**](#build)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L474)build **build: string ### [**](#diskMbytes)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L477)diskMbytes **diskMbytes: number ### [**](#maxTotalChargeUsd)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L478)optionalmaxTotalChargeUsd **maxTotalChargeUsd? : number ### [**](#memoryMbytes)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L476)memoryMbytes **memoryMbytes: number ### [**](#timeoutSecs)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L475)timeoutSecs **timeoutSecs: number --- # ActorRunStats ## Index[**](#Index) ### Properties * [**computeUnits](#computeUnits) * [**cpuAvgUsage](#cpuAvgUsage) * [**cpuCurrentUsage](#cpuCurrentUsage) * [**cpuMaxUsage](#cpuMaxUsage) * [**durationMillis](#durationMillis) * [**inputBodyLen](#inputBodyLen) * [**memAvgBytes](#memAvgBytes) * [**memCurrentBytes](#memCurrentBytes) * [**memMaxBytes](#memMaxBytes) * [**metamorph](#metamorph) * [**netRxBytes](#netRxBytes) * [**netTxBytes](#netTxBytes) * [**restartCount](#restartCount) * [**resurrectCount](#resurrectCount) * [**runTimeSecs](#runTimeSecs) ## Properties[**](#Properties) ### [**](#computeUnits)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L470)computeUnits **computeUnits: number ### [**](#cpuAvgUsage)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L462)cpuAvgUsage **cpuAvgUsage: number ### [**](#cpuCurrentUsage)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L464)cpuCurrentUsage **cpuCurrentUsage: number ### [**](#cpuMaxUsage)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L463)cpuMaxUsage **cpuMaxUsage: number ### [**](#durationMillis)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L467)durationMillis **durationMillis: number ### [**](#inputBodyLen)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L456)inputBodyLen **inputBodyLen: number ### [**](#memAvgBytes)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L459)memAvgBytes **memAvgBytes: number ### [**](#memCurrentBytes)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L461)memCurrentBytes **memCurrentBytes: number ### [**](#memMaxBytes)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L460)memMaxBytes **memMaxBytes: number ### [**](#metamorph)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L469)metamorph **metamorph: number ### [**](#netRxBytes)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L465)netRxBytes **netRxBytes: number ### [**](#netTxBytes)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L466)netTxBytes **netTxBytes: number ### [**](#restartCount)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L457)restartCount **restartCount: number ### [**](#resurrectCount)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L458)resurrectCount **resurrectCount: number ### [**](#runTimeSecs)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L468)runTimeSecs **runTimeSecs: number --- # ActorRunUsage ## Index[**](#Index) ### Properties * [**ACTOR\_COMPUTE\_UNITS](#ACTOR_COMPUTE_UNITS) * [**DATA\_TRANSFER\_EXTERNAL\_GBYTES](#DATA_TRANSFER_EXTERNAL_GBYTES) * [**DATA\_TRANSFER\_INTERNAL\_GBYTES](#DATA_TRANSFER_INTERNAL_GBYTES) * [**DATASET\_READS](#DATASET_READS) * [**DATASET\_WRITES](#DATASET_WRITES) * [**KEY\_VALUE\_STORE\_LISTS](#KEY_VALUE_STORE_LISTS) * [**KEY\_VALUE\_STORE\_READS](#KEY_VALUE_STORE_READS) * [**KEY\_VALUE\_STORE\_WRITES](#KEY_VALUE_STORE_WRITES) * [**PROXY\_RESIDENTIAL\_TRANSFER\_GBYTES](#PROXY_RESIDENTIAL_TRANSFER_GBYTES) * [**PROXY\_SERPS](#PROXY_SERPS) * [**REQUEST\_QUEUE\_READS](#REQUEST_QUEUE_READS) * [**REQUEST\_QUEUE\_WRITES](#REQUEST_QUEUE_WRITES) ## Properties[**](#Properties) ### [**](#ACTOR_COMPUTE_UNITS)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L435)optionalACTOR\_COMPUTE\_UNITS **ACTOR\_COMPUTE\_UNITS? : number ### [**](#DATA_TRANSFER_EXTERNAL_GBYTES)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L444)optionalDATA\_TRANSFER\_EXTERNAL\_GBYTES **DATA\_TRANSFER\_EXTERNAL\_GBYTES? : number ### [**](#DATA_TRANSFER_INTERNAL_GBYTES)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L443)optionalDATA\_TRANSFER\_INTERNAL\_GBYTES **DATA\_TRANSFER\_INTERNAL\_GBYTES? : number ### [**](#DATASET_READS)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L436)optionalDATASET\_READS **DATASET\_READS? : number ### [**](#DATASET_WRITES)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L437)optionalDATASET\_WRITES **DATASET\_WRITES? : number ### [**](#KEY_VALUE_STORE_LISTS)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L440)optionalKEY\_VALUE\_STORE\_LISTS **KEY\_VALUE\_STORE\_LISTS? : number ### [**](#KEY_VALUE_STORE_READS)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L438)optionalKEY\_VALUE\_STORE\_READS **KEY\_VALUE\_STORE\_READS? : number ### [**](#KEY_VALUE_STORE_WRITES)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L439)optionalKEY\_VALUE\_STORE\_WRITES **KEY\_VALUE\_STORE\_WRITES? : number ### [**](#PROXY_RESIDENTIAL_TRANSFER_GBYTES)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L445)optionalPROXY\_RESIDENTIAL\_TRANSFER\_GBYTES **PROXY\_RESIDENTIAL\_TRANSFER\_GBYTES? : number ### [**](#PROXY_SERPS)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L446)optionalPROXY\_SERPS **PROXY\_SERPS? : number ### [**](#REQUEST_QUEUE_READS)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L441)optionalREQUEST\_QUEUE\_READS **REQUEST\_QUEUE\_READS? : number ### [**](#REQUEST_QUEUE_WRITES)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L442)optionalREQUEST\_QUEUE\_WRITES **REQUEST\_QUEUE\_WRITES? : number --- # ActorStandby ## Index[**](#Index) ### Properties * [**build](#build) * [**desiredRequestsPerActorRun](#desiredRequestsPerActorRun) * [**idleTimeoutSecs](#idleTimeoutSecs) * [**maxRequestsPerActorRun](#maxRequestsPerActorRun) * [**memoryMbytes](#memoryMbytes) ## Properties[**](#Properties) ### [**](#build)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L341)build **build: string ### [**](#desiredRequestsPerActorRun)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L338)desiredRequestsPerActorRun **desiredRequestsPerActorRun: number ### [**](#idleTimeoutSecs)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L340)idleTimeoutSecs **idleTimeoutSecs: number ### [**](#maxRequestsPerActorRun)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L339)maxRequestsPerActorRun **maxRequestsPerActorRun: number ### [**](#memoryMbytes)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L342)memoryMbytes **memoryMbytes: number --- # ActorStartOptions ## Index[**](#Index) ### Properties * [**build](#build) * [**contentType](#contentType) * [**maxItems](#maxItems) * [**memory](#memory) * [**timeout](#timeout) * [**waitForFinish](#waitForFinish) * [**webhooks](#webhooks) ## Properties[**](#Properties) ### [**](#build)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L350)optionalbuild **build? : string Tag or number of the actor build to run (e.g. `beta` or `1.2.345`). If not provided, the run uses build tag or number from the default actor run configuration (typically `latest`). ### [**](#contentType)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L358)optionalcontentType **contentType? : string Content type for the `input`. If not specified, `input` is expected to be an object that will be stringified to JSON and content type set to `application/json; charset=utf-8`. If `options.contentType` is specified, then `input` must be a `String` or `Buffer`. ### [**](#maxItems)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L393)optionalmaxItems **maxItems? : number Specifies maximum number of items that the actor run should return. This is used by pay per result actors to limit the maximum number of results that will be charged to customer. Value can be accessed in actor run using `ACTOR_MAX_PAID_DATASET_ITEMS` environment variable. ### [**](#memory)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L364)optionalmemory **memory? : number Memory in megabytes which will be allocated for the new actor run. If not provided, the run uses memory of the default actor run configuration. ### [**](#timeout)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L369)optionaltimeout **timeout? : number Timeout for the actor run in seconds. Zero value means there is no timeout. If not provided, the run uses timeout of the default actor run configuration. ### [**](#waitForFinish)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L379)optionalwaitForFinish **waitForFinish? : number Maximum time to wait for the actor run to finish, in seconds. If the limit is reached, the returned promise is resolved to a run object that will have status `READY` or `RUNNING` and it will not contain the actor run output. By default (or when `waitForFinish` is set to `0`), the function resolves immediately without waiting. The wait is limited to 60s and happens on the API directly, as opposed to the `call` method and its `waitSecs` option, which is implemented via polling on the client side instead (and has no limit like that). ### [**](#webhooks)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L386)optionalwebhooks **webhooks? : readonly [WebhookUpdateData](https://docs.apify.com/api/client/js/api/client/js/reference.md#WebhookUpdateData)\[] Specifies optional webhooks associated with the actor run, which can be used to receive a notification e.g. when the actor finished or failed, see [ad hook webhooks documentation](https://docs.apify.com/webhooks/ad-hoc-webhooks) for detailed description. --- # ActorStats ## Index[**](#Index) ### Properties * [**lastRunStartedAt](#lastRunStartedAt) * [**totalBuilds](#totalBuilds) * [**totalMetamorphs](#totalMetamorphs) * [**totalRuns](#totalRuns) * [**totalUsers](#totalUsers) * [**totalUsers30Days](#totalUsers30Days) * [**totalUsers7Days](#totalUsers7Days) * [**totalUsers90Days](#totalUsers90Days) ## Properties[**](#Properties) ### [**](#lastRunStartedAt)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L297)lastRunStartedAt **lastRunStartedAt: Date ### [**](#totalBuilds)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L290)totalBuilds **totalBuilds: number ### [**](#totalMetamorphs)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L296)totalMetamorphs **totalMetamorphs: number ### [**](#totalRuns)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L291)totalRuns **totalRuns: number ### [**](#totalUsers)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L292)totalUsers **totalUsers: number ### [**](#totalUsers30Days)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L294)totalUsers30Days **totalUsers30Days: number ### [**](#totalUsers7Days)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L293)totalUsers7Days **totalUsers7Days: number ### [**](#totalUsers90Days)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L295)totalUsers90Days **totalUsers90Days: number --- # ActorStoreList ## Index[**](#Index) ### Properties * [**currentPricingInfo](#currentPricingInfo) * [**description](#description) * [**id](#id) * [**name](#name) * [**pictureUrl](#pictureUrl) * [**stats](#stats) * [**title](#title) * [**url](#url) * [**username](#username) * [**userPictureUrl](#userPictureUrl) ## Properties[**](#Properties) ### [**](#currentPricingInfo)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/store_collection.ts#L51)currentPricingInfo **currentPricingInfo: [PricingInfo](https://docs.apify.com/api/client/js/api/client/js/reference/interface/PricingInfo.md) ### [**](#description)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/store_collection.ts#L49)optionaldescription **description? : string ### [**](#id)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/store_collection.ts#L45)id **id: string ### [**](#name)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/store_collection.ts#L46)name **name: string ### [**](#pictureUrl)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/store_collection.ts#L52)optionalpictureUrl **pictureUrl? : string ### [**](#stats)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/store_collection.ts#L50)stats **stats: [ActorStats](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorStats.md) ### [**](#title)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/store_collection.ts#L48)optionaltitle **title? : string ### [**](#url)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/store_collection.ts#L54)url **url: string ### [**](#username)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/store_collection.ts#L47)username **username: string ### [**](#userPictureUrl)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/store_collection.ts#L53)optionaluserPictureUrl **userPictureUrl? : string --- # ActorTaggedBuild ## Index[**](#Index) ### Properties * [**buildId](#buildId) * [**buildNumber](#buildNumber) * [**finishedAt](#finishedAt) ## Properties[**](#Properties) ### [**](#buildId)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L314)optionalbuildId **buildId? : string ### [**](#buildNumber)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L315)optionalbuildNumber **buildNumber? : string ### [**](#finishedAt)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L316)optionalfinishedAt **finishedAt? : Date --- # ApifyClientOptions ## Index[**](#Index) ### Properties * [**baseUrl](#baseUrl) * [**maxRetries](#maxRetries) * [**minDelayBetweenRetriesMillis](#minDelayBetweenRetriesMillis) * [**requestInterceptors](#requestInterceptors) * [**timeoutSecs](#timeoutSecs) * [**token](#token) * [**userAgentSuffix](#userAgentSuffix) ## Properties[**](#Properties) ### [**](#baseUrl)[**](https://github.com/apify/apify-client-js/blob/master/src/apify_client.ts#L349)optionalbaseUrl **baseUrl? : string = https\://api.apify.com ### [**](#maxRetries)[**](https://github.com/apify/apify-client-js/blob/master/src/apify_client.ts#L351)optionalmaxRetries **maxRetries? : number = 8 ### [**](#minDelayBetweenRetriesMillis)[**](https://github.com/apify/apify-client-js/blob/master/src/apify_client.ts#L353)optionalminDelayBetweenRetriesMillis **minDelayBetweenRetriesMillis? : number = 500 ### [**](#requestInterceptors)[**](https://github.com/apify/apify-client-js/blob/master/src/apify_client.ts#L355)optionalrequestInterceptors **requestInterceptors? : (undefined | null | (value) => ApifyRequestConfig | Promise\)\[] = \[] ### [**](#timeoutSecs)[**](https://github.com/apify/apify-client-js/blob/master/src/apify_client.ts#L357)optionaltimeoutSecs **timeoutSecs? : number = 360 ### [**](#token)[**](https://github.com/apify/apify-client-js/blob/master/src/apify_client.ts#L358)optionaltoken **token? : string ### [**](#userAgentSuffix)[**](https://github.com/apify/apify-client-js/blob/master/src/apify_client.ts#L359)optionaluserAgentSuffix **userAgentSuffix? : string | string\[] --- # Build ## Index[**](#Index) ### Properties * [**actId](#actId) * [**actorDefinition](#actorDefinition) * [**buildNumber](#buildNumber) * [**finishedAt](#finishedAt) * [**id](#id) * [**inputSchema](#inputSchema) * [**meta](#meta) * [**options](#options) * [**readme](#readme) * [**startedAt](#startedAt) * [**stats](#stats) * [**status](#status) * [**usage](#usage) * [**usageTotalUsd](#usageTotalUsd) * [**usageUsd](#usageUsd) * [**userId](#userId) ## Properties[**](#Properties) ### [**](#actId)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/build.ts#L124)actId **actId: string ### [**](#actorDefinition)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/build.ts#L144)optionalactorDefinition **actorDefinition? : [ActorDefinition](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorDefinition.md) ### [**](#buildNumber)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/build.ts#L140)buildNumber **buildNumber: string ### [**](#finishedAt)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/build.ts#L127)optionalfinishedAt **finishedAt? : Date ### [**](#id)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/build.ts#L123)id **id: string ### [**](#inputSchema)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/build.ts#L135)optionalinputSchema **inputSchema? : string * **@deprecated** This property is deprecated in favor of `actorDefinition.input`. ### [**](#meta)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/build.ts#L129)meta **meta: [BuildMeta](https://docs.apify.com/api/client/js/api/client/js/reference/interface/BuildMeta.md) ### [**](#options)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/build.ts#L131)optionaloptions **options? : [BuildOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/BuildOptions.md) ### [**](#readme)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/build.ts#L139)optionalreadme **readme? : string * **@deprecated** This property is deprecated in favor of `actorDefinition.readme`. ### [**](#startedAt)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/build.ts#L126)startedAt **startedAt: Date ### [**](#stats)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/build.ts#L130)optionalstats **stats? : [BuildStats](https://docs.apify.com/api/client/js/api/client/js/reference/interface/BuildStats.md) ### [**](#status)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/build.ts#L128)status **status: SUCCEEDED | FAILED | ABORTED | TIMED-OUT ### [**](#usage)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/build.ts#L141)optionalusage **usage? : [BuildUsage](https://docs.apify.com/api/client/js/api/client/js/reference/interface/BuildUsage.md) ### [**](#usageTotalUsd)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/build.ts#L142)optionalusageTotalUsd **usageTotalUsd? : number ### [**](#usageUsd)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/build.ts#L143)optionalusageUsd **usageUsd? : [BuildUsage](https://docs.apify.com/api/client/js/api/client/js/reference/interface/BuildUsage.md) ### [**](#userId)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/build.ts#L125)userId **userId: string --- # BuildClientGetOptions ## Index[**](#Index) ### Properties * [**waitForFinish](#waitForFinish) ## Properties[**](#Properties) ### [**](#waitForFinish)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/build.ts#L104)optionalwaitForFinish **waitForFinish? : number --- # BuildClientWaitForFinishOptions ## Index[**](#Index) ### Properties * [**waitSecs](#waitSecs) ## Properties[**](#Properties) ### [**](#waitSecs)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/build.ts#L113)optionalwaitSecs **waitSecs? : number Maximum time to wait for the build to finish, in seconds. If the limit is reached, the returned promise is resolved to a build object that will have status `READY` or `RUNNING`. If `waitSecs` omitted, the function waits indefinitely. --- # BuildCollectionClientListOptions ## Index[**](#Index) ### Properties * [**desc](#desc) * [**limit](#limit) * [**offset](#offset) ## Properties[**](#Properties) ### [**](#desc)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/build_collection.ts#L39)optionaldesc **desc? : boolean ### [**](#limit)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/build_collection.ts#L37)optionallimit **limit? : number ### [**](#offset)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/build_collection.ts#L38)optionaloffset **offset? : number --- # BuildMeta ## Index[**](#Index) ### Properties * [**clientIp](#clientIp) * [**origin](#origin) * [**userAgent](#userAgent) ## Properties[**](#Properties) ### [**](#clientIp)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/build.ts#L118)clientIp **clientIp: string ### [**](#origin)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/build.ts#L117)origin **origin: string ### [**](#userAgent)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/build.ts#L119)userAgent **userAgent: string --- # BuildOptions ## Index[**](#Index) ### Properties * [**betaPackages](#betaPackages) * [**diskMbytes](#diskMbytes) * [**memoryMbytes](#memoryMbytes) * [**useCache](#useCache) ## Properties[**](#Properties) ### [**](#betaPackages)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/build.ts#L159)optionalbetaPackages **betaPackages? : boolean ### [**](#diskMbytes)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/build.ts#L161)optionaldiskMbytes **diskMbytes? : number ### [**](#memoryMbytes)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/build.ts#L160)optionalmemoryMbytes **memoryMbytes? : number ### [**](#useCache)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/build.ts#L158)optionaluseCache **useCache? : boolean --- # BuildStats ## Index[**](#Index) ### Properties * [**computeUnits](#computeUnits) * [**durationMillis](#durationMillis) * [**runTimeSecs](#runTimeSecs) ## Properties[**](#Properties) ### [**](#computeUnits)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/build.ts#L154)computeUnits **computeUnits: number ### [**](#durationMillis)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/build.ts#L152)durationMillis **durationMillis: number ### [**](#runTimeSecs)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/build.ts#L153)runTimeSecs **runTimeSecs: number --- # BuildUsage ## Index[**](#Index) ### Properties * [**ACTOR\_COMPUTE\_UNITS](#ACTOR_COMPUTE_UNITS) ## Properties[**](#Properties) ### [**](#ACTOR_COMPUTE_UNITS)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/build.ts#L148)optionalACTOR\_COMPUTE\_UNITS **ACTOR\_COMPUTE\_UNITS? : number --- # Current ## Index[**](#Index) ### Properties * [**activeActorJobCount](#activeActorJobCount) * [**actorCount](#actorCount) * [**actorMemoryGbytes](#actorMemoryGbytes) * [**actorTaskCount](#actorTaskCount) * [**monthlyActorComputeUnits](#monthlyActorComputeUnits) * [**monthlyExternalDataTransferGbytes](#monthlyExternalDataTransferGbytes) * [**monthlyProxySerps](#monthlyProxySerps) * [**monthlyResidentialProxyGbytes](#monthlyResidentialProxyGbytes) * [**monthlyUsageUsd](#monthlyUsageUsd) * [**teamAccountSeatCount](#teamAccountSeatCount) ## Properties[**](#Properties) ### [**](#activeActorJobCount)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L262)activeActorJobCount **activeActorJobCount: number ### [**](#actorCount)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L260)actorCount **actorCount: number ### [**](#actorMemoryGbytes)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L259)actorMemoryGbytes **actorMemoryGbytes: number ### [**](#actorTaskCount)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L261)actorTaskCount **actorTaskCount: number ### [**](#monthlyActorComputeUnits)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L255)monthlyActorComputeUnits **monthlyActorComputeUnits: number ### [**](#monthlyExternalDataTransferGbytes)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L256)monthlyExternalDataTransferGbytes **monthlyExternalDataTransferGbytes: number ### [**](#monthlyProxySerps)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L257)monthlyProxySerps **monthlyProxySerps: number ### [**](#monthlyResidentialProxyGbytes)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L258)monthlyResidentialProxyGbytes **monthlyResidentialProxyGbytes: number ### [**](#monthlyUsageUsd)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L254)monthlyUsageUsd **monthlyUsageUsd: number ### [**](#teamAccountSeatCount)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L263)teamAccountSeatCount **teamAccountSeatCount: number --- # Dataset ## Index[**](#Index) ### Properties * [**accessedAt](#accessedAt) * [**actId](#actId) * [**actRunId](#actRunId) * [**cleanItemCount](#cleanItemCount) * [**createdAt](#createdAt) * [**fields](#fields) * [**generalAccess](#generalAccess) * [**id](#id) * [**itemCount](#itemCount) * [**modifiedAt](#modifiedAt) * [**name](#name) * [**stats](#stats) * [**title](#title) * [**userId](#userId) ## Properties[**](#Properties) ### [**](#accessedAt)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset.ts#L186)accessedAt **accessedAt: Date ### [**](#actId)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset.ts#L189)optionalactId **actId? : string ### [**](#actRunId)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset.ts#L190)optionalactRunId **actRunId? : string ### [**](#cleanItemCount)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset.ts#L188)cleanItemCount **cleanItemCount: number ### [**](#createdAt)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset.ts#L184)createdAt **createdAt: Date ### [**](#fields)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset.ts#L192)fields **fields: string\[] ### [**](#generalAccess)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset.ts#L193)optionalgeneralAccess **generalAccess? : null | STORAGE\_GENERAL\_ACCESS ### [**](#id)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset.ts#L180)id **id: string ### [**](#itemCount)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset.ts#L187)itemCount **itemCount: number ### [**](#modifiedAt)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset.ts#L185)modifiedAt **modifiedAt: Date ### [**](#name)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset.ts#L181)optionalname **name? : string ### [**](#stats)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset.ts#L191)stats **stats: [DatasetStats](https://docs.apify.com/api/client/js/api/client/js/reference/interface/DatasetStats.md) ### [**](#title)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset.ts#L182)optionaltitle **title? : string ### [**](#userId)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset.ts#L183)userId **userId: string --- # DatasetClientDownloadItemsOptions ### Hierarchy * [DatasetClientListItemOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/DatasetClientListItemOptions.md) * *DatasetClientDownloadItemsOptions* ## Index[**](#Index) ### Properties * [**attachment](#attachment) * [**bom](#bom) * [**clean](#clean) * [**delimiter](#delimiter) * [**desc](#desc) * [**fields](#fields) * [**flatten](#flatten) * [**limit](#limit) * [**offset](#offset) * [**omit](#omit) * [**skipEmpty](#skipEmpty) * [**skipHeaderRow](#skipHeaderRow) * [**skipHidden](#skipHidden) * [**unwind](#unwind) * [**view](#view) * [**xmlRoot](#xmlRoot) * [**xmlRow](#xmlRow) ## Properties[**](#Properties) ### [**](#attachment)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset.ts#L236)optionalattachment **attachment? : boolean ### [**](#bom)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset.ts#L237)optionalbom **bom? : boolean ### [**](#clean)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset.ts#L210)optionalinheritedclean **clean? : boolean Inherited from DatasetClientListItemOptions.clean ### [**](#delimiter)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset.ts#L238)optionaldelimiter **delimiter? : string ### [**](#desc)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset.ts#L211)optionalinheriteddesc **desc? : boolean Inherited from DatasetClientListItemOptions.desc ### [**](#fields)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset.ts#L213)optionalinheritedfields **fields? : string\[] Inherited from DatasetClientListItemOptions.fields ### [**](#flatten)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset.ts#L212)optionalinheritedflatten **flatten? : string\[] Inherited from DatasetClientListItemOptions.flatten ### [**](#limit)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset.ts#L215)optionalinheritedlimit **limit? : number Inherited from DatasetClientListItemOptions.limit ### [**](#offset)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset.ts#L216)optionalinheritedoffset **offset? : number Inherited from DatasetClientListItemOptions.offset ### [**](#omit)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset.ts#L214)optionalinheritedomit **omit? : string\[] Inherited from DatasetClientListItemOptions.omit ### [**](#skipEmpty)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset.ts#L217)optionalinheritedskipEmpty **skipEmpty? : boolean Inherited from DatasetClientListItemOptions.skipEmpty ### [**](#skipHeaderRow)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset.ts#L239)optionalskipHeaderRow **skipHeaderRow? : boolean ### [**](#skipHidden)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset.ts#L218)optionalinheritedskipHidden **skipHidden? : boolean Inherited from DatasetClientListItemOptions.skipHidden ### [**](#unwind)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset.ts#L219)optionalinheritedunwind **unwind? : string | string\[] Inherited from DatasetClientListItemOptions.unwind ### [**](#view)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset.ts#L220)optionalinheritedview **view? : string Inherited from DatasetClientListItemOptions.view ### [**](#xmlRoot)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset.ts#L240)optionalxmlRoot **xmlRoot? : string ### [**](#xmlRow)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset.ts#L241)optionalxmlRow **xmlRow? : string --- # DatasetClientListItemOptions ### Hierarchy * *DatasetClientListItemOptions* * [DatasetClientDownloadItemsOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/DatasetClientDownloadItemsOptions.md) ## Index[**](#Index) ### Properties * [**clean](#clean) * [**desc](#desc) * [**fields](#fields) * [**flatten](#flatten) * [**limit](#limit) * [**offset](#offset) * [**omit](#omit) * [**skipEmpty](#skipEmpty) * [**skipHidden](#skipHidden) * [**unwind](#unwind) * [**view](#view) ## Properties[**](#Properties) ### [**](#clean)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset.ts#L210)optionalclean **clean? : boolean ### [**](#desc)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset.ts#L211)optionaldesc **desc? : boolean ### [**](#fields)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset.ts#L213)optionalfields **fields? : string\[] ### [**](#flatten)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset.ts#L212)optionalflatten **flatten? : string\[] ### [**](#limit)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset.ts#L215)optionallimit **limit? : number ### [**](#offset)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset.ts#L216)optionaloffset **offset? : number ### [**](#omit)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset.ts#L214)optionalomit **omit? : string\[] ### [**](#skipEmpty)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset.ts#L217)optionalskipEmpty **skipEmpty? : boolean ### [**](#skipHidden)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset.ts#L218)optionalskipHidden **skipHidden? : boolean ### [**](#unwind)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset.ts#L219)optionalunwind **unwind? : string | string\[] ### [**](#view)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset.ts#L220)optionalview **view? : string --- # DatasetClientUpdateOptions ## Index[**](#Index) ### Properties * [**generalAccess](#generalAccess) * [**name](#name) * [**title](#title) ## Properties[**](#Properties) ### [**](#generalAccess)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset.ts#L206)optionalgeneralAccess **generalAccess? : null | STORAGE\_GENERAL\_ACCESS ### [**](#name)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset.ts#L204)optionalname **name? : null | string ### [**](#title)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset.ts#L205)optionaltitle **title? : string --- # DatasetCollectionClientGetOrCreateOptions ## Index[**](#Index) ### Properties * [**schema](#schema) ## Properties[**](#Properties) ### [**](#schema)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset_collection.ts#L55)optionalschema **schema? : Record\ --- # DatasetCollectionClientListOptions ## Index[**](#Index) ### Properties * [**desc](#desc) * [**limit](#limit) * [**offset](#offset) * [**unnamed](#unnamed) ## Properties[**](#Properties) ### [**](#desc)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset_collection.ts#L51)optionaldesc **desc? : boolean ### [**](#limit)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset_collection.ts#L49)optionallimit **limit? : number ### [**](#offset)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset_collection.ts#L50)optionaloffset **offset? : number ### [**](#unnamed)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset_collection.ts#L48)optionalunnamed **unnamed? : boolean --- # DatasetStatistics ## Index[**](#Index) ### Properties * [**fieldStatistics](#fieldStatistics) ## Properties[**](#Properties) ### [**](#fieldStatistics)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset.ts#L245)fieldStatistics **fieldStatistics: Record\ --- # DatasetStats ## Index[**](#Index) ### Properties * [**deleteCount](#deleteCount) * [**readCount](#readCount) * [**storageBytes](#storageBytes) * [**writeCount](#writeCount) ## Properties[**](#Properties) ### [**](#deleteCount)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset.ts#L199)optionaldeleteCount **deleteCount? : number ### [**](#readCount)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset.ts#L197)optionalreadCount **readCount? : number ### [**](#storageBytes)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset.ts#L200)optionalstorageBytes **storageBytes? : number ### [**](#writeCount)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset.ts#L198)optionalwriteCount **writeCount? : number --- # FieldStatistics ## Index[**](#Index) ### Properties * [**emptyCount](#emptyCount) * [**max](#max) * [**min](#min) * [**nullCount](#nullCount) ## Properties[**](#Properties) ### [**](#emptyCount)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset.ts#L252)optionalemptyCount **emptyCount? : number ### [**](#max)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset.ts#L250)optionalmax **max? : number ### [**](#min)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset.ts#L249)optionalmin **min? : number ### [**](#nullCount)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/dataset.ts#L251)optionalnullCount **nullCount? : number --- # FlatPricePerMonthActorPricingInfo ### Hierarchy * CommonActorPricingInfo * *FlatPricePerMonthActorPricingInfo* ## Index[**](#Index) ### Properties * [**apifyMarginPercentage](#apifyMarginPercentage) * [**createdAt](#createdAt) * [**notifiedAboutChangeAt](#notifiedAboutChangeAt) * [**notifiedAboutFutureChangeAt](#notifiedAboutFutureChangeAt) * [**pricePerUnitUsd](#pricePerUnitUsd) * [**pricingModel](#pricingModel) * [**reasonForChange](#reasonForChange) * [**startedAt](#startedAt) * [**trialMinutes](#trialMinutes) ## Properties[**](#Properties) ### [**](#apifyMarginPercentage)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L513)inheritedapifyMarginPercentage **apifyMarginPercentage: number Inherited from CommonActorPricingInfo.apifyMarginPercentage In \[0, 1], fraction of pricePerUnitUsd that goes to Apify ### [**](#createdAt)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L515)inheritedcreatedAt **createdAt: Date Inherited from CommonActorPricingInfo.createdAt When this pricing info record has been created ### [**](#notifiedAboutChangeAt)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L519)optionalinheritednotifiedAboutChangeAt **notifiedAboutChangeAt? : Date Inherited from CommonActorPricingInfo.notifiedAboutChangeAt ### [**](#notifiedAboutFutureChangeAt)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L518)optionalinheritednotifiedAboutFutureChangeAt **notifiedAboutFutureChangeAt? : Date Inherited from CommonActorPricingInfo.notifiedAboutFutureChangeAt ### [**](#pricePerUnitUsd)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L532)pricePerUnitUsd **pricePerUnitUsd: number Monthly flat price in USD ### [**](#pricingModel)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L528)pricingModel **pricingModel: FLAT\_PRICE\_PER\_MONTH ### [**](#reasonForChange)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L520)optionalinheritedreasonForChange **reasonForChange? : string Inherited from CommonActorPricingInfo.reasonForChange ### [**](#startedAt)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L517)inheritedstartedAt **startedAt: Date Inherited from CommonActorPricingInfo.startedAt Since when is this pricing info record effective for a given Actor ### [**](#trialMinutes)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L530)optionaltrialMinutes **trialMinutes? : number For how long this Actor can be used for free in trial period --- # FreeActorPricingInfo ### Hierarchy * CommonActorPricingInfo * *FreeActorPricingInfo* ## Index[**](#Index) ### Properties * [**apifyMarginPercentage](#apifyMarginPercentage) * [**createdAt](#createdAt) * [**notifiedAboutChangeAt](#notifiedAboutChangeAt) * [**notifiedAboutFutureChangeAt](#notifiedAboutFutureChangeAt) * [**pricingModel](#pricingModel) * [**reasonForChange](#reasonForChange) * [**startedAt](#startedAt) ## Properties[**](#Properties) ### [**](#apifyMarginPercentage)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L513)inheritedapifyMarginPercentage **apifyMarginPercentage: number Inherited from CommonActorPricingInfo.apifyMarginPercentage In \[0, 1], fraction of pricePerUnitUsd that goes to Apify ### [**](#createdAt)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L515)inheritedcreatedAt **createdAt: Date Inherited from CommonActorPricingInfo.createdAt When this pricing info record has been created ### [**](#notifiedAboutChangeAt)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L519)optionalinheritednotifiedAboutChangeAt **notifiedAboutChangeAt? : Date Inherited from CommonActorPricingInfo.notifiedAboutChangeAt ### [**](#notifiedAboutFutureChangeAt)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L518)optionalinheritednotifiedAboutFutureChangeAt **notifiedAboutFutureChangeAt? : Date Inherited from CommonActorPricingInfo.notifiedAboutFutureChangeAt ### [**](#pricingModel)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L524)pricingModel **pricingModel: FREE ### [**](#reasonForChange)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L520)optionalinheritedreasonForChange **reasonForChange? : string Inherited from CommonActorPricingInfo.reasonForChange ### [**](#startedAt)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L517)inheritedstartedAt **startedAt: Date Inherited from CommonActorPricingInfo.startedAt Since when is this pricing info record effective for a given Actor --- # KeyValueClientGetRecordOptions ## Index[**](#Index) ### Properties * [**buffer](#buffer) * [**stream](#stream) ## Properties[**](#Properties) ### [**](#buffer)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/key_value_store.ts#L296)optionalbuffer **buffer? : boolean ### [**](#stream)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/key_value_store.ts#L297)optionalstream **stream? : boolean --- # KeyValueClientListKeysOptions ## Index[**](#Index) ### Properties * [**collection](#collection) * [**exclusiveStartKey](#exclusiveStartKey) * [**limit](#limit) * [**prefix](#prefix) ## Properties[**](#Properties) ### [**](#collection)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/key_value_store.ts#L277)optionalcollection **collection? : string ### [**](#exclusiveStartKey)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/key_value_store.ts#L276)optionalexclusiveStartKey **exclusiveStartKey? : string ### [**](#limit)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/key_value_store.ts#L275)optionallimit **limit? : number ### [**](#prefix)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/key_value_store.ts#L278)optionalprefix **prefix? : string --- # KeyValueClientListKeysResult ## Index[**](#Index) ### Properties * [**count](#count) * [**exclusiveStartKey](#exclusiveStartKey) * [**isTruncated](#isTruncated) * [**items](#items) * [**limit](#limit) * [**nextExclusiveStartKey](#nextExclusiveStartKey) ## Properties[**](#Properties) ### [**](#count)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/key_value_store.ts#L282)count **count: number ### [**](#exclusiveStartKey)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/key_value_store.ts#L284)exclusiveStartKey **exclusiveStartKey: string ### [**](#isTruncated)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/key_value_store.ts#L285)isTruncated **isTruncated: boolean ### [**](#items)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/key_value_store.ts#L287)items **items: [KeyValueListItem](https://docs.apify.com/api/client/js/api/client/js/reference/interface/KeyValueListItem.md)\[] ### [**](#limit)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/key_value_store.ts#L283)limit **limit: number ### [**](#nextExclusiveStartKey)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/key_value_store.ts#L286)nextExclusiveStartKey **nextExclusiveStartKey: string --- # KeyValueClientUpdateOptions ## Index[**](#Index) ### Properties * [**generalAccess](#generalAccess) * [**name](#name) * [**title](#title) ## Properties[**](#Properties) ### [**](#generalAccess)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/key_value_store.ts#L271)optionalgeneralAccess **generalAccess? : null | STORAGE\_GENERAL\_ACCESS ### [**](#name)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/key_value_store.ts#L269)optionalname **name? : null | string ### [**](#title)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/key_value_store.ts#L270)optionaltitle **title? : string --- # KeyValueListItem ## Index[**](#Index) ### Properties * [**key](#key) * [**size](#size) ## Properties[**](#Properties) ### [**](#key)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/key_value_store.ts#L291)key **key: string ### [**](#size)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/key_value_store.ts#L292)size **size: number --- # KeyValueStore ## Index[**](#Index) ### Properties * [**accessedAt](#accessedAt) * [**actId](#actId) * [**actRunId](#actRunId) * [**createdAt](#createdAt) * [**generalAccess](#generalAccess) * [**id](#id) * [**modifiedAt](#modifiedAt) * [**name](#name) * [**stats](#stats) * [**title](#title) * [**userId](#userId) ## Properties[**](#Properties) ### [**](#accessedAt)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/key_value_store.ts#L253)accessedAt **accessedAt: Date ### [**](#actId)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/key_value_store.ts#L254)optionalactId **actId? : string ### [**](#actRunId)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/key_value_store.ts#L255)optionalactRunId **actRunId? : string ### [**](#createdAt)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/key_value_store.ts#L251)createdAt **createdAt: Date ### [**](#generalAccess)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/key_value_store.ts#L257)optionalgeneralAccess **generalAccess? : null | STORAGE\_GENERAL\_ACCESS ### [**](#id)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/key_value_store.ts#L247)id **id: string ### [**](#modifiedAt)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/key_value_store.ts#L252)modifiedAt **modifiedAt: Date ### [**](#name)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/key_value_store.ts#L248)optionalname **name? : string ### [**](#stats)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/key_value_store.ts#L256)optionalstats **stats? : [KeyValueStoreStats](https://docs.apify.com/api/client/js/api/client/js/reference/interface/KeyValueStoreStats.md) ### [**](#title)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/key_value_store.ts#L249)optionaltitle **title? : string ### [**](#userId)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/key_value_store.ts#L250)userId **userId: string --- # KeyValueStoreCollectionClientGetOrCreateOptions ## Index[**](#Index) ### Properties * [**schema](#schema) ## Properties[**](#Properties) ### [**](#schema)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/key_value_store_collection.ts#L60)optionalschema **schema? : Record\ --- # KeyValueStoreCollectionClientListOptions ## Index[**](#Index) ### Properties * [**desc](#desc) * [**limit](#limit) * [**offset](#offset) * [**unnamed](#unnamed) ## Properties[**](#Properties) ### [**](#desc)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/key_value_store_collection.ts#L56)optionaldesc **desc? : boolean ### [**](#limit)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/key_value_store_collection.ts#L54)optionallimit **limit? : number ### [**](#offset)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/key_value_store_collection.ts#L55)optionaloffset **offset? : number ### [**](#unnamed)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/key_value_store_collection.ts#L53)optionalunnamed **unnamed? : boolean --- # KeyValueStoreRecord \ ## Index[**](#Index) ### Properties * [**contentType](#contentType) * [**key](#key) * [**value](#value) ## Properties[**](#Properties) ### [**](#contentType)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/key_value_store.ts#L303)optionalcontentType **contentType? : string ### [**](#key)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/key_value_store.ts#L301)key **key: string ### [**](#value)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/key_value_store.ts#L302)value **value: T --- # KeyValueStoreRecordOptions ## Index[**](#Index) ### Properties * [**doNotRetryTimeouts](#doNotRetryTimeouts) * [**timeoutSecs](#timeoutSecs) ## Properties[**](#Properties) ### [**](#doNotRetryTimeouts)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/key_value_store.ts#L308)optionaldoNotRetryTimeouts **doNotRetryTimeouts? : boolean ### [**](#timeoutSecs)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/key_value_store.ts#L307)optionaltimeoutSecs **timeoutSecs? : number --- # KeyValueStoreStats ## Index[**](#Index) ### Properties * [**deleteCount](#deleteCount) * [**listCount](#listCount) * [**readCount](#readCount) * [**storageBytes](#storageBytes) * [**writeCount](#writeCount) ## Properties[**](#Properties) ### [**](#deleteCount)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/key_value_store.ts#L263)optionaldeleteCount **deleteCount? : number ### [**](#listCount)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/key_value_store.ts#L264)optionallistCount **listCount? : number ### [**](#readCount)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/key_value_store.ts#L261)optionalreadCount **readCount? : number ### [**](#storageBytes)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/key_value_store.ts#L265)optionalstorageBytes **storageBytes? : number ### [**](#writeCount)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/key_value_store.ts#L262)optionalwriteCount **writeCount? : number --- # Limits ## Index[**](#Index) ### Properties * [**dataRetentionDays](#dataRetentionDays) * [**maxActorCount](#maxActorCount) * [**maxActorMemoryGbytes](#maxActorMemoryGbytes) * [**maxActorTaskCount](#maxActorTaskCount) * [**maxConcurrentActorJobs](#maxConcurrentActorJobs) * [**maxMonthlyActorComputeUnits](#maxMonthlyActorComputeUnits) * [**maxMonthlyExternalDataTransferGbytes](#maxMonthlyExternalDataTransferGbytes) * [**maxMonthlyProxySerps](#maxMonthlyProxySerps) * [**maxMonthlyResidentialProxyGbytes](#maxMonthlyResidentialProxyGbytes) * [**maxMonthlyUsageUsd](#maxMonthlyUsageUsd) * [**maxTeamAccountSeatCount](#maxTeamAccountSeatCount) ## Properties[**](#Properties) ### [**](#dataRetentionDays)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L248)dataRetentionDays **dataRetentionDays: number ### [**](#maxActorCount)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L244)maxActorCount **maxActorCount: number ### [**](#maxActorMemoryGbytes)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L243)maxActorMemoryGbytes **maxActorMemoryGbytes: number ### [**](#maxActorTaskCount)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L245)maxActorTaskCount **maxActorTaskCount: number ### [**](#maxConcurrentActorJobs)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L246)maxConcurrentActorJobs **maxConcurrentActorJobs: number ### [**](#maxMonthlyActorComputeUnits)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L239)maxMonthlyActorComputeUnits **maxMonthlyActorComputeUnits: number ### [**](#maxMonthlyExternalDataTransferGbytes)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L240)maxMonthlyExternalDataTransferGbytes **maxMonthlyExternalDataTransferGbytes: number ### [**](#maxMonthlyProxySerps)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L241)maxMonthlyProxySerps **maxMonthlyProxySerps: number ### [**](#maxMonthlyResidentialProxyGbytes)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L242)maxMonthlyResidentialProxyGbytes **maxMonthlyResidentialProxyGbytes: number ### [**](#maxMonthlyUsageUsd)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L238)maxMonthlyUsageUsd **maxMonthlyUsageUsd: number ### [**](#maxTeamAccountSeatCount)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L247)maxTeamAccountSeatCount **maxTeamAccountSeatCount: number --- # MonthlyUsage ## Index[**](#Index) ### Properties * [**dailyServiceUsages](#dailyServiceUsages) * [**monthlyServiceUsage](#monthlyServiceUsage) * [**totalUsageCreditsUsdAfterVolumeDiscount](#totalUsageCreditsUsdAfterVolumeDiscount) * [**totalUsageCreditsUsdBeforeVolumeDiscount](#totalUsageCreditsUsdBeforeVolumeDiscount) * [**usageCycle](#usageCycle) ## Properties[**](#Properties) ### [**](#dailyServiceUsages)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L183)dailyServiceUsages **dailyServiceUsages: DailyServiceUsage\[] ### [**](#monthlyServiceUsage)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L182)monthlyServiceUsage **monthlyServiceUsage: {} #### Type declaration ### [**](#totalUsageCreditsUsdAfterVolumeDiscount)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L185)totalUsageCreditsUsdAfterVolumeDiscount **totalUsageCreditsUsdAfterVolumeDiscount: number ### [**](#totalUsageCreditsUsdBeforeVolumeDiscount)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L184)totalUsageCreditsUsdBeforeVolumeDiscount **totalUsageCreditsUsdBeforeVolumeDiscount: number ### [**](#usageCycle)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L181)usageCycle **usageCycle: [UsageCycle](https://docs.apify.com/api/client/js/api/client/js/reference/interface/UsageCycle.md) --- # MonthlyUsageCycle ## Index[**](#Index) ### Properties * [**endAt](#endAt) * [**startAt](#startAt) ## Properties[**](#Properties) ### [**](#endAt)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L234)endAt **endAt: Date ### [**](#startAt)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L233)startAt **startAt: Date --- # OpenApiDefinition ## Index[**](#Index) ### Properties * [**components](#components) * [**info](#info) * [**openapi](#openapi) * [**paths](#paths) * [**servers](#servers) ## Properties[**](#Properties) ### [**](#components)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/build.ts#L174)components **components: { schemas: {} } #### Type declaration * ##### schemas: {} ### [**](#info)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/build.ts#L166)info **info: { description? : string; title: string; version? : string; x-build-id: string } #### Type declaration * ##### optionaldescription?: string * ##### title: string * ##### optionalversion?: string * ##### x-build-id: string ### [**](#openapi)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/build.ts#L165)openapi **openapi: string ### [**](#paths)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/build.ts#L173)paths **paths: {} #### Type declaration ### [**](#servers)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/build.ts#L172)servers **servers: { url: string }\[] --- # PaginatedList \ ## Index[**](#Index) ### Properties * [**count](#count) * [**desc](#desc) * [**items](#items) * [**limit](#limit) * [**offset](#offset) * [**total](#total) ## Properties[**](#Properties) ### [**](#count)[**](https://github.com/apify/apify-client-js/blob/master/src/utils.ts#L237)count **count: number Count of dataset entries returned in this set. ### [**](#desc)[**](https://github.com/apify/apify-client-js/blob/master/src/utils.ts#L243)desc **desc: boolean Should the results be in descending order. ### [**](#items)[**](https://github.com/apify/apify-client-js/blob/master/src/utils.ts#L245)items **items: Data\[] Dataset entries based on chosen format parameter. ### [**](#limit)[**](https://github.com/apify/apify-client-js/blob/master/src/utils.ts#L241)limit **limit: number Maximum number of dataset entries requested. ### [**](#offset)[**](https://github.com/apify/apify-client-js/blob/master/src/utils.ts#L239)offset **offset: number Position of the first returned entry in the dataset. ### [**](#total)[**](https://github.com/apify/apify-client-js/blob/master/src/utils.ts#L235)total **total: number Total count of entries in the dataset. --- # PricePerDatasetItemActorPricingInfo ### Hierarchy * CommonActorPricingInfo * *PricePerDatasetItemActorPricingInfo* ## Index[**](#Index) ### Properties * [**apifyMarginPercentage](#apifyMarginPercentage) * [**createdAt](#createdAt) * [**notifiedAboutChangeAt](#notifiedAboutChangeAt) * [**notifiedAboutFutureChangeAt](#notifiedAboutFutureChangeAt) * [**pricePerUnitUsd](#pricePerUnitUsd) * [**pricingModel](#pricingModel) * [**reasonForChange](#reasonForChange) * [**startedAt](#startedAt) * [**unitName](#unitName) ## Properties[**](#Properties) ### [**](#apifyMarginPercentage)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L513)inheritedapifyMarginPercentage **apifyMarginPercentage: number Inherited from CommonActorPricingInfo.apifyMarginPercentage In \[0, 1], fraction of pricePerUnitUsd that goes to Apify ### [**](#createdAt)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L515)inheritedcreatedAt **createdAt: Date Inherited from CommonActorPricingInfo.createdAt When this pricing info record has been created ### [**](#notifiedAboutChangeAt)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L519)optionalinheritednotifiedAboutChangeAt **notifiedAboutChangeAt? : Date Inherited from CommonActorPricingInfo.notifiedAboutChangeAt ### [**](#notifiedAboutFutureChangeAt)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L518)optionalinheritednotifiedAboutFutureChangeAt **notifiedAboutFutureChangeAt? : Date Inherited from CommonActorPricingInfo.notifiedAboutFutureChangeAt ### [**](#pricePerUnitUsd)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L539)pricePerUnitUsd **pricePerUnitUsd: number ### [**](#pricingModel)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L536)pricingModel **pricingModel: PRICE\_PER\_DATASET\_ITEM ### [**](#reasonForChange)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L520)optionalinheritedreasonForChange **reasonForChange? : string Inherited from CommonActorPricingInfo.reasonForChange ### [**](#startedAt)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L517)inheritedstartedAt **startedAt: Date Inherited from CommonActorPricingInfo.startedAt Since when is this pricing info record effective for a given Actor ### [**](#unitName)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L538)optionalunitName **unitName? : string Name of the unit that is being charged --- # PricePerEventActorPricingInfo ### Hierarchy * CommonActorPricingInfo * *PricePerEventActorPricingInfo* ## Index[**](#Index) ### Properties * [**apifyMarginPercentage](#apifyMarginPercentage) * [**createdAt](#createdAt) * [**minimalMaxTotalChargeUsd](#minimalMaxTotalChargeUsd) * [**notifiedAboutChangeAt](#notifiedAboutChangeAt) * [**notifiedAboutFutureChangeAt](#notifiedAboutFutureChangeAt) * [**pricingModel](#pricingModel) * [**pricingPerEvent](#pricingPerEvent) * [**reasonForChange](#reasonForChange) * [**startedAt](#startedAt) ## Properties[**](#Properties) ### [**](#apifyMarginPercentage)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L513)inheritedapifyMarginPercentage **apifyMarginPercentage: number Inherited from CommonActorPricingInfo.apifyMarginPercentage In \[0, 1], fraction of pricePerUnitUsd that goes to Apify ### [**](#createdAt)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L515)inheritedcreatedAt **createdAt: Date Inherited from CommonActorPricingInfo.createdAt When this pricing info record has been created ### [**](#minimalMaxTotalChargeUsd)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L555)optionalminimalMaxTotalChargeUsd **minimalMaxTotalChargeUsd? : number ### [**](#notifiedAboutChangeAt)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L519)optionalinheritednotifiedAboutChangeAt **notifiedAboutChangeAt? : Date Inherited from CommonActorPricingInfo.notifiedAboutChangeAt ### [**](#notifiedAboutFutureChangeAt)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L518)optionalinheritednotifiedAboutFutureChangeAt **notifiedAboutFutureChangeAt? : Date Inherited from CommonActorPricingInfo.notifiedAboutFutureChangeAt ### [**](#pricingModel)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L551)pricingModel **pricingModel: PAY\_PER\_EVENT ### [**](#pricingPerEvent)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L552)pricingPerEvent **pricingPerEvent: { actorChargeEvents: [ActorChargeEvents](https://docs.apify.com/api/client/js/api/client/js/reference.md#ActorChargeEvents) } #### Type declaration * ##### actorChargeEvents: [ActorChargeEvents](https://docs.apify.com/api/client/js/api/client/js/reference.md#ActorChargeEvents) ### [**](#reasonForChange)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L520)optionalinheritedreasonForChange **reasonForChange? : string Inherited from CommonActorPricingInfo.reasonForChange ### [**](#startedAt)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L517)inheritedstartedAt **startedAt: Date Inherited from CommonActorPricingInfo.startedAt Since when is this pricing info record effective for a given Actor --- # PricingInfo ## Index[**](#Index) ### Properties * [**pricingModel](#pricingModel) ## Properties[**](#Properties) ### [**](#pricingModel)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/store_collection.ts#L41)pricingModel **pricingModel: string --- # ProxyGroup ## Index[**](#Index) ### Properties * [**availableCount](#availableCount) * [**description](#description) * [**name](#name) ## Properties[**](#Properties) ### [**](#availableCount)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L119)availableCount **availableCount: number ### [**](#description)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L118)description **description: string ### [**](#name)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L117)name **name: string --- # RequestQueue ## Index[**](#Index) ### Properties * [**accessedAt](#accessedAt) * [**actId](#actId) * [**actRunId](#actRunId) * [**createdAt](#createdAt) * [**expireAt](#expireAt) * [**generalAccess](#generalAccess) * [**hadMultipleClients](#hadMultipleClients) * [**handledRequestCount](#handledRequestCount) * [**id](#id) * [**modifiedAt](#modifiedAt) * [**name](#name) * [**pendingRequestCount](#pendingRequestCount) * [**stats](#stats) * [**title](#title) * [**totalRequestCount](#totalRequestCount) * [**userId](#userId) ## Properties[**](#Properties) ### [**](#accessedAt)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L563)accessedAt **accessedAt: Date ### [**](#actId)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L568)optionalactId **actId? : string ### [**](#actRunId)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L569)optionalactRunId **actRunId? : string ### [**](#createdAt)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L561)createdAt **createdAt: Date ### [**](#expireAt)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L564)optionalexpireAt **expireAt? : string ### [**](#generalAccess)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L572)optionalgeneralAccess **generalAccess? : null | STORAGE\_GENERAL\_ACCESS ### [**](#hadMultipleClients)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L570)hadMultipleClients **hadMultipleClients: boolean ### [**](#handledRequestCount)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L566)handledRequestCount **handledRequestCount: number ### [**](#id)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L557)id **id: string ### [**](#modifiedAt)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L562)modifiedAt **modifiedAt: Date ### [**](#name)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L558)optionalname **name? : string ### [**](#pendingRequestCount)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L567)pendingRequestCount **pendingRequestCount: number ### [**](#stats)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L571)stats **stats: [RequestQueueStats](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueueStats.md) ### [**](#title)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L559)optionaltitle **title? : string ### [**](#totalRequestCount)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L565)totalRequestCount **totalRequestCount: number ### [**](#userId)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L560)userId **userId: string --- # RequestQueueClientAddRequestOptions ## Index[**](#Index) ### Properties * [**forefront](#forefront) ## Properties[**](#Properties) ### [**](#forefront)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L638)optionalforefront **forefront? : boolean --- # RequestQueueClientAddRequestResult ## Index[**](#Index) ### Properties * [**requestId](#requestId) * [**wasAlreadyHandled](#wasAlreadyHandled) * [**wasAlreadyPresent](#wasAlreadyPresent) ## Properties[**](#Properties) ### [**](#requestId)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L677)requestId **requestId: string ### [**](#wasAlreadyHandled)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L679)wasAlreadyHandled **wasAlreadyHandled: boolean ### [**](#wasAlreadyPresent)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L678)wasAlreadyPresent **wasAlreadyPresent: boolean --- # RequestQueueClientBatchAddRequestWithRetriesOptions ## Index[**](#Index) ### Properties * [**forefront](#forefront) * [**maxParallel](#maxParallel) * [**maxUnprocessedRequestsRetries](#maxUnprocessedRequestsRetries) * [**minDelayBetweenUnprocessedRequestsRetriesMillis](#minDelayBetweenUnprocessedRequestsRetriesMillis) ## Properties[**](#Properties) ### [**](#forefront)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L655)optionalforefront **forefront? : boolean ### [**](#maxParallel)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L657)optionalmaxParallel **maxParallel? : number ### [**](#maxUnprocessedRequestsRetries)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L656)optionalmaxUnprocessedRequestsRetries **maxUnprocessedRequestsRetries? : number ### [**](#minDelayBetweenUnprocessedRequestsRetriesMillis)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L658)optionalminDelayBetweenUnprocessedRequestsRetriesMillis **minDelayBetweenUnprocessedRequestsRetriesMillis? : number --- # RequestQueueClientBatchRequestsOperationResult ## Index[**](#Index) ### Properties * [**processedRequests](#processedRequests) * [**unprocessedRequests](#unprocessedRequests) ## Properties[**](#Properties) ### [**](#processedRequests)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L700)processedRequests **processedRequests: ProcessedRequest\[] ### [**](#unprocessedRequests)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L701)unprocessedRequests **unprocessedRequests: UnprocessedRequest\[] --- # RequestQueueClientDeleteRequestLockOptions ## Index[**](#Index) ### Properties * [**forefront](#forefront) ## Properties[**](#Properties) ### [**](#forefront)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L647)optionalforefront **forefront? : boolean --- # RequestQueueClientListAndLockHeadOptions ## Index[**](#Index) ### Properties * [**limit](#limit) * [**lockSecs](#lockSecs) ## Properties[**](#Properties) ### [**](#limit)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L619)optionallimit **limit? : number ### [**](#lockSecs)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L618)lockSecs **lockSecs: number --- # RequestQueueClientListAndLockHeadResult ### Hierarchy * [RequestQueueClientListHeadResult](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueueClientListHeadResult.md) * *RequestQueueClientListAndLockHeadResult* ## Index[**](#Index) ### Properties * [**clientKey](#clientKey) * [**hadMultipleClients](#hadMultipleClients) * [**items](#items) * [**limit](#limit) * [**lockSecs](#lockSecs) * [**queueHasLockedRequests](#queueHasLockedRequests) * [**queueModifiedAt](#queueModifiedAt) ## Properties[**](#Properties) ### [**](#clientKey)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L625)clientKey **clientKey: string ### [**](#hadMultipleClients)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L596)inheritedhadMultipleClients **hadMultipleClients: boolean Inherited from RequestQueueClientListHeadResult.hadMultipleClients ### [**](#items)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L597)inheriteditems **items: [RequestQueueClientListItem](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueueClientListItem.md)\[] Inherited from RequestQueueClientListHeadResult.items ### [**](#limit)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L594)inheritedlimit **limit: number Inherited from RequestQueueClientListHeadResult.limit ### [**](#lockSecs)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L623)lockSecs **lockSecs: number ### [**](#queueHasLockedRequests)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L624)queueHasLockedRequests **queueHasLockedRequests: boolean ### [**](#queueModifiedAt)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L595)inheritedqueueModifiedAt **queueModifiedAt: Date Inherited from RequestQueueClientListHeadResult.queueModifiedAt --- # RequestQueueClientListHeadOptions ## Index[**](#Index) ### Properties * [**limit](#limit) ## Properties[**](#Properties) ### [**](#limit)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L590)optionallimit **limit? : number --- # RequestQueueClientListHeadResult ### Hierarchy * *RequestQueueClientListHeadResult* * [RequestQueueClientListAndLockHeadResult](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueueClientListAndLockHeadResult.md) ## Index[**](#Index) ### Properties * [**hadMultipleClients](#hadMultipleClients) * [**items](#items) * [**limit](#limit) * [**queueModifiedAt](#queueModifiedAt) ## Properties[**](#Properties) ### [**](#hadMultipleClients)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L596)hadMultipleClients **hadMultipleClients: boolean ### [**](#items)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L597)items **items: [RequestQueueClientListItem](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueueClientListItem.md)\[] ### [**](#limit)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L594)limit **limit: number ### [**](#queueModifiedAt)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L595)queueModifiedAt **queueModifiedAt: Date --- # RequestQueueClientListItem ## Index[**](#Index) ### Properties * [**id](#id) * [**lockExpiresAt](#lockExpiresAt) * [**method](#method) * [**retryCount](#retryCount) * [**uniqueKey](#uniqueKey) * [**url](#url) ## Properties[**](#Properties) ### [**](#id)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L629)id **id: string ### [**](#lockExpiresAt)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L634)optionallockExpiresAt **lockExpiresAt? : Date ### [**](#method)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L633)method **method: [AllowedHttpMethods](https://docs.apify.com/api/client/js/api/client/js/reference.md#AllowedHttpMethods) ### [**](#retryCount)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L630)retryCount **retryCount: number ### [**](#uniqueKey)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L631)uniqueKey **uniqueKey: string ### [**](#url)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L632)url **url: string --- # RequestQueueClientListRequestsOptions ## Index[**](#Index) ### Properties * [**exclusiveStartId](#exclusiveStartId) * [**limit](#limit) ## Properties[**](#Properties) ### [**](#exclusiveStartId)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L602)optionalexclusiveStartId **exclusiveStartId? : string ### [**](#limit)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L601)optionallimit **limit? : number --- # RequestQueueClientListRequestsResult ## Index[**](#Index) ### Properties * [**exclusiveStartId](#exclusiveStartId) * [**items](#items) * [**limit](#limit) ## Properties[**](#Properties) ### [**](#exclusiveStartId)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L613)optionalexclusiveStartId **exclusiveStartId? : string ### [**](#items)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L614)items **items: [RequestQueueClientRequestSchema](https://docs.apify.com/api/client/js/api/client/js/reference/interface/RequestQueueClientRequestSchema.md)\[] ### [**](#limit)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L612)limit **limit: number --- # RequestQueueClientPaginateRequestsOptions ## Index[**](#Index) ### Properties * [**exclusiveStartId](#exclusiveStartId) * [**limit](#limit) * [**maxPageLimit](#maxPageLimit) ## Properties[**](#Properties) ### [**](#exclusiveStartId)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L608)optionalexclusiveStartId **exclusiveStartId? : string ### [**](#limit)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L606)optionallimit **limit? : number ### [**](#maxPageLimit)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L607)optionalmaxPageLimit **maxPageLimit? : number --- # RequestQueueClientProlongRequestLockOptions ## Index[**](#Index) ### Properties * [**forefront](#forefront) * [**lockSecs](#lockSecs) ## Properties[**](#Properties) ### [**](#forefront)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L642)optionalforefront **forefront? : boolean ### [**](#lockSecs)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L643)lockSecs **lockSecs: number --- # RequestQueueClientProlongRequestLockResult ## Index[**](#Index) ### Properties * [**lockExpiresAt](#lockExpiresAt) ## Properties[**](#Properties) ### [**](#lockExpiresAt)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L651)lockExpiresAt **lockExpiresAt: Date --- # RequestQueueClientRequestSchema ## Index[**](#Index) ### Properties * [**errorMessages](#errorMessages) * [**handledAt](#handledAt) * [**headers](#headers) * [**id](#id) * [**loadedUrl](#loadedUrl) * [**method](#method) * [**noRetry](#noRetry) * [**payload](#payload) * [**retryCount](#retryCount) * [**uniqueKey](#uniqueKey) * [**url](#url) * [**userData](#userData) ## Properties[**](#Properties) ### [**](#errorMessages)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L668)optionalerrorMessages **errorMessages? : string\[] ### [**](#handledAt)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L671)optionalhandledAt **handledAt? : string ### [**](#headers)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L669)optionalheaders **headers? : Record\ ### [**](#id)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L662)id **id: string ### [**](#loadedUrl)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L673)optionalloadedUrl **loadedUrl? : string ### [**](#method)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L665)optionalmethod **method? : [AllowedHttpMethods](https://docs.apify.com/api/client/js/api/client/js/reference.md#AllowedHttpMethods) ### [**](#noRetry)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L672)optionalnoRetry **noRetry? : boolean ### [**](#payload)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L666)optionalpayload **payload? : string ### [**](#retryCount)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L667)optionalretryCount **retryCount? : number ### [**](#uniqueKey)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L663)uniqueKey **uniqueKey: string ### [**](#url)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L664)url **url: string ### [**](#userData)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L670)optionaluserData **userData? : Record\ --- # RequestQueueClientUnlockRequestsResult ## Index[**](#Index) ### Properties * [**unlockedCount](#unlockedCount) ## Properties[**](#Properties) ### [**](#unlockedCount)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L696)unlockedCount **unlockedCount: number --- # RequestQueueClientUpdateOptions ## Index[**](#Index) ### Properties * [**generalAccess](#generalAccess) * [**name](#name) * [**title](#title) ## Properties[**](#Properties) ### [**](#generalAccess)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L586)optionalgeneralAccess **generalAccess? : null | STORAGE\_GENERAL\_ACCESS ### [**](#name)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L584)optionalname **name? : null | string ### [**](#title)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L585)optionaltitle **title? : string --- # RequestQueueCollectionListOptions ## Index[**](#Index) ### Properties * [**desc](#desc) * [**limit](#limit) * [**offset](#offset) * [**unnamed](#unnamed) ## Properties[**](#Properties) ### [**](#desc)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue_collection.ts#L50)optionaldesc **desc? : boolean ### [**](#limit)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue_collection.ts#L48)optionallimit **limit? : number ### [**](#offset)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue_collection.ts#L49)optionaloffset **offset? : number ### [**](#unnamed)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue_collection.ts#L47)optionalunnamed **unnamed? : boolean --- # RequestQueueStats ## Index[**](#Index) ### Properties * [**deleteCount](#deleteCount) * [**headItemReadCount](#headItemReadCount) * [**readCount](#readCount) * [**storageBytes](#storageBytes) * [**writeCount](#writeCount) ## Properties[**](#Properties) ### [**](#deleteCount)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L578)optionaldeleteCount **deleteCount? : number ### [**](#headItemReadCount)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L579)optionalheadItemReadCount **headItemReadCount? : number ### [**](#readCount)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L576)optionalreadCount **readCount? : number ### [**](#storageBytes)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L580)optionalstorageBytes **storageBytes? : number ### [**](#writeCount)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L577)optionalwriteCount **writeCount? : number --- # RequestQueueUserOptions ## Index[**](#Index) ### Properties * [**clientKey](#clientKey) * [**timeoutSecs](#timeoutSecs) ## Properties[**](#Properties) ### [**](#clientKey)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L552)optionalclientKey **clientKey? : string ### [**](#timeoutSecs)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/request_queue.ts#L553)optionaltimeoutSecs **timeoutSecs? : number --- # RunAbortOptions ## Index[**](#Index) ### Properties * [**gracefully](#gracefully) ## Properties[**](#Properties) ### [**](#gracefully)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/run.ts#L275)optionalgracefully **gracefully? : boolean --- # RunChargeOptions ## Index[**](#Index) ### Properties * [**count](#count) * [**eventName](#eventName) * [**idempotencyKey](#idempotencyKey) ## Properties[**](#Properties) ### [**](#count)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/run.ts#L300)optionalcount **count? : number Defaults to 1 ### [**](#eventName)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/run.ts#L298)eventName **eventName: string Name of the event to charge. Must be defined in the Actor's pricing info else the API will throw. ### [**](#idempotencyKey)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/run.ts#L302)optionalidempotencyKey **idempotencyKey? : string Defaults to runId-eventName-timestamp --- # RunCollectionListOptions ## Index[**](#Index) ### Properties * [**desc](#desc) * [**limit](#limit) * [**offset](#offset) * [**status](#status) ## Properties[**](#Properties) ### [**](#desc)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/run_collection.ts#L42)optionaldesc **desc? : boolean ### [**](#limit)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/run_collection.ts#L40)optionallimit **limit? : number ### [**](#offset)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/run_collection.ts#L41)optionaloffset **offset? : number ### [**](#status)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/run_collection.ts#L43)optionalstatus **status? : READY | RUNNING | SUCCEEDED | FAILED | ABORTING | ABORTED | TIMING-OUT | TIMED-OUT --- # RunGetOptions ## Index[**](#Index) ### Properties * [**waitForFinish](#waitForFinish) ## Properties[**](#Properties) ### [**](#waitForFinish)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/run.ts#L271)optionalwaitForFinish **waitForFinish? : number --- # RunMetamorphOptions ## Index[**](#Index) ### Properties * [**build](#build) * [**contentType](#contentType) ## Properties[**](#Properties) ### [**](#build)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/run.ts#L280)optionalbuild **build? : string ### [**](#contentType)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/run.ts#L279)optionalcontentType **contentType? : string --- # RunResurrectOptions ## Index[**](#Index) ### Properties * [**build](#build) * [**maxItems](#maxItems) * [**maxTotalChargeUsd](#maxTotalChargeUsd) * [**memory](#memory) * [**timeout](#timeout) ## Properties[**](#Properties) ### [**](#build)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/run.ts#L289)optionalbuild **build? : string ### [**](#maxItems)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/run.ts#L292)optionalmaxItems **maxItems? : number ### [**](#maxTotalChargeUsd)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/run.ts#L293)optionalmaxTotalChargeUsd **maxTotalChargeUsd? : number ### [**](#memory)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/run.ts#L290)optionalmemory **memory? : number ### [**](#timeout)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/run.ts#L291)optionaltimeout **timeout? : number --- # RunUpdateOptions ## Index[**](#Index) ### Properties * [**generalAccess](#generalAccess) * [**isStatusMessageTerminal](#isStatusMessageTerminal) * [**statusMessage](#statusMessage) ## Properties[**](#Properties) ### [**](#generalAccess)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/run.ts#L285)optionalgeneralAccess **generalAccess? : null | RUN\_GENERAL\_ACCESS ### [**](#isStatusMessageTerminal)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/run.ts#L284)optionalisStatusMessageTerminal **isStatusMessageTerminal? : boolean ### [**](#statusMessage)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/run.ts#L283)optionalstatusMessage **statusMessage? : string --- # RunWaitForFinishOptions ## Index[**](#Index) ### Properties * [**waitSecs](#waitSecs) ## Properties[**](#Properties) ### [**](#waitSecs)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/run.ts#L311)optionalwaitSecs **waitSecs? : number Maximum time to wait for the run to finish, in seconds. If the limit is reached, the returned promise is resolved to a run object that will have status `READY` or `RUNNING`. If `waitSecs` omitted, the function waits indefinitely. --- # Schedule ## Index[**](#Index) ### Properties * [**actions](#actions) * [**createdAt](#createdAt) * [**cronExpression](#cronExpression) * [**description](#description) * [**id](#id) * [**isEnabled](#isEnabled) * [**isExclusive](#isExclusive) * [**lastRunAt](#lastRunAt) * [**modifiedAt](#modifiedAt) * [**name](#name) * [**nextRunAt](#nextRunAt) * [**notifications](#notifications) * [**timezone](#timezone) * [**title](#title) * [**userId](#userId) ## Properties[**](#Properties) ### [**](#actions)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/schedule.ts#L78)actions **actions: [ScheduleAction](https://docs.apify.com/api/client/js/api/client/js/reference.md#ScheduleAction)\[] ### [**](#createdAt)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/schedule.ts#L74)createdAt **createdAt: Date ### [**](#cronExpression)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/schedule.ts#L69)cronExpression **cronExpression: string ### [**](#description)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/schedule.ts#L73)optionaldescription **description? : string ### [**](#id)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/schedule.ts#L65)id **id: string ### [**](#isEnabled)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/schedule.ts#L71)isEnabled **isEnabled: boolean ### [**](#isExclusive)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/schedule.ts#L72)isExclusive **isExclusive: boolean ### [**](#lastRunAt)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/schedule.ts#L77)lastRunAt **lastRunAt: string ### [**](#modifiedAt)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/schedule.ts#L75)modifiedAt **modifiedAt: Date ### [**](#name)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/schedule.ts#L67)name **name: string ### [**](#nextRunAt)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/schedule.ts#L76)nextRunAt **nextRunAt: string ### [**](#notifications)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/schedule.ts#L79)notifications **notifications: { email: boolean } #### Type declaration * ##### email: boolean ### [**](#timezone)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/schedule.ts#L70)timezone **timezone: Africa/Abidjan | Africa/Accra | Africa/Addis\_Ababa | Africa/Algiers | Africa/Asmara | Africa/Asmera | Africa/Bamako | Africa/Bangui | Africa/Banjul | Africa/Bissau | Africa/Blantyre | Africa/Brazzaville | Africa/Bujumbura | Africa/Cairo | Africa/Casablanca | Africa/Ceuta | Africa/Conakry | Africa/Dakar | Africa/Dar\_es\_Salaam | Africa/Djibouti | Africa/Douala | Africa/El\_Aaiun | Africa/Freetown | Africa/Gaborone | Africa/Harare | Africa/Johannesburg | Africa/Juba | Africa/Kampala | Africa/Khartoum | Africa/Kigali | Africa/Kinshasa | Africa/Lagos | Africa/Libreville | Africa/Lome | Africa/Luanda | Africa/Lubumbashi | Africa/Lusaka | Africa/Malabo | Africa/Maputo | Africa/Maseru | Africa/Mbabane | Africa/Mogadishu | Africa/Monrovia | Africa/Nairobi | Africa/Ndjamena | Africa/Niamey | Africa/Nouakchott | Africa/Ouagadougou | Africa/Porto-Novo | Africa/Sao\_Tome | Africa/Timbuktu | Africa/Tripoli | Africa/Tunis | Africa/Windhoek | America/Adak | America/Anchorage | America/Anguilla | America/Antigua | America/Araguaina | America/Argentina/Buenos\_Aires | America/Argentina/Catamarca | America/Argentina/ComodRivadavia | America/Argentina/Cordoba | America/Argentina/Jujuy | America/Argentina/La\_Rioja | America/Argentina/Mendoza | America/Argentina/Rio\_Gallegos | America/Argentina/Salta | America/Argentina/San\_Juan | America/Argentina/San\_Luis | America/Argentina/Tucuman | America/Argentina/Ushuaia | America/Aruba | America/Asuncion | America/Atikokan | America/Atka | America/Bahia | America/Bahia\_Banderas | America/Barbados | America/Belem | America/Belize | America/Blanc-Sablon | America/Boa\_Vista | America/Bogota | America/Boise | America/Buenos\_Aires | America/Cambridge\_Bay | America/Campo\_Grande | America/Cancun | America/Caracas | America/Catamarca | America/Cayenne | America/Cayman | America/Chicago | America/Chihuahua | America/Coral\_Harbour | America/Cordoba | America/Costa\_Rica | America/Creston | America/Cuiaba | America/Curacao | America/Danmarkshavn | America/Dawson | America/Dawson\_Creek | America/Denver | America/Detroit | America/Dominica | America/Edmonton | America/Eirunepe | America/El\_Salvador | America/Ensenada | America/Fort\_Nelson | America/Fort\_Wayne | America/Fortaleza | America/Glace\_Bay | America/Godthab | America/Goose\_Bay | America/Grand\_Turk | America/Grenada | America/Guadeloupe | America/Guatemala | America/Guayaquil | America/Guyana | America/Halifax | America/Havana | America/Hermosillo | America/Indiana/Indianapolis | America/Indiana/Knox | America/Indiana/Marengo | America/Indiana/Petersburg | America/Indiana/Tell\_City | America/Indiana/Vevay | America/Indiana/Vincennes | America/Indiana/Winamac | America/Indianapolis | America/Inuvik | America/Iqaluit | America/Jamaica | America/Jujuy | America/Juneau | America/Kentucky/Louisville | America/Kentucky/Monticello | America/Knox\_IN | America/Kralendijk | America/La\_Paz | America/Lima | America/Los\_Angeles | America/Louisville | America/Lower\_Princes | America/Maceio | America/Managua | America/Manaus | America/Marigot | America/Martinique | America/Matamoros | America/Mazatlan | America/Mendoza | America/Menominee | America/Merida | America/Metlakatla | America/Mexico\_City | America/Miquelon | America/Moncton | America/Monterrey | America/Montevideo | America/Montreal | America/Montserrat | America/Nassau | America/New\_York | America/Nipigon | America/Nome | America/Noronha | America/North\_Dakota/Beulah | America/North\_Dakota/Center | America/North\_Dakota/New\_Salem | America/Nuuk | America/Ojinaga | America/Panama | America/Pangnirtung | America/Paramaribo | America/Phoenix | America/Port-au-Prince | America/Port\_of\_Spain | America/Porto\_Acre | America/Porto\_Velho | America/Puerto\_Rico | America/Punta\_Arenas | America/Rainy\_River | America/Rankin\_Inlet | America/Recife | America/Regina | America/Resolute | America/Rio\_Branco | America/Rosario | America/Santa\_Isabel | America/Santarem | America/Santiago | America/Santo\_Domingo | America/Sao\_Paulo | America/Scoresbysund | America/Shiprock | America/Sitka | America/St\_Barthelemy | America/St\_Johns | America/St\_Kitts | America/St\_Lucia | America/St\_Thomas | America/St\_Vincent | America/Swift\_Current | America/Tegucigalpa | America/Thule | America/Thunder\_Bay | America/Tijuana | America/Toronto | America/Tortola | America/Vancouver | America/Virgin | America/Whitehorse | America/Winnipeg | America/Yakutat | America/Yellowknife | Antarctica/Casey | Antarctica/Davis | Antarctica/DumontDUrville | Antarctica/Macquarie | Antarctica/Mawson | Antarctica/McMurdo | Antarctica/Palmer | Antarctica/Rothera | Antarctica/South\_Pole | Antarctica/Syowa | Antarctica/Troll | Antarctica/Vostok | Arctic/Longyearbyen | Asia/Aden | Asia/Almaty | Asia/Amman | Asia/Anadyr | Asia/Aqtau | Asia/Aqtobe | Asia/Ashgabat | Asia/Ashkhabad | Asia/Atyrau | Asia/Baghdad | Asia/Bahrain | Asia/Baku | Asia/Bangkok | Asia/Barnaul | Asia/Beirut | Asia/Bishkek | Asia/Brunei | Asia/Calcutta | Asia/Chita | Asia/Choibalsan | Asia/Chongqing | Asia/Chungking | Asia/Colombo | Asia/Dacca | Asia/Damascus | Asia/Dhaka | Asia/Dili | Asia/Dubai | Asia/Dushanbe | Asia/Famagusta | Asia/Gaza | Asia/Harbin | Asia/Hebron | Asia/Ho\_Chi\_Minh | Asia/Hong\_Kong | Asia/Hovd | Asia/Irkutsk | Asia/Istanbul | Asia/Jakarta | Asia/Jayapura | Asia/Jerusalem | Asia/Kabul | Asia/Kamchatka | Asia/Karachi | Asia/Kashgar | Asia/Kathmandu | Asia/Katmandu | Asia/Khandyga | Asia/Kolkata | Asia/Krasnoyarsk | Asia/Kuala\_Lumpur | Asia/Kuching | Asia/Kuwait | Asia/Macao | Asia/Macau | Asia/Magadan | Asia/Makassar | Asia/Manila | Asia/Muscat | Asia/Nicosia | Asia/Novokuznetsk | Asia/Novosibirsk | Asia/Omsk | Asia/Oral | Asia/Phnom\_Penh | Asia/Pontianak | Asia/Pyongyang | Asia/Qatar | Asia/Qostanay | Asia/Qyzylorda | Asia/Rangoon | Asia/Riyadh | Asia/Saigon | Asia/Sakhalin | Asia/Samarkand | Asia/Seoul | Asia/Shanghai | Asia/Singapore | Asia/Srednekolymsk | Asia/Taipei | Asia/Tashkent | Asia/Tbilisi | Asia/Tehran | Asia/Tel\_Aviv | Asia/Thimbu | Asia/Thimphu | Asia/Tokyo | Asia/Tomsk | Asia/Ujung\_Pandang | Asia/Ulaanbaatar | Asia/Ulan\_Bator | Asia/Urumqi | Asia/Ust-Nera | Asia/Vientiane | Asia/Vladivostok | Asia/Yakutsk | Asia/Yangon | Asia/Yekaterinburg | Asia/Yerevan | Atlantic/Azores | Atlantic/Bermuda | Atlantic/Canary | Atlantic/Cape\_Verde | Atlantic/Faeroe | Atlantic/Faroe | Atlantic/Jan\_Mayen | Atlantic/Madeira | Atlantic/Reykjavik | Atlantic/South\_Georgia | Atlantic/St\_Helena | Atlantic/Stanley | Australia/ACT | Australia/Adelaide | Australia/Brisbane | Australia/Broken\_Hill | Australia/Canberra | Australia/Currie | Australia/Darwin | Australia/Eucla | Australia/Hobart | Australia/LHI | Australia/Lindeman | Australia/Lord\_Howe | Australia/Melbourne | Australia/NSW | Australia/North | Australia/Perth | Australia/Queensland | Australia/South | Australia/Sydney | Australia/Tasmania | Australia/Victoria | Australia/West | Australia/Yancowinna | Brazil/Acre | Brazil/DeNoronha | Brazil/East | Brazil/West | CET | CST6CDT | Canada/Atlantic | Canada/Central | Canada/Eastern | Canada/Mountain | Canada/Newfoundland | Canada/Pacific | Canada/Saskatchewan | Canada/Yukon | Chile/Continental | Chile/EasterIsland | Cuba | EET | EST | EST5EDT | Egypt | Eire | Etc/GMT | Etc/GMT+0 | Etc/GMT+1 | Etc/GMT+10 | Etc/GMT+11 | Etc/GMT+12 | Etc/GMT+2 | Etc/GMT+3 | Etc/GMT+4 | Etc/GMT+5 | Etc/GMT+6 | Etc/GMT+7 | Etc/GMT+8 | Etc/GMT+9 | Etc/GMT-0 | Etc/GMT-1 | Etc/GMT-10 | Etc/GMT-11 | Etc/GMT-12 | Etc/GMT-13 | Etc/GMT-14 | Etc/GMT-2 | Etc/GMT-3 | Etc/GMT-4 | Etc/GMT-5 | Etc/GMT-6 | Etc/GMT-7 | Etc/GMT-8 | Etc/GMT-9 | Etc/GMT0 | Etc/Greenwich | Etc/UCT | Etc/UTC | Etc/Universal | Etc/Zulu | Europe/Amsterdam | Europe/Andorra | Europe/Astrakhan | Europe/Athens | Europe/Belfast | Europe/Belgrade | Europe/Berlin | Europe/Bratislava | Europe/Brussels | Europe/Bucharest | Europe/Budapest | Europe/Busingen | Europe/Chisinau | Europe/Copenhagen | Europe/Dublin | Europe/Gibraltar | Europe/Guernsey | Europe/Helsinki | Europe/Isle\_of\_Man | Europe/Istanbul | Europe/Jersey | Europe/Kaliningrad | Europe/Kiev | Europe/Kirov | Europe/Lisbon | Europe/Ljubljana | Europe/London | Europe/Luxembourg | Europe/Madrid | Europe/Malta | Europe/Mariehamn | Europe/Minsk | Europe/Monaco | Europe/Moscow | Europe/Nicosia | Europe/Oslo | Europe/Paris | Europe/Podgorica | Europe/Prague | Europe/Riga | Europe/Rome | Europe/Samara | Europe/San\_Marino | Europe/Sarajevo | Europe/Saratov | Europe/Simferopol | Europe/Skopje | Europe/Sofia | Europe/Stockholm | Europe/Tallinn | Europe/Tirane | Europe/Tiraspol | Europe/Ulyanovsk | Europe/Uzhgorod | Europe/Vaduz | Europe/Vatican | Europe/Vienna | Europe/Vilnius | Europe/Volgograd | Europe/Warsaw | Europe/Zagreb | Europe/Zaporozhye | Europe/Zurich | GB | GB-Eire | GMT | GMT+0 | GMT-0 | GMT0 | Greenwich | HST | Hongkong | Iceland | Indian/Antananarivo | Indian/Chagos | Indian/Christmas | Indian/Cocos | Indian/Comoro | Indian/Kerguelen | Indian/Mahe | Indian/Maldives | Indian/Mauritius | Indian/Mayotte | Indian/Reunion | Iran | Israel | Jamaica | Japan | Kwajalein | Libya | MET | MST | MST7MDT | Mexico/BajaNorte | Mexico/BajaSur | Mexico/General | NZ | NZ-CHAT | Navajo | PRC | PST8PDT | Pacific/Apia | Pacific/Auckland | Pacific/Bougainville | Pacific/Chatham | Pacific/Chuuk | Pacific/Easter | Pacific/Efate | Pacific/Enderbury | Pacific/Fakaofo | Pacific/Fiji | Pacific/Funafuti | Pacific/Galapagos | Pacific/Gambier | Pacific/Guadalcanal | Pacific/Guam | Pacific/Honolulu | Pacific/Johnston | Pacific/Kiritimati | Pacific/Kosrae | Pacific/Kwajalein | Pacific/Majuro | Pacific/Marquesas | Pacific/Midway | Pacific/Nauru | Pacific/Niue | Pacific/Norfolk | Pacific/Noumea | Pacific/Pago\_Pago | Pacific/Palau | Pacific/Pitcairn | Pacific/Pohnpei | Pacific/Ponape | Pacific/Port\_Moresby | Pacific/Rarotonga | Pacific/Saipan | Pacific/Samoa | Pacific/Tahiti | Pacific/Tarawa | Pacific/Tongatapu | Pacific/Truk | Pacific/Wake | Pacific/Wallis | Pacific/Yap | Poland | Portugal | ROC | ROK | Singapore | Turkey | UCT | US/Alaska | US/Aleutian | US/Arizona | US/Central | US/East-Indiana | US/Eastern | US/Hawaii | US/Indiana-Starke | US/Michigan | US/Mountain | US/Pacific | US/Samoa | UTC | Universal | W-SU | WET | Zulu ### [**](#title)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/schedule.ts#L68)optionaltitle **title? : string ### [**](#userId)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/schedule.ts#L66)userId **userId: string --- # ScheduleActionRunActor ### Hierarchy * BaseScheduleAction\ * *ScheduleActionRunActor* ## Index[**](#Index) ### Properties * [**actorId](#actorId) * [**id](#id) * [**runInput](#runInput) * [**runOptions](#runOptions) * [**type](#type) ## Properties[**](#Properties) ### [**](#actorId)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/schedule.ts#L106)actorId **actorId: string ### [**](#id)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/schedule.ts#L99)inheritedid **id: string Inherited from BaseScheduleAction.id ### [**](#runInput)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/schedule.ts#L107)optionalrunInput **runInput? : [ScheduledActorRunInput](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ScheduledActorRunInput.md) ### [**](#runOptions)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/schedule.ts#L108)optionalrunOptions **runOptions? : [ScheduledActorRunOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ScheduledActorRunOptions.md) ### [**](#type)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/schedule.ts#L100)inheritedtype **type: RunActor Inherited from BaseScheduleAction.type --- # ScheduleActionRunActorTask ### Hierarchy * BaseScheduleAction\ * *ScheduleActionRunActorTask* ## Index[**](#Index) ### Properties * [**actorTaskId](#actorTaskId) * [**id](#id) * [**input](#input) * [**type](#type) ## Properties[**](#Properties) ### [**](#actorTaskId)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/schedule.ts#L123)actorTaskId **actorTaskId: string ### [**](#id)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/schedule.ts#L99)inheritedid **id: string Inherited from BaseScheduleAction.id ### [**](#input)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/schedule.ts#L124)optionalinput **input? : string ### [**](#type)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/schedule.ts#L100)inheritedtype **type: RunActorTask Inherited from BaseScheduleAction.type --- # ScheduleCollectionListOptions ## Index[**](#Index) ### Properties * [**desc](#desc) * [**limit](#limit) * [**offset](#offset) ## Properties[**](#Properties) ### [**](#desc)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/schedule_collection.ts#L48)optionaldesc **desc? : boolean ### [**](#limit)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/schedule_collection.ts#L46)optionallimit **limit? : number ### [**](#offset)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/schedule_collection.ts#L47)optionaloffset **offset? : number --- # ScheduledActorRunInput ## Index[**](#Index) ### Properties * [**body](#body) * [**contentType](#contentType) ## Properties[**](#Properties) ### [**](#body)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/schedule.ts#L112)body **body: string ### [**](#contentType)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/schedule.ts#L113)contentType **contentType: string --- # ScheduledActorRunOptions ## Index[**](#Index) ### Properties * [**build](#build) * [**memoryMbytes](#memoryMbytes) * [**timeoutSecs](#timeoutSecs) ## Properties[**](#Properties) ### [**](#build)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/schedule.ts#L117)build **build: string ### [**](#memoryMbytes)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/schedule.ts#L119)memoryMbytes **memoryMbytes: number ### [**](#timeoutSecs)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/schedule.ts#L118)timeoutSecs **timeoutSecs: number --- # StoreCollectionListOptions ## Index[**](#Index) ### Properties * [**category](#category) * [**limit](#limit) * [**offset](#offset) * [**pricingModel](#pricingModel) * [**search](#search) * [**sortBy](#sortBy) * [**username](#username) ## Properties[**](#Properties) ### [**](#category)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/store_collection.ts#L62)optionalcategory **category? : string ### [**](#limit)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/store_collection.ts#L58)optionallimit **limit? : number ### [**](#offset)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/store_collection.ts#L59)optionaloffset **offset? : number ### [**](#pricingModel)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/store_collection.ts#L64)optionalpricingModel **pricingModel? : string ### [**](#search)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/store_collection.ts#L60)optionalsearch **search? : string ### [**](#sortBy)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/store_collection.ts#L61)optionalsortBy **sortBy? : string ### [**](#username)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/store_collection.ts#L63)optionalusername **username? : string --- # Task ## Index[**](#Index) ### Properties * [**actId](#actId) * [**actorStandby](#actorStandby) * [**createdAt](#createdAt) * [**description](#description) * [**id](#id) * [**input](#input) * [**modifiedAt](#modifiedAt) * [**name](#name) * [**options](#options) * [**stats](#stats) * [**title](#title) * [**userId](#userId) * [**username](#username) ## Properties[**](#Properties) ### [**](#actId)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/task.ts#L204)actId **actId: string ### [**](#actorStandby)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/task.ts#L214)optionalactorStandby **actorStandby? : Partial<[ActorStandby](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorStandby.md)> ### [**](#createdAt)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/task.ts#L209)createdAt **createdAt: Date ### [**](#description)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/task.ts#L207)optionaldescription **description? : string ### [**](#id)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/task.ts#L202)id **id: string ### [**](#input)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/task.ts#L213)optionalinput **input? : [Dictionary](https://docs.apify.com/api/client/js/api/client/js/reference.md#Dictionary) | [Dictionary](https://docs.apify.com/api/client/js/api/client/js/reference.md#Dictionary)\[] ### [**](#modifiedAt)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/task.ts#L210)modifiedAt **modifiedAt: Date ### [**](#name)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/task.ts#L205)name **name: string ### [**](#options)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/task.ts#L212)optionaloptions **options? : [TaskOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/TaskOptions.md) ### [**](#stats)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/task.ts#L211)stats **stats: [TaskStats](https://docs.apify.com/api/client/js/api/client/js/reference/interface/TaskStats.md) ### [**](#title)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/task.ts#L206)optionaltitle **title? : string ### [**](#userId)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/task.ts#L203)userId **userId: string ### [**](#username)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/task.ts#L208)optionalusername **username? : string --- # TaskCallOptions ### Hierarchy * Omit<[TaskStartOptions](https://docs.apify.com/api/client/js/api/client/js/reference.md#TaskStartOptions), waitForFinish> * *TaskCallOptions* ## Index[**](#Index) ### Properties * [**build](#build) * [**maxItems](#maxItems) * [**memory](#memory) * [**timeout](#timeout) * [**waitSecs](#waitSecs) * [**webhooks](#webhooks) ## Properties[**](#Properties) ### [**](#build)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L350)optionalinheritedbuild **build? : string Inherited from Omit.build Tag or number of the actor build to run (e.g. `beta` or `1.2.345`). If not provided, the run uses build tag or number from the default actor run configuration (typically `latest`). ### [**](#maxItems)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L393)optionalinheritedmaxItems **maxItems? : number Inherited from Omit.maxItems Specifies maximum number of items that the actor run should return. This is used by pay per result actors to limit the maximum number of results that will be charged to customer. Value can be accessed in actor run using `ACTOR_MAX_PAID_DATASET_ITEMS` environment variable. ### [**](#memory)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L364)optionalinheritedmemory **memory? : number Inherited from Omit.memory Memory in megabytes which will be allocated for the new actor run. If not provided, the run uses memory of the default actor run configuration. ### [**](#timeout)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L369)optionalinheritedtimeout **timeout? : number Inherited from Omit.timeout Timeout for the actor run in seconds. Zero value means there is no timeout. If not provided, the run uses timeout of the default actor run configuration. ### [**](#waitSecs)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/task.ts#L238)optionalwaitSecs **waitSecs? : number ### [**](#webhooks)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/actor.ts#L386)optionalinheritedwebhooks **webhooks? : readonly [WebhookUpdateData](https://docs.apify.com/api/client/js/api/client/js/reference.md#WebhookUpdateData)\[] Inherited from Omit.webhooks Specifies optional webhooks associated with the actor run, which can be used to receive a notification e.g. when the actor finished or failed, see [ad hook webhooks documentation](https://docs.apify.com/webhooks/ad-hoc-webhooks) for detailed description. --- # TaskCollectionListOptions ## Index[**](#Index) ### Properties * [**desc](#desc) * [**limit](#limit) * [**offset](#offset) ## Properties[**](#Properties) ### [**](#desc)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/task_collection.ts#L53)optionaldesc **desc? : boolean ### [**](#limit)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/task_collection.ts#L51)optionallimit **limit? : number ### [**](#offset)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/task_collection.ts#L52)optionaloffset **offset? : number --- # TaskCreateData ### Hierarchy * [TaskUpdateData](https://docs.apify.com/api/client/js/api/client/js/reference.md#TaskUpdateData) * *TaskCreateData* ## Index[**](#Index) ### Properties * [**actId](#actId) * [**actorStandby](#actorStandby) * [**description](#description) * [**input](#input) * [**name](#name) * [**options](#options) * [**title](#title) ## Properties[**](#Properties) ### [**](#actId)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/task_collection.ts#L59)actId **actId: string ### [**](#actorStandby)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/task.ts#L214)optionalinheritedactorStandby **actorStandby? : Partial<[ActorStandby](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ActorStandby.md)> Inherited from TaskUpdateData.actorStandby ### [**](#description)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/task.ts#L207)optionalinheriteddescription **description? : string Inherited from TaskUpdateData.description ### [**](#input)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/task.ts#L213)optionalinheritedinput **input? : [Dictionary](https://docs.apify.com/api/client/js/api/client/js/reference.md#Dictionary) | [Dictionary](https://docs.apify.com/api/client/js/api/client/js/reference.md#Dictionary)\[] Inherited from TaskUpdateData.input ### [**](#name)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/task.ts#L205)optionalinheritedname **name? : string Inherited from TaskUpdateData.name ### [**](#options)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/task.ts#L212)optionalinheritedoptions **options? : [TaskOptions](https://docs.apify.com/api/client/js/api/client/js/reference/interface/TaskOptions.md) Inherited from TaskUpdateData.options ### [**](#title)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/task.ts#L206)optionalinheritedtitle **title? : string Inherited from TaskUpdateData.title --- # TaskLastRunOptions ## Index[**](#Index) ### Properties * [**status](#status) ## Properties[**](#Properties) ### [**](#status)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/task.ts#L232)optionalstatus **status? : READY | RUNNING | SUCCEEDED | FAILED | TIMING\_OUT | TIMED\_OUT | ABORTING | ABORTED --- # TaskOptions ## Index[**](#Index) ### Properties * [**build](#build) * [**memoryMbytes](#memoryMbytes) * [**timeoutSecs](#timeoutSecs) ## Properties[**](#Properties) ### [**](#build)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/task.ts#L222)optionalbuild **build? : string ### [**](#memoryMbytes)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/task.ts#L224)optionalmemoryMbytes **memoryMbytes? : number ### [**](#timeoutSecs)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/task.ts#L223)optionaltimeoutSecs **timeoutSecs? : number --- # TaskStats ## Index[**](#Index) ### Properties * [**totalRuns](#totalRuns) ## Properties[**](#Properties) ### [**](#totalRuns)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/task.ts#L218)totalRuns **totalRuns: number --- # UsageCycle ## Index[**](#Index) ### Properties * [**endAt](#endAt) * [**startAt](#startAt) ## Properties[**](#Properties) ### [**](#endAt)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L190)endAt **endAt: Date ### [**](#startAt)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L189)startAt **startAt: Date --- # User ## Index[**](#Index) ### Properties * [**createdAt](#createdAt) * [**effectivePlatformFeatures](#effectivePlatformFeatures) * [**email](#email) * [**id](#id) * [**isPaying](#isPaying) * [**plan](#plan) * [**profile](#profile) * [**proxy](#proxy) * [**username](#username) ## Properties[**](#Properties) ### [**](#createdAt)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L107)optionalcreatedAt **createdAt? : Date ### [**](#effectivePlatformFeatures)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L106)optionaleffectivePlatformFeatures **effectivePlatformFeatures? : EffectivePlatformFeatures ### [**](#email)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L103)optionalemail **email? : string ### [**](#id)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L102)optionalid **id? : string ### [**](#isPaying)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L108)optionalisPaying **isPaying? : boolean ### [**](#plan)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L105)optionalplan **plan? : [UserPlan](https://docs.apify.com/api/client/js/api/client/js/reference/interface/UserPlan.md) ### [**](#profile)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L93)profile **profile: { bio? : string; githubUsername? : string; name? : string; pictureUrl? : string; twitterUsername? : string; websiteUrl? : string } #### Type declaration * ##### optionalbio?: string * ##### optionalgithubUsername?: string * ##### optionalname?: string * ##### optionalpictureUrl?: string * ##### optionaltwitterUsername?: string * ##### optionalwebsiteUrl?: string ### [**](#proxy)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L104)optionalproxy **proxy? : [UserProxy](https://docs.apify.com/api/client/js/api/client/js/reference/interface/UserProxy.md) ### [**](#username)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L92)username **username: string --- # UserPlan ## Index[**](#Index) ### Properties * [**availableAddOns](#availableAddOns) * [**availableProxyGroups](#availableProxyGroups) * [**dataRetentionDays](#dataRetentionDays) * [**description](#description) * [**enabledPlatformFeatures](#enabledPlatformFeatures) * [**id](#id) * [**isEnabled](#isEnabled) * [**maxActorCount](#maxActorCount) * [**maxActorMemoryGbytes](#maxActorMemoryGbytes) * [**maxActorTaskCount](#maxActorTaskCount) * [**maxMonthlyActorComputeUnits](#maxMonthlyActorComputeUnits) * [**maxMonthlyExternalDataTransferGbytes](#maxMonthlyExternalDataTransferGbytes) * [**maxMonthlyProxySerps](#maxMonthlyProxySerps) * [**maxMonthlyResidentialProxyGbytes](#maxMonthlyResidentialProxyGbytes) * [**maxMonthlyUsageUsd](#maxMonthlyUsageUsd) * [**monthlyBasePriceUsd](#monthlyBasePriceUsd) * [**monthlyUsageCreditsUsd](#monthlyUsageCreditsUsd) * [**supportLevel](#supportLevel) * [**teamAccountSeatCount](#teamAccountSeatCount) * [**usageDiscountPercent](#usageDiscountPercent) ## Properties[**](#Properties) ### [**](#availableAddOns)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L142)availableAddOns **availableAddOns: unknown\[] ### [**](#availableProxyGroups)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L139)availableProxyGroups **availableProxyGroups: Record\ ### [**](#dataRetentionDays)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L138)dataRetentionDays **dataRetentionDays: number ### [**](#description)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L124)description **description: string ### [**](#enabledPlatformFeatures)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L129)enabledPlatformFeatures **enabledPlatformFeatures: [PlatformFeature](https://docs.apify.com/api/client/js/api/client/js/reference/enum/PlatformFeature.md)\[] ### [**](#id)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L123)id **id: string ### [**](#isEnabled)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L125)isEnabled **isEnabled: boolean ### [**](#maxActorCount)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L136)maxActorCount **maxActorCount: number ### [**](#maxActorMemoryGbytes)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L131)maxActorMemoryGbytes **maxActorMemoryGbytes: number ### [**](#maxActorTaskCount)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L137)maxActorTaskCount **maxActorTaskCount: number ### [**](#maxMonthlyActorComputeUnits)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L132)maxMonthlyActorComputeUnits **maxMonthlyActorComputeUnits: number ### [**](#maxMonthlyExternalDataTransferGbytes)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L135)maxMonthlyExternalDataTransferGbytes **maxMonthlyExternalDataTransferGbytes: number ### [**](#maxMonthlyProxySerps)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L134)maxMonthlyProxySerps **maxMonthlyProxySerps: number ### [**](#maxMonthlyResidentialProxyGbytes)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L133)maxMonthlyResidentialProxyGbytes **maxMonthlyResidentialProxyGbytes: number ### [**](#maxMonthlyUsageUsd)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L130)maxMonthlyUsageUsd **maxMonthlyUsageUsd: number ### [**](#monthlyBasePriceUsd)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L126)monthlyBasePriceUsd **monthlyBasePriceUsd: number ### [**](#monthlyUsageCreditsUsd)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L127)monthlyUsageCreditsUsd **monthlyUsageCreditsUsd: number ### [**](#supportLevel)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L141)supportLevel **supportLevel: string ### [**](#teamAccountSeatCount)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L140)teamAccountSeatCount **teamAccountSeatCount: number ### [**](#usageDiscountPercent)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L128)usageDiscountPercent **usageDiscountPercent: number --- # UserProxy ## Index[**](#Index) ### Properties * [**groups](#groups) * [**password](#password) ## Properties[**](#Properties) ### [**](#groups)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L113)groups **groups: [ProxyGroup](https://docs.apify.com/api/client/js/api/client/js/reference/interface/ProxyGroup.md)\[] ### [**](#password)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/user.ts#L112)password **password: string --- # Webhook ## Index[**](#Index) ### Properties * [**condition](#condition) * [**createdAt](#createdAt) * [**description](#description) * [**doNotRetry](#doNotRetry) * [**eventTypes](#eventTypes) * [**headersTemplate](#headersTemplate) * [**id](#id) * [**ignoreSslErrors](#ignoreSslErrors) * [**isAdHoc](#isAdHoc) * [**isApifyIntegration](#isApifyIntegration) * [**lastDispatch](#lastDispatch) * [**modifiedAt](#modifiedAt) * [**payloadTemplate](#payloadTemplate) * [**requestUrl](#requestUrl) * [**shouldInterpolateStrings](#shouldInterpolateStrings) * [**stats](#stats) * [**userId](#userId) ## Properties[**](#Properties) ### [**](#condition)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/webhook.ts#L86)condition **condition: [WebhookCondition](https://docs.apify.com/api/client/js/api/client/js/reference.md#WebhookCondition) ### [**](#createdAt)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/webhook.ts#L82)createdAt **createdAt: Date ### [**](#description)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/webhook.ts#L96)optionaldescription **description? : string ### [**](#doNotRetry)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/webhook.ts#L88)doNotRetry **doNotRetry: boolean ### [**](#eventTypes)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/webhook.ts#L85)eventTypes **eventTypes: [WebhookEventType](https://docs.apify.com/api/client/js/api/client/js/reference.md#WebhookEventType)\[] ### [**](#headersTemplate)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/webhook.ts#L95)optionalheadersTemplate **headersTemplate? : string ### [**](#id)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/webhook.ts#L80)id **id: string ### [**](#ignoreSslErrors)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/webhook.ts#L87)ignoreSslErrors **ignoreSslErrors: boolean ### [**](#isAdHoc)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/webhook.ts#L84)isAdHoc **isAdHoc: boolean ### [**](#isApifyIntegration)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/webhook.ts#L94)optionalisApifyIntegration **isApifyIntegration? : boolean ### [**](#lastDispatch)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/webhook.ts#L91)lastDispatch **lastDispatch: string ### [**](#modifiedAt)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/webhook.ts#L83)modifiedAt **modifiedAt: Date ### [**](#payloadTemplate)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/webhook.ts#L90)payloadTemplate **payloadTemplate: string ### [**](#requestUrl)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/webhook.ts#L89)requestUrl **requestUrl: string ### [**](#shouldInterpolateStrings)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/webhook.ts#L93)shouldInterpolateStrings **shouldInterpolateStrings: boolean ### [**](#stats)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/webhook.ts#L92)stats **stats: [WebhookStats](https://docs.apify.com/api/client/js/api/client/js/reference/interface/WebhookStats.md) ### [**](#userId)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/webhook.ts#L81)userId **userId: string --- # WebhookAnyRunOfActorCondition ## Index[**](#Index) ### Properties * [**actorId](#actorId) ## Properties[**](#Properties) ### [**](#actorId)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/webhook.ts#L133)actorId **actorId: string --- # WebhookAnyRunOfActorTaskCondition ## Index[**](#Index) ### Properties * [**actorTaskId](#actorTaskId) ## Properties[**](#Properties) ### [**](#actorTaskId)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/webhook.ts#L137)actorTaskId **actorTaskId: string --- # WebhookCertainRunCondition ## Index[**](#Index) ### Properties * [**actorRunId](#actorRunId) ## Properties[**](#Properties) ### [**](#actorRunId)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/webhook.ts#L141)actorRunId **actorRunId: string --- # WebhookCollectionListOptions ## Index[**](#Index) ### Properties * [**desc](#desc) * [**limit](#limit) * [**offset](#offset) ## Properties[**](#Properties) ### [**](#desc)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/webhook_collection.ts#L50)optionaldesc **desc? : boolean ### [**](#limit)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/webhook_collection.ts#L48)optionallimit **limit? : number ### [**](#offset)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/webhook_collection.ts#L49)optionaloffset **offset? : number --- # WebhookDispatch ## Index[**](#Index) ### Properties * [**calls](#calls) * [**createdAt](#createdAt) * [**eventType](#eventType) * [**id](#id) * [**status](#status) * [**userId](#userId) * [**webhook](#webhook) * [**webhookId](#webhookId) ## Properties[**](#Properties) ### [**](#calls)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/webhook_dispatch.ts#L31)calls **calls: [WebhookDispatchCall](https://docs.apify.com/api/client/js/api/client/js/reference/interface/WebhookDispatchCall.md)\[] ### [**](#createdAt)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/webhook_dispatch.ts#L28)createdAt **createdAt: Date ### [**](#eventType)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/webhook_dispatch.ts#L30)eventType **eventType: [WebhookEventType](https://docs.apify.com/api/client/js/api/client/js/reference.md#WebhookEventType) ### [**](#id)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/webhook_dispatch.ts#L25)id **id: string ### [**](#status)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/webhook_dispatch.ts#L29)status **status: [WebhookDispatchStatus](https://docs.apify.com/api/client/js/api/client/js/reference/enum/WebhookDispatchStatus.md) ### [**](#userId)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/webhook_dispatch.ts#L26)userId **userId: string ### [**](#webhook)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/webhook_dispatch.ts#L32)webhook **webhook: Pick<[Webhook](https://docs.apify.com/api/client/js/api/client/js/reference/interface/Webhook.md), isAdHoc | requestUrl> ### [**](#webhookId)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/webhook_dispatch.ts#L27)webhookId **webhookId: string --- # WebhookDispatchCall ## Index[**](#Index) ### Properties * [**errorMessage](#errorMessage) * [**finishedAt](#finishedAt) * [**responseBody](#responseBody) * [**responseStatus](#responseStatus) * [**startedAt](#startedAt) ## Properties[**](#Properties) ### [**](#errorMessage)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/webhook_dispatch.ts#L44)errorMessage **errorMessage: null | string ### [**](#finishedAt)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/webhook_dispatch.ts#L43)finishedAt **finishedAt: Date ### [**](#responseBody)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/webhook_dispatch.ts#L46)responseBody **responseBody: null | string ### [**](#responseStatus)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/webhook_dispatch.ts#L45)responseStatus **responseStatus: null | number ### [**](#startedAt)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/webhook_dispatch.ts#L42)startedAt **startedAt: Date --- # WebhookDispatchCollectionListOptions ## Index[**](#Index) ### Properties * [**desc](#desc) * [**limit](#limit) * [**offset](#offset) ## Properties[**](#Properties) ### [**](#desc)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/webhook_dispatch_collection.ts#L39)optionaldesc **desc? : boolean ### [**](#limit)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/webhook_dispatch_collection.ts#L37)optionallimit **limit? : number ### [**](#offset)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/webhook_dispatch_collection.ts#L38)optionaloffset **offset? : number --- # WebhookIdempotencyKey ## Index[**](#Index) ### Properties * [**idempotencyKey](#idempotencyKey) ## Properties[**](#Properties) ### [**](#idempotencyKey)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/webhook.ts#L100)optionalidempotencyKey **idempotencyKey? : string --- # WebhookStats ## Index[**](#Index) ### Properties * [**totalDispatches](#totalDispatches) ## Properties[**](#Properties) ### [**](#totalDispatches)[**](https://github.com/apify/apify-client-js/blob/master/src/resource_clients/webhook.ts#L122)totalDispatches **totalDispatches: number --- [Skip to main content](#__docusaurus_skipToContent_fallback) [![](/api/client/js/img/apify_sdk.svg)![](/api/client/js/img/apify_sdk_white.svg)](https://docs.apify.com) [Academy](https://docs.apify.com/academy)[Platform](https://docs.apify.com/platform) [API](https://docs.apify.com/api) * [Reference](https://docs.apify.com/api/v2) * [Client for JavaScript](https://docs.apify.com/api/client/js/) * [Client for Python](https://docs.apify.com/api/client/python/) [SDK](https://docs.apify.com/sdk) * [SDK for JavaScript](https://docs.apify.com/sdk/js/) * [SDK for Python](https://docs.apify.com/sdk/python/) [CLI](https://docs.apify.com/cli/) [Open source](https://docs.apify.com/open-source) * [Crawlee](https://crawlee.dev) * [Got Scraping](https://github.com/apify/got-scraping) * [Fingerprint Suite](https://github.com/apify/fingerprint-suite) * [Apify on GitHub](https://github.com/apify) * [Actor whitepaper](https://whitepaper.actor) [Discord](https://discord.com/invite/jyEM2PRvMU "Chat on Discord")[Get started](https://console.apify.com) [API client for JavaScript](https://docs.apify.com/api/client/js/api/client/js/.md) [Docs](https://docs.apify.com/api/client/js/api/client/js/docs.md)[Reference](https://docs.apify.com/api/client/js/api/client/js/reference.md)[Changelog](https://docs.apify.com/api/client/js/api/client/js/docs/changelog.md)[GitHub](https://github.com/apify/apify-client-js) [2.12](https://docs.apify.com/api/client/js/api/client/js/docs.md) * [Next](https://docs.apify.com/api/client/js/api/client/js/docs/next) * [2.12](https://docs.apify.com/api/client/js/api/client/js/docs.md) * [2.11](https://docs.apify.com/api/client/js/api/client/js/docs/2.11) * [2.10](https://docs.apify.com/api/client/js/api/client/js/docs/2.10) * [2.9](https://docs.apify.com/api/client/js/api/client/js/docs/2.9) * [2.8](https://docs.apify.com/api/client/js/api/client/js/docs/2.8) * [2.7](https://docs.apify.com/api/client/js/api/client/js/docs/2.7) * [2.6](https://docs.apify.com/api/client/js/api/client/js/docs/2.6) # Apify API client for JavaScript # Apify API client for JavaScript ## ## The official library to interact with Apify API from a web browser, Node.js, JavaScript, or TypeScript applications, providing convenience functions and automatic retries on errors. [Get Started](https://docs.apify.com/api/client/js/api/client/js/docs.md)[GitHub](https://ghbtns.com/github-btn.html?user=apify\&repo=apify-client-js\&type=star\&count=true\&size=large) ![](/api/client/js/assets/images/logo-blur-5206054b91a93d20690b49e1aeb1f62e.png) ``` npm install apify-client ``` Easily run Actors, await them to finish using the convenient `.call()` method, and retrieve results from the resulting dataset. ``` const { ApifyClient } = require('apify-client'); const client = new ApifyClient({ token: 'MY-APIFY-TOKEN', }); // Starts an actor and waits for it to finish. const { defaultDatasetId } = await client.actor('john-doe/my-cool-actor').call(); // Fetches results from the actor's dataset. const { items } = await client.dataset(defaultDatasetId).listItems(); ``` Learn * [Academy](https://docs.apify.com/academy) * [Platform](https://docs.apify.com/platform) API * [Reference](https://docs.apify.com/api/v2) * [Client for JavaScript](https://docs.apify.com/api/client/js/) * [Client for Python](https://docs.apify.com/api/client/python/) SDK * [SDK for JavaScript](https://docs.apify.com/sdk/js/) * [SDK for Python](https://docs.apify.com/sdk/python/) Other * [CLI](https://docs.apify.com/cli/) * [Open source](https://docs.apify.com/open-source) More * [Crawlee](https://crawlee.dev) * [GitHub](https://github.com/apify) * [Discord](https://discord.com/invite/jyEM2PRvMU) * [Trust Center](https://trust.apify.com) [](https://apify.com) --- # API client for Python | Apify Documentation ## api - [Search the documentation](https://docs.apify.com/api/client/python/search.md) - [Changelog](https://docs.apify.com/api/client/python/docs/changelog.md): All notable changes to this project will be documented in this file. - [Asyncio support](https://docs.apify.com/api/client/python/docs/concepts/asyncio-support.md): The package provides an asynchronous version of the client, ApifyClientAsync, which allows you to interact with the Apify API using Python's standard async/await syntax. This enables you to perform non-blocking operations, see the Python asyncio documentation for more information. - [Convenience methods](https://docs.apify.com/api/client/python/docs/concepts/convenience-methods.md): The Apify client provides several convenience methods to handle actions that the API alone cannot perform efficiently, such as waiting for an Actor run to finish without running into network timeouts. These methods simplify common tasks and enhance the usability of the client. - [Error handling](https://docs.apify.com/api/client/python/docs/concepts/error-handling.md): When you use the Apify client, it automatically extracts all relevant data from the endpoint and returns it in the expected format. Date strings, for instance, are seamlessly converted to Python datetime.datetime objects. If an error occurs, the client raises an ApifyApiError. This exception wraps the raw JSON errors returned by the API and provides additional context, making it easier to debug any issues that arise. - [Logging](https://docs.apify.com/api/client/python/docs/concepts/logging.md): The library logs useful debug information to the apify_client logger whenever it sends requests to the Apify API. You can configure this logger to print debug information to the standard output by adding a handler: - [Nested clients](https://docs.apify.com/api/client/python/docs/concepts/nested-clients.md): In some cases, the Apify client provides nested clients to simplify working with related collections. For example, you can easily manage the runs of a specific Actor without having to construct multiple endpoints or client instances manually. - [Pagination](https://docs.apify.com/api/client/python/docs/concepts/pagination.md): Most methods named list or list_something in the Apify client return a ListPage object. This object provides a consistent interface for working with paginated data and includes the following properties: - [Retries](https://docs.apify.com/api/client/python/docs/concepts/retries.md): When dealing with network communication, failures can occasionally occur. The Apify client automatically retries requests that fail due to: - [Single and collection clients](https://docs.apify.com/api/client/python/docs/concepts/single-and-collection-clients.md): The Apify client interface is designed to be consistent and intuitive across all of its components. When you call specific methods on the main client, you create specialized clients to manage individual API resources. There are two main types of clients: - [Streaming resources](https://docs.apify.com/api/client/python/docs/concepts/streaming-resources.md): Certain resources, such as dataset items, key-value store records, and logs, support streaming directly from the Apify API. This allows you to process large resources incrementally without downloading them entirely into memory, making it ideal for handling large or continuously updated data. - [Integration with data libraries](https://docs.apify.com/api/client/python/docs/examples/integration-with-data-libraries.md): The Apify client for Python seamlessly integrates with data analysis libraries like Pandas. This allows you to load dataset items directly into a Pandas DataFrame for efficient manipulation and analysis. Pandas provides robust data structures and tools for handling large datasets, making it a powerful addition to your Apify workflows. - [Manage tasks for reusable input](https://docs.apify.com/api/client/python/docs/examples/manage-tasks-for-reusable-input.md): When you need to run multiple inputs with the same Actor, the most convenient approach is to create multiple tasks, each with different input configurations. Task inputs are stored on the Apify platform when the task is created, allowing you to reuse them easily. - [Passing input to Actor](https://docs.apify.com/api/client/python/docs/examples/passing-input-to-actor.md): The efficient way to run an Actor and retrieve results is by passing input data directly to the call method. This method allows you to configure the Actor's input, execute it, and either get a reference to the running Actor or wait for its completion. - [Retrieve Actor data](https://docs.apify.com/api/client/python/docs/examples/retrieve-actor-data.md): Actor output data is stored in datasets, which can be retrieved from individual Actor runs. Dataset items support pagination for efficient retrieval, and multiple datasets can be merged into a single dataset for further analysis. This merged dataset can then be exported into various formats such as CSV, JSON, XLSX, or XML. Additionally, integrations provide powerful tools to automate data workflows. - [Getting started](https://docs.apify.com/api/client/python/docs/overview/getting-started.md): This guide will walk you through how to use the Apify Client for Python to run Actors on the Apify platform, provide input to them, and retrieve results from their datasets. You'll learn the basics of running serverless programs (we're calling them Actors) and managing their output efficiently. - [Introduction](https://docs.apify.com/api/client/python/docs/overview/introduction.md): The Apify client for Python is the official library to access the Apify REST API from your Python applications. It provides useful features like automatic retries and convenience functions that improve the experience of using the Apify API. All requests and responses (including errors) are encoded in JSON format with UTF-8 encoding. The client provides both synchronous and asynchronous interfaces. - [Setting up](https://docs.apify.com/api/client/python/docs/overview/setting-up.md): This guide will help you get started with Apify client for Python by setting it up on your computer. Follow the steps below to ensure a smooth installation process. - [apify-client-python](https://docs.apify.com/api/client/python/reference.md) - [_BaseApifyClient](https://docs.apify.com/api/client/python/reference/class/_BaseApifyClient.md) - [_BaseBaseClient](https://docs.apify.com/api/client/python/reference/class/_BaseBaseClient.md) - [_BaseHTTPClient](https://docs.apify.com/api/client/python/reference/class/_BaseHTTPClient.md) - [_ContextInjectingFilter](https://docs.apify.com/api/client/python/reference/class/_ContextInjectingFilter.md) - [_DebugLogFormatter](https://docs.apify.com/api/client/python/reference/class/_DebugLogFormatter.md) - [ActorClient](https://docs.apify.com/api/client/python/reference/class/ActorClient.md): Sub-client for manipulating a single Actor. - [ActorClientAsync](https://docs.apify.com/api/client/python/reference/class/ActorClientAsync.md): Async sub-client for manipulating a single Actor. - [ActorCollectionClient](https://docs.apify.com/api/client/python/reference/class/ActorCollectionClient.md): Sub-client for manipulating Actors. - [ActorCollectionClientAsync](https://docs.apify.com/api/client/python/reference/class/ActorCollectionClientAsync.md): Async sub-client for manipulating Actors. - [ActorEnvVarClient](https://docs.apify.com/api/client/python/reference/class/ActorEnvVarClient.md): Sub-client for manipulating a single Actor environment variable. - [ActorEnvVarClientAsync](https://docs.apify.com/api/client/python/reference/class/ActorEnvVarClientAsync.md): Async sub-client for manipulating a single Actor environment variable. - [ActorEnvVarCollectionClient](https://docs.apify.com/api/client/python/reference/class/ActorEnvVarCollectionClient.md): Sub-client for manipulating actor env vars. - [ActorEnvVarCollectionClientAsync](https://docs.apify.com/api/client/python/reference/class/ActorEnvVarCollectionClientAsync.md): Async sub-client for manipulating actor env vars. - [ActorVersionClient](https://docs.apify.com/api/client/python/reference/class/ActorVersionClient.md): Sub-client for manipulating a single Actor version. - [ActorVersionClientAsync](https://docs.apify.com/api/client/python/reference/class/ActorVersionClientAsync.md): Async sub-client for manipulating a single Actor version. - [ActorVersionCollectionClient](https://docs.apify.com/api/client/python/reference/class/ActorVersionCollectionClient.md): Sub-client for manipulating Actor versions. - [ActorVersionCollectionClientAsync](https://docs.apify.com/api/client/python/reference/class/ActorVersionCollectionClientAsync.md): Async sub-client for manipulating Actor versions. - [ApifyApiError](https://docs.apify.com/api/client/python/reference/class/ApifyApiError.md): Error specific to requests to the Apify API. An `ApifyApiError` is thrown for successful HTTP requests that reach the API, but the API responds with an error response. Typically, those are rate limit errors and internal errors, which are automatically retried, or validation errors, which are thrown immediately, because a correction by the user is needed. - [ApifyClient](https://docs.apify.com/api/client/python/reference/class/ApifyClient.md): The Apify API client. - [ApifyClientAsync](https://docs.apify.com/api/client/python/reference/class/ApifyClientAsync.md): The asynchronous version of the Apify API client. - [ApifyClientError](https://docs.apify.com/api/client/python/reference/class/ApifyClientError.md): Base class for errors specific to the Apify API Client. - [BatchAddRequestsResult](https://docs.apify.com/api/client/python/reference/class/BatchAddRequestsResult.md): Result of the batch add requests operation. - [BuildClient](https://docs.apify.com/api/client/python/reference/class/BuildClient.md): Sub-client for manipulating a single Actor build. - [BuildClientAsync](https://docs.apify.com/api/client/python/reference/class/BuildClientAsync.md): Async sub-client for manipulating a single Actor build. - [BuildCollectionClient](https://docs.apify.com/api/client/python/reference/class/BuildCollectionClient.md): Sub-client for listing Actor builds. - [BuildCollectionClientAsync](https://docs.apify.com/api/client/python/reference/class/BuildCollectionClientAsync.md): Async sub-client for listing Actor builds. - [DatasetClient](https://docs.apify.com/api/client/python/reference/class/DatasetClient.md): Sub-client for manipulating a single dataset. - [DatasetClientAsync](https://docs.apify.com/api/client/python/reference/class/DatasetClientAsync.md): Async sub-client for manipulating a single dataset. - [DatasetCollectionClient](https://docs.apify.com/api/client/python/reference/class/DatasetCollectionClient.md): Sub-client for manipulating datasets. - [DatasetCollectionClientAsync](https://docs.apify.com/api/client/python/reference/class/DatasetCollectionClientAsync.md): Async sub-client for manipulating datasets. - [HTTPClient](https://docs.apify.com/api/client/python/reference/class/HTTPClient.md) - [HTTPClientAsync](https://docs.apify.com/api/client/python/reference/class/HTTPClientAsync.md) - [InvalidResponseBodyError](https://docs.apify.com/api/client/python/reference/class/InvalidResponseBodyError.md): Error caused by the response body failing to be parsed. This error exists for the quite common situation, where only a partial JSON response is received and an attempt to parse the JSON throws an error. In most cases this can be resolved by retrying the request. We do that by identifying this error in the HTTPClient. - [KeyValueStoreClient](https://docs.apify.com/api/client/python/reference/class/KeyValueStoreClient.md): Sub-client for manipulating a single key-value store. - [KeyValueStoreClientAsync](https://docs.apify.com/api/client/python/reference/class/KeyValueStoreClientAsync.md): Async sub-client for manipulating a single key-value store. - [KeyValueStoreCollectionClient](https://docs.apify.com/api/client/python/reference/class/KeyValueStoreCollectionClient.md): Sub-client for manipulating key-value stores. - [KeyValueStoreCollectionClientAsync](https://docs.apify.com/api/client/python/reference/class/KeyValueStoreCollectionClientAsync.md): Async sub-client for manipulating key-value stores. - [ListPage](https://docs.apify.com/api/client/python/reference/class/ListPage.md): A single page of items returned from a list() method. - [LogClient](https://docs.apify.com/api/client/python/reference/class/LogClient.md): Sub-client for manipulating logs. - [LogClientAsync](https://docs.apify.com/api/client/python/reference/class/LogClientAsync.md): Async sub-client for manipulating logs. - [LogContext](https://docs.apify.com/api/client/python/reference/class/LogContext.md) - [RedirectLogFormatter](https://docs.apify.com/api/client/python/reference/class/RedirectLogFormatter.md): Formater applied to default redirect logger. - [RequestQueueClient](https://docs.apify.com/api/client/python/reference/class/RequestQueueClient.md): Sub-client for manipulating a single request queue. - [RequestQueueClientAsync](https://docs.apify.com/api/client/python/reference/class/RequestQueueClientAsync.md): Async sub-client for manipulating a single request queue. - [RequestQueueCollectionClient](https://docs.apify.com/api/client/python/reference/class/RequestQueueCollectionClient.md): Sub-client for manipulating request queues. - [RequestQueueCollectionClientAsync](https://docs.apify.com/api/client/python/reference/class/RequestQueueCollectionClientAsync.md): Async sub-client for manipulating request queues. - [RunClient](https://docs.apify.com/api/client/python/reference/class/RunClient.md): Sub-client for manipulating a single Actor run. - [RunClientAsync](https://docs.apify.com/api/client/python/reference/class/RunClientAsync.md): Async sub-client for manipulating a single Actor run. - [RunCollectionClient](https://docs.apify.com/api/client/python/reference/class/RunCollectionClient.md): Sub-client for listing Actor runs. - [RunCollectionClientAsync](https://docs.apify.com/api/client/python/reference/class/RunCollectionClientAsync.md): Async sub-client for listing Actor runs. - [ScheduleClient](https://docs.apify.com/api/client/python/reference/class/ScheduleClient.md): Sub-client for manipulating a single schedule. - [ScheduleClientAsync](https://docs.apify.com/api/client/python/reference/class/ScheduleClientAsync.md): Async sub-client for manipulating a single schedule. - [ScheduleCollectionClient](https://docs.apify.com/api/client/python/reference/class/ScheduleCollectionClient.md): Sub-client for manipulating schedules. - [ScheduleCollectionClientAsync](https://docs.apify.com/api/client/python/reference/class/ScheduleCollectionClientAsync.md): Async sub-client for manipulating schedules. - [Statistics](https://docs.apify.com/api/client/python/reference/class/Statistics.md): Statistics about API client usage and rate limit errors. - [StatusMessageWatcher](https://docs.apify.com/api/client/python/reference/class/StatusMessageWatcher.md): Utility class for logging status messages from another Actor run. Status message is logged at fixed time intervals, and there is no guarantee that all messages will be logged, especially in cases of frequent status message changes. - [StatusMessageWatcherAsync](https://docs.apify.com/api/client/python/reference/class/StatusMessageWatcherAsync.md): Async variant of `StatusMessageWatcher` that is logging in task. - [StatusMessageWatcherSync](https://docs.apify.com/api/client/python/reference/class/StatusMessageWatcherSync.md): Sync variant of `StatusMessageWatcher` that is logging in thread. - [StoreCollectionClient](https://docs.apify.com/api/client/python/reference/class/StoreCollectionClient.md): Sub-client for Apify store. - [StoreCollectionClientAsync](https://docs.apify.com/api/client/python/reference/class/StoreCollectionClientAsync.md): Async sub-client for Apify store. - [StreamedLog](https://docs.apify.com/api/client/python/reference/class/StreamedLog.md): Utility class for streaming logs from another Actor. It uses buffer to deal with possibly chunked logs. Chunked logs are stored in buffer. Chunks are expected to contain specific markers that indicate the start of the log message. Each time a new chunk with complete split marker arrives, the buffer is processed, logged and emptied. This works only if the logs have datetime marker in ISO format. For example, `2025-05-12T15:35:59.429Z` This is the default log standard for the actors. - [StreamedLogAsync](https://docs.apify.com/api/client/python/reference/class/StreamedLogAsync.md): Async variant of `StreamedLog` that is logging in tasks. - [StreamedLogSync](https://docs.apify.com/api/client/python/reference/class/StreamedLogSync.md): Sync variant of `StreamedLog` that is logging in threads. - [TaskClient](https://docs.apify.com/api/client/python/reference/class/TaskClient.md): Sub-client for manipulating a single task. - [TaskClientAsync](https://docs.apify.com/api/client/python/reference/class/TaskClientAsync.md): Async sub-client for manipulating a single task. - [TaskCollectionClient](https://docs.apify.com/api/client/python/reference/class/TaskCollectionClient.md): Sub-client for manipulating tasks. - [TaskCollectionClientAsync](https://docs.apify.com/api/client/python/reference/class/TaskCollectionClientAsync.md): Async sub-client for manipulating tasks. - [UserClient](https://docs.apify.com/api/client/python/reference/class/UserClient.md): Sub-client for querying user data. - [UserClientAsync](https://docs.apify.com/api/client/python/reference/class/UserClientAsync.md): Async sub-client for querying user data. - [WebhookClient](https://docs.apify.com/api/client/python/reference/class/WebhookClient.md): Sub-client for manipulating a single webhook. - [WebhookClientAsync](https://docs.apify.com/api/client/python/reference/class/WebhookClientAsync.md): Async sub-client for manipulating a single webhook. - [WebhookCollectionClient](https://docs.apify.com/api/client/python/reference/class/WebhookCollectionClient.md): Sub-client for manipulating webhooks. - [WebhookCollectionClientAsync](https://docs.apify.com/api/client/python/reference/class/WebhookCollectionClientAsync.md): Async sub-client for manipulating webhooks. - [WebhookDispatchClient](https://docs.apify.com/api/client/python/reference/class/WebhookDispatchClient.md): Sub-client for querying information about a webhook dispatch. - [WebhookDispatchClientAsync](https://docs.apify.com/api/client/python/reference/class/WebhookDispatchClientAsync.md): Async sub-client for querying information about a webhook dispatch. - [WebhookDispatchCollectionClient](https://docs.apify.com/api/client/python/reference/class/WebhookDispatchCollectionClient.md): Sub-client for listing webhook dispatches. - [WebhookDispatchCollectionClientAsync](https://docs.apify.com/api/client/python/reference/class/WebhookDispatchCollectionClientAsync.md): Async sub-client for listing webhook dispatches. - [WithLogDetailsClient](https://docs.apify.com/api/client/python/reference/class/WithLogDetailsClient.md) - [Apify API client for Python](https://docs.apify.com/api/client/python/index.md) --- # Full Documentation Content [Skip to main content](#__docusaurus_skipToContent_fallback) [![](/api/client/python/img/apify_sdk.svg)![](/api/client/python/img/apify_sdk_white.svg)](https://docs.apify.com) [Academy](https://docs.apify.com/academy)[Platform](https://docs.apify.com/platform) [API](https://docs.apify.com/api) * [Reference](https://docs.apify.com/api/v2) * [Client for JavaScript](https://docs.apify.com/api/client/js/) * [Client for Python](https://docs.apify.com/api/client/python/) [SDK](https://docs.apify.com/sdk) * [SDK for JavaScript](https://docs.apify.com/sdk/js/) * [SDK for Python](https://docs.apify.com/sdk/python/) [CLI](https://docs.apify.com/cli/) [Open source](https://docs.apify.com/open-source) * [Crawlee](https://crawlee.dev) * [Got Scraping](https://github.com/apify/got-scraping) * [Fingerprint Suite](https://github.com/apify/fingerprint-suite) * [Apify on GitHub](https://github.com/apify) * [Actor whitepaper](https://whitepaper.actor) [Discord](https://discord.com/invite/jyEM2PRvMU "Chat on Discord")[Get started](https://console.apify.com) [API Client for Python](https://docs.apify.com/api/client/python/api/client/python/.md) [Docs](https://docs.apify.com/api/client/python/api/client/python/docs/overview/introduction.md)[Reference](https://docs.apify.com/api/client/python/api/client/python/reference.md)[Changelog](https://docs.apify.com/api/client/python/api/client/python/docs/changelog.md)[GitHub](https://github.com/apify/apify-client-python) # Search the documentation Type your search here [](https://www.algolia.com/) Learn * [Academy](https://docs.apify.com/academy) * [Platform](https://docs.apify.com/platform) API * [Reference](https://docs.apify.com/api/v2) * [Client for JavaScript](https://docs.apify.com/api/client/js/) * [Client for Python](https://docs.apify.com/api/client/python/) SDK * [SDK for JavaScript](https://docs.apify.com/sdk/js/) * [SDK for Python](https://docs.apify.com/sdk/python/) Other * [CLI](https://docs.apify.com/cli/) * [Open source](https://docs.apify.com/open-source) More * [Crawlee](https://crawlee.dev) * [GitHub](https://github.com/apify) * [Discord](https://discord.com/invite/jyEM2PRvMU) * [Trust Center](https://trust.apify.com) [](https://apify.com) --- # Changelog All notable changes to this project will be documented in this file. ### 1.12.3 - **not yet released**[​](#1123---not-yet-released "Direct link to 1123---not-yet-released") #### 🚀 Features[​](#-features "Direct link to 🚀 Features") * Extend status parameter to an array of possible statuses ([#455](https://github.com/apify/apify-client-python/pull/455)) ([76f6769](https://github.com/apify/apify-client-python/commit/76f676973d067ce8af398d8e6ceea55595da5ecf)) by [@JanHranicky](https://github.com/JanHranicky) ### [1.12.2](https://github.com/apify/apify-client-python/releases/tag/v1.12.2) (2025-08-08)[​](#1122-2025-08-08 "Direct link to 1122-2025-08-08") #### 🐛 Bug Fixes[​](#-bug-fixes "Direct link to 🐛 Bug Fixes") * Fix API error with stream ([#459](https://github.com/apify/apify-client-python/pull/459)) ([0c91ca5](https://github.com/apify/apify-client-python/commit/0c91ca516a01a6fca7bc8fa07f7bf9c15c75bf9d)) by [@Pijukatel](https://github.com/Pijukatel) ### [1.12.1](https://github.com/apify/apify-client-python/releases/tag/v1.12.1) (2025-07-30)[​](#1121-2025-07-30 "Direct link to 1121-2025-07-30") #### 🐛 Bug Fixes[​](#-bug-fixes-1 "Direct link to 🐛 Bug Fixes") * Restrict apify-shared version ([#447](https://github.com/apify/apify-client-python/pull/447)) ([22cd220](https://github.com/apify/apify-client-python/commit/22cd220e8f22af01f5fdfcedc684015c006b6fe6)) by [@vdusek](https://github.com/vdusek) ### [1.12.0](https://github.com/apify/apify-client-python/releases/tag/v1.12.0) (2025-06-26)[​](#1120-2025-06-26 "Direct link to 1120-2025-06-26") #### 🚀 Features[​](#-features-1 "Direct link to 🚀 Features") * Allow sorting of Actors collection ([#422](https://github.com/apify/apify-client-python/pull/422)) ([df6e47d](https://github.com/apify/apify-client-python/commit/df6e47d3b72e0aa5563f1ece7abc9d9da50b77a2)) by [@protoss70](https://github.com/protoss70) * Add `KeyValueStoreClient.record_exists` ([#427](https://github.com/apify/apify-client-python/pull/427)) ([519529b](https://github.com/apify/apify-client-python/commit/519529b01895958aa33516d8ec4853290c388d05)) by [@janbuchar](https://github.com/janbuchar) #### 🐛 Bug Fixes[​](#-bug-fixes-2 "Direct link to 🐛 Bug Fixes") * Enable to add headers template in webhooks created dynamically ([#419](https://github.com/apify/apify-client-python/pull/419)) ([b84d1ec](https://github.com/apify/apify-client-python/commit/b84d1ec0491ad2623defcfba5fe1aa06274cf533)) by [@gaelloyoly](https://github.com/gaelloyoly) * Rename sortBy parameters option ([#426](https://github.com/apify/apify-client-python/pull/426)) ([a270409](https://github.com/apify/apify-client-python/commit/a2704095928651bf183743bf85fb365c65480d80)) by [@protoss70](https://github.com/protoss70) ### [1.11.0](https://github.com/apify/apify-client-python/releases/tag/v1.11.0) (2025-06-13)[​](#1110-2025-06-13 "Direct link to 1110-2025-06-13") #### 🚀 Features[​](#-features-2 "Direct link to 🚀 Features") * Add `validate_input` endpoint ([#396](https://github.com/apify/apify-client-python/pull/396)) ([1c5bf85](https://github.com/apify/apify-client-python/commit/1c5bf8550ffd91b94ea83694f7c933cf2767fadc)) by [@Pijukatel](https://github.com/Pijukatel), closes [#151](https://github.com/apify/apify-client-python/issues/151) * Add list kv store keys by collection or prefix ([#397](https://github.com/apify/apify-client-python/pull/397)) ([6747c20](https://github.com/apify/apify-client-python/commit/6747c201cd654953a97a4c3fe8256756eb7568c7)) by [@MFori](https://github.com/MFori) * Add redirected actor logs ([#403](https://github.com/apify/apify-client-python/pull/403)) ([fd02cd8](https://github.com/apify/apify-client-python/commit/fd02cd8726f1664677a47dcb946a0186080d7839)) by [@Pijukatel](https://github.com/Pijukatel), closes [#402](https://github.com/apify/apify-client-python/issues/402) * Add `unlock_requests` method to RequestQueue clients ([#408](https://github.com/apify/apify-client-python/pull/408)) ([d4f0018](https://github.com/apify/apify-client-python/commit/d4f00186016fab4e909a7886467e619b23e627e5)) by [@drobnikj](https://github.com/drobnikj) * Add `StatusMessageWatcher` ([#407](https://github.com/apify/apify-client-python/pull/407)) ([a535512](https://github.com/apify/apify-client-python/commit/a53551217b62a2a6ca2ccbc81130043560fbc475)) by [@Pijukatel](https://github.com/Pijukatel), closes [#404](https://github.com/apify/apify-client-python/issues/404) ### [1.10.0](https://github.com/apify/apify-client-python/releases/tag/v1.10.0) (2025-04-29)[​](#1100-2025-04-29 "Direct link to 1100-2025-04-29") #### 🚀 Features[​](#-features-3 "Direct link to 🚀 Features") * Add support for general resource access ([#394](https://github.com/apify/apify-client-python/pull/394)) ([cc79c30](https://github.com/apify/apify-client-python/commit/cc79c30a7d0b57d21a5fc7efb94c08cc4035c8b4)) by [@tobice](https://github.com/tobice) ### [1.9.4](https://github.com/apify/apify-client-python/releases/tag/v1.9.4) (2025-04-24)[​](#194-2025-04-24 "Direct link to 194-2025-04-24") #### 🐛 Bug Fixes[​](#-bug-fixes-3 "Direct link to 🐛 Bug Fixes") * Default\_build() returns BuildClient ([#389](https://github.com/apify/apify-client-python/pull/389)) ([8149052](https://github.com/apify/apify-client-python/commit/8149052a97032f1336147a48c8a8f6cd5e076b95)) by [@danpoletaev](https://github.com/danpoletaev) ### [1.9.3](https://github.com/apify/apify-client-python/releases/tag/v1.9.3) (2025-04-14)[​](#193-2025-04-14 "Direct link to 193-2025-04-14") #### 🚀 Features[​](#-features-4 "Direct link to 🚀 Features") * Add maxItems and maxTotalChargeUsd to resurrect ([#360](https://github.com/apify/apify-client-python/pull/360)) ([a020807](https://github.com/apify/apify-client-python/commit/a0208073ef93804358e4377959a56d8342f83447)) by [@novotnyj](https://github.com/novotnyj) * Add get default build method ([#385](https://github.com/apify/apify-client-python/pull/385)) ([f818b95](https://github.com/apify/apify-client-python/commit/f818b95fec1c4e57e98b28ad0b2b346ee2f64602)) by [@danpoletaev](https://github.com/danpoletaev) ### [1.9.2](https://github.com/apify/apify-client-python/releases/tag/v1.9.2) (2025-02-14)[​](#192-2025-02-14 "Direct link to 192-2025-02-14") #### 🐛 Bug Fixes[​](#-bug-fixes-4 "Direct link to 🐛 Bug Fixes") * Add missing PPE-related Actor parameters ([#351](https://github.com/apify/apify-client-python/pull/351)) ([75b1c6c](https://github.com/apify/apify-client-python/commit/75b1c6c4d26c21d69ce10ef4424c6ba458bd5a33)) by [@janbuchar](https://github.com/janbuchar) ### [1.9.1](https://github.com/apify/apify-client-python/releases/tag/v1.9.1) (2025-02-07)[​](#191-2025-02-07 "Direct link to 191-2025-02-07") #### 🐛 Bug Fixes[​](#-bug-fixes-5 "Direct link to 🐛 Bug Fixes") * Add `stats` attribute for `ApifyClientAsync` ([#348](https://github.com/apify/apify-client-python/pull/348)) ([6631f8c](https://github.com/apify/apify-client-python/commit/6631f8ccbd56107647a6b886ddcd5cbae378069d)) by [@Mantisus](https://github.com/Mantisus) * Fix return type of charge API call ([#350](https://github.com/apify/apify-client-python/pull/350)) ([28102fe](https://github.com/apify/apify-client-python/commit/28102fe42039df2f1f2bb3c4e4aa652e37933456)) by [@janbuchar](https://github.com/janbuchar) ### [1.9.0](https://github.com/apify/apify-client-python/releases/tag/v1.9.0) (2025-02-04)[​](#190-2025-02-04 "Direct link to 190-2025-02-04") #### 🚀 Features[​](#-features-5 "Direct link to 🚀 Features") * Add user.update\_limits ([#279](https://github.com/apify/apify-client-python/pull/279)) ([7aed9c9](https://github.com/apify/apify-client-python/commit/7aed9c928958831168ac8d293538d6fd3adbc5e5)) by [@MFori](https://github.com/MFori), closes [#329](https://github.com/apify/apify-client-python/issues/329) * Add charge method to the run client for "pay per event" ([#304](https://github.com/apify/apify-client-python/pull/304)) ([3bd6bbb](https://github.com/apify/apify-client-python/commit/3bd6bbb86d2b777863f0c3d0459b61da9a7f15ff)) by [@Jkuzz](https://github.com/Jkuzz) * Add error data to ApifyApiError ([#314](https://github.com/apify/apify-client-python/pull/314)) ([df2398b](https://github.com/apify/apify-client-python/commit/df2398b51d774c5f8653a80f83b320d0f5394dde)) by [@Pijukatel](https://github.com/Pijukatel), closes [#306](https://github.com/apify/apify-client-python/issues/306) * Add GET: dataset.statistics ([#324](https://github.com/apify/apify-client-python/pull/324)) ([19ea4ad](https://github.com/apify/apify-client-python/commit/19ea4ad46068520885bd098739a9b64d1f17e1fc)) by [@MFori](https://github.com/MFori) * Add `get_open_api_specification` method to `BuildClient` ([#336](https://github.com/apify/apify-client-python/pull/336)) ([9ebcedb](https://github.com/apify/apify-client-python/commit/9ebcedbaede53add167f1c51ec6196e793e67917)) by [@danpoletaev](https://github.com/danpoletaev) * Add rate limit statistics ([#343](https://github.com/apify/apify-client-python/pull/343)) ([f35c68f](https://github.com/apify/apify-client-python/commit/f35c68ff824ce83bf9aca893589381782a1a48c7)) by [@Mantisus](https://github.com/Mantisus) ### [1.8.1](https://github.com/apify/apify-client-python/releases/tags/v1.8.1) (2024-09-17)[​](#181-2024-09-17 "Direct link to 181-2024-09-17") #### 🐛 Bug Fixes[​](#-bug-fixes-6 "Direct link to 🐛 Bug Fixes") * Batch add requests can handle more than 25 requests ([#268](https://github.com/apify/apify-client-python/pull/268)) ([9110ee0](https://github.com/apify/apify-client-python/commit/9110ee08954762aed00ac09cd042e802c1d041f7)) by [@vdusek](https://github.com/vdusek), closes [#264](https://github.com/apify/apify-client-python/issues/264) ### [1.8.0](https://github.com/apify/apify-client-python/releases/tags/v1.8.0) (2024-08-30)[​](#180-2024-08-30 "Direct link to 180-2024-08-30") * drop support for Python 3.8 #### 🚀 Features[​](#-features-6 "Direct link to 🚀 Features") * Adds headers\_template to webhooks and webhooks\_collection ([#239](https://github.com/apify/apify-client-python/pull/239)) ([6dbd781](https://github.com/apify/apify-client-python/commit/6dbd781d24d9deb6a7669193ce4d5a4190fe5026)) by [@jakerobers](https://github.com/jakerobers) * Add actor standby ([#248](https://github.com/apify/apify-client-python/pull/248)) ([dd4bf90](https://github.com/apify/apify-client-python/commit/dd4bf9072a4caa189af5f90e513e37df325dc929)) by [@jirimoravcik](https://github.com/jirimoravcik) * Allow passing list of fields to unwind parameter ([#256](https://github.com/apify/apify-client-python/pull/256)) ([036b455](https://github.com/apify/apify-client-python/commit/036b455c51243e0ef81cb74a44fe670abc085ce7)) by [@fnesveda](https://github.com/fnesveda) ### [1.7.1](https://github.com/apify/apify-client-python/releases/tag/v1.7.1) (2024-07-11)[​](#171-2024-07-11 "Direct link to 171-2024-07-11") #### 🐛 Bug Fixes[​](#-bug-fixes-7 "Direct link to 🐛 Bug Fixes") * Fix breaking change (sync -> async) in 1.7.0 * Fix getting storages of last run ### [1.7.0](https://github.com/apify/apify-client-python/releases/tag/v1.7.0) (2024-05-20)[​](#170-2024-05-20 "Direct link to 170-2024-05-20") #### 🐛 Bug Fixes[​](#-bug-fixes-8 "Direct link to 🐛 Bug Fixes") * Fix abort of last task run * Fix abort of last Actor run * `ActorClient`'s and `TaskClient`'s `last_run` methods are asynchronous ### [1.6.4](https://github.com/apify/apify-client-python/releases/tag/v1.6.4) (2024-02-27)[​](#164-2024-02-27 "Direct link to 164-2024-02-27") #### 🚀 Features[​](#-features-7 "Direct link to 🚀 Features") * Add `monthlyUsage()` and `limits()` methods to `UserClient` ### [1.6.3](https://github.com/apify/apify-client-python/releases/tag/v1.6.3) (2023-02-16)[​](#163-2023-02-16 "Direct link to 163-2023-02-16") #### 🚀 Features[​](#-features-8 "Direct link to 🚀 Features") * Add `log()` method to `BuildClient` ### [1.6.2](https://github.com/apify/apify-client-python/releases/tag/v1.6.2) (2023-01-08)[​](#162-2023-01-08 "Direct link to 162-2023-01-08") #### Chore[​](#chore "Direct link to Chore") * Relative imports were replaced for absolute imports ### [1.6.1](https://github.com/apify/apify-client-python/releases/tag/v1.6.1) (2023-12-11)[​](#161-2023-12-11 "Direct link to 161-2023-12-11") #### 🐛 Bug Fixes[​](#-bug-fixes-9 "Direct link to 🐛 Bug Fixes") * Fix `_BaseHTTPClient._parse_params()` method to ensure correct conversion of API list parameters ### [1.6.0](https://github.com/apify/apify-client-python/releases/tag/v1.6.0) (2023-11-16)[​](#160-2023-11-16 "Direct link to 160-2023-11-16") #### Chore[​](#chore-1 "Direct link to Chore") * Migrate from Autopep8 and Flake8 to Ruff ### [1.5.0](https://github.com/apify/apify-client-python/releases/tag/v1.5.0) (2023-10-18)[​](#150-2023-10-18 "Direct link to 150-2023-10-18") #### 🚀 Features[​](#-features-9 "Direct link to 🚀 Features") * Add support for Python 3.12 * Add DELETE to Actor runs * Add DELETE to Actor builds #### Chore[​](#chore-2 "Direct link to Chore") * Rewrite documentation publication to use Docusaurus * Remove PR Toolkit workflow ### [1.4.1](https://github.com/apify/apify-client-python/releases/tag/v1.4.1) (2023-09-06)[​](#141-2023-09-06 "Direct link to 141-2023-09-06") #### 🚀 Features[​](#-features-10 "Direct link to 🚀 Features") * Add `StoreCollectionClient` for listing Actors in the Apify Store * Add support for specifying the `max_items` parameter for pay-per result Actors and their runs #### Chore[​](#chore-3 "Direct link to Chore") * Improve logging of HTTP requests * Remove `pytest-randomly` Pytest plugin ### [1.4.0](https://github.com/apify/apify-client-python/releases/tag/v1.4.0) (2023-08-23)[​](#140-2023-08-23 "Direct link to 140-2023-08-23") #### 🚀 Features[​](#-features-11 "Direct link to 🚀 Features") * Add `RunClient.reboot` method to reboot Actor runs #### Chore[​](#chore-4 "Direct link to Chore") * Simplify code via `flake8-simplify` * Unify indentation in configuration files ### [1.3.1](https://github.com/apify/apify-client-python/releases/tag/v1.3.1) (2023-07-28)[​](#131-2023-07-28 "Direct link to 131-2023-07-28") #### Chore[​](#chore-5 "Direct link to Chore") * Start importing general constants and utilities from the `apify-shared` library ### [1.3.0](https://github.com/apify/apify-client-python/releases/tag/v1.3.0) (2023-07-24)[​](#130-2023-07-24 "Direct link to 130-2023-07-24") #### 🚀 Features[​](#-features-12 "Direct link to 🚀 Features") * Add `list_and_lock_head`, `delete_request_lock`, `prolong_request_lock` methods to `RequestQueueClient` * Add `batch_add_requests`, `batch_delete_requests`, `list_requests` methods `RequestQueueClient` ### [1.2.2](https://github.com/apify/apify-client-python/releases/tag/v1.2.2) (2023-05-31)[​](#122-2023-05-31 "Direct link to 122-2023-05-31") #### 🐛 Bug Fixes[​](#-bug-fixes-10 "Direct link to 🐛 Bug Fixes") * Fix encoding webhook lists in request parameters ### [1.2.1](https://github.com/apify/apify-client-python/releases/tag/v1.2.1) (2023-05-23)[​](#121-2023-05-23 "Direct link to 121-2023-05-23") #### 🐛 Bug Fixes[​](#-bug-fixes-11 "Direct link to 🐛 Bug Fixes") * Relax dependency requirements to improve compatibility with other libraries ### [1.2.0](https://github.com/apify/apify-client-python/releases/tag/v1.2.0) (2023-05-23)[​](#120-2023-05-23 "Direct link to 120-2023-05-23") #### 🚀 Features[​](#-features-13 "Direct link to 🚀 Features") * Add option to change the build, memory limit and timeout when resurrecting a run #### Chore[​](#chore-6 "Direct link to Chore") * Update dependencies ### [1.1.1](https://github.com/apify/apify-client-python/releases/tag/v1.1.1) (2023-05-05)[​](#111-2023-05-05 "Direct link to 111-2023-05-05") #### Chore[​](#chore-7 "Direct link to Chore") * Change GitHub workflows to use new secrets ### [1.1.0](https://github.com/apify/apify-client-python/releases/tag/v1.1.0) (2023-05-05)[​](#110-2023-05-05 "Direct link to 110-2023-05-05") #### 🚀 Features[​](#-features-14 "Direct link to 🚀 Features") * Add support for `is_status_message_terminal` flag in Actor run status message update #### Chore[​](#chore-8 "Direct link to Chore") * Switch from `setup.py` to `pyproject.toml` for specifying project setup ### [1.0.0](https://github.com/apify/apify-client-python/releases/tag/v1.0.0) (2023-03-13)[​](#100-2023-03-13 "Direct link to 100-2023-03-13") #### Breaking changes[​](#breaking-changes "Direct link to Breaking changes") * Drop support for Python 3.7, add support for Python 3.11 * Unify methods for streaming resources * Switch underlying HTTP library from `requests` to `httpx` #### 🚀 Features[​](#-features-15 "Direct link to 🚀 Features") * Add support for asynchronous usage via `ApifyClientAsync` * Add configurable socket timeout for requests to the Apify API * Add `py.typed` file to signal type checkers that this package is typed * Add method to update status message for a run * Add option to set up webhooks for Actor builds * Add logger with basic debugging info * Add support for `schema` parameter in `get_or_create` method for datasets and key-value stores * Add support for `title` parameter in task and schedule methods * Add `x-apify-workflow-key` header support * Add support for `flatten` and `view` parameters in dataset items methods * Add support for `origin` parameter in Actor/task run methods * Add clients for Actor version environment variables #### 🐛 Bug Fixes[​](#-bug-fixes-12 "Direct link to 🐛 Bug Fixes") * Disallow `NaN` and `Infinity` values in JSONs sent to the Apify API #### Chore[​](#chore-9 "Direct link to Chore") * Simplify retrying with exponential backoff * Improve checks for "not found" errors * Simplify flake8 config * Update development dependencies * Simplify development scripts * Update GitHub Actions versions to fix deprecations * Unify unit test style * Unify preparing resource representation * Update output management in GitHub Workflows to fix deprecations * Improve type hints across codebase * Add option to manually publish the package with a workflow dispatch * Add `pre-commit` to run code quality checks before committing * Convert `unittest`-style tests to `pytest`-style tests * Backport project setup improvements from `apify-sdk-python` ### [0.6.0](https://github.com/apify/apify-client-python/releases/tag/v0.6.0) (2022-06-27)[​](#060-2022-06-27 "Direct link to 060-2022-06-27") #### Removed[​](#removed "Direct link to Removed") * Drop support for single-file Actors #### Chore[​](#chore-10 "Direct link to Chore") * Update dependencies * Fix some lint issues in shell scripts and `setup.py` * Add Python 3.10 to unit test roster ### [0.5.0](https://github.com/apify/apify-client-python/releases/tag/v0.5.0) (2021-09-16)[​](#050-2021-09-16 "Direct link to 050-2021-09-16") #### Changed[​](#changed "Direct link to Changed") * Improve retrying broken API server connections #### 🐛 Bug Fixes[​](#-bug-fixes-13 "Direct link to 🐛 Bug Fixes") * Fix timeout value in actively waiting for a run to finish #### Chore[​](#chore-11 "Direct link to Chore") * Update development dependencies ### [0.4.0](https://github.com/apify/apify-client-python/releases/tag/v0.4.0) (2021-09-07)[​](#040-2021-09-07 "Direct link to 040-2021-09-07") #### Changed[​](#changed-1 "Direct link to Changed") * Improve handling of `Enum` arguments * Improve support for storing more data types in key-value stores #### 🐛 Bug Fixes[​](#-bug-fixes-14 "Direct link to 🐛 Bug Fixes") * Fix values of some `ActorJobStatus` `Enum` members ### [0.3.0](https://github.com/apify/apify-client-python/releases/tag/v0.3.0) (2021-08-26)[​](#030-2021-08-26 "Direct link to 030-2021-08-26") #### 🚀 Features[​](#-features-16 "Direct link to 🚀 Features") * Add the `test()` method to the webhook client * Add support for indicating the pagination direction in the `ListPage` objects #### Changed[​](#changed-2 "Direct link to Changed") * Improve support for storing more data types in datasets #### 🐛 Bug Fixes[​](#-bug-fixes-15 "Direct link to 🐛 Bug Fixes") * Fix return type in the `DatasetClient.list_items()` method docs #### Chore[​](#chore-12 "Direct link to Chore") * Add human-friendly names to the jobs in Github Action workflows * Update development dependencies ### [0.2.0](https://github.com/apify/apify-client-python/releases/tag/v0.2.0) (2021-08-09)[​](#020-2021-08-09 "Direct link to 020-2021-08-09") #### 🚀 Features[​](#-features-17 "Direct link to 🚀 Features") * Add the `gracefully` parameter to the "Abort run" method #### Changed[​](#changed-3 "Direct link to Changed") * Replace `base_url` with `api_url` in the client constructor to enable easier passing of the API server url from environment variables available to Actors on the Apify platform #### Chore[​](#chore-13 "Direct link to Chore") * Change tags for Actor images with this client on Docker Hub to be aligned with the Apify SDK Node.js images * Update the `requests` dependency to 2.26.0 * Update development dependencies ### [0.1.0](https://github.com/apify/apify-client-python/releases/tag/v0.1.0) (2021-08-02)[​](#010-2021-08-02 "Direct link to 010-2021-08-02") #### Changed[​](#changed-4 "Direct link to Changed") * Methods using specific option values for arguments now use well-defined and documented `Enum`s for those arguments instead of generic strings * Make the submodule `apify_client.consts` containing those `Enum`s available #### Chore[​](#chore-14 "Direct link to Chore") * Update development dependencies * Enforce unified use of single quotes and double quotes * Add repository dispatch to build Actor images with this client when publishing a new version ### [0.0.1](https://github.com/apify/apify-client-python/releases/tag/v0.0.1) (2021-05-13)[​](#001-2021-05-13 "Direct link to 001-2021-05-13") * Initial release of the package. --- # Asyncio support The package provides an asynchronous version of the client, [`ApifyClientAsync`](https://docs.apify.com/api/client/python/api/client/python/reference/class/ApifyClientAsync.md), which allows you to interact with the Apify API using Python's standard async/await syntax. This enables you to perform non-blocking operations, see the Python [asyncio documentation](https://docs.python.org/3/library/asyncio-task.html) for more information. The following example demonstrates how to run an Actor asynchronously and stream its logs while it is running: ``` import asyncio from apify_client import ApifyClientAsync TOKEN = 'MY-APIFY-TOKEN' async def main() -> None: apify_client = ApifyClientAsync(TOKEN) actor_client = apify_client.actor('my-actor-id') # Start the Actor and get the run ID run_result = await actor_client.start() run_client = apify_client.run(run_result['id']) log_client = run_client.log() # Stream the logs async with log_client.stream() as async_log_stream: if async_log_stream: async for line in async_log_stream.aiter_lines(): print(line) if __name__ == '__main__': asyncio.run(main()) ``` --- # Convenience methods The Apify client provides several convenience methods to handle actions that the API alone cannot perform efficiently, such as waiting for an Actor run to finish without running into network timeouts. These methods simplify common tasks and enhance the usability of the client. * [`ActorClient.call`](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorClient.md#call) - Starts an Actor and waits for it to finish, handling network timeouts internally. * [`ActorClient.start`](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorClient.md#start) - Explicitly waits for an Actor run to finish with customizable timeouts. Additionally, storage-related resources offer flexible options for data retrieval: * [Key-value store](https://docs.apify.com/platform/storage/key-value-store) records can be retrieved as objects, buffers, or streams. * [Dataset](https://docs.apify.com/platform/storage/dataset) items can be fetched as individual objects, serialized data, or iterated asynchronously. - Async client - Sync client ``` from apify_client import ApifyClientAsync TOKEN = 'MY-APIFY-TOKEN' async def main() -> None: apify_client = ApifyClientAsync(TOKEN) actor_client = apify_client.actor('username/actor-name') # Start an Actor and waits for it to finish finished_actor_run = await actor_client.call() # Starts an Actor and waits maximum 60s (1 minute) for the finish actor_run = await actor_client.start(wait_for_finish=60) ``` ``` from apify_client import ApifyClient TOKEN = 'MY-APIFY-TOKEN' def main() -> None: apify_client = ApifyClient(TOKEN) actor_client = apify_client.actor('username/actor-name') # Start an Actor and waits for it to finish finished_actor_run = actor_client.call() # Starts an Actor and waits maximum 60s (1 minute) for the finish actor_run = actor_client.start(wait_for_finish=60) ``` --- # Error handling When you use the Apify client, it automatically extracts all relevant data from the endpoint and returns it in the expected format. Date strings, for instance, are seamlessly converted to Python `datetime.datetime` objects. If an error occurs, the client raises an [`ApifyApiError`](https://docs.apify.com/api/client/python/api/client/python/reference/class/ApifyApiError.md). This exception wraps the raw JSON errors returned by the API and provides additional context, making it easier to debug any issues that arise. * Async client * Sync client ``` from apify_client import ApifyClientAsync TOKEN = 'MY-APIFY-TOKEN' async def main() -> None: apify_client = ApifyClientAsync(TOKEN) try: # Try to list items from non-existing dataset dataset_client = apify_client.dataset('not-existing-dataset-id') dataset_items = (await dataset_client.list_items()).items except Exception as ApifyApiError: # The exception is an instance of ApifyApiError print(ApifyApiError) ``` ``` from apify_client import ApifyClient TOKEN = 'MY-APIFY-TOKEN' def main() -> None: apify_client = ApifyClient(TOKEN) try: # Try to list items from non-existing dataset dataset_client = apify_client.dataset('not-existing-dataset-id') dataset_items = dataset_client.list_items().items except Exception as ApifyApiError: # The exception is an instance of ApifyApiError print(ApifyApiError) ``` --- # Logging The library logs useful debug information to the `apify_client` logger whenever it sends requests to the Apify API. You can configure this logger to print debug information to the standard output by adding a handler: ``` import logging # Configure the Apify client logger apify_client_logger = logging.getLogger('apify_client') apify_client_logger.setLevel(logging.DEBUG) apify_client_logger.addHandler(logging.StreamHandler()) ``` The log records include additional properties, provided via the extra argument, which can be helpful for debugging. Some of these properties are: * `attempt` - Number of retry attempts for the request. * `status_code` - HTTP status code of the response. * `url` - URL of the API endpoint being called. * `client_method` - Method name of the client that initiated the request. * `resource_id` - Identifier of the resource being accessed. To display these additional properties in the log output, you need to use a custom log formatter. Here's a basic example: ``` import logging # Configure the Apify client logger apify_client_logger = logging.getLogger('apify_client') apify_client_logger.setLevel(logging.DEBUG) apify_client_logger.addHandler(logging.StreamHandler()) # Create a custom logging formatter formatter = logging.Formatter( '%(asctime)s - %(name)s - %(levelname)s - %(message)s - ' '%(attempt)s - %(status_code)s - %(url)s' ) handler = logging.StreamHandler() handler.setFormatter(formatter) apify_client_logger.addHandler(handler) ``` For more information on creating and using custom log formatters, refer to the official Python [logging documentation](https://docs.python.org/3/howto/logging.html#formatters). --- # Nested clients In some cases, the Apify client provides nested clients to simplify working with related collections. For example, you can easily manage the runs of a specific Actor without having to construct multiple endpoints or client instances manually. * Async client * Sync client ``` from apify_client import ApifyClientAsync TOKEN = 'MY-APIFY-TOKEN' async def main() -> None: apify_client = ApifyClientAsync(TOKEN) actor_client = apify_client.actor('username/actor-name') runs_client = actor_client.runs() # List the last 10 runs of the Actor actor_runs = (await runs_client.list(limit=10, desc=True)).items # Select the last run of the Actor that finished with a SUCCEEDED status last_succeeded_run_client = actor_client.last_run(status='SUCCEEDED') # type: ignore[arg-type] # Get dataset actor_run_dataset_client = last_succeeded_run_client.dataset() # Fetch items from the run's dataset dataset_items = (await actor_run_dataset_client.list_items()).items ``` ``` from apify_client import ApifyClient TOKEN = 'MY-APIFY-TOKEN' def main() -> None: apify_client = ApifyClient(TOKEN) actor_client = apify_client.actor('username/actor-name') runs_client = actor_client.runs() # List the last 10 runs of the Actor actor_runs = runs_client.list(limit=10, desc=True).items # Select the last run of the Actor that finished with a SUCCEEDED status last_succeeded_run_client = actor_client.last_run(status='SUCCEEDED') # type: ignore[arg-type] # Get dataset actor_run_dataset_client = last_succeeded_run_client.dataset() # Fetch items from the run's dataset dataset_items = actor_run_dataset_client.list_items().items ``` This direct access to [Dataset](https://docs.apify.com/platform/storage/dataset) (and other storage resources) from the [`RunClient`](https://docs.apify.com/api/client/python/api/client/python/reference/class/RunClient.md) is especially convenient when used alongside the [`ActorClient.last_run`](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorClient.md#last_run) method. --- # Pagination Most methods named `list` or `list_something` in the Apify client return a [`ListPage`](https://docs.apify.com/api/client/python/api/client/python/reference/class/ListPage.md) object. This object provides a consistent interface for working with paginated data and includes the following properties: * `items` - The main results you're looking for. * `total` - The total number of items available. * `offset` - The starting point of the current page. * `count` - The number of items in the current page. * `limit` - The maximum number of items per page. Some methods, such as `list_keys` or `list_head`, paginate differently. Regardless, the primary results are always stored under the items property, and the limit property can be used to control the number of results returned. The following example demonstrates how to fetch all items from a dataset using pagination: * Async client * Sync client ``` from apify_client import ApifyClientAsync TOKEN = 'MY-APIFY-TOKEN' async def main() -> None: apify_client = ApifyClientAsync(TOKEN) # Initialize the dataset client dataset_client = apify_client.dataset('dataset-id') # Define the pagination parameters limit = 1000 # Number of items per page offset = 0 # Starting offset all_items = [] # List to store all fetched items while True: # Fetch a page of items response = await dataset_client.list_items(limit=limit, offset=offset) items = response.items total = response.total print(f'Fetched {len(items)} items') # Add the fetched items to the complete list all_items.extend(items) # Exit the loop if there are no more items to fetch if offset + limit >= total: break # Increment the offset for the next page offset += limit print(f'Overall fetched {len(all_items)} items') ``` ``` from apify_client import ApifyClient TOKEN = 'MY-APIFY-TOKEN' def main() -> None: apify_client = ApifyClient(TOKEN) # Initialize the dataset client dataset_client = apify_client.dataset('dataset-id') # Define the pagination parameters limit = 1000 # Number of items per page offset = 0 # Starting offset all_items = [] # List to store all fetched items while True: # Fetch a page of items response = dataset_client.list_items(limit=limit, offset=offset) items = response.items total = response.total print(f'Fetched {len(items)} items') # Add the fetched items to the complete list all_items.extend(items) # Exit the loop if there are no more items to fetch if offset + limit >= total: break # Increment the offset for the next page offset += limit print(f'Overall fetched {len(all_items)} items') ``` The [`ListPage`](https://docs.apify.com/api/client/python/api/client/python/reference/class/ListPage.md) interface offers several key benefits. Its consistent structure ensures predictable results for most `list` methods, providing a uniform way to work with paginated data. It also offers flexibility, allowing you to customize the `limit` and `offset` parameters to control data fetching according to your needs. Additionally, it provides scalability, enabling you to efficiently handle large datasets through pagination. This approach ensures efficient data retrieval while keeping memory usage under control, making it ideal for managing and processing large collections. --- # Retries When dealing with network communication, failures can occasionally occur. The Apify client automatically retries requests that fail due to: * Network errors * Internal errors in the Apify API (HTTP status codes 500 and above) * Rate limit errors (HTTP status code 429) By default, the client will retry a failed request up to 8 times. The retry intervals use an exponential backoff strategy: * The first retry occurs after approximately 500 milliseconds. * The second retry occurs after approximately 1,000 milliseconds, and so on. You can customize this behavior using the following options in the [`ApifyClient`](https://docs.apify.com/api/client/python/api/client/python/reference/class/ApifyClient.md) constructor: * `max_retries`: Defines the maximum number of retry attempts. * `min_delay_between_retries_millis`: Sets the minimum delay between retries (in milliseconds). Retries with exponential backoff are a common strategy for handling network errors. They help to reduce the load on the server and increase the chances of a successful request. * Async client * Sync client ``` from apify_client import ApifyClientAsync TOKEN = 'MY-APIFY-TOKEN' async def main() -> None: apify_client = ApifyClientAsync( token=TOKEN, max_retries=8, min_delay_between_retries_millis=500, # 0.5s timeout_secs=360, # 6 mins ) ``` ``` from apify_client import ApifyClient TOKEN = 'MY-APIFY-TOKEN' async def main() -> None: apify_client = ApifyClient( token=TOKEN, max_retries=8, min_delay_between_retries_millis=500, # 0.5s timeout_secs=360, # 6 mins ) ``` --- # Single and collection clients The Apify client interface is designed to be consistent and intuitive across all of its components. When you call specific methods on the main client, you create specialized clients to manage individual API resources. There are two main types of clients: * [`ActorClient`](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorClient.md) - Manages a single resource. * [`ActorCollectionClient`](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md) - Manages a collection of resources. - Async client - Sync client ``` from apify_client import ApifyClientAsync TOKEN = 'MY-APIFY-TOKEN' async def main() -> None: apify_client = ApifyClientAsync(TOKEN) # Collection clients do not require a parameter actor_collection_client = apify_client.actors() # Create an Actor with the name: my-actor my_actor = await actor_collection_client.create(name='my-actor') # List all of your Actors actor_list = (await actor_collection_client.list()).items ``` ``` from apify_client import ApifyClient TOKEN = 'MY-APIFY-TOKEN' def main() -> None: apify_client = ApifyClient(TOKEN) # Collection clients do not require a parameter actor_collection_client = apify_client.actors() # Create an Actor with the name: my-actor my_actor = actor_collection_client.create(name='my-actor') # List all of your Actors actor_list = actor_collection_client.list().items ``` The resource ID can be the resource's `id` or a combination of `username/resource-name`. * Async client * Sync client ``` from apify_client import ApifyClientAsync TOKEN = 'MY-APIFY-TOKEN' async def main() -> None: apify_client = ApifyClientAsync(TOKEN) # Resource clients accept an ID of the resource actor_client = apify_client.actor('username/actor-name') # Fetch the 'username/actor-name' object from the API my_actor = await actor_client.get() # Start the run of 'username/actor-name' and return the Run object my_actor_run = await actor_client.start() ``` ``` from apify_client import ApifyClient TOKEN = 'MY-APIFY-TOKEN' def main() -> None: apify_client = ApifyClient(TOKEN) # Resource clients accept an ID of the resource actor_client = apify_client.actor('username/actor-name') # Fetch the 'username/actor-name' object from the API my_actor = actor_client.get() # Start the run of 'username/actor-name' and return the Run object my_actor_run = actor_client.start() ``` By utilizing the appropriate collection or resource client, you can simplify how you interact with the Apify API. --- # Streaming resources Certain resources, such as dataset items, key-value store records, and logs, support streaming directly from the Apify API. This allows you to process large resources incrementally without downloading them entirely into memory, making it ideal for handling large or continuously updated data. Supported streaming methods: * [`DatasetClient.stream_items`](https://docs.apify.com/api/client/python/api/client/python/reference/class/DatasetClient.md#stream_items) - Stream dataset items incrementally. * [`KeyValueStoreClient.stream_record`](https://docs.apify.com/api/client/python/api/client/python/reference/class/KeyValueStoreClient.md#stream_record) - Stream key-value store records as raw data. * [`LogClient.stream`](https://docs.apify.com/api/client/python/api/client/python/reference/class/LogClient.md#stream) - Stream logs in real time. These methods return a raw, context-managed `httpx.Response` object. The response must be consumed within a with block to ensure that the connection is closed automatically, preventing memory leaks or unclosed connections. The following example demonstrates how to stream the logs of an Actor run incrementally: * Async client * Sync client ``` from apify_client import ApifyClientAsync TOKEN = 'MY-APIFY-TOKEN' async def main() -> None: apify_client = ApifyClientAsync(TOKEN) run_client = apify_client.run('MY-RUN-ID') log_client = run_client.log() async with log_client.stream() as log_stream: if log_stream: for line in log_stream.iter_lines(): print(line) ``` ``` from apify_client import ApifyClient TOKEN = 'MY-APIFY-TOKEN' def main() -> None: apify_client = ApifyClient(TOKEN) run_client = apify_client.run('MY-RUN-ID') log_client = run_client.log() with log_client.stream() as log_stream: if log_stream: for line in log_stream.iter_lines(): print(line) ``` Streaming offers several key benefits. It ensures memory efficiency by loading only a small portion of the resource into memory at any given time, making it ideal for handling large data. It enables real-time processing, allowing you to start working with data immediately as it is received. With automatic resource management, using the `with` statement ensures that connections are properly closed, preventing memory leaks or unclosed connections. This approach is valuable for processing large logs, datasets, or files on the fly without the need to download them entirely. --- # Integration with data libraries The Apify client for Python seamlessly integrates with data analysis libraries like [Pandas](https://pandas.pydata.org/). This allows you to load dataset items directly into a Pandas [DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) for efficient manipulation and analysis. Pandas provides robust data structures and tools for handling large datasets, making it a powerful addition to your Apify workflows. The following example demonstrates how to retrieve items from the most recent dataset of an Actor run and load them into a Pandas DataFrame for further analysis: * Async client * Sync client ``` import asyncio import pandas as pd from apify_client import ApifyClientAsync TOKEN = 'MY-APIFY-TOKEN' async def main() -> None: # Initialize the Apify client apify_client = ApifyClientAsync(token=TOKEN) actor_client = apify_client.actor('apify/web-scraper') run_client = actor_client.last_run() dataset_client = run_client.dataset() # Load items from last dataset run dataset_data = await dataset_client.list_items() # Pass dataset items to Pandas DataFrame data_frame = pd.DataFrame(dataset_data.items) print(data_frame.info) if __name__ == '__main__': asyncio.run(main()) ``` ``` import pandas as pd from apify_client import ApifyClient TOKEN = 'MY-APIFY-TOKEN' def main() -> None: # Initialize the Apify client apify_client = ApifyClient(token=TOKEN) actor_client = apify_client.actor('apify/web-scraper') run_client = actor_client.last_run() dataset_client = run_client.dataset() # Load items from last dataset run dataset_data = dataset_client.list_items() # Pass dataset items to Pandas DataFrame data_frame = pd.DataFrame(dataset_data.items) print(data_frame.info) if __name__ == '__main__': main() ``` --- # Manage tasks for reusable input When you need to run multiple inputs with the same Actor, the most convenient approach is to create multiple [tasks](https://docs.apify.com/platform/actors/running/tasks), each with different input configurations. Task inputs are stored on the Apify platform when the task is created, allowing you to reuse them easily. The following example demonstrates how to create tasks for the `apify/instagram-hashtag-scraper` Actor with different inputs, manage task clients, and execute them asynchronously: * Async client * Sync client ``` import asyncio from apify_client import ApifyClientAsync from apify_client.clients.resource_clients import TaskClientAsync TOKEN = 'MY-APIFY-TOKEN' HASHTAGS = ['zebra', 'lion', 'hippo'] async def run_apify_task(client: TaskClientAsync) -> dict: result = await client.call() return result or {} async def main() -> None: apify_client = ApifyClientAsync(token=TOKEN) # Create Apify tasks apify_tasks = list[dict]() apify_tasks_client = apify_client.tasks() for hashtag in HASHTAGS: apify_task = await apify_tasks_client.create( name=f'hashtags-{hashtag}', actor_id='apify/instagram-hashtag-scraper', task_input={'hashtags': [hashtag], 'resultsLimit': 20}, memory_mbytes=1024, ) apify_tasks.append(apify_task) print('Tasks created:', apify_tasks) # Create Apify task clients apify_task_clients = list[TaskClientAsync]() for apify_task in apify_tasks: task_id = apify_task['id'] apify_task_client = apify_client.task(task_id) apify_task_clients.append(apify_task_client) print('Task clients created:', apify_task_clients) # Execute Apify tasks run_apify_tasks = [run_apify_task(client) for client in apify_task_clients] task_run_results = await asyncio.gather(*run_apify_tasks) print('Task results:', task_run_results) if __name__ == '__main__': asyncio.run(main()) ``` ``` from apify_client import ApifyClient from apify_client.clients.resource_clients import TaskClient TOKEN = 'MY-APIFY-TOKEN' HASHTAGS = ['zebra', 'lion', 'hippo'] def run_apify_task(client: TaskClient) -> dict: result = client.call() return result or {} def main() -> None: apify_client = ApifyClient(token=TOKEN) # Create Apify tasks apify_tasks = list[dict]() apify_tasks_client = apify_client.tasks() for hashtag in HASHTAGS: apify_task = apify_tasks_client.create( name=f'hashtags-{hashtag}', actor_id='apify/instagram-hashtag-scraper', task_input={'hashtags': [hashtag], 'resultsLimit': 20}, memory_mbytes=1024, ) apify_tasks.append(apify_task) print('Tasks created:', apify_tasks) # Create Apify task clients apify_task_clients = list[TaskClient]() for apify_task in apify_tasks: task_id = apify_task['id'] apify_task_client = apify_client.task(task_id) apify_task_clients.append(apify_task_client) print('Task clients created:', apify_task_clients) # Execute Apify tasks task_run_results = list[dict]() for client in apify_task_clients: result = run_apify_task(client) task_run_results.append(result) print('Task results:', task_run_results) if __name__ == '__main__': main() ``` --- # Passing input to Actor The efficient way to run an Actor and retrieve results is by passing input data directly to the `call` method. This method allows you to configure the Actor's input, execute it, and either get a reference to the running Actor or wait for its completion. The following example demonstrates how to pass input to the `apify/instagram-hashtag-scraper` Actor and wait for it to finish. * Async client * Sync client ``` import asyncio from apify_client import ApifyClientAsync TOKEN = 'MY-APIFY-TOKEN' async def main() -> None: # Client initialization with the API token apify_client = ApifyClientAsync(token=TOKEN) # Get the Actor client actor_client = apify_client.actor('apify/instagram-hashtag-scraper') input_data = {'hashtags': ['rainbow'], 'resultsLimit': 20} # Run the Actor and wait for it to finish up to 60 seconds. # Input is not persisted for next runs. run_result = await actor_client.call(run_input=input_data, timeout_secs=60) if __name__ == '__main__': asyncio.run(main()) ``` ``` from apify_client import ApifyClient TOKEN = 'MY-APIFY-TOKEN' def main() -> None: # Client initialization with the API token apify_client = ApifyClient(token=TOKEN) # Get the Actor client actor_client = apify_client.actor('apify/instagram-hashtag-scraper') input_data = {'hashtags': ['rainbow'], 'resultsLimit': 20} # Run the Actor and wait for it to finish up to 60 seconds. # Input is not persisted for next runs. run_result = actor_client.call(run_input=input_data, timeout_secs=60) if __name__ == '__main__': main() ``` --- # Retrieve Actor data Actor output data is stored in [datasets](https://docs.apify.com/platform/storage/dataset), which can be retrieved from individual Actor runs. Dataset items support pagination for efficient retrieval, and multiple datasets can be merged into a single dataset for further analysis. This merged dataset can then be exported into various formats such as CSV, JSON, XLSX, or XML. Additionally, [integrations](https://docs.apify.com/platform/integrations) provide powerful tools to automate data workflows. The following example demonstrates how to fetch datasets from an Actor's runs, paginate through their items, and merge them into a single dataset for unified analysis: * Async client * Sync client ``` import asyncio from apify_client import ApifyClientAsync TOKEN = 'MY-APIFY-TOKEN' async def main() -> None: # Client initialization with the API token apify_client = ApifyClientAsync(token=TOKEN) actor_client = apify_client.actor('apify/instagram-hashtag-scraper') runs_client = actor_client.runs() # See pagination to understand how to get more datasets actor_datasets = await runs_client.list(limit=20) datasets_client = apify_client.datasets() merging_dataset = await datasets_client.get_or_create(name='merge-dataset') for dataset_item in actor_datasets.items: # Dataset items can be handled here. Dataset items can be paginated dataset_client = apify_client.dataset(dataset_item['id']) dataset_items = await dataset_client.list_items(limit=1000) # Items can be pushed to single dataset merging_dataset_client = apify_client.dataset(merging_dataset['id']) await merging_dataset_client.push_items(dataset_items.items) # ... if __name__ == '__main__': asyncio.run(main()) ``` ``` from apify_client import ApifyClient TOKEN = 'MY-APIFY-TOKEN' def main() -> None: # Client initialization with the API token apify_client = ApifyClient(token=TOKEN) actor_client = apify_client.actor('apify/instagram-hashtag-scraper') runs_client = actor_client.runs() # See pagination to understand how to get more datasets actor_datasets = runs_client.list(limit=20) datasets_client = apify_client.datasets() merging_dataset = datasets_client.get_or_create(name='merge-dataset') for dataset_item in actor_datasets.items: # Dataset items can be handled here. Dataset items can be paginated dataset_client = apify_client.dataset(dataset_item['id']) dataset_items = dataset_client.list_items(limit=1000) # Items can be pushed to single dataset merging_dataset_client = apify_client.dataset(merging_dataset['id']) merging_dataset_client.push_items(dataset_items.items) # ... if __name__ == '__main__': main() ``` --- # Getting started This guide will walk you through how to use the [Apify Client for Python](https://github.com/apify/apify-client-python) to run [Actors](https://apify.com/actors) on the [Apify platform](https://docs.apify.com/platform), provide input to them, and retrieve results from their datasets. You'll learn the basics of running serverless programs (we're calling them Actors) and managing their output efficiently. ## Running your first Actor[​](#running-your-first-actor "Direct link to Running your first Actor") To start an Actor, you need its ID (e.g., `john-doe/my-cool-actor`) and an API token. The Actor's ID is a combination of the username and the Actor owner's username. Use the [`ActorClient`](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorClient.md) to run the Actor and wait for it to complete. You can run both your own Actors and [Actors from Apify store](https://docs.apify.com/platform/actors/running/actors-in-store). * Async client * Sync client ``` from apify_client import ApifyClientAsync # You can find your API token at https://console.apify.com/settings/integrations. TOKEN = 'MY-APIFY-TOKEN' async def main() -> None: apify_client = ApifyClientAsync(TOKEN) # Start an Actor and wait for it to finish. actor_client = apify_client.actor('john-doe/my-cool-actor') call_result = await actor_client.call() if call_result is None: print('Actor run failed.') return # Fetch results from the Actor run's default dataset. dataset_client = apify_client.dataset(call_result['defaultDatasetId']) list_items_result = await dataset_client.list_items() print(f'Dataset: {list_items_result}') ``` ``` from apify_client import ApifyClient # You can find your API token at https://console.apify.com/settings/integrations. TOKEN = 'MY-APIFY-TOKEN' def main() -> None: apify_client = ApifyClient(TOKEN) # Start an Actor and wait for it to finish. actor_client = apify_client.actor('john-doe/my-cool-actor') call_result = actor_client.call() if call_result is None: print('Actor run failed.') return # Fetch results from the Actor run's default dataset. dataset_client = apify_client.dataset(call_result['defaultDatasetId']) list_items_result = dataset_client.list_items() print(f'Dataset: {list_items_result}') ``` ## Providing input to Actor[​](#providing-input-to-actor "Direct link to Providing input to Actor") Actors often require input, such as URLs to scrape, search terms, or other configuration data. You can pass input as a JSON object when starting the Actor using the [`ActorClient.call`](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorClient.md#call) method. Actors respect the input schema defined in the Actor's [input schema](https://docs.apify.com/platform/actors/development/actor-definition/input-schema). * Async client * Sync client ``` from apify_client import ApifyClientAsync TOKEN = 'MY-APIFY-TOKEN' async def main() -> None: apify_client = ApifyClientAsync(TOKEN) actor_client = apify_client.actor('username/actor-name') # Define the input for the Actor. run_input = { 'some': 'input', } # Start an Actor and waits for it to finish. call_result = await actor_client.call(run_input=run_input) ``` ``` from apify_client import ApifyClient TOKEN = 'MY-APIFY-TOKEN' def main() -> None: apify_client = ApifyClient(TOKEN) actor_client = apify_client.actor('username/actor-name') # Define the input for the Actor. run_input = { 'some': 'input', } # Start an Actor and waits for it to finish. call_result = actor_client.call(run_input=run_input) ``` ## Getting results from the dataset[​](#getting-results-from-the-dataset "Direct link to Getting results from the dataset") To get the results from the dataset, you can use the [`DatasetClient`](https://docs.apify.com/api/client/python/api/client/python/reference/class/DatasetClient.md) ([`ApifyClient.dataset`](https://docs.apify.com/api/client/python/api/client/python/reference/class/ApifyClient.md#dataset) ) and [`DatasetClient.list_items`](https://docs.apify.com/api/client/python/api/client/python/reference/class/DatasetClient.md#list_items) method. You need to pass the dataset ID to define which dataset you want to access. You can get the dataset ID from the Actor's run dictionary (represented by `defaultDatasetId`). * Async client * Sync client ``` from apify_client import ApifyClientAsync TOKEN = 'MY-APIFY-TOKEN' async def main() -> None: apify_client = ApifyClientAsync(TOKEN) actor_client = apify_client.actor('username/actor-name') # Define the input for the Actor. run_input = { 'some': 'input', } # Start an Actor and waits for it to finish. call_result = await actor_client.call(run_input=run_input) ``` ``` from apify_client import ApifyClient TOKEN = 'MY-APIFY-TOKEN' def main() -> None: apify_client = ApifyClient(TOKEN) actor_client = apify_client.actor('username/actor-name') # Define the input for the Actor. run_input = { 'some': 'input', } # Start an Actor and waits for it to finish. call_result = actor_client.call(run_input=run_input) ``` Dataset access Running an Actor might take time, depending on the Actor's complexity and the amount of data it processes. If you want only to get data and have an immediate response you should access the existing dataset of the finished [Actor run](https://docs.apify.com/platform/actors/running/runs-and-builds#runs). --- # Introduction The [Apify client for Python](https://github.com/apify/apify-client-python) is the official library to access the [Apify REST API](https://docs.apify.com/api/v2) from your Python applications. It provides useful features like automatic retries and convenience functions that improve the experience of using the Apify API. All requests and responses (including errors) are encoded in JSON format with UTF-8 encoding. The client provides both synchronous and asynchronous interfaces. * Async client * Sync client ``` from apify_client import ApifyClientAsync # You can find your API token at https://console.apify.com/settings/integrations. TOKEN = 'MY-APIFY-TOKEN' async def main() -> None: apify_client = ApifyClientAsync(TOKEN) # Start an Actor and wait for it to finish. actor_client = apify_client.actor('john-doe/my-cool-actor') call_result = await actor_client.call() if call_result is None: print('Actor run failed.') return # Fetch results from the Actor run's default dataset. dataset_client = apify_client.dataset(call_result['defaultDatasetId']) list_items_result = await dataset_client.list_items() print(f'Dataset: {list_items_result}') ``` ``` from apify_client import ApifyClient # You can find your API token at https://console.apify.com/settings/integrations. TOKEN = 'MY-APIFY-TOKEN' def main() -> None: apify_client = ApifyClient(TOKEN) # Start an Actor and wait for it to finish. actor_client = apify_client.actor('john-doe/my-cool-actor') call_result = actor_client.call() if call_result is None: print('Actor run failed.') return # Fetch results from the Actor run's default dataset. dataset_client = apify_client.dataset(call_result['defaultDatasetId']) list_items_result = dataset_client.list_items() print(f'Dataset: {list_items_result}') ``` --- # Setting up This guide will help you get started with [Apify client for Python](https://github.com/apify/apify-client-python) by setting it up on your computer. Follow the steps below to ensure a smooth installation process. ## Prerequisites[​](#prerequisites "Direct link to Prerequisites") Before installing `apify-client` itself, make sure that your system meets the following requirements: * **Python 3.10 or higher**: `apify-client` requires Python 3.10 or a newer version. You can download Python from the [official website](https://www.python.org/downloads/). * **Python package manager**: While this guide uses Pip (the most common package manager), you can also use any package manager you want. You can download Pip from the [official website](https://pip.pypa.io/en/stable/installation/). ### Verifying prerequisites[​](#verifying-prerequisites "Direct link to Verifying prerequisites") To check if Python and the Pip package manager are installed, run the following commands: ``` python --version ``` ``` pip --version ``` If these commands return the respective versions, you're ready to continue. ## Installation[​](#installation "Direct link to Installation") Apify client for Python is available as the [`apify-client`](https://pypi.org/project/apify-client/) package on PyPI. To install it, run: ``` pip install apify-client ``` After installation, verify that `apify-client` is installed correctly by checking its version: ``` python -c 'import apify_client; print(apify_client.__version__)' ``` ## Authentication and initialization[​](#authentication-and-initialization "Direct link to Authentication and initialization") To use the client, you need an [API token](https://docs.apify.com/platform/integrations/api#api-token). You can find your token under [Integrations](https://console.apify.com/account/integrations) tab in Apify Console. Copy the token and initialize the client by providing the token (`MY-APIFY-TOKEN`) as a parameter to the `ApifyClient` constructor. * Async client * Sync client ``` from apify_client import ApifyClientAsync TOKEN = 'MY-APIFY-TOKEN' async def main() -> None: # Client initialization with the API token. apify_client = ApifyClientAsync(TOKEN) ``` ``` from apify_client import ApifyClient TOKEN = 'MY-APIFY-TOKEN' def main() -> None: # Client initialization with the API token. apify_client = ApifyClient(TOKEN) ``` Secure access The API token is used to authorize your requests to the Apify API. You can be charged for the usage of the underlying services, so do not share your API token with untrusted parties or expose it on the client side of your applications. --- # apify-client-python ## Index[**](#Index) ### Async Resource Clients * [**retry\_with\_exp\_backoff\_async](https://docs.apify.com/api/client/python/api/client/python/reference.md#retry_with_exp_backoff_async) ### Classes * [**\_BaseApifyClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/_BaseApifyClient.md) * [**\_BaseBaseClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/_BaseBaseClient.md) * [**\_BaseHTTPClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/_BaseHTTPClient.md) * [**\_ContextInjectingFilter](https://docs.apify.com/api/client/python/api/client/python/reference/class/_ContextInjectingFilter.md) * [**\_DebugLogFormatter](https://docs.apify.com/api/client/python/api/client/python/reference/class/_DebugLogFormatter.md) * [**ActorClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorClient.md) * [**ActorClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorClientAsync.md) * [**ActorCollectionClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md) * [**ActorCollectionClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClientAsync.md) * [**ActorEnvVarClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorEnvVarClient.md) * [**ActorEnvVarClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorEnvVarClientAsync.md) * [**ActorEnvVarCollectionClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorEnvVarCollectionClient.md) * [**ActorEnvVarCollectionClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorEnvVarCollectionClientAsync.md) * [**ActorVersionClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorVersionClient.md) * [**ActorVersionClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorVersionClientAsync.md) * [**ActorVersionCollectionClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorVersionCollectionClient.md) * [**ActorVersionCollectionClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorVersionCollectionClientAsync.md) * [**ApifyClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/ApifyClient.md) * [**ApifyClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/ApifyClientAsync.md) * [**BatchAddRequestsResult](https://docs.apify.com/api/client/python/api/client/python/reference/class/BatchAddRequestsResult.md) * [**BuildClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/BuildClient.md) * [**BuildClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/BuildClientAsync.md) * [**BuildCollectionClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/BuildCollectionClient.md) * [**BuildCollectionClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/BuildCollectionClientAsync.md) * [**DatasetClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/DatasetClient.md) * [**DatasetClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/DatasetClientAsync.md) * [**DatasetCollectionClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/DatasetCollectionClient.md) * [**DatasetCollectionClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/DatasetCollectionClientAsync.md) * [**HTTPClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/HTTPClient.md) * [**HTTPClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/HTTPClientAsync.md) * [**KeyValueStoreClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/KeyValueStoreClient.md) * [**KeyValueStoreClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/KeyValueStoreClientAsync.md) * [**KeyValueStoreCollectionClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/KeyValueStoreCollectionClient.md) * [**KeyValueStoreCollectionClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/KeyValueStoreCollectionClientAsync.md) * [**ListPage](https://docs.apify.com/api/client/python/api/client/python/reference/class/ListPage.md) * [**LogClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/LogClient.md) * [**LogClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/LogClientAsync.md) * [**LogContext](https://docs.apify.com/api/client/python/api/client/python/reference/class/LogContext.md) * [**RedirectLogFormatter](https://docs.apify.com/api/client/python/api/client/python/reference/class/RedirectLogFormatter.md) * [**RequestQueueClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/RequestQueueClient.md) * [**RequestQueueClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/RequestQueueClientAsync.md) * [**RequestQueueCollectionClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/RequestQueueCollectionClient.md) * [**RequestQueueCollectionClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/RequestQueueCollectionClientAsync.md) * [**RunClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/RunClient.md) * [**RunClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/RunClientAsync.md) * [**RunCollectionClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/RunCollectionClient.md) * [**RunCollectionClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/RunCollectionClientAsync.md) * [**ScheduleClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/ScheduleClient.md) * [**ScheduleClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/ScheduleClientAsync.md) * [**ScheduleCollectionClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/ScheduleCollectionClient.md) * [**ScheduleCollectionClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/ScheduleCollectionClientAsync.md) * [**StatusMessageWatcher](https://docs.apify.com/api/client/python/api/client/python/reference/class/StatusMessageWatcher.md) * [**StatusMessageWatcherAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/StatusMessageWatcherAsync.md) * [**StatusMessageWatcherSync](https://docs.apify.com/api/client/python/api/client/python/reference/class/StatusMessageWatcherSync.md) * [**StoreCollectionClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/StoreCollectionClient.md) * [**StoreCollectionClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/StoreCollectionClientAsync.md) * [**StreamedLog](https://docs.apify.com/api/client/python/api/client/python/reference/class/StreamedLog.md) * [**StreamedLogAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/StreamedLogAsync.md) * [**StreamedLogSync](https://docs.apify.com/api/client/python/api/client/python/reference/class/StreamedLogSync.md) * [**TaskClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/TaskClient.md) * [**TaskClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/TaskClientAsync.md) * [**TaskCollectionClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/TaskCollectionClient.md) * [**TaskCollectionClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/TaskCollectionClientAsync.md) * [**UserClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/UserClient.md) * [**UserClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/UserClientAsync.md) * [**WebhookClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/WebhookClient.md) * [**WebhookClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/WebhookClientAsync.md) * [**WebhookCollectionClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/WebhookCollectionClient.md) * [**WebhookCollectionClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/WebhookCollectionClientAsync.md) * [**WebhookDispatchClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/WebhookDispatchClient.md) * [**WebhookDispatchClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/WebhookDispatchClientAsync.md) * [**WebhookDispatchCollectionClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/WebhookDispatchCollectionClient.md) * [**WebhookDispatchCollectionClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/WebhookDispatchCollectionClientAsync.md) * [**WithLogDetailsClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/WithLogDetailsClient.md) ### Data structures * [**Statistics](https://docs.apify.com/api/client/python/api/client/python/reference/class/Statistics.md) ### Errors * [**ApifyApiError](https://docs.apify.com/api/client/python/api/client/python/reference/class/ApifyApiError.md) * [**ApifyClientError](https://docs.apify.com/api/client/python/api/client/python/reference/class/ApifyClientError.md) * [**InvalidResponseBodyError](https://docs.apify.com/api/client/python/api/client/python/reference/class/InvalidResponseBodyError.md) ### Methods * [**\_\_getattr\_\_](https://docs.apify.com/api/client/python/api/client/python/reference.md#__getattr__) * [**catch\_not\_found\_or\_throw](https://docs.apify.com/api/client/python/api/client/python/reference.md#catch_not_found_or_throw) * [**create\_redirect\_logger](https://docs.apify.com/api/client/python/api/client/python/reference.md#create_redirect_logger) * [**encode\_key\_value\_store\_record\_value](https://docs.apify.com/api/client/python/api/client/python/reference.md#encode_key_value_store_record_value) * [**encode\_webhook\_list\_to\_base64](https://docs.apify.com/api/client/python/api/client/python/reference.md#encode_webhook_list_to_base64) * [**get\_actor\_env\_var\_representation](https://docs.apify.com/api/client/python/api/client/python/reference.md#get_actor_env_var_representation) * [**get\_actor\_representation](https://docs.apify.com/api/client/python/api/client/python/reference.md#get_actor_representation) * [**get\_task\_representation](https://docs.apify.com/api/client/python/api/client/python/reference.md#get_task_representation) * [**get\_webhook\_representation](https://docs.apify.com/api/client/python/api/client/python/reference.md#get_webhook_representation) * [**is\_retryable\_error](https://docs.apify.com/api/client/python/api/client/python/reference.md#is_retryable_error) * [**pluck\_data](https://docs.apify.com/api/client/python/api/client/python/reference.md#pluck_data) * [**pluck\_data\_as\_list](https://docs.apify.com/api/client/python/api/client/python/reference.md#pluck_data_as_list) * [**retry\_with\_exp\_backoff](https://docs.apify.com/api/client/python/api/client/python/reference.md#retry_with_exp_backoff) * [**to\_safe\_id](https://docs.apify.com/api/client/python/api/client/python/reference.md#to_safe_id) ### Properties * [**\_\_version\_\_](https://docs.apify.com/api/client/python/api/client/python/reference.md#__version__) * [**API\_VERSION](https://docs.apify.com/api/client/python/api/client/python/reference.md#API_VERSION) * [**DEFAULT\_API\_URL](https://docs.apify.com/api/client/python/api/client/python/reference.md#DEFAULT_API_URL) * [**DEFAULT\_BACKOFF\_EXPONENTIAL\_FACTOR](https://docs.apify.com/api/client/python/api/client/python/reference.md#DEFAULT_BACKOFF_EXPONENTIAL_FACTOR) * [**DEFAULT\_BACKOFF\_RANDOM\_FACTOR](https://docs.apify.com/api/client/python/api/client/python/reference.md#DEFAULT_BACKOFF_RANDOM_FACTOR) * [**DEFAULT\_TIMEOUT](https://docs.apify.com/api/client/python/api/client/python/reference.md#DEFAULT_TIMEOUT) * [**DEFAULT\_WAIT\_FOR\_FINISH\_SEC](https://docs.apify.com/api/client/python/api/client/python/reference.md#DEFAULT_WAIT_FOR_FINISH_SEC) * [**DEFAULT\_WAIT\_WHEN\_JOB\_NOT\_EXIST\_SEC](https://docs.apify.com/api/client/python/api/client/python/reference.md#DEFAULT_WAIT_WHEN_JOB_NOT_EXIST_SEC) * [**DEPRECATED\_NAMES](https://docs.apify.com/api/client/python/api/client/python/reference.md#DEPRECATED_NAMES) * [**http\_client](https://docs.apify.com/api/client/python/api/client/python/reference.md#http_client) * [**http\_client](https://docs.apify.com/api/client/python/api/client/python/reference.md#http_client) * [**log\_context](https://docs.apify.com/api/client/python/api/client/python/reference.md#log_context) * [**logger](https://docs.apify.com/api/client/python/api/client/python/reference.md#logger) * [**logger](https://docs.apify.com/api/client/python/api/client/python/reference.md#logger) * [**logger](https://docs.apify.com/api/client/python/api/client/python/reference.md#logger) * [**logger\_name](https://docs.apify.com/api/client/python/api/client/python/reference.md#logger_name) * [**PARSE\_DATE\_FIELDS\_KEY\_SUFFIX](https://docs.apify.com/api/client/python/api/client/python/reference.md#PARSE_DATE_FIELDS_KEY_SUFFIX) * [**PARSE\_DATE\_FIELDS\_MAX\_DEPTH](https://docs.apify.com/api/client/python/api/client/python/reference.md#PARSE_DATE_FIELDS_MAX_DEPTH) * [**RECORD\_NOT\_FOUND\_EXCEPTION\_TYPES](https://docs.apify.com/api/client/python/api/client/python/reference.md#RECORD_NOT_FOUND_EXCEPTION_TYPES) * [**root\_client](https://docs.apify.com/api/client/python/api/client/python/reference.md#root_client) * [**root\_client](https://docs.apify.com/api/client/python/api/client/python/reference.md#root_client) * [**StopRetryingType](https://docs.apify.com/api/client/python/api/client/python/reference.md#StopRetryingType) * [**T](https://docs.apify.com/api/client/python/api/client/python/reference.md#T) * [**T](https://docs.apify.com/api/client/python/api/client/python/reference.md#T) ## Async Resource Clients[**](<#Async Resource Clients>) ### [**](#retry_with_exp_backoff_async)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/_utils.py#L82)retry\_with\_exp\_backoff\_async * **async **retry\_with\_exp\_backoff\_async**(async\_func, \*, max\_retries, backoff\_base\_millis, backoff\_factor, random\_factor): [T](https://docs.apify.com/api/client/python/api/client/python/reference.md#T) - #### Parameters * ##### async\_func: Callable\[\[StopRetryingType, int], Awaitable\[[T](https://docs.apify.com/api/client/python/api/client/python/reference.md#T)]] * ##### optionalkeyword-onlymax\_retries: int = 8 * ##### optionalkeyword-onlybackoff\_base\_millis: int = 500 * ##### optionalkeyword-onlybackoff\_factor: float = 2 * ##### optionalkeyword-onlyrandom\_factor: float = 1 #### Returns [T](https://docs.apify.com/api/client/python/api/client/python/reference.md#T) ## Methods[**](#Methods) ### [**](#__getattr__)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/consts.py#L22)\_\_getattr\_\_ * ****\_\_getattr\_\_**(name): Any - #### Parameters * ##### name: str #### Returns Any ### [**](#catch_not_found_or_throw)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/_utils.py#L115)catch\_not\_found\_or\_throw * ****catch\_not\_found\_or\_throw**(exc): None - #### Parameters * ##### exc: [ApifyApiError](https://docs.apify.com/api/client/python/api/client/python/reference/class/ApifyApiError.md) #### Returns None ### [**](#create_redirect_logger)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/_logging.py#L129)create\_redirect\_logger * ****create\_redirect\_logger**(name): logging.Logger - Create a logger for redirecting logs from another Actor. *** #### Parameters * ##### name: str The name of the logger. It can be used to inherit from other loggers. Example: `apify.xyz` will use logger named `xyz` and make it a children of `apify` logger. #### Returns logging.Logger ### [**](#encode_key_value_store_record_value)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/_utils.py#L139)encode\_key\_value\_store\_record\_value * ****encode\_key\_value\_store\_record\_value**(value, content\_type): tuple\[Any, str] - #### Parameters * ##### value: Any * ##### optionalcontent\_type: str | None = None #### Returns tuple\[Any, str] ### [**](#encode_webhook_list_to_base64)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/_utils.py#L122)encode\_webhook\_list\_to\_base64 * ****encode\_webhook\_list\_to\_base64**(webhooks): str - Encode a list of dictionaries representing webhooks to their base64-encoded representation for the API. *** #### Parameters * ##### webhooks: [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[dict] #### Returns str ### [**](#get_actor_env_var_representation)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/actor_env_var.py#L10)get\_actor\_env\_var\_representation * ****get\_actor\_env\_var\_representation**(\*, is\_secret, name, value): dict - Return an environment variable representation of the Actor in a dictionary. *** #### Parameters * ##### optionalkeyword-onlyis\_secret: bool | None = None * ##### optionalkeyword-onlyname: str | None = None * ##### optionalkeyword-onlyvalue: str | None = None #### Returns dict ### [**](#get_actor_representation)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/actor.py#L35)get\_actor\_representation * ****get\_actor\_representation**(\*, name, title, description, seo\_title, seo\_description, versions, restart\_on\_error, is\_public, is\_deprecated, is\_anonymously\_runnable, categories, default\_run\_build, default\_run\_max\_items, default\_run\_memory\_mbytes, default\_run\_timeout\_secs, example\_run\_input\_body, example\_run\_input\_content\_type, actor\_standby\_is\_enabled, actor\_standby\_desired\_requests\_per\_actor\_run, actor\_standby\_max\_requests\_per\_actor\_run, actor\_standby\_idle\_timeout\_secs, actor\_standby\_build, actor\_standby\_memory\_mbytes, pricing\_infos): dict - Get dictionary representation of the Actor. *** #### Parameters * ##### keyword-onlyname: str | None * ##### optionalkeyword-onlytitle: str | None = None * ##### optionalkeyword-onlydescription: str | None = None * ##### optionalkeyword-onlyseo\_title: str | None = None * ##### optionalkeyword-onlyseo\_description: str | None = None * ##### optionalkeyword-onlyversions: [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[dict] | None = None * ##### optionalkeyword-onlyrestart\_on\_error: bool | None = None * ##### optionalkeyword-onlyis\_public: bool | None = None * ##### optionalkeyword-onlyis\_deprecated: bool | None = None * ##### optionalkeyword-onlyis\_anonymously\_runnable: bool | None = None * ##### optionalkeyword-onlycategories: [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[str] | None = None * ##### optionalkeyword-onlydefault\_run\_build: str | None = None * ##### optionalkeyword-onlydefault\_run\_max\_items: int | None = None * ##### optionalkeyword-onlydefault\_run\_memory\_mbytes: int | None = None * ##### optionalkeyword-onlydefault\_run\_timeout\_secs: int | None = None * ##### optionalkeyword-onlyexample\_run\_input\_body: Any = None * ##### optionalkeyword-onlyexample\_run\_input\_content\_type: str | None = None * ##### optionalkeyword-onlyactor\_standby\_is\_enabled: bool | None = None * ##### optionalkeyword-onlyactor\_standby\_desired\_requests\_per\_actor\_run: int | None = None * ##### optionalkeyword-onlyactor\_standby\_max\_requests\_per\_actor\_run: int | None = None * ##### optionalkeyword-onlyactor\_standby\_idle\_timeout\_secs: int | None = None * ##### optionalkeyword-onlyactor\_standby\_build: str | None = None * ##### optionalkeyword-onlyactor\_standby\_memory\_mbytes: int | None = None * ##### optionalkeyword-onlypricing\_infos: [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[dict] | None = None #### Returns dict ### [**](#get_task_representation)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/task.py#L26)get\_task\_representation * ****get\_task\_representation**(actor\_id, name, task\_input, build, max\_items, memory\_mbytes, timeout\_secs, title, actor\_standby\_desired\_requests\_per\_actor\_run, actor\_standby\_max\_requests\_per\_actor\_run, actor\_standby\_idle\_timeout\_secs, actor\_standby\_build, actor\_standby\_memory\_mbytes): dict - Get the dictionary representation of a task. *** #### Parameters * ##### optionalactor\_id: str | None = None * ##### optionalname: str | None = None * ##### optionaltask\_input: dict | None = None * ##### optionalbuild: str | None = None * ##### optionalmax\_items: int | None = None * ##### optionalmemory\_mbytes: int | None = None * ##### optionaltimeout\_secs: int | None = None * ##### optionaltitle: str | None = None * ##### optionalactor\_standby\_desired\_requests\_per\_actor\_run: int | None = None * ##### optionalactor\_standby\_max\_requests\_per\_actor\_run: int | None = None * ##### optionalactor\_standby\_idle\_timeout\_secs: int | None = None * ##### optionalactor\_standby\_build: str | None = None * ##### optionalactor\_standby\_memory\_mbytes: int | None = None #### Returns dict ### [**](#get_webhook_representation)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/webhook.py#L24)get\_webhook\_representation * ****get\_webhook\_representation**(\*, event\_types, request\_url, payload\_template, headers\_template, actor\_id, actor\_task\_id, actor\_run\_id, ignore\_ssl\_errors, do\_not\_retry, idempotency\_key, is\_ad\_hoc): dict - Prepare webhook dictionary representation for clients. *** #### Parameters * ##### optionalkeyword-onlyevent\_types: [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[WebhookEventType] | None = None * ##### optionalkeyword-onlyrequest\_url: str | None = None * ##### optionalkeyword-onlypayload\_template: str | None = None * ##### optionalkeyword-onlyheaders\_template: str | None = None * ##### optionalkeyword-onlyactor\_id: str | None = None * ##### optionalkeyword-onlyactor\_task\_id: str | None = None * ##### optionalkeyword-onlyactor\_run\_id: str | None = None * ##### optionalkeyword-onlyignore\_ssl\_errors: bool | None = None * ##### optionalkeyword-onlydo\_not\_retry: bool | None = None * ##### optionalkeyword-onlyidempotency\_key: str | None = None * ##### optionalkeyword-onlyis\_ad\_hoc: bool | None = None #### Returns dict ### [**](#is_retryable_error)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/_errors.py#L77)is\_retryable\_error * ****is\_retryable\_error**(exc): bool - Check if the given error is retryable. *** #### Parameters * ##### exc: Exception #### Returns bool ### [**](#pluck_data)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/_utils.py#L35)pluck\_data * ****pluck\_data**(parsed\_response): dict - #### Parameters * ##### parsed\_response: Any #### Returns dict ### [**](#pluck_data_as_list)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/_utils.py#L42)pluck\_data\_as\_list * ****pluck\_data\_as\_list**(parsed\_response): [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list) - #### Parameters * ##### parsed\_response: Any #### Returns [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list) ### [**](#retry_with_exp_backoff)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/_utils.py#L49)retry\_with\_exp\_backoff * ****retry\_with\_exp\_backoff**(func, \*, max\_retries, backoff\_base\_millis, backoff\_factor, random\_factor): [T](https://docs.apify.com/api/client/python/api/client/python/reference.md#T) - #### Parameters * ##### func: Callable\[\[StopRetryingType, int], [T](https://docs.apify.com/api/client/python/api/client/python/reference.md#T)] * ##### optionalkeyword-onlymax\_retries: int = 8 * ##### optionalkeyword-onlybackoff\_base\_millis: int = 500 * ##### optionalkeyword-onlybackoff\_factor: float = 2 * ##### optionalkeyword-onlyrandom\_factor: float = 1 #### Returns [T](https://docs.apify.com/api/client/python/api/client/python/reference.md#T) ### [**](#to_safe_id)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/_utils.py#L28)to\_safe\_id * ****to\_safe\_id**(id): str - #### Parameters * ##### id: str #### Returns str ## Properties[**](#Properties) ### [**](#__version__)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/__init__.py#L5)\_\_version\_\_ **\_\_version\_\_: Undefined ### [**](#API_VERSION)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/client.py#L58)API\_VERSION **API\_VERSION: Undefined ### [**](#DEFAULT_API_URL)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/client.py#L56)DEFAULT\_API\_URL **DEFAULT\_API\_URL: Undefined ### [**](#DEFAULT_BACKOFF_EXPONENTIAL_FACTOR)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/_http_client.py#L25)DEFAULT\_BACKOFF\_EXPONENTIAL\_FACTOR **DEFAULT\_BACKOFF\_EXPONENTIAL\_FACTOR: Undefined ### [**](#DEFAULT_BACKOFF_RANDOM_FACTOR)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/_http_client.py#L26)DEFAULT\_BACKOFF\_RANDOM\_FACTOR **DEFAULT\_BACKOFF\_RANDOM\_FACTOR: Undefined ### [**](#DEFAULT_TIMEOUT)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/client.py#L57)DEFAULT\_TIMEOUT **DEFAULT\_TIMEOUT: Undefined ### [**](#DEFAULT_WAIT_FOR_FINISH_SEC)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/base/actor_job_base_client.py#L15)DEFAULT\_WAIT\_FOR\_FINISH\_SEC **DEFAULT\_WAIT\_FOR\_FINISH\_SEC: Undefined ### [**](#DEFAULT_WAIT_WHEN_JOB_NOT_EXIST_SEC)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/base/actor_job_base_client.py#L18)DEFAULT\_WAIT\_WHEN\_JOB\_NOT\_EXIST\_SEC **DEFAULT\_WAIT\_WHEN\_JOB\_NOT\_EXIST\_SEC: Undefined ### [**](#DEPRECATED_NAMES)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/consts.py#L11)DEPRECATED\_NAMES **DEPRECATED\_NAMES: Undefined ### [**](#http_client)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/base/base_client.py#L52)http\_client **http\_client: [HTTPClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/HTTPClient.md) ### [**](#http_client)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/base/base_client.py#L95)http\_client **http\_client: [HTTPClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/HTTPClientAsync.md) ### [**](#log_context)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/_logging.py#L34)log\_context **log\_context: Undefined ### [**](#logger)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/_logging.py#L22)logger **logger: Undefined ### [**](#logger)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/_http_client.py#L28)logger **logger: Undefined ### [**](#logger)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/request_queue.py#L22)logger **logger: Undefined ### [**](#logger_name)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/_logging.py#L19)logger\_name **logger\_name: Undefined ### [**](#PARSE_DATE_FIELDS_KEY_SUFFIX)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/_utils.py#L20)PARSE\_DATE\_FIELDS\_KEY\_SUFFIX **PARSE\_DATE\_FIELDS\_KEY\_SUFFIX: Undefined ### [**](#PARSE_DATE_FIELDS_MAX_DEPTH)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/_utils.py#L19)PARSE\_DATE\_FIELDS\_MAX\_DEPTH **PARSE\_DATE\_FIELDS\_MAX\_DEPTH: Undefined ### [**](#RECORD_NOT_FOUND_EXCEPTION_TYPES)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/_utils.py#L22)RECORD\_NOT\_FOUND\_EXCEPTION\_TYPES **RECORD\_NOT\_FOUND\_EXCEPTION\_TYPES: Undefined ### [**](#root_client)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/base/base_client.py#L53)root\_client **root\_client: [ApifyClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/ApifyClient.md) ### [**](#root_client)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/base/base_client.py#L96)root\_client **root\_client: [ApifyClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/ApifyClientAsync.md) ### [**](#StopRetryingType)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/_utils.py#L25)StopRetryingType **StopRetryingType: Undefined ### [**](#T)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/_utils.py#L24)T **T: Undefined ### [**](#T)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/base/resource_collection_client.py#L10)T **T: Undefined --- # \_BaseApifyClient ### Hierarchy * *\_BaseApifyClient* * [ApifyClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/ApifyClient.md) * [ApifyClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/ApifyClientAsync.md) ## Index[**](#Index) ### Properties * [**http\_client](https://docs.apify.com/api/client/python/api/client/python/reference/class/_BaseApifyClient.md#http_client) ## Properties[**](#Properties) ### [**](#http_client)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/client.py#L62)http\_client **http\_client: [HTTPClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/HTTPClient.md) | [HTTPClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/HTTPClientAsync.md) --- # \_BaseBaseClient ## Index[**](#Index) ### Properties * [**http\_client](https://docs.apify.com/api/client/python/api/client/python/reference/class/_BaseBaseClient.md#http_client) * [**params](https://docs.apify.com/api/client/python/api/client/python/reference/class/_BaseBaseClient.md#params) * [**resource\_id](https://docs.apify.com/api/client/python/api/client/python/reference/class/_BaseBaseClient.md#resource_id) * [**root\_client](https://docs.apify.com/api/client/python/api/client/python/reference/class/_BaseBaseClient.md#root_client) * [**url](https://docs.apify.com/api/client/python/api/client/python/reference/class/_BaseBaseClient.md#url) ## Properties[**](#Properties) ### [**](#http_client)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/base/base_client.py#L20)http\_client **http\_client: [HTTPClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/HTTPClient.md) | [HTTPClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/HTTPClientAsync.md) ### [**](#params)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/base/base_client.py#L19)params **params: dict ### [**](#resource_id)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/base/base_client.py#L17)resource\_id **resource\_id: str | None ### [**](#root_client)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/base/base_client.py#L21)root\_client **root\_client: [ApifyClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/ApifyClient.md) | [ApifyClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/ApifyClientAsync.md) ### [**](#url)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/base/base_client.py#L18)url **url: str --- # \_BaseHTTPClient ### Hierarchy * *\_BaseHTTPClient* * [HTTPClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/HTTPClient.md) * [HTTPClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/HTTPClientAsync.md) --- # \_ContextInjectingFilter ## Index[**](#Index) ### Methods * [**filter](https://docs.apify.com/api/client/python/api/client/python/reference/class/_ContextInjectingFilter.md#filter) ## Methods[**](#Methods) ### [**](#filter)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/_logging.py#L94)filter * ****filter**(record): bool - #### Parameters * ##### record: logging.LogRecord #### Returns bool --- # \_DebugLogFormatter ## Index[**](#Index) ### Methods * [**format](https://docs.apify.com/api/client/python/api/client/python/reference/class/_DebugLogFormatter.md#format) ### Properties * [**empty\_record](https://docs.apify.com/api/client/python/api/client/python/reference/class/_DebugLogFormatter.md#empty_record) ## Methods[**](#Methods) ### [**](#format)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/_logging.py#L120)format * ****format**(record): str - #### Parameters * ##### record: logging.LogRecord #### Returns str ## Properties[**](#Properties) ### [**](#empty_record)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/_logging.py#L109)empty\_record **empty\_record: Undefined --- # ActorClient Sub-client for manipulating a single Actor. ## Index[**](#Index) ### Methods * [**build](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorClient.md#build) * [**builds](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorClient.md#builds) * [**call](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorClient.md#call) * [**default\_build](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorClient.md#default_build) * [**delete](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorClient.md#delete) * [**get](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorClient.md#get) * [**last\_run](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorClient.md#last_run) * [**runs](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorClient.md#runs) * [**start](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorClient.md#start) * [**update](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorClient.md#update) * [**validate\_input](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorClient.md#validate_input) * [**version](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorClient.md#version) * [**versions](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorClient.md#versions) * [**webhooks](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorClient.md#webhooks) ## Methods[**](#Methods) ### [**](#build)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/actor.py#L347)build * ****build**(\*, version\_number, beta\_packages, tag, use\_cache, wait\_for\_finish): dict - Build the Actor. *** #### Parameters * ##### keyword-onlyversion\_number: str Actor version number to be built. * ##### optionalkeyword-onlybeta\_packages: bool | None = None If True, then the Actor is built with beta versions of Apify NPM packages. By default, the build uses latest stable packages. * ##### optionalkeyword-onlytag: str | None = None Tag to be applied to the build on success. By default, the tag is taken from the Actor version's build tag property. * ##### optionalkeyword-onlyuse\_cache: bool | None = None If true, the Actor's Docker container will be rebuilt using layer cache (). This is to enable quick rebuild during development. By default, the cache is not used. * ##### optionalkeyword-onlywait\_for\_finish: int | None = None The maximum number of seconds the server waits for the build to finish before returning. By default it is 0, the maximum value is 60. #### Returns dict ### [**](#builds)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/actor.py#L391)builds * ****builds**(): [BuildCollectionClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/BuildCollectionClient.md) - Retrieve a client for the builds of this Actor. *** #### Returns [BuildCollectionClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/BuildCollectionClient.md) ### [**](#call)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/actor.py#L281)call * ****call**(\*, run\_input, content\_type, build, max\_items, max\_total\_charge\_usd, memory\_mbytes, timeout\_secs, webhooks, wait\_secs, logger): dict | None - Start the Actor and wait for it to finish before returning the Run object. It waits indefinitely, unless the wait\_secs argument is provided. *** #### Parameters * ##### optionalkeyword-onlyrun\_input: Any = None The input to pass to the Actor run. * ##### optionalkeyword-onlycontent\_type: str | None = None The content type of the input. * ##### optionalkeyword-onlybuild: str | None = None Specifies the Actor build to run. It can be either a build tag or build number. By default, the run uses the build specified in the default run configuration for the Actor (typically latest). * ##### optionalkeyword-onlymax\_items: int | None = None Maximum number of results that will be returned by this run. If the Actor is charged per result, you will not be charged for more results than the given limit. * ##### optionalkeyword-onlymax\_total\_charge\_usd: Decimal | None = None A limit on the total charged amount for pay-per-event actors. * ##### optionalkeyword-onlymemory\_mbytes: int | None = None Memory limit for the run, in megabytes. By default, the run uses a memory limit specified in the default run configuration for the Actor. * ##### optionalkeyword-onlytimeout\_secs: int | None = None Optional timeout for the run, in seconds. By default, the run uses timeout specified in the default run configuration for the Actor. * ##### optionalkeyword-onlywebhooks: [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[dict] | None = None Optional webhooks () associated with the Actor run, which can be used to receive a notification, e.g. when the Actor finished or failed. If you already have a webhook set up for the Actor, you do not have to add it again here. * ##### optionalkeyword-onlywait\_secs: int | None = None The maximum number of seconds the server waits for the run to finish. If not provided, waits indefinitely. * ##### optionalkeyword-onlylogger: (Logger | None) | Literal\[default] = 'default' Logger used to redirect logs from the Actor run. Using "default" literal means that a predefined default logger will be used. Setting `None` will disable any log propagation. Passing custom logger will redirect logs to the provided logger. The logger is also used to capture status and status message of the other Actor run. #### Returns dict | None ### [**](#default_build)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/actor.py#L399)default\_build * **async **default\_build**(\*, wait\_for\_finish): [BuildClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/BuildClient.md) - Retrieve Actor's default build. *** #### Parameters * ##### optionalkeyword-onlywait\_for\_finish: int | None = None The maximum number of seconds the server waits for the build to finish before returning. By default it is 0, the maximum value is 60. #### Returns [BuildClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/BuildClient.md) ### [**](#delete)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/actor.py#L210)delete * ****delete**(): None - Delete the Actor. *** #### Returns None ### [**](#get)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/actor.py#L105)get * ****get**(): dict | None - Retrieve the Actor. *** #### Returns dict | None ### [**](#last_run)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/actor.py#L429)last\_run * ****last\_run**(\*, status, origin): [RunClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/RunClient.md) - Retrieve the client for the last run of this Actor. Last run is retrieved based on the start time of the runs. *** #### Parameters * ##### optionalkeyword-onlystatus: ActorJobStatus | None = None Consider only runs with this status. * ##### optionalkeyword-onlyorigin: MetaOrigin | None = None Consider only runs started with this origin. #### Returns [RunClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/RunClient.md) ### [**](#runs)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/actor.py#L395)runs * ****runs**(): [RunCollectionClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/RunCollectionClient.md) - Retrieve a client for the runs of this Actor. *** #### Returns [RunCollectionClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/RunCollectionClient.md) ### [**](#start)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/actor.py#L217)start * ****start**(\*, run\_input, content\_type, build, max\_items, max\_total\_charge\_usd, memory\_mbytes, timeout\_secs, wait\_for\_finish, webhooks): dict - Start the Actor and immediately return the Run object. *** #### Parameters * ##### optionalkeyword-onlyrun\_input: Any = None The input to pass to the Actor run. * ##### optionalkeyword-onlycontent\_type: str | None = None The content type of the input. * ##### optionalkeyword-onlybuild: str | None = None Specifies the Actor build to run. It can be either a build tag or build number. By default, the run uses the build specified in the default run configuration for the Actor (typically latest). * ##### optionalkeyword-onlymax\_items: int | None = None Maximum number of results that will be returned by this run. If the Actor is charged per result, you will not be charged for more results than the given limit. * ##### optionalkeyword-onlymax\_total\_charge\_usd: Decimal | None = None A limit on the total charged amount for pay-per-event actors. * ##### optionalkeyword-onlymemory\_mbytes: int | None = None Memory limit for the run, in megabytes. By default, the run uses a memory limit specified in the default run configuration for the Actor. * ##### optionalkeyword-onlytimeout\_secs: int | None = None Optional timeout for the run, in seconds. By default, the run uses timeout specified in the default run configuration for the Actor. * ##### optionalkeyword-onlywait\_for\_finish: int | None = None The maximum number of seconds the server waits for the run to finish. By default, it is 0, the maximum value is 60. * ##### optionalkeyword-onlywebhooks: [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[dict] | None = None Optional ad-hoc webhooks () associated with the Actor run which can be used to receive a notification, e.g. when the Actor finished or failed. If you already have a webhook set up for the Actor or task, you do not have to add it again here. Each webhook is represented by a dictionary containing these items: * `event_types`: List of `WebhookEventType` values which trigger the webhook. * `request_url`: URL to which to send the webhook HTTP request. * `payload_template`: Optional template for the request payload. #### Returns dict ### [**](#update)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/actor.py#L115)update * ****update**(\*, name, title, description, seo\_title, seo\_description, versions, restart\_on\_error, is\_public, is\_deprecated, is\_anonymously\_runnable, categories, default\_run\_build, default\_run\_max\_items, default\_run\_memory\_mbytes, default\_run\_timeout\_secs, example\_run\_input\_body, example\_run\_input\_content\_type, actor\_standby\_is\_enabled, actor\_standby\_desired\_requests\_per\_actor\_run, actor\_standby\_max\_requests\_per\_actor\_run, actor\_standby\_idle\_timeout\_secs, actor\_standby\_build, actor\_standby\_memory\_mbytes, pricing\_infos): dict - Update the Actor with the specified fields. *** #### Parameters * ##### optionalkeyword-onlyname: str | None = None The name of the Actor. * ##### optionalkeyword-onlytitle: str | None = None The title of the Actor (human-readable). * ##### optionalkeyword-onlydescription: str | None = None The description for the Actor. * ##### optionalkeyword-onlyseo\_title: str | None = None The title of the Actor optimized for search engines. * ##### optionalkeyword-onlyseo\_description: str | None = None The description of the Actor optimized for search engines. * ##### optionalkeyword-onlyversions: [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[dict] | None = None The list of Actor versions. * ##### optionalkeyword-onlyrestart\_on\_error: bool | None = None If true, the main Actor run process will be restarted whenever it exits with a non-zero status code. * ##### optionalkeyword-onlyis\_public: bool | None = None Whether the Actor is public. * ##### optionalkeyword-onlyis\_deprecated: bool | None = None Whether the Actor is deprecated. * ##### optionalkeyword-onlyis\_anonymously\_runnable: bool | None = None Whether the Actor is anonymously runnable. * ##### optionalkeyword-onlycategories: [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[str] | None = None The categories to which the Actor belongs to. * ##### optionalkeyword-onlydefault\_run\_build: str | None = None Tag or number of the build that you want to run by default. * ##### optionalkeyword-onlydefault\_run\_max\_items: int | None = None Default limit of the number of results that will be returned by runs of this Actor, if the Actor is charged per result. * ##### optionalkeyword-onlydefault\_run\_memory\_mbytes: int | None = None Default amount of memory allocated for the runs of this Actor, in megabytes. * ##### optionalkeyword-onlydefault\_run\_timeout\_secs: int | None = None Default timeout for the runs of this Actor in seconds. * ##### optionalkeyword-onlyexample\_run\_input\_body: Any = None Input to be prefilled as default input to new users of this Actor. * ##### optionalkeyword-onlyexample\_run\_input\_content\_type: str | None = None The content type of the example run input. * ##### optionalkeyword-onlyactor\_standby\_is\_enabled: bool | None = None Whether the Actor Standby is enabled. * ##### optionalkeyword-onlyactor\_standby\_desired\_requests\_per\_actor\_run: int | None = None The desired number of concurrent HTTP requests for a single Actor Standby run. * ##### optionalkeyword-onlyactor\_standby\_max\_requests\_per\_actor\_run: int | None = None The maximum number of concurrent HTTP requests for a single Actor Standby run. * ##### optionalkeyword-onlyactor\_standby\_idle\_timeout\_secs: int | None = None If the Actor run does not receive any requests for this time, it will be shut down. * ##### optionalkeyword-onlyactor\_standby\_build: str | None = None The build tag or number to run when the Actor is in Standby mode. * ##### optionalkeyword-onlyactor\_standby\_memory\_mbytes: int | None = None The memory in megabytes to use when the Actor is in Standby mode. * ##### optionalkeyword-onlypricing\_infos: [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[dict] | None = None A list of objects that describes the pricing of the Actor. #### Returns dict ### [**](#validate_input)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/actor.py#L476)validate\_input * ****validate\_input**(run\_input, \*, build\_tag, content\_type): bool - Validate an input for the Actor that defines an input schema. *** #### Parameters * ##### optionalrun\_input: Any = None The input to validate. * ##### optionalkeyword-onlybuild\_tag: str | None = None The actor's build tag. * ##### optionalkeyword-onlycontent\_type: str | None = None The content type of the input. #### Returns bool ### [**](#version)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/actor.py#L461)version * ****version**(version\_number): [ActorVersionClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorVersionClient.md) - Retrieve the client for the specified version of this Actor. *** #### Parameters * ##### version\_number: str The version number for which to retrieve the resource client. #### Returns [ActorVersionClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorVersionClient.md) ### [**](#versions)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/actor.py#L457)versions * ****versions**(): [ActorVersionCollectionClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorVersionCollectionClient.md) - Retrieve a client for the versions of this Actor. *** #### Returns [ActorVersionCollectionClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorVersionCollectionClient.md) ### [**](#webhooks)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/actor.py#L472)webhooks * ****webhooks**(): [WebhookCollectionClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/WebhookCollectionClient.md) - Retrieve a client for webhooks associated with this Actor. *** #### Returns [WebhookCollectionClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/WebhookCollectionClient.md) --- # ActorClientAsync Async sub-client for manipulating a single Actor. ## Index[**](#Index) ### Methods * [**build](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorClientAsync.md#build) * [**builds](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorClientAsync.md#builds) * [**call](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorClientAsync.md#call) * [**default\_build](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorClientAsync.md#default_build) * [**delete](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorClientAsync.md#delete) * [**get](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorClientAsync.md#get) * [**last\_run](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorClientAsync.md#last_run) * [**runs](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorClientAsync.md#runs) * [**start](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorClientAsync.md#start) * [**update](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorClientAsync.md#update) * [**validate\_input](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorClientAsync.md#validate_input) * [**version](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorClientAsync.md#version) * [**versions](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorClientAsync.md#versions) * [**webhooks](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorClientAsync.md#webhooks) ## Methods[**](#Methods) ### [**](#build)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/actor.py#L756)build * **async **build**(\*, version\_number, beta\_packages, tag, use\_cache, wait\_for\_finish): dict - Build the Actor. *** #### Parameters * ##### keyword-onlyversion\_number: str Actor version number to be built. * ##### optionalkeyword-onlybeta\_packages: bool | None = None If True, then the Actor is built with beta versions of Apify NPM packages. By default, the build uses latest stable packages. * ##### optionalkeyword-onlytag: str | None = None Tag to be applied to the build on success. By default, the tag is taken from the Actor version's build tag property. * ##### optionalkeyword-onlyuse\_cache: bool | None = None If true, the Actor's Docker container will be rebuilt using layer cache (). This is to enable quick rebuild during development. By default, the cache is not used. * ##### optionalkeyword-onlywait\_for\_finish: int | None = None The maximum number of seconds the server waits for the build to finish before returning. By default it is 0, the maximum value is 60. #### Returns dict ### [**](#builds)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/actor.py#L800)builds * ****builds**(): [BuildCollectionClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/BuildCollectionClientAsync.md) - Retrieve a client for the builds of this Actor. *** #### Returns [BuildCollectionClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/BuildCollectionClientAsync.md) ### [**](#call)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/actor.py#L686)call * **async **call**(\*, run\_input, content\_type, build, max\_items, max\_total\_charge\_usd, memory\_mbytes, timeout\_secs, webhooks, wait\_secs, logger): dict | None - Start the Actor and wait for it to finish before returning the Run object. It waits indefinitely, unless the wait\_secs argument is provided. *** #### Parameters * ##### optionalkeyword-onlyrun\_input: Any = None The input to pass to the Actor run. * ##### optionalkeyword-onlycontent\_type: str | None = None The content type of the input. * ##### optionalkeyword-onlybuild: str | None = None Specifies the Actor build to run. It can be either a build tag or build number. By default, the run uses the build specified in the default run configuration for the Actor (typically latest). * ##### optionalkeyword-onlymax\_items: int | None = None Maximum number of results that will be returned by this run. If the Actor is charged per result, you will not be charged for more results than the given limit. * ##### optionalkeyword-onlymax\_total\_charge\_usd: Decimal | None = None A limit on the total charged amount for pay-per-event actors. * ##### optionalkeyword-onlymemory\_mbytes: int | None = None Memory limit for the run, in megabytes. By default, the run uses a memory limit specified in the default run configuration for the Actor. * ##### optionalkeyword-onlytimeout\_secs: int | None = None Optional timeout for the run, in seconds. By default, the run uses timeout specified in the default run configuration for the Actor. * ##### optionalkeyword-onlywebhooks: [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[dict] | None = None Optional webhooks () associated with the Actor run, which can be used to receive a notification, e.g. when the Actor finished or failed. If you already have a webhook set up for the Actor, you do not have to add it again here. * ##### optionalkeyword-onlywait\_secs: int | None = None The maximum number of seconds the server waits for the run to finish. If not provided, waits indefinitely. * ##### optionalkeyword-onlylogger: (Logger | None) | Literal\[default] = 'default' Logger used to redirect logs from the Actor run. Using "default" literal means that a predefined default logger will be used. Setting `None` will disable any log propagation. Passing custom logger will redirect logs to the provided logger. The logger is also used to capture status and status message of the other Actor run. #### Returns dict | None ### [**](#default_build)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/actor.py#L808)default\_build * **async **default\_build**(\*, wait\_for\_finish): [BuildClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/BuildClientAsync.md) - Retrieve Actor's default build. *** #### Parameters * ##### optionalkeyword-onlywait\_for\_finish: int | None = None The maximum number of seconds the server waits for the build to finish before returning. By default it is 0, the maximum value is 60. #### Returns [BuildClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/BuildClientAsync.md) ### [**](#delete)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/actor.py#L615)delete * **async **delete**(): None - Delete the Actor. *** #### Returns None ### [**](#get)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/actor.py#L510)get * **async **get**(): dict | None - Retrieve the Actor. *** #### Returns dict | None ### [**](#last_run)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/actor.py#L842)last\_run * ****last\_run**(\*, status, origin): [RunClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/RunClientAsync.md) - Retrieve the client for the last run of this Actor. Last run is retrieved based on the start time of the runs. *** #### Parameters * ##### optionalkeyword-onlystatus: ActorJobStatus | None = None Consider only runs with this status. * ##### optionalkeyword-onlyorigin: MetaOrigin | None = None Consider only runs started with this origin. #### Returns [RunClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/RunClientAsync.md) ### [**](#runs)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/actor.py#L804)runs * ****runs**(): [RunCollectionClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/RunCollectionClientAsync.md) - Retrieve a client for the runs of this Actor. *** #### Returns [RunCollectionClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/RunCollectionClientAsync.md) ### [**](#start)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/actor.py#L622)start * **async **start**(\*, run\_input, content\_type, build, max\_items, max\_total\_charge\_usd, memory\_mbytes, timeout\_secs, wait\_for\_finish, webhooks): dict - Start the Actor and immediately return the Run object. *** #### Parameters * ##### optionalkeyword-onlyrun\_input: Any = None The input to pass to the Actor run. * ##### optionalkeyword-onlycontent\_type: str | None = None The content type of the input. * ##### optionalkeyword-onlybuild: str | None = None Specifies the Actor build to run. It can be either a build tag or build number. By default, the run uses the build specified in the default run configuration for the Actor (typically latest). * ##### optionalkeyword-onlymax\_items: int | None = None Maximum number of results that will be returned by this run. If the Actor is charged per result, you will not be charged for more results than the given limit. * ##### optionalkeyword-onlymax\_total\_charge\_usd: Decimal | None = None A limit on the total charged amount for pay-per-event actors. * ##### optionalkeyword-onlymemory\_mbytes: int | None = None Memory limit for the run, in megabytes. By default, the run uses a memory limit specified in the default run configuration for the Actor. * ##### optionalkeyword-onlytimeout\_secs: int | None = None Optional timeout for the run, in seconds. By default, the run uses timeout specified in the default run configuration for the Actor. * ##### optionalkeyword-onlywait\_for\_finish: int | None = None The maximum number of seconds the server waits for the run to finish. By default, it is 0, the maximum value is 60. * ##### optionalkeyword-onlywebhooks: [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[dict] | None = None Optional ad-hoc webhooks () associated with the Actor run which can be used to receive a notification, e.g. when the Actor finished or failed. If you already have a webhook set up for the Actor or task, you do not have to add it again here. Each webhook is represented by a dictionary containing these items: * `event_types`: List of `WebhookEventType` values which trigger the webhook. * `request_url`: URL to which to send the webhook HTTP request. * `payload_template`: Optional template for the request payload. #### Returns dict ### [**](#update)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/actor.py#L520)update * **async **update**(\*, name, title, description, seo\_title, seo\_description, versions, restart\_on\_error, is\_public, is\_deprecated, is\_anonymously\_runnable, categories, default\_run\_build, default\_run\_max\_items, default\_run\_memory\_mbytes, default\_run\_timeout\_secs, example\_run\_input\_body, example\_run\_input\_content\_type, actor\_standby\_is\_enabled, actor\_standby\_desired\_requests\_per\_actor\_run, actor\_standby\_max\_requests\_per\_actor\_run, actor\_standby\_idle\_timeout\_secs, actor\_standby\_build, actor\_standby\_memory\_mbytes, pricing\_infos): dict - Update the Actor with the specified fields. *** #### Parameters * ##### optionalkeyword-onlyname: str | None = None The name of the Actor. * ##### optionalkeyword-onlytitle: str | None = None The title of the Actor (human-readable). * ##### optionalkeyword-onlydescription: str | None = None The description for the Actor. * ##### optionalkeyword-onlyseo\_title: str | None = None The title of the Actor optimized for search engines. * ##### optionalkeyword-onlyseo\_description: str | None = None The description of the Actor optimized for search engines. * ##### optionalkeyword-onlyversions: [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[dict] | None = None The list of Actor versions. * ##### optionalkeyword-onlyrestart\_on\_error: bool | None = None If true, the main Actor run process will be restarted whenever it exits with a non-zero status code. * ##### optionalkeyword-onlyis\_public: bool | None = None Whether the Actor is public. * ##### optionalkeyword-onlyis\_deprecated: bool | None = None Whether the Actor is deprecated. * ##### optionalkeyword-onlyis\_anonymously\_runnable: bool | None = None Whether the Actor is anonymously runnable. * ##### optionalkeyword-onlycategories: [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[str] | None = None The categories to which the Actor belongs to. * ##### optionalkeyword-onlydefault\_run\_build: str | None = None Tag or number of the build that you want to run by default. * ##### optionalkeyword-onlydefault\_run\_max\_items: int | None = None Default limit of the number of results that will be returned by runs of this Actor, if the Actor is charged per result. * ##### optionalkeyword-onlydefault\_run\_memory\_mbytes: int | None = None Default amount of memory allocated for the runs of this Actor, in megabytes. * ##### optionalkeyword-onlydefault\_run\_timeout\_secs: int | None = None Default timeout for the runs of this Actor in seconds. * ##### optionalkeyword-onlyexample\_run\_input\_body: Any = None Input to be prefilled as default input to new users of this Actor. * ##### optionalkeyword-onlyexample\_run\_input\_content\_type: str | None = None The content type of the example run input. * ##### optionalkeyword-onlyactor\_standby\_is\_enabled: bool | None = None Whether the Actor Standby is enabled. * ##### optionalkeyword-onlyactor\_standby\_desired\_requests\_per\_actor\_run: int | None = None The desired number of concurrent HTTP requests for a single Actor Standby run. * ##### optionalkeyword-onlyactor\_standby\_max\_requests\_per\_actor\_run: int | None = None The maximum number of concurrent HTTP requests for a single Actor Standby run. * ##### optionalkeyword-onlyactor\_standby\_idle\_timeout\_secs: int | None = None If the Actor run does not receive any requests for this time, it will be shut down. * ##### optionalkeyword-onlyactor\_standby\_build: str | None = None The build tag or number to run when the Actor is in Standby mode. * ##### optionalkeyword-onlyactor\_standby\_memory\_mbytes: int | None = None The memory in megabytes to use when the Actor is in Standby mode. * ##### optionalkeyword-onlypricing\_infos: [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[dict] | None = None A list of objects that describes the pricing of the Actor. #### Returns dict ### [**](#validate_input)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/actor.py#L889)validate\_input * **async **validate\_input**(run\_input, \*, build\_tag, content\_type): bool - Validate an input for the Actor that defines an input schema. *** #### Parameters * ##### optionalrun\_input: Any = None The input to validate. * ##### optionalkeyword-onlybuild\_tag: str | None = None The actor's build tag. * ##### optionalkeyword-onlycontent\_type: str | None = None The content type of the input. #### Returns bool ### [**](#version)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/actor.py#L874)version * ****version**(version\_number): [ActorVersionClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorVersionClientAsync.md) - Retrieve the client for the specified version of this Actor. *** #### Parameters * ##### version\_number: str The version number for which to retrieve the resource client. #### Returns [ActorVersionClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorVersionClientAsync.md) ### [**](#versions)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/actor.py#L870)versions * ****versions**(): [ActorVersionCollectionClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorVersionCollectionClientAsync.md) - Retrieve a client for the versions of this Actor. *** #### Returns [ActorVersionCollectionClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorVersionCollectionClientAsync.md) ### [**](#webhooks)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/actor.py#L885)webhooks * ****webhooks**(): [WebhookCollectionClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/WebhookCollectionClientAsync.md) - Retrieve a client for webhooks associated with this Actor. *** #### Returns [WebhookCollectionClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/WebhookCollectionClientAsync.md) --- # ActorCollectionClient Sub-client for manipulating Actors. ## Index[**](#Index) ### Methods * [**create](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#create) * [**list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list) ## Methods[**](#Methods) ### [**](#create)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/actor_collection.py#L47)create * ****create**(\*, name, title, description, seo\_title, seo\_description, versions, restart\_on\_error, is\_public, is\_deprecated, is\_anonymously\_runnable, categories, default\_run\_build, default\_run\_max\_items, default\_run\_memory\_mbytes, default\_run\_timeout\_secs, example\_run\_input\_body, example\_run\_input\_content\_type, actor\_standby\_is\_enabled, actor\_standby\_desired\_requests\_per\_actor\_run, actor\_standby\_max\_requests\_per\_actor\_run, actor\_standby\_idle\_timeout\_secs, actor\_standby\_build, actor\_standby\_memory\_mbytes): dict - Create a new Actor. *** #### Parameters * ##### keyword-onlyname: str The name of the Actor. * ##### optionalkeyword-onlytitle: str | None = None The title of the Actor (human-readable). * ##### optionalkeyword-onlydescription: str | None = None The description for the Actor. * ##### optionalkeyword-onlyseo\_title: str | None = None The title of the Actor optimized for search engines. * ##### optionalkeyword-onlyseo\_description: str | None = None The description of the Actor optimized for search engines. * ##### optionalkeyword-onlyversions: [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[dict] | None = None The list of Actor versions. * ##### optionalkeyword-onlyrestart\_on\_error: bool | None = None If true, the main Actor run process will be restarted whenever it exits with a non-zero status code. * ##### optionalkeyword-onlyis\_public: bool | None = None Whether the Actor is public. * ##### optionalkeyword-onlyis\_deprecated: bool | None = None Whether the Actor is deprecated. * ##### optionalkeyword-onlyis\_anonymously\_runnable: bool | None = None Whether the Actor is anonymously runnable. * ##### optionalkeyword-onlycategories: [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[str] | None = None The categories to which the Actor belongs to. * ##### optionalkeyword-onlydefault\_run\_build: str | None = None Tag or number of the build that you want to run by default. * ##### optionalkeyword-onlydefault\_run\_max\_items: int | None = None Default limit of the number of results that will be returned by runs of this Actor, if the Actor is charged per result. * ##### optionalkeyword-onlydefault\_run\_memory\_mbytes: int | None = None Default amount of memory allocated for the runs of this Actor, in megabytes. * ##### optionalkeyword-onlydefault\_run\_timeout\_secs: int | None = None Default timeout for the runs of this Actor in seconds. * ##### optionalkeyword-onlyexample\_run\_input\_body: Any = None Input to be prefilled as default input to new users of this Actor. * ##### optionalkeyword-onlyexample\_run\_input\_content\_type: str | None = None The content type of the example run input. * ##### optionalkeyword-onlyactor\_standby\_is\_enabled: bool | None = None Whether the Actor Standby is enabled. * ##### optionalkeyword-onlyactor\_standby\_desired\_requests\_per\_actor\_run: int | None = None The desired number of concurrent HTTP requests for a single Actor Standby run. * ##### optionalkeyword-onlyactor\_standby\_max\_requests\_per\_actor\_run: int | None = None The maximum number of concurrent HTTP requests for a single Actor Standby run. * ##### optionalkeyword-onlyactor\_standby\_idle\_timeout\_secs: int | None = None If the Actor run does not receive any requests for this time, it will be shut down. * ##### optionalkeyword-onlyactor\_standby\_build: str | None = None The build tag or number to run when the Actor is in Standby mode. * ##### optionalkeyword-onlyactor\_standby\_memory\_mbytes: int | None = None The memory in megabytes to use when the Actor is in Standby mode. #### Returns dict ### [**](#list)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/actor_collection.py#L22)list * ****list**(\*, my, limit, offset, desc, sort\_by): [ListPage](https://docs.apify.com/api/client/python/api/client/python/reference/class/ListPage.md)\[dict] - List the Actors the user has created or used. *** #### Parameters * ##### optionalkeyword-onlymy: bool | None = None If True, will return only Actors which the user has created themselves. * ##### optionalkeyword-onlylimit: int | None = None How many Actors to list. * ##### optionalkeyword-onlyoffset: int | None = None What Actor to include as first when retrieving the list. * ##### optionalkeyword-onlydesc: bool | None = None Whether to sort the Actors in descending order based on their creation date. * ##### optionalkeyword-onlysort\_by: Literal\[createdAt, stats.lastRunStartedAt] | None = 'createdAt' Field to sort the results by. #### Returns [ListPage](https://docs.apify.com/api/client/python/api/client/python/reference/class/ListPage.md)\[dict] --- # ActorCollectionClientAsync Async sub-client for manipulating Actors. ## Index[**](#Index) ### Methods * [**create](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClientAsync.md#create) * [**list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClientAsync.md#list) ## Methods[**](#Methods) ### [**](#create)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/actor_collection.py#L173)create * **async **create**(\*, name, title, description, seo\_title, seo\_description, versions, restart\_on\_error, is\_public, is\_deprecated, is\_anonymously\_runnable, categories, default\_run\_build, default\_run\_max\_items, default\_run\_memory\_mbytes, default\_run\_timeout\_secs, example\_run\_input\_body, example\_run\_input\_content\_type, actor\_standby\_is\_enabled, actor\_standby\_desired\_requests\_per\_actor\_run, actor\_standby\_max\_requests\_per\_actor\_run, actor\_standby\_idle\_timeout\_secs, actor\_standby\_build, actor\_standby\_memory\_mbytes): dict - Create a new Actor. *** #### Parameters * ##### keyword-onlyname: str The name of the Actor. * ##### optionalkeyword-onlytitle: str | None = None The title of the Actor (human-readable). * ##### optionalkeyword-onlydescription: str | None = None The description for the Actor. * ##### optionalkeyword-onlyseo\_title: str | None = None The title of the Actor optimized for search engines. * ##### optionalkeyword-onlyseo\_description: str | None = None The description of the Actor optimized for search engines. * ##### optionalkeyword-onlyversions: [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[dict] | None = None The list of Actor versions. * ##### optionalkeyword-onlyrestart\_on\_error: bool | None = None If true, the main Actor run process will be restarted whenever it exits with a non-zero status code. * ##### optionalkeyword-onlyis\_public: bool | None = None Whether the Actor is public. * ##### optionalkeyword-onlyis\_deprecated: bool | None = None Whether the Actor is deprecated. * ##### optionalkeyword-onlyis\_anonymously\_runnable: bool | None = None Whether the Actor is anonymously runnable. * ##### optionalkeyword-onlycategories: [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[str] | None = None The categories to which the Actor belongs to. * ##### optionalkeyword-onlydefault\_run\_build: str | None = None Tag or number of the build that you want to run by default. * ##### optionalkeyword-onlydefault\_run\_max\_items: int | None = None Default limit of the number of results that will be returned by runs of this Actor, if the Actor is charged per result. * ##### optionalkeyword-onlydefault\_run\_memory\_mbytes: int | None = None Default amount of memory allocated for the runs of this Actor, in megabytes. * ##### optionalkeyword-onlydefault\_run\_timeout\_secs: int | None = None Default timeout for the runs of this Actor in seconds. * ##### optionalkeyword-onlyexample\_run\_input\_body: Any = None Input to be prefilled as default input to new users of this Actor. * ##### optionalkeyword-onlyexample\_run\_input\_content\_type: str | None = None The content type of the example run input. * ##### optionalkeyword-onlyactor\_standby\_is\_enabled: bool | None = None Whether the Actor Standby is enabled. * ##### optionalkeyword-onlyactor\_standby\_desired\_requests\_per\_actor\_run: int | None = None The desired number of concurrent HTTP requests for a single Actor Standby run. * ##### optionalkeyword-onlyactor\_standby\_max\_requests\_per\_actor\_run: int | None = None The maximum number of concurrent HTTP requests for a single Actor Standby run. * ##### optionalkeyword-onlyactor\_standby\_idle\_timeout\_secs: int | None = None If the Actor run does not receive any requests for this time, it will be shut down. * ##### optionalkeyword-onlyactor\_standby\_build: str | None = None The build tag or number to run when the Actor is in Standby mode. * ##### optionalkeyword-onlyactor\_standby\_memory\_mbytes: int | None = None The memory in megabytes to use when the Actor is in Standby mode. #### Returns dict ### [**](#list)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/actor_collection.py#L148)list * **async **list**(\*, my, limit, offset, desc, sort\_by): [ListPage](https://docs.apify.com/api/client/python/api/client/python/reference/class/ListPage.md)\[dict] - List the Actors the user has created or used. *** #### Parameters * ##### optionalkeyword-onlymy: bool | None = None If True, will return only Actors which the user has created themselves. * ##### optionalkeyword-onlylimit: int | None = None How many Actors to list. * ##### optionalkeyword-onlyoffset: int | None = None What Actor to include as first when retrieving the list. * ##### optionalkeyword-onlydesc: bool | None = None Whether to sort the Actors in descending order based on their creation date. * ##### optionalkeyword-onlysort\_by: Literal\[createdAt, stats.lastRunStartedAt] | None = 'createdAt' Field to sort the results by. #### Returns [ListPage](https://docs.apify.com/api/client/python/api/client/python/reference/class/ListPage.md)\[dict] --- # ActorEnvVarClient Sub-client for manipulating a single Actor environment variable. ## Index[**](#Index) ### Methods * [**delete](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorEnvVarClient.md#delete) * [**get](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorEnvVarClient.md#get) * [**update](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorEnvVarClient.md#update) ## Methods[**](#Methods) ### [**](#delete)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/actor_env_var.py#L69)delete * ****delete**(): None - Delete the Actor environment variable. *** #### Returns None ### [**](#get)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/actor_env_var.py#L32)get * ****get**(): dict | None - Return information about the Actor environment variable. *** #### Returns dict | None ### [**](#update)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/actor_env_var.py#L42)update * ****update**(\*, is\_secret, name, value): dict - Update the Actor environment variable with specified fields. *** #### Parameters * ##### optionalkeyword-onlyis\_secret: bool | None = None Whether the environment variable is secret or not. * ##### keyword-onlyname: str The name of the environment variable. * ##### keyword-onlyvalue: str The value of the environment variable. #### Returns dict --- # ActorEnvVarClientAsync Async sub-client for manipulating a single Actor environment variable. ## Index[**](#Index) ### Methods * [**delete](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorEnvVarClientAsync.md#delete) * [**get](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorEnvVarClientAsync.md#get) * [**update](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorEnvVarClientAsync.md#update) ## Methods[**](#Methods) ### [**](#delete)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/actor_env_var.py#L122)delete * **async **delete**(): None - Delete the Actor environment variable. *** #### Returns None ### [**](#get)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/actor_env_var.py#L85)get * **async **get**(): dict | None - Return information about the Actor environment variable. *** #### Returns dict | None ### [**](#update)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/actor_env_var.py#L95)update * **async **update**(\*, is\_secret, name, value): dict - Update the Actor environment variable with specified fields. *** #### Parameters * ##### optionalkeyword-onlyis\_secret: bool | None = None Whether the environment variable is secret or not. * ##### keyword-onlyname: str The name of the environment variable. * ##### keyword-onlyvalue: str The value of the environment variable. #### Returns dict --- # ActorEnvVarCollectionClient Sub-client for manipulating actor env vars. ## Index[**](#Index) ### Methods * [**create](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorEnvVarCollectionClient.md#create) * [**list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorEnvVarCollectionClient.md#list) ## Methods[**](#Methods) ### [**](#create)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/actor_env_var_collection.py#L32)create * ****create**(\*, is\_secret, name, value): dict - Create a new actor environment variable. *** #### Parameters * ##### optionalkeyword-onlyis\_secret: bool | None = None Whether the environment variable is secret or not. * ##### keyword-onlyname: str The name of the environment variable. * ##### keyword-onlyvalue: str The value of the environment variable. #### Returns dict ### [**](#list)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/actor_env_var_collection.py#L22)list * ****list**(): [ListPage](https://docs.apify.com/api/client/python/api/client/python/reference/class/ListPage.md)\[dict] - List the available actor environment variables. *** #### Returns [ListPage](https://docs.apify.com/api/client/python/api/client/python/reference/class/ListPage.md)\[dict] --- # ActorEnvVarCollectionClientAsync Async sub-client for manipulating actor env vars. ## Index[**](#Index) ### Methods * [**create](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorEnvVarCollectionClientAsync.md#create) * [**list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorEnvVarCollectionClientAsync.md#list) ## Methods[**](#Methods) ### [**](#create)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/actor_env_var_collection.py#L78)create * **async **create**(\*, is\_secret, name, value): dict - Create a new actor environment variable. *** #### Parameters * ##### optionalkeyword-onlyis\_secret: bool | None = None Whether the environment variable is secret or not. * ##### keyword-onlyname: str The name of the environment variable. * ##### keyword-onlyvalue: str The value of the environment variable. #### Returns dict ### [**](#list)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/actor_env_var_collection.py#L68)list * **async **list**(): [ListPage](https://docs.apify.com/api/client/python/api/client/python/reference/class/ListPage.md)\[dict] - List the available actor environment variables. *** #### Returns [ListPage](https://docs.apify.com/api/client/python/api/client/python/reference/class/ListPage.md)\[dict] --- # ActorVersionClient Sub-client for manipulating a single Actor version. ## Index[**](#Index) ### Methods * [**delete](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorVersionClient.md#delete) * [**env\_var](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorVersionClient.md#env_var) * [**env\_vars](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorVersionClient.md#env_vars) * [**get](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorVersionClient.md#get) * [**update](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorVersionClient.md#update) ## Methods[**](#Methods) ### [**](#delete)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/actor_version.py#L109)delete * ****delete**(): None - Delete the Actor version. *** #### Returns None ### [**](#env_var)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/actor_version.py#L120)env\_var * ****env\_var**(env\_var\_name): [ActorEnvVarClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorEnvVarClient.md) - Retrieve the client for the specified environment variable of this Actor version. *** #### Parameters * ##### env\_var\_name: str The name of the environment variable for which to retrieve the resource client. #### Returns [ActorEnvVarClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorEnvVarClient.md) ### [**](#env_vars)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/actor_version.py#L116)env\_vars * ****env\_vars**(): [ActorEnvVarCollectionClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorEnvVarCollectionClient.md) - Retrieve a client for the environment variables of this Actor version. *** #### Returns [ActorEnvVarCollectionClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorEnvVarCollectionClient.md) ### [**](#get)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/actor_version.py#L51)get * ****get**(): dict | None - Return information about the Actor version. *** #### Returns dict | None ### [**](#update)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/actor_version.py#L61)update * ****update**(\*, build\_tag, env\_vars, apply\_env\_vars\_to\_build, source\_type, source\_files, git\_repo\_url, tarball\_url, github\_gist\_url): dict - Update the Actor version with specified fields. *** #### Parameters * ##### optionalkeyword-onlybuild\_tag: str | None = None Tag that is automatically set to the latest successful build of the current version. * ##### optionalkeyword-onlyenv\_vars: [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[dict] | None = None Environment variables that will be available to the Actor run process, and optionally also to the build process. See the API docs for their exact structure. * ##### optionalkeyword-onlyapply\_env\_vars\_to\_build: bool | None = None Whether the environment variables specified for the Actor run will also be set to the Actor build process. * ##### optionalkeyword-onlysource\_type: ActorSourceType | None = None What source type is the Actor version using. * ##### optionalkeyword-onlysource\_files: [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[dict] | None = None Source code comprised of multiple files, each an item of the array. Required when `source_type` is `ActorSourceType.SOURCE_FILES`. See the API docs for the exact structure. * ##### optionalkeyword-onlygit\_repo\_url: str | None = None The URL of a Git repository from which the source code will be cloned. Required when `source_type` is `ActorSourceType.GIT_REPO`. * ##### optionalkeyword-onlytarball\_url: str | None = None The URL of a tarball or a zip archive from which the source code will be downloaded. Required when `source_type` is `ActorSourceType.TARBALL`. * ##### optionalkeyword-onlygithub\_gist\_url: str | None = None The URL of a GitHub Gist from which the source will be downloaded. Required when `source_type` is `ActorSourceType.GITHUB_GIST`. #### Returns dict --- # ActorVersionClientAsync Async sub-client for manipulating a single Actor version. ## Index[**](#Index) ### Methods * [**delete](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorVersionClientAsync.md#delete) * [**env\_var](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorVersionClientAsync.md#env_var) * [**env\_vars](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorVersionClientAsync.md#env_vars) * [**get](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorVersionClientAsync.md#get) * [**update](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorVersionClientAsync.md#update) ## Methods[**](#Methods) ### [**](#delete)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/actor_version.py#L198)delete * **async **delete**(): None - Delete the Actor version. *** #### Returns None ### [**](#env_var)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/actor_version.py#L209)env\_var * ****env\_var**(env\_var\_name): [ActorEnvVarClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorEnvVarClientAsync.md) - Retrieve the client for the specified environment variable of this Actor version. *** #### Parameters * ##### env\_var\_name: str The name of the environment variable for which to retrieve the resource client. #### Returns [ActorEnvVarClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorEnvVarClientAsync.md) ### [**](#env_vars)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/actor_version.py#L205)env\_vars * ****env\_vars**(): [ActorEnvVarCollectionClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorEnvVarCollectionClientAsync.md) - Retrieve a client for the environment variables of this Actor version. *** #### Returns [ActorEnvVarCollectionClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorEnvVarCollectionClientAsync.md) ### [**](#get)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/actor_version.py#L140)get * **async **get**(): dict | None - Return information about the Actor version. *** #### Returns dict | None ### [**](#update)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/actor_version.py#L150)update * **async **update**(\*, build\_tag, env\_vars, apply\_env\_vars\_to\_build, source\_type, source\_files, git\_repo\_url, tarball\_url, github\_gist\_url): dict - Update the Actor version with specified fields. *** #### Parameters * ##### optionalkeyword-onlybuild\_tag: str | None = None Tag that is automatically set to the latest successful build of the current version. * ##### optionalkeyword-onlyenv\_vars: [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[dict] | None = None Environment variables that will be available to the Actor run process, and optionally also to the build process. See the API docs for their exact structure. * ##### optionalkeyword-onlyapply\_env\_vars\_to\_build: bool | None = None Whether the environment variables specified for the Actor run will also be set to the Actor build process. * ##### optionalkeyword-onlysource\_type: ActorSourceType | None = None What source type is the Actor version using. * ##### optionalkeyword-onlysource\_files: [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[dict] | None = None Source code comprised of multiple files, each an item of the array. Required when `source_type` is `ActorSourceType.SOURCE_FILES`. See the API docs for the exact structure. * ##### optionalkeyword-onlygit\_repo\_url: str | None = None The URL of a Git repository from which the source code will be cloned. Required when `source_type` is `ActorSourceType.GIT_REPO`. * ##### optionalkeyword-onlytarball\_url: str | None = None The URL of a tarball or a zip archive from which the source code will be downloaded. Required when `source_type` is `ActorSourceType.TARBALL`. * ##### optionalkeyword-onlygithub\_gist\_url: str | None = None The URL of a GitHub Gist from which the source will be downloaded. Required when `source_type` is `ActorSourceType.GITHUB_GIST`. #### Returns dict --- # ActorVersionCollectionClient Sub-client for manipulating Actor versions. ## Index[**](#Index) ### Methods * [**create](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorVersionCollectionClient.md#create) * [**list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorVersionCollectionClient.md#list) ## Methods[**](#Methods) ### [**](#create)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/actor_version_collection.py#L34)create * ****create**(\*, version\_number, build\_tag, env\_vars, apply\_env\_vars\_to\_build, source\_type, source\_files, git\_repo\_url, tarball\_url, github\_gist\_url): dict - Create a new Actor version. *** #### Parameters * ##### keyword-onlyversion\_number: str Major and minor version of the Actor (e.g. `1.0`). * ##### optionalkeyword-onlybuild\_tag: str | None = None Tag that is automatically set to the latest successful build of the current version. * ##### optionalkeyword-onlyenv\_vars: [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[dict] | None = None Environment variables that will be available to the Actor run process, and optionally also to the build process. See the API docs for their exact structure. * ##### optionalkeyword-onlyapply\_env\_vars\_to\_build: bool | None = None Whether the environment variables specified for the Actor run will also be set to the Actor build process. * ##### keyword-onlysource\_type: ActorSourceType What source type is the Actor version using. * ##### optionalkeyword-onlysource\_files: [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[dict] | None = None Source code comprised of multiple files, each an item of the array. Required when `source_type` is `ActorSourceType.SOURCE_FILES`. See the API docs for the exact structure. * ##### optionalkeyword-onlygit\_repo\_url: str | None = None The URL of a Git repository from which the source code will be cloned. Required when `source_type` is `ActorSourceType.GIT_REPO`. * ##### optionalkeyword-onlytarball\_url: str | None = None The URL of a tarball or a zip archive from which the source code will be downloaded. Required when `source_type` is `ActorSourceType.TARBALL`. * ##### optionalkeyword-onlygithub\_gist\_url: str | None = None The URL of a GitHub Gist from which the source will be downloaded. Required when `source_type` is `ActorSourceType.GITHUB_GIST`. #### Returns dict ### [**](#list)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/actor_version_collection.py#L24)list * ****list**(): [ListPage](https://docs.apify.com/api/client/python/api/client/python/reference/class/ListPage.md)\[dict] - List the available Actor versions. *** #### Returns [ListPage](https://docs.apify.com/api/client/python/api/client/python/reference/class/ListPage.md)\[dict] --- # ActorVersionCollectionClientAsync Async sub-client for manipulating Actor versions. ## Index[**](#Index) ### Methods * [**create](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorVersionCollectionClientAsync.md#create) * [**list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorVersionCollectionClientAsync.md#list) ## Methods[**](#Methods) ### [**](#create)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/actor_version_collection.py#L104)create * **async **create**(\*, version\_number, build\_tag, env\_vars, apply\_env\_vars\_to\_build, source\_type, source\_files, git\_repo\_url, tarball\_url, github\_gist\_url): dict - Create a new Actor version. *** #### Parameters * ##### keyword-onlyversion\_number: str Major and minor version of the Actor (e.g. `1.0`). * ##### optionalkeyword-onlybuild\_tag: str | None = None Tag that is automatically set to the latest successful build of the current version. * ##### optionalkeyword-onlyenv\_vars: [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[dict] | None = None Environment variables that will be available to the Actor run process, and optionally also to the build process. See the API docs for their exact structure. * ##### optionalkeyword-onlyapply\_env\_vars\_to\_build: bool | None = None Whether the environment variables specified for the Actor run will also be set to the Actor build process. * ##### keyword-onlysource\_type: ActorSourceType What source type is the Actor version using. * ##### optionalkeyword-onlysource\_files: [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[dict] | None = None Source code comprised of multiple files, each an item of the array. Required when `source_type` is `ActorSourceType.SOURCE_FILES`. See the API docs for the exact structure. * ##### optionalkeyword-onlygit\_repo\_url: str | None = None The URL of a Git repository from which the source code will be cloned. Required when `source_type` is `ActorSourceType.GIT_REPO`. * ##### optionalkeyword-onlytarball\_url: str | None = None The URL of a tarball or a zip archive from which the source code will be downloaded. Required when `source_type` is `ActorSourceType.TARBALL`. * ##### optionalkeyword-onlygithub\_gist\_url: str | None = None The URL of a GitHub Gist from which the source will be downloaded. Required when `source_type` is `ActorSourceType.GITHUB_GIST`. #### Returns dict ### [**](#list)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/actor_version_collection.py#L94)list * **async **list**(): [ListPage](https://docs.apify.com/api/client/python/api/client/python/reference/class/ListPage.md)\[dict] - List the available Actor versions. *** #### Returns [ListPage](https://docs.apify.com/api/client/python/api/client/python/reference/class/ListPage.md)\[dict] --- # ApifyApiError Error specific to requests to the Apify API. An `ApifyApiError` is thrown for successful HTTP requests that reach the API, but the API responds with an error response. Typically, those are rate limit errors and internal errors, which are automatically retried, or validation errors, which are thrown immediately, because a correction by the user is needed. ### Hierarchy * [ApifyClientError](https://docs.apify.com/api/client/python/api/client/python/reference/class/ApifyClientError.md) * *ApifyApiError* --- # ApifyClient The Apify API client. ### Hierarchy * [\_BaseApifyClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/_BaseApifyClient.md) * *ApifyClient* ## Index[**](#Index) ### Methods * [**\_\_init\_\_](https://docs.apify.com/api/client/python/api/client/python/reference/class/ApifyClient.md#__init__) * [**actor](https://docs.apify.com/api/client/python/api/client/python/reference/class/ApifyClient.md#actor) * [**actors](https://docs.apify.com/api/client/python/api/client/python/reference/class/ApifyClient.md#actors) * [**build](https://docs.apify.com/api/client/python/api/client/python/reference/class/ApifyClient.md#build) * [**builds](https://docs.apify.com/api/client/python/api/client/python/reference/class/ApifyClient.md#builds) * [**dataset](https://docs.apify.com/api/client/python/api/client/python/reference/class/ApifyClient.md#dataset) * [**datasets](https://docs.apify.com/api/client/python/api/client/python/reference/class/ApifyClient.md#datasets) * [**key\_value\_store](https://docs.apify.com/api/client/python/api/client/python/reference/class/ApifyClient.md#key_value_store) * [**key\_value\_stores](https://docs.apify.com/api/client/python/api/client/python/reference/class/ApifyClient.md#key_value_stores) * [**log](https://docs.apify.com/api/client/python/api/client/python/reference/class/ApifyClient.md#log) * [**request\_queue](https://docs.apify.com/api/client/python/api/client/python/reference/class/ApifyClient.md#request_queue) * [**request\_queues](https://docs.apify.com/api/client/python/api/client/python/reference/class/ApifyClient.md#request_queues) * [**run](https://docs.apify.com/api/client/python/api/client/python/reference/class/ApifyClient.md#run) * [**runs](https://docs.apify.com/api/client/python/api/client/python/reference/class/ApifyClient.md#runs) * [**schedule](https://docs.apify.com/api/client/python/api/client/python/reference/class/ApifyClient.md#schedule) * [**schedules](https://docs.apify.com/api/client/python/api/client/python/reference/class/ApifyClient.md#schedules) * [**store](https://docs.apify.com/api/client/python/api/client/python/reference/class/ApifyClient.md#store) * [**task](https://docs.apify.com/api/client/python/api/client/python/reference/class/ApifyClient.md#task) * [**tasks](https://docs.apify.com/api/client/python/api/client/python/reference/class/ApifyClient.md#tasks) * [**user](https://docs.apify.com/api/client/python/api/client/python/reference/class/ApifyClient.md#user) * [**webhook](https://docs.apify.com/api/client/python/api/client/python/reference/class/ApifyClient.md#webhook) * [**webhook\_dispatch](https://docs.apify.com/api/client/python/api/client/python/reference/class/ApifyClient.md#webhook_dispatch) * [**webhook\_dispatches](https://docs.apify.com/api/client/python/api/client/python/reference/class/ApifyClient.md#webhook_dispatches) * [**webhooks](https://docs.apify.com/api/client/python/api/client/python/reference/class/ApifyClient.md#webhooks) ### Properties * [**http\_client](https://docs.apify.com/api/client/python/api/client/python/reference/class/ApifyClient.md#http_client) ## Methods[**](#Methods) ### [**](#__init__)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/client.py#L104)\_\_init\_\_ * ****\_\_init\_\_**(token, \*, api\_url, max\_retries, min\_delay\_between\_retries\_millis, timeout\_secs): None - Initialize a new instance. *** #### Parameters * ##### optionaltoken: str | None = None The Apify API token. * ##### optionalkeyword-onlyapi\_url: str | None = None The URL of the Apify API server to which to connect to. Defaults to . * ##### optionalkeyword-onlymax\_retries: int | None = 8 How many times to retry a failed request at most. * ##### optionalkeyword-onlymin\_delay\_between\_retries\_millis: int | None = 500 How long will the client wait between retrying requests (increases exponentially from this value). * ##### optionalkeyword-onlytimeout\_secs: int | None = DEFAULT\_TIMEOUT The socket timeout of the HTTP requests sent to the Apify API. #### Returns None ### [**](#actor)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/client.py#L140)actor * ****actor**(actor\_id): [ActorClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorClient.md) - Retrieve the sub-client for manipulating a single Actor. *** #### Parameters * ##### actor\_id: str ID of the Actor to be manipulated. #### Returns [ActorClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorClient.md) ### [**](#actors)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/client.py#L148)actors * ****actors**(): [ActorCollectionClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md) - Retrieve the sub-client for manipulating Actors. *** #### Returns [ActorCollectionClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md) ### [**](#build)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/client.py#L152)build * ****build**(build\_id): [BuildClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/BuildClient.md) - Retrieve the sub-client for manipulating a single Actor build. *** #### Parameters * ##### build\_id: str ID of the Actor build to be manipulated. #### Returns [BuildClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/BuildClient.md) ### [**](#builds)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/client.py#L160)builds * ****builds**(): [BuildCollectionClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/BuildCollectionClient.md) - Retrieve the sub-client for querying multiple builds of a user. *** #### Returns [BuildCollectionClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/BuildCollectionClient.md) ### [**](#dataset)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/client.py#L176)dataset * ****dataset**(dataset\_id): [DatasetClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/DatasetClient.md) - Retrieve the sub-client for manipulating a single dataset. *** #### Parameters * ##### dataset\_id: str ID of the dataset to be manipulated. #### Returns [DatasetClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/DatasetClient.md) ### [**](#datasets)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/client.py#L184)datasets * ****datasets**(): [DatasetCollectionClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/DatasetCollectionClient.md) - Retrieve the sub-client for manipulating datasets. *** #### Returns [DatasetCollectionClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/DatasetCollectionClient.md) ### [**](#key_value_store)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/client.py#L188)key\_value\_store * ****key\_value\_store**(key\_value\_store\_id): [KeyValueStoreClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/KeyValueStoreClient.md) - Retrieve the sub-client for manipulating a single key-value store. *** #### Parameters * ##### key\_value\_store\_id: str ID of the key-value store to be manipulated. #### Returns [KeyValueStoreClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/KeyValueStoreClient.md) ### [**](#key_value_stores)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/client.py#L196)key\_value\_stores * ****key\_value\_stores**(): [KeyValueStoreCollectionClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/KeyValueStoreCollectionClient.md) - Retrieve the sub-client for manipulating key-value stores. *** #### Returns [KeyValueStoreCollectionClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/KeyValueStoreCollectionClient.md) ### [**](#log)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/client.py#L249)log * ****log**(build\_or\_run\_id): [LogClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/LogClient.md) - Retrieve the sub-client for retrieving logs. *** #### Parameters * ##### build\_or\_run\_id: str ID of the Actor build or run for which to access the log. #### Returns [LogClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/LogClient.md) ### [**](#request_queue)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/client.py#L200)request\_queue * ****request\_queue**(request\_queue\_id, \*, client\_key): [RequestQueueClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/RequestQueueClient.md) - Retrieve the sub-client for manipulating a single request queue. *** #### Parameters * ##### request\_queue\_id: str ID of the request queue to be manipulated. * ##### optionalkeyword-onlyclient\_key: str | None = None A unique identifier of the client accessing the request queue. #### Returns [RequestQueueClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/RequestQueueClient.md) ### [**](#request_queues)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/client.py#L209)request\_queues * ****request\_queues**(): [RequestQueueCollectionClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/RequestQueueCollectionClient.md) - Retrieve the sub-client for manipulating request queues. *** #### Returns [RequestQueueCollectionClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/RequestQueueCollectionClient.md) ### [**](#run)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/client.py#L164)run * ****run**(run\_id): [RunClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/RunClient.md) - Retrieve the sub-client for manipulating a single Actor run. *** #### Parameters * ##### run\_id: str ID of the Actor run to be manipulated. #### Returns [RunClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/RunClient.md) ### [**](#runs)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/client.py#L172)runs * ****runs**(): [RunCollectionClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/RunCollectionClient.md) - Retrieve the sub-client for querying multiple Actor runs of a user. *** #### Returns [RunCollectionClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/RunCollectionClient.md) ### [**](#schedule)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/client.py#L237)schedule * ****schedule**(schedule\_id): [ScheduleClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/ScheduleClient.md) - Retrieve the sub-client for manipulating a single schedule. *** #### Parameters * ##### schedule\_id: str ID of the schedule to be manipulated. #### Returns [ScheduleClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/ScheduleClient.md) ### [**](#schedules)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/client.py#L245)schedules * ****schedules**(): [ScheduleCollectionClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/ScheduleCollectionClient.md) - Retrieve the sub-client for manipulating schedules. *** #### Returns [ScheduleCollectionClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/ScheduleCollectionClient.md) ### [**](#store)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/client.py#L277)store * ****store**(): [StoreCollectionClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/StoreCollectionClient.md) - Retrieve the sub-client for Apify store. *** #### Returns [StoreCollectionClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/StoreCollectionClient.md) ### [**](#task)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/client.py#L257)task * ****task**(task\_id): [TaskClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/TaskClient.md) - Retrieve the sub-client for manipulating a single task. *** #### Parameters * ##### task\_id: str ID of the task to be manipulated. #### Returns [TaskClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/TaskClient.md) ### [**](#tasks)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/client.py#L265)tasks * ****tasks**(): [TaskCollectionClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/TaskCollectionClient.md) - Retrieve the sub-client for manipulating tasks. *** #### Returns [TaskCollectionClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/TaskCollectionClient.md) ### [**](#user)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/client.py#L269)user * ****user**(user\_id): [UserClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/UserClient.md) - Retrieve the sub-client for querying users. *** #### Parameters * ##### optionaluser\_id: str | None = None ID of user to be queried. If None, queries the user belonging to the token supplied to the client. #### Returns [UserClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/UserClient.md) ### [**](#webhook)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/client.py#L213)webhook * ****webhook**(webhook\_id): [WebhookClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/WebhookClient.md) - Retrieve the sub-client for manipulating a single webhook. *** #### Parameters * ##### webhook\_id: str ID of the webhook to be manipulated. #### Returns [WebhookClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/WebhookClient.md) ### [**](#webhook_dispatch)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/client.py#L225)webhook\_dispatch * ****webhook\_dispatch**(webhook\_dispatch\_id): [WebhookDispatchClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/WebhookDispatchClient.md) - Retrieve the sub-client for accessing a single webhook dispatch. *** #### Parameters * ##### webhook\_dispatch\_id: str ID of the webhook dispatch to access. #### Returns [WebhookDispatchClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/WebhookDispatchClient.md) ### [**](#webhook_dispatches)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/client.py#L233)webhook\_dispatches * ****webhook\_dispatches**(): [WebhookDispatchCollectionClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/WebhookDispatchCollectionClient.md) - Retrieve the sub-client for querying multiple webhook dispatches of a user. *** #### Returns [WebhookDispatchCollectionClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/WebhookDispatchCollectionClient.md) ### [**](#webhooks)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/client.py#L221)webhooks * ****webhooks**(): [WebhookCollectionClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/WebhookCollectionClient.md) - Retrieve the sub-client for querying multiple webhooks of a user. *** #### Returns [WebhookCollectionClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/WebhookCollectionClient.md) ## Properties[**](#Properties) ### [**](#http_client)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/client.py#L102)http\_client **http\_client: [HTTPClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/HTTPClient.md) | [HTTPClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/HTTPClientAsync.md) Overrides [\_BaseApifyClient.http\_client](https://docs.apify.com/api/client/python/api/client/python/reference/class/_BaseApifyClient.md#http_client) --- # ApifyClientAsync The asynchronous version of the Apify API client. ### Hierarchy * [\_BaseApifyClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/_BaseApifyClient.md) * *ApifyClientAsync* ## Index[**](#Index) ### Methods * [**\_\_init\_\_](https://docs.apify.com/api/client/python/api/client/python/reference/class/ApifyClientAsync.md#__init__) * [**actor](https://docs.apify.com/api/client/python/api/client/python/reference/class/ApifyClientAsync.md#actor) * [**actors](https://docs.apify.com/api/client/python/api/client/python/reference/class/ApifyClientAsync.md#actors) * [**build](https://docs.apify.com/api/client/python/api/client/python/reference/class/ApifyClientAsync.md#build) * [**builds](https://docs.apify.com/api/client/python/api/client/python/reference/class/ApifyClientAsync.md#builds) * [**dataset](https://docs.apify.com/api/client/python/api/client/python/reference/class/ApifyClientAsync.md#dataset) * [**datasets](https://docs.apify.com/api/client/python/api/client/python/reference/class/ApifyClientAsync.md#datasets) * [**key\_value\_store](https://docs.apify.com/api/client/python/api/client/python/reference/class/ApifyClientAsync.md#key_value_store) * [**key\_value\_stores](https://docs.apify.com/api/client/python/api/client/python/reference/class/ApifyClientAsync.md#key_value_stores) * [**log](https://docs.apify.com/api/client/python/api/client/python/reference/class/ApifyClientAsync.md#log) * [**request\_queue](https://docs.apify.com/api/client/python/api/client/python/reference/class/ApifyClientAsync.md#request_queue) * [**request\_queues](https://docs.apify.com/api/client/python/api/client/python/reference/class/ApifyClientAsync.md#request_queues) * [**run](https://docs.apify.com/api/client/python/api/client/python/reference/class/ApifyClientAsync.md#run) * [**runs](https://docs.apify.com/api/client/python/api/client/python/reference/class/ApifyClientAsync.md#runs) * [**schedule](https://docs.apify.com/api/client/python/api/client/python/reference/class/ApifyClientAsync.md#schedule) * [**schedules](https://docs.apify.com/api/client/python/api/client/python/reference/class/ApifyClientAsync.md#schedules) * [**store](https://docs.apify.com/api/client/python/api/client/python/reference/class/ApifyClientAsync.md#store) * [**task](https://docs.apify.com/api/client/python/api/client/python/reference/class/ApifyClientAsync.md#task) * [**tasks](https://docs.apify.com/api/client/python/api/client/python/reference/class/ApifyClientAsync.md#tasks) * [**user](https://docs.apify.com/api/client/python/api/client/python/reference/class/ApifyClientAsync.md#user) * [**webhook](https://docs.apify.com/api/client/python/api/client/python/reference/class/ApifyClientAsync.md#webhook) * [**webhook\_dispatch](https://docs.apify.com/api/client/python/api/client/python/reference/class/ApifyClientAsync.md#webhook_dispatch) * [**webhook\_dispatches](https://docs.apify.com/api/client/python/api/client/python/reference/class/ApifyClientAsync.md#webhook_dispatches) * [**webhooks](https://docs.apify.com/api/client/python/api/client/python/reference/class/ApifyClientAsync.md#webhooks) ### Properties * [**http\_client](https://docs.apify.com/api/client/python/api/client/python/reference/class/ApifyClientAsync.md#http_client) ## Methods[**](#Methods) ### [**](#__init__)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/client.py#L287)\_\_init\_\_ * ****\_\_init\_\_**(token, \*, api\_url, max\_retries, min\_delay\_between\_retries\_millis, timeout\_secs): None - Initialize a new instance. *** #### Parameters * ##### optionaltoken: str | None = None The Apify API token. * ##### optionalkeyword-onlyapi\_url: str | None = None The URL of the Apify API server to which to connect to. Defaults to . * ##### optionalkeyword-onlymax\_retries: int | None = 8 How many times to retry a failed request at most. * ##### optionalkeyword-onlymin\_delay\_between\_retries\_millis: int | None = 500 How long will the client wait between retrying requests (increases exponentially from this value). * ##### optionalkeyword-onlytimeout\_secs: int | None = DEFAULT\_TIMEOUT The socket timeout of the HTTP requests sent to the Apify API. #### Returns None ### [**](#actor)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/client.py#L323)actor * ****actor**(actor\_id): [ActorClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorClientAsync.md) - Retrieve the sub-client for manipulating a single Actor. *** #### Parameters * ##### actor\_id: str ID of the Actor to be manipulated. #### Returns [ActorClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorClientAsync.md) ### [**](#actors)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/client.py#L331)actors * ****actors**(): [ActorCollectionClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClientAsync.md) - Retrieve the sub-client for manipulating Actors. *** #### Returns [ActorCollectionClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClientAsync.md) ### [**](#build)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/client.py#L335)build * ****build**(build\_id): [BuildClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/BuildClientAsync.md) - Retrieve the sub-client for manipulating a single Actor build. *** #### Parameters * ##### build\_id: str ID of the Actor build to be manipulated. #### Returns [BuildClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/BuildClientAsync.md) ### [**](#builds)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/client.py#L343)builds * ****builds**(): [BuildCollectionClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/BuildCollectionClientAsync.md) - Retrieve the sub-client for querying multiple builds of a user. *** #### Returns [BuildCollectionClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/BuildCollectionClientAsync.md) ### [**](#dataset)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/client.py#L359)dataset * ****dataset**(dataset\_id): [DatasetClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/DatasetClientAsync.md) - Retrieve the sub-client for manipulating a single dataset. *** #### Parameters * ##### dataset\_id: str ID of the dataset to be manipulated. #### Returns [DatasetClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/DatasetClientAsync.md) ### [**](#datasets)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/client.py#L367)datasets * ****datasets**(): [DatasetCollectionClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/DatasetCollectionClientAsync.md) - Retrieve the sub-client for manipulating datasets. *** #### Returns [DatasetCollectionClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/DatasetCollectionClientAsync.md) ### [**](#key_value_store)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/client.py#L371)key\_value\_store * ****key\_value\_store**(key\_value\_store\_id): [KeyValueStoreClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/KeyValueStoreClientAsync.md) - Retrieve the sub-client for manipulating a single key-value store. *** #### Parameters * ##### key\_value\_store\_id: str ID of the key-value store to be manipulated. #### Returns [KeyValueStoreClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/KeyValueStoreClientAsync.md) ### [**](#key_value_stores)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/client.py#L379)key\_value\_stores * ****key\_value\_stores**(): [KeyValueStoreCollectionClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/KeyValueStoreCollectionClientAsync.md) - Retrieve the sub-client for manipulating key-value stores. *** #### Returns [KeyValueStoreCollectionClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/KeyValueStoreCollectionClientAsync.md) ### [**](#log)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/client.py#L432)log * ****log**(build\_or\_run\_id): [LogClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/LogClientAsync.md) - Retrieve the sub-client for retrieving logs. *** #### Parameters * ##### build\_or\_run\_id: str ID of the Actor build or run for which to access the log. #### Returns [LogClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/LogClientAsync.md) ### [**](#request_queue)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/client.py#L383)request\_queue * ****request\_queue**(request\_queue\_id, \*, client\_key): [RequestQueueClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/RequestQueueClientAsync.md) - Retrieve the sub-client for manipulating a single request queue. *** #### Parameters * ##### request\_queue\_id: str ID of the request queue to be manipulated. * ##### optionalkeyword-onlyclient\_key: str | None = None A unique identifier of the client accessing the request queue. #### Returns [RequestQueueClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/RequestQueueClientAsync.md) ### [**](#request_queues)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/client.py#L392)request\_queues * ****request\_queues**(): [RequestQueueCollectionClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/RequestQueueCollectionClientAsync.md) - Retrieve the sub-client for manipulating request queues. *** #### Returns [RequestQueueCollectionClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/RequestQueueCollectionClientAsync.md) ### [**](#run)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/client.py#L347)run * ****run**(run\_id): [RunClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/RunClientAsync.md) - Retrieve the sub-client for manipulating a single Actor run. *** #### Parameters * ##### run\_id: str ID of the Actor run to be manipulated. #### Returns [RunClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/RunClientAsync.md) ### [**](#runs)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/client.py#L355)runs * ****runs**(): [RunCollectionClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/RunCollectionClientAsync.md) - Retrieve the sub-client for querying multiple Actor runs of a user. *** #### Returns [RunCollectionClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/RunCollectionClientAsync.md) ### [**](#schedule)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/client.py#L420)schedule * ****schedule**(schedule\_id): [ScheduleClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/ScheduleClientAsync.md) - Retrieve the sub-client for manipulating a single schedule. *** #### Parameters * ##### schedule\_id: str ID of the schedule to be manipulated. #### Returns [ScheduleClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/ScheduleClientAsync.md) ### [**](#schedules)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/client.py#L428)schedules * ****schedules**(): [ScheduleCollectionClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/ScheduleCollectionClientAsync.md) - Retrieve the sub-client for manipulating schedules. *** #### Returns [ScheduleCollectionClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/ScheduleCollectionClientAsync.md) ### [**](#store)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/client.py#L460)store * ****store**(): [StoreCollectionClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/StoreCollectionClientAsync.md) - Retrieve the sub-client for Apify store. *** #### Returns [StoreCollectionClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/StoreCollectionClientAsync.md) ### [**](#task)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/client.py#L440)task * ****task**(task\_id): [TaskClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/TaskClientAsync.md) - Retrieve the sub-client for manipulating a single task. *** #### Parameters * ##### task\_id: str ID of the task to be manipulated. #### Returns [TaskClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/TaskClientAsync.md) ### [**](#tasks)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/client.py#L448)tasks * ****tasks**(): [TaskCollectionClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/TaskCollectionClientAsync.md) - Retrieve the sub-client for manipulating tasks. *** #### Returns [TaskCollectionClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/TaskCollectionClientAsync.md) ### [**](#user)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/client.py#L452)user * ****user**(user\_id): [UserClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/UserClientAsync.md) - Retrieve the sub-client for querying users. *** #### Parameters * ##### optionaluser\_id: str | None = None ID of user to be queried. If None, queries the user belonging to the token supplied to the client. #### Returns [UserClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/UserClientAsync.md) ### [**](#webhook)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/client.py#L396)webhook * ****webhook**(webhook\_id): [WebhookClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/WebhookClientAsync.md) - Retrieve the sub-client for manipulating a single webhook. *** #### Parameters * ##### webhook\_id: str ID of the webhook to be manipulated. #### Returns [WebhookClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/WebhookClientAsync.md) ### [**](#webhook_dispatch)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/client.py#L408)webhook\_dispatch * ****webhook\_dispatch**(webhook\_dispatch\_id): [WebhookDispatchClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/WebhookDispatchClientAsync.md) - Retrieve the sub-client for accessing a single webhook dispatch. *** #### Parameters * ##### webhook\_dispatch\_id: str ID of the webhook dispatch to access. #### Returns [WebhookDispatchClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/WebhookDispatchClientAsync.md) ### [**](#webhook_dispatches)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/client.py#L416)webhook\_dispatches * ****webhook\_dispatches**(): [WebhookDispatchCollectionClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/WebhookDispatchCollectionClientAsync.md) - Retrieve the sub-client for querying multiple webhook dispatches of a user. *** #### Returns [WebhookDispatchCollectionClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/WebhookDispatchCollectionClientAsync.md) ### [**](#webhooks)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/client.py#L404)webhooks * ****webhooks**(): [WebhookCollectionClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/WebhookCollectionClientAsync.md) - Retrieve the sub-client for querying multiple webhooks of a user. *** #### Returns [WebhookCollectionClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/WebhookCollectionClientAsync.md) ## Properties[**](#Properties) ### [**](#http_client)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/client.py#L285)http\_client **http\_client: [HTTPClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/HTTPClient.md) | [HTTPClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/HTTPClientAsync.md) Overrides [\_BaseApifyClient.http\_client](https://docs.apify.com/api/client/python/api/client/python/reference/class/_BaseApifyClient.md#http_client) --- # ApifyClientError Base class for errors specific to the Apify API Client. ### Hierarchy * *ApifyClientError* * [ApifyApiError](https://docs.apify.com/api/client/python/api/client/python/reference/class/ApifyApiError.md) * [InvalidResponseBodyError](https://docs.apify.com/api/client/python/api/client/python/reference/class/InvalidResponseBodyError.md) --- # BatchAddRequestsResult Result of the batch add requests operation. ## Index[**](#Index) ### Properties * [**processedRequests](https://docs.apify.com/api/client/python/api/client/python/reference/class/BatchAddRequestsResult.md#processedRequests) * [**unprocessedRequests](https://docs.apify.com/api/client/python/api/client/python/reference/class/BatchAddRequestsResult.md#unprocessedRequests) ## Properties[**](#Properties) ### [**](#processedRequests)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/request_queue.py#L40)processedRequests **processedRequests: [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[dict] List of successfully added requests. ### [**](#unprocessedRequests)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/request_queue.py#L41)unprocessedRequests **unprocessedRequests: [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[dict] List of requests that failed to be added. --- # BuildClient Sub-client for manipulating a single Actor build. ## Index[**](#Index) ### Methods * [**abort](https://docs.apify.com/api/client/python/api/client/python/reference/class/BuildClient.md#abort) * [**delete](https://docs.apify.com/api/client/python/api/client/python/reference/class/BuildClient.md#delete) * [**get](https://docs.apify.com/api/client/python/api/client/python/reference/class/BuildClient.md#get) * [**get\_open\_api\_definition](https://docs.apify.com/api/client/python/api/client/python/reference/class/BuildClient.md#get_open_api_definition) * [**log](https://docs.apify.com/api/client/python/api/client/python/reference/class/BuildClient.md#log) * [**wait\_for\_finish](https://docs.apify.com/api/client/python/api/client/python/reference/class/BuildClient.md#wait_for_finish) ## Methods[**](#Methods) ### [**](#abort)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/build.py#L36)abort * ****abort**(): dict - Abort the Actor build which is starting or currently running and return its details. *** #### Returns dict ### [**](#delete)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/build.py#L29)delete * ****delete**(): None - Delete the build. *** #### Returns None ### [**](#get)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/build.py#L19)get * ****get**(): dict | None - Return information about the Actor build. *** #### Returns dict | None ### [**](#get_open_api_definition)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/build.py#L46)get\_open\_api\_definition * ****get\_open\_api\_definition**(): dict | None - Return OpenAPI definition of the Actor's build. *** #### Returns dict | None ### [**](#log)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/build.py#L75)log * ****log**(): [LogClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/LogClient.md) - Get the client for the log of the Actor build. *** #### Returns [LogClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/LogClient.md) ### [**](#wait_for_finish)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/build.py#L63)wait\_for\_finish * ****wait\_for\_finish**(\*, wait\_secs): dict | None - Wait synchronously until the build finishes or the server times out. *** #### Parameters * ##### optionalkeyword-onlywait\_secs: int | None = None How long does the client wait for build to finish. None for indefinite. #### Returns dict | None --- # BuildClientAsync Async sub-client for manipulating a single Actor build. ## Index[**](#Index) ### Methods * [**abort](https://docs.apify.com/api/client/python/api/client/python/reference/class/BuildClientAsync.md#abort) * [**delete](https://docs.apify.com/api/client/python/api/client/python/reference/class/BuildClientAsync.md#delete) * [**get](https://docs.apify.com/api/client/python/api/client/python/reference/class/BuildClientAsync.md#get) * [**get\_open\_api\_definition](https://docs.apify.com/api/client/python/api/client/python/reference/class/BuildClientAsync.md#get_open_api_definition) * [**log](https://docs.apify.com/api/client/python/api/client/python/reference/class/BuildClientAsync.md#log) * [**wait\_for\_finish](https://docs.apify.com/api/client/python/api/client/python/reference/class/BuildClientAsync.md#wait_for_finish) ## Methods[**](#Methods) ### [**](#abort)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/build.py#L106)abort * **async **abort**(): dict - Abort the Actor build which is starting or currently running and return its details. *** #### Returns dict ### [**](#delete)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/build.py#L116)delete * **async **delete**(): None - Delete the build. *** #### Returns None ### [**](#get)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/build.py#L96)get * **async **get**(): dict | None - Return information about the Actor build. *** #### Returns dict | None ### [**](#get_open_api_definition)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/build.py#L123)get\_open\_api\_definition * **async **get\_open\_api\_definition**(): dict | None - Return OpenAPI definition of the Actor's build. *** #### Returns dict | None ### [**](#log)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/build.py#L152)log * ****log**(): [LogClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/LogClientAsync.md) - Get the client for the log of the Actor build. *** #### Returns [LogClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/LogClientAsync.md) ### [**](#wait_for_finish)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/build.py#L140)wait\_for\_finish * **async **wait\_for\_finish**(\*, wait\_secs): dict | None - Wait synchronously until the build finishes or the server times out. *** #### Parameters * ##### optionalkeyword-onlywait\_secs: int | None = None How long does the client wait for build to finish. None for indefinite. #### Returns dict | None --- # BuildCollectionClient Sub-client for listing Actor builds. ## Index[**](#Index) ### Methods * [**list](https://docs.apify.com/api/client/python/api/client/python/reference/class/BuildCollectionClient.md#list) ## Methods[**](#Methods) ### [**](#list)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/build_collection.py#L21)list * ****list**(\*, limit, offset, desc): [ListPage](https://docs.apify.com/api/client/python/api/client/python/reference/class/ListPage.md)\[dict] - List all Actor builds. List all Actor builds, either of a single Actor, or all user's Actors, depending on where this client was initialized from. *** #### Parameters * ##### optionalkeyword-onlylimit: int | None = None How many builds to retrieve. * ##### optionalkeyword-onlyoffset: int | None = None What build to include as first when retrieving the list. * ##### optionalkeyword-onlydesc: bool | None = None Whether to sort the builds in descending order based on their start date. #### Returns [ListPage](https://docs.apify.com/api/client/python/api/client/python/reference/class/ListPage.md)\[dict] --- # BuildCollectionClientAsync Async sub-client for listing Actor builds. ## Index[**](#Index) ### Methods * [**list](https://docs.apify.com/api/client/python/api/client/python/reference/class/BuildCollectionClientAsync.md#list) ## Methods[**](#Methods) ### [**](#list)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/build_collection.py#L55)list * **async **list**(\*, limit, offset, desc): [ListPage](https://docs.apify.com/api/client/python/api/client/python/reference/class/ListPage.md)\[dict] - List all Actor builds. List all Actor builds, either of a single Actor, or all user's Actors, depending on where this client was initialized from. *** #### Parameters * ##### optionalkeyword-onlylimit: int | None = None How many builds to retrieve. * ##### optionalkeyword-onlyoffset: int | None = None What build to include as first when retrieving the list. * ##### optionalkeyword-onlydesc: bool | None = None Whether to sort the builds in descending order based on their start date. #### Returns [ListPage](https://docs.apify.com/api/client/python/api/client/python/reference/class/ListPage.md)\[dict] --- # DatasetClient Sub-client for manipulating a single dataset. ## Index[**](#Index) ### Methods * [**delete](https://docs.apify.com/api/client/python/api/client/python/reference/class/DatasetClient.md#delete) * [**download\_items](https://docs.apify.com/api/client/python/api/client/python/reference/class/DatasetClient.md#download_items) * [**get](https://docs.apify.com/api/client/python/api/client/python/reference/class/DatasetClient.md#get) * [**get\_items\_as\_bytes](https://docs.apify.com/api/client/python/api/client/python/reference/class/DatasetClient.md#get_items_as_bytes) * [**get\_statistics](https://docs.apify.com/api/client/python/api/client/python/reference/class/DatasetClient.md#get_statistics) * [**iterate\_items](https://docs.apify.com/api/client/python/api/client/python/reference/class/DatasetClient.md#iterate_items) * [**list\_items](https://docs.apify.com/api/client/python/api/client/python/reference/class/DatasetClient.md#list_items) * [**push\_items](https://docs.apify.com/api/client/python/api/client/python/reference/class/DatasetClient.md#push_items) * [**stream\_items](https://docs.apify.com/api/client/python/api/client/python/reference/class/DatasetClient.md#stream_items) * [**update](https://docs.apify.com/api/client/python/api/client/python/reference/class/DatasetClient.md#update) ## Methods[**](#Methods) ### [**](#delete)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/dataset.py#L62)delete * ****delete**(): None - Delete the dataset. *** #### Returns None ### [**](#download_items)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/dataset.py#L239)download\_items * ****download\_items**(\*, item\_format, offset, limit, desc, clean, bom, delimiter, fields, omit, unwind, skip\_empty, skip\_header\_row, skip\_hidden, xml\_root, xml\_row, flatten): bytes - Get the items in the dataset as raw bytes. Deprecated: this function is a deprecated alias of `get_items_as_bytes`. It will be removed in a future version. *** #### Parameters * ##### optionalkeyword-onlyitem\_format: str = 'json' Format of the results, possible values are: json, jsonl, csv, html, xlsx, xml and rss. The default value is json. * ##### optionalkeyword-onlyoffset: int | None = None Number of items that should be skipped at the start. The default value is 0. * ##### optionalkeyword-onlylimit: int | None = None Maximum number of items to return. By default there is no limit. * ##### optionalkeyword-onlydesc: bool | None = None By default, results are returned in the same order as they were stored. To reverse the order, set this parameter to True. * ##### optionalkeyword-onlyclean: bool | None = None If True, returns only non-empty items and skips hidden fields (i.e. fields starting with the # character). The clean parameter is just a shortcut for skip\_hidden=True and skip\_empty=True parameters. Note that since some objects might be skipped from the output, that the result might contain less items than the limit value. * ##### optionalkeyword-onlybom: bool | None = None All text responses are encoded in UTF-8 encoding. By default, csv files are prefixed with the UTF-8 Byte Order Mark (BOM), while json, jsonl, xml, html and rss files are not. If you want to override this default behavior, specify bom=True query parameter to include the BOM or bom=False to skip it. * ##### optionalkeyword-onlydelimiter: str | None = None A delimiter character for CSV files. The default delimiter is a simple comma (,). * ##### optionalkeyword-onlyfields: [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[str] | None = None A list of fields which should be picked from the items, only these fields will remain in the resulting record objects. Note that the fields in the outputted items are sorted the same way as they are specified in the fields parameter. You can use this feature to effectively fix the output format. * ##### optionalkeyword-onlyomit: [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[str] | None = None A list of fields which should be omitted from the items. * ##### optionalkeyword-onlyunwind: (str | [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[str]) | None = None A list of fields which should be unwound, in order which they should be processed. Each field should be either an array or an object. If the field is an array then every element of the array will become a separate record and merged with parent object. If the unwound field is an object then it is merged with the parent object. If the unwound field is missing or its value is neither an array nor an object and therefore cannot be merged with a parent object, then the item gets preserved as it is. Note that the unwound items ignore the desc parameter. * ##### optionalkeyword-onlyskip\_empty: bool | None = None If True, then empty items are skipped from the output. Note that if used, the results might contain less items than the limit value. * ##### optionalkeyword-onlyskip\_header\_row: bool | None = None If True, then header row in the csv format is skipped. * ##### optionalkeyword-onlyskip\_hidden: bool | None = None If True, then hidden fields are skipped from the output, i.e. fields starting with the # character. * ##### optionalkeyword-onlyxml\_root: str | None = None Overrides default root element name of xml output. By default the root element is items. * ##### optionalkeyword-onlyxml\_row: str | None = None Overrides default element name that wraps each page or page function result object in xml output. By default the element name is item. * ##### optionalkeyword-onlyflatten: [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[str] | None = None A list of fields that should be flattened. #### Returns bytes ### [**](#get)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/dataset.py#L33)get * ****get**(): dict | None - Retrieve the dataset. *** #### Returns dict | None ### [**](#get_items_as_bytes)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/dataset.py#L333)get\_items\_as\_bytes * ****get\_items\_as\_bytes**(\*, item\_format, offset, limit, desc, clean, bom, delimiter, fields, omit, unwind, skip\_empty, skip\_header\_row, skip\_hidden, xml\_root, xml\_row, flatten): bytes - Get the items in the dataset as raw bytes. *** #### Parameters * ##### optionalkeyword-onlyitem\_format: str = 'json' Format of the results, possible values are: json, jsonl, csv, html, xlsx, xml and rss. The default value is json. * ##### optionalkeyword-onlyoffset: int | None = None Number of items that should be skipped at the start. The default value is 0. * ##### optionalkeyword-onlylimit: int | None = None Maximum number of items to return. By default there is no limit. * ##### optionalkeyword-onlydesc: bool | None = None By default, results are returned in the same order as they were stored. To reverse the order, set this parameter to True. * ##### optionalkeyword-onlyclean: bool | None = None If True, returns only non-empty items and skips hidden fields (i.e. fields starting with the # character). The clean parameter is just a shortcut for skip\_hidden=True and skip\_empty=True parameters. Note that since some objects might be skipped from the output, that the result might contain less items than the limit value. * ##### optionalkeyword-onlybom: bool | None = None All text responses are encoded in UTF-8 encoding. By default, csv files are prefixed with the UTF-8 Byte Order Mark (BOM), while json, jsonl, xml, html and rss files are not. If you want to override this default behavior, specify bom=True query parameter to include the BOM or bom=False to skip it. * ##### optionalkeyword-onlydelimiter: str | None = None A delimiter character for CSV files. The default delimiter is a simple comma (,). * ##### optionalkeyword-onlyfields: [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[str] | None = None A list of fields which should be picked from the items, only these fields will remain in the resulting record objects. Note that the fields in the outputted items are sorted the same way as they are specified in the fields parameter. You can use this feature to effectively fix the output format. You can use this feature to effectively fix the output format. * ##### optionalkeyword-onlyomit: [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[str] | None = None A list of fields which should be omitted from the items. * ##### optionalkeyword-onlyunwind: (str | [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[str]) | None = None A list of fields which should be unwound, in order which they should be processed. Each field should be either an array or an object. If the field is an array then every element of the array will become a separate record and merged with parent object. If the unwound field is an object then it is merged with the parent object. If the unwound field is missing or its value is neither an array nor an object and therefore cannot be merged with a parent object, then the item gets preserved as it is. Note that the unwound items ignore the desc parameter. * ##### optionalkeyword-onlyskip\_empty: bool | None = None If True, then empty items are skipped from the output. Note that if used, the results might contain less items than the limit value. * ##### optionalkeyword-onlyskip\_header\_row: bool | None = None If True, then header row in the csv format is skipped. * ##### optionalkeyword-onlyskip\_hidden: bool | None = None If True, then hidden fields are skipped from the output, i.e. fields starting with the # character. * ##### optionalkeyword-onlyxml\_root: str | None = None Overrides default root element name of xml output. By default the root element is items. * ##### optionalkeyword-onlyxml\_row: str | None = None Overrides default element name that wraps each page or page function result object in xml output. By default the element name is item. * ##### optionalkeyword-onlyflatten: [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[str] | None = None A list of fields that should be flattened. #### Returns bytes ### [**](#get_statistics)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/dataset.py#L553)get\_statistics * ****get\_statistics**(): dict | None - Get the dataset statistics. *** #### Returns dict | None ### [**](#iterate_items)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/dataset.py#L157)iterate\_items * ****iterate\_items**(\*, offset, limit, clean, desc, fields, omit, unwind, skip\_empty, skip\_hidden): Iterator\[dict] - Iterate over the items in the dataset. *** #### Parameters * ##### optionalkeyword-onlyoffset: int = 0 Number of items that should be skipped at the start. The default value is 0. * ##### optionalkeyword-onlylimit: int | None = None Maximum number of items to return. By default there is no limit. * ##### optionalkeyword-onlyclean: bool | None = None If True, returns only non-empty items and skips hidden fields (i.e. fields starting with the # character). The clean parameter is just a shortcut for skip\_hidden=True and skip\_empty=True parameters. Note that since some objects might be skipped from the output, that the result might contain less items than the limit value. * ##### optionalkeyword-onlydesc: bool | None = None By default, results are returned in the same order as they were stored. To reverse the order, set this parameter to True. * ##### optionalkeyword-onlyfields: [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[str] | None = None A list of fields which should be picked from the items, only these fields will remain in the resulting record objects. Note that the fields in the outputted items are sorted the same way as they are specified in the fields parameter. You can use this feature to effectively fix the output format. * ##### optionalkeyword-onlyomit: [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[str] | None = None A list of fields which should be omitted from the items. * ##### optionalkeyword-onlyunwind: (str | [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[str]) | None = None A list of fields which should be unwound, in order which they should be processed. Each field should be either an array or an object. If the field is an array then every element of the array will become a separate record and merged with parent object. If the unwound field is an object then it is merged with the parent object. If the unwound field is missing or its value is neither an array nor an object and therefore cannot be merged with a parent object, then the item gets preserved as it is. Note that the unwound items ignore the desc parameter. * ##### optionalkeyword-onlyskip\_empty: bool | None = None If True, then empty items are skipped from the output. Note that if used, the results might contain less items than the limit value. * ##### optionalkeyword-onlyskip\_hidden: bool | None = None If True, then hidden fields are skipped from the output, i.e. fields starting with the # character. #### Returns Iterator\[dict] ### [**](#list_items)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/dataset.py#L69)list\_items * ****list\_items**(\*, offset, limit, clean, desc, fields, omit, unwind, skip\_empty, skip\_hidden, flatten, view): [ListPage](https://docs.apify.com/api/client/python/api/client/python/reference/class/ListPage.md) - List the items of the dataset. *** #### Parameters * ##### optionalkeyword-onlyoffset: int | None = None Number of items that should be skipped at the start. The default value is 0. * ##### optionalkeyword-onlylimit: int | None = None Maximum number of items to return. By default there is no limit. * ##### optionalkeyword-onlyclean: bool | None = None If True, returns only non-empty items and skips hidden fields (i.e. fields starting with the # character). The clean parameter is just a shortcut for skip\_hidden=True and skip\_empty=True parameters. Note that since some objects might be skipped from the output, that the result might contain less items than the limit value. * ##### optionalkeyword-onlydesc: bool | None = None By default, results are returned in the same order as they were stored. To reverse the order, set this parameter to True. * ##### optionalkeyword-onlyfields: [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[str] | None = None A list of fields which should be picked from the items, only these fields will remain in the resulting record objects. Note that the fields in the outputted items are sorted the same way as they are specified in the fields parameter. You can use this feature to effectively fix the output format. * ##### optionalkeyword-onlyomit: [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[str] | None = None A list of fields which should be omitted from the items. * ##### optionalkeyword-onlyunwind: (str | [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[str]) | None = None A list of fields which should be unwound, in order which they should be processed. Each field should be either an array or an object. If the field is an array then every element of the array will become a separate record and merged with parent object. If the unwound field is an object then it is merged with the parent object. If the unwound field is missing or its value is neither an array nor an object and therefore cannot be merged with a parent object, then the item gets preserved as it is. Note that the unwound items ignore the desc parameter. * ##### optionalkeyword-onlyskip\_empty: bool | None = None If True, then empty items are skipped from the output. Note that if used, the results might contain less items than the limit value. * ##### optionalkeyword-onlyskip\_hidden: bool | None = None If True, then hidden fields are skipped from the output, i.e. fields starting with the # character. * ##### optionalkeyword-onlyflatten: [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[str] | None = None A list of fields that should be flattened. * ##### optionalkeyword-onlyview: str | None = None Name of the dataset view to be used. #### Returns [ListPage](https://docs.apify.com/api/client/python/api/client/python/reference/class/ListPage.md) ### [**](#push_items)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/dataset.py#L526)push\_items * ****push\_items**(items): None - Push items to the dataset. *** #### Parameters * ##### items: JSONSerializable The items which to push in the dataset. Either a stringified JSON, a dictionary, or a list of strings or dictionaries. #### Returns None ### [**](#stream_items)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/dataset.py#L429)stream\_items * ****stream\_items**(\*, item\_format, offset, limit, desc, clean, bom, delimiter, fields, omit, unwind, skip\_empty, skip\_header\_row, skip\_hidden, xml\_root, xml\_row): Iterator\[httpx.Response] - Retrieve the items in the dataset as a stream. *** #### Parameters * ##### optionalkeyword-onlyitem\_format: str = 'json' Format of the results, possible values are: json, jsonl, csv, html, xlsx, xml and rss. The default value is json. * ##### optionalkeyword-onlyoffset: int | None = None Number of items that should be skipped at the start. The default value is 0. * ##### optionalkeyword-onlylimit: int | None = None Maximum number of items to return. By default there is no limit. * ##### optionalkeyword-onlydesc: bool | None = None By default, results are returned in the same order as they were stored. To reverse the order, set this parameter to True. * ##### optionalkeyword-onlyclean: bool | None = None If True, returns only non-empty items and skips hidden fields (i.e. fields starting with the # character). The clean parameter is just a shortcut for skip\_hidden=True and skip\_empty=True parameters. Note that since some objects might be skipped from the output, that the result might contain less items than the limit value. * ##### optionalkeyword-onlybom: bool | None = None All text responses are encoded in UTF-8 encoding. By default, csv files are prefixed with the UTF-8 Byte Order Mark (BOM), while json, jsonl, xml, html and rss files are not. If you want to override this default behavior, specify bom=True query parameter to include the BOM or bom=False to skip it. * ##### optionalkeyword-onlydelimiter: str | None = None A delimiter character for CSV files. The default delimiter is a simple comma (,). * ##### optionalkeyword-onlyfields: [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[str] | None = None A list of fields which should be picked from the items, only these fields will remain in the resulting record objects. Note that the fields in the outputted items are sorted the same way as they are specified in the fields parameter. You can use this feature to effectively fix the output format. You can use this feature to effectively fix the output format. * ##### optionalkeyword-onlyomit: [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[str] | None = None A list of fields which should be omitted from the items. * ##### optionalkeyword-onlyunwind: (str | [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[str]) | None = None A list of fields which should be unwound, in order which they should be processed. Each field should be either an array or an object. If the field is an array then every element of the array will become a separate record and merged with parent object. If the unwound field is an object then it is merged with the parent object. If the unwound field is missing or its value is neither an array nor an object and therefore cannot be merged with a parent object, then the item gets preserved as it is. Note that the unwound items ignore the desc parameter. * ##### optionalkeyword-onlyskip\_empty: bool | None = None If True, then empty items are skipped from the output. Note that if used, the results might contain less items than the limit value. * ##### optionalkeyword-onlyskip\_header\_row: bool | None = None If True, then header row in the csv format is skipped. * ##### optionalkeyword-onlyskip\_hidden: bool | None = None If True, then hidden fields are skipped from the output, i.e. fields starting with the # character. * ##### optionalkeyword-onlyxml\_root: str | None = None Overrides default root element name of xml output. By default the root element is items. * ##### optionalkeyword-onlyxml\_row: str | None = None Overrides default element name that wraps each page or page function result object in xml output. By default the element name is item. #### Returns Iterator\[httpx.Response] ### [**](#update)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/dataset.py#L43)update * ****update**(\*, name, general\_access): dict - Update the dataset with specified fields. *** #### Parameters * ##### optionalkeyword-onlyname: str | None = None The new name for the dataset. * ##### optionalkeyword-onlygeneral\_access: StorageGeneralAccess | None = None Determines how others can access the dataset. #### Returns dict --- # DatasetClientAsync Async sub-client for manipulating a single dataset. ## Index[**](#Index) ### Methods * [**delete](https://docs.apify.com/api/client/python/api/client/python/reference/class/DatasetClientAsync.md#delete) * [**get](https://docs.apify.com/api/client/python/api/client/python/reference/class/DatasetClientAsync.md#get) * [**get\_items\_as\_bytes](https://docs.apify.com/api/client/python/api/client/python/reference/class/DatasetClientAsync.md#get_items_as_bytes) * [**get\_statistics](https://docs.apify.com/api/client/python/api/client/python/reference/class/DatasetClientAsync.md#get_statistics) * [**iterate\_items](https://docs.apify.com/api/client/python/api/client/python/reference/class/DatasetClientAsync.md#iterate_items) * [**list\_items](https://docs.apify.com/api/client/python/api/client/python/reference/class/DatasetClientAsync.md#list_items) * [**push\_items](https://docs.apify.com/api/client/python/api/client/python/reference/class/DatasetClientAsync.md#push_items) * [**stream\_items](https://docs.apify.com/api/client/python/api/client/python/reference/class/DatasetClientAsync.md#stream_items) * [**update](https://docs.apify.com/api/client/python/api/client/python/reference/class/DatasetClientAsync.md#update) ## Methods[**](#Methods) ### [**](#delete)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/dataset.py#L612)delete * **async **delete**(): None - Delete the dataset. *** #### Returns None ### [**](#get)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/dataset.py#L583)get * **async **get**(): dict | None - Retrieve the dataset. *** #### Returns dict | None ### [**](#get_items_as_bytes)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/dataset.py#L790)get\_items\_as\_bytes * **async **get\_items\_as\_bytes**(\*, item\_format, offset, limit, desc, clean, bom, delimiter, fields, omit, unwind, skip\_empty, skip\_header\_row, skip\_hidden, xml\_root, xml\_row, flatten): bytes - Get the items in the dataset as raw bytes. *** #### Parameters * ##### optionalkeyword-onlyitem\_format: str = 'json' Format of the results, possible values are: json, jsonl, csv, html, xlsx, xml and rss. The default value is json. * ##### optionalkeyword-onlyoffset: int | None = None Number of items that should be skipped at the start. The default value is 0. * ##### optionalkeyword-onlylimit: int | None = None Maximum number of items to return. By default there is no limit. * ##### optionalkeyword-onlydesc: bool | None = None By default, results are returned in the same order as they were stored. To reverse the order, set this parameter to True. * ##### optionalkeyword-onlyclean: bool | None = None If True, returns only non-empty items and skips hidden fields (i.e. fields starting with the # character). The clean parameter is just a shortcut for skip\_hidden=True and skip\_empty=True parameters. Note that since some objects might be skipped from the output, that the result might contain less items than the limit value. * ##### optionalkeyword-onlybom: bool | None = None All text responses are encoded in UTF-8 encoding. By default, csv files are prefixed with the UTF-8 Byte Order Mark (BOM), while json, jsonl, xml, html and rss files are not. If you want to override this default behavior, specify bom=True query parameter to include the BOM or bom=False to skip it. * ##### optionalkeyword-onlydelimiter: str | None = None A delimiter character for CSV files. The default delimiter is a simple comma (,). * ##### optionalkeyword-onlyfields: [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[str] | None = None A list of fields which should be picked from the items, only these fields will remain in the resulting record objects. Note that the fields in the outputted items are sorted the same way as they are specified in the fields parameter. You can use this feature to effectively fix the output format. You can use this feature to effectively fix the output format. * ##### optionalkeyword-onlyomit: [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[str] | None = None A list of fields which should be omitted from the items. * ##### optionalkeyword-onlyunwind: (str | [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[str]) | None = None A list of fields which should be unwound, in order which they should be processed. Each field should be either an array or an object. If the field is an array then every element of the array will become a separate record and merged with parent object. If the unwound field is an object then it is merged with the parent object. If the unwound field is missing or its value is neither an array nor an object and therefore cannot be merged with a parent object, then the item gets preserved as it is. Note that the unwound items ignore the desc parameter. * ##### optionalkeyword-onlyskip\_empty: bool | None = None If True, then empty items are skipped from the output. Note that if used, the results might contain less items than the limit value. * ##### optionalkeyword-onlyskip\_header\_row: bool | None = None If True, then header row in the csv format is skipped. * ##### optionalkeyword-onlyskip\_hidden: bool | None = None If True, then hidden fields are skipped from the output, i.e. fields starting with the # character. * ##### optionalkeyword-onlyxml\_root: str | None = None Overrides default root element name of xml output. By default the root element is items. * ##### optionalkeyword-onlyxml\_row: str | None = None Overrides default element name that wraps each page or page function result object in xml output. By default the element name is item. * ##### optionalkeyword-onlyflatten: [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[str] | None = None A list of fields that should be flattened. #### Returns bytes ### [**](#get_statistics)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/dataset.py#L1010)get\_statistics * **async **get\_statistics**(): dict | None - Get the dataset statistics. *** #### Returns dict | None ### [**](#iterate_items)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/dataset.py#L707)iterate\_items * **async **iterate\_items**(\*, offset, limit, clean, desc, fields, omit, unwind, skip\_empty, skip\_hidden): AsyncIterator\[dict] - Iterate over the items in the dataset. *** #### Parameters * ##### optionalkeyword-onlyoffset: int = 0 Number of items that should be skipped at the start. The default value is 0. * ##### optionalkeyword-onlylimit: int | None = None Maximum number of items to return. By default there is no limit. * ##### optionalkeyword-onlyclean: bool | None = None If True, returns only non-empty items and skips hidden fields (i.e. fields starting with the # character). The clean parameter is just a shortcut for skip\_hidden=True and skip\_empty=True parameters. Note that since some objects might be skipped from the output, that the result might contain less items than the limit value. * ##### optionalkeyword-onlydesc: bool | None = None By default, results are returned in the same order as they were stored. To reverse the order, set this parameter to True. * ##### optionalkeyword-onlyfields: [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[str] | None = None A list of fields which should be picked from the items, only these fields will remain in the resulting record objects. Note that the fields in the outputted items are sorted the same way as they are specified in the fields parameter. You can use this feature to effectively fix the output format. * ##### optionalkeyword-onlyomit: [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[str] | None = None A list of fields which should be omitted from the items. * ##### optionalkeyword-onlyunwind: (str | [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[str]) | None = None A list of fields which should be unwound, in order which they should be processed. Each field should be either an array or an object. If the field is an array then every element of the array will become a separate record and merged with parent object. If the unwound field is an object then it is merged with the parent object. If the unwound field is missing or its value is neither an array nor an object and therefore cannot be merged with a parent object, then the item gets preserved as it is. Note that the unwound items ignore the desc parameter. * ##### optionalkeyword-onlyskip\_empty: bool | None = None If True, then empty items are skipped from the output. Note that if used, the results might contain less items than the limit value. * ##### optionalkeyword-onlyskip\_hidden: bool | None = None If True, then hidden fields are skipped from the output, i.e. fields starting with the # character. #### Returns AsyncIterator\[dict] ### [**](#list_items)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/dataset.py#L619)list\_items * **async **list\_items**(\*, offset, limit, clean, desc, fields, omit, unwind, skip\_empty, skip\_hidden, flatten, view): [ListPage](https://docs.apify.com/api/client/python/api/client/python/reference/class/ListPage.md) - List the items of the dataset. *** #### Parameters * ##### optionalkeyword-onlyoffset: int | None = None Number of items that should be skipped at the start. The default value is 0. * ##### optionalkeyword-onlylimit: int | None = None Maximum number of items to return. By default there is no limit. * ##### optionalkeyword-onlyclean: bool | None = None If True, returns only non-empty items and skips hidden fields (i.e. fields starting with the # character). The clean parameter is just a shortcut for skip\_hidden=True and skip\_empty=True parameters. Note that since some objects might be skipped from the output, that the result might contain less items than the limit value. * ##### optionalkeyword-onlydesc: bool | None = None By default, results are returned in the same order as they were stored. To reverse the order, set this parameter to True. * ##### optionalkeyword-onlyfields: [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[str] | None = None A list of fields which should be picked from the items, only these fields will remain in the resulting record objects. Note that the fields in the outputted items are sorted the same way as they are specified in the fields parameter. You can use this feature to effectively fix the output format. * ##### optionalkeyword-onlyomit: [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[str] | None = None A list of fields which should be omitted from the items. * ##### optionalkeyword-onlyunwind: (str | [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[str]) | None = None A list of fields which should be unwound, in order which they should be processed. Each field should be either an array or an object. If the field is an array then every element of the array will become a separate record and merged with parent object. If the unwound field is an object then it is merged with the parent object. If the unwound field is missing or its value is neither an array nor an object and therefore cannot be merged with a parent object, then the item gets preserved as it is. Note that the unwound items ignore the desc parameter. * ##### optionalkeyword-onlyskip\_empty: bool | None = None If True, then empty items are skipped from the output. Note that if used, the results might contain less items than the limit value. * ##### optionalkeyword-onlyskip\_hidden: bool | None = None If True, then hidden fields are skipped from the output, i.e. fields starting with the # character. * ##### optionalkeyword-onlyflatten: [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[str] | None = None A list of fields that should be flattened. * ##### optionalkeyword-onlyview: str | None = None Name of the dataset view to be used. #### Returns [ListPage](https://docs.apify.com/api/client/python/api/client/python/reference/class/ListPage.md) ### [**](#push_items)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/dataset.py#L983)push\_items * **async **push\_items**(items): None - Push items to the dataset. *** #### Parameters * ##### items: JSONSerializable The items which to push in the dataset. Either a stringified JSON, a dictionary, or a list of strings or dictionaries. #### Returns None ### [**](#stream_items)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/dataset.py#L886)stream\_items * **async **stream\_items**(\*, item\_format, offset, limit, desc, clean, bom, delimiter, fields, omit, unwind, skip\_empty, skip\_header\_row, skip\_hidden, xml\_root, xml\_row): AsyncIterator\[httpx.Response] - Retrieve the items in the dataset as a stream. *** #### Parameters * ##### optionalkeyword-onlyitem\_format: str = 'json' Format of the results, possible values are: json, jsonl, csv, html, xlsx, xml and rss. The default value is json. * ##### optionalkeyword-onlyoffset: int | None = None Number of items that should be skipped at the start. The default value is 0. * ##### optionalkeyword-onlylimit: int | None = None Maximum number of items to return. By default there is no limit. * ##### optionalkeyword-onlydesc: bool | None = None By default, results are returned in the same order as they were stored. To reverse the order, set this parameter to True. * ##### optionalkeyword-onlyclean: bool | None = None If True, returns only non-empty items and skips hidden fields (i.e. fields starting with the # character). The clean parameter is just a shortcut for skip\_hidden=True and skip\_empty=True parameters. Note that since some objects might be skipped from the output, that the result might contain less items than the limit value. * ##### optionalkeyword-onlybom: bool | None = None All text responses are encoded in UTF-8 encoding. By default, csv files are prefixed with the UTF-8 Byte Order Mark (BOM), while json, jsonl, xml, html and rss files are not. If you want to override this default behavior, specify bom=True query parameter to include the BOM or bom=False to skip it. * ##### optionalkeyword-onlydelimiter: str | None = None A delimiter character for CSV files. The default delimiter is a simple comma (,). * ##### optionalkeyword-onlyfields: [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[str] | None = None A list of fields which should be picked from the items, only these fields will remain in the resulting record objects. Note that the fields in the outputted items are sorted the same way as they are specified in the fields parameter. You can use this feature to effectively fix the output format. You can use this feature to effectively fix the output format. * ##### optionalkeyword-onlyomit: [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[str] | None = None A list of fields which should be omitted from the items. * ##### optionalkeyword-onlyunwind: (str | [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[str]) | None = None A list of fields which should be unwound, in order which they should be processed. Each field should be either an array or an object. If the field is an array then every element of the array will become a separate record and merged with parent object. If the unwound field is an object then it is merged with the parent object. If the unwound field is missing or its value is neither an array nor an object and therefore cannot be merged with a parent object, then the item gets preserved as it is. Note that the unwound items ignore the desc parameter. * ##### optionalkeyword-onlyskip\_empty: bool | None = None If True, then empty items are skipped from the output. Note that if used, the results might contain less items than the limit value. * ##### optionalkeyword-onlyskip\_header\_row: bool | None = None If True, then header row in the csv format is skipped. * ##### optionalkeyword-onlyskip\_hidden: bool | None = None If True, then hidden fields are skipped from the output, i.e. fields starting with the # character. * ##### optionalkeyword-onlyxml\_root: str | None = None Overrides default root element name of xml output. By default the root element is items. * ##### optionalkeyword-onlyxml\_row: str | None = None Overrides default element name that wraps each page or page function result object in xml output. By default the element name is item. #### Returns AsyncIterator\[httpx.Response] ### [**](#update)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/dataset.py#L593)update * **async **update**(\*, name, general\_access): dict - Update the dataset with specified fields. *** #### Parameters * ##### optionalkeyword-onlyname: str | None = None The new name for the dataset. * ##### optionalkeyword-onlygeneral\_access: StorageGeneralAccess | None = None Determines how others can access the dataset. #### Returns dict --- # DatasetCollectionClient Sub-client for manipulating datasets. ## Index[**](#Index) ### Methods * [**get\_or\_create](https://docs.apify.com/api/client/python/api/client/python/reference/class/DatasetCollectionClient.md#get_or_create) * [**list](https://docs.apify.com/api/client/python/api/client/python/reference/class/DatasetCollectionClient.md#list) ## Methods[**](#Methods) ### [**](#get_or_create)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/dataset_collection.py#L44)get\_or\_create * ****get\_or\_create**(\*, name, schema): dict - Retrieve a named dataset, or create a new one when it doesn't exist. *** #### Parameters * ##### optionalkeyword-onlyname: str | None = None The name of the dataset to retrieve or create. * ##### optionalkeyword-onlyschema: dict | None = None The schema of the dataset. #### Returns dict ### [**](#list)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/dataset_collection.py#L21)list * ****list**(\*, unnamed, limit, offset, desc): [ListPage](https://docs.apify.com/api/client/python/api/client/python/reference/class/ListPage.md)\[dict] - List the available datasets. *** #### Parameters * ##### optionalkeyword-onlyunnamed: bool | None = None Whether to include unnamed datasets in the list. * ##### optionalkeyword-onlylimit: int | None = None How many datasets to retrieve. * ##### optionalkeyword-onlyoffset: int | None = None What dataset to include as first when retrieving the list. * ##### optionalkeyword-onlydesc: bool | None = None Whether to sort the datasets in descending order based on their modification date. #### Returns [ListPage](https://docs.apify.com/api/client/python/api/client/python/reference/class/ListPage.md)\[dict] --- # DatasetCollectionClientAsync Async sub-client for manipulating datasets. ## Index[**](#Index) ### Methods * [**get\_or\_create](https://docs.apify.com/api/client/python/api/client/python/reference/class/DatasetCollectionClientAsync.md#get_or_create) * [**list](https://docs.apify.com/api/client/python/api/client/python/reference/class/DatasetCollectionClientAsync.md#list) ## Methods[**](#Methods) ### [**](#get_or_create)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/dataset_collection.py#L90)get\_or\_create * **async **get\_or\_create**(\*, name, schema): dict - Retrieve a named dataset, or create a new one when it doesn't exist. *** #### Parameters * ##### optionalkeyword-onlyname: str | None = None The name of the dataset to retrieve or create. * ##### optionalkeyword-onlyschema: dict | None = None The schema of the dataset. #### Returns dict ### [**](#list)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/dataset_collection.py#L67)list * **async **list**(\*, unnamed, limit, offset, desc): [ListPage](https://docs.apify.com/api/client/python/api/client/python/reference/class/ListPage.md)\[dict] - List the available datasets. *** #### Parameters * ##### optionalkeyword-onlyunnamed: bool | None = None Whether to include unnamed datasets in the list. * ##### optionalkeyword-onlylimit: int | None = None How many datasets to retrieve. * ##### optionalkeyword-onlyoffset: int | None = None What dataset to include as first when retrieving the list. * ##### optionalkeyword-onlydesc: bool | None = None Whether to sort the datasets in descending order based on their modification date. #### Returns [ListPage](https://docs.apify.com/api/client/python/api/client/python/reference/class/ListPage.md)\[dict] --- # HTTPClient ### Hierarchy * [\_BaseHTTPClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/_BaseHTTPClient.md) * *HTTPClient* ## Index[**](#Index) ### Methods * [**call](https://docs.apify.com/api/client/python/api/client/python/reference/class/HTTPClient.md#call) ## Methods[**](#Methods) ### [**](#call)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/_http_client.py#L136)call * ****call**(\*, method, url, headers, params, data, json, stream, parse\_response, timeout\_secs): httpx.Response - #### Parameters * ##### keyword-onlymethod: str * ##### keyword-onlyurl: str * ##### optionalkeyword-onlyheaders: dict | None = None * ##### optionalkeyword-onlyparams: dict | None = None * ##### optionalkeyword-onlydata: Any = None * ##### optionalkeyword-onlyjson: JSONSerializable | None = None * ##### optionalkeyword-onlystream: bool | None = None * ##### optionalkeyword-onlyparse\_response: bool | None = True * ##### optionalkeyword-onlytimeout\_secs: int | None = None #### Returns httpx.Response --- # HTTPClientAsync ### Hierarchy * [\_BaseHTTPClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/_BaseHTTPClient.md) * *HTTPClientAsync* ## Index[**](#Index) ### Methods * [**call](https://docs.apify.com/api/client/python/api/client/python/reference/class/HTTPClientAsync.md#call) ## Methods[**](#Methods) ### [**](#call)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/_http_client.py#L232)call * **async **call**(\*, method, url, headers, params, data, json, stream, parse\_response, timeout\_secs): httpx.Response - #### Parameters * ##### keyword-onlymethod: str * ##### keyword-onlyurl: str * ##### optionalkeyword-onlyheaders: dict | None = None * ##### optionalkeyword-onlyparams: dict | None = None * ##### optionalkeyword-onlydata: Any = None * ##### optionalkeyword-onlyjson: JSONSerializable | None = None * ##### optionalkeyword-onlystream: bool | None = None * ##### optionalkeyword-onlyparse\_response: bool | None = True * ##### optionalkeyword-onlytimeout\_secs: int | None = None #### Returns httpx.Response --- # InvalidResponseBodyError Error caused by the response body failing to be parsed. This error exists for the quite common situation, where only a partial JSON response is received and an attempt to parse the JSON throws an error. In most cases this can be resolved by retrying the request. We do that by identifying this error in the HTTPClient. ### Hierarchy * [ApifyClientError](https://docs.apify.com/api/client/python/api/client/python/reference/class/ApifyClientError.md) * *InvalidResponseBodyError* --- # KeyValueStoreClient Sub-client for manipulating a single key-value store. ## Index[**](#Index) ### Methods * [**delete](https://docs.apify.com/api/client/python/api/client/python/reference/class/KeyValueStoreClient.md#delete) * [**delete\_record](https://docs.apify.com/api/client/python/api/client/python/reference/class/KeyValueStoreClient.md#delete_record) * [**get](https://docs.apify.com/api/client/python/api/client/python/reference/class/KeyValueStoreClient.md#get) * [**get\_record](https://docs.apify.com/api/client/python/api/client/python/reference/class/KeyValueStoreClient.md#get_record) * [**get\_record\_as\_bytes](https://docs.apify.com/api/client/python/api/client/python/reference/class/KeyValueStoreClient.md#get_record_as_bytes) * [**list\_keys](https://docs.apify.com/api/client/python/api/client/python/reference/class/KeyValueStoreClient.md#list_keys) * [**record\_exists](https://docs.apify.com/api/client/python/api/client/python/reference/class/KeyValueStoreClient.md#record_exists) * [**set\_record](https://docs.apify.com/api/client/python/api/client/python/reference/class/KeyValueStoreClient.md#set_record) * [**stream\_record](https://docs.apify.com/api/client/python/api/client/python/reference/class/KeyValueStoreClient.md#stream_record) * [**update](https://docs.apify.com/api/client/python/api/client/python/reference/class/KeyValueStoreClient.md#update) ## Methods[**](#Methods) ### [**](#delete)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/key_value_store.py#L60)delete * ****delete**(): None - Delete the key-value store. *** #### Returns None ### [**](#delete_record)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/key_value_store.py#L275)delete\_record * ****delete\_record**(key): None - Delete the specified record from the key-value store. *** #### Parameters * ##### key: str The key of the record which to delete. #### Returns None ### [**](#get)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/key_value_store.py#L31)get * ****get**(): dict | None - Retrieve the key-value store. *** #### Returns dict | None ### [**](#get_record)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/key_value_store.py#L104)get\_record * ****get\_record**(key, \*, as\_bytes, as\_file): dict | None - Retrieve the given record from the key-value store. *** #### Parameters * ##### key: str Key of the record to retrieve. * ##### optionalkeyword-onlyas\_bytes: bool = False Deprecated, use `get_record_as_bytes()` instead. Whether to retrieve the record as raw bytes, default False. * ##### optionalkeyword-onlyas\_file: bool = False Deprecated, use `stream_record()` instead. Whether to retrieve the record as a file-like object, default False. #### Returns dict | None ### [**](#get_record_as_bytes)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/key_value_store.py#L183)get\_record\_as\_bytes * ****get\_record\_as\_bytes**(key): dict | None - Retrieve the given record from the key-value store, without parsing it. *** #### Parameters * ##### key: str Key of the record to retrieve. #### Returns dict | None ### [**](#list_keys)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/key_value_store.py#L67)list\_keys * ****list\_keys**(\*, limit, exclusive\_start\_key, collection, prefix): dict - List the keys in the key-value store. *** #### Parameters * ##### optionalkeyword-onlylimit: int | None = None Number of keys to be returned. Maximum value is 1000. * ##### optionalkeyword-onlyexclusive\_start\_key: str | None = None All keys up to this one (including) are skipped from the result. * ##### optionalkeyword-onlycollection: str | None = None The name of the collection in store schema to list keys from. * ##### optionalkeyword-onlyprefix: str | None = None The prefix of the keys to be listed. #### Returns dict ### [**](#record_exists)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/key_value_store.py#L158)record\_exists * ****record\_exists**(key): bool - Check if given record is present in the key-value store. *** #### Parameters * ##### key: str Key of the record to check. #### Returns bool ### [**](#set_record)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/key_value_store.py#L248)set\_record * ****set\_record**(key, value, content\_type): None - Set a value to the given record in the key-value store. *** #### Parameters * ##### key: str The key of the record to save the value to. * ##### value: Any The value to save into the record. * ##### optionalcontent\_type: str | None = None The content type of the saved value. #### Returns None ### [**](#stream_record)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/key_value_store.py#L214)stream\_record * ****stream\_record**(key): Iterator\[dict | None] - Retrieve the given record from the key-value store, as a stream. *** #### Parameters * ##### key: str Key of the record to retrieve. #### Returns Iterator\[dict | None] ### [**](#update)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/key_value_store.py#L41)update * ****update**(\*, name, general\_access): dict - Update the key-value store with specified fields. *** #### Parameters * ##### optionalkeyword-onlyname: str | None = None The new name for key-value store. * ##### optionalkeyword-onlygeneral\_access: StorageGeneralAccess | None = None Determines how others can access the key-value store. #### Returns dict --- # KeyValueStoreClientAsync Async sub-client for manipulating a single key-value store. ## Index[**](#Index) ### Methods * [**delete](https://docs.apify.com/api/client/python/api/client/python/reference/class/KeyValueStoreClientAsync.md#delete) * [**delete\_record](https://docs.apify.com/api/client/python/api/client/python/reference/class/KeyValueStoreClientAsync.md#delete_record) * [**get](https://docs.apify.com/api/client/python/api/client/python/reference/class/KeyValueStoreClientAsync.md#get) * [**get\_record](https://docs.apify.com/api/client/python/api/client/python/reference/class/KeyValueStoreClientAsync.md#get_record) * [**get\_record\_as\_bytes](https://docs.apify.com/api/client/python/api/client/python/reference/class/KeyValueStoreClientAsync.md#get_record_as_bytes) * [**list\_keys](https://docs.apify.com/api/client/python/api/client/python/reference/class/KeyValueStoreClientAsync.md#list_keys) * [**record\_exists](https://docs.apify.com/api/client/python/api/client/python/reference/class/KeyValueStoreClientAsync.md#record_exists) * [**set\_record](https://docs.apify.com/api/client/python/api/client/python/reference/class/KeyValueStoreClientAsync.md#set_record) * [**stream\_record](https://docs.apify.com/api/client/python/api/client/python/reference/class/KeyValueStoreClientAsync.md#stream_record) * [**update](https://docs.apify.com/api/client/python/api/client/python/reference/class/KeyValueStoreClientAsync.md#update) ## Methods[**](#Methods) ### [**](#delete)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/key_value_store.py#L328)delete * **async **delete**(): None - Delete the key-value store. *** #### Returns None ### [**](#delete_record)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/key_value_store.py#L522)delete\_record * **async **delete\_record**(key): None - Delete the specified record from the key-value store. *** #### Parameters * ##### key: str The key of the record which to delete. #### Returns None ### [**](#get)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/key_value_store.py#L299)get * **async **get**(): dict | None - Retrieve the key-value store. *** #### Returns dict | None ### [**](#get_record)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/key_value_store.py#L372)get\_record * **async **get\_record**(key): dict | None - Retrieve the given record from the key-value store. *** #### Parameters * ##### key: str Key of the record to retrieve. #### Returns dict | None ### [**](#get_record_as_bytes)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/key_value_store.py#L430)get\_record\_as\_bytes * **async **get\_record\_as\_bytes**(key): dict | None - Retrieve the given record from the key-value store, without parsing it. *** #### Parameters * ##### key: str Key of the record to retrieve. #### Returns dict | None ### [**](#list_keys)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/key_value_store.py#L335)list\_keys * **async **list\_keys**(\*, limit, exclusive\_start\_key, collection, prefix): dict - List the keys in the key-value store. *** #### Parameters * ##### optionalkeyword-onlylimit: int | None = None Number of keys to be returned. Maximum value is 1000. * ##### optionalkeyword-onlyexclusive\_start\_key: str | None = None All keys up to this one (including) are skipped from the result. * ##### optionalkeyword-onlycollection: str | None = None The name of the collection in store schema to list keys from. * ##### optionalkeyword-onlyprefix: str | None = None The prefix of the keys to be listed. #### Returns dict ### [**](#record_exists)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/key_value_store.py#L405)record\_exists * **async **record\_exists**(key): bool - Check if given record is present in the key-value store. *** #### Parameters * ##### key: str Key of the record to check. #### Returns bool ### [**](#set_record)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/key_value_store.py#L495)set\_record * **async **set\_record**(key, value, content\_type): None - Set a value to the given record in the key-value store. *** #### Parameters * ##### key: str The key of the record to save the value to. * ##### value: Any The value to save into the record. * ##### optionalcontent\_type: str | None = None The content type of the saved value. #### Returns None ### [**](#stream_record)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/key_value_store.py#L461)stream\_record * **async **stream\_record**(key): AsyncIterator\[dict | None] - Retrieve the given record from the key-value store, as a stream. *** #### Parameters * ##### key: str Key of the record to retrieve. #### Returns AsyncIterator\[dict | None] ### [**](#update)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/key_value_store.py#L309)update * **async **update**(\*, name, general\_access): dict - Update the key-value store with specified fields. *** #### Parameters * ##### optionalkeyword-onlyname: str | None = None The new name for key-value store. * ##### optionalkeyword-onlygeneral\_access: StorageGeneralAccess | None = None Determines how others can access the key-value store. #### Returns dict --- # KeyValueStoreCollectionClient Sub-client for manipulating key-value stores. ## Index[**](#Index) ### Methods * [**get\_or\_create](https://docs.apify.com/api/client/python/api/client/python/reference/class/KeyValueStoreCollectionClient.md#get_or_create) * [**list](https://docs.apify.com/api/client/python/api/client/python/reference/class/KeyValueStoreCollectionClient.md#list) ## Methods[**](#Methods) ### [**](#get_or_create)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/key_value_store_collection.py#L44)get\_or\_create * ****get\_or\_create**(\*, name, schema): dict - Retrieve a named key-value store, or create a new one when it doesn't exist. *** #### Parameters * ##### optionalkeyword-onlyname: str | None = None The name of the key-value store to retrieve or create. * ##### optionalkeyword-onlyschema: dict | None = None The schema of the key-value store. #### Returns dict ### [**](#list)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/key_value_store_collection.py#L21)list * ****list**(\*, unnamed, limit, offset, desc): [ListPage](https://docs.apify.com/api/client/python/api/client/python/reference/class/ListPage.md)\[dict] - List the available key-value stores. *** #### Parameters * ##### optionalkeyword-onlyunnamed: bool | None = None Whether to include unnamed key-value stores in the list. * ##### optionalkeyword-onlylimit: int | None = None How many key-value stores to retrieve. * ##### optionalkeyword-onlyoffset: int | None = None What key-value store to include as first when retrieving the list. * ##### optionalkeyword-onlydesc: bool | None = None Whether to sort the key-value stores in descending order based on their modification date. #### Returns [ListPage](https://docs.apify.com/api/client/python/api/client/python/reference/class/ListPage.md)\[dict] --- # KeyValueStoreCollectionClientAsync Async sub-client for manipulating key-value stores. ## Index[**](#Index) ### Methods * [**get\_or\_create](https://docs.apify.com/api/client/python/api/client/python/reference/class/KeyValueStoreCollectionClientAsync.md#get_or_create) * [**list](https://docs.apify.com/api/client/python/api/client/python/reference/class/KeyValueStoreCollectionClientAsync.md#list) ## Methods[**](#Methods) ### [**](#get_or_create)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/key_value_store_collection.py#L95)get\_or\_create * **async **get\_or\_create**(\*, name, schema): dict - Retrieve a named key-value store, or create a new one when it doesn't exist. *** #### Parameters * ##### optionalkeyword-onlyname: str | None = None The name of the key-value store to retrieve or create. * ##### optionalkeyword-onlyschema: dict | None = None The schema of the key-value store. #### Returns dict ### [**](#list)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/key_value_store_collection.py#L72)list * **async **list**(\*, unnamed, limit, offset, desc): [ListPage](https://docs.apify.com/api/client/python/api/client/python/reference/class/ListPage.md)\[dict] - List the available key-value stores. *** #### Parameters * ##### optionalkeyword-onlyunnamed: bool | None = None Whether to include unnamed key-value stores in the list. * ##### optionalkeyword-onlylimit: int | None = None How many key-value stores to retrieve. * ##### optionalkeyword-onlyoffset: int | None = None What key-value store to include as first when retrieving the list. * ##### optionalkeyword-onlydesc: bool | None = None Whether to sort the key-value stores in descending order based on their modification date. #### Returns [ListPage](https://docs.apify.com/api/client/python/api/client/python/reference/class/ListPage.md)\[dict] --- # ListPage A single page of items returned from a list() method. ## Index[**](#Index) ### Properties * [**count](https://docs.apify.com/api/client/python/api/client/python/reference/class/ListPage.md#count) * [**desc](https://docs.apify.com/api/client/python/api/client/python/reference/class/ListPage.md#desc) * [**items](https://docs.apify.com/api/client/python/api/client/python/reference/class/ListPage.md#items) * [**limit](https://docs.apify.com/api/client/python/api/client/python/reference/class/ListPage.md#limit) * [**offset](https://docs.apify.com/api/client/python/api/client/python/reference/class/ListPage.md#offset) * [**total](https://docs.apify.com/api/client/python/api/client/python/reference/class/ListPage.md#total) ## Properties[**](#Properties) ### [**](#count)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/base/resource_collection_client.py#L19)count **count: int Count of the returned objects on this page ### [**](#desc)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/base/resource_collection_client.py#L31)desc **desc: bool Whether the listing is descending or not ### [**](#items)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/base/resource_collection_client.py#L16)items **items: [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[[T](https://docs.apify.com/api/client/python/api/client/python/reference.md#T)] List of returned objects on this page ### [**](#limit)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/base/resource_collection_client.py#L25)limit **limit: int The offset of the first object specified in the API call ### [**](#offset)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/base/resource_collection_client.py#L22)offset **offset: int The limit on the number of returned objects offset specified in the API call ### [**](#total)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/base/resource_collection_client.py#L28)total **total: int Total number of objects matching the API call criteria --- # LogClient Sub-client for manipulating logs. ## Index[**](#Index) ### Methods * [**get](https://docs.apify.com/api/client/python/api/client/python/reference/class/LogClient.md#get) * [**get\_as\_bytes](https://docs.apify.com/api/client/python/api/client/python/reference/class/LogClient.md#get_as_bytes) * [**stream](https://docs.apify.com/api/client/python/api/client/python/reference/class/LogClient.md#stream) ## Methods[**](#Methods) ### [**](#get)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/log.py#L38)get * ****get**(\*, raw): str | None - Retrieve the log as text. *** #### Parameters * ##### optionalkeyword-onlyraw: bool = False If true, the log will include formating. For example, coloring character sequences. #### Returns str | None ### [**](#get_as_bytes)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/log.py#L63)get\_as\_bytes * ****get\_as\_bytes**(\*, raw): bytes | None - Retrieve the log as raw bytes. *** #### Parameters * ##### optionalkeyword-onlyraw: bool = False If true, the log will include formating. For example, coloring character sequences. #### Returns bytes | None ### [**](#stream)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/log.py#L90)stream * ****stream**(\*, raw): Iterator\[httpx.Response | None] - Retrieve the log as a stream. *** #### Parameters * ##### optionalkeyword-onlyraw: bool = False If true, the log will include formating. For example, coloring character sequences. #### Returns Iterator\[httpx.Response | None] --- # LogClientAsync Async sub-client for manipulating logs. ## Index[**](#Index) ### Methods * [**get](https://docs.apify.com/api/client/python/api/client/python/reference/class/LogClientAsync.md#get) * [**get\_as\_bytes](https://docs.apify.com/api/client/python/api/client/python/reference/class/LogClientAsync.md#get_as_bytes) * [**stream](https://docs.apify.com/api/client/python/api/client/python/reference/class/LogClientAsync.md#stream) ## Methods[**](#Methods) ### [**](#get)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/log.py#L128)get * **async **get**(\*, raw): str | None - Retrieve the log as text. *** #### Parameters * ##### optionalkeyword-onlyraw: bool = False If true, the log will include formating. For example, coloring character sequences. #### Returns str | None ### [**](#get_as_bytes)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/log.py#L153)get\_as\_bytes * **async **get\_as\_bytes**(\*, raw): bytes | None - Retrieve the log as raw bytes. *** #### Parameters * ##### optionalkeyword-onlyraw: bool = False If true, the log will include formating. For example, coloring character sequences. #### Returns bytes | None ### [**](#stream)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/log.py#L180)stream * **async **stream**(\*, raw): AsyncIterator\[httpx.Response | None] - Retrieve the log as a stream. *** #### Parameters * ##### optionalkeyword-onlyraw: bool = False If true, the log will include formating. For example, coloring character sequences. #### Returns AsyncIterator\[httpx.Response | None] --- # LogContext ## Index[**](#Index) ### Properties * [**attempt](https://docs.apify.com/api/client/python/api/client/python/reference/class/LogContext.md#attempt) * [**client\_method](https://docs.apify.com/api/client/python/api/client/python/reference/class/LogContext.md#client_method) * [**method](https://docs.apify.com/api/client/python/api/client/python/reference/class/LogContext.md#method) * [**resource\_id](https://docs.apify.com/api/client/python/api/client/python/reference/class/LogContext.md#resource_id) * [**url](https://docs.apify.com/api/client/python/api/client/python/reference/class/LogContext.md#url) ## Properties[**](#Properties) ### [**](#attempt)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/_logging.py#L27)attempt **attempt: ContextVar\[int | None] ### [**](#client_method)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/_logging.py#L28)client\_method **client\_method: ContextVar\[str | None] ### [**](#method)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/_logging.py#L29)method **method: ContextVar\[str | None] ### [**](#resource_id)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/_logging.py#L30)resource\_id **resource\_id: ContextVar\[str | None] ### [**](#url)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/_logging.py#L31)url **url: ContextVar\[str | None] --- # RedirectLogFormatter Formater applied to default redirect logger. ## Index[**](#Index) ### Methods * [**format](https://docs.apify.com/api/client/python/api/client/python/reference/class/RedirectLogFormatter.md#format) ## Methods[**](#Methods) ### [**](#format)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/_logging.py#L160)format * ****format**(record): str - Format the log by prepending logger name to the original message. *** #### Parameters * ##### record: logging.LogRecord Log record to be formated. #### Returns str --- # RequestQueueClient Sub-client for manipulating a single request queue. ## Index[**](#Index) ### Methods * [**add\_request](https://docs.apify.com/api/client/python/api/client/python/reference/class/RequestQueueClient.md#add_request) * [**batch\_add\_requests](https://docs.apify.com/api/client/python/api/client/python/reference/class/RequestQueueClient.md#batch_add_requests) * [**batch\_delete\_requests](https://docs.apify.com/api/client/python/api/client/python/reference/class/RequestQueueClient.md#batch_delete_requests) * [**delete](https://docs.apify.com/api/client/python/api/client/python/reference/class/RequestQueueClient.md#delete) * [**delete\_request](https://docs.apify.com/api/client/python/api/client/python/reference/class/RequestQueueClient.md#delete_request) * [**delete\_request\_lock](https://docs.apify.com/api/client/python/api/client/python/reference/class/RequestQueueClient.md#delete_request_lock) * [**get](https://docs.apify.com/api/client/python/api/client/python/reference/class/RequestQueueClient.md#get) * [**get\_request](https://docs.apify.com/api/client/python/api/client/python/reference/class/RequestQueueClient.md#get_request) * [**list\_and\_lock\_head](https://docs.apify.com/api/client/python/api/client/python/reference/class/RequestQueueClient.md#list_and_lock_head) * [**list\_head](https://docs.apify.com/api/client/python/api/client/python/reference/class/RequestQueueClient.md#list_head) * [**list\_requests](https://docs.apify.com/api/client/python/api/client/python/reference/class/RequestQueueClient.md#list_requests) * [**prolong\_request\_lock](https://docs.apify.com/api/client/python/api/client/python/reference/class/RequestQueueClient.md#prolong_request_lock) * [**unlock\_requests](https://docs.apify.com/api/client/python/api/client/python/reference/class/RequestQueueClient.md#unlock_requests) * [**update](https://docs.apify.com/api/client/python/api/client/python/reference/class/RequestQueueClient.md#update) * [**update\_request](https://docs.apify.com/api/client/python/api/client/python/reference/class/RequestQueueClient.md#update_request) ## Methods[**](#Methods) ### [**](#add_request)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/request_queue.py#L144)add\_request * ****add\_request**(request, \*, forefront): dict - Add a request to the queue. *** #### Parameters * ##### request: dict The request to add to the queue. * ##### optionalkeyword-onlyforefront: bool | None = None Whether to add the request to the head or the end of the queue. #### Returns dict ### [**](#batch_add_requests)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/request_queue.py#L283)batch\_add\_requests * ****batch\_add\_requests**(requests, \*, forefront, max\_parallel, max\_unprocessed\_requests\_retries, min\_delay\_between\_unprocessed\_requests\_retries): [BatchAddRequestsResult](https://docs.apify.com/api/client/python/api/client/python/reference/class/BatchAddRequestsResult.md) - Add requests to the request queue in batches. Requests are split into batches based on size and processed in parallel. *** #### Parameters * ##### requests: [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[dict] List of requests to be added to the queue. * ##### optionalkeyword-onlyforefront: bool = False Whether to add requests to the front of the queue. * ##### optionalkeyword-onlymax\_parallel: int = 1 Specifies the maximum number of parallel tasks for API calls. This is only applicable to the async client. For the sync client, this value must be set to 1, as parallel execution is not supported. * ##### optionalkeyword-onlymax\_unprocessed\_requests\_retries: int | None = None Deprecated argument. Will be removed in next major release. * ##### optionalkeyword-onlymin\_delay\_between\_unprocessed\_requests\_retries: timedelta | None = None Deprecated argument. Will be removed in next major release. #### Returns [BatchAddRequestsResult](https://docs.apify.com/api/client/python/api/client/python/reference/class/BatchAddRequestsResult.md) ### [**](#batch_delete_requests)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/request_queue.py#L361)batch\_delete\_requests * ****batch\_delete\_requests**(requests): dict - Delete given requests from the queue. *** #### Parameters * ##### requests: [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[dict] List of the requests to delete. #### Returns dict ### [**](#delete)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/request_queue.py#L92)delete * ****delete**(): None - Delete the request queue. *** #### Returns None ### [**](#delete_request)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/request_queue.py#L219)delete\_request * ****delete\_request**(request\_id): None - Delete a request from the queue. *** #### Parameters * ##### request\_id: str ID of the request to delete. #### Returns None ### [**](#delete_request_lock)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/request_queue.py#L265)delete\_request\_lock * ****delete\_request\_lock**(request\_id, \*, forefront): None - Delete the lock on a request. *** #### Parameters * ##### request\_id: str ID of the request to delete the lock. * ##### optionalkeyword-onlyforefront: bool | None = None Whether to put the request in the beginning or the end of the queue after the lock is deleted. #### Returns None ### [**](#get)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/request_queue.py#L63)get * ****get**(): dict | None - Retrieve the request queue. *** #### Returns dict | None ### [**](#get_request)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/request_queue.py#L168)get\_request * ****get\_request**(request\_id): dict | None - Retrieve a request from the queue. *** #### Parameters * ##### request\_id: str ID of the request to retrieve. #### Returns dict | None ### [**](#list_and_lock_head)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/request_queue.py#L121)list\_and\_lock\_head * ****list\_and\_lock\_head**(\*, lock\_secs, limit): dict - Retrieve a given number of unlocked requests from the beginning of the queue and lock them for a given time. *** #### Parameters * ##### keyword-onlylock\_secs: int How long the requests will be locked for, in seconds. * ##### optionalkeyword-onlylimit: int | None = None How many requests to retrieve. #### Returns dict ### [**](#list_head)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/request_queue.py#L99)list\_head * ****list\_head**(\*, limit): dict - Retrieve a given number of requests from the beginning of the queue. *** #### Parameters * ##### optionalkeyword-onlylimit: int | None = None How many requests to retrieve. #### Returns dict ### [**](#list_requests)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/request_queue.py#L381)list\_requests * ****list\_requests**(\*, limit, exclusive\_start\_id): dict - List requests in the queue. *** #### Parameters * ##### optionalkeyword-onlylimit: int | None = None How many requests to retrieve. * ##### optionalkeyword-onlyexclusive\_start\_id: str | None = None All requests up to this one (including) are skipped from the result. #### Returns dict ### [**](#prolong_request_lock)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/request_queue.py#L238)prolong\_request\_lock * ****prolong\_request\_lock**(request\_id, \*, forefront, lock\_secs): dict - Prolong the lock on a request. *** #### Parameters * ##### request\_id: str ID of the request to prolong the lock. * ##### optionalkeyword-onlyforefront: bool | None = None Whether to put the request in the beginning or the end of the queue after lock expires. * ##### keyword-onlylock\_secs: int By how much to prolong the lock, in seconds. #### Returns dict ### [**](#unlock_requests)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/request_queue.py#L406)unlock\_requests * ****unlock\_requests**(): dict - Unlock all requests in the queue, which were locked by the same clientKey or from the same Actor run. *** #### Returns dict ### [**](#update)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/request_queue.py#L73)update * ****update**(\*, name, general\_access): dict - Update the request queue with specified fields. *** #### Parameters * ##### optionalkeyword-onlyname: str | None = None The new name for the request queue. * ##### optionalkeyword-onlygeneral\_access: StorageGeneralAccess | None = None Determines how others can access the request queue. #### Returns dict ### [**](#update_request)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/request_queue.py#L193)update\_request * ****update\_request**(request, \*, forefront): dict - Update a request in the queue. *** #### Parameters * ##### request: dict The updated request. * ##### optionalkeyword-onlyforefront: bool | None = None Whether to put the updated request in the beginning or the end of the queue. #### Returns dict --- # RequestQueueClientAsync Async sub-client for manipulating a single request queue. ## Index[**](#Index) ### Methods * [**add\_request](https://docs.apify.com/api/client/python/api/client/python/reference/class/RequestQueueClientAsync.md#add_request) * [**batch\_add\_requests](https://docs.apify.com/api/client/python/api/client/python/reference/class/RequestQueueClientAsync.md#batch_add_requests) * [**batch\_delete\_requests](https://docs.apify.com/api/client/python/api/client/python/reference/class/RequestQueueClientAsync.md#batch_delete_requests) * [**delete](https://docs.apify.com/api/client/python/api/client/python/reference/class/RequestQueueClientAsync.md#delete) * [**delete\_request](https://docs.apify.com/api/client/python/api/client/python/reference/class/RequestQueueClientAsync.md#delete_request) * [**delete\_request\_lock](https://docs.apify.com/api/client/python/api/client/python/reference/class/RequestQueueClientAsync.md#delete_request_lock) * [**get](https://docs.apify.com/api/client/python/api/client/python/reference/class/RequestQueueClientAsync.md#get) * [**get\_request](https://docs.apify.com/api/client/python/api/client/python/reference/class/RequestQueueClientAsync.md#get_request) * [**list\_and\_lock\_head](https://docs.apify.com/api/client/python/api/client/python/reference/class/RequestQueueClientAsync.md#list_and_lock_head) * [**list\_head](https://docs.apify.com/api/client/python/api/client/python/reference/class/RequestQueueClientAsync.md#list_head) * [**list\_requests](https://docs.apify.com/api/client/python/api/client/python/reference/class/RequestQueueClientAsync.md#list_requests) * [**prolong\_request\_lock](https://docs.apify.com/api/client/python/api/client/python/reference/class/RequestQueueClientAsync.md#prolong_request_lock) * [**unlock\_requests](https://docs.apify.com/api/client/python/api/client/python/reference/class/RequestQueueClientAsync.md#unlock_requests) * [**update](https://docs.apify.com/api/client/python/api/client/python/reference/class/RequestQueueClientAsync.md#update) * [**update\_request](https://docs.apify.com/api/client/python/api/client/python/reference/class/RequestQueueClientAsync.md#update_request) ## Methods[**](#Methods) ### [**](#add_request)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/request_queue.py#L525)add\_request * **async **add\_request**(request, \*, forefront): dict - Add a request to the queue. *** #### Parameters * ##### request: dict The request to add to the queue. * ##### optionalkeyword-onlyforefront: bool | None = None Whether to add the request to the head or the end of the queue. #### Returns dict ### [**](#batch_add_requests)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/request_queue.py#L711)batch\_add\_requests * **async **batch\_add\_requests**(requests, \*, forefront, max\_parallel, max\_unprocessed\_requests\_retries, min\_delay\_between\_unprocessed\_requests\_retries): [BatchAddRequestsResult](https://docs.apify.com/api/client/python/api/client/python/reference/class/BatchAddRequestsResult.md) - Add requests to the request queue in batches. Requests are split into batches based on size and processed in parallel. *** #### Parameters * ##### requests: [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[dict] List of requests to be added to the queue. * ##### optionalkeyword-onlyforefront: bool = False Whether to add requests to the front of the queue. * ##### optionalkeyword-onlymax\_parallel: int = 5 Specifies the maximum number of parallel tasks for API calls. This is only applicable to the async client. For the sync client, this value must be set to 1, as parallel execution is not supported. * ##### optionalkeyword-onlymax\_unprocessed\_requests\_retries: int | None = None Deprecated argument. Will be removed in next major release. * ##### optionalkeyword-onlymin\_delay\_between\_unprocessed\_requests\_retries: timedelta | None = None Deprecated argument. Will be removed in next major release. #### Returns [BatchAddRequestsResult](https://docs.apify.com/api/client/python/api/client/python/reference/class/BatchAddRequestsResult.md) ### [**](#batch_delete_requests)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/request_queue.py#L791)batch\_delete\_requests * **async **batch\_delete\_requests**(requests): dict - Delete given requests from the queue. *** #### Parameters * ##### requests: [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[dict] List of the requests to delete. #### Returns dict ### [**](#delete)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/request_queue.py#L473)delete * **async **delete**(): None - Delete the request queue. *** #### Returns None ### [**](#delete_request)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/request_queue.py#L600)delete\_request * **async **delete\_request**(request\_id): None - Delete a request from the queue. *** #### Parameters * ##### request\_id: str ID of the request to delete. #### Returns None ### [**](#delete_request_lock)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/request_queue.py#L644)delete\_request\_lock * **async **delete\_request\_lock**(request\_id, \*, forefront): None - Delete the lock on a request. *** #### Parameters * ##### request\_id: str ID of the request to delete the lock. * ##### optionalkeyword-onlyforefront: bool | None = None Whether to put the request in the beginning or the end of the queue after the lock is deleted. #### Returns None ### [**](#get)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/request_queue.py#L444)get * **async **get**(): dict | None - Retrieve the request queue. *** #### Returns dict | None ### [**](#get_request)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/request_queue.py#L549)get\_request * **async **get\_request**(request\_id): dict | None - Retrieve a request from the queue. *** #### Parameters * ##### request\_id: str ID of the request to retrieve. #### Returns dict | None ### [**](#list_and_lock_head)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/request_queue.py#L502)list\_and\_lock\_head * **async **list\_and\_lock\_head**(\*, lock\_secs, limit): dict - Retrieve a given number of unlocked requests from the beginning of the queue and lock them for a given time. *** #### Parameters * ##### keyword-onlylock\_secs: int How long the requests will be locked for, in seconds. * ##### optionalkeyword-onlylimit: int | None = None How many requests to retrieve. #### Returns dict ### [**](#list_head)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/request_queue.py#L480)list\_head * **async **list\_head**(\*, limit): dict - Retrieve a given number of requests from the beginning of the queue. *** #### Parameters * ##### optionalkeyword-onlylimit: int | None = None How many requests to retrieve. #### Returns dict ### [**](#list_requests)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/request_queue.py#L810)list\_requests * **async **list\_requests**(\*, limit, exclusive\_start\_id): dict - List requests in the queue. *** #### Parameters * ##### optionalkeyword-onlylimit: int | None = None How many requests to retrieve. * ##### optionalkeyword-onlyexclusive\_start\_id: str | None = None All requests up to this one (including) are skipped from the result. #### Returns dict ### [**](#prolong_request_lock)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/request_queue.py#L617)prolong\_request\_lock * **async **prolong\_request\_lock**(request\_id, \*, forefront, lock\_secs): dict - Prolong the lock on a request. *** #### Parameters * ##### request\_id: str ID of the request to prolong the lock. * ##### optionalkeyword-onlyforefront: bool | None = None Whether to put the request in the beginning or the end of the queue after lock expires. * ##### keyword-onlylock\_secs: int By how much to prolong the lock, in seconds. #### Returns dict ### [**](#unlock_requests)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/request_queue.py#L835)unlock\_requests * **async **unlock\_requests**(): dict - Unlock all requests in the queue, which were locked by the same clientKey or from the same Actor run. *** #### Returns dict ### [**](#update)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/request_queue.py#L454)update * **async **update**(\*, name, general\_access): dict - Update the request queue with specified fields. *** #### Parameters * ##### optionalkeyword-onlyname: str | None = None The new name for the request queue. * ##### optionalkeyword-onlygeneral\_access: StorageGeneralAccess | None = None Determines how others can access the request queue. #### Returns dict ### [**](#update_request)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/request_queue.py#L574)update\_request * **async **update\_request**(request, \*, forefront): dict - Update a request in the queue. *** #### Parameters * ##### request: dict The updated request. * ##### optionalkeyword-onlyforefront: bool | None = None Whether to put the updated request in the beginning or the end of the queue. #### Returns dict --- # RequestQueueCollectionClient Sub-client for manipulating request queues. ## Index[**](#Index) ### Methods * [**get\_or\_create](https://docs.apify.com/api/client/python/api/client/python/reference/class/RequestQueueCollectionClient.md#get_or_create) * [**list](https://docs.apify.com/api/client/python/api/client/python/reference/class/RequestQueueCollectionClient.md#list) ## Methods[**](#Methods) ### [**](#get_or_create)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/request_queue_collection.py#L44)get\_or\_create * ****get\_or\_create**(\*, name): dict - Retrieve a named request queue, or create a new one when it doesn't exist. *** #### Parameters * ##### optionalkeyword-onlyname: str | None = None The name of the request queue to retrieve or create. #### Returns dict ### [**](#list)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/request_queue_collection.py#L21)list * ****list**(\*, unnamed, limit, offset, desc): [ListPage](https://docs.apify.com/api/client/python/api/client/python/reference/class/ListPage.md)\[dict] - List the available request queues. *** #### Parameters * ##### optionalkeyword-onlyunnamed: bool | None = None Whether to include unnamed request queues in the list. * ##### optionalkeyword-onlylimit: int | None = None How many request queues to retrieve. * ##### optionalkeyword-onlyoffset: int | None = None What request queue to include as first when retrieving the list. * ##### optionalkeyword-onlydesc: bool | None = None Whether to sort therequest queues in descending order based on their modification date. #### Returns [ListPage](https://docs.apify.com/api/client/python/api/client/python/reference/class/ListPage.md)\[dict] --- # RequestQueueCollectionClientAsync Async sub-client for manipulating request queues. ## Index[**](#Index) ### Methods * [**get\_or\_create](https://docs.apify.com/api/client/python/api/client/python/reference/class/RequestQueueCollectionClientAsync.md#get_or_create) * [**list](https://docs.apify.com/api/client/python/api/client/python/reference/class/RequestQueueCollectionClientAsync.md#list) ## Methods[**](#Methods) ### [**](#get_or_create)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/request_queue_collection.py#L89)get\_or\_create * **async **get\_or\_create**(\*, name): dict - Retrieve a named request queue, or create a new one when it doesn't exist. *** #### Parameters * ##### optionalkeyword-onlyname: str | None = None The name of the request queue to retrieve or create. #### Returns dict ### [**](#list)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/request_queue_collection.py#L66)list * **async **list**(\*, unnamed, limit, offset, desc): [ListPage](https://docs.apify.com/api/client/python/api/client/python/reference/class/ListPage.md)\[dict] - List the available request queues. *** #### Parameters * ##### optionalkeyword-onlyunnamed: bool | None = None Whether to include unnamed request queues in the list. * ##### optionalkeyword-onlylimit: int | None = None How many request queues to retrieve. * ##### optionalkeyword-onlyoffset: int | None = None What request queue to include as first when retrieving the list. * ##### optionalkeyword-onlydesc: bool | None = None Whether to sort therequest queues in descending order based on their modification date. #### Returns [ListPage](https://docs.apify.com/api/client/python/api/client/python/reference/class/ListPage.md)\[dict] --- # RunClient Sub-client for manipulating a single Actor run. ## Index[**](#Index) ### Methods * [**abort](https://docs.apify.com/api/client/python/api/client/python/reference/class/RunClient.md#abort) * [**charge](https://docs.apify.com/api/client/python/api/client/python/reference/class/RunClient.md#charge) * [**dataset](https://docs.apify.com/api/client/python/api/client/python/reference/class/RunClient.md#dataset) * [**delete](https://docs.apify.com/api/client/python/api/client/python/reference/class/RunClient.md#delete) * [**get](https://docs.apify.com/api/client/python/api/client/python/reference/class/RunClient.md#get) * [**get\_status\_message\_watcher](https://docs.apify.com/api/client/python/api/client/python/reference/class/RunClient.md#get_status_message_watcher) * [**get\_streamed\_log](https://docs.apify.com/api/client/python/api/client/python/reference/class/RunClient.md#get_streamed_log) * [**key\_value\_store](https://docs.apify.com/api/client/python/api/client/python/reference/class/RunClient.md#key_value_store) * [**log](https://docs.apify.com/api/client/python/api/client/python/reference/class/RunClient.md#log) * [**metamorph](https://docs.apify.com/api/client/python/api/client/python/reference/class/RunClient.md#metamorph) * [**reboot](https://docs.apify.com/api/client/python/api/client/python/reference/class/RunClient.md#reboot) * [**request\_queue](https://docs.apify.com/api/client/python/api/client/python/reference/class/RunClient.md#request_queue) * [**resurrect](https://docs.apify.com/api/client/python/api/client/python/reference/class/RunClient.md#resurrect) * [**update](https://docs.apify.com/api/client/python/api/client/python/reference/class/RunClient.md#update) * [**wait\_for\_finish](https://docs.apify.com/api/client/python/api/client/python/reference/class/RunClient.md#wait_for_finish) ## Methods[**](#Methods) ### [**](#abort)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/run.py#L87)abort * ****abort**(\*, gracefully): dict - Abort the Actor run which is starting or currently running and return its details. *** #### Parameters * ##### optionalkeyword-onlygracefully: bool | None = None If True, the Actor run will abort gracefully. It will send `aborting` and `persistStates` events into the run and force-stop the run after 30 seconds. It is helpful in cases where you plan to resurrect the run later. #### Returns dict ### [**](#charge)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/run.py#L289)charge * ****charge**(event\_name, count, idempotency\_key): None - Charge for an event of a Pay-Per-Event Actor run. *** #### Parameters * ##### event\_name: str * ##### optionalcount: int | None = None * ##### optionalidempotency\_key: str | None = None #### Returns None ### [**](#dataset)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/run.py#L214)dataset * ****dataset**(): [DatasetClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/DatasetClient.md) - Get the client for the default dataset of the Actor run. *** #### Returns [DatasetClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/DatasetClient.md) ### [**](#delete)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/run.py#L80)delete * ****delete**(): None - Delete the run. *** #### Returns None ### [**](#get)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/run.py#L43)get * ****get**(): dict | None - Return information about the Actor run. *** #### Returns dict | None ### [**](#get_status_message_watcher)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/run.py#L325)get\_status\_message\_watcher * ****get\_status\_message\_watcher**(to\_logger, check\_period): [StatusMessageWatcherSync](https://docs.apify.com/api/client/python/api/client/python/reference/class/StatusMessageWatcherSync.md) - Get `StatusMessageWatcher` instance that can be used to redirect status and status messages to logs. `StatusMessageWatcher` can be explicitly started and stopped or used as a context manager. *** #### Parameters * ##### optionalto\_logger: logging.Logger | None = None `Logger` used for logging the status and status messages. If not provided, a new logger is created. * ##### optionalcheck\_period: timedelta = timedelta(seconds=1) The period with which the status message will be polled. #### Returns [StatusMessageWatcherSync](https://docs.apify.com/api/client/python/api/client/python/reference/class/StatusMessageWatcherSync.md) ### [**](#get_streamed_log)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/run.py#L262)get\_streamed\_log * ****get\_streamed\_log**(to\_logger, \*, from\_start): [StreamedLogSync](https://docs.apify.com/api/client/python/api/client/python/reference/class/StreamedLogSync.md) - Get `StreamedLog` instance that can be used to redirect logs. `StreamedLog` can be explicitly started and stopped or used as a context manager. *** #### Parameters * ##### optionalto\_logger: logging.Logger | None = None `Logger` used for logging the redirected messages. If not provided, a new logger is created * ##### optionalkeyword-onlyfrom\_start: bool = True If `True`, all logs from the start of the actor run will be redirected. If `False`, only newly arrived logs will be redirected. This can be useful for redirecting only a small portion of relevant logs for long-running actors in stand-by. #### Returns [StreamedLogSync](https://docs.apify.com/api/client/python/api/client/python/reference/class/StreamedLogSync.md) ### [**](#key_value_store)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/run.py#L226)key\_value\_store * ****key\_value\_store**(): [KeyValueStoreClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/KeyValueStoreClient.md) - Get the client for the default key-value store of the Actor run. *** #### Returns [KeyValueStoreClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/KeyValueStoreClient.md) ### [**](#log)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/run.py#L250)log * ****log**(): [LogClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/LogClient.md) - Get the client for the log of the Actor run. *** #### Returns [LogClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/LogClient.md) ### [**](#metamorph)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/run.py#L114)metamorph * ****metamorph**(\*, target\_actor\_id, target\_actor\_build, run\_input, content\_type): dict - Transform an Actor run into a run of another Actor with a new input. *** #### Parameters * ##### keyword-onlytarget\_actor\_id: str ID of the target Actor that the run should be transformed into. * ##### optionalkeyword-onlytarget\_actor\_build: str | None = None The build of the target Actor. It can be either a build tag or build number. By default, the run uses the build specified in the default run configuration for the target Actor (typically the latest build). * ##### optionalkeyword-onlyrun\_input: Any = None The input to pass to the new run. * ##### optionalkeyword-onlycontent\_type: str | None = None The content type of the input. #### Returns dict ### [**](#reboot)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/run.py#L200)reboot * ****reboot**(): dict - Reboot an Actor run. Only runs that are running, i.e. runs with status RUNNING can be rebooted. *** #### Returns dict ### [**](#request_queue)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/run.py#L238)request\_queue * ****request\_queue**(): [RequestQueueClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/RequestQueueClient.md) - Get the client for the default request queue of the Actor run. *** #### Returns [RequestQueueClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/RequestQueueClient.md) ### [**](#resurrect)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/run.py#L153)resurrect * ****resurrect**(\*, build, memory\_mbytes, timeout\_secs, max\_items, max\_total\_charge\_usd): dict - Resurrect a finished Actor run. Only finished runs, i.e. runs with status FINISHED, FAILED, ABORTED and TIMED-OUT can be resurrected. Run status will be updated to RUNNING and its container will be restarted with the same default storages. *** #### Parameters * ##### optionalkeyword-onlybuild: str | None = None Which Actor build the resurrected run should use. It can be either a build tag or build number. By default, the resurrected run uses the same build as before. * ##### optionalkeyword-onlymemory\_mbytes: int | None = None New memory limit for the resurrected run, in megabytes. By default, the resurrected run uses the same memory limit as before. * ##### optionalkeyword-onlytimeout\_secs: int | None = None New timeout for the resurrected run, in seconds. By default, the resurrected run uses the same timeout as before. * ##### optionalkeyword-onlymax\_items: int | None = None Maximum number of items that the resurrected pay-per-result run will return. By default, the resurrected run uses the same limit as before. Limit can be only increased. * ##### optionalkeyword-onlymax\_total\_charge\_usd: Decimal | None = None Maximum cost for the resurrected pay-per-event run in USD. By default, the resurrected run uses the same limit as before. Limit can be only increased. #### Returns dict ### [**](#update)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/run.py#L53)update * ****update**(\*, status\_message, is\_status\_message\_terminal, general\_access): dict - Update the run with the specified fields. *** #### Parameters * ##### optionalkeyword-onlystatus\_message: str | None = None The new status message for the run. * ##### optionalkeyword-onlyis\_status\_message\_terminal: bool | None = None Set this flag to True if this is the final status message of the Actor run. * ##### optionalkeyword-onlygeneral\_access: RunGeneralAccess | None = None Determines how others can access the run and its storages. #### Returns dict ### [**](#wait_for_finish)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/run.py#L102)wait\_for\_finish * ****wait\_for\_finish**(\*, wait\_secs): dict | None - Wait synchronously until the run finishes or the server times out. *** #### Parameters * ##### optionalkeyword-onlywait\_secs: int | None = None How long does the client wait for run to finish. None for indefinite. #### Returns dict | None --- # RunClientAsync Async sub-client for manipulating a single Actor run. ## Index[**](#Index) ### Methods * [**abort](https://docs.apify.com/api/client/python/api/client/python/reference/class/RunClientAsync.md#abort) * [**charge](https://docs.apify.com/api/client/python/api/client/python/reference/class/RunClientAsync.md#charge) * [**dataset](https://docs.apify.com/api/client/python/api/client/python/reference/class/RunClientAsync.md#dataset) * [**delete](https://docs.apify.com/api/client/python/api/client/python/reference/class/RunClientAsync.md#delete) * [**get](https://docs.apify.com/api/client/python/api/client/python/reference/class/RunClientAsync.md#get) * [**get\_status\_message\_watcher](https://docs.apify.com/api/client/python/api/client/python/reference/class/RunClientAsync.md#get_status_message_watcher) * [**get\_streamed\_log](https://docs.apify.com/api/client/python/api/client/python/reference/class/RunClientAsync.md#get_streamed_log) * [**key\_value\_store](https://docs.apify.com/api/client/python/api/client/python/reference/class/RunClientAsync.md#key_value_store) * [**log](https://docs.apify.com/api/client/python/api/client/python/reference/class/RunClientAsync.md#log) * [**metamorph](https://docs.apify.com/api/client/python/api/client/python/reference/class/RunClientAsync.md#metamorph) * [**reboot](https://docs.apify.com/api/client/python/api/client/python/reference/class/RunClientAsync.md#reboot) * [**request\_queue](https://docs.apify.com/api/client/python/api/client/python/reference/class/RunClientAsync.md#request_queue) * [**resurrect](https://docs.apify.com/api/client/python/api/client/python/reference/class/RunClientAsync.md#resurrect) * [**update](https://docs.apify.com/api/client/python/api/client/python/reference/class/RunClientAsync.md#update) * [**wait\_for\_finish](https://docs.apify.com/api/client/python/api/client/python/reference/class/RunClientAsync.md#wait_for_finish) ## Methods[**](#Methods) ### [**](#abort)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/run.py#L399)abort * **async **abort**(\*, gracefully): dict - Abort the Actor run which is starting or currently running and return its details. *** #### Parameters * ##### optionalkeyword-onlygracefully: bool | None = None If True, the Actor run will abort gracefully. It will send `aborting` and `persistStates` events into the run and force-stop the run after 30 seconds. It is helpful in cases where you plan to resurrect the run later. #### Returns dict ### [**](#charge)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/run.py#L613)charge * **async **charge**(event\_name, count, idempotency\_key): None - Charge for an event of a Pay-Per-Event Actor run. *** #### Parameters * ##### event\_name: str * ##### optionalcount: int | None = None * ##### optionalidempotency\_key: str | None = None #### Returns None ### [**](#dataset)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/run.py#L536)dataset * ****dataset**(): [DatasetClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/DatasetClientAsync.md) - Get the client for the default dataset of the Actor run. *** #### Returns [DatasetClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/DatasetClientAsync.md) ### [**](#delete)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/run.py#L426)delete * **async **delete**(): None - Delete the run. *** #### Returns None ### [**](#get)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/run.py#L362)get * **async **get**(): dict | None - Return information about the Actor run. *** #### Returns dict | None ### [**](#get_status_message_watcher)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/run.py#L648)get\_status\_message\_watcher * **async **get\_status\_message\_watcher**(to\_logger, check\_period): [StatusMessageWatcherAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/StatusMessageWatcherAsync.md) - Get `StatusMessageWatcher` instance that can be used to redirect status and status messages to logs. `StatusMessageWatcher` can be explicitly started and stopped or used as a context manager. *** #### Parameters * ##### optionalto\_logger: logging.Logger | None = None `Logger` used for logging the status and status messages. If not provided, a new logger is created. * ##### optionalcheck\_period: timedelta = timedelta(seconds=1) The period with which the status message will be polled. #### Returns [StatusMessageWatcherAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/StatusMessageWatcherAsync.md) ### [**](#get_streamed_log)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/run.py#L584)get\_streamed\_log * **async **get\_streamed\_log**(to\_logger, \*, from\_start): [StreamedLogAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/StreamedLogAsync.md) - Get `StreamedLog` instance that can be used to redirect logs. `StreamedLog` can be explicitly started and stopped or used as a context manager. *** #### Parameters * ##### optionalto\_logger: logging.Logger | None = None `Logger` used for logging the redirected messages. If not provided, a new logger is created * ##### optionalkeyword-onlyfrom\_start: bool = True If `True`, all logs from the start of the actor run will be redirected. If `False`, only newly arrived logs will be redirected. This can be useful for redirecting only a small portion of relevant logs for long-running actors in stand-by. #### Returns [StreamedLogAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/StreamedLogAsync.md) ### [**](#key_value_store)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/run.py#L548)key\_value\_store * ****key\_value\_store**(): [KeyValueStoreClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/KeyValueStoreClientAsync.md) - Get the client for the default key-value store of the Actor run. *** #### Returns [KeyValueStoreClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/KeyValueStoreClientAsync.md) ### [**](#log)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/run.py#L572)log * ****log**(): [LogClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/LogClientAsync.md) - Get the client for the log of the Actor run. *** #### Returns [LogClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/LogClientAsync.md) ### [**](#metamorph)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/run.py#L433)metamorph * **async **metamorph**(\*, target\_actor\_id, target\_actor\_build, run\_input, content\_type): dict - Transform an Actor run into a run of another Actor with a new input. *** #### Parameters * ##### keyword-onlytarget\_actor\_id: str ID of the target Actor that the run should be transformed into. * ##### optionalkeyword-onlytarget\_actor\_build: str | None = None The build of the target Actor. It can be either a build tag or build number. By default, the run uses the build specified in the default run configuration for the target Actor (typically the latest build). * ##### optionalkeyword-onlyrun\_input: Any = None The input to pass to the new run. * ##### optionalkeyword-onlycontent\_type: str | None = None The content type of the input. #### Returns dict ### [**](#reboot)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/run.py#L522)reboot * **async **reboot**(): dict - Reboot an Actor run. Only runs that are running, i.e. runs with status RUNNING can be rebooted. *** #### Returns dict ### [**](#request_queue)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/run.py#L560)request\_queue * ****request\_queue**(): [RequestQueueClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/RequestQueueClientAsync.md) - Get the client for the default request queue of the Actor run. *** #### Returns [RequestQueueClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/RequestQueueClientAsync.md) ### [**](#resurrect)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/run.py#L475)resurrect * **async **resurrect**(\*, build, memory\_mbytes, timeout\_secs, max\_items, max\_total\_charge\_usd): dict - Resurrect a finished Actor run. Only finished runs, i.e. runs with status FINISHED, FAILED, ABORTED and TIMED-OUT can be resurrected. Run status will be updated to RUNNING and its container will be restarted with the same default storages. *** #### Parameters * ##### optionalkeyword-onlybuild: str | None = None Which Actor build the resurrected run should use. It can be either a build tag or build number. By default, the resurrected run uses the same build as before. * ##### optionalkeyword-onlymemory\_mbytes: int | None = None New memory limit for the resurrected run, in megabytes. By default, the resurrected run uses the same memory limit as before. * ##### optionalkeyword-onlytimeout\_secs: int | None = None New timeout for the resurrected run, in seconds. By default, the resurrected run uses the same timeout as before. * ##### optionalkeyword-onlymax\_items: int | None = None Maximum number of items that the resurrected pay-per-result run will return. By default, the resurrected run uses the same limit as before. Limit can be only increased. * ##### optionalkeyword-onlymax\_total\_charge\_usd: Decimal | None = None Maximum cost for the resurrected pay-per-event run in USD. By default, the resurrected run uses the same limit as before. Limit can be only increased. #### Returns dict ### [**](#update)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/run.py#L372)update * **async **update**(\*, status\_message, is\_status\_message\_terminal, general\_access): dict - Update the run with the specified fields. *** #### Parameters * ##### optionalkeyword-onlystatus\_message: str | None = None The new status message for the run. * ##### optionalkeyword-onlyis\_status\_message\_terminal: bool | None = None Set this flag to True if this is the final status message of the Actor run. * ##### optionalkeyword-onlygeneral\_access: RunGeneralAccess | None = None Determines how others can access the run and its storages. #### Returns dict ### [**](#wait_for_finish)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/run.py#L414)wait\_for\_finish * **async **wait\_for\_finish**(\*, wait\_secs): dict | None - Wait synchronously until the run finishes or the server times out. *** #### Parameters * ##### optionalkeyword-onlywait\_secs: int | None = None How long does the client wait for run to finish. None for indefinite. #### Returns dict | None --- # RunCollectionClient Sub-client for listing Actor runs. ## Index[**](#Index) ### Methods * [**list](https://docs.apify.com/api/client/python/api/client/python/reference/class/RunCollectionClient.md#list) ## Methods[**](#Methods) ### [**](#list)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/run_collection.py#L23)list * ****list**(\*, limit, offset, desc, status): [ListPage](https://docs.apify.com/api/client/python/api/client/python/reference/class/ListPage.md)\[dict] - List all Actor runs. List all Actor runs, either of a single Actor, or all user's Actors, depending on where this client was initialized from. *** #### Parameters * ##### optionalkeyword-onlylimit: int | None = None How many runs to retrieve. * ##### optionalkeyword-onlyoffset: int | None = None What run to include as first when retrieving the list. * ##### optionalkeyword-onlydesc: bool | None = None Whether to sort the runs in descending order based on their start date. * ##### optionalkeyword-onlystatus: (ActorJobStatus | [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[ActorJobStatus]) | None = None Retrieve only runs with the provided statuses. #### Returns [ListPage](https://docs.apify.com/api/client/python/api/client/python/reference/class/ListPage.md)\[dict] --- # RunCollectionClientAsync Async sub-client for listing Actor runs. ## Index[**](#Index) ### Methods * [**list](https://docs.apify.com/api/client/python/api/client/python/reference/class/RunCollectionClientAsync.md#list) ## Methods[**](#Methods) ### [**](#list)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/run_collection.py#L69)list * **async **list**(\*, limit, offset, desc, status): [ListPage](https://docs.apify.com/api/client/python/api/client/python/reference/class/ListPage.md)\[dict] - List all Actor runs. List all Actor runs, either of a single Actor, or all user's Actors, depending on where this client was initialized from. *** #### Parameters * ##### optionalkeyword-onlylimit: int | None = None How many runs to retrieve. * ##### optionalkeyword-onlyoffset: int | None = None What run to include as first when retrieving the list. * ##### optionalkeyword-onlydesc: bool | None = None Whether to sort the runs in descending order based on their start date. * ##### optionalkeyword-onlystatus: (ActorJobStatus | [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[ActorJobStatus]) | None = None Retrieve only runs with the provided statuses. #### Returns [ListPage](https://docs.apify.com/api/client/python/api/client/python/reference/class/ListPage.md)\[dict] --- # ScheduleClient Sub-client for manipulating a single schedule. ## Index[**](#Index) ### Methods * [**delete](https://docs.apify.com/api/client/python/api/client/python/reference/class/ScheduleClient.md#delete) * [**get](https://docs.apify.com/api/client/python/api/client/python/reference/class/ScheduleClient.md#get) * [**get\_log](https://docs.apify.com/api/client/python/api/client/python/reference/class/ScheduleClient.md#get_log) * [**update](https://docs.apify.com/api/client/python/api/client/python/reference/class/ScheduleClient.md#update) ## Methods[**](#Methods) ### [**](#delete)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/schedule.py#L97)delete * ****delete**(): None - Delete the schedule. *** #### Returns None ### [**](#get)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/schedule.py#L43)get * ****get**(): dict | None - Return information about the schedule. *** #### Returns dict | None ### [**](#get_log)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/schedule.py#L104)get\_log * ****get\_log**(): [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list) | None - Return log for the given schedule. *** #### Returns [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list) | None ### [**](#update)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/schedule.py#L53)update * ****update**(\*, cron\_expression, is\_enabled, is\_exclusive, name, actions, description, timezone, title): dict - Update the schedule with specified fields. *** #### Parameters * ##### optionalkeyword-onlycron\_expression: str | None = None The cron expression used by this schedule. * ##### optionalkeyword-onlyis\_enabled: bool | None = None True if the schedule should be enabled. * ##### optionalkeyword-onlyis\_exclusive: bool | None = None When set to true, don't start Actor or Actor task if it's still running from the previous schedule. * ##### optionalkeyword-onlyname: str | None = None The name of the schedule to create. * ##### optionalkeyword-onlyactions: [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[dict] | None = None Actors or tasks that should be run on this schedule. See the API documentation for exact structure. * ##### optionalkeyword-onlydescription: str | None = None Description of this schedule. * ##### optionalkeyword-onlytimezone: str | None = None Timezone in which your cron expression runs (TZ database name from ). * ##### optionalkeyword-onlytitle: str | None = None A human-friendly equivalent of the name. #### Returns dict --- # ScheduleClientAsync Async sub-client for manipulating a single schedule. ## Index[**](#Index) ### Methods * [**delete](https://docs.apify.com/api/client/python/api/client/python/reference/class/ScheduleClientAsync.md#delete) * [**get](https://docs.apify.com/api/client/python/api/client/python/reference/class/ScheduleClientAsync.md#get) * [**get\_log](https://docs.apify.com/api/client/python/api/client/python/reference/class/ScheduleClientAsync.md#get_log) * [**update](https://docs.apify.com/api/client/python/api/client/python/reference/class/ScheduleClientAsync.md#update) ## Methods[**](#Methods) ### [**](#delete)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/schedule.py#L187)delete * **async **delete**(): None - Delete the schedule. *** #### Returns None ### [**](#get)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/schedule.py#L133)get * **async **get**(): dict | None - Return information about the schedule. *** #### Returns dict | None ### [**](#get_log)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/schedule.py#L194)get\_log * **async **get\_log**(): [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list) | None - Return log for the given schedule. *** #### Returns [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list) | None ### [**](#update)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/schedule.py#L143)update * **async **update**(\*, cron\_expression, is\_enabled, is\_exclusive, name, actions, description, timezone, title): dict - Update the schedule with specified fields. *** #### Parameters * ##### optionalkeyword-onlycron\_expression: str | None = None The cron expression used by this schedule. * ##### optionalkeyword-onlyis\_enabled: bool | None = None True if the schedule should be enabled. * ##### optionalkeyword-onlyis\_exclusive: bool | None = None When set to true, don't start Actor or Actor task if it's still running from the previous schedule. * ##### optionalkeyword-onlyname: str | None = None The name of the schedule to create. * ##### optionalkeyword-onlyactions: [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[dict] | None = None Actors or tasks that should be run on this schedule. See the API documentation for exact structure. * ##### optionalkeyword-onlydescription: str | None = None Description of this schedule. * ##### optionalkeyword-onlytimezone: str | None = None Timezone in which your cron expression runs (TZ database name from ). * ##### optionalkeyword-onlytitle: str | None = None A human-friendly equivalent of the name. #### Returns dict --- # ScheduleCollectionClient Sub-client for manipulating schedules. ## Index[**](#Index) ### Methods * [**create](https://docs.apify.com/api/client/python/api/client/python/reference/class/ScheduleCollectionClient.md#create) * [**list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ScheduleCollectionClient.md#list) ## Methods[**](#Methods) ### [**](#create)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/schedule_collection.py#L43)create * ****create**(\*, cron\_expression, is\_enabled, is\_exclusive, name, actions, description, timezone, title): dict - Create a new schedule. *** #### Parameters * ##### keyword-onlycron\_expression: str The cron expression used by this schedule. * ##### keyword-onlyis\_enabled: bool True if the schedule should be enabled. * ##### keyword-onlyis\_exclusive: bool When set to true, don't start Actor or Actor task if it's still running from the previous schedule. * ##### optionalkeyword-onlyname: str | None = None The name of the schedule to create. * ##### optionalkeyword-onlyactions: [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[dict] | None = None Actors or tasks that should be run on this schedule. See the API documentation for exact structure. * ##### optionalkeyword-onlydescription: str | None = None Description of this schedule. * ##### optionalkeyword-onlytimezone: str | None = None Timezone in which your cron expression runs (TZ database name from ). * ##### optionalkeyword-onlytitle: str | None = None Title of this schedule. #### Returns dict ### [**](#list)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/schedule_collection.py#L22)list * ****list**(\*, limit, offset, desc): [ListPage](https://docs.apify.com/api/client/python/api/client/python/reference/class/ListPage.md)\[dict] - List the available schedules. *** #### Parameters * ##### optionalkeyword-onlylimit: int | None = None How many schedules to retrieve. * ##### optionalkeyword-onlyoffset: int | None = None What schedules to include as first when retrieving the list. * ##### optionalkeyword-onlydesc: bool | None = None Whether to sort the schedules in descending order based on their modification date. #### Returns [ListPage](https://docs.apify.com/api/client/python/api/client/python/reference/class/ListPage.md)\[dict] --- # ScheduleCollectionClientAsync Async sub-client for manipulating schedules. ## Index[**](#Index) ### Methods * [**create](https://docs.apify.com/api/client/python/api/client/python/reference/class/ScheduleCollectionClientAsync.md#create) * [**list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ScheduleCollectionClientAsync.md#list) ## Methods[**](#Methods) ### [**](#create)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/schedule_collection.py#L120)create * **async **create**(\*, cron\_expression, is\_enabled, is\_exclusive, name, actions, description, timezone, title): dict - Create a new schedule. *** #### Parameters * ##### keyword-onlycron\_expression: str The cron expression used by this schedule. * ##### keyword-onlyis\_enabled: bool True if the schedule should be enabled. * ##### keyword-onlyis\_exclusive: bool When set to true, don't start Actor or Actor task if it's still running from the previous schedule. * ##### optionalkeyword-onlyname: str | None = None The name of the schedule to create. * ##### optionalkeyword-onlyactions: [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[dict] | None = None Actors or tasks that should be run on this schedule. See the API documentation for exact structure. * ##### optionalkeyword-onlydescription: str | None = None Description of this schedule. * ##### optionalkeyword-onlytimezone: str | None = None Timezone in which your cron expression runs (TZ database name from ). * ##### optionalkeyword-onlytitle: str | None = None Title of this schedule. #### Returns dict ### [**](#list)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/schedule_collection.py#L99)list * **async **list**(\*, limit, offset, desc): [ListPage](https://docs.apify.com/api/client/python/api/client/python/reference/class/ListPage.md)\[dict] - List the available schedules. *** #### Parameters * ##### optionalkeyword-onlylimit: int | None = None How many schedules to retrieve. * ##### optionalkeyword-onlyoffset: int | None = None What schedules to include as first when retrieving the list. * ##### optionalkeyword-onlydesc: bool | None = None Whether to sort the schedules in descending order based on their modification date. #### Returns [ListPage](https://docs.apify.com/api/client/python/api/client/python/reference/class/ListPage.md)\[dict] --- # Statistics Statistics about API client usage and rate limit errors. ## Index[**](#Index) ### Methods * [**add\_rate\_limit\_error](https://docs.apify.com/api/client/python/api/client/python/reference/class/Statistics.md#add_rate_limit_error) ### Properties * [**calls](https://docs.apify.com/api/client/python/api/client/python/reference/class/Statistics.md#calls) * [**rate\_limit\_errors](https://docs.apify.com/api/client/python/api/client/python/reference/class/Statistics.md#rate_limit_errors) * [**requests](https://docs.apify.com/api/client/python/api/client/python/reference/class/Statistics.md#requests) ## Methods[**](#Methods) ### [**](#add_rate_limit_error)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/_statistics.py#L18)add\_rate\_limit\_error * ****add\_rate\_limit\_error**(attempt): None - Add rate limit error for specific attempt. *** #### Parameters * ##### attempt: int The attempt number (1-based indexing). #### Returns None ## Properties[**](#Properties) ### [**](#calls)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/_statistics.py#L9)calls **calls: int Total number of API method calls made by the client. ### [**](#rate_limit_errors)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/_statistics.py#L15)rate\_limit\_errors **rate\_limit\_errors: defaultdict\[int, int] List tracking which retry attempts encountered rate limit (429) errors. ### [**](#requests)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/_statistics.py#L12)requests **requests: int Total number of HTTP requests sent, including retries. --- # StatusMessageWatcher Utility class for logging status messages from another Actor run. Status message is logged at fixed time intervals, and there is no guarantee that all messages will be logged, especially in cases of frequent status message changes. ### Hierarchy * *StatusMessageWatcher* * [StatusMessageWatcherAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/StatusMessageWatcherAsync.md) * [StatusMessageWatcherSync](https://docs.apify.com/api/client/python/api/client/python/reference/class/StatusMessageWatcherSync.md) ## Index[**](#Index) ### Methods * [**\_\_init\_\_](https://docs.apify.com/api/client/python/api/client/python/reference/class/StatusMessageWatcher.md#__init__) ## Methods[**](#Methods) ### [**](#__init__)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/log.py#L402)\_\_init\_\_ * ****\_\_init\_\_**(\*, to\_logger, check\_period): None - Initialize `StatusMessageWatcher`. *** #### Parameters * ##### keyword-onlyto\_logger: logging.Logger The logger to which the status message will be redirected. * ##### optionalkeyword-onlycheck\_period: timedelta = timedelta(seconds=5) The period with which the status message will be polled. #### Returns None --- # StatusMessageWatcherAsync Async variant of `StatusMessageWatcher` that is logging in task. ### Hierarchy * [StatusMessageWatcher](https://docs.apify.com/api/client/python/api/client/python/reference/class/StatusMessageWatcher.md) * *StatusMessageWatcherAsync* ## Index[**](#Index) ### Methods * [**\_\_aenter\_\_](https://docs.apify.com/api/client/python/api/client/python/reference/class/StatusMessageWatcherAsync.md#__aenter__) * [**\_\_aexit\_\_](https://docs.apify.com/api/client/python/api/client/python/reference/class/StatusMessageWatcherAsync.md#__aexit__) * [**\_\_init\_\_](https://docs.apify.com/api/client/python/api/client/python/reference/class/StatusMessageWatcherAsync.md#__init__) * [**start](https://docs.apify.com/api/client/python/api/client/python/reference/class/StatusMessageWatcherAsync.md#start) * [**stop](https://docs.apify.com/api/client/python/api/client/python/reference/class/StatusMessageWatcherAsync.md#stop) ## Methods[**](#Methods) ### [**](#__aenter__)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/log.py#L472)\_\_aenter\_\_ * **async **\_\_aenter\_\_**(): Self - Start the logging task within the context. Exiting the context will cancel the logging task. *** #### Returns Self ### [**](#__aexit__)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/log.py#L477)\_\_aexit\_\_ * **async **\_\_aexit\_\_**(exc\_type, exc\_val, exc\_tb): None - Cancel the logging task. *** #### Parameters * ##### exc\_type: type\[BaseException] | None * ##### exc\_val: BaseException | None * ##### exc\_tb: TracebackType | None #### Returns None ### [**](#__init__)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/log.py#L440)\_\_init\_\_ * ****\_\_init\_\_**(\*, run\_client, to\_logger, check\_period): None - Overrides [StatusMessageWatcher.\_\_init\_\_](https://docs.apify.com/api/client/python/api/client/python/reference/class/StatusMessageWatcher.md#__init__) Initialize `StatusMessageWatcherAsync`. *** #### Parameters * ##### keyword-onlyrun\_client: [RunClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/RunClientAsync.md) The client for run that will be used to get a status and message. * ##### keyword-onlyto\_logger: logging.Logger The logger to which the status message will be redirected. * ##### optionalkeyword-onlycheck\_period: timedelta = timedelta(seconds=1) The period with which the status message will be polled. #### Returns None ### [**](#start)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/log.py#L454)start * ****start**(): Task - Start the logging task. The caller has to handle any cleanup by manually calling the `stop` method. *** #### Returns Task ### [**](#stop)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/log.py#L461)stop * **async **stop**(): None - Stop the logging task. *** #### Returns None --- # StatusMessageWatcherSync Sync variant of `StatusMessageWatcher` that is logging in thread. ### Hierarchy * [StatusMessageWatcher](https://docs.apify.com/api/client/python/api/client/python/reference/class/StatusMessageWatcher.md) * *StatusMessageWatcherSync* ## Index[**](#Index) ### Methods * [**\_\_enter\_\_](https://docs.apify.com/api/client/python/api/client/python/reference/class/StatusMessageWatcherSync.md#__enter__) * [**\_\_exit\_\_](https://docs.apify.com/api/client/python/api/client/python/reference/class/StatusMessageWatcherSync.md#__exit__) * [**\_\_init\_\_](https://docs.apify.com/api/client/python/api/client/python/reference/class/StatusMessageWatcherSync.md#__init__) * [**start](https://docs.apify.com/api/client/python/api/client/python/reference/class/StatusMessageWatcherSync.md#start) * [**stop](https://docs.apify.com/api/client/python/api/client/python/reference/class/StatusMessageWatcherSync.md#stop) ## Methods[**](#Methods) ### [**](#__enter__)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/log.py#L529)\_\_enter\_\_ * ****\_\_enter\_\_**(): Self - Start the logging task within the context. Exiting the context will cancel the logging task. *** #### Returns Self ### [**](#__exit__)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/log.py#L534)\_\_exit\_\_ * ****\_\_exit\_\_**(exc\_type, exc\_val, exc\_tb): None - Cancel the logging task. *** #### Parameters * ##### exc\_type: type\[BaseException] | None * ##### exc\_val: BaseException | None * ##### exc\_tb: TracebackType | None #### Returns None ### [**](#__init__)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/log.py#L495)\_\_init\_\_ * ****\_\_init\_\_**(\*, run\_client, to\_logger, check\_period): None - Overrides [StatusMessageWatcher.\_\_init\_\_](https://docs.apify.com/api/client/python/api/client/python/reference/class/StatusMessageWatcher.md#__init__) Initialize `StatusMessageWatcherSync`. *** #### Parameters * ##### keyword-onlyrun\_client: [RunClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/RunClient.md) The client for run that will be used to get a status and message. * ##### keyword-onlyto\_logger: logging.Logger The logger to which the status message will be redirected. * ##### optionalkeyword-onlycheck\_period: timedelta = timedelta(seconds=1) The period with which the status message will be polled. #### Returns None ### [**](#start)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/log.py#L510)start * ****start**(): Thread - Start the logging thread. The caller has to handle any cleanup by manually calling the `stop` method. *** #### Returns Thread ### [**](#stop)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/log.py#L519)stop * ****stop**(): None - Signal the \_logging\_thread thread to stop logging and wait for it to finish. *** #### Returns None --- # StoreCollectionClient Sub-client for Apify store. ## Index[**](#Index) ### Methods * [**list](https://docs.apify.com/api/client/python/api/client/python/reference/class/StoreCollectionClient.md#list) ## Methods[**](#Methods) ### [**](#list)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/store_collection.py#L21)list * ****list**(\*, limit, offset, search, sort\_by, category, username, pricing\_model): [ListPage](https://docs.apify.com/api/client/python/api/client/python/reference/class/ListPage.md)\[dict] - List Actors in Apify store. *** #### Parameters * ##### optionalkeyword-onlylimit: int | None = None How many Actors to list. * ##### optionalkeyword-onlyoffset: int | None = None What Actor to include as first when retrieving the list. * ##### optionalkeyword-onlysearch: str | None = None String to search by. The search runs on the following fields: title, name, description, username, readme. * ##### optionalkeyword-onlysort\_by: str | None = None Specifies the field by which to sort the results. * ##### optionalkeyword-onlycategory: str | None = None Filter by this category. * ##### optionalkeyword-onlyusername: str | None = None Filter by this username. * ##### optionalkeyword-onlypricing\_model: str | None = None Filter by this pricing model. #### Returns [ListPage](https://docs.apify.com/api/client/python/api/client/python/reference/class/ListPage.md)\[dict] --- # StoreCollectionClientAsync Async sub-client for Apify store. ## Index[**](#Index) ### Methods * [**list](https://docs.apify.com/api/client/python/api/client/python/reference/class/StoreCollectionClientAsync.md#list) ## Methods[**](#Methods) ### [**](#list)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/store_collection.py#L68)list * **async **list**(\*, limit, offset, search, sort\_by, category, username, pricing\_model): [ListPage](https://docs.apify.com/api/client/python/api/client/python/reference/class/ListPage.md)\[dict] - List Actors in Apify store. *** #### Parameters * ##### optionalkeyword-onlylimit: int | None = None How many Actors to list. * ##### optionalkeyword-onlyoffset: int | None = None What Actor to include as first when retrieving the list. * ##### optionalkeyword-onlysearch: str | None = None String to search by. The search runs on the following fields: title, name, description, username, readme. * ##### optionalkeyword-onlysort\_by: str | None = None Specifies the field by which to sort the results. * ##### optionalkeyword-onlycategory: str | None = None Filter by this category. * ##### optionalkeyword-onlyusername: str | None = None Filter by this username. * ##### optionalkeyword-onlypricing\_model: str | None = None Filter by this pricing model. #### Returns [ListPage](https://docs.apify.com/api/client/python/api/client/python/reference/class/ListPage.md)\[dict] --- # StreamedLog Utility class for streaming logs from another Actor. It uses buffer to deal with possibly chunked logs. Chunked logs are stored in buffer. Chunks are expected to contain specific markers that indicate the start of the log message. Each time a new chunk with complete split marker arrives, the buffer is processed, logged and emptied. This works only if the logs have datetime marker in ISO format. For example, `2025-05-12T15:35:59.429Z` This is the default log standard for the actors. ### Hierarchy * *StreamedLog* * [StreamedLogSync](https://docs.apify.com/api/client/python/api/client/python/reference/class/StreamedLogSync.md) * [StreamedLogAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/StreamedLogAsync.md) ## Index[**](#Index) ### Methods * [**\_\_init\_\_](https://docs.apify.com/api/client/python/api/client/python/reference/class/StreamedLog.md#__init__) ## Methods[**](#Methods) ### [**](#__init__)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/log.py#L224)\_\_init\_\_ * ****\_\_init\_\_**(to\_logger, \*, from\_start): None - Overrides [StreamedLog.\_\_init\_\_](https://docs.apify.com/api/client/python/api/client/python/reference/class/StreamedLog.md#__init__) Initialize `StreamedLog`. *** #### Parameters * ##### to\_logger: logging.Logger The logger to which the logs will be redirected. * ##### optionalkeyword-onlyfrom\_start: bool = True If `True`, all logs from the start of the actor run will be redirected. If `False`, only newly arrived logs will be redirected. This can be useful for redirecting only a small portion of relevant logs for long-running actors in stand-by. #### Returns None --- # StreamedLogAsync Async variant of `StreamedLog` that is logging in tasks. ### Hierarchy * [StreamedLog](https://docs.apify.com/api/client/python/api/client/python/reference/class/StreamedLog.md) * *StreamedLogAsync* ## Index[**](#Index) ### Methods * [**\_\_aenter\_\_](https://docs.apify.com/api/client/python/api/client/python/reference/class/StreamedLogAsync.md#__aenter__) * [**\_\_aexit\_\_](https://docs.apify.com/api/client/python/api/client/python/reference/class/StreamedLogAsync.md#__aexit__) * [**\_\_init\_\_](https://docs.apify.com/api/client/python/api/client/python/reference/class/StreamedLogAsync.md#__init__) * [**start](https://docs.apify.com/api/client/python/api/client/python/reference/class/StreamedLogAsync.md#start) * [**stop](https://docs.apify.com/api/client/python/api/client/python/reference/class/StreamedLogAsync.md#stop) ## Methods[**](#Methods) ### [**](#__aenter__)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/log.py#L367)\_\_aenter\_\_ * **async **\_\_aenter\_\_**(): Self - Start the streaming task within the context. Exiting the context will cancel the streaming task. *** #### Returns Self ### [**](#__aexit__)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/log.py#L372)\_\_aexit\_\_ * **async **\_\_aexit\_\_**(exc\_type, exc\_val, exc\_tb): None - Cancel the streaming task. *** #### Parameters * ##### exc\_type: type\[BaseException] | None * ##### exc\_val: BaseException | None * ##### exc\_tb: TracebackType | None #### Returns None ### [**](#__init__)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/log.py#L344)\_\_init\_\_ * ****\_\_init\_\_**(to\_logger, \*, from\_start): None - Overrides [StreamedLog.\_\_init\_\_](https://docs.apify.com/api/client/python/api/client/python/reference/class/StreamedLog.md#__init__) Initialize `StreamedLog`. *** #### Parameters * ##### to\_logger: logging.Logger The logger to which the logs will be redirected. * ##### optionalkeyword-onlyfrom\_start: bool = True If `True`, all logs from the start of the actor run will be redirected. If `False`, only newly arrived logs will be redirected. This can be useful for redirecting only a small portion of relevant logs for long-running actors in stand-by. #### Returns None ### [**](#start)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/log.py#L349)start * ****start**(): Task - Start the streaming task. The caller has to handle any cleanup by manually calling the `stop` method. *** #### Returns Task ### [**](#stop)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/log.py#L356)stop * **async **stop**(): None - Stop the streaming task. *** #### Returns None --- # StreamedLogSync Sync variant of `StreamedLog` that is logging in threads. ### Hierarchy * [StreamedLog](https://docs.apify.com/api/client/python/api/client/python/reference/class/StreamedLog.md) * *StreamedLogSync* ## Index[**](#Index) ### Methods * [**\_\_enter\_\_](https://docs.apify.com/api/client/python/api/client/python/reference/class/StreamedLogSync.md#__enter__) * [**\_\_exit\_\_](https://docs.apify.com/api/client/python/api/client/python/reference/class/StreamedLogSync.md#__exit__) * [**\_\_init\_\_](https://docs.apify.com/api/client/python/api/client/python/reference/class/StreamedLogSync.md#__init__) * [**start](https://docs.apify.com/api/client/python/api/client/python/reference/class/StreamedLogSync.md#start) * [**stop](https://docs.apify.com/api/client/python/api/client/python/reference/class/StreamedLogSync.md#stop) ## Methods[**](#Methods) ### [**](#__enter__)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/log.py#L316)\_\_enter\_\_ * ****\_\_enter\_\_**(): Self - Start the streaming thread within the context. Exiting the context will finish the streaming thread. *** #### Returns Self ### [**](#__exit__)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/log.py#L321)\_\_exit\_\_ * ****\_\_exit\_\_**(exc\_type, exc\_val, exc\_tb): None - Stop the streaming thread. *** #### Parameters * ##### exc\_type: type\[BaseException] | None * ##### exc\_val: BaseException | None * ##### exc\_tb: TracebackType | None #### Returns None ### [**](#__init__)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/log.py#L292)\_\_init\_\_ * ****\_\_init\_\_**(to\_logger, \*, from\_start): None - Overrides [StreamedLog.\_\_init\_\_](https://docs.apify.com/api/client/python/api/client/python/reference/class/StreamedLog.md#__init__) Initialize `StreamedLog`. *** #### Parameters * ##### to\_logger: logging.Logger The logger to which the logs will be redirected. * ##### optionalkeyword-onlyfrom\_start: bool = True If `True`, all logs from the start of the actor run will be redirected. If `False`, only newly arrived logs will be redirected. This can be useful for redirecting only a small portion of relevant logs for long-running actors in stand-by. #### Returns None ### [**](#start)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/log.py#L298)start * ****start**(): Thread - Start the streaming thread. The caller has to handle any cleanup by manually calling the `stop` method. *** #### Returns Thread ### [**](#stop)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/log.py#L307)stop * ****stop**(): None - Signal the streaming thread to stop logging and wait for it to finish. *** #### Returns None --- # TaskClient Sub-client for manipulating a single task. ## Index[**](#Index) ### Methods * [**call](https://docs.apify.com/api/client/python/api/client/python/reference/class/TaskClient.md#call) * [**delete](https://docs.apify.com/api/client/python/api/client/python/reference/class/TaskClient.md#delete) * [**get](https://docs.apify.com/api/client/python/api/client/python/reference/class/TaskClient.md#get) * [**get\_input](https://docs.apify.com/api/client/python/api/client/python/reference/class/TaskClient.md#get_input) * [**last\_run](https://docs.apify.com/api/client/python/api/client/python/reference/class/TaskClient.md#last_run) * [**runs](https://docs.apify.com/api/client/python/api/client/python/reference/class/TaskClient.md#runs) * [**start](https://docs.apify.com/api/client/python/api/client/python/reference/class/TaskClient.md#start) * [**update](https://docs.apify.com/api/client/python/api/client/python/reference/class/TaskClient.md#update) * [**update\_input](https://docs.apify.com/api/client/python/api/client/python/reference/class/TaskClient.md#update_input) * [**webhooks](https://docs.apify.com/api/client/python/api/client/python/reference/class/TaskClient.md#webhooks) ## Methods[**](#Methods) ### [**](#call)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/task.py#L206)call * ****call**(\*, task\_input, build, max\_items, memory\_mbytes, timeout\_secs, webhooks, wait\_secs): dict | None - Start a task and wait for it to finish before returning the Run object. It waits indefinitely, unless the wait\_secs argument is provided. *** #### Parameters * ##### optionalkeyword-onlytask\_input: dict | None = None Task input dictionary. * ##### optionalkeyword-onlybuild: str | None = None Specifies the Actor build to run. It can be either a build tag or build number. By default, the run uses the build specified in the task settings (typically latest). * ##### optionalkeyword-onlymax\_items: int | None = None Maximum number of results that will be returned by this run. If the Actor is charged per result, you will not be charged for more results than the given limit. * ##### optionalkeyword-onlymemory\_mbytes: int | None = None Memory limit for the run, in megabytes. By default, the run uses a memory limit specified in the task settings. * ##### optionalkeyword-onlytimeout\_secs: int | None = None Optional timeout for the run, in seconds. By default, the run uses timeout specified in the task settings. * ##### optionalkeyword-onlywebhooks: [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[dict] | None = None Specifies optional webhooks associated with the Actor run, which can be used to receive a notification e.g. when the Actor finished or failed. Note: if you already have a webhook set up for the Actor or task, you do not have to add it again here. * ##### optionalkeyword-onlywait\_secs: int | None = None The maximum number of seconds the server waits for the task run to finish. If not provided, waits indefinitely. #### Returns dict | None ### [**](#delete)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/task.py#L142)delete * ****delete**(): None - Delete the task. *** #### Returns None ### [**](#get)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/task.py#L71)get * ****get**(): dict | None - Retrieve the task. *** #### Returns dict | None ### [**](#get_input)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/task.py#L253)get\_input * ****get\_input**(): dict | None - Retrieve the default input for this task. *** #### Returns dict | None ### [**](#last_run)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/task.py#L292)last\_run * ****last\_run**(\*, status, origin): [RunClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/RunClient.md) - Retrieve the client for the last run of this task. Last run is retrieved based on the start time of the runs. *** #### Parameters * ##### optionalkeyword-onlystatus: ActorJobStatus | None = None Consider only runs with this status. * ##### optionalkeyword-onlyorigin: MetaOrigin | None = None Consider only runs started with this origin. #### Returns [RunClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/RunClient.md) ### [**](#runs)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/task.py#L288)runs * ****runs**(): [RunCollectionClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/RunCollectionClient.md) - Retrieve a client for the runs of this task. *** #### Returns [RunCollectionClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/RunCollectionClient.md) ### [**](#start)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/task.py#L149)start * ****start**(\*, task\_input, build, max\_items, memory\_mbytes, timeout\_secs, wait\_for\_finish, webhooks): dict - Start the task and immediately return the Run object. *** #### Parameters * ##### optionalkeyword-onlytask\_input: dict | None = None Task input dictionary. * ##### optionalkeyword-onlybuild: str | None = None Specifies the Actor build to run. It can be either a build tag or build number. By default, the run uses the build specified in the task settings (typically latest). * ##### optionalkeyword-onlymax\_items: int | None = None Maximum number of results that will be returned by this run. If the Actor is charged per result, you will not be charged for more results than the given limit. * ##### optionalkeyword-onlymemory\_mbytes: int | None = None Memory limit for the run, in megabytes. By default, the run uses a memory limit specified in the task settings. * ##### optionalkeyword-onlytimeout\_secs: int | None = None Optional timeout for the run, in seconds. By default, the run uses timeout specified in the task settings. * ##### optionalkeyword-onlywait\_for\_finish: int | None = None The maximum number of seconds the server waits for the run to finish. By default, it is 0, the maximum value is 60. * ##### optionalkeyword-onlywebhooks: [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[dict] | None = None Optional ad-hoc webhooks () associated with the Actor run which can be used to receive a notification, e.g. when the Actor finished or failed. If you already have a webhook set up for the Actor or task, you do not have to add it again here. Each webhook is represented by a dictionary containing these items: * `event_types`: List of `` `WebhookEventType` `` values which trigger the webhook. * `request_url`: URL to which to send the webhook HTTP request. * `payload_template`: Optional template for the request payload. #### Returns dict ### [**](#update)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/task.py#L81)update * ****update**(\*, name, task\_input, build, max\_items, memory\_mbytes, timeout\_secs, title, actor\_standby\_desired\_requests\_per\_actor\_run, actor\_standby\_max\_requests\_per\_actor\_run, actor\_standby\_idle\_timeout\_secs, actor\_standby\_build, actor\_standby\_memory\_mbytes): dict - Update the task with specified fields. *** #### Parameters * ##### optionalkeyword-onlyname: str | None = None Name of the task. * ##### optionalkeyword-onlytask\_input: dict | None = None Task input dictionary. * ##### optionalkeyword-onlybuild: str | None = None Actor build to run. It can be either a build tag or build number. By default, the run uses the build specified in the task settings (typically latest). * ##### optionalkeyword-onlymax\_items: int | None = None Maximum number of results that will be returned by this run. If the Actor is charged per result, you will not be charged for more results than the given limit. * ##### optionalkeyword-onlymemory\_mbytes: int | None = None Memory limit for the run, in megabytes. By default, the run uses a memory limit specified in the task settings. * ##### optionalkeyword-onlytimeout\_secs: int | None = None Optional timeout for the run, in seconds. By default, the run uses timeout specified in the task settings. * ##### optionalkeyword-onlytitle: str | None = None A human-friendly equivalent of the name. * ##### optionalkeyword-onlyactor\_standby\_desired\_requests\_per\_actor\_run: int | None = None The desired number of concurrent HTTP requests for a single Actor Standby run. * ##### optionalkeyword-onlyactor\_standby\_max\_requests\_per\_actor\_run: int | None = None The maximum number of concurrent HTTP requests for a single Actor Standby run. * ##### optionalkeyword-onlyactor\_standby\_idle\_timeout\_secs: int | None = None If the Actor run does not receive any requests for this time, it will be shut down. * ##### optionalkeyword-onlyactor\_standby\_build: str | None = None The build tag or number to run when the Actor is in Standby mode. * ##### optionalkeyword-onlyactor\_standby\_memory\_mbytes: int | None = None The memory in megabytes to use when the Actor is in Standby mode. #### Returns dict ### [**](#update_input)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/task.py#L272)update\_input * ****update\_input**(\*, task\_input): dict - Update the default input for this task. *** #### Parameters * ##### keyword-onlytask\_input: dict #### Returns dict ### [**](#webhooks)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/task.py#L315)webhooks * ****webhooks**(): [WebhookCollectionClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/WebhookCollectionClient.md) - Retrieve a client for webhooks associated with this task. *** #### Returns [WebhookCollectionClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/WebhookCollectionClient.md) --- # TaskClientAsync Async sub-client for manipulating a single task. ## Index[**](#Index) ### Methods * [**call](https://docs.apify.com/api/client/python/api/client/python/reference/class/TaskClientAsync.md#call) * [**delete](https://docs.apify.com/api/client/python/api/client/python/reference/class/TaskClientAsync.md#delete) * [**get](https://docs.apify.com/api/client/python/api/client/python/reference/class/TaskClientAsync.md#get) * [**get\_input](https://docs.apify.com/api/client/python/api/client/python/reference/class/TaskClientAsync.md#get_input) * [**last\_run](https://docs.apify.com/api/client/python/api/client/python/reference/class/TaskClientAsync.md#last_run) * [**runs](https://docs.apify.com/api/client/python/api/client/python/reference/class/TaskClientAsync.md#runs) * [**start](https://docs.apify.com/api/client/python/api/client/python/reference/class/TaskClientAsync.md#start) * [**update](https://docs.apify.com/api/client/python/api/client/python/reference/class/TaskClientAsync.md#update) * [**update\_input](https://docs.apify.com/api/client/python/api/client/python/reference/class/TaskClientAsync.md#update_input) * [**webhooks](https://docs.apify.com/api/client/python/api/client/python/reference/class/TaskClientAsync.md#webhooks) ## Methods[**](#Methods) ### [**](#call)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/task.py#L463)call * **async **call**(\*, task\_input, build, max\_items, memory\_mbytes, timeout\_secs, webhooks, wait\_secs): dict | None - Start a task and wait for it to finish before returning the Run object. It waits indefinitely, unless the wait\_secs argument is provided. *** #### Parameters * ##### optionalkeyword-onlytask\_input: dict | None = None Task input dictionary. * ##### optionalkeyword-onlybuild: str | None = None Specifies the Actor build to run. It can be either a build tag or build number. By default, the run uses the build specified in the task settings (typically latest). * ##### optionalkeyword-onlymax\_items: int | None = None Maximum number of results that will be returned by this run. If the Actor is charged per result, you will not be charged for more results than the given limit. * ##### optionalkeyword-onlymemory\_mbytes: int | None = None Memory limit for the run, in megabytes. By default, the run uses a memory limit specified in the task settings. * ##### optionalkeyword-onlytimeout\_secs: int | None = None Optional timeout for the run, in seconds. By default, the run uses timeout specified in the task settings. * ##### optionalkeyword-onlywebhooks: [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[dict] | None = None Specifies optional webhooks associated with the Actor run, which can be used to receive a notification e.g. when the Actor finished or failed. Note: if you already have a webhook set up for the Actor or task, you do not have to add it again here. * ##### optionalkeyword-onlywait\_secs: int | None = None The maximum number of seconds the server waits for the task run to finish. If not provided, waits indefinitely. #### Returns dict | None ### [**](#delete)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/task.py#L399)delete * **async **delete**(): None - Delete the task. *** #### Returns None ### [**](#get)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/task.py#L328)get * **async **get**(): dict | None - Retrieve the task. *** #### Returns dict | None ### [**](#get_input)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/task.py#L510)get\_input * **async **get\_input**(): dict | None - Retrieve the default input for this task. *** #### Returns dict | None ### [**](#last_run)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/task.py#L549)last\_run * ****last\_run**(\*, status, origin): [RunClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/RunClientAsync.md) - Retrieve the client for the last run of this task. Last run is retrieved based on the start time of the runs. *** #### Parameters * ##### optionalkeyword-onlystatus: ActorJobStatus | None = None Consider only runs with this status. * ##### optionalkeyword-onlyorigin: MetaOrigin | None = None Consider only runs started with this origin. #### Returns [RunClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/RunClientAsync.md) ### [**](#runs)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/task.py#L545)runs * ****runs**(): [RunCollectionClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/RunCollectionClientAsync.md) - Retrieve a client for the runs of this task. *** #### Returns [RunCollectionClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/RunCollectionClientAsync.md) ### [**](#start)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/task.py#L406)start * **async **start**(\*, task\_input, build, max\_items, memory\_mbytes, timeout\_secs, wait\_for\_finish, webhooks): dict - Start the task and immediately return the Run object. *** #### Parameters * ##### optionalkeyword-onlytask\_input: dict | None = None Task input dictionary. * ##### optionalkeyword-onlybuild: str | None = None Specifies the Actor build to run. It can be either a build tag or build number. By default, the run uses the build specified in the task settings (typically latest). * ##### optionalkeyword-onlymax\_items: int | None = None Maximum number of results that will be returned by this run. If the Actor is charged per result, you will not be charged for more results than the given limit. * ##### optionalkeyword-onlymemory\_mbytes: int | None = None Memory limit for the run, in megabytes. By default, the run uses a memory limit specified in the task settings. * ##### optionalkeyword-onlytimeout\_secs: int | None = None Optional timeout for the run, in seconds. By default, the run uses timeout specified in the task settings. * ##### optionalkeyword-onlywait\_for\_finish: int | None = None The maximum number of seconds the server waits for the run to finish. By default, it is 0, the maximum value is 60. * ##### optionalkeyword-onlywebhooks: [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[dict] | None = None Optional ad-hoc webhooks () associated with the Actor run which can be used to receive a notification, e.g. when the Actor finished or failed. If you already have a webhook set up for the Actor or task, you do not have to add it again here. Each webhook is represented by a dictionary containing these items: * `event_types`: List of `` `WebhookEventType` `` values which trigger the webhook. * `request_url`: URL to which to send the webhook HTTP request. * `payload_template`: Optional template for the request payload. #### Returns dict ### [**](#update)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/task.py#L338)update * **async **update**(\*, name, task\_input, build, max\_items, memory\_mbytes, timeout\_secs, title, actor\_standby\_desired\_requests\_per\_actor\_run, actor\_standby\_max\_requests\_per\_actor\_run, actor\_standby\_idle\_timeout\_secs, actor\_standby\_build, actor\_standby\_memory\_mbytes): dict - Update the task with specified fields. *** #### Parameters * ##### optionalkeyword-onlyname: str | None = None Name of the task. * ##### optionalkeyword-onlytask\_input: dict | None = None Task input dictionary. * ##### optionalkeyword-onlybuild: str | None = None Actor build to run. It can be either a build tag or build number. By default, the run uses the build specified in the task settings (typically latest). * ##### optionalkeyword-onlymax\_items: int | None = None Maximum number of results that will be returned by this run. If the Actor is charged per result, you will not be charged for more results than the given limit. * ##### optionalkeyword-onlymemory\_mbytes: int | None = None Memory limit for the run, in megabytes. By default, the run uses a memory limit specified in the task settings. * ##### optionalkeyword-onlytimeout\_secs: int | None = None Optional timeout for the run, in seconds. By default, the run uses timeout specified in the task settings. * ##### optionalkeyword-onlytitle: str | None = None A human-friendly equivalent of the name. * ##### optionalkeyword-onlyactor\_standby\_desired\_requests\_per\_actor\_run: int | None = None The desired number of concurrent HTTP requests for a single Actor Standby run. * ##### optionalkeyword-onlyactor\_standby\_max\_requests\_per\_actor\_run: int | None = None The maximum number of concurrent HTTP requests for a single Actor Standby run. * ##### optionalkeyword-onlyactor\_standby\_idle\_timeout\_secs: int | None = None If the Actor run does not receive any requests for this time, it will be shut down. * ##### optionalkeyword-onlyactor\_standby\_build: str | None = None The build tag or number to run when the Actor is in Standby mode. * ##### optionalkeyword-onlyactor\_standby\_memory\_mbytes: int | None = None The memory in megabytes to use when the Actor is in Standby mode. #### Returns dict ### [**](#update_input)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/task.py#L529)update\_input * **async **update\_input**(\*, task\_input): dict - Update the default input for this task. *** #### Parameters * ##### keyword-onlytask\_input: dict #### Returns dict ### [**](#webhooks)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/task.py#L572)webhooks * ****webhooks**(): [WebhookCollectionClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/WebhookCollectionClientAsync.md) - Retrieve a client for webhooks associated with this task. *** #### Returns [WebhookCollectionClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/WebhookCollectionClientAsync.md) --- # TaskCollectionClient Sub-client for manipulating tasks. ## Index[**](#Index) ### Methods * [**create](https://docs.apify.com/api/client/python/api/client/python/reference/class/TaskCollectionClient.md#create) * [**list](https://docs.apify.com/api/client/python/api/client/python/reference/class/TaskCollectionClient.md#list) ## Methods[**](#Methods) ### [**](#create)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/task_collection.py#L43)create * ****create**(\*, actor\_id, name, build, timeout\_secs, memory\_mbytes, max\_items, task\_input, title, actor\_standby\_desired\_requests\_per\_actor\_run, actor\_standby\_max\_requests\_per\_actor\_run, actor\_standby\_idle\_timeout\_secs, actor\_standby\_build, actor\_standby\_memory\_mbytes): dict - Create a new task. *** #### Parameters * ##### keyword-onlyactor\_id: str Id of the Actor that should be run. * ##### keyword-onlyname: str Name of the task. * ##### optionalkeyword-onlybuild: str | None = None Actor build to run. It can be either a build tag or build number. By default, the run uses the build specified in the task settings (typically latest). * ##### optionalkeyword-onlytimeout\_secs: int | None = None Optional timeout for the run, in seconds. By default, the run uses timeout specified in the task settings. * ##### optionalkeyword-onlymemory\_mbytes: int | None = None Memory limit for the run, in megabytes. By default, the run uses a memory limit specified in the task settings. * ##### optionalkeyword-onlymax\_items: int | None = None Maximum number of results that will be returned by runs of this task. If the Actor of this task is charged per result, you will not be charged for more results than the given limit. * ##### optionalkeyword-onlytask\_input: dict | None = None Task input object. * ##### optionalkeyword-onlytitle: str | None = None A human-friendly equivalent of the name. * ##### optionalkeyword-onlyactor\_standby\_desired\_requests\_per\_actor\_run: int | None = None The desired number of concurrent HTTP requests for a single Actor Standby run. * ##### optionalkeyword-onlyactor\_standby\_max\_requests\_per\_actor\_run: int | None = None The maximum number of concurrent HTTP requests for a single Actor Standby run. * ##### optionalkeyword-onlyactor\_standby\_idle\_timeout\_secs: int | None = None If the Actor run does not receive any requests for this time, it will be shut down. * ##### optionalkeyword-onlyactor\_standby\_build: str | None = None The build tag or number to run when the Actor is in Standby mode. * ##### optionalkeyword-onlyactor\_standby\_memory\_mbytes: int | None = None The memory in megabytes to use when the Actor is in Standby mode. #### Returns dict ### [**](#list)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/task_collection.py#L22)list * ****list**(\*, limit, offset, desc): [ListPage](https://docs.apify.com/api/client/python/api/client/python/reference/class/ListPage.md)\[dict] - List the available tasks. *** #### Parameters * ##### optionalkeyword-onlylimit: int | None = None How many tasks to list. * ##### optionalkeyword-onlyoffset: int | None = None What task to include as first when retrieving the list. * ##### optionalkeyword-onlydesc: bool | None = None Whether to sort the tasks in descending order based on their creation date. #### Returns [ListPage](https://docs.apify.com/api/client/python/api/client/python/reference/class/ListPage.md)\[dict] --- # TaskCollectionClientAsync Async sub-client for manipulating tasks. ## Index[**](#Index) ### Methods * [**create](https://docs.apify.com/api/client/python/api/client/python/reference/class/TaskCollectionClientAsync.md#create) * [**list](https://docs.apify.com/api/client/python/api/client/python/reference/class/TaskCollectionClientAsync.md#list) ## Methods[**](#Methods) ### [**](#create)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/task_collection.py#L137)create * **async **create**(\*, actor\_id, name, build, timeout\_secs, memory\_mbytes, max\_items, task\_input, title, actor\_standby\_desired\_requests\_per\_actor\_run, actor\_standby\_max\_requests\_per\_actor\_run, actor\_standby\_idle\_timeout\_secs, actor\_standby\_build, actor\_standby\_memory\_mbytes): dict - Create a new task. *** #### Parameters * ##### keyword-onlyactor\_id: str Id of the Actor that should be run. * ##### keyword-onlyname: str Name of the task. * ##### optionalkeyword-onlybuild: str | None = None Actor build to run. It can be either a build tag or build number. By default, the run uses the build specified in the task settings (typically latest). * ##### optionalkeyword-onlytimeout\_secs: int | None = None Optional timeout for the run, in seconds. By default, the run uses timeout specified in the task settings. * ##### optionalkeyword-onlymemory\_mbytes: int | None = None Memory limit for the run, in megabytes. By default, the run uses a memory limit specified in the task settings. * ##### optionalkeyword-onlymax\_items: int | None = None Maximum number of results that will be returned by runs of this task. If the Actor of this task is charged per result, you will not be charged for more results than the given limit. * ##### optionalkeyword-onlytask\_input: dict | None = None Task input object. * ##### optionalkeyword-onlytitle: str | None = None A human-friendly equivalent of the name. * ##### optionalkeyword-onlyactor\_standby\_desired\_requests\_per\_actor\_run: int | None = None The desired number of concurrent HTTP requests for a single Actor Standby run. * ##### optionalkeyword-onlyactor\_standby\_max\_requests\_per\_actor\_run: int | None = None The maximum number of concurrent HTTP requests for a single Actor Standby run. * ##### optionalkeyword-onlyactor\_standby\_idle\_timeout\_secs: int | None = None If the Actor run does not receive any requests for this time, it will be shut down. * ##### optionalkeyword-onlyactor\_standby\_build: str | None = None The build tag or number to run when the Actor is in Standby mode. * ##### optionalkeyword-onlyactor\_standby\_memory\_mbytes: int | None = None The memory in megabytes to use when the Actor is in Standby mode. #### Returns dict ### [**](#list)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/task_collection.py#L116)list * **async **list**(\*, limit, offset, desc): [ListPage](https://docs.apify.com/api/client/python/api/client/python/reference/class/ListPage.md)\[dict] - List the available tasks. *** #### Parameters * ##### optionalkeyword-onlylimit: int | None = None How many tasks to list. * ##### optionalkeyword-onlyoffset: int | None = None What task to include as first when retrieving the list. * ##### optionalkeyword-onlydesc: bool | None = None Whether to sort the tasks in descending order based on their creation date. #### Returns [ListPage](https://docs.apify.com/api/client/python/api/client/python/reference/class/ListPage.md)\[dict] --- # UserClient Sub-client for querying user data. ## Index[**](#Index) ### Methods * [**get](https://docs.apify.com/api/client/python/api/client/python/reference/class/UserClient.md#get) * [**limits](https://docs.apify.com/api/client/python/api/client/python/reference/class/UserClient.md#limits) * [**monthly\_usage](https://docs.apify.com/api/client/python/api/client/python/reference/class/UserClient.md#monthly_usage) * [**update\_limits](https://docs.apify.com/api/client/python/api/client/python/reference/class/UserClient.md#update_limits) ## Methods[**](#Methods) ### [**](#get)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/user.py#L23)get * ****get**(): dict | None - Return information about user account. You receive all or only public info based on your token permissions. *** #### Returns dict | None ### [**](#limits)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/user.py#L60)limits * ****limits**(): dict | None - Return a complete summary of the user account's limits. It is the same information which is available on the account's Limits page. The returned data includes the current usage cycle, a summary of the account's limits, and the current usage. *** #### Returns dict | None ### [**](#monthly_usage)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/user.py#L35)monthly\_usage * ****monthly\_usage**(): dict | None - Return monthly usage of the user account. This includes a complete usage summary for the current usage cycle, an overall sum, as well as a daily breakdown of usage. It is the same information which is available on the account's Billing page. The information includes use of storage, data transfer, and request queue usage. *** #### Returns dict | None ### [**](#update_limits)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/user.py#L84)update\_limits * ****update\_limits**(\*, max\_monthly\_usage\_usd, data\_retention\_days): None - Update the account's limits manageable on your account's Limits page. *** #### Parameters * ##### optionalkeyword-onlymax\_monthly\_usage\_usd: int | None = None * ##### optionalkeyword-onlydata\_retention\_days: int | None = None #### Returns None --- # UserClientAsync Async sub-client for querying user data. ## Index[**](#Index) ### Methods * [**get](https://docs.apify.com/api/client/python/api/client/python/reference/class/UserClientAsync.md#get) * [**limits](https://docs.apify.com/api/client/python/api/client/python/reference/class/UserClientAsync.md#limits) * [**monthly\_usage](https://docs.apify.com/api/client/python/api/client/python/reference/class/UserClientAsync.md#monthly_usage) * [**update\_limits](https://docs.apify.com/api/client/python/api/client/python/reference/class/UserClientAsync.md#update_limits) ## Methods[**](#Methods) ### [**](#get)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/user.py#L115)get * **async **get**(): dict | None - Return information about user account. You receive all or only public info based on your token permissions. *** #### Returns dict | None ### [**](#limits)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/user.py#L152)limits * **async **limits**(): dict | None - Return a complete summary of the user account's limits. It is the same information which is available on the account's Limits page. The returned data includes the current usage cycle, a summary of the account's limits, and the current usage. *** #### Returns dict | None ### [**](#monthly_usage)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/user.py#L127)monthly\_usage * **async **monthly\_usage**(): dict | None - Return monthly usage of the user account. This includes a complete usage summary for the current usage cycle, an overall sum, as well as a daily breakdown of usage. It is the same information which is available on the account's Billing page. The information includes use of storage, data transfer, and request queue usage. *** #### Returns dict | None ### [**](#update_limits)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/user.py#L176)update\_limits * **async **update\_limits**(\*, max\_monthly\_usage\_usd, data\_retention\_days): None - Update the account's limits manageable on your account's Limits page. *** #### Parameters * ##### optionalkeyword-onlymax\_monthly\_usage\_usd: int | None = None * ##### optionalkeyword-onlydata\_retention\_days: int | None = None #### Returns None --- # WebhookClient Sub-client for manipulating a single webhook. ## Index[**](#Index) ### Methods * [**delete](https://docs.apify.com/api/client/python/api/client/python/reference/class/WebhookClient.md#delete) * [**dispatches](https://docs.apify.com/api/client/python/api/client/python/reference/class/WebhookClient.md#dispatches) * [**get](https://docs.apify.com/api/client/python/api/client/python/reference/class/WebhookClient.md#get) * [**test](https://docs.apify.com/api/client/python/api/client/python/reference/class/WebhookClient.md#test) * [**update](https://docs.apify.com/api/client/python/api/client/python/reference/class/WebhookClient.md#update) ## Methods[**](#Methods) ### [**](#delete)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/webhook.py#L130)delete * ****delete**(): None - Delete the webhook. *** #### Returns None ### [**](#dispatches)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/webhook.py#L161)dispatches * ****dispatches**(): [WebhookDispatchCollectionClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/WebhookDispatchCollectionClient.md) - Get dispatches of the webhook. *** #### Returns [WebhookDispatchCollectionClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/WebhookDispatchCollectionClient.md) ### [**](#get)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/webhook.py#L71)get * ****get**(): dict | None - Retrieve the webhook. *** #### Returns dict | None ### [**](#test)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/webhook.py#L137)test * ****test**(): dict | None - Test a webhook. Creates a webhook dispatch with a dummy payload. *** #### Returns dict | None ### [**](#update)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/webhook.py#L81)update * ****update**(\*, event\_types, request\_url, payload\_template, headers\_template, actor\_id, actor\_task\_id, actor\_run\_id, ignore\_ssl\_errors, do\_not\_retry, is\_ad\_hoc): dict - Update the webhook. *** #### Parameters * ##### optionalkeyword-onlyevent\_types: [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[WebhookEventType] | None = None List of event types that should trigger the webhook. At least one is required. * ##### optionalkeyword-onlyrequest\_url: str | None = None URL that will be invoked once the webhook is triggered. * ##### optionalkeyword-onlypayload\_template: str | None = None Specification of the payload that will be sent to request\_url. * ##### optionalkeyword-onlyheaders\_template: str | None = None Headers that will be sent to the request\_url. * ##### optionalkeyword-onlyactor\_id: str | None = None Id of the Actor whose runs should trigger the webhook. * ##### optionalkeyword-onlyactor\_task\_id: str | None = None Id of the Actor task whose runs should trigger the webhook. * ##### optionalkeyword-onlyactor\_run\_id: str | None = None Id of the Actor run which should trigger the webhook. * ##### optionalkeyword-onlyignore\_ssl\_errors: bool | None = None Whether the webhook should ignore SSL errors returned by request\_url. * ##### optionalkeyword-onlydo\_not\_retry: bool | None = None Whether the webhook should retry sending the payload to request\_url upon failure. * ##### optionalkeyword-onlyis\_ad\_hoc: bool | None = None Set to True if you want the webhook to be triggered only the first time the condition is fulfilled. Only applicable when actor\_run\_id is filled. #### Returns dict --- # WebhookClientAsync Async sub-client for manipulating a single webhook. ## Index[**](#Index) ### Methods * [**delete](https://docs.apify.com/api/client/python/api/client/python/reference/class/WebhookClientAsync.md#delete) * [**dispatches](https://docs.apify.com/api/client/python/api/client/python/reference/class/WebhookClientAsync.md#dispatches) * [**get](https://docs.apify.com/api/client/python/api/client/python/reference/class/WebhookClientAsync.md#get) * [**test](https://docs.apify.com/api/client/python/api/client/python/reference/class/WebhookClientAsync.md#test) * [**update](https://docs.apify.com/api/client/python/api/client/python/reference/class/WebhookClientAsync.md#update) ## Methods[**](#Methods) ### [**](#delete)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/webhook.py#L241)delete * **async **delete**(): None - Delete the webhook. *** #### Returns None ### [**](#dispatches)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/webhook.py#L272)dispatches * ****dispatches**(): [WebhookDispatchCollectionClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/WebhookDispatchCollectionClientAsync.md) - Get dispatches of the webhook. *** #### Returns [WebhookDispatchCollectionClientAsync](https://docs.apify.com/api/client/python/api/client/python/reference/class/WebhookDispatchCollectionClientAsync.md) ### [**](#get)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/webhook.py#L182)get * **async **get**(): dict | None - Retrieve the webhook. *** #### Returns dict | None ### [**](#test)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/webhook.py#L248)test * **async **test**(): dict | None - Test a webhook. Creates a webhook dispatch with a dummy payload. *** #### Returns dict | None ### [**](#update)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/webhook.py#L192)update * **async **update**(\*, event\_types, request\_url, payload\_template, headers\_template, actor\_id, actor\_task\_id, actor\_run\_id, ignore\_ssl\_errors, do\_not\_retry, is\_ad\_hoc): dict - Update the webhook. *** #### Parameters * ##### optionalkeyword-onlyevent\_types: [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[WebhookEventType] | None = None List of event types that should trigger the webhook. At least one is required. * ##### optionalkeyword-onlyrequest\_url: str | None = None URL that will be invoked once the webhook is triggered. * ##### optionalkeyword-onlypayload\_template: str | None = None Specification of the payload that will be sent to request\_url. * ##### optionalkeyword-onlyheaders\_template: str | None = None Headers that will be sent to the request\_url. * ##### optionalkeyword-onlyactor\_id: str | None = None Id of the Actor whose runs should trigger the webhook. * ##### optionalkeyword-onlyactor\_task\_id: str | None = None Id of the Actor task whose runs should trigger the webhook. * ##### optionalkeyword-onlyactor\_run\_id: str | None = None Id of the Actor run which should trigger the webhook. * ##### optionalkeyword-onlyignore\_ssl\_errors: bool | None = None Whether the webhook should ignore SSL errors returned by request\_url. * ##### optionalkeyword-onlydo\_not\_retry: bool | None = None Whether the webhook should retry sending the payload to request\_url upon failure. * ##### optionalkeyword-onlyis\_ad\_hoc: bool | None = None Set to True if you want the webhook to be triggered only the first time the condition is fulfilled. Only applicable when actor\_run\_id is filled. #### Returns dict --- # WebhookCollectionClient Sub-client for manipulating webhooks. ## Index[**](#Index) ### Methods * [**create](https://docs.apify.com/api/client/python/api/client/python/reference/class/WebhookCollectionClient.md#create) * [**list](https://docs.apify.com/api/client/python/api/client/python/reference/class/WebhookCollectionClient.md#list) ## Methods[**](#Methods) ### [**](#create)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/webhook_collection.py#L45)create * ****create**(\*, event\_types, request\_url, payload\_template, headers\_template, actor\_id, actor\_task\_id, actor\_run\_id, ignore\_ssl\_errors, do\_not\_retry, idempotency\_key, is\_ad\_hoc): dict - Create a new webhook. You have to specify exactly one out of actor\_id, actor\_task\_id or actor\_run\_id. *** #### Parameters * ##### keyword-onlyevent\_types: [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[WebhookEventType] List of event types that should trigger the webhook. At least one is required. * ##### keyword-onlyrequest\_url: str URL that will be invoked once the webhook is triggered. * ##### optionalkeyword-onlypayload\_template: str | None = None Specification of the payload that will be sent to request\_url. * ##### optionalkeyword-onlyheaders\_template: str | None = None Headers that will be sent to the request\_url. * ##### optionalkeyword-onlyactor\_id: str | None = None Id of the Actor whose runs should trigger the webhook. * ##### optionalkeyword-onlyactor\_task\_id: str | None = None Id of the Actor task whose runs should trigger the webhook. * ##### optionalkeyword-onlyactor\_run\_id: str | None = None Id of the Actor run which should trigger the webhook. * ##### optionalkeyword-onlyignore\_ssl\_errors: bool | None = None Whether the webhook should ignore SSL errors returned by request\_url. * ##### optionalkeyword-onlydo\_not\_retry: bool | None = None Whether the webhook should retry sending the payload to request\_url upon failure. * ##### optionalkeyword-onlyidempotency\_key: str | None = None A unique identifier of a webhook. You can use it to ensure that you won't create the same webhook multiple times. * ##### optionalkeyword-onlyis\_ad\_hoc: bool | None = None Set to True if you want the webhook to be triggered only the first time the condition is fulfilled. Only applicable when actor\_run\_id is filled. #### Returns dict ### [**](#list)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/webhook_collection.py#L24)list * ****list**(\*, limit, offset, desc): [ListPage](https://docs.apify.com/api/client/python/api/client/python/reference/class/ListPage.md)\[dict] - List the available webhooks. *** #### Parameters * ##### optionalkeyword-onlylimit: int | None = None How many webhooks to retrieve. * ##### optionalkeyword-onlyoffset: int | None = None What webhook to include as first when retrieving the list. * ##### optionalkeyword-onlydesc: bool | None = None Whether to sort the webhooks in descending order based on their date of creation. #### Returns [ListPage](https://docs.apify.com/api/client/python/api/client/python/reference/class/ListPage.md)\[dict] --- # WebhookCollectionClientAsync Async sub-client for manipulating webhooks. ## Index[**](#Index) ### Methods * [**create](https://docs.apify.com/api/client/python/api/client/python/reference/class/WebhookCollectionClientAsync.md#create) * [**list](https://docs.apify.com/api/client/python/api/client/python/reference/class/WebhookCollectionClientAsync.md#list) ## Methods[**](#Methods) ### [**](#create)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/webhook_collection.py#L130)create * **async **create**(\*, event\_types, request\_url, payload\_template, headers\_template, actor\_id, actor\_task\_id, actor\_run\_id, ignore\_ssl\_errors, do\_not\_retry, idempotency\_key, is\_ad\_hoc): dict - Create a new webhook. You have to specify exactly one out of actor\_id, actor\_task\_id or actor\_run\_id. *** #### Parameters * ##### keyword-onlyevent\_types: [list](https://docs.apify.com/api/client/python/api/client/python/reference/class/ActorCollectionClient.md#list)\[WebhookEventType] List of event types that should trigger the webhook. At least one is required. * ##### keyword-onlyrequest\_url: str URL that will be invoked once the webhook is triggered. * ##### optionalkeyword-onlypayload\_template: str | None = None Specification of the payload that will be sent to request\_url. * ##### optionalkeyword-onlyheaders\_template: str | None = None Headers that will be sent to the request\_url. * ##### optionalkeyword-onlyactor\_id: str | None = None Id of the Actor whose runs should trigger the webhook. * ##### optionalkeyword-onlyactor\_task\_id: str | None = None Id of the Actor task whose runs should trigger the webhook. * ##### optionalkeyword-onlyactor\_run\_id: str | None = None Id of the Actor run which should trigger the webhook. * ##### optionalkeyword-onlyignore\_ssl\_errors: bool | None = None Whether the webhook should ignore SSL errors returned by request\_url. * ##### optionalkeyword-onlydo\_not\_retry: bool | None = None Whether the webhook should retry sending the payload to request\_url upon failure. * ##### optionalkeyword-onlyidempotency\_key: str | None = None A unique identifier of a webhook. You can use it to ensure that you won't create the same webhook multiple times. * ##### optionalkeyword-onlyis\_ad\_hoc: bool | None = None Set to True if you want the webhook to be triggered only the first time the condition is fulfilled. Only applicable when actor\_run\_id is filled. #### Returns dict ### [**](#list)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/webhook_collection.py#L109)list * **async **list**(\*, limit, offset, desc): [ListPage](https://docs.apify.com/api/client/python/api/client/python/reference/class/ListPage.md)\[dict] - List the available webhooks. *** #### Parameters * ##### optionalkeyword-onlylimit: int | None = None How many webhooks to retrieve. * ##### optionalkeyword-onlyoffset: int | None = None What webhook to include as first when retrieving the list. * ##### optionalkeyword-onlydesc: bool | None = None Whether to sort the webhooks in descending order based on their date of creation. #### Returns [ListPage](https://docs.apify.com/api/client/python/api/client/python/reference/class/ListPage.md)\[dict] --- # WebhookDispatchClient Sub-client for querying information about a webhook dispatch. ## Index[**](#Index) ### Methods * [**get](https://docs.apify.com/api/client/python/api/client/python/reference/class/WebhookDispatchClient.md#get) ## Methods[**](#Methods) ### [**](#get)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/webhook_dispatch.py#L18)get * ****get**(): dict | None - Retrieve the webhook dispatch. *** #### Returns dict | None --- # WebhookDispatchClientAsync Async sub-client for querying information about a webhook dispatch. ## Index[**](#Index) ### Methods * [**get](https://docs.apify.com/api/client/python/api/client/python/reference/class/WebhookDispatchClientAsync.md#get) ## Methods[**](#Methods) ### [**](#get)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/webhook_dispatch.py#L37)get * **async **get**(): dict | None - Retrieve the webhook dispatch. *** #### Returns dict | None --- # WebhookDispatchCollectionClient Sub-client for listing webhook dispatches. ## Index[**](#Index) ### Methods * [**list](https://docs.apify.com/api/client/python/api/client/python/reference/class/WebhookDispatchCollectionClient.md#list) ## Methods[**](#Methods) ### [**](#list)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/webhook_dispatch_collection.py#L21)list * ****list**(\*, limit, offset, desc): [ListPage](https://docs.apify.com/api/client/python/api/client/python/reference/class/ListPage.md)\[dict] - List all webhook dispatches of a user. *** #### Parameters * ##### optionalkeyword-onlylimit: int | None = None How many webhook dispatches to retrieve. * ##### optionalkeyword-onlyoffset: int | None = None What webhook dispatch to include as first when retrieving the list. * ##### optionalkeyword-onlydesc: bool | None = None Whether to sort the webhook dispatches in descending order based on the date of their creation. #### Returns [ListPage](https://docs.apify.com/api/client/python/api/client/python/reference/class/ListPage.md)\[dict] --- # WebhookDispatchCollectionClientAsync Async sub-client for listing webhook dispatches. ## Index[**](#Index) ### Methods * [**list](https://docs.apify.com/api/client/python/api/client/python/reference/class/WebhookDispatchCollectionClientAsync.md#list) ## Methods[**](#Methods) ### [**](#list)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/clients/resource_clients/webhook_dispatch_collection.py#L51)list * **async **list**(\*, limit, offset, desc): [ListPage](https://docs.apify.com/api/client/python/api/client/python/reference/class/ListPage.md)\[dict] - List all webhook dispatches of a user. *** #### Parameters * ##### optionalkeyword-onlylimit: int | None = None How many webhook dispatches to retrieve. * ##### optionalkeyword-onlyoffset: int | None = None What webhook dispatch to include as first when retrieving the list. * ##### optionalkeyword-onlydesc: bool | None = None Whether to sort the webhook dispatches in descending order based on the date of their creation. #### Returns [ListPage](https://docs.apify.com/api/client/python/api/client/python/reference/class/ListPage.md)\[dict] --- # WithLogDetailsClient ## Index[**](#Index) ### Methods * [**\_\_new\_\_](https://docs.apify.com/api/client/python/api/client/python/reference/class/WithLogDetailsClient.md#__new__) ## Methods[**](#Methods) ### [**](#__new__)[**](https://undefined/apify/apify-client-python/blob/master//src/apify_client/_logging.py#L46)\_\_new\_\_ * ****\_\_new\_\_**(name, bases, attrs): [WithLogDetailsClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/WithLogDetailsClient.md) - #### Parameters * ##### name: str * ##### bases: tuple * ##### attrs: dict #### Returns [WithLogDetailsClient](https://docs.apify.com/api/client/python/api/client/python/reference/class/WithLogDetailsClient.md) --- [Skip to main content](#__docusaurus_skipToContent_fallback) [![](/api/client/python/img/apify_sdk.svg)![](/api/client/python/img/apify_sdk_white.svg)](https://docs.apify.com) [Academy](https://docs.apify.com/academy)[Platform](https://docs.apify.com/platform) [API](https://docs.apify.com/api) * [Reference](https://docs.apify.com/api/v2) * [Client for JavaScript](https://docs.apify.com/api/client/js/) * [Client for Python](https://docs.apify.com/api/client/python/) [SDK](https://docs.apify.com/sdk) * [SDK for JavaScript](https://docs.apify.com/sdk/js/) * [SDK for Python](https://docs.apify.com/sdk/python/) [CLI](https://docs.apify.com/cli/) [Open source](https://docs.apify.com/open-source) * [Crawlee](https://crawlee.dev) * [Got Scraping](https://github.com/apify/got-scraping) * [Fingerprint Suite](https://github.com/apify/fingerprint-suite) * [Apify on GitHub](https://github.com/apify) * [Actor whitepaper](https://whitepaper.actor) [Discord](https://discord.com/invite/jyEM2PRvMU "Chat on Discord")[Get started](https://console.apify.com) [API Client for Python](https://docs.apify.com/api/client/python/api/client/python/.md) [Docs](https://docs.apify.com/api/client/python/api/client/python/docs/overview/introduction.md)[Reference](https://docs.apify.com/api/client/python/api/client/python/reference.md)[Changelog](https://docs.apify.com/api/client/python/api/client/python/docs/changelog.md)[GitHub](https://github.com/apify/apify-client-python) # Apify API client for Python # Apify API client for Python. ## ## The Apify API Client for Python is the official library to access Apify API from your Python applications. It provides useful features like automatic retries and convenience functions to improve your experience with the Apify API. [Get Started](https://docs.apify.com/api/client/python/api/client/python/docs/overview/introduction.md)[GitHub](https://ghbtns.com/github-btn.html?user=apify\&repo=apify-client-python\&type=star\&count=true\&size=large) ![](/api/client/python/img/logo-blur.png) ``` pip install apify-client ``` For example, the Apify API Client for Python makes it easy to run your own Actors or Actors from the [Apify Store](https://apify.com/store) by simply using the `.call()` method to start an Actor and wait for it to finish. ``` from apify_client import ApifyClientAsync async def main() -> None: apify_client = ApifyClientAsync('MY-APIFY-TOKEN') # Start an Actor and wait for it to finish. actor_client = apify_client.actor('john-doe/my-cool-actor') call_result = await actor_client.call() if call_result is None: print('Actor run failed.') return # Fetch results from the Actor run's default dataset. dataset_client = apify_client.dataset(call_result['defaultDatasetId']) list_items_result = await dataset_client.list_items() print(f'Dataset: {list_items_result}') ``` Learn * [Academy](https://docs.apify.com/academy) * [Platform](https://docs.apify.com/platform) API * [Reference](https://docs.apify.com/api/v2) * [Client for JavaScript](https://docs.apify.com/api/client/js/) * [Client for Python](https://docs.apify.com/api/client/python/) SDK * [SDK for JavaScript](https://docs.apify.com/sdk/js/) * [SDK for Python](https://docs.apify.com/sdk/python/) Other * [CLI](https://docs.apify.com/cli/) * [Open source](https://docs.apify.com/open-source) More * [Crawlee](https://crawlee.dev) * [GitHub](https://github.com/apify) * [Discord](https://discord.com/invite/jyEM2PRvMU) * [Trust Center](https://trust.apify.com) [](https://apify.com) --- # SDK for JavaScript | Apify Documentation ## sdk - [Search the documentation](https://docs.apify.com/sdk/js/search.md) - [Changelog](https://docs.apify.com/sdk/js/docs/changelog.md): It seems that the changelog is not available. - [Accept user input](https://docs.apify.com/sdk/js/docs/examples/accept-user-input.md): This example accepts and logs user input: - [Add data to dataset](https://docs.apify.com/sdk/js/docs/examples/add-data-to-dataset.md): This example saves data to the default dataset. If the dataset doesn't exist, it will be created. - [Basic crawler](https://docs.apify.com/sdk/js/docs/examples/basic-crawler.md): This is the most bare-bones example of the Apify SDK, which demonstrates some of its building blocks such as the BasicCrawler. You probably don't need to go this deep though, and it would be better to start with one of the full-featured crawlers - [Call actor](https://docs.apify.com/sdk/js/docs/examples/call-actor.md): This example demonstrates how to start an Apify actor using - [Capture a screenshot using Puppeteer](https://docs.apify.com/sdk/js/docs/examples/capture-screenshot.md): To run this example on the Apify Platform, select the apify/actor-node-puppeteer-chrome image for your Dockerfile. - [Cheerio crawler](https://docs.apify.com/sdk/js/docs/examples/cheerio-crawler.md): This example demonstrates how to use CheerioCrawler to crawl a list of URLs from an external file, load each URL using a plain HTTP request, parse the HTML using the Cheerio library and extract some data from it: the page title and all h1 tags. - [Crawl all links on a website](https://docs.apify.com/sdk/js/docs/examples/crawl-all-links.md): This example uses the enqueueLinks() method to add new links to the RequestQueue as the crawler navigates from page to page. If only the - [Crawl multiple URLs](https://docs.apify.com/sdk/js/docs/examples/crawl-multiple-urls.md): This example crawls the specified list of URLs. - [Crawl a website with relative links](https://docs.apify.com/sdk/js/docs/examples/crawl-relative-links.md): When crawling a website, you may encounter different types of links present that you may want to crawl. - [Crawl a single URL](https://docs.apify.com/sdk/js/docs/examples/crawl-single-url.md): This example uses the got-scraping npm package - [Crawl a sitemap](https://docs.apify.com/sdk/js/docs/examples/crawl-sitemap.md): This example downloads and crawls the URLs from a sitemap. - [Crawl some links on a website](https://docs.apify.com/sdk/js/docs/examples/crawl-some-links.md): This CheerioCrawler example uses the pseudoUrls property in the enqueueLinks() method to only add links to the RequestQueue queue if they match the specified regular expression. - [Forms](https://docs.apify.com/sdk/js/docs/examples/forms.md): This example demonstrates how to use PuppeteerCrawler to - [Dataset Map and Reduce methods](https://docs.apify.com/sdk/js/docs/examples/map-and-reduce.md): This example shows an easy use-case of the Dataset map - [Playwright crawler](https://docs.apify.com/sdk/js/docs/examples/playwright-crawler.md): This example demonstrates how to use PlaywrightCrawler - [Puppeteer crawler](https://docs.apify.com/sdk/js/docs/examples/puppeteer-crawler.md): This example demonstrates how to use PuppeteerCrawler in combination - [Puppeteer recursive crawl](https://docs.apify.com/sdk/js/docs/examples/puppeteer-recursive-crawl.md): Run the following example to perform a recursive crawl of a website using PuppeteerCrawler. - [Puppeteer with proxy](https://docs.apify.com/sdk/js/docs/examples/puppeteer-with-proxy.md): This example demonstrates how to load pages in headless Chrome / Puppeteer over Apify Proxy. - [Apify Platform](https://docs.apify.com/sdk/js/docs/guides/apify-platform.md): Apify platform - large-scale and high-performance web scraping - [Running in Docker](https://docs.apify.com/sdk/js/docs/guides/docker-images.md): Example Docker images to run your crawlers - [Environment Variables](https://docs.apify.com/sdk/js/docs/guides/environment-variables.md): The following is a list of the environment variables used by Apify SDK that are available to the user. - [Pay-per-event Monetization](https://docs.apify.com/sdk/js/docs/guides/pay-per-event.md): Monetize your Actors using the pay-per-event pricing model - [Proxy Management](https://docs.apify.com/sdk/js/docs/guides/proxy-management.md): IP address blocking is one of the oldest - [Request Storage](https://docs.apify.com/sdk/js/docs/guides/request-storage.md): The Apify SDK has several request storage types that are useful for specific tasks. The requests are stored either on local disk to a directory defined by the - [Result Storage](https://docs.apify.com/sdk/js/docs/guides/result-storage.md): The Apify SDK has several result storage types that are useful for specific tasks. The data is stored either on local disk to a directory defined by the - [Session Management](https://docs.apify.com/sdk/js/docs/guides/session-management.md): SessionPool is a - [Setting up a TypeScript project](https://docs.apify.com/sdk/js/docs/guides/type-script-actor.md): Apify SDK supports TypeScript by covering public APIs with type declarations. This - [Apify SDK: The scalable web crawling and scraping library for JavaScript](https://docs.apify.com/sdk/js/docs/readme/introduction.md): npm version - [overview](https://docs.apify.com/sdk/js/docs/readme/overview.md): Overview - [support](https://docs.apify.com/sdk/js/docs/readme/support.md): Support - [Upgrading to v1](https://docs.apify.com/sdk/js/docs/upgrading/upgrading-to-v1.md): Summary - [Upgrading to v2](https://docs.apify.com/sdk/js/docs/upgrading/upgrading-to-v2.md): - BREAKING: Require Node.js >=15.10.0 because HTTP2 support on lower Node.js versions is very buggy. - [Upgrading to v3](https://docs.apify.com/sdk/js/docs/upgrading/upgrading-to-v3.md): This page summarizes most of the breaking changes between Crawlee (v3) and Apify SDK (v2). Crawlee is the spiritual successor to Apify SDK, so we decided to keep the versioning and release Crawlee as v3. - [apify](https://docs.apify.com/sdk/js/reference.md) - [Changelog](https://docs.apify.com/sdk/js/reference/changelog.md): Change Log - [Actor ](https://docs.apify.com/sdk/js/reference/class/Actor.md): `Actor` class serves as an alternative approach to the static helpers exported from the package. It allows to pass configuration that will be used on the instance methods. Environment variables will have precedence over this configuration. See {@apilink Configuration} for details about what can be configured and what are the default values. - [externalApifyClient](https://docs.apify.com/sdk/js/reference/class/ApifyClient.md): ApifyClient is the official library to access [Apify API](https://docs.apify.com/api/v2) from your JavaScript applications. It runs both in Node.js and browser. - [ChargingManager](https://docs.apify.com/sdk/js/reference/class/ChargingManager.md): Handles pay-per-event charging. - [Configuration](https://docs.apify.com/sdk/js/reference/class/Configuration.md): `Configuration` is a value object holding the SDK configuration. We can use it in two ways: 1. When using `Actor` class, we can get the instance configuration via `sdk.config` ```javascript import { Actor } from 'apify'; import { BasicCrawler } from 'crawlee'; const sdk = new Actor({ token: '123' }); console.log(sdk.config.get('token')); // '123' const crawler = new BasicCrawler({ // ... crawler options }, sdk.config); ``` 2. To get the global configuration (singleton instance). It will respect the environment variables. ```javascript import { BasicCrawler, Configuration } from 'crawlee'; // Get the global configuration const config = Configuration.getGlobalConfig(); // Set the 'persistStateIntervalMillis' option // of global configuration to 30 seconds config.set('persistStateIntervalMillis', 30_000); // No need to pass the configuration to the crawler, // as it's using the global configuration by default const crawler = new BasicCrawler(); ``` ## Supported Configuration Options Key | Environment Variable | Default Value ---|---|--- `memoryMbytes` | `ACTOR_MEMORY_MBYTES` | - `headless` | `APIFY_HEADLESS` | - `persistStateIntervalMillis` | `APIFY_PERSIST_STATE_INTERVAL_MILLIS` | `60e3` `token` | `APIFY_TOKEN` | - `isAtHome` | `APIFY_IS_AT_HOME` | - `defaultDatasetId` | `ACTOR_DEFAULT_DATASET_ID` | `'default'` `defaultKeyValueStoreId` | `ACTOR_DEFAULT_KEY_VALUE_STORE_ID` | `'default'` `defaultRequestQueueId` | `ACTOR_DEFAULT_REQUEST_QUEUE_ID` | `'default'` ## Advanced Configuration Options Key | Environment Variable | Default Value ---|---|--- `actorEventsWsUrl` | `ACTOR_EVENTS_WEBSOCKET_URL` | - `actorId` | `ACTOR_ID` | - `actorRunId` | `ACTOR_RUN_ID` | - `actorTaskId` | `ACTOR_TASK_ID` | - `apiBaseUrl` | `APIFY_API_BASE_URL` | `'https://api.apify.com'` `containerPort` | `ACTOR_WEB_SERVER_PORT` | `4321` `containerUrl` | `ACTOR_WEB_SERVER_URL` | `'http://localhost:4321'` `inputKey` | `ACTOR_INPUT_KEY` | `'INPUT'` `metamorphAfterSleepMillis` | `APIFY_METAMORPH_AFTER_SLEEP_MILLIS` | `300e3` `metaOrigin` | `APIFY_META_ORIGIN` | - `proxyHostname` | `APIFY_PROXY_HOSTNAME` | `'proxy.apify.com'` `proxyPassword` | `APIFY_PROXY_PASSWORD` | - `proxyPort` | `APIFY_PROXY_PORT` | `8000` `proxyStatusUrl` | `APIFY_PROXY_STATUS_URL` | `'http://proxy.apify.com'` `userId` | `APIFY_USER_ID` | - `xvfb` | `APIFY_XVFB` | - `standbyPort` | `ACTOR_STANDBY_PORT` | `4321` `standbyUrl` | `ACTOR_STANDBY_URL` | - `chromeExecutablePath` | `APIFY_CHROME_EXECUTABLE_PATH` | - `defaultBrowserPath` | `APIFY_DEFAULT_BROWSER_PATH` | - - [externalDataset ](https://docs.apify.com/sdk/js/reference/class/Dataset.md): The `Dataset` class represents a store for structured data where each object stored has the same attributes, such as online store products or real estate offers. You can imagine it as a table, where each object is a row and its attributes are columns. Dataset is an append-only storage - you can only add new records to it but you cannot modify or remove existing records. Typically it is used to store crawling results. Do not instantiate this class directly, use the {@apilink Dataset.open} function instead. `Dataset` stores its data either on local disk or in the Apify cloud, depending on whether the `APIFY_LOCAL_STORAGE_DIR` or `APIFY_TOKEN` environment variables are set. If the `APIFY_LOCAL_STORAGE_DIR` environment variable is set, the data is stored in the local directory in the following files: ``` {APIFY_LOCAL_STORAGE_DIR}/datasets/{DATASET_ID}/{INDEX}.json ``` Note that `{DATASET_ID}` is the name or ID of the dataset. The default dataset has ID: `default`, unless you override it by setting the `APIFY_DEFAULT_DATASET_ID` environment variable. Each dataset item is stored as a separate JSON file, where `{INDEX}` is a zero-based index of the item in the dataset. If the `APIFY_TOKEN` environment variable is set but `APIFY_LOCAL_STORAGE_DIR` not, the data is stored in the [Apify Dataset](https://docs.apify.com/storage/dataset) cloud storage. Note that you can force usage of the cloud storage also by passing the `forceCloud` option to {@apilink Dataset.open} function, even if the `APIFY_LOCAL_STORAGE_DIR` variable is set. **Example usage:** ```javascript // Write a single row to the default dataset await Dataset.pushData({ col1: 123, col2: 'val2' }); // Open a named dataset const dataset = await Dataset.open('some-name'); // Write a single row await dataset.pushData({ foo: 'bar' }); // Write multiple rows await dataset.pushData([ { foo: 'bar2', col2: 'val2' }, { col3: 123 }, ]); // Export the entirety of the dataset to one file in the key-value store await dataset.exportToCSV('MY-DATA'); ``` - [KeyValueStore](https://docs.apify.com/sdk/js/reference/class/KeyValueStore.md) - [externalLog](https://docs.apify.com/sdk/js/reference/class/Log.md): The log instance enables level aware logging of messages and we advise to use it instead of `console.log()` and its aliases in most development scenarios. A very useful use case for `log` is using `log.debug` liberally throughout the codebase to get useful logging messages only when appropriate log level is set and keeping the console tidy in production environments. The available logging levels are, in this order: `DEBUG`, `INFO`, `WARNING`, `ERROR`, `OFF` and can be referenced from the `log.LEVELS` constant, such as `log.LEVELS.ERROR`. To log messages to the system console, use the `log.level(message)` invocation, such as `log.debug('this is a debug message')`. To prevent writing of messages above a certain log level to the console, simply set the appropriate level. The default log level is `INFO`, which means that `DEBUG` messages will not be printed, unless enabled. **Example:** ```js import log from '@apify/log'; // importing from the Apify SDK or Crawlee is also supported: // import { log } from 'apify'; // import { log } from 'crawlee'; log.info('Information message', { someData: 123 }); // prints message log.debug('Debug message', { debugData: 'hello' }); // doesn't print anything log.setLevel(log.LEVELS.DEBUG); log.debug('Debug message'); // prints message log.setLevel(log.LEVELS.ERROR); log.debug('Debug message'); // doesn't print anything log.info('Info message'); // doesn't print anything log.error('Error message', { errorDetails: 'This is bad!' }); // prints message try { throw new Error('Not good!'); } catch (e) { log.exception(e, 'Exception occurred', { errorDetails: 'This is really bad!' }); // prints message } log.setOptions({ prefix: 'My actor' }); log.info('I am running!'); // prints "My actor: I am running" const childLog = log.child({ prefix: 'Crawler' }); log.info('I am crawling!'); // prints "My actor:Crawler: I am crawling" ``` Another very useful way of setting the log level is by setting the `APIFY_LOG_LEVEL` environment variable, such as `APIFY_LOG_LEVEL=DEBUG`. This way, no code changes are necessary to turn on your debug messages and start debugging right away. To add timestamps to your logs, you can override the default logger settings: ```js log.setOptions({ logger: new log.LoggerText({ skipTime: false }), }); ``` You can customize your logging further by extending or replacing the default logger instances with your own implementations. - [externalLogger](https://docs.apify.com/sdk/js/reference/class/Logger.md): This is an abstract class that should be extended by custom logger classes. this._log() method must be implemented by them. - [externalLoggerJson](https://docs.apify.com/sdk/js/reference/class/LoggerJson.md): This is an abstract class that should be extended by custom logger classes. this._log() method must be implemented by them. - [externalLoggerText](https://docs.apify.com/sdk/js/reference/class/LoggerText.md): This is an abstract class that should be extended by custom logger classes. this._log() method must be implemented by them. - [PlatformEventManager](https://docs.apify.com/sdk/js/reference/class/PlatformEventManager.md): Gets an instance of a Node.js' [EventEmitter](https://nodejs.org/api/events.html#events_class_eventemitter) class that emits various events from the SDK or the Apify platform. The event emitter is initialized by calling the {@apilink Actor.main} function. **Example usage:** ```javascript Actor.on('cpuInfo', (data) => { if (data.isCpuOverloaded) console.log('Oh no, the CPU is overloaded!'); }); ``` The following events are emitted: - `cpuInfo`: `{ "isCpuOverloaded": Boolean }` The event is emitted approximately every second and it indicates whether the Actor is using the maximum of available CPU resources. If that's the case, the Actor should not add more workload. For example, this event is used by the {@apilink AutoscaledPool} class. - `migrating`: `void` Emitted when the Actor running on the Apify platform is going to be migrated to another worker server soon. You can use it to persist the state of the Actor and gracefully stop your in-progress tasks, so that they are not interrupted by the migration. For example, this is used by the {@apilink RequestList} class. - `aborting`: `void` When a user aborts an Actor run on the Apify platform, they can choose to abort gracefully to allow the Actor some time before getting killed. This graceful abort emits the `aborting` event which the SDK uses to gracefully stop running crawls and you can use it to do your own cleanup as well. - `persistState`: `{ "isMigrating": Boolean }` Emitted in regular intervals (by default 60 seconds) to notify all components of Apify SDK that it is time to persist their state, in order to avoid repeating all work when the Actor restarts. This event is automatically emitted together with the `migrating` event, in which case the `isMigrating` flag is set to `true`. Otherwise the flag is `false`. Note that the `persistState` event is provided merely for user convenience, you can achieve the same effect using `setInterval()` and listening for the `migrating` event. - [ProxyConfiguration](https://docs.apify.com/sdk/js/reference/class/ProxyConfiguration.md): Configures connection to a proxy server with the provided options. Proxy servers are used to prevent target websites from blocking your crawlers based on IP address rate limits or blacklists. Setting proxy configuration in your crawlers automatically configures them to use the selected proxies for all connections. You can get information about the currently used proxy by inspecting the {@apilink ProxyInfo} property in your crawler's page function. There, you can inspect the proxy's URL and other attributes. The proxy servers are managed by [Apify Proxy](https://docs.apify.com/proxy). To be able to use Apify Proxy, you need an Apify account and access to the selected proxies. If you provide no configuration option, the proxies will be managed automatically using a smart algorithm. If you want to use your own proxies, use the {@apilink ProxyConfigurationOptions.proxyUrls} option. Your list of proxy URLs will be rotated by the configuration if this option is provided. **Example usage:** ```javascript const proxyConfiguration = await Actor.createProxyConfiguration({ groups: ['GROUP1', 'GROUP2'] // List of Apify Proxy groups countryCode: 'US', }); const crawler = new CheerioCrawler({ // ... proxyConfiguration, requestHandler({ proxyInfo }) { const usedProxyUrl = proxyInfo.url; // Getting the proxy URL } }) ``` - [externalRequestQueue](https://docs.apify.com/sdk/js/reference/class/RequestQueue.md): Represents a queue of URLs to crawl, which is used for deep crawling of websites where you start with several URLs and then recursively follow links to other pages. The data structure supports both breadth-first and depth-first crawling orders. Each URL is represented using an instance of the {@apilink Request} class. The queue can only contain unique URLs. More precisely, it can only contain {@apilink Request} instances with distinct `uniqueKey` properties. By default, `uniqueKey` is generated from the URL, but it can also be overridden. To add a single URL multiple times to the queue, corresponding {@apilink Request} objects will need to have different `uniqueKey` properties. Do not instantiate this class directly, use the {@apilink RequestQueue.open} function instead. `RequestQueue` is used by {@apilink BasicCrawler}, {@apilink CheerioCrawler}, {@apilink PuppeteerCrawler} and {@apilink PlaywrightCrawler} as a source of URLs to crawl. Unlike {@apilink RequestList}, `RequestQueue` supports dynamic adding and removing of requests. On the other hand, the queue is not optimized for operations that add or remove a large number of URLs in a batch. **Example usage:** ```javascript // Open the default request queue associated with the crawler run const queue = await RequestQueue.open(); // Open a named request queue const queueWithName = await RequestQueue.open('some-name'); // Enqueue few requests await queue.addRequest({ url: 'http://example.com/aaa' }); await queue.addRequest({ url: 'http://example.com/bbb' }); await queue.addRequest({ url: 'http://example.com/foo/bar' }, { forefront: true }); ``` - [externalLogLevel](https://docs.apify.com/sdk/js/reference/enum/LogLevel.md) - [AbortOptions](https://docs.apify.com/sdk/js/reference/interface/AbortOptions.md) - [ActorPricingInfo](https://docs.apify.com/sdk/js/reference/interface/ActorPricingInfo.md) - [externalActorRun](https://docs.apify.com/sdk/js/reference/interface/ActorRun.md) - [externalApifyClientOptions](https://docs.apify.com/sdk/js/reference/interface/ApifyClientOptions.md) - [ApifyEnv](https://docs.apify.com/sdk/js/reference/interface/ApifyEnv.md): Parsed representation of the Apify environment variables. This object is returned by the {@apilink Actor.getEnv} function. - [CallOptions](https://docs.apify.com/sdk/js/reference/interface/CallOptions.md) - [CallTaskOptions](https://docs.apify.com/sdk/js/reference/interface/CallTaskOptions.md) - [ChargeOptions](https://docs.apify.com/sdk/js/reference/interface/ChargeOptions.md) - [ChargeResult](https://docs.apify.com/sdk/js/reference/interface/ChargeResult.md) - [ConfigurationOptions](https://docs.apify.com/sdk/js/reference/interface/ConfigurationOptions.md) - [externalDatasetConsumer ](https://docs.apify.com/sdk/js/reference/interface/DatasetConsumer.md): User-function used in the `Dataset.forEach()` API. - [externalDatasetContent ](https://docs.apify.com/sdk/js/reference/interface/DatasetContent.md) - [externalDatasetDataOptions](https://docs.apify.com/sdk/js/reference/interface/DatasetDataOptions.md) - [externalDatasetIteratorOptions](https://docs.apify.com/sdk/js/reference/interface/DatasetIteratorOptions.md) - [externalDatasetMapper ](https://docs.apify.com/sdk/js/reference/interface/DatasetMapper.md): User-function used in the `Dataset.map()` API. - [externalDatasetOptions](https://docs.apify.com/sdk/js/reference/interface/DatasetOptions.md) - [externalDatasetReducer ](https://docs.apify.com/sdk/js/reference/interface/DatasetReducer.md): User-function used in the `Dataset.reduce()` API. - [ExitOptions](https://docs.apify.com/sdk/js/reference/interface/ExitOptions.md) - [InitOptions](https://docs.apify.com/sdk/js/reference/interface/InitOptions.md) - [externalKeyConsumer](https://docs.apify.com/sdk/js/reference/interface/KeyConsumer.md): User-function used in the {@apilink KeyValueStore.forEachKey} method. - [externalKeyValueStoreIteratorOptions](https://docs.apify.com/sdk/js/reference/interface/KeyValueStoreIteratorOptions.md) - [externalKeyValueStoreOptions](https://docs.apify.com/sdk/js/reference/interface/KeyValueStoreOptions.md) - [externalLoggerOptions](https://docs.apify.com/sdk/js/reference/interface/LoggerOptions.md) - [MainOptions](https://docs.apify.com/sdk/js/reference/interface/MainOptions.md) - [MetamorphOptions](https://docs.apify.com/sdk/js/reference/interface/MetamorphOptions.md) - [OpenStorageOptions](https://docs.apify.com/sdk/js/reference/interface/OpenStorageOptions.md) - [ProxyConfigurationOptions](https://docs.apify.com/sdk/js/reference/interface/ProxyConfigurationOptions.md) - [ProxyInfo](https://docs.apify.com/sdk/js/reference/interface/ProxyInfo.md): The main purpose of the ProxyInfo object is to provide information about the current proxy connection used by the crawler for the request. Outside of crawlers, you can get this object by calling {@apilink ProxyConfiguration.newProxyInfo}. **Example usage:** ```javascript const proxyConfiguration = await Actor.createProxyConfiguration({ groups: ['GROUP1', 'GROUP2'] // List of Apify Proxy groups countryCode: 'US', }); // Getting proxyInfo object by calling class method directly const proxyInfo = proxyConfiguration.newProxyInfo(); // In crawler const crawler = new CheerioCrawler({ // ... proxyConfiguration, requestHandler({ proxyInfo }) { // Getting used proxy URL const proxyUrl = proxyInfo.url; // Getting ID of used Session const sessionIdentifier = proxyInfo.sessionId; } }) ``` - [externalQueueOperationInfo](https://docs.apify.com/sdk/js/reference/interface/QueueOperationInfo.md): A helper class that is used to report results from various {@apilink RequestQueue} functions as well as {@apilink enqueueLinks}. - [RebootOptions](https://docs.apify.com/sdk/js/reference/interface/RebootOptions.md) - [externalRecordOptions](https://docs.apify.com/sdk/js/reference/interface/RecordOptions.md) - [externalRequestQueueOperationOptions](https://docs.apify.com/sdk/js/reference/interface/RequestQueueOperationOptions.md) - [externalRequestQueueOptions](https://docs.apify.com/sdk/js/reference/interface/RequestQueueOptions.md) - [WebhookOptions](https://docs.apify.com/sdk/js/reference/interface/WebhookOptions.md) - [Apify SDK for JavaScript and Node.js](https://docs.apify.com/sdk/js/index.md) --- # Full Documentation Content [Skip to main content](#__docusaurus_skipToContent_fallback) [![](/sdk/js/img/apify_sdk.svg)![](/sdk/js/img/apify_sdk_white.svg)](https://docs.apify.com) [Academy](https://docs.apify.com/academy)[Platform](https://docs.apify.com/platform) [API](https://docs.apify.com/api) * [Reference](https://docs.apify.com/api/v2) * [Client for JavaScript](https://docs.apify.com/api/client/js/) * [Client for Python](https://docs.apify.com/api/client/python/) [SDK](https://docs.apify.com/sdk) * [SDK for JavaScript](https://docs.apify.com/sdk/js/) * [SDK for Python](https://docs.apify.com/sdk/python/) [CLI](https://docs.apify.com/cli/) [Open source](https://docs.apify.com/open-source) * [Crawlee](https://crawlee.dev) * [Got Scraping](https://github.com/apify/got-scraping) * [Fingerprint Suite](https://github.com/apify/fingerprint-suite) * [Apify on GitHub](https://github.com/apify) * [Actor whitepaper](https://whitepaper.actor) [Discord](https://discord.com/invite/jyEM2PRvMU "Chat on Discord")[Get started](https://console.apify.com) [SDK for JavaScript](https://docs.apify.com/sdk/js/sdk/js/.md) [Docs](https://docs.apify.com/sdk/js/sdk/js/docs/guides/apify-platform.md)[Examples](https://docs.apify.com/sdk/js/sdk/js/docs/examples)[Reference](https://docs.apify.com/sdk/js/sdk/js/reference.md)[Changelog](https://docs.apify.com/sdk/js/sdk/js/docs/changelog.md)[GitHub](https://github.com/apify/apify-sdk-js) [3.4](https://docs.apify.com/sdk/js/sdk/js/docs/guides/apify-platform.md) * [Next](https://docs.apify.com/sdk/js/sdk/js/docs/next/guides/apify-platform) * [3.4](https://docs.apify.com/sdk/js/sdk/js/docs/guides/apify-platform.md) * [3.3](https://docs.apify.com/sdk/js/sdk/js/docs/3.3/guides/apify-platform) * [3.2](https://docs.apify.com/sdk/js/sdk/js/docs/3.2/guides/apify-platform) * [3.1](https://docs.apify.com/sdk/js/sdk/js/docs/3.1/guides/apify-platform) * [3.0](https://docs.apify.com/sdk/js/sdk/js/docs/3.0/guides/apify-platform) * [2.3](https://docs.apify.com/sdk/js/sdk/js/docs/2.3/guides/motivation) * [1.3](https://docs.apify.com/sdk/js/sdk/js/docs/1.3/guides/motivation) # Search the documentation Type your search here Next (current) [](https://www.algolia.com/) Learn * [Academy](https://docs.apify.com/academy) * [Platform](https://docs.apify.com/platform) API * [Reference](https://docs.apify.com/api/v2) * [Client for JavaScript](https://docs.apify.com/api/client/js/) * [Client for Python](https://docs.apify.com/api/client/python/) SDK * [SDK for JavaScript](https://docs.apify.com/sdk/js/) * [SDK for Python](https://docs.apify.com/sdk/python/) Other * [CLI](https://docs.apify.com/cli/) * [Open source](https://docs.apify.com/open-source) More * [Crawlee](https://crawlee.dev) * [GitHub](https://github.com/apify) * [Discord](https://discord.com/invite/jyEM2PRvMU) * [Trust Center](https://trust.apify.com) [](https://apify.com) --- # Changelog ### [apify@3.4.4](https://github.com/apify/apify-sdk-js/releases/tag/apify@3.4.4)[​](#apify344 "Direct link to heading") ##### [3.4.4](https://github.com/apify/apify-sdk-js/compare/apify@3.4.3...apify@3.4.4) (2025-07-28)[​](#344-2025-07-28 "Direct link to heading") **Note:** Version bump only for package apify ##### What's Changed[​](#whats-changed "Direct link to heading") * docs: remove versions page by [@barjin](https://github.com/barjin) in [#416](https://github.com/apify/apify-sdk-js/pull/416) * chore: add `userIsPaying` to the env var interface by [@stepskop](https://github.com/stepskop) in [#415](https://github.com/apify/apify-sdk-js/pull/415) **Full Changelog**: ### [apify@3.4.3](https://github.com/apify/apify-sdk-js/releases/tag/apify@3.4.3)[​](#apify343 "Direct link to heading") ##### [3.4.3](https://github.com/apify/apify-sdk-js/compare/apify@3.4.2...apify@3.4.3) (2025-07-14)[​](#343-2025-07-14 "Direct link to heading") ###### Bug Fixes[​](#bug-fixes "Direct link to heading") * Return eventChargeLimitReached=false for charge calls with count=0 ([#395](https://github.com/apify/apify-sdk-js/issues/395)) ([4f97da0](https://github.com/apify/apify-sdk-js/commit/4f97da0cf4bbda33dcaa3d91f0f543f080dbab8b)), closes [#372](https://github.com/apify/apify-sdk-js/issues/372) ### [apify@3.4.2](https://github.com/apify/apify-sdk-js/releases/tag/apify@3.4.2)[​](#apify342 "Direct link to heading") ##### [3.4.2](https://github.com/apify/apify-sdk-js/compare/apify@3.4.1...apify@3.4.2) (2025-05-19)[​](#342-2025-05-19 "Direct link to heading") ###### Bug Fixes[​](#bug-fixes-1 "Direct link to heading") * improve extension of `Configuration` class to fix issues in native ESM projects ([#394](https://github.com/apify/apify-sdk-js/issues/394)) ([8842706](https://github.com/apify/apify-sdk-js/commit/884270611e09a0fec40903958f74d458ba454300)) ### [apify@3.4.1](https://github.com/apify/apify-sdk-js/releases/tag/apify@3.4.1)[​](#apify341 "Direct link to heading") ##### [3.4.1](https://github.com/apify/apify-sdk-js/compare/apify@3.4.0...apify@3.4.1) (2025-05-07)[​](#341-2025-05-07 "Direct link to heading") ###### Bug Fixes[​](#bug-fixes-2 "Direct link to heading") * convert `@apilink` to `@link` on build ([#383](https://github.com/apify/apify-sdk-js/issues/383)) ([ccae1ac](https://github.com/apify/apify-sdk-js/commit/ccae1ac9737dfc5bfc64e4586846e413ddb54a37)) * improve check for crawlee version mismatch ([#386](https://github.com/apify/apify-sdk-js/issues/386)) ([721e67d](https://github.com/apify/apify-sdk-js/commit/721e67dbde367b01e1347900b73394221bca0c9d)), closes [#375](https://github.com/apify/apify-sdk-js/issues/375) * prefer proxy password from env var ([#385](https://github.com/apify/apify-sdk-js/issues/385)) ([132b5dc](https://github.com/apify/apify-sdk-js/commit/132b5dc5b0c5b77cad357b4d022b53ab6801a3a2)), closes [#20502](https://github.com/apify/apify-sdk-js/issues/20502) ### [apify@3.4.0](https://github.com/apify/apify-sdk-js/releases/tag/apify@3.4.0)[​](#apify340 "Direct link to heading") #### [3.4.0](https://github.com/apify/apify-sdk-js/compare/apify@3.3.2...apify@3.4.0) (2025-04-01)[​](#340-2025-04-01 "Direct link to heading") ###### Features[​](#features "Direct link to heading") * sign record's public url ([#358](https://github.com/apify/apify-sdk-js/issues/358)) ([6274cc0](https://github.com/apify/apify-sdk-js/commit/6274cc018ab3b02787c234eacbb025d4d973a95d)) ### [apify@3.3.2](https://github.com/apify/apify-sdk-js/releases/tag/apify@3.3.2)[​](#apify332 "Direct link to heading") ##### [3.3.2](https://github.com/apify/apify-sdk-js/compare/apify@3.3.1...apify@3.3.2) (2025-02-20)[​](#332-2025-02-20 "Direct link to heading") ###### Bug Fixes[​](#bug-fixes-3 "Direct link to heading") * ensure `maxTotalChargeUsd` is correctly mapped to number, consider empty string as infinity ([#361](https://github.com/apify/apify-sdk-js/issues/361)) ([bb65f70](https://github.com/apify/apify-sdk-js/commit/bb65f70be4750c8dc1bee368f849fafa924add39)) ### [apify@3.3.1](https://github.com/apify/apify-sdk-js/releases/tag/apify@3.3.1)[​](#apify331 "Direct link to heading") ##### [3.3.1](https://github.com/apify/apify-sdk-js/compare/apify@3.3.0...apify@3.3.1) (2025-02-19)[​](#331-2025-02-19 "Direct link to heading") ###### Bug Fixes[​](#bug-fixes-4 "Direct link to heading") * Add workaround for incorrect handling of maxTotalChargeUsd ([#360](https://github.com/apify/apify-sdk-js/issues/360)) ([acb2abe](https://github.com/apify/apify-sdk-js/commit/acb2abe9a3422e5b0b28972085377664173fd3ff)) ### [apify@3.3.0](https://github.com/apify/apify-sdk-js/releases/tag/apify@3.3.0)[​](#apify330 "Direct link to heading") #### [3.3.0](https://github.com/apify/apify-sdk-js/compare/apify@3.2.6...apify@3.3.0) (2025-02-06)[​](#330-2025-02-06 "Direct link to heading") ###### Bug Fixes[​](#bug-fixes-5 "Direct link to heading") * load `initialCount` in `openRequestQueue()` ([#339](https://github.com/apify/apify-sdk-js/issues/339)) ([48548cd](https://github.com/apify/apify-sdk-js/commit/48548cd088365b84a0178ba38c5d3da7f4922ade)) * prevent reboot loops ([#345](https://github.com/apify/apify-sdk-js/issues/345)) ([271bc99](https://github.com/apify/apify-sdk-js/commit/271bc999c1a6c75f8e8359214237b51f8ade03c7)) ###### Features[​](#features-1 "Direct link to heading") * Actor.charge() ([#346](https://github.com/apify/apify-sdk-js/issues/346)) ([e26e496](https://github.com/apify/apify-sdk-js/commit/e26e49669cae04df11f2138b80549e5cd8611b3c)) * add SDK and Crawlee version to the `ApifyClient`'s user agent ([#335](https://github.com/apify/apify-sdk-js/issues/335)) ([9c069a1](https://github.com/apify/apify-sdk-js/commit/9c069a1643f0a5f417765e9391550ae06c50160f)), closes [#331](https://github.com/apify/apify-sdk-js/issues/331) * add standby URL, change default port ([#328](https://github.com/apify/apify-sdk-js/issues/328)) ([7d265f3](https://github.com/apify/apify-sdk-js/commit/7d265f3e2a7dfdda232e0bbf7c6bb73736112950)) ### [apify@3.2.6](https://github.com/apify/apify-sdk-js/releases/tag/apify@3.2.6)[​](#apify326 "Direct link to heading") ##### [3.2.6](https://github.com/apify/apify-sdk-js/compare/apify@3.2.5...apify@3.2.6) (2024-10-07)[​](#326-2024-10-07 "Direct link to heading") ###### Bug Fixes[​](#bug-fixes-6 "Direct link to heading") * decode special characters in proxy `username` and `password` ([#326](https://github.com/apify/apify-sdk-js/issues/326)) ([9a7a4d0](https://github.com/apify/apify-sdk-js/commit/9a7a4d0ecc30f21d2be607840ce28903dbf1d191)) ### [apify@3.2.5](https://github.com/apify/apify-sdk-js/releases/tag/apify@3.2.5)[​](#apify325 "Direct link to heading") ##### [3.2.5](https://github.com/apify/apify-sdk-js/compare/apify@3.2.4...apify@3.2.5) (2024-08-14)[​](#325-2024-08-14 "Direct link to heading") ###### Features[​](#features-2 "Direct link to heading") * add `metaOrigin` to `Actor.config` ([#320](https://github.com/apify/apify-sdk-js/issues/320)) ([5a4d2eb](https://github.com/apify/apify-sdk-js/commit/5a4d2ebb9218bd342438f740d035a563444037d3)) ### [apify@3.2.4](https://github.com/apify/apify-sdk-js/releases/tag/apify@3.2.4)[​](#apify324 "Direct link to heading") ##### [3.2.4](https://github.com/apify/apify-sdk-js/compare/apify@3.2.3...apify@3.2.4) (2024-07-04)[​](#324-2024-07-04 "Direct link to heading") ###### Bug Fixes[​](#bug-fixes-7 "Direct link to heading") * add `standbyPort` to `ConfigurationOptions` ([#311](https://github.com/apify/apify-sdk-js/issues/311)) ([530b8a1](https://github.com/apify/apify-sdk-js/commit/530b8a133f2808c61e079449156e5ed1fe73ce64)) ###### Features[​](#features-3 "Direct link to heading") * add standby port to configuration ([#310](https://github.com/apify/apify-sdk-js/issues/310)) ([cc26098](https://github.com/apify/apify-sdk-js/commit/cc26098da862a7338fdf776956b904d5672a5daf)) ### [apify@3.2.3](https://github.com/apify/apify-sdk-js/releases/tag/apify@3.2.3)[​](#apify323 "Direct link to heading") ##### [3.2.3](https://github.com/apify/apify-sdk-js/compare/apify@3.2.2...apify@3.2.3) (2024-06-03)[​](#323-2024-06-03 "Direct link to heading") ###### Bug Fixes[​](#bug-fixes-8 "Direct link to heading") * respect `KVS.getPublicUrl()` from core when not on apify platform ([#302](https://github.com/apify/apify-sdk-js/issues/302)) ([a4d80bb](https://github.com/apify/apify-sdk-js/commit/a4d80bbeee2fd2db145638b17757fa5f673e7452)) ### [apify@3.2.2](https://github.com/apify/apify-sdk-js/releases/tag/apify@3.2.2)[​](#apify322 "Direct link to heading") ##### [3.2.2](https://github.com/apify/apify-sdk-js/compare/apify@3.2.1...apify@3.2.2) (2024-05-23)[​](#322-2024-05-23 "Direct link to heading") ###### Bug Fixes[​](#bug-fixes-9 "Direct link to heading") * dont fail on resolution of not installed packages ([0cea251](https://github.com/apify/apify-sdk-js/commit/0cea251b35c652d529320a0570d6b283f52f0ac1)) ### [apify@3.2.1](https://github.com/apify/apify-sdk-js/releases/tag/apify@3.2.1)[​](#apify321 "Direct link to heading") ##### [3.2.1](https://github.com/apify/apify-sdk-js/compare/apify@3.2.0...apify@3.2.1) (2024-05-23)[​](#321-2024-05-23 "Direct link to heading") ###### Features[​](#features-4 "Direct link to heading") * validate crawlee versions in `Actor.init` ([#301](https://github.com/apify/apify-sdk-js/issues/301)) ([66ff6a9](https://github.com/apify/apify-sdk-js/commit/66ff6a9090e9e3321a217e14019e8c3001e3df4d)), closes [#237](https://github.com/apify/apify-sdk-js/issues/237) ### [apify@3.2.0](https://github.com/apify/apify-sdk-js/releases/tag/apify@3.2.0)[​](#apify320 "Direct link to heading") #### [3.2.0](https://github.com/apify/apify-sdk-js/compare/apify@3.1.16...apify@3.2.0) (2024-04-11)[​](#320-2024-04-11 "Direct link to heading") ###### Features[​](#features-5 "Direct link to heading") * support for proxy tiers ([#290](https://github.com/apify/apify-sdk-js/issues/290)) ([fff3a66](https://github.com/apify/apify-sdk-js/commit/fff3a66d3a0fe5080121cc083e27f59db3d979b5)) ### [@apify/scraper-tools@1.1.4](https://github.com/apify/apify-sdk-js/releases/tag/@apify/scraper-tools@1.1.4)[​](#apifyscraper-tools114 "Direct link to heading") ##### [1.1.4](https://github.com/apify/apify-sdk-js/compare/@apify/scraper-tools@1.1.2...@apify/scraper-tools@1.1.4) (2024-03-25)[​](#114-2024-03-25 "Direct link to heading") **Note:** Version bump only for package @apify/scraper-tools ### [apify@3.1.16](https://github.com/apify/apify-sdk-js/releases/tag/apify@3.1.16)[​](#apify3116 "Direct link to heading") ##### [3.1.16](https://github.com/apify/apify-sdk-js/compare/apify@3.1.15...apify@3.1.16) (2024-02-23)[​](#3116-2024-02-23 "Direct link to heading") **Note:** Version bump only for package apify ### [apify@3.1.15](https://github.com/apify/apify-sdk-js/releases/tag/apify@3.1.15)[​](#apify3115 "Direct link to heading") ##### [3.1.15](https://github.com/apify/apify-sdk-js/compare/apify@3.1.14...apify@3.1.15) (2024-01-08)[​](#3115-2024-01-08 "Direct link to heading") ###### Features[​](#features-6 "Direct link to heading") * ignore proxy configuration locally if no valid token or password is found ([#272](https://github.com/apify/apify-sdk-js/issues/272)) ([0931c2e](https://github.com/apify/apify-sdk-js/commit/0931c2e27e48425bfc58c5df80cd42ed66b9395d)), closes [#262](https://github.com/apify/apify-sdk-js/issues/262) ### [apify@3.1.14](https://github.com/apify/apify-sdk-js/releases/tag/apify@3.1.14)[​](#apify3114 "Direct link to heading") ##### [3.1.14](https://github.com/apify/apify-sdk-js/compare/apify@3.1.13...apify@3.1.14) (2023-11-27)[​](#3114-2023-11-27 "Direct link to heading") This release updates internal dependencies, including the new got-scraping v4 (used for resolving proxy URLs). ### [apify@3.1.13](https://github.com/apify/apify-sdk-js/releases/tag/apify@3.1.13)[​](#apify3113 "Direct link to heading") ##### [3.1.13](https://github.com/apify/apify-sdk-js/compare/apify@3.1.12...apify@3.1.13) (2023-11-15)[​](#3113-2023-11-15 "Direct link to heading") ###### Bug Fixes[​](#bug-fixes-10 "Direct link to heading") * **apify:** declare got-scraping as dependency ([#252](https://github.com/apify/apify-sdk-js/issues/252)) ([a6bcf1d](https://github.com/apify/apify-sdk-js/commit/a6bcf1d578a7c7ebbb23b3768e8bbf9e94e2b404)) ### [apify@3.1.12](https://github.com/apify/apify-sdk-js/releases/tag/apify@3.1.12)[​](#apify3112 "Direct link to heading") ##### [3.1.12](https://github.com/apify/apify-sdk-js/compare/apify@3.1.11...apify@3.1.12) (2023-10-05)[​](#3112-2023-10-05 "Direct link to heading") ###### Bug Fixes[​](#bug-fixes-11 "Direct link to heading") * add more logging to `Actor.init` and `Actor.exit` ([#236](https://github.com/apify/apify-sdk-js/issues/236)) ([b7e01fc](https://github.com/apify/apify-sdk-js/commit/b7e01fc649de84d6f1391bf95e0f349f7ca32536)) ### [apify@3.1.11](https://github.com/apify/apify-sdk-js/releases/tag/apify@3.1.11)[​](#apify3111 "Direct link to heading") ##### [3.1.11](https://github.com/apify/apify-sdk-js/compare/apify@3.1.10...apify@3.1.11) (2023-10-04)[​](#3111-2023-10-04 "Direct link to heading") ###### Bug Fixes[​](#bug-fixes-12 "Direct link to heading") * run the whole `Actor.exit()` code inside a timeout handler ([#235](https://github.com/apify/apify-sdk-js/issues/235)) ([c8aabae](https://github.com/apify/apify-sdk-js/commit/c8aabaee5f2de1ab40947f47f95f54ccff37cad0)) ###### Features[​](#features-7 "Direct link to heading") * Use `.reboot()` instead of `.metamorph()` for reboot ([#227](https://github.com/apify/apify-sdk-js/issues/227)) ([8c0bff5](https://github.com/apify/apify-sdk-js/commit/8c0bff5a8d3ea65e532b3700b34b9c563856158a)) ### [apify@3.1.10](https://github.com/apify/apify-sdk-js/releases/tag/apify@3.1.10)[​](#apify3110 "Direct link to heading") ##### [3.1.10](https://github.com/apify/apify-sdk-js/compare/apify@3.1.9...apify@3.1.10) (2023-09-07)[​](#3110-2023-09-07 "Direct link to heading") ###### Bug Fixes[​](#bug-fixes-13 "Direct link to heading") * require newer version of crawlee to fix possible issues with `purgeDefaultStorages` ([#226](https://github.com/apify/apify-sdk-js/issues/226)) ([95cf31f](https://github.com/apify/apify-sdk-js/commit/95cf31f3d1d054a1c8e3daac89f41bbb0aaddbba)) ### [apify@3.1.9](https://github.com/apify/apify-sdk-js/releases/tag/apify@3.1.9)[​](#apify319 "Direct link to heading") ##### [3.1.9](https://github.com/apify/apify-sdk-js/compare/apify@3.1.8...apify@3.1.9) (2023-09-06)[​](#319-2023-09-06 "Direct link to heading") **Note:** Version bump only for package apify ### [@apify/scraper-tools@1.1.2](https://github.com/apify/apify-sdk-js/releases/tag/@apify/scraper-tools@1.1.2)[​](#apifyscraper-tools112 "Direct link to heading") ##### [1.1.2](https://github.com/apify/apify-sdk-js/compare/@apify/scraper-tools@1.1.1...@apify/scraper-tools@1.1.2) (2023-07-28)[​](#112-2023-07-28 "Direct link to heading") ###### Bug Fixes[​](#bug-fixes-14 "Direct link to heading") * declare missing dependency on tslib ([bc27118](https://github.com/apify/apify-sdk-js/commit/bc27118daab211857305f7617b1ee1433da13d4a)) ###### Features[​](#features-8 "Direct link to heading") * terminal message on Actor.exit() ([#172](https://github.com/apify/apify-sdk-js/issues/172)) ([e0feca8](https://github.com/apify/apify-sdk-js/commit/e0feca895766af0d92fbf78ca4c2d7b49bd2acff)) ### [apify@3.1.8](https://github.com/apify/apify-sdk-js/releases/tag/apify@3.1.8)[​](#apify318 "Direct link to heading") ##### [3.1.8](https://github.com/apify/apify-sdk-js/compare/apify@3.1.7...apify@3.1.8) (2023-07-20)[​](#318-2023-07-20 "Direct link to heading") ###### Bug Fixes[​](#bug-fixes-15 "Direct link to heading") * require newer version of apify-client and other packages ([24a3a4b](https://github.com/apify/apify-sdk-js/commit/24a3a4b5bf2f61e690348727e7f24c06c45a0999)) ###### Features[​](#features-9 "Direct link to heading") * Use Actor env vars ([#216](https://github.com/apify/apify-sdk-js/issues/216)) ([11ff740](https://github.com/apify/apify-sdk-js/commit/11ff740ad3d2bdd37fce011d94b64ea01413b0d9)) ### [apify@3.1.7](https://github.com/apify/apify-sdk-js/releases/tag/apify@3.1.7)[​](#apify317 "Direct link to heading") ##### [3.1.7](https://github.com/apify/apify-sdk-js/compare/apify@3.1.6...apify@3.1.7) (2023-06-09)[​](#317-2023-06-09 "Direct link to heading") **Note:** Version bump only for package apify ### [apify@3.1.6](https://github.com/apify/apify-sdk-js/releases/tag/apify@3.1.6)[​](#apify316 "Direct link to heading") ##### [3.1.6](https://github.com/apify/apify-sdk-js/compare/apify@3.1.5...apify@3.1.6) (2023-06-09)[​](#316-2023-06-09 "Direct link to heading") ###### Bug Fixes[​](#bug-fixes-16 "Direct link to heading") * only print status message when explicitly provided in `Actor.exit()` ([#203](https://github.com/apify/apify-sdk-js/issues/203)) ([85159e4](https://github.com/apify/apify-sdk-js/commit/85159e499984c78eee90b6d92332ea63b9f46c8c)) ### [apify@3.1.5](https://github.com/apify/apify-sdk-js/releases/tag/apify@3.1.5)[​](#apify315 "Direct link to heading") ##### [3.1.5](https://github.com/apify/apify-sdk-js/compare/apify@3.1.4...apify@3.1.5) (2023-05-31)[​](#315-2023-05-31 "Direct link to heading") ###### Bug Fixes[​](#bug-fixes-17 "Direct link to heading") * add missing `options` parameter to `Actor.setStatusMessage()` ([712e8c6](https://github.com/apify/apify-sdk-js/commit/712e8c66755ac8baeb35fcc1ad000487da8b2c48)) ###### Features[​](#features-10 "Direct link to heading") * add `Actor.getInputOrThrow()` method ([#198](https://github.com/apify/apify-sdk-js/issues/198)) ([5fbbfe4](https://github.com/apify/apify-sdk-js/commit/5fbbfe4960a79fbbd23f4fdd7d07a1a5063820f4)) ### [apify@3.1.4](https://github.com/apify/apify-sdk-js/releases/tag/apify@3.1.4)[​](#apify314 "Direct link to heading") ##### [3.1.4](https://github.com/apify/apify-sdk-js/compare/apify@3.1.3...apify@3.1.4) (2023-03-23)[​](#314-2023-03-23 "Direct link to heading") ###### Bug Fixes[​](#bug-fixes-18 "Direct link to heading") * log status message only once and without prefix ([#179](https://github.com/apify/apify-sdk-js/issues/179)) ([1f11a6a](https://github.com/apify/apify-sdk-js/commit/1f11a6ad8ebc8a0cfaef58be47ba8b12c75018f1)) --- # Accept user input This example accepts and logs user input: ``` import { Actor } from 'apify'; await Actor.init(); const input = await Actor.getInput(); console.log(input); await Actor.exit(); ``` To provide the actor with input, create a `INPUT.json` file inside the "default" key-value store: ``` {PROJECT_FOLDER}/storage/key_value_stores/default/INPUT.json ``` Anything in this file will be available to the actor when it runs. To learn about other ways to provide an actor with input, refer to the [Apify Platform Documentation](https://apify.com/docs/actor#run). --- # Add data to dataset This example saves data to the default dataset. If the dataset doesn't exist, it will be created. You can save data to custom datasets by using [`Actor.openDataset()`](https://docs.apify.com/sdk/js/sdk/js/reference/class/Dataset.md#open) [Run on](https://console.apify.com/actors/kk67IcZkKSSBTslXI?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IEFjdG9yIH0gZnJvbSAnYXBpZnknO1xcbmltcG9ydCB7IENoZWVyaW9DcmF3bGVyIH0gZnJvbSAnY3Jhd2xlZSc7XFxuXFxuYXdhaXQgQWN0b3IuaW5pdCgpO1xcblxcbi8vIENyZWF0ZSBhIGRhdGFzZXQgd2hlcmUgd2Ugd2lsbCBzdG9yZSB0aGUgcmVzdWx0cy5cXG5jb25zdCBjcmF3bGVyID0gbmV3IENoZWVyaW9DcmF3bGVyKHtcXG4gICAgLy8gRnVuY3Rpb24gY2FsbGVkIGZvciBlYWNoIFVSTFxcbiAgICBhc3luYyByZXF1ZXN0SGFuZGxlcih7IHJlcXVlc3QsIGJvZHkgfSkge1xcbiAgICAgICAgLy8gU2F2ZSBkYXRhIHRvIGRlZmF1bHQgZGF0YXNldFxcbiAgICAgICAgYXdhaXQgQWN0b3IucHVzaERhdGEoe1xcbiAgICAgICAgICAgIHVybDogcmVxdWVzdC51cmwsXFxuICAgICAgICAgICAgaHRtbDogYm9keSxcXG4gICAgICAgIH0pO1xcbiAgICB9LFxcbn0pO1xcblxcbi8vIFJ1biB0aGUgY3Jhd2xlclxcbmF3YWl0IGNyYXdsZXIucnVuKFtcXG4gICAgeyB1cmw6ICdodHRwOi8vd3d3LmV4YW1wbGUuY29tL3BhZ2UtMScgfSxcXG4gICAgeyB1cmw6ICdodHRwOi8vd3d3LmV4YW1wbGUuY29tL3BhZ2UtMicgfSxcXG4gICAgeyB1cmw6ICdodHRwOi8vd3d3LmV4YW1wbGUuY29tL3BhZ2UtMycgfSxcXG5dKTtcXG5cXG5hd2FpdCBBY3Rvci5leGl0KCk7XFxuXCJ9Iiwib3B0aW9ucyI6eyJidWlsZCI6ImxhdGVzdCIsImNvbnRlbnRUeXBlIjoiYXBwbGljYXRpb24vanNvbjsgY2hhcnNldD11dGYtOCIsIm1lbW9yeSI6MTAyNCwidGltZW91dCI6MTgwfX0.fhpAfqCjjEMd7THx-jtJurjuRe7si1RztaBrOcDRcQ8\&asrc=run_on_apify) ``` import { Actor } from 'apify'; import { CheerioCrawler } from 'crawlee'; await Actor.init(); // Create a dataset where we will store the results. const crawler = new CheerioCrawler({ // Function called for each URL async requestHandler({ request, body }) { // Save data to default dataset await Actor.pushData({ url: request.url, html: body, }); }, }); // Run the crawler await crawler.run([ { url: 'http://www.example.com/page-1' }, { url: 'http://www.example.com/page-2' }, { url: 'http://www.example.com/page-3' }, ]); await Actor.exit(); ``` Each item in this dataset will be saved to its own file in the following directory: ``` {PROJECT_FOLDER}/storage/datasets/default/ ``` --- # Basic crawler This is the most bare-bones example of the Apify SDK, which demonstrates some of its building blocks such as the [`BasicCrawler`](https://crawlee.dev/api/basic-crawler/class/BasicCrawler). You probably don't need to go this deep though, and it would be better to start with one of the full-featured crawlers like [`CheerioCrawler`](https://crawlee.dev/api/cheerio-crawler/class/CheerioCrawler) or [`PlaywrightCrawler`](https://crawlee.dev/api/playwright-crawler/class/PlaywrightCrawler). The script simply downloads several web pages with plain HTTP requests using the [`got-scraping`](https://github.com/apify/got-scraping) npm package and stores their raw HTML and URL in the default dataset. In local configuration, the data will be stored as JSON files in `./storage/datasets/default`. [Run on](https://console.apify.com/actors/kk67IcZkKSSBTslXI?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IEFjdG9yIH0gZnJvbSAnYXBpZnknO1xcbmltcG9ydCB7IEJhc2ljQ3Jhd2xlciB9IGZyb20gJ2NyYXdsZWUnO1xcbmltcG9ydCB7IGdvdFNjcmFwaW5nIH0gZnJvbSAnZ290LXNjcmFwaW5nJztcXG5cXG5hd2FpdCBBY3Rvci5pbml0KCk7XFxuXFxuLy8gQ3JlYXRlIGEgZGF0YXNldCB3aGVyZSB3ZSB3aWxsIHN0b3JlIHRoZSByZXN1bHRzLlxcbi8vIENyZWF0ZSBhIEJhc2ljQ3Jhd2xlciAtIHRoZSBzaW1wbGVzdCBjcmF3bGVyIHRoYXQgZW5hYmxlc1xcbi8vIHVzZXJzIHRvIGltcGxlbWVudCB0aGUgY3Jhd2xpbmcgbG9naWMgdGhlbXNlbHZlcy5cXG5jb25zdCBjcmF3bGVyID0gbmV3IEJhc2ljQ3Jhd2xlcih7XFxuICAgIC8vIFRoaXMgZnVuY3Rpb24gd2lsbCBiZSBjYWxsZWQgZm9yIGVhY2ggVVJMIHRvIGNyYXdsLlxcbiAgICBhc3luYyByZXF1ZXN0SGFuZGxlcih7IHJlcXVlc3QgfSkge1xcbiAgICAgICAgY29uc3QgeyB1cmwgfSA9IHJlcXVlc3Q7XFxuICAgICAgICBjb25zb2xlLmxvZyhgUHJvY2Vzc2luZyAke3VybH0uLi5gKTtcXG5cXG4gICAgICAgIC8vIEZldGNoIHRoZSBwYWdlIEhUTUwgdmlhIEFwaWZ5IHV0aWxzIGdvdFNjcmFwaW5nXFxuICAgICAgICBjb25zdCB7IGJvZHkgfSA9IGF3YWl0IGdvdFNjcmFwaW5nKHsgdXJsIH0pO1xcblxcbiAgICAgICAgLy8gU3RvcmUgdGhlIEhUTUwgYW5kIFVSTCB0byB0aGUgZGVmYXVsdCBkYXRhc2V0LlxcbiAgICAgICAgYXdhaXQgQWN0b3IucHVzaERhdGEoe1xcbiAgICAgICAgICAgIHVybDogcmVxdWVzdC51cmwsXFxuICAgICAgICAgICAgaHRtbDogYm9keSxcXG4gICAgICAgIH0pO1xcbiAgICB9LFxcbn0pO1xcblxcbi8vIFRoZSBpbml0aWFsIGxpc3Qgb2YgVVJMcyB0byBjcmF3bC4gSGVyZSB3ZSB1c2UganVzdCBhIGZldyBoYXJkLWNvZGVkIFVSTHMuXFxuYXdhaXQgY3Jhd2xlci5ydW4oW1xcbiAgICB7IHVybDogJ2h0dHA6Ly93d3cuZ29vZ2xlLmNvbS8nIH0sXFxuICAgIHsgdXJsOiAnaHR0cDovL3d3dy5leGFtcGxlLmNvbS8nIH0sXFxuICAgIHsgdXJsOiAnaHR0cDovL3d3dy5iaW5nLmNvbS8nIH0sXFxuICAgIHsgdXJsOiAnaHR0cDovL3d3dy53aWtpcGVkaWEuY29tLycgfSxcXG5dKTtcXG5cXG5jb25zb2xlLmxvZygnQ3Jhd2xlciBmaW5pc2hlZC4nKTtcXG5cXG5hd2FpdCBBY3Rvci5leGl0KCk7XFxuXCJ9Iiwib3B0aW9ucyI6eyJidWlsZCI6ImxhdGVzdCIsImNvbnRlbnRUeXBlIjoiYXBwbGljYXRpb24vanNvbjsgY2hhcnNldD11dGYtOCIsIm1lbW9yeSI6MTAyNCwidGltZW91dCI6MTgwfX0.-TdH8qt-fjSHPGoP8mJHr2LqYkhq6aWUhY9IdesMFrM\&asrc=run_on_apify) ``` import { Actor } from 'apify'; import { BasicCrawler } from 'crawlee'; import { gotScraping } from 'got-scraping'; await Actor.init(); // Create a dataset where we will store the results. // Create a BasicCrawler - the simplest crawler that enables // users to implement the crawling logic themselves. const crawler = new BasicCrawler({ // This function will be called for each URL to crawl. async requestHandler({ request }) { const { url } = request; console.log(`Processing ${url}...`); // Fetch the page HTML via Apify utils gotScraping const { body } = await gotScraping({ url }); // Store the HTML and URL to the default dataset. await Actor.pushData({ url: request.url, html: body, }); }, }); // The initial list of URLs to crawl. Here we use just a few hard-coded URLs. await crawler.run([ { url: 'http://www.google.com/' }, { url: 'http://www.example.com/' }, { url: 'http://www.bing.com/' }, { url: 'http://www.wikipedia.com/' }, ]); console.log('Crawler finished.'); await Actor.exit(); ``` --- # Call actor This example demonstrates how to start an Apify actor using [`Actor.call()`](https://docs.apify.com/sdk/js/sdk/js/reference/class/Actor.md#call) and how to call the Apify API using [`Actor.newClient()`](https://docs.apify.com/sdk/js/sdk/js/reference/class/Actor.md#newClient). The script gets a random weird word and its explanation from [randomword.com](https://randomword.com/) and sends it to your email using the [`apify/send-mail`](https://apify.com/apify/send-mail) actor. To make the example work, you'll need an [Apify account](https://console.apify.com/). Go to the [Settings - Integrations](https://console.apify.com/account?tab=integrations) page to obtain your API token and set it to the [`APIFY_TOKEN`](https://docs.apify.com/sdk/js/sdk/js/docs/guides/environment-variables.md#APIFY_TOKEN) environment variable, or run the script using the Apify CLI. If you deploy this actor to the Apify Cloud, you can do things like set up a scheduler to run your actor early in the morning. To see what other actors are available, visit the [Apify Store](https://apify.com/store). > To run this example on Apify Platform, use the `apify/actor-node-puppeteer-chrome` image for your Dockerfile. ``` import { Actor } from 'apify'; import { launchPuppeteer } from 'crawlee'; await Actor.init(); // Launch the web browser. const browser = await launchPuppeteer(); console.log('Obtaining own email address...'); const apifyClient = Actor.newClient(); const { email } = await apifyClient.user().get(); // Load randomword.com and get a random word console.log('Fetching a random word.'); const page = await browser.newPage(); await page.goto('https://randomword.com/'); const randomWord = await page.$eval('#shared_section', (el) => el.outerHTML); // Send random word to your email. For that, you can use an actor we already // have available on the platform under the name: apify/send-mail. // The second parameter to the Actor.call() invocation is the actor's // desired input. You can find the required input parameters by checking // the actor's documentation page: https://apify.com/apify/send-mail console.log(`Sending email to ${user.email}...`); await Actor.call('apify/send-mail', { to: email, subject: 'Random Word', html: `

Random Word

${randomWord}`, }); console.log('Email sent. Good luck!'); // Close Browser await browser.close(); await Actor.exit(); ``` --- # Capture a screenshot using Puppeteer tip To run this example on the Apify Platform, select the `apify/actor-node-puppeteer-chrome` image for your Dockerfile. This example captures a screenshot of a web page using `Puppeteer`. It would look almost exactly the same with `Playwright`. * Page Screenshot * Crawler Utils Screenshot Using `page.screenshot()`: [Run on](https://console.apify.com/actors/7tWSD8hrYzuc9Lte7?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IEFjdG9yIH0gZnJvbSAnYXBpZnknO1xcbmltcG9ydCB7IGxhdW5jaFB1cHBldGVlciB9IGZyb20gJ2NyYXdsZWUnO1xcblxcbmF3YWl0IEFjdG9yLmluaXQoKTtcXG5cXG5jb25zdCB1cmwgPSAnaHR0cDovL3d3dy5leGFtcGxlLmNvbS8nO1xcbi8vIFN0YXJ0IGEgYnJvd3NlclxcbmNvbnN0IGJyb3dzZXIgPSBhd2FpdCBsYXVuY2hQdXBwZXRlZXIoKTtcXG5cXG4vLyBPcGVuIG5ldyB0YWIgaW4gdGhlIGJyb3dzZXJcXG5jb25zdCBwYWdlID0gYXdhaXQgYnJvd3Nlci5uZXdQYWdlKCk7XFxuXFxuLy8gTmF2aWdhdGUgdG8gdGhlIFVSTFxcbmF3YWl0IHBhZ2UuZ290byh1cmwpO1xcblxcbi8vIENhcHR1cmUgdGhlIHNjcmVlbnNob3RcXG5jb25zdCBzY3JlZW5zaG90ID0gYXdhaXQgcGFnZS5zY3JlZW5zaG90KCk7XFxuXFxuLy8gU2F2ZSB0aGUgc2NyZWVuc2hvdCB0byB0aGUgZGVmYXVsdCBrZXktdmFsdWUgc3RvcmVcXG5hd2FpdCBBY3Rvci5zZXRWYWx1ZSgnbXkta2V5Jywgc2NyZWVuc2hvdCwgeyBjb250ZW50VHlwZTogJ2ltYWdlL3BuZycgfSk7XFxuXFxuLy8gQ2xvc2UgUHVwcGV0ZWVyXFxuYXdhaXQgYnJvd3Nlci5jbG9zZSgpO1xcblxcbmF3YWl0IEFjdG9yLmV4aXQoKTtcXG5cIn0iLCJvcHRpb25zIjp7ImJ1aWxkIjoibGF0ZXN0IiwiY29udGVudFR5cGUiOiJhcHBsaWNhdGlvbi9qc29uOyBjaGFyc2V0PXV0Zi04IiwibWVtb3J5Ijo0MDk2LCJ0aW1lb3V0IjoxODB9fQ.xy-Qn13nROyNEPEB6pUG8xQ1VfIjq56rsat4wKqhq9o\&asrc=run_on_apify) ``` import { Actor } from 'apify'; import { launchPuppeteer } from 'crawlee'; await Actor.init(); const url = 'http://www.example.com/'; // Start a browser const browser = await launchPuppeteer(); // Open new tab in the browser const page = await browser.newPage(); // Navigate to the URL await page.goto(url); // Capture the screenshot const screenshot = await page.screenshot(); // Save the screenshot to the default key-value store await Actor.setValue('my-key', screenshot, { contentType: 'image/png' }); // Close Puppeteer await browser.close(); await Actor.exit(); ``` Using `puppeteerUtils.saveSnapshot()`: [Run on](https://console.apify.com/actors/7tWSD8hrYzuc9Lte7?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IEFjdG9yIH0gZnJvbSAnYXBpZnknO1xcbmltcG9ydCB7IGxhdW5jaFB1cHBldGVlciwgdXRpbHMgfSBmcm9tICdjcmF3bGVlJztcXG5cXG5hd2FpdCBBY3Rvci5pbml0KCk7XFxuXFxuY29uc3QgdXJsID0gJ2h0dHA6Ly93d3cuZXhhbXBsZS5jb20vJztcXG4vLyBTdGFydCBhIGJyb3dzZXJcXG5jb25zdCBicm93c2VyID0gYXdhaXQgbGF1bmNoUHVwcGV0ZWVyKCk7XFxuXFxuLy8gT3BlbiBuZXcgdGFiIGluIHRoZSBicm93c2VyXFxuY29uc3QgcGFnZSA9IGF3YWl0IGJyb3dzZXIubmV3UGFnZSgpO1xcblxcbi8vIE5hdmlnYXRlIHRvIHRoZSBVUkxcXG5hd2FpdCBwYWdlLmdvdG8odXJsKTtcXG5cXG4vLyBDYXB0dXJlIHRoZSBzY3JlZW5zaG90XFxuYXdhaXQgdXRpbHMucHVwcGV0ZWVyLnNhdmVTbmFwc2hvdChwYWdlLCB7IGtleTogJ215LWtleScsIHNhdmVIdG1sOiBmYWxzZSB9KTtcXG5cXG4vLyBDbG9zZSBQdXBwZXRlZXJcXG5hd2FpdCBicm93c2VyLmNsb3NlKCk7XFxuXFxuYXdhaXQgQWN0b3IuZXhpdCgpO1xcblwifSIsIm9wdGlvbnMiOnsiYnVpbGQiOiJsYXRlc3QiLCJjb250ZW50VHlwZSI6ImFwcGxpY2F0aW9uL2pzb247IGNoYXJzZXQ9dXRmLTgiLCJtZW1vcnkiOjQwOTYsInRpbWVvdXQiOjE4MH19.QSyAaQjtq2wJi2-pHooiFMBrLOELGoFYIBj8kQcDYtA\&asrc=run_on_apify) ``` import { Actor } from 'apify'; import { launchPuppeteer, utils } from 'crawlee'; await Actor.init(); const url = 'http://www.example.com/'; // Start a browser const browser = await launchPuppeteer(); // Open new tab in the browser const page = await browser.newPage(); // Navigate to the URL await page.goto(url); // Capture the screenshot await utils.puppeteer.saveSnapshot(page, { key: 'my-key', saveHtml: false }); // Close Puppeteer await browser.close(); await Actor.exit(); ``` This example captures a screenshot of multiple web pages when using `PuppeteerCrawler`: * Page Screenshot * Crawler Utils Screenshot Using `page.screenshot()`: [Run on](https://console.apify.com/actors/7tWSD8hrYzuc9Lte7?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IEFjdG9yIH0gZnJvbSAnYXBpZnknO1xcbmltcG9ydCB7IFB1cHBldGVlckNyYXdsZXIgfSBmcm9tICdjcmF3bGVlJztcXG5cXG5hd2FpdCBBY3Rvci5pbml0KCk7XFxuXFxuLy8gQ3JlYXRlIGEgUHVwcGV0ZWVyQ3Jhd2xlclxcbmNvbnN0IGNyYXdsZXIgPSBuZXcgUHVwcGV0ZWVyQ3Jhd2xlcih7XFxuICAgIGFzeW5jIHJlcXVlc3RIYW5kbGVyKHsgcmVxdWVzdCwgcGFnZSB9KSB7XFxuICAgICAgICAvLyBDYXB0dXJlIHRoZSBzY3JlZW5zaG90IHdpdGggUHVwcGV0ZWVyXFxuICAgICAgICBjb25zdCBzY3JlZW5zaG90ID0gYXdhaXQgcGFnZS5zY3JlZW5zaG90KCk7XFxuICAgICAgICAvLyBDb252ZXJ0IHRoZSBVUkwgaW50byBhIHZhbGlkIGtleVxcbiAgICAgICAgY29uc3Qga2V5ID0gcmVxdWVzdC51cmwucmVwbGFjZSgvWzovXS9nLCAnXycpO1xcbiAgICAgICAgLy8gU2F2ZSB0aGUgc2NyZWVuc2hvdCB0byB0aGUgZGVmYXVsdCBrZXktdmFsdWUgc3RvcmVcXG4gICAgICAgIGF3YWl0IEFjdG9yLnNldFZhbHVlKGtleSwgc2NyZWVuc2hvdCwgeyBjb250ZW50VHlwZTogJ2ltYWdlL3BuZycgfSk7XFxuICAgIH0sXFxufSk7XFxuXFxuLy8gUnVuIHRoZSBjcmF3bGVyXFxuYXdhaXQgY3Jhd2xlci5ydW4oW1xcbiAgICB7IHVybDogJ2h0dHA6Ly93d3cuZXhhbXBsZS5jb20vcGFnZS0xJyB9LFxcbiAgICB7IHVybDogJ2h0dHA6Ly93d3cuZXhhbXBsZS5jb20vcGFnZS0yJyB9LFxcbiAgICB7IHVybDogJ2h0dHA6Ly93d3cuZXhhbXBsZS5jb20vcGFnZS0zJyB9LFxcbl0pO1xcblxcbmF3YWl0IEFjdG9yLmV4aXQoKTtcXG5cIn0iLCJvcHRpb25zIjp7ImJ1aWxkIjoibGF0ZXN0IiwiY29udGVudFR5cGUiOiJhcHBsaWNhdGlvbi9qc29uOyBjaGFyc2V0PXV0Zi04IiwibWVtb3J5Ijo0MDk2LCJ0aW1lb3V0IjoxODB9fQ.V_BcbfCWH__rcmGznaMSLm6R1wTtqF583QKH4Z3n5Uc\&asrc=run_on_apify) ``` import { Actor } from 'apify'; import { PuppeteerCrawler } from 'crawlee'; await Actor.init(); // Create a PuppeteerCrawler const crawler = new PuppeteerCrawler({ async requestHandler({ request, page }) { // Capture the screenshot with Puppeteer const screenshot = await page.screenshot(); // Convert the URL into a valid key const key = request.url.replace(/[:/]/g, '_'); // Save the screenshot to the default key-value store await Actor.setValue(key, screenshot, { contentType: 'image/png' }); }, }); // Run the crawler await crawler.run([ { url: 'http://www.example.com/page-1' }, { url: 'http://www.example.com/page-2' }, { url: 'http://www.example.com/page-3' }, ]); await Actor.exit(); ``` Using `puppeteerUtils.saveSnapshot()`: [Run on](https://console.apify.com/actors/7tWSD8hrYzuc9Lte7?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IEFjdG9yIH0gZnJvbSAnYXBpZnknO1xcbmltcG9ydCB7IFB1cHBldGVlckNyYXdsZXIsIHB1cHBldGVlclV0aWxzIH0gZnJvbSAnY3Jhd2xlZSc7XFxuXFxuYXdhaXQgQWN0b3IuaW5pdCgpO1xcblxcbi8vIENyZWF0ZSBhIFB1cHBldGVlckNyYXdsZXJcXG5jb25zdCBjcmF3bGVyID0gbmV3IFB1cHBldGVlckNyYXdsZXIoe1xcbiAgICBhc3luYyByZXF1ZXN0SGFuZGxlcih7IHJlcXVlc3QsIHBhZ2UgfSkge1xcbiAgICAgICAgLy8gQ29udmVydCB0aGUgVVJMIGludG8gYSB2YWxpZCBrZXlcXG4gICAgICAgIGNvbnN0IGtleSA9IHJlcXVlc3QudXJsLnJlcGxhY2UoL1s6L10vZywgJ18nKTtcXG4gICAgICAgIC8vIENhcHR1cmUgdGhlIHNjcmVlbnNob3RcXG4gICAgICAgIGF3YWl0IHB1cHBldGVlclV0aWxzLnNhdmVTbmFwc2hvdChwYWdlLCB7IGtleSwgc2F2ZUh0bWw6IGZhbHNlIH0pO1xcbiAgICB9LFxcbn0pO1xcblxcbi8vIFJ1biB0aGUgY3Jhd2xlclxcbmF3YWl0IGNyYXdsZXIucnVuKFtcXG4gICAgeyB1cmw6ICdodHRwOi8vd3d3LmV4YW1wbGUuY29tL3BhZ2UtMScgfSxcXG4gICAgeyB1cmw6ICdodHRwOi8vd3d3LmV4YW1wbGUuY29tL3BhZ2UtMicgfSxcXG4gICAgeyB1cmw6ICdodHRwOi8vd3d3LmV4YW1wbGUuY29tL3BhZ2UtMycgfSxcXG5dKTtcXG5cXG5hd2FpdCBBY3Rvci5leGl0KCk7XFxuXCJ9Iiwib3B0aW9ucyI6eyJidWlsZCI6ImxhdGVzdCIsImNvbnRlbnRUeXBlIjoiYXBwbGljYXRpb24vanNvbjsgY2hhcnNldD11dGYtOCIsIm1lbW9yeSI6NDA5NiwidGltZW91dCI6MTgwfX0.udR8araTvFL0crHf63ENyHe6LCZ4yd1J7FwSdJauc5M\&asrc=run_on_apify) ``` import { Actor } from 'apify'; import { PuppeteerCrawler, puppeteerUtils } from 'crawlee'; await Actor.init(); // Create a PuppeteerCrawler const crawler = new PuppeteerCrawler({ async requestHandler({ request, page }) { // Convert the URL into a valid key const key = request.url.replace(/[:/]/g, '_'); // Capture the screenshot await puppeteerUtils.saveSnapshot(page, { key, saveHtml: false }); }, }); // Run the crawler await crawler.run([ { url: 'http://www.example.com/page-1' }, { url: 'http://www.example.com/page-2' }, { url: 'http://www.example.com/page-3' }, ]); await Actor.exit(); ``` In both examples using `page.screenshot()`, a `key` variable is created based on the URL of the web page. This variable is used as the key when saving each screenshot into a key-value store. --- # Cheerio crawler This example demonstrates how to use [`CheerioCrawler`](https://crawlee.dev/api/cheerio-crawler/class/CheerioCrawler) to crawl a list of URLs from an external file, load each URL using a plain HTTP request, parse the HTML using the [Cheerio library](https://www.npmjs.com/package/cheerio) and extract some data from it: the page title and all `h1` tags. [Run on](https://console.apify.com/actors/kk67IcZkKSSBTslXI?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IENoZWVyaW9DcmF3bGVyLCBsb2csIExvZ0xldmVsIH0gZnJvbSAnY3Jhd2xlZSc7XFxuXFxuLy8gQ3Jhd2xlcnMgY29tZSB3aXRoIHZhcmlvdXMgdXRpbGl0aWVzLCBlLmcuIGZvciBsb2dnaW5nLlxcbi8vIEhlcmUgd2UgdXNlIGRlYnVnIGxldmVsIG9mIGxvZ2dpbmcgdG8gaW1wcm92ZSB0aGUgZGVidWdnaW5nIGV4cGVyaWVuY2UuXFxuLy8gVGhpcyBmdW5jdGlvbmFsaXR5IGlzIG9wdGlvbmFsIVxcbmxvZy5zZXRMZXZlbChMb2dMZXZlbC5ERUJVRyk7XFxuXFxuLy8gQ3JlYXRlIGFuIGluc3RhbmNlIG9mIHRoZSBDaGVlcmlvQ3Jhd2xlciBjbGFzcyAtIGEgY3Jhd2xlclxcbi8vIHRoYXQgYXV0b21hdGljYWxseSBsb2FkcyB0aGUgVVJMcyBhbmQgcGFyc2VzIHRoZWlyIEhUTUwgdXNpbmcgdGhlIGNoZWVyaW8gbGlicmFyeS5cXG5jb25zdCBjcmF3bGVyID0gbmV3IENoZWVyaW9DcmF3bGVyKHtcXG4gICAgLy8gVGhlIGNyYXdsZXIgZG93bmxvYWRzIGFuZCBwcm9jZXNzZXMgdGhlIHdlYiBwYWdlcyBpbiBwYXJhbGxlbCwgd2l0aCBhIGNvbmN1cnJlbmN5XFxuICAgIC8vIGF1dG9tYXRpY2FsbHkgbWFuYWdlZCBiYXNlZCBvbiB0aGUgYXZhaWxhYmxlIHN5c3RlbSBtZW1vcnkgYW5kIENQVSAoc2VlIEF1dG9zY2FsZWRQb29sIGNsYXNzKS5cXG4gICAgLy8gSGVyZSB3ZSBkZWZpbmUgc29tZSBoYXJkIGxpbWl0cyBmb3IgdGhlIGNvbmN1cnJlbmN5LlxcbiAgICBtaW5Db25jdXJyZW5jeTogMTAsXFxuICAgIG1heENvbmN1cnJlbmN5OiA1MCxcXG5cXG4gICAgLy8gT24gZXJyb3IsIHJldHJ5IGVhY2ggcGFnZSBhdCBtb3N0IG9uY2UuXFxuICAgIG1heFJlcXVlc3RSZXRyaWVzOiAxLFxcblxcbiAgICAvLyBJbmNyZWFzZSB0aGUgdGltZW91dCBmb3IgcHJvY2Vzc2luZyBvZiBlYWNoIHBhZ2UuXFxuICAgIHJlcXVlc3RIYW5kbGVyVGltZW91dFNlY3M6IDMwLFxcblxcbiAgICAvLyBMaW1pdCB0byAxMCByZXF1ZXN0cyBwZXIgb25lIGNyYXdsXFxuICAgIG1heFJlcXVlc3RzUGVyQ3Jhd2w6IDEwLFxcblxcbiAgICAvLyBUaGlzIGZ1bmN0aW9uIHdpbGwgYmUgY2FsbGVkIGZvciBlYWNoIFVSTCB0byBjcmF3bC5cXG4gICAgLy8gSXQgYWNjZXB0cyBhIHNpbmdsZSBwYXJhbWV0ZXIsIHdoaWNoIGlzIGFuIG9iamVjdCB3aXRoIG9wdGlvbnMgYXM6XFxuICAgIC8vIGh0dHBzOi8vc2RrLmFwaWZ5LmNvbS9kb2NzL3R5cGVkZWZzL2NoZWVyaW8tY3Jhd2xlci1vcHRpb25zI2hhbmRsZXBhZ2VmdW5jdGlvblxcbiAgICAvLyBXZSB1c2UgZm9yIGRlbW9uc3RyYXRpb24gb25seSAyIG9mIHRoZW06XFxuICAgIC8vIC0gcmVxdWVzdDogYW4gaW5zdGFuY2Ugb2YgdGhlIFJlcXVlc3QgY2xhc3Mgd2l0aCBpbmZvcm1hdGlvbiBzdWNoIGFzIFVSTCBhbmQgSFRUUCBtZXRob2RcXG4gICAgLy8gLSAkOiB0aGUgY2hlZXJpbyBvYmplY3QgY29udGFpbmluZyBwYXJzZWQgSFRNTFxcbiAgICBhc3luYyByZXF1ZXN0SGFuZGxlcih7IHJlcXVlc3QsICQgfSkge1xcbiAgICAgICAgbG9nLmRlYnVnKGBQcm9jZXNzaW5nICR7cmVxdWVzdC51cmx9Li4uYCk7XFxuXFxuICAgICAgICAvLyBFeHRyYWN0IGRhdGEgZnJvbSB0aGUgcGFnZSB1c2luZyBjaGVlcmlvLlxcbiAgICAgICAgY29uc3QgdGl0bGUgPSAkKCd0aXRsZScpLnRleHQoKTtcXG4gICAgICAgIGNvbnN0IGgxdGV4dHMgPSBbXTtcXG4gICAgICAgICQoJ2gxJykuZWFjaCgoaW5kZXgsIGVsKSA9PiB7XFxuICAgICAgICAgICAgaDF0ZXh0cy5wdXNoKHtcXG4gICAgICAgICAgICAgICAgdGV4dDogJChlbCkudGV4dCgpLFxcbiAgICAgICAgICAgIH0pO1xcbiAgICAgICAgfSk7XFxuXFxuICAgICAgICAvLyBTdG9yZSB0aGUgcmVzdWx0cyB0byB0aGUgZGF0YXNldC4gSW4gbG9jYWwgY29uZmlndXJhdGlvbixcXG4gICAgICAgIC8vIHRoZSBkYXRhIHdpbGwgYmUgc3RvcmVkIGFzIEpTT04gZmlsZXMgaW4gLi9zdG9yYWdlL2RhdGFzZXRzL2RlZmF1bHRcXG4gICAgICAgIGF3YWl0IEFjdG9yLnB1c2hEYXRhKHtcXG4gICAgICAgICAgICB1cmw6IHJlcXVlc3QudXJsLFxcbiAgICAgICAgICAgIHRpdGxlLFxcbiAgICAgICAgICAgIGgxdGV4dHMsXFxuICAgICAgICB9KTtcXG4gICAgfSxcXG5cXG4gICAgLy8gVGhpcyBmdW5jdGlvbiBpcyBjYWxsZWQgaWYgdGhlIHBhZ2UgcHJvY2Vzc2luZyBmYWlsZWQgbW9yZSB0aGFuIG1heFJlcXVlc3RSZXRyaWVzKzEgdGltZXMuXFxuICAgIGZhaWxlZFJlcXVlc3RIYW5kbGVyKHsgcmVxdWVzdCB9KSB7XFxuICAgICAgICBsb2cuZGVidWcoYFJlcXVlc3QgJHtyZXF1ZXN0LnVybH0gZmFpbGVkIHR3aWNlLmApO1xcbiAgICB9LFxcbn0pO1xcblxcbi8vIFJ1biB0aGUgY3Jhd2xlciBhbmQgd2FpdCBmb3IgaXQgdG8gZmluaXNoLlxcbmF3YWl0IGNyYXdsZXIucnVuKCk7XFxuXFxubG9nLmRlYnVnKCdDcmF3bGVyIGZpbmlzaGVkLicpO1xcblwifSIsIm9wdGlvbnMiOnsiYnVpbGQiOiJsYXRlc3QiLCJjb250ZW50VHlwZSI6ImFwcGxpY2F0aW9uL2pzb247IGNoYXJzZXQ9dXRmLTgiLCJtZW1vcnkiOjEwMjQsInRpbWVvdXQiOjE4MH19.dOZILM56nUwSSVMoLgQB0brbbjQm2W2FDao35eLD72s\&asrc=run_on_apify) ``` import { CheerioCrawler, log, LogLevel } from 'crawlee'; // Crawlers come with various utilities, e.g. for logging. // Here we use debug level of logging to improve the debugging experience. // This functionality is optional! log.setLevel(LogLevel.DEBUG); // Create an instance of the CheerioCrawler class - a crawler // that automatically loads the URLs and parses their HTML using the cheerio library. const crawler = new CheerioCrawler({ // The crawler downloads and processes the web pages in parallel, with a concurrency // automatically managed based on the available system memory and CPU (see AutoscaledPool class). // Here we define some hard limits for the concurrency. minConcurrency: 10, maxConcurrency: 50, // On error, retry each page at most once. maxRequestRetries: 1, // Increase the timeout for processing of each page. requestHandlerTimeoutSecs: 30, // Limit to 10 requests per one crawl maxRequestsPerCrawl: 10, // This function will be called for each URL to crawl. // It accepts a single parameter, which is an object with options as: // https://sdk.apify.com/docs/typedefs/cheerio-crawler-options#handlepagefunction // We use for demonstration only 2 of them: // - request: an instance of the Request class with information such as URL and HTTP method // - $: the cheerio object containing parsed HTML async requestHandler({ request, $ }) { log.debug(`Processing ${request.url}...`); // Extract data from the page using cheerio. const title = $('title').text(); const h1texts = []; $('h1').each((index, el) => { h1texts.push({ text: $(el).text(), }); }); // Store the results to the dataset. In local configuration, // the data will be stored as JSON files in ./storage/datasets/default await Actor.pushData({ url: request.url, title, h1texts, }); }, // This function is called if the page processing failed more than maxRequestRetries+1 times. failedRequestHandler({ request }) { log.debug(`Request ${request.url} failed twice.`); }, }); // Run the crawler and wait for it to finish. await crawler.run(); log.debug('Crawler finished.'); ``` --- # Crawl all links on a website This example uses the `enqueueLinks()` method to add new links to the `RequestQueue` as the crawler navigates from page to page. If only the required parameters are defined, all links will be crawled. * Cheerio Crawler * Puppeteer Crawler * Playwright Crawler Using `CheerioCrawler`: [Run on](https://console.apify.com/actors/kk67IcZkKSSBTslXI?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IEFjdG9yIH0gZnJvbSAnYXBpZnknO1xcbmltcG9ydCB7IENoZWVyaW9DcmF3bGVyIH0gZnJvbSAnY3Jhd2xlZSc7XFxuXFxuYXdhaXQgQWN0b3IuaW5pdCgpO1xcblxcbmNvbnN0IGNyYXdsZXIgPSBuZXcgQ2hlZXJpb0NyYXdsZXIoe1xcbiAgICBhc3luYyByZXF1ZXN0SGFuZGxlcih7IHJlcXVlc3QsIGVucXVldWVMaW5rcyB9KSB7XFxuICAgICAgICBjb25zb2xlLmxvZyhyZXF1ZXN0LnVybCk7XFxuICAgICAgICAvLyBBZGQgYWxsIGxpbmtzIGZyb20gcGFnZSB0byBSZXF1ZXN0UXVldWVcXG4gICAgICAgIGF3YWl0IGVucXVldWVMaW5rcygpO1xcbiAgICB9LFxcbiAgICBtYXhSZXF1ZXN0c1BlckNyYXdsOiAxMCwgLy8gTGltaXRhdGlvbiBmb3Igb25seSAxMCByZXF1ZXN0cyAoZG8gbm90IHVzZSBpZiB5b3Ugd2FudCB0byBjcmF3bCBhbGwgbGlua3MpXFxufSk7XFxuXFxuLy8gUnVuIHRoZSBjcmF3bGVyXFxuYXdhaXQgY3Jhd2xlci5ydW4oWydodHRwczovL2FwaWZ5LmNvbS8nXSk7XFxuXFxuYXdhaXQgQWN0b3IuZXhpdCgpO1xcblwifSIsIm9wdGlvbnMiOnsiYnVpbGQiOiJsYXRlc3QiLCJjb250ZW50VHlwZSI6ImFwcGxpY2F0aW9uL2pzb247IGNoYXJzZXQ9dXRmLTgiLCJtZW1vcnkiOjEwMjQsInRpbWVvdXQiOjE4MH19.WZ0oMu6yd1pBKWHbkngs3qzaOVhpacPP6PKxjXnRLbc\&asrc=run_on_apify) ``` import { Actor } from 'apify'; import { CheerioCrawler } from 'crawlee'; await Actor.init(); const crawler = new CheerioCrawler({ async requestHandler({ request, enqueueLinks }) { console.log(request.url); // Add all links from page to RequestQueue await enqueueLinks(); }, maxRequestsPerCrawl: 10, // Limitation for only 10 requests (do not use if you want to crawl all links) }); // Run the crawler await crawler.run(['https://apify.com/']); await Actor.exit(); ``` Using `PuppeteerCrawler`: tip To run this example on the Apify Platform, select the `apify/actor-node-puppeteer-chrome` image for your Dockerfile. [Run on](https://console.apify.com/actors/7tWSD8hrYzuc9Lte7?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IEFjdG9yIH0gZnJvbSAnYXBpZnknO1xcbmltcG9ydCB7IFB1cHBldGVlckNyYXdsZXIgfSBmcm9tICdjcmF3bGVlJztcXG5cXG5hd2FpdCBBY3Rvci5pbml0KCk7XFxuXFxuY29uc3QgY3Jhd2xlciA9IG5ldyBQdXBwZXRlZXJDcmF3bGVyKHtcXG4gICAgYXN5bmMgcmVxdWVzdEhhbmRsZXIoeyByZXF1ZXN0LCBlbnF1ZXVlTGlua3MgfSkge1xcbiAgICAgICAgY29uc29sZS5sb2cocmVxdWVzdC51cmwpO1xcbiAgICAgICAgLy8gQWRkIGFsbCBsaW5rcyBmcm9tIHBhZ2UgdG8gUmVxdWVzdFF1ZXVlXFxuICAgICAgICBhd2FpdCBlbnF1ZXVlTGlua3MoKTtcXG4gICAgfSxcXG4gICAgbWF4UmVxdWVzdHNQZXJDcmF3bDogMTAsIC8vIExpbWl0YXRpb24gZm9yIG9ubHkgMTAgcmVxdWVzdHMgKGRvIG5vdCB1c2UgaWYgeW91IHdhbnQgdG8gY3Jhd2wgYWxsIGxpbmtzKVxcbn0pO1xcblxcbi8vIFJ1biB0aGUgY3Jhd2xlclxcbmF3YWl0IGNyYXdsZXIucnVuKFsnaHR0cHM6Ly9hcGlmeS5jb20vJ10pO1xcblxcbmF3YWl0IEFjdG9yLmV4aXQoKTtcXG5cIn0iLCJvcHRpb25zIjp7ImJ1aWxkIjoibGF0ZXN0IiwiY29udGVudFR5cGUiOiJhcHBsaWNhdGlvbi9qc29uOyBjaGFyc2V0PXV0Zi04IiwibWVtb3J5Ijo0MDk2LCJ0aW1lb3V0IjoxODB9fQ.gNhqxwBfIYMReWTkgUMf9WC-YJ_1Vy7-cQOmxNZDobM\&asrc=run_on_apify) ``` import { Actor } from 'apify'; import { PuppeteerCrawler } from 'crawlee'; await Actor.init(); const crawler = new PuppeteerCrawler({ async requestHandler({ request, enqueueLinks }) { console.log(request.url); // Add all links from page to RequestQueue await enqueueLinks(); }, maxRequestsPerCrawl: 10, // Limitation for only 10 requests (do not use if you want to crawl all links) }); // Run the crawler await crawler.run(['https://apify.com/']); await Actor.exit(); ``` Using `PlaywrightCrawler`: tip To run this example on the Apify Platform, select the `apify/actor-node-playwright-chrome` image for your Dockerfile. [Run on](https://console.apify.com/actors/6i5QsHBMtm3hKph70?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IEFjdG9yIH0gZnJvbSAnYXBpZnknO1xcbmltcG9ydCB7IFBsYXl3cmlnaHRDcmF3bGVyIH0gZnJvbSAnY3Jhd2xlZSc7XFxuXFxuYXdhaXQgQWN0b3IuaW5pdCgpO1xcblxcbmNvbnN0IGNyYXdsZXIgPSBuZXcgUGxheXdyaWdodENyYXdsZXIoe1xcbiAgICBhc3luYyByZXF1ZXN0SGFuZGxlcih7IHJlcXVlc3QsIGVucXVldWVMaW5rcyB9KSB7XFxuICAgICAgICBjb25zb2xlLmxvZyhyZXF1ZXN0LnVybCk7XFxuICAgICAgICAvLyBBZGQgYWxsIGxpbmtzIGZyb20gcGFnZSB0byBSZXF1ZXN0UXVldWVcXG4gICAgICAgIGF3YWl0IGVucXVldWVMaW5rcygpO1xcbiAgICB9LFxcbiAgICBtYXhSZXF1ZXN0c1BlckNyYXdsOiAxMCwgLy8gTGltaXRhdGlvbiBmb3Igb25seSAxMCByZXF1ZXN0cyAoZG8gbm90IHVzZSBpZiB5b3Ugd2FudCB0byBjcmF3bCBhbGwgbGlua3MpXFxufSk7XFxuXFxuLy8gUnVuIHRoZSBjcmF3bGVyXFxuYXdhaXQgY3Jhd2xlci5ydW4oWydodHRwczovL2FwaWZ5LmNvbS8nXSk7XFxuXFxuYXdhaXQgQWN0b3IuZXhpdCgpO1xcblwifSIsIm9wdGlvbnMiOnsiYnVpbGQiOiJsYXRlc3QiLCJjb250ZW50VHlwZSI6ImFwcGxpY2F0aW9uL2pzb247IGNoYXJzZXQ9dXRmLTgiLCJtZW1vcnkiOjQwOTYsInRpbWVvdXQiOjE4MH19.eVg0BTpLuA9jZtmijHGMjetPuME0zmTZX4oo8kxSAh8\&asrc=run_on_apify) ``` import { Actor } from 'apify'; import { PlaywrightCrawler } from 'crawlee'; await Actor.init(); const crawler = new PlaywrightCrawler({ async requestHandler({ request, enqueueLinks }) { console.log(request.url); // Add all links from page to RequestQueue await enqueueLinks(); }, maxRequestsPerCrawl: 10, // Limitation for only 10 requests (do not use if you want to crawl all links) }); // Run the crawler await crawler.run(['https://apify.com/']); await Actor.exit(); ``` --- # Crawl multiple URLs This example crawls the specified list of URLs. * Cheerio Crawler * Puppeteer Crawler * Playwright Crawler Using `CheerioCrawler`: [Run on](https://console.apify.com/actors/kk67IcZkKSSBTslXI?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IEFjdG9yIH0gZnJvbSAnYXBpZnknO1xcbmltcG9ydCB7IENoZWVyaW9DcmF3bGVyIH0gZnJvbSAnY3Jhd2xlZSc7XFxuXFxuYXdhaXQgQWN0b3IuaW5pdCgpO1xcblxcbmNvbnN0IGNyYXdsZXIgPSBuZXcgQ2hlZXJpb0NyYXdsZXIoe1xcbiAgICAvLyBGdW5jdGlvbiBjYWxsZWQgZm9yIGVhY2ggVVJMXFxuICAgIGFzeW5jIHJlcXVlc3RIYW5kbGVyKHsgcmVxdWVzdCwgJCB9KSB7XFxuICAgICAgICBjb25zdCB0aXRsZSA9ICQoJ3RpdGxlJykudGV4dCgpO1xcbiAgICAgICAgY29uc29sZS5sb2coYFVSTDogJHtyZXF1ZXN0LnVybH1cXFxcblRJVExFOiAke3RpdGxlfWApO1xcbiAgICB9LFxcbn0pO1xcblxcbi8vIFJ1biB0aGUgY3Jhd2xlclxcbmF3YWl0IGNyYXdsZXIucnVuKFtcXG4gICAgJ2h0dHA6Ly93d3cuZXhhbXBsZS5jb20vcGFnZS0xJyxcXG4gICAgJ2h0dHA6Ly93d3cuZXhhbXBsZS5jb20vcGFnZS0yJyxcXG4gICAgJ2h0dHA6Ly93d3cuZXhhbXBsZS5jb20vcGFnZS0zJyxcXG5dKTtcXG5cXG5hd2FpdCBBY3Rvci5leGl0KCk7XFxuXCJ9Iiwib3B0aW9ucyI6eyJidWlsZCI6ImxhdGVzdCIsImNvbnRlbnRUeXBlIjoiYXBwbGljYXRpb24vanNvbjsgY2hhcnNldD11dGYtOCIsIm1lbW9yeSI6MTAyNCwidGltZW91dCI6MTgwfX0.CeiVEdnjPDfQ0i8PLiJLQhDJFF2dN9OtHDx7MiAmQD8\&asrc=run_on_apify) ``` import { Actor } from 'apify'; import { CheerioCrawler } from 'crawlee'; await Actor.init(); const crawler = new CheerioCrawler({ // Function called for each URL async requestHandler({ request, $ }) { const title = $('title').text(); console.log(`URL: ${request.url}\nTITLE: ${title}`); }, }); // Run the crawler await crawler.run([ 'http://www.example.com/page-1', 'http://www.example.com/page-2', 'http://www.example.com/page-3', ]); await Actor.exit(); ``` Using `PuppeteerCrawler`: tip To run this example on the Apify Platform, select the `apify/actor-node-puppeteer-chrome` image for your Dockerfile. [Run on](https://console.apify.com/actors/7tWSD8hrYzuc9Lte7?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IEFjdG9yIH0gZnJvbSAnYXBpZnknO1xcbmltcG9ydCB7IFB1cHBldGVlckNyYXdsZXIgfSBmcm9tICdjcmF3bGVlJztcXG5cXG5hd2FpdCBBY3Rvci5pbml0KCk7XFxuXFxuY29uc3QgY3Jhd2xlciA9IG5ldyBQdXBwZXRlZXJDcmF3bGVyKHtcXG4gICAgLy8gRnVuY3Rpb24gY2FsbGVkIGZvciBlYWNoIFVSTFxcbiAgICBhc3luYyByZXF1ZXN0SGFuZGxlcih7IHJlcXVlc3QsIHBhZ2UgfSkge1xcbiAgICAgICAgY29uc3QgdGl0bGUgPSBhd2FpdCBwYWdlLnRpdGxlKCk7XFxuICAgICAgICBjb25zb2xlLmxvZyhgVVJMOiAke3JlcXVlc3QudXJsfVxcXFxuVElUTEU6ICR7dGl0bGV9YCk7XFxuICAgIH0sXFxufSk7XFxuXFxuLy8gUnVuIHRoZSBjcmF3bGVyXFxuYXdhaXQgY3Jhd2xlci5ydW4oW1xcbiAgICAnaHR0cDovL3d3dy5leGFtcGxlLmNvbS9wYWdlLTEnLFxcbiAgICAnaHR0cDovL3d3dy5leGFtcGxlLmNvbS9wYWdlLTInLFxcbiAgICAnaHR0cDovL3d3dy5leGFtcGxlLmNvbS9wYWdlLTMnLFxcbl0pO1xcblxcbmF3YWl0IEFjdG9yLmV4aXQoKTtcXG5cIn0iLCJvcHRpb25zIjp7ImJ1aWxkIjoibGF0ZXN0IiwiY29udGVudFR5cGUiOiJhcHBsaWNhdGlvbi9qc29uOyBjaGFyc2V0PXV0Zi04IiwibWVtb3J5Ijo0MDk2LCJ0aW1lb3V0IjoxODB9fQ.N3_G0e276h-8f8FDQW4iLmyjhKEPItvUgrKXe3Rpxy8\&asrc=run_on_apify) ``` import { Actor } from 'apify'; import { PuppeteerCrawler } from 'crawlee'; await Actor.init(); const crawler = new PuppeteerCrawler({ // Function called for each URL async requestHandler({ request, page }) { const title = await page.title(); console.log(`URL: ${request.url}\nTITLE: ${title}`); }, }); // Run the crawler await crawler.run([ 'http://www.example.com/page-1', 'http://www.example.com/page-2', 'http://www.example.com/page-3', ]); await Actor.exit(); ``` Using `PlaywrightCrawler`: tip To run this example on the Apify Platform, select the `apify/actor-node-playwright-chrome` image for your Dockerfile. [Run on](https://console.apify.com/actors/6i5QsHBMtm3hKph70?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IEFjdG9yIH0gZnJvbSAnYXBpZnknO1xcbmltcG9ydCB7IFBsYXl3cmlnaHRDcmF3bGVyIH0gZnJvbSAnY3Jhd2xlZSc7XFxuXFxuYXdhaXQgQWN0b3IuaW5pdCgpO1xcblxcbmNvbnN0IGNyYXdsZXIgPSBuZXcgUGxheXdyaWdodENyYXdsZXIoe1xcbiAgICAvLyBGdW5jdGlvbiBjYWxsZWQgZm9yIGVhY2ggVVJMXFxuICAgIGFzeW5jIHJlcXVlc3RIYW5kbGVyKHsgcmVxdWVzdCwgcGFnZSB9KSB7XFxuICAgICAgICBjb25zdCB0aXRsZSA9IGF3YWl0IHBhZ2UudGl0bGUoKTtcXG4gICAgICAgIGNvbnNvbGUubG9nKGBVUkw6ICR7cmVxdWVzdC51cmx9XFxcXG5USVRMRTogJHt0aXRsZX1gKTtcXG4gICAgfSxcXG59KTtcXG5cXG4vLyBSdW4gdGhlIGNyYXdsZXJcXG5hd2FpdCBjcmF3bGVyLnJ1bihbXFxuICAgICdodHRwOi8vd3d3LmV4YW1wbGUuY29tL3BhZ2UtMScsXFxuICAgICdodHRwOi8vd3d3LmV4YW1wbGUuY29tL3BhZ2UtMicsXFxuICAgICdodHRwOi8vd3d3LmV4YW1wbGUuY29tL3BhZ2UtMycsXFxuXSk7XFxuXFxuYXdhaXQgQWN0b3IuZXhpdCgpO1xcblwifSIsIm9wdGlvbnMiOnsiYnVpbGQiOiJsYXRlc3QiLCJjb250ZW50VHlwZSI6ImFwcGxpY2F0aW9uL2pzb247IGNoYXJzZXQ9dXRmLTgiLCJtZW1vcnkiOjQwOTYsInRpbWVvdXQiOjE4MH19.tFxeTZWttzvkWqmTccMmErP36zwOU4YG608H07ALpD0\&asrc=run_on_apify) ``` import { Actor } from 'apify'; import { PlaywrightCrawler } from 'crawlee'; await Actor.init(); const crawler = new PlaywrightCrawler({ // Function called for each URL async requestHandler({ request, page }) { const title = await page.title(); console.log(`URL: ${request.url}\nTITLE: ${title}`); }, }); // Run the crawler await crawler.run([ 'http://www.example.com/page-1', 'http://www.example.com/page-2', 'http://www.example.com/page-3', ]); await Actor.exit(); ``` --- # Crawl a website with relative links When crawling a website, you may encounter different types of links present that you may want to crawl. To facilitate the easy crawling of such links, we provide the `enqueueLinks()` method on the crawler context, which will automatically find links and add them to the crawler's [`RequestQueue`](https://docs.apify.com/sdk/js/sdk/js/reference/class/RequestQueue.md). We provide 3 different strategies for crawling relative links: * [All](https://crawlee.dev/api/core/enum/EnqueueStrategy#All) which will enqueue all links found, regardless of the domain they point to. * [SameHostname](https://crawlee.dev/api/core/enum/EnqueueStrategy#SameHostname) which will enqueue all links found for the same hostname (regardless of any subdomains present). * [SameSubdomain](https://crawlee.dev/api/core/enum/EnqueueStrategy#SameSubdomain) which will enqueue all links found that have the same subdomain and hostname. This is the default strategy. note For these examples, we are using the [`CheerioCrawler`](https://crawlee.dev/api/cheerio-crawler/class/CheerioCrawler), however the same method is available for both the [`PuppeteerCrawler`](https://crawlee.dev/api/puppeteer-crawler/class/PuppeteerCrawler) and [`PlaywrightCrawler`](https://crawlee.dev/api/playwright-crawler/class/PlaywrightCrawler), and you use it the exact same way. * All Links * Same Hostname * Same Subdomain Example domains Any urls found will be matched by this strategy, even if they go off of the site you are currently crawling. [Run on](https://console.apify.com/actors/kk67IcZkKSSBTslXI?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IEFjdG9yIH0gZnJvbSAnYXBpZnknO1xcbmltcG9ydCB7IENoZWVyaW9DcmF3bGVyIH0gZnJvbSAnY3Jhd2xlZSc7XFxuXFxuYXdhaXQgQWN0b3IuaW5pdCgpO1xcblxcbmNvbnN0IGNyYXdsZXIgPSBuZXcgQ2hlZXJpb0NyYXdsZXIoe1xcbiAgICBtYXhSZXF1ZXN0c1BlckNyYXdsOiAxMCwgLy8gTGltaXRhdGlvbiBmb3Igb25seSAxMCByZXF1ZXN0cyAoZG8gbm90IHVzZSBpZiB5b3Ugd2FudCB0byBjcmF3bCBhbGwgbGlua3MpXFxuICAgIGFzeW5jIHJlcXVlc3RIYW5kbGVyKHsgcmVxdWVzdCwgZW5xdWV1ZUxpbmtzIH0pIHtcXG4gICAgICAgIGNvbnNvbGUubG9nKHJlcXVlc3QudXJsKTtcXG4gICAgICAgIGF3YWl0IGVucXVldWVMaW5rcyh7XFxuICAgICAgICAgICAgLy8gU2V0dGluZyB0aGUgc3RyYXRlZ3kgdG8gJ2FsbCcgd2lsbCBlbnF1ZXVlIGFsbCBsaW5rcyBmb3VuZFxcbiAgICAgICAgICAgIC8vIGhpZ2hsaWdodC1uZXh0LWxpbmVcXG4gICAgICAgICAgICBzdHJhdGVneTogJ2FsbCcsXFxuICAgICAgICB9KTtcXG4gICAgfSxcXG59KTtcXG5cXG4vLyBSdW4gdGhlIGNyYXdsZXJcXG5hd2FpdCBjcmF3bGVyLnJ1bihbJ2h0dHBzOi8vYXBpZnkuY29tLyddKTtcXG5cXG5hd2FpdCBBY3Rvci5leGl0KCk7XFxuXCJ9Iiwib3B0aW9ucyI6eyJidWlsZCI6ImxhdGVzdCIsImNvbnRlbnRUeXBlIjoiYXBwbGljYXRpb24vanNvbjsgY2hhcnNldD11dGYtOCIsIm1lbW9yeSI6MTAyNCwidGltZW91dCI6MTgwfX0.a1IFpzCtFyz6kXkEkdwjYb-WWnJaRH4hJxbbzFMcYfg\&asrc=run_on_apify) ``` import { Actor } from 'apify'; import { CheerioCrawler } from 'crawlee'; await Actor.init(); const crawler = new CheerioCrawler({ maxRequestsPerCrawl: 10, // Limitation for only 10 requests (do not use if you want to crawl all links) async requestHandler({ request, enqueueLinks }) { console.log(request.url); await enqueueLinks({ // Setting the strategy to 'all' will enqueue all links found strategy: 'all', }); }, }); // Run the crawler await crawler.run(['https://apify.com/']); await Actor.exit(); ``` Example domains For a url of `https://example.com`, `enqueueLinks()` will match relative urls, urls that point to the same full domain or urls that point to any subdomain of the provided domain. For instance, hyperlinks like `https://subdomain.example.com/some/path`, `https://example.com/some/path`, `/absolute/example` or `./relative/example` will all be matched by this strategy. [Run on](https://console.apify.com/actors/kk67IcZkKSSBTslXI?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IEFjdG9yIH0gZnJvbSAnYXBpZnknO1xcbmltcG9ydCB7IENoZWVyaW9DcmF3bGVyLCBFbnF1ZXVlU3RyYXRlZ3kgfSBmcm9tICdjcmF3bGVlJztcXG5cXG5hd2FpdCBBY3Rvci5pbml0KCk7XFxuXFxuY29uc3QgY3Jhd2xlciA9IG5ldyBDaGVlcmlvQ3Jhd2xlcih7XFxuICAgIG1heFJlcXVlc3RzUGVyQ3Jhd2w6IDEwLCAvLyBMaW1pdGF0aW9uIGZvciBvbmx5IDEwIHJlcXVlc3RzIChkbyBub3QgdXNlIGlmIHlvdSB3YW50IHRvIGNyYXdsIGFsbCBsaW5rcylcXG4gICAgYXN5bmMgcmVxdWVzdEhhbmRsZXIoeyByZXF1ZXN0LCBlbnF1ZXVlTGlua3MgfSkge1xcbiAgICAgICAgY29uc29sZS5sb2cocmVxdWVzdC51cmwpO1xcbiAgICAgICAgYXdhaXQgZW5xdWV1ZUxpbmtzKHtcXG4gICAgICAgICAgICAvLyBTZXR0aW5nIHRoZSBzdHJhdGVneSB0byAnc2FtZS1zdWJkb21haW4nIHdpbGwgZW5xdWV1ZSBhbGwgbGlua3MgZm91bmQgdGhhdCBhcmUgb24gdGhlIHNhbWUgaG9zdG5hbWVcXG4gICAgICAgICAgICAvLyBhcyByZXF1ZXN0LmxvYWRlZFVybCBvciByZXF1ZXN0LnVybFxcbiAgICAgICAgICAgIC8vIGhpZ2hsaWdodC1uZXh0LWxpbmVcXG4gICAgICAgICAgICBzdHJhdGVneTogRW5xdWV1ZVN0cmF0ZWd5LlNhbWVIb3N0bmFtZSxcXG4gICAgICAgICAgICAvLyBBbHRlcm5hdGl2ZWx5LCB5b3UgY2FuIHBhc3MgaW4gdGhlIHN0cmluZyAnc2FtZS1ob3N0bmFtZSdcXG4gICAgICAgICAgICAvLyBzdHJhdGVneTogJ3NhbWUtaG9zdG5hbWUnLFxcbiAgICAgICAgfSk7XFxuICAgIH0sXFxufSk7XFxuXFxuLy8gUnVuIHRoZSBjcmF3bGVyXFxuYXdhaXQgY3Jhd2xlci5ydW4oWydodHRwczovL2FwaWZ5LmNvbS8nXSk7XFxuXFxuYXdhaXQgQWN0b3IuZXhpdCgpO1xcblwifSIsIm9wdGlvbnMiOnsiYnVpbGQiOiJsYXRlc3QiLCJjb250ZW50VHlwZSI6ImFwcGxpY2F0aW9uL2pzb247IGNoYXJzZXQ9dXRmLTgiLCJtZW1vcnkiOjEwMjQsInRpbWVvdXQiOjE4MH19.yF2AJFRXorzWRuCXhRGjM8nWXBFT585D7nwOkBPAPf0\&asrc=run_on_apify) ``` import { Actor } from 'apify'; import { CheerioCrawler, EnqueueStrategy } from 'crawlee'; await Actor.init(); const crawler = new CheerioCrawler({ maxRequestsPerCrawl: 10, // Limitation for only 10 requests (do not use if you want to crawl all links) async requestHandler({ request, enqueueLinks }) { console.log(request.url); await enqueueLinks({ // Setting the strategy to 'same-subdomain' will enqueue all links found that are on the same hostname // as request.loadedUrl or request.url strategy: EnqueueStrategy.SameHostname, // Alternatively, you can pass in the string 'same-hostname' // strategy: 'same-hostname', }); }, }); // Run the crawler await crawler.run(['https://apify.com/']); await Actor.exit(); ``` tip This is the default strategy when calling `enqueueLinks()`, so you don't have to specify it. Example domains For a url of `https://subdomain.example.com`, `enqueueLinks()` will only match relative urls or urls that point to the same full domain. For instance, hyperlinks like `https://subdomain.example.com/some/path`, `/absolute/example` or `./relative/example` will all be matched by this strategy, while `https://other-subdomain.example.com` or `https://otherexample.com` will not. [Run on](https://console.apify.com/actors/kk67IcZkKSSBTslXI?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IEFjdG9yIH0gZnJvbSAnYXBpZnknO1xcbmltcG9ydCB7IENoZWVyaW9DcmF3bGVyLCBFbnF1ZXVlU3RyYXRlZ3kgfSBmcm9tICdjcmF3bGVlJztcXG5cXG5hd2FpdCBBY3Rvci5pbml0KCk7XFxuXFxuY29uc3QgY3Jhd2xlciA9IG5ldyBDaGVlcmlvQ3Jhd2xlcih7XFxuICAgIG1heFJlcXVlc3RzUGVyQ3Jhd2w6IDEwLCAvLyBMaW1pdGF0aW9uIGZvciBvbmx5IDEwIHJlcXVlc3RzIChkbyBub3QgdXNlIGlmIHlvdSB3YW50IHRvIGNyYXdsIGFsbCBsaW5rcylcXG4gICAgYXN5bmMgcmVxdWVzdEhhbmRsZXIoeyByZXF1ZXN0LCBlbnF1ZXVlTGlua3MgfSkge1xcbiAgICAgICAgY29uc29sZS5sb2cocmVxdWVzdC51cmwpO1xcbiAgICAgICAgYXdhaXQgZW5xdWV1ZUxpbmtzKHtcXG4gICAgICAgICAgICAvLyBTZXR0aW5nIHRoZSBzdHJhdGVneSB0byAnc2FtZS1zdWJkb21haW4nIHdpbGwgZW5xdWV1ZSBhbGwgbGlua3MgZm91bmQgdGhhdCBhcmUgb24gdGhlIHNhbWUgc3ViZG9tYWluIGFuZCBob3N0bmFtZVxcbiAgICAgICAgICAgIC8vIGFzIHJlcXVlc3QubG9hZGVkVXJsIG9yIHJlcXVlc3QudXJsXFxuICAgICAgICAgICAgLy8gaGlnaGxpZ2h0LW5leHQtbGluZVxcbiAgICAgICAgICAgIHN0cmF0ZWd5OiBFbnF1ZXVlU3RyYXRlZ3kuU2FtZUhvc3RuYW1lLFxcbiAgICAgICAgICAgIC8vIEFsdGVybmF0aXZlbHksIHlvdSBjYW4gcGFzcyBpbiB0aGUgc3RyaW5nICdzYW1lLXN1YmRvbWFpbidcXG4gICAgICAgICAgICAvLyBzdHJhdGVneTogJ3NhbWUtc3ViZG9tYWluJyxcXG4gICAgICAgIH0pO1xcbiAgICB9LFxcbn0pO1xcblxcbi8vIFJ1biB0aGUgY3Jhd2xlclxcbmF3YWl0IGNyYXdsZXIucnVuKFsnaHR0cHM6Ly9hcGlmeS5jb20vJ10pO1xcblxcbmF3YWl0IEFjdG9yLmV4aXQoKTtcXG5cIn0iLCJvcHRpb25zIjp7ImJ1aWxkIjoibGF0ZXN0IiwiY29udGVudFR5cGUiOiJhcHBsaWNhdGlvbi9qc29uOyBjaGFyc2V0PXV0Zi04IiwibWVtb3J5IjoxMDI0LCJ0aW1lb3V0IjoxODB9fQ.dyU8vmMEV9LyeUOm-72BRE7THBxt7nDR7zN35H27ulw\&asrc=run_on_apify) ``` import { Actor } from 'apify'; import { CheerioCrawler, EnqueueStrategy } from 'crawlee'; await Actor.init(); const crawler = new CheerioCrawler({ maxRequestsPerCrawl: 10, // Limitation for only 10 requests (do not use if you want to crawl all links) async requestHandler({ request, enqueueLinks }) { console.log(request.url); await enqueueLinks({ // Setting the strategy to 'same-subdomain' will enqueue all links found that are on the same subdomain and hostname // as request.loadedUrl or request.url strategy: EnqueueStrategy.SameHostname, // Alternatively, you can pass in the string 'same-subdomain' // strategy: 'same-subdomain', }); }, }); // Run the crawler await crawler.run(['https://apify.com/']); await Actor.exit(); ``` --- # Crawl a single URL This example uses the [`got-scraping`](https://github.com/apify/got-scraping) npm package to grab the HTML of a web page. [Run on](https://console.apify.com/actors/kk67IcZkKSSBTslXI?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IGdvdFNjcmFwaW5nIH0gZnJvbSAnZ290LXNjcmFwaW5nJztcXG5cXG4vLyBHZXQgdGhlIEhUTUwgb2YgYSB3ZWIgcGFnZVxcbmNvbnN0IHsgYm9keSB9ID0gYXdhaXQgZ290U2NyYXBpbmcoeyB1cmw6ICdodHRwczovL3d3dy5leGFtcGxlLmNvbScgfSk7XFxuY29uc29sZS5sb2coYm9keSk7XFxuXCJ9Iiwib3B0aW9ucyI6eyJidWlsZCI6ImxhdGVzdCIsImNvbnRlbnRUeXBlIjoiYXBwbGljYXRpb24vanNvbjsgY2hhcnNldD11dGYtOCIsIm1lbW9yeSI6MTAyNCwidGltZW91dCI6MTgwfX0.0S1i1yD10_82mLCH3VWFtCZTU4-BDrDU1UGY208IqgE\&asrc=run_on_apify) ``` import { gotScraping } from 'got-scraping'; // Get the HTML of a web page const { body } = await gotScraping({ url: 'https://www.example.com' }); console.log(body); ``` If you don't want to hard-code the URL into the script, refer to the [Accept User Input](https://docs.apify.com/sdk/js/sdk/js/docs/examples/accept-user-input.md) example. --- # Crawl a sitemap This example downloads and crawls the URLs from a sitemap. * Cheerio Crawler * Puppeteer Crawler * Playwright Crawler Using `CheerioCrawler`: [Run on](https://console.apify.com/actors/kk67IcZkKSSBTslXI?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IEFjdG9yIH0gZnJvbSAnYXBpZnknO1xcbmltcG9ydCB7IENoZWVyaW9DcmF3bGVyLCBkb3dubG9hZExpc3RPZlVybHMgfSBmcm9tICdjcmF3bGVlJztcXG5cXG5hd2FpdCBBY3Rvci5pbml0KCk7XFxuXFxuY29uc3QgY3Jhd2xlciA9IG5ldyBDaGVlcmlvQ3Jhd2xlcih7XFxuICAgIC8vIEZ1bmN0aW9uIGNhbGxlZCBmb3IgZWFjaCBVUkxcXG4gICAgYXN5bmMgcmVxdWVzdEhhbmRsZXIoeyByZXF1ZXN0IH0pIHtcXG4gICAgICAgIGNvbnNvbGUubG9nKHJlcXVlc3QudXJsKTtcXG4gICAgfSxcXG4gICAgbWF4UmVxdWVzdHNQZXJDcmF3bDogMTAsIC8vIExpbWl0YXRpb24gZm9yIG9ubHkgMTAgcmVxdWVzdHMgKGRvIG5vdCB1c2UgaWYgeW91IHdhbnQgdG8gY3Jhd2wgYSBzaXRlbWFwKVxcbn0pO1xcblxcbmNvbnN0IGxpc3RPZlVybHMgPSBhd2FpdCBkb3dubG9hZExpc3RPZlVybHMoe1xcbiAgICB1cmw6ICdodHRwczovL2FwaWZ5LmNvbS9zaXRlbWFwLnhtbCcsXFxufSk7XFxuXFxuLy8gUnVuIHRoZSBjcmF3bGVyXFxuYXdhaXQgY3Jhd2xlci5ydW4obGlzdE9mVXJscyk7XFxuXFxuYXdhaXQgQWN0b3IuZXhpdCgpO1xcblwifSIsIm9wdGlvbnMiOnsiYnVpbGQiOiJsYXRlc3QiLCJjb250ZW50VHlwZSI6ImFwcGxpY2F0aW9uL2pzb247IGNoYXJzZXQ9dXRmLTgiLCJtZW1vcnkiOjEwMjQsInRpbWVvdXQiOjE4MH19.XWC3QQWKIsRIB8TdL40CGjzvHiqadKnt7F-9rhoHEEo\&asrc=run_on_apify) ``` import { Actor } from 'apify'; import { CheerioCrawler, downloadListOfUrls } from 'crawlee'; await Actor.init(); const crawler = new CheerioCrawler({ // Function called for each URL async requestHandler({ request }) { console.log(request.url); }, maxRequestsPerCrawl: 10, // Limitation for only 10 requests (do not use if you want to crawl a sitemap) }); const listOfUrls = await downloadListOfUrls({ url: 'https://apify.com/sitemap.xml', }); // Run the crawler await crawler.run(listOfUrls); await Actor.exit(); ``` Using `PuppeteerCrawler`: tip To run this example on the Apify Platform, select the `apify/actor-node-puppeteer-chrome` image for your Dockerfile. [Run on](https://console.apify.com/actors/7tWSD8hrYzuc9Lte7?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IEFjdG9yIH0gZnJvbSAnYXBpZnknO1xcbmltcG9ydCB7IGRvd25sb2FkTGlzdE9mVXJscywgUHVwcGV0ZWVyQ3Jhd2xlciB9IGZyb20gJ2NyYXdsZWUnO1xcblxcbmF3YWl0IEFjdG9yLmluaXQoKTtcXG5cXG5jb25zdCBjcmF3bGVyID0gbmV3IFB1cHBldGVlckNyYXdsZXIoe1xcbiAgICAvLyBGdW5jdGlvbiBjYWxsZWQgZm9yIGVhY2ggVVJMXFxuICAgIGFzeW5jIHJlcXVlc3RIYW5kbGVyKHsgcmVxdWVzdCB9KSB7XFxuICAgICAgICBjb25zb2xlLmxvZyhyZXF1ZXN0LnVybCk7XFxuICAgIH0sXFxuICAgIG1heFJlcXVlc3RzUGVyQ3Jhd2w6IDEwLCAvLyBMaW1pdGF0aW9uIGZvciBvbmx5IDEwIHJlcXVlc3RzIChkbyBub3QgdXNlIGlmIHlvdSB3YW50IHRvIGNyYXdsIGEgc2l0ZW1hcClcXG59KTtcXG5cXG5jb25zdCBsaXN0T2ZVcmxzID0gYXdhaXQgZG93bmxvYWRMaXN0T2ZVcmxzKHtcXG4gICAgdXJsOiAnaHR0cHM6Ly9hcGlmeS5jb20vc2l0ZW1hcC54bWwnLFxcbn0pO1xcblxcbi8vIFJ1biB0aGUgY3Jhd2xlclxcbmF3YWl0IGNyYXdsZXIucnVuKGxpc3RPZlVybHMpO1xcblxcbmF3YWl0IEFjdG9yLmV4aXQoKTtcXG5cIn0iLCJvcHRpb25zIjp7ImJ1aWxkIjoibGF0ZXN0IiwiY29udGVudFR5cGUiOiJhcHBsaWNhdGlvbi9qc29uOyBjaGFyc2V0PXV0Zi04IiwibWVtb3J5Ijo0MDk2LCJ0aW1lb3V0IjoxODB9fQ._LOESjvhSiJamXz1EhpRWvA_afgRmfQABfI1Wgts8c8\&asrc=run_on_apify) ``` import { Actor } from 'apify'; import { downloadListOfUrls, PuppeteerCrawler } from 'crawlee'; await Actor.init(); const crawler = new PuppeteerCrawler({ // Function called for each URL async requestHandler({ request }) { console.log(request.url); }, maxRequestsPerCrawl: 10, // Limitation for only 10 requests (do not use if you want to crawl a sitemap) }); const listOfUrls = await downloadListOfUrls({ url: 'https://apify.com/sitemap.xml', }); // Run the crawler await crawler.run(listOfUrls); await Actor.exit(); ``` Using `PlaywrightCrawler`: tip To run this example on the Apify Platform, select the `apify/actor-node-playwright-chrome` image for your Dockerfile. [Run on](https://console.apify.com/actors/6i5QsHBMtm3hKph70?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IEFjdG9yIH0gZnJvbSAnYXBpZnknO1xcbmltcG9ydCB7IGRvd25sb2FkTGlzdE9mVXJscywgUGxheXdyaWdodENyYXdsZXIgfSBmcm9tICdjcmF3bGVlJztcXG5cXG5hd2FpdCBBY3Rvci5pbml0KCk7XFxuXFxuY29uc3QgY3Jhd2xlciA9IG5ldyBQbGF5d3JpZ2h0Q3Jhd2xlcih7XFxuICAgIC8vIEZ1bmN0aW9uIGNhbGxlZCBmb3IgZWFjaCBVUkxcXG4gICAgYXN5bmMgcmVxdWVzdEhhbmRsZXIoeyByZXF1ZXN0IH0pIHtcXG4gICAgICAgIGNvbnNvbGUubG9nKHJlcXVlc3QudXJsKTtcXG4gICAgfSxcXG4gICAgbWF4UmVxdWVzdHNQZXJDcmF3bDogMTAsIC8vIExpbWl0YXRpb24gZm9yIG9ubHkgMTAgcmVxdWVzdHMgKGRvIG5vdCB1c2UgaWYgeW91IHdhbnQgdG8gY3Jhd2wgYSBzaXRlbWFwKVxcbn0pO1xcblxcbmNvbnN0IGxpc3RPZlVybHMgPSBhd2FpdCBkb3dubG9hZExpc3RPZlVybHMoe1xcbiAgICB1cmw6ICdodHRwczovL2FwaWZ5LmNvbS9zaXRlbWFwLnhtbCcsXFxufSk7XFxuXFxuLy8gUnVuIHRoZSBjcmF3bGVyXFxuYXdhaXQgY3Jhd2xlci5ydW4obGlzdE9mVXJscyk7XFxuXFxuYXdhaXQgQWN0b3IuZXhpdCgpO1xcblwifSIsIm9wdGlvbnMiOnsiYnVpbGQiOiJsYXRlc3QiLCJjb250ZW50VHlwZSI6ImFwcGxpY2F0aW9uL2pzb247IGNoYXJzZXQ9dXRmLTgiLCJtZW1vcnkiOjQwOTYsInRpbWVvdXQiOjE4MH19.qbl4ro1qZvqNhlkeysCWDSDwM0LV0A3CVXl89bDLbR4\&asrc=run_on_apify) ``` import { Actor } from 'apify'; import { downloadListOfUrls, PlaywrightCrawler } from 'crawlee'; await Actor.init(); const crawler = new PlaywrightCrawler({ // Function called for each URL async requestHandler({ request }) { console.log(request.url); }, maxRequestsPerCrawl: 10, // Limitation for only 10 requests (do not use if you want to crawl a sitemap) }); const listOfUrls = await downloadListOfUrls({ url: 'https://apify.com/sitemap.xml', }); // Run the crawler await crawler.run(listOfUrls); await Actor.exit(); ``` --- # Crawl some links on a website This [`CheerioCrawler`](https://crawlee.dev/api/cheerio-crawler/class/CheerioCrawler) example uses the [`pseudoUrls`](https://crawlee.dev/api/core/class/PseudoUrl) property in the [`enqueueLinks()`](https://crawlee.dev/api/cheerio-crawler/interface/CheerioRequestHandlerInputs#enqueueLinks) method to only add links to the [`RequestQueue`](https://docs.apify.com/sdk/js/sdk/js/reference/class/RequestQueue.md) queue if they match the specified regular expression. [Run on](https://console.apify.com/actors/kk67IcZkKSSBTslXI?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IEFjdG9yIH0gZnJvbSAnYXBpZnknO1xcbmltcG9ydCB7IENoZWVyaW9DcmF3bGVyIH0gZnJvbSAnY3Jhd2xlZSc7XFxuXFxuYXdhaXQgQWN0b3IuaW5pdCgpO1xcblxcbi8vIENyZWF0ZSBhIENoZWVyaW9DcmF3bGVyXFxuY29uc3QgY3Jhd2xlciA9IG5ldyBDaGVlcmlvQ3Jhd2xlcih7XFxuICAgIC8vIExpbWl0cyB0aGUgY3Jhd2xlciB0byBvbmx5IDEwIHJlcXVlc3RzIChkbyBub3QgdXNlIGlmIHlvdSB3YW50IHRvIGNyYXdsIGFsbCBsaW5rcylcXG4gICAgbWF4UmVxdWVzdHNQZXJDcmF3bDogMTAsXFxuICAgIC8vIEZ1bmN0aW9uIGNhbGxlZCBmb3IgZWFjaCBVUkxcXG4gICAgYXN5bmMgcmVxdWVzdEhhbmRsZXIoeyByZXF1ZXN0LCBlbnF1ZXVlTGlua3MgfSkge1xcbiAgICAgICAgY29uc29sZS5sb2cocmVxdWVzdC51cmwpO1xcbiAgICAgICAgLy8gQWRkIHNvbWUgbGlua3MgZnJvbSBwYWdlIHRvIHRoZSBjcmF3bGVyJ3MgUmVxdWVzdFF1ZXVlXFxuICAgICAgICBhd2FpdCBlbnF1ZXVlTGlua3Moe1xcbiAgICAgICAgICAgIHBzZXVkb1VybHM6IFsnaHR0cFtzP106Ly9hcGlmeS5jb20vWy4rXS9bLitdJ10sXFxuICAgICAgICB9KTtcXG4gICAgfSxcXG59KTtcXG5cXG4vLyBEZWZpbmUgdGhlIHN0YXJ0aW5nIFVSTCBhbmQgcnVuIHRoZSBjcmF3bGVyXFxuYXdhaXQgY3Jhd2xlci5ydW4oWydodHRwczovL2FwaWZ5LmNvbS9zdG9yZSddKTtcXG5cXG5hd2FpdCBBY3Rvci5leGl0KCk7XFxuXCJ9Iiwib3B0aW9ucyI6eyJidWlsZCI6ImxhdGVzdCIsImNvbnRlbnRUeXBlIjoiYXBwbGljYXRpb24vanNvbjsgY2hhcnNldD11dGYtOCIsIm1lbW9yeSI6MTAyNCwidGltZW91dCI6MTgwfX0.wHLH-CdKCylWDVcIRISOGWdcfzTZHeVAVlfiQhkzdko\&asrc=run_on_apify) ``` import { Actor } from 'apify'; import { CheerioCrawler } from 'crawlee'; await Actor.init(); // Create a CheerioCrawler const crawler = new CheerioCrawler({ // Limits the crawler to only 10 requests (do not use if you want to crawl all links) maxRequestsPerCrawl: 10, // Function called for each URL async requestHandler({ request, enqueueLinks }) { console.log(request.url); // Add some links from page to the crawler's RequestQueue await enqueueLinks({ pseudoUrls: ['http[s?]://apify.com/[.+]/[.+]'], }); }, }); // Define the starting URL and run the crawler await crawler.run(['https://apify.com/store']); await Actor.exit(); ``` --- # Forms This example demonstrates how to use [`PuppeteerCrawler`](https://crawlee.dev/api/puppeteer-crawler/class/PuppeteerCrawler) to automatically fill and submit a search form to look up repositories on [GitHub](https://github.com) using headless Chrome / Puppeteer. The actor first fills in the search term, repository owner, start date and language of the repository, then submits the form and prints out the results. Finally, the results are saved either on the Apify platform to the default [`dataset`](https://docs.apify.com/sdk/js/sdk/js/reference/class/Dataset.md) or on the local machine as JSON files in `./storage/datasets/default`. tip To run this example on the Apify Platform, select the `apify/actor-node-puppeteer-chrome` image for your Dockerfile. [Run on](https://console.apify.com/actors/7tWSD8hrYzuc9Lte7?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IEFjdG9yIH0gZnJvbSAnYXBpZnknO1xcbmltcG9ydCB7IGxhdW5jaFB1cHBldGVlciB9IGZyb20gJ2NyYXdsZWUnO1xcblxcbmF3YWl0IEFjdG9yLmluaXQoKTtcXG5cXG4vLyBMYXVuY2ggdGhlIHdlYiBicm93c2VyLlxcbmNvbnN0IGJyb3dzZXIgPSBhd2FpdCBsYXVuY2hQdXBwZXRlZXIoKTtcXG5cXG4vLyBDcmVhdGUgYW5kIG5hdmlnYXRlIG5ldyBwYWdlXFxuY29uc29sZS5sb2coJ09wZW4gdGFyZ2V0IHBhZ2UnKTtcXG5jb25zdCBwYWdlID0gYXdhaXQgYnJvd3Nlci5uZXdQYWdlKCk7XFxuYXdhaXQgcGFnZS5nb3RvKCdodHRwczovL2dpdGh1Yi5jb20vc2VhcmNoL2FkdmFuY2VkJyk7XFxuXFxuLy8gRmlsbCBmb3JtIGZpZWxkcyBhbmQgc2VsZWN0IGRlc2lyZWQgc2VhcmNoIG9wdGlvbnNcXG5jb25zb2xlLmxvZygnRmlsbCBpbiBzZWFyY2ggZm9ybScpO1xcbmF3YWl0IHBhZ2UudHlwZSgnI2Fkdl9jb2RlX3NlYXJjaCBpbnB1dC5qcy1hZHZhbmNlZC1zZWFyY2gtaW5wdXQnLCAnYXBpZnktanMnKTtcXG5hd2FpdCBwYWdlLnR5cGUoJyNzZWFyY2hfZnJvbScsICdhcGlmeScpO1xcbmF3YWl0IHBhZ2UudHlwZSgnI3NlYXJjaF9kYXRlJywgJz4yMDE1Jyk7XFxuYXdhaXQgcGFnZS5zZWxlY3QoJ3NlbGVjdCNzZWFyY2hfbGFuZ3VhZ2UnLCAnSmF2YVNjcmlwdCcpO1xcblxcbi8vIFN1Ym1pdCB0aGUgZm9ybSBhbmQgd2FpdCBmb3IgZnVsbCBsb2FkIG9mIG5leHQgcGFnZVxcbmNvbnNvbGUubG9nKCdTdWJtaXQgc2VhcmNoIGZvcm0nKTtcXG5hd2FpdCBQcm9taXNlLmFsbChbXFxuICAgIHBhZ2Uud2FpdEZvck5hdmlnYXRpb24oKSxcXG4gICAgcGFnZS5jbGljaygnI2Fkdl9jb2RlX3NlYXJjaCBidXR0b25bdHlwZT1cXFwic3VibWl0XFxcIl0nKSxcXG5dKTtcXG5cXG4vLyBPYnRhaW4gYW5kIHByaW50IGxpc3Qgb2Ygc2VhcmNoIHJlc3VsdHNcXG5jb25zdCByZXN1bHRzID0gYXdhaXQgcGFnZS4kJGV2YWwoJ2Rpdi5mNC50ZXh0LW5vcm1hbCBhJywgKG5vZGVzKSA9PlxcbiAgICBub2Rlcy5tYXAoKG5vZGUpID0-ICh7XFxuICAgICAgICB1cmw6IG5vZGUuaHJlZixcXG4gICAgICAgIG5hbWU6IG5vZGUuaW5uZXJUZXh0LFxcbiAgICB9KSksXFxuKTtcXG5cXG5jb25zb2xlLmxvZygnUmVzdWx0czonLCByZXN1bHRzKTtcXG5cXG4vLyBTdG9yZSBkYXRhIGluIGRlZmF1bHQgZGF0YXNldFxcbmF3YWl0IEFjdG9yLnB1c2hEYXRhKHJlc3VsdHMpO1xcblxcbi8vIENsb3NlIGJyb3dzZXJcXG5hd2FpdCBicm93c2VyLmNsb3NlKCk7XFxuXFxuYXdhaXQgQWN0b3IuZXhpdCgpO1xcblwifSIsIm9wdGlvbnMiOnsiYnVpbGQiOiJsYXRlc3QiLCJjb250ZW50VHlwZSI6ImFwcGxpY2F0aW9uL2pzb247IGNoYXJzZXQ9dXRmLTgiLCJtZW1vcnkiOjQwOTYsInRpbWVvdXQiOjE4MH19.50kP3gcHDUJWt6VevBrpm1zXyG6s5l7JYuSd2JiWhVg\&asrc=run_on_apify) ``` import { Actor } from 'apify'; import { launchPuppeteer } from 'crawlee'; await Actor.init(); // Launch the web browser. const browser = await launchPuppeteer(); // Create and navigate new page console.log('Open target page'); const page = await browser.newPage(); await page.goto('https://github.com/search/advanced'); // Fill form fields and select desired search options console.log('Fill in search form'); await page.type('#adv_code_search input.js-advanced-search-input', 'apify-js'); await page.type('#search_from', 'apify'); await page.type('#search_date', '>2015'); await page.select('select#search_language', 'JavaScript'); // Submit the form and wait for full load of next page console.log('Submit search form'); await Promise.all([ page.waitForNavigation(), page.click('#adv_code_search button[type="submit"]'), ]); // Obtain and print list of search results const results = await page.$$eval('div.f4.text-normal a', (nodes) => nodes.map((node) => ({ url: node.href, name: node.innerText, })), ); console.log('Results:', results); // Store data in default dataset await Actor.pushData(results); // Close browser await browser.close(); await Actor.exit(); ``` --- # Dataset Map and Reduce methods This example shows an easy use-case of the [`Dataset`](https://docs.apify.com/sdk/js/sdk/js/reference/class/Dataset.md) [`map`](https://docs.apify.com/sdk/js/sdk/js/reference/class/Dataset.md#map) and [`reduce`](https://docs.apify.com/sdk/js/sdk/js/reference/class/Dataset.md#reduce) methods. Both methods can be used to simplify the dataset results workflow process. Both can be called on the [dataset](https://docs.apify.com/sdk/js/sdk/js/reference/class/Dataset.md) directly. Important to mention is that both methods return a new result (`map` returns a new array and `reduce` can return any type) - neither method updates the dataset in any way. Examples for both methods are demonstrated on a simple dataset containing the results scraped from a page: the `URL` and a hypothetical number of `h1` - `h3` header elements under the `headingCount` key. This data structure is stored in the default dataset under `{PROJECT_FOLDER}/storage/datasets/default/`. If you want to simulate the functionality, you can use the [`Actor.pushData()`](https://docs.apify.com/sdk/js/sdk/js/reference/class/Dataset.md#pushData) method to save the example `JSON array` to your dataset. ``` [ { "url": "https://apify.com/", "headingCount": 11 }, { "url": "https://apify.com/storage", "headingCount": 8 }, { "url": "https://apify.com/proxy", "headingCount": 4 } ] ``` ### Map[​](#map "Direct link to heading") The dataset `map` method is very similar to standard Array mapping methods. It produces a new array of values by mapping each value in the existing array through a transformation function and an options parameter. The `map` method used to check if are there more than 5 header elements on each page: [Run on](https://console.apify.com/actors/kk67IcZkKSSBTslXI?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IEFjdG9yIH0gZnJvbSAnYXBpZnknO1xcblxcbmF3YWl0IEFjdG9yLmluaXQoKTtcXG5cXG5jb25zdCBkYXRhc2V0ID0gYXdhaXQgQWN0b3Iub3BlbkRhdGFzZXQ8eyBoZWFkaW5nQ291bnQ6IG51bWJlciB9PigpO1xcblxcbi8vIGNhbGxpbmcgbWFwIGZ1bmN0aW9uIGFuZCBmaWx0ZXJpbmcgdGhyb3VnaCBtYXBwZWQgaXRlbXNcXG5jb25zdCBtb3JlVGhhbjVoZWFkZXJzID0gKFxcbiAgICBhd2FpdCBkYXRhc2V0Lm1hcCgoaXRlbSkgPT4gaXRlbS5oZWFkaW5nQ291bnQpXFxuKS5maWx0ZXIoKGNvdW50KSA9PiBjb3VudCA-IDUpO1xcblxcbi8vIHNhdmluZyByZXN1bHQgb2YgbWFwIHRvIGRlZmF1bHQgS2V5LXZhbHVlIHN0b3JlXFxuYXdhaXQgQWN0b3Iuc2V0VmFsdWUoJ3BhZ2VzX3dpdGhfbW9yZV90aGFuXzVfaGVhZGVycycsIG1vcmVUaGFuNWhlYWRlcnMpO1xcblxcbmF3YWl0IEFjdG9yLmV4aXQoKTtcXG5cIn0iLCJvcHRpb25zIjp7ImJ1aWxkIjoibGF0ZXN0IiwiY29udGVudFR5cGUiOiJhcHBsaWNhdGlvbi9qc29uOyBjaGFyc2V0PXV0Zi04IiwibWVtb3J5IjoxMDI0LCJ0aW1lb3V0IjoxODB9fQ.rEYgnbXjDJ4eTxXqxEB8PtTkf-Ky6EDTYuYqMHw2XKE\&asrc=run_on_apify) ``` import { Actor } from 'apify'; await Actor.init(); const dataset = await Actor.openDataset<{ headingCount: number }>(); // calling map function and filtering through mapped items const moreThan5headers = ( await dataset.map((item) => item.headingCount) ).filter((count) => count > 5); // saving result of map to default Key-value store await Actor.setValue('pages_with_more_than_5_headers', moreThan5headers); await Actor.exit(); ``` The `moreThan5headers` variable is an array of `headingCount` attributes where the number of headers is greater than 5. The `map` method's result value saved to the [`key-value store`](https://docs.apify.com/sdk/js/sdk/js/reference/class/KeyValueStore.md) should be: ``` [11, 8]; ``` ### Reduce[​](#reduce "Direct link to heading") The dataset `reduce` method does not produce a new array of values - it reduces a list of values down to a single value. The method iterates through the items in the dataset using the [`memo` argument](https://docs.apify.com/sdk/js/sdk/js/reference/class/Dataset.md#reduce). After performing the necessary calculation, the `memo` is sent to the next iteration, while the item just processed is reduced (removed). Using the `reduce` method to get the total number of headers scraped (all items in the dataset): [Run on](https://console.apify.com/actors/kk67IcZkKSSBTslXI?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IEFjdG9yIH0gZnJvbSAnYXBpZnknO1xcblxcbmNvbnN0IGRhdGFzZXQgPSBhd2FpdCBBY3Rvci5vcGVuRGF0YXNldCgpO1xcblxcbi8vIGNhbGxpbmcgcmVkdWNlIGZ1bmN0aW9uIGFuZCB1c2luZyBtZW1vIHRvIGNhbGN1bGF0ZSBudW1iZXIgb2YgaGVhZGVyc1xcbmNvbnN0IHBhZ2VzSGVhZGluZ0NvdW50ID0gYXdhaXQgZGF0YXNldC5yZWR1Y2UoKG1lbW8sIHZhbHVlKSA9PiB7XFxuICAgIHJldHVybiBtZW1vICsgdmFsdWUuaGVhZGluZ0NvdW50O1xcbn0sIDApO1xcblxcbi8vIHNhdmluZyByZXN1bHQgb2YgbWFwIHRvIGRlZmF1bHQgS2V5LXZhbHVlIHN0b3JlXFxuYXdhaXQgQWN0b3Iuc2V0VmFsdWUoJ3BhZ2VzX2hlYWRpbmdfY291bnQnLCBwYWdlc0hlYWRpbmdDb3VudCk7XFxuXCJ9Iiwib3B0aW9ucyI6eyJidWlsZCI6ImxhdGVzdCIsImNvbnRlbnRUeXBlIjoiYXBwbGljYXRpb24vanNvbjsgY2hhcnNldD11dGYtOCIsIm1lbW9yeSI6MTAyNCwidGltZW91dCI6MTgwfX0.Ca5oSQWfTfmi-fp-gu9TTQKaoJ4BQW-1AhHXekmCV9c\&asrc=run_on_apify) ``` import { Actor } from 'apify'; const dataset = await Actor.openDataset(); // calling reduce function and using memo to calculate number of headers const pagesHeadingCount = await dataset.reduce((memo, value) => { return memo + value.headingCount; }, 0); // saving result of map to default Key-value store await Actor.setValue('pages_heading_count', pagesHeadingCount); ``` The original dataset will be reduced to a single value, `pagesHeadingCount`, which contains the count of all headers for all scraped pages (all dataset items). The `reduce` method's result value saved to the [`key-value store`](https://docs.apify.com/sdk/js/sdk/js/reference/class/KeyValueStore.md) should be: ``` 23; ``` --- # Playwright crawler This example demonstrates how to use [`PlaywrightCrawler`](https://crawlee.dev/api/playwright-crawler/class/PlaywrightCrawler) in combination with [`RequestQueue`](https://docs.apify.com/sdk/js/sdk/js/reference/class/RequestQueue.md) to recursively scrape the [Hacker News website](https://news.ycombinator.com) using headless Chrome / Playwright. The crawler starts with a single URL, finds links to next pages, enqueues them and continues until no more desired links are available. The results are stored to the default dataset. In local configuration, the results are stored as JSON files in `./storage/datasets/default` tip To run this example on the Apify Platform, select the `apify/actor-node-playwright-chrome` image for your Dockerfile. [Run on](https://console.apify.com/actors/6i5QsHBMtm3hKph70?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IEFjdG9yIH0gZnJvbSAnYXBpZnknO1xcbmltcG9ydCB7IFBsYXl3cmlnaHRDcmF3bGVyIH0gZnJvbSAnY3Jhd2xlZSc7XFxuXFxuYXdhaXQgQWN0b3IuaW5pdCgpO1xcblxcbi8vIENyZWF0ZSBhbiBpbnN0YW5jZSBvZiB0aGUgUGxheXdyaWdodENyYXdsZXIgY2xhc3MgLSBhIGNyYXdsZXJcXG4vLyB0aGF0IGF1dG9tYXRpY2FsbHkgbG9hZHMgdGhlIFVSTHMgaW4gaGVhZGxlc3MgQ2hyb21lIC8gUGxheXdyaWdodC5cXG5jb25zdCBjcmF3bGVyID0gbmV3IFBsYXl3cmlnaHRDcmF3bGVyKHtcXG4gICAgbGF1bmNoQ29udGV4dDoge1xcbiAgICAgICAgLy8gSGVyZSB5b3UgY2FuIHNldCBvcHRpb25zIHRoYXQgYXJlIHBhc3NlZCB0byB0aGUgcGxheXdyaWdodCAubGF1bmNoKCkgZnVuY3Rpb24uXFxuICAgICAgICBsYXVuY2hPcHRpb25zOiB7XFxuICAgICAgICAgICAgaGVhZGxlc3M6IHRydWUsXFxuICAgICAgICB9LFxcbiAgICB9LFxcblxcbiAgICAvLyBTdG9wIGNyYXdsaW5nIGFmdGVyIHNldmVyYWwgcGFnZXNcXG4gICAgbWF4UmVxdWVzdHNQZXJDcmF3bDogNTAsXFxuXFxuICAgIC8vIFRoaXMgZnVuY3Rpb24gd2lsbCBiZSBjYWxsZWQgZm9yIGVhY2ggVVJMIHRvIGNyYXdsLlxcbiAgICAvLyBIZXJlIHlvdSBjYW4gd3JpdGUgdGhlIFBsYXl3cmlnaHQgc2NyaXB0cyB5b3UgYXJlIGZhbWlsaWFyIHdpdGgsXFxuICAgIC8vIHdpdGggdGhlIGV4Y2VwdGlvbiB0aGF0IGJyb3dzZXJzIGFuZCBwYWdlcyBhcmUgYXV0b21hdGljYWxseSBtYW5hZ2VkIGJ5IHRoZSBBcGlmeSBTREsuXFxuICAgIC8vIFRoZSBmdW5jdGlvbiBhY2NlcHRzIGEgc2luZ2xlIHBhcmFtZXRlciwgd2hpY2ggaXMgYW4gb2JqZWN0IHdpdGggYSBsb3Qgb2YgcHJvcGVydGllcyxcXG4gICAgLy8gdGhlIG1vc3QgaW1wb3J0YW50IGJlaW5nOlxcbiAgICAvLyAtIHJlcXVlc3Q6IGFuIGluc3RhbmNlIG9mIHRoZSBSZXF1ZXN0IGNsYXNzIHdpdGggaW5mb3JtYXRpb24gc3VjaCBhcyBVUkwgYW5kIEhUVFAgbWV0aG9kXFxuICAgIC8vIC0gcGFnZTogUGxheXdyaWdodCdzIFBhZ2Ugb2JqZWN0IChzZWUgaHR0cHM6Ly9wbGF5d3JpZ2h0LmRldi9kb2NzL2FwaS9jbGFzcy1wYWdlKVxcbiAgICBhc3luYyByZXF1ZXN0SGFuZGxlcih7IHJlcXVlc3QsIHBhZ2UsIGVucXVldWVMaW5rcyB9KSB7XFxuICAgICAgICBjb25zb2xlLmxvZyhgUHJvY2Vzc2luZyAke3JlcXVlc3QudXJsfS4uLmApO1xcblxcbiAgICAgICAgLy8gQSBmdW5jdGlvbiB0byBiZSBldmFsdWF0ZWQgYnkgUGxheXdyaWdodCB3aXRoaW4gdGhlIGJyb3dzZXIgY29udGV4dC5cXG4gICAgICAgIGNvbnN0IGRhdGEgPSBhd2FpdCBwYWdlLiQkZXZhbCgnLmF0aGluZycsICgkcG9zdHMpID0-IHtcXG4gICAgICAgICAgICBjb25zdCBzY3JhcGVkRGF0YSA9IFtdO1xcblxcbiAgICAgICAgICAgIC8vIFdlJ3JlIGdldHRpbmcgdGhlIHRpdGxlLCByYW5rIGFuZCBVUkwgb2YgZWFjaCBwb3N0IG9uIEhhY2tlciBOZXdzLlxcbiAgICAgICAgICAgICRwb3N0cy5mb3JFYWNoKCgkcG9zdCkgPT4ge1xcbiAgICAgICAgICAgICAgICBzY3JhcGVkRGF0YS5wdXNoKHtcXG4gICAgICAgICAgICAgICAgICAgIHRpdGxlOiAkcG9zdC5xdWVyeVNlbGVjdG9yKCcudGl0bGUgYScpLmlubmVyVGV4dCxcXG4gICAgICAgICAgICAgICAgICAgIHJhbms6ICRwb3N0LnF1ZXJ5U2VsZWN0b3IoJy5yYW5rJykuaW5uZXJUZXh0LFxcbiAgICAgICAgICAgICAgICAgICAgaHJlZjogJHBvc3QucXVlcnlTZWxlY3RvcignLnRpdGxlIGEnKS5ocmVmLFxcbiAgICAgICAgICAgICAgICB9KTtcXG4gICAgICAgICAgICB9KTtcXG5cXG4gICAgICAgICAgICByZXR1cm4gc2NyYXBlZERhdGE7XFxuICAgICAgICB9KTtcXG5cXG4gICAgICAgIC8vIFN0b3JlIHRoZSByZXN1bHRzIHRvIHRoZSBkZWZhdWx0IGRhdGFzZXQuXFxuICAgICAgICBhd2FpdCBBY3Rvci5wdXNoRGF0YShkYXRhKTtcXG5cXG4gICAgICAgIC8vIEZpbmQgYSBsaW5rIHRvIHRoZSBuZXh0IHBhZ2UgYW5kIGVucXVldWUgaXQgaWYgaXQgZXhpc3RzLlxcbiAgICAgICAgY29uc3QgaW5mb3MgPSBhd2FpdCBlbnF1ZXVlTGlua3Moe1xcbiAgICAgICAgICAgIHNlbGVjdG9yOiAnLm1vcmVsaW5rJyxcXG4gICAgICAgIH0pO1xcblxcbiAgICAgICAgaWYgKGluZm9zLnByb2Nlc3NlZFJlcXVlc3RzLmxlbmd0aCA9PT0gMClcXG4gICAgICAgICAgICBjb25zb2xlLmxvZyhgJHtyZXF1ZXN0LnVybH0gaXMgdGhlIGxhc3QgcGFnZSFgKTtcXG4gICAgfSxcXG5cXG4gICAgLy8gVGhpcyBmdW5jdGlvbiBpcyBjYWxsZWQgaWYgdGhlIHBhZ2UgcHJvY2Vzc2luZyBmYWlsZWQgbW9yZSB0aGFuIG1heFJlcXVlc3RSZXRyaWVzKzEgdGltZXMuXFxuICAgIGZhaWxlZFJlcXVlc3RIYW5kbGVyKHsgcmVxdWVzdCB9KSB7XFxuICAgICAgICBjb25zb2xlLmxvZyhgUmVxdWVzdCAke3JlcXVlc3QudXJsfSBmYWlsZWQgdG9vIG1hbnkgdGltZXMuYCk7XFxuICAgIH0sXFxufSk7XFxuXFxuLy8gUnVuIHRoZSBjcmF3bGVyIGFuZCB3YWl0IGZvciBpdCB0byBmaW5pc2guXFxuYXdhaXQgY3Jhd2xlci5ydW4oWydodHRwczovL25ld3MueWNvbWJpbmF0b3IuY29tLyddKTtcXG5cXG5jb25zb2xlLmxvZygnQ3Jhd2xlciBmaW5pc2hlZC4nKTtcXG5cXG5hd2FpdCBBY3Rvci5leGl0KCk7XFxuXCJ9Iiwib3B0aW9ucyI6eyJidWlsZCI6ImxhdGVzdCIsImNvbnRlbnRUeXBlIjoiYXBwbGljYXRpb24vanNvbjsgY2hhcnNldD11dGYtOCIsIm1lbW9yeSI6NDA5NiwidGltZW91dCI6MTgwfX0.pYKENUrfvL61rPML7uc96hLxWD7O0UxTc_ZALKmFpyA\&asrc=run_on_apify) ``` import { Actor } from 'apify'; import { PlaywrightCrawler } from 'crawlee'; await Actor.init(); // Create an instance of the PlaywrightCrawler class - a crawler // that automatically loads the URLs in headless Chrome / Playwright. const crawler = new PlaywrightCrawler({ launchContext: { // Here you can set options that are passed to the playwright .launch() function. launchOptions: { headless: true, }, }, // Stop crawling after several pages maxRequestsPerCrawl: 50, // This function will be called for each URL to crawl. // Here you can write the Playwright scripts you are familiar with, // with the exception that browsers and pages are automatically managed by the Apify SDK. // The function accepts a single parameter, which is an object with a lot of properties, // the most important being: // - request: an instance of the Request class with information such as URL and HTTP method // - page: Playwright's Page object (see https://playwright.dev/docs/api/class-page) async requestHandler({ request, page, enqueueLinks }) { console.log(`Processing ${request.url}...`); // A function to be evaluated by Playwright within the browser context. const data = await page.$$eval('.athing', ($posts) => { const scrapedData = []; // We're getting the title, rank and URL of each post on Hacker News. $posts.forEach(($post) => { scrapedData.push({ title: $post.querySelector('.title a').innerText, rank: $post.querySelector('.rank').innerText, href: $post.querySelector('.title a').href, }); }); return scrapedData; }); // Store the results to the default dataset. await Actor.pushData(data); // Find a link to the next page and enqueue it if it exists. const infos = await enqueueLinks({ selector: '.morelink', }); if (infos.processedRequests.length === 0) console.log(`${request.url} is the last page!`); }, // This function is called if the page processing failed more than maxRequestRetries+1 times. failedRequestHandler({ request }) { console.log(`Request ${request.url} failed too many times.`); }, }); // Run the crawler and wait for it to finish. await crawler.run(['https://news.ycombinator.com/']); console.log('Crawler finished.'); await Actor.exit(); ``` --- # Puppeteer crawler This example demonstrates how to use [`PuppeteerCrawler`](https://crawlee.dev/api/puppeteer-crawler/class/PuppeteerCrawler) in combination with [`RequestQueue`](https://docs.apify.com/sdk/js/sdk/js/reference/class/RequestQueue.md) to recursively scrape the [Hacker News website](https://news.ycombinator.com) using headless Chrome / Puppeteer. The crawler starts with a single URL, finds links to next pages, enqueues them and continues until no more desired links are available. The results are stored to the default dataset. In local configuration, the results are stored as JSON files in `./storage/datasets/default` tip To run this example on the Apify Platform, select the `apify/actor-node-puppeteer-chrome` image for your Dockerfile. [Run on](https://console.apify.com/actors/7tWSD8hrYzuc9Lte7?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IEFjdG9yIH0gZnJvbSAnYXBpZnknO1xcbmltcG9ydCB7IFB1cHBldGVlckNyYXdsZXIgfSBmcm9tICdjcmF3bGVlJztcXG5cXG5hd2FpdCBBY3Rvci5pbml0KCk7XFxuXFxuLy8gQ3JlYXRlIGFuIGluc3RhbmNlIG9mIHRoZSBQdXBwZXRlZXJDcmF3bGVyIGNsYXNzIC0gYSBjcmF3bGVyXFxuLy8gdGhhdCBhdXRvbWF0aWNhbGx5IGxvYWRzIHRoZSBVUkxzIGluIGhlYWRsZXNzIENocm9tZSAvIFB1cHBldGVlci5cXG5jb25zdCBjcmF3bGVyID0gbmV3IFB1cHBldGVlckNyYXdsZXIoe1xcbiAgICAvLyBIZXJlIHlvdSBjYW4gc2V0IG9wdGlvbnMgdGhhdCBhcmUgcGFzc2VkIHRvIHRoZSBsYXVuY2hQdXBwZXRlZXIoKSBmdW5jdGlvbi5cXG4gICAgbGF1bmNoQ29udGV4dDoge1xcbiAgICAgICAgbGF1bmNoT3B0aW9uczoge1xcbiAgICAgICAgICAgIGhlYWRsZXNzOiB0cnVlLFxcbiAgICAgICAgICAgIC8vIE90aGVyIFB1cHBldGVlciBvcHRpb25zXFxuICAgICAgICB9LFxcbiAgICB9LFxcblxcbiAgICAvLyBTdG9wIGNyYXdsaW5nIGFmdGVyIHNldmVyYWwgcGFnZXNcXG4gICAgbWF4UmVxdWVzdHNQZXJDcmF3bDogNTAsXFxuXFxuICAgIC8vIFRoaXMgZnVuY3Rpb24gd2lsbCBiZSBjYWxsZWQgZm9yIGVhY2ggVVJMIHRvIGNyYXdsLlxcbiAgICAvLyBIZXJlIHlvdSBjYW4gd3JpdGUgdGhlIFB1cHBldGVlciBzY3JpcHRzIHlvdSBhcmUgZmFtaWxpYXIgd2l0aCxcXG4gICAgLy8gd2l0aCB0aGUgZXhjZXB0aW9uIHRoYXQgYnJvd3NlcnMgYW5kIHBhZ2VzIGFyZSBhdXRvbWF0aWNhbGx5IG1hbmFnZWQgYnkgdGhlIEFwaWZ5IFNESy5cXG4gICAgLy8gVGhlIGZ1bmN0aW9uIGFjY2VwdHMgYSBzaW5nbGUgcGFyYW1ldGVyLCB3aGljaCBpcyBhbiBvYmplY3Qgd2l0aCB0aGUgZm9sbG93aW5nIGZpZWxkczpcXG4gICAgLy8gLSByZXF1ZXN0OiBhbiBpbnN0YW5jZSBvZiB0aGUgUmVxdWVzdCBjbGFzcyB3aXRoIGluZm9ybWF0aW9uIHN1Y2ggYXMgVVJMIGFuZCBIVFRQIG1ldGhvZFxcbiAgICAvLyAtIHBhZ2U6IFB1cHBldGVlcidzIFBhZ2Ugb2JqZWN0IChzZWUgaHR0cHM6Ly9wcHRyLmRldi8jc2hvdz1hcGktY2xhc3MtcGFnZSlcXG4gICAgYXN5bmMgcmVxdWVzdEhhbmRsZXIoeyByZXF1ZXN0LCBwYWdlLCBlbnF1ZXVlTGlua3MgfSkge1xcbiAgICAgICAgY29uc29sZS5sb2coYFByb2Nlc3NpbmcgJHtyZXF1ZXN0LnVybH0uLi5gKTtcXG5cXG4gICAgICAgIC8vIEEgZnVuY3Rpb24gdG8gYmUgZXZhbHVhdGVkIGJ5IFB1cHBldGVlciB3aXRoaW4gdGhlIGJyb3dzZXIgY29udGV4dC5cXG4gICAgICAgIGNvbnN0IGRhdGEgPSBhd2FpdCBwYWdlLiQkZXZhbCgnLmF0aGluZycsICgkcG9zdHMpID0-IHtcXG4gICAgICAgICAgICBjb25zdCBzY3JhcGVkRGF0YSA9IFtdO1xcblxcbiAgICAgICAgICAgIC8vIFdlJ3JlIGdldHRpbmcgdGhlIHRpdGxlLCByYW5rIGFuZCBVUkwgb2YgZWFjaCBwb3N0IG9uIEhhY2tlciBOZXdzLlxcbiAgICAgICAgICAgICRwb3N0cy5mb3JFYWNoKCgkcG9zdCkgPT4ge1xcbiAgICAgICAgICAgICAgICBzY3JhcGVkRGF0YS5wdXNoKHtcXG4gICAgICAgICAgICAgICAgICAgIHRpdGxlOiAkcG9zdC5xdWVyeVNlbGVjdG9yKCcudGl0bGUgYScpLmlubmVyVGV4dCxcXG4gICAgICAgICAgICAgICAgICAgIHJhbms6ICRwb3N0LnF1ZXJ5U2VsZWN0b3IoJy5yYW5rJykuaW5uZXJUZXh0LFxcbiAgICAgICAgICAgICAgICAgICAgaHJlZjogJHBvc3QucXVlcnlTZWxlY3RvcignLnRpdGxlIGEnKS5ocmVmLFxcbiAgICAgICAgICAgICAgICB9KTtcXG4gICAgICAgICAgICB9KTtcXG5cXG4gICAgICAgICAgICByZXR1cm4gc2NyYXBlZERhdGE7XFxuICAgICAgICB9KTtcXG5cXG4gICAgICAgIC8vIFN0b3JlIHRoZSByZXN1bHRzIHRvIHRoZSBkZWZhdWx0IGRhdGFzZXQuXFxuICAgICAgICBhd2FpdCBBY3Rvci5wdXNoRGF0YShkYXRhKTtcXG5cXG4gICAgICAgIC8vIEZpbmQgYSBsaW5rIHRvIHRoZSBuZXh0IHBhZ2UgYW5kIGVucXVldWUgaXQgaWYgaXQgZXhpc3RzLlxcbiAgICAgICAgY29uc3QgaW5mb3MgPSBhd2FpdCBlbnF1ZXVlTGlua3Moe1xcbiAgICAgICAgICAgIHNlbGVjdG9yOiAnLm1vcmVsaW5rJyxcXG4gICAgICAgIH0pO1xcblxcbiAgICAgICAgaWYgKGluZm9zLmxlbmd0aCA9PT0gMCkgY29uc29sZS5sb2coYCR7cmVxdWVzdC51cmx9IGlzIHRoZSBsYXN0IHBhZ2UhYCk7XFxuICAgIH0sXFxuXFxuICAgIC8vIFRoaXMgZnVuY3Rpb24gaXMgY2FsbGVkIGlmIHRoZSBwYWdlIHByb2Nlc3NpbmcgZmFpbGVkIG1vcmUgdGhhbiBtYXhSZXF1ZXN0UmV0cmllcysxIHRpbWVzLlxcbiAgICBmYWlsZWRSZXF1ZXN0SGFuZGxlcih7IHJlcXVlc3QgfSkge1xcbiAgICAgICAgY29uc29sZS5sb2coYFJlcXVlc3QgJHtyZXF1ZXN0LnVybH0gZmFpbGVkIHRvbyBtYW55IHRpbWVzLmApO1xcbiAgICB9LFxcbn0pO1xcblxcbi8vIFJ1biB0aGUgY3Jhd2xlciBhbmQgd2FpdCBmb3IgaXQgdG8gZmluaXNoLlxcbmF3YWl0IGNyYXdsZXIucnVuKFsnaHR0cHM6Ly9uZXdzLnljb21iaW5hdG9yLmNvbS8nXSk7XFxuXFxuY29uc29sZS5sb2coJ0NyYXdsZXIgZmluaXNoZWQuJyk7XFxuXFxuYXdhaXQgQWN0b3IuZXhpdCgpO1xcblwifSIsIm9wdGlvbnMiOnsiYnVpbGQiOiJsYXRlc3QiLCJjb250ZW50VHlwZSI6ImFwcGxpY2F0aW9uL2pzb247IGNoYXJzZXQ9dXRmLTgiLCJtZW1vcnkiOjQwOTYsInRpbWVvdXQiOjE4MH19.88cqtP3DJA1811DUd2fOqdjsLFRPvz91Pi_WHe8Yt5U\&asrc=run_on_apify) ``` import { Actor } from 'apify'; import { PuppeteerCrawler } from 'crawlee'; await Actor.init(); // Create an instance of the PuppeteerCrawler class - a crawler // that automatically loads the URLs in headless Chrome / Puppeteer. const crawler = new PuppeteerCrawler({ // Here you can set options that are passed to the launchPuppeteer() function. launchContext: { launchOptions: { headless: true, // Other Puppeteer options }, }, // Stop crawling after several pages maxRequestsPerCrawl: 50, // This function will be called for each URL to crawl. // Here you can write the Puppeteer scripts you are familiar with, // with the exception that browsers and pages are automatically managed by the Apify SDK. // The function accepts a single parameter, which is an object with the following fields: // - request: an instance of the Request class with information such as URL and HTTP method // - page: Puppeteer's Page object (see https://pptr.dev/#show=api-class-page) async requestHandler({ request, page, enqueueLinks }) { console.log(`Processing ${request.url}...`); // A function to be evaluated by Puppeteer within the browser context. const data = await page.$$eval('.athing', ($posts) => { const scrapedData = []; // We're getting the title, rank and URL of each post on Hacker News. $posts.forEach(($post) => { scrapedData.push({ title: $post.querySelector('.title a').innerText, rank: $post.querySelector('.rank').innerText, href: $post.querySelector('.title a').href, }); }); return scrapedData; }); // Store the results to the default dataset. await Actor.pushData(data); // Find a link to the next page and enqueue it if it exists. const infos = await enqueueLinks({ selector: '.morelink', }); if (infos.length === 0) console.log(`${request.url} is the last page!`); }, // This function is called if the page processing failed more than maxRequestRetries+1 times. failedRequestHandler({ request }) { console.log(`Request ${request.url} failed too many times.`); }, }); // Run the crawler and wait for it to finish. await crawler.run(['https://news.ycombinator.com/']); console.log('Crawler finished.'); await Actor.exit(); ``` --- # Puppeteer recursive crawl Run the following example to perform a recursive crawl of a website using [`PuppeteerCrawler`](https://crawlee.dev/api/puppeteer-crawler/class/PuppeteerCrawler). tip To run this example on the Apify Platform, select the `apify/actor-node-puppeteer-chrome` image for your Dockerfile. [Run on](https://console.apify.com/actors/7tWSD8hrYzuc9Lte7?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IEFjdG9yIH0gZnJvbSAnYXBpZnknO1xcbmltcG9ydCB7IFB1cHBldGVlckNyYXdsZXIgfSBmcm9tICdjcmF3bGVlJztcXG5cXG5hd2FpdCBBY3Rvci5pbml0KCk7XFxuXFxuY29uc3QgY3Jhd2xlciA9IG5ldyBQdXBwZXRlZXJDcmF3bGVyKHtcXG4gICAgYXN5bmMgcmVxdWVzdEhhbmRsZXIoeyByZXF1ZXN0LCBwYWdlLCBlbnF1ZXVlTGlua3MgfSkge1xcbiAgICAgICAgY29uc3QgdGl0bGUgPSBhd2FpdCBwYWdlLnRpdGxlKCk7XFxuICAgICAgICBjb25zb2xlLmxvZyhgVGl0bGUgb2YgJHtyZXF1ZXN0LnVybH06ICR7dGl0bGV9YCk7XFxuXFxuICAgICAgICBhd2FpdCBlbnF1ZXVlTGlua3Moe1xcbiAgICAgICAgICAgIHBzZXVkb1VybHM6IFsnaHR0cHM6Ly93d3cuaWFuYS5vcmcvWy4qXSddLFxcbiAgICAgICAgfSk7XFxuICAgIH0sXFxuICAgIG1heFJlcXVlc3RzUGVyQ3Jhd2w6IDEwLFxcbn0pO1xcblxcbmF3YWl0IGNyYXdsZXIucnVuKFsnaHR0cHM6Ly93d3cuaWFuYS5vcmcvJ10pO1xcblxcbmF3YWl0IEFjdG9yLmV4aXQoKTtcXG5cIn0iLCJvcHRpb25zIjp7ImJ1aWxkIjoibGF0ZXN0IiwiY29udGVudFR5cGUiOiJhcHBsaWNhdGlvbi9qc29uOyBjaGFyc2V0PXV0Zi04IiwibWVtb3J5Ijo0MDk2LCJ0aW1lb3V0IjoxODB9fQ.8btSf7N11TyAU4POBztQTOCmNYkaZxZ9FeoCUoRa5YE\&asrc=run_on_apify) ``` import { Actor } from 'apify'; import { PuppeteerCrawler } from 'crawlee'; await Actor.init(); const crawler = new PuppeteerCrawler({ async requestHandler({ request, page, enqueueLinks }) { const title = await page.title(); console.log(`Title of ${request.url}: ${title}`); await enqueueLinks({ pseudoUrls: ['https://www.iana.org/[.*]'], }); }, maxRequestsPerCrawl: 10, }); await crawler.run(['https://www.iana.org/']); await Actor.exit(); ``` --- # Puppeteer with proxy This example demonstrates how to load pages in headless Chrome / Puppeteer over [Apify Proxy](https://docs.apify.com/proxy). To make it work, you'll need an Apify account with access to the proxy. Visit the [Apify platform introduction](https://docs.apify.com/sdk/js/sdk/js/docs/guides/apify-platform.md) to find how to log into your account from the SDK. tip To run this example on the Apify Platform, select the `apify/actor-node-puppeteer-chrome` image for your Dockerfile. [Run on](https://console.apify.com/actors/7tWSD8hrYzuc9Lte7?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcImNvZGVcIjpcImltcG9ydCB7IEFjdG9yIH0gZnJvbSAnYXBpZnknO1xcbmltcG9ydCB7IFB1cHBldGVlckNyYXdsZXIgfSBmcm9tICdjcmF3bGVlJztcXG5cXG5hd2FpdCBBY3Rvci5pbml0KCk7XFxuXFxuLy8gUHJveHkgY29ubmVjdGlvbiBpcyBhdXRvbWF0aWNhbGx5IGVzdGFibGlzaGVkIGluIHRoZSBDcmF3bGVyXFxuY29uc3QgcHJveHlDb25maWd1cmF0aW9uID0gYXdhaXQgQWN0b3IuY3JlYXRlUHJveHlDb25maWd1cmF0aW9uKCk7XFxuXFxuY29uc3QgY3Jhd2xlciA9IG5ldyBQdXBwZXRlZXJDcmF3bGVyKHtcXG4gICAgcHJveHlDb25maWd1cmF0aW9uLFxcbiAgICBhc3luYyByZXF1ZXN0SGFuZGxlcih7IHBhZ2UgfSkge1xcbiAgICAgICAgY29uc3Qgc3RhdHVzID0gYXdhaXQgcGFnZS4kZXZhbCgndGQuc3RhdHVzJywgKGVsKSA9PiBlbC50ZXh0Q29udGVudCk7XFxuICAgICAgICBjb25zb2xlLmxvZyhgUHJveHkgU3RhdHVzOiAke3N0YXR1c31gKTtcXG4gICAgfSxcXG59KTtcXG5cXG5jb25zb2xlLmxvZygnUnVubmluZyBQdXBwZXRlZXIgc2NyaXB0Li4uJyk7XFxuXFxuYXdhaXQgY3Jhd2xlci5ydW4oWydodHRwOi8vcHJveHkuYXBpZnkuY29tJ10pO1xcblxcbmNvbnNvbGUubG9nKCdQdXBwZXRlZXIgY2xvc2VkLicpO1xcblxcbmF3YWl0IEFjdG9yLmV4aXQoKTtcXG5cIn0iLCJvcHRpb25zIjp7ImJ1aWxkIjoibGF0ZXN0IiwiY29udGVudFR5cGUiOiJhcHBsaWNhdGlvbi9qc29uOyBjaGFyc2V0PXV0Zi04IiwibWVtb3J5Ijo0MDk2LCJ0aW1lb3V0IjoxODB9fQ.Z2NfopKj1DbaGy58OZ3N2Og8hM7AvkFTeEbBFCwOtGk\&asrc=run_on_apify) ``` import { Actor } from 'apify'; import { PuppeteerCrawler } from 'crawlee'; await Actor.init(); // Proxy connection is automatically established in the Crawler const proxyConfiguration = await Actor.createProxyConfiguration(); const crawler = new PuppeteerCrawler({ proxyConfiguration, async requestHandler({ page }) { const status = await page.$eval('td.status', (el) => el.textContent); console.log(`Proxy Status: ${status}`); }, }); console.log('Running Puppeteer script...'); await crawler.run(['http://proxy.apify.com']); console.log('Puppeteer closed.'); await Actor.exit(); ``` --- # Apify Platform Apify is a [platform](https://apify.com) built to serve large-scale and high-performance web scraping and automation needs. It provides easy access to [compute instances (Actors)](#what-is-an-actor), convenient [request](https://docs.apify.com/sdk/js/sdk/js/docs/guides/request-storage.md) and [result](https://docs.apify.com/sdk/js/sdk/js/docs/guides/result-storage.md) storages, [proxies](https://docs.apify.com/sdk/js/sdk/js/docs/guides/proxy-management.md), [scheduling](https://docs.apify.com/scheduler), [webhooks](https://docs.apify.com/webhooks) and [more](https://docs.apify.com/), accessible through a [web interface](https://console.apify.com) or an [API](https://docs.apify.com/api). While we think that the Apify platform is super cool, and it's definitely worth signing up for a [free account](https://console.apify.com/sign-up), **Crawlee is and will always be open source**, runnable locally or on any cloud infrastructure. note We do not test Crawlee in other cloud environments such as Lambda or on specific architectures such as Raspberry PI. We strive to make it work, but there are no guarantees. ## Logging into Apify platform from Crawlee[​](#logging-into-apify-platform-from-crawlee "Direct link to heading") To access your [Apify account](https://console.apify.com/sign-up) from Crawlee, you must provide credentials - your [API token](https://console.apify.com/account?tab=integrations). You can do that either by utilizing [Apify CLI](https://github.com/apify/apify-cli) or with environment variables. Once you provide credentials to your scraper, you will be able to use all the Apify platform features, such as calling actors, saving to cloud storages, using Apify proxies, setting up webhooks and so on. ### Log in with CLI[​](#log-in-with-cli "Direct link to heading") Apify CLI allows you to log in to your Apify account on your computer. If you then run your scraper using the CLI, your credentials will automatically be added. ``` npm install -g apify-cli apify login -t YOUR_API_TOKEN ``` ### Log in with environment variables[​](#log-in-with-environment-variables "Direct link to heading") Alternatively, you can always provide credentials to your scraper by setting the [`APIFY_TOKEN`](#apify_token) environment variable to your API token. > There's also the [`APIFY_PROXY_PASSWORD`](#apify_proxy_password) environment variable. Actor automatically infers that from your token, but it can be useful when you need to access proxies from a different account than your token represents. ### Log in with Configuration[​](#log-in-with-configuration "Direct link to heading") Another option is to use the [`Configuration`](https://sdk.apify.com/api/apify/class/Configuration) instance and set your api token there. ``` import { Actor } from 'apify'; const sdk = new Actor({ token: 'your_api_token' }); ``` ## What is an actor[​](#what-is-an-actor "Direct link to heading") When you deploy your script to the Apify platform, it becomes an [actor](https://apify.com/actors). An actor is a serverless microservice that accepts an input and produces an output. It can run for a few seconds, hours or even infinitely. An actor can perform anything from a simple action such as filling out a web form or sending an email, to complex operations such as crawling an entire website and removing duplicates from a large dataset. Actors can be shared in the [Apify Store](https://apify.com/store) so that other people can use them. But don't worry, if you share your actor in the store and somebody uses it, it runs under their account, not yours. **Related links** * [Store of existing actors](https://apify.com/store) * [Documentation](https://docs.apify.com/actors) * [View actors in Apify Console](https://console.apify.com/actors) * [API reference](https://apify.com/docs/api/v2#/reference/actors) ## Running an actor locally[​](#running-an-actor-locally "Direct link to heading") First let's create a boilerplate of the new actor. You could use Apify CLI and just run: ``` apify create my-hello-world ``` The CLI will prompt you to select a project boilerplate template - let's pick "Hello world". The tool will create a directory called `my-hello-world` with a Node.js project files. You can run the actor as follows: ``` cd my-hello-world apify run ``` ## Running Crawlee code as an actor[​](#running-crawlee-code-as-an-actor "Direct link to heading") For running Crawlee code as an actor on [Apify platform](https://apify.com/actors) you should either: * use a combination of [`Actor.init()`](https://sdk.apify.com/api/apify/class/Actor#init) and [`Actor.exit()`](https://sdk.apify.com/api/apify/class/Actor#exit) functions; * or wrap it into [`Actor.main()`](https://sdk.apify.com/api/apify/class/Actor#main) function. NOTE * Adding [`Actor.init()`](https://sdk.apify.com/api/apify/class/Actor#init) and [`Actor.exit()`](https://sdk.apify.com/api/apify/class/Actor#exit) to your code are the only two important things needed to run it on Apify platform as an actor. `Actor.init()` is needed to initialize your actor (e.g. to set the correct storage implementation), while without `Actor.exit()` the process will simply never stop. * [`Actor.main()`](https://sdk.apify.com/api/apify/class/Actor#main) is an alternative to `Actor.init()` and `Actor.exit()` as it calls both behind the scenes. Let's look at the `CheerioCrawler` example from the [Quick Start](https://crawlee.dev/docs/quick-start) guide: * Using Actor.main() * Using Actor.init() and Actor.exit() ``` import { Actor } from 'apify'; import { CheerioCrawler } from 'crawlee'; await Actor.main(async () => { const crawler = new CheerioCrawler({ async requestHandler({ request, $, enqueueLinks }) { const { url } = request; // Extract HTML title of the page. const title = $('title').text(); console.log(`Title of ${url}: ${title}`); // Add URLs that match the provided pattern. await enqueueLinks({ globs: ['https://www.iana.org/*'], }); // Save extracted data to dataset. await Actor.pushData({ url, title }); }, }); // Enqueue the initial request and run the crawler await crawler.run(['https://www.iana.org/']); }); ``` ``` import { Actor } from 'apify'; import { CheerioCrawler } from 'crawlee'; await Actor.init(); const crawler = new CheerioCrawler({ async requestHandler({ request, $, enqueueLinks }) { const { url } = request; // Extract HTML title of the page. const title = $('title').text(); console.log(`Title of ${url}: ${title}`); // Add URLs that match the provided pattern. await enqueueLinks({ globs: ['https://www.iana.org/*'], }); // Save extracted data to dataset. await Actor.pushData({ url, title }); }, }); // Enqueue the initial request and run the crawler await crawler.run(['https://www.iana.org/']); await Actor.exit(); ``` Note that you could also run your actor (that is using Crawlee) locally with Apify CLI. You could start it via the following command in your project folder: ``` apify run ``` ## Deploying an actor to Apify platform[​](#deploying-an-actor-to-apify-platform "Direct link to heading") Now (assuming you are already logged in to your Apify account) you can easily deploy your code to the Apify platform by running: ``` apify push ``` Your script will be uploaded to and built on the Apify platform so that it can be run there. For more information, view the [Apify Actor](https://docs.apify.com/cli) documentation. ## Usage on Apify platform[​](#usage-on-apify-platform "Direct link to heading") You can also develop your actor in an online code editor directly on the platform (you'll need an Apify Account). Let's go to the [Actors](https://console.apify.com/actors) page in the app, click *Create new* and then go to the *Source* tab and start writing the code or paste one of the examples from the [Examples](https://docs.apify.com/sdk/js/sdk/js/docs/examples) section. ## Storages[​](#storages "Direct link to heading") There are several things worth mentioning here. ### Helper functions for default Key-Value Store and Dataset[​](#helper-functions-for-default-key-value-store-and-dataset "Direct link to heading") To simplify access to the *default* storages, instead of using the helper functions of respective storage classes, you could use: * [`Actor.setValue()`](https://sdk.apify.com/api/apify/class/Actor#setValue), [`Actor.getValue()`](https://sdk.apify.com/api/apify/class/Actor#getValue), [`Actor.getInput()`](https://sdk.apify.com/api/apify/class/Actor#getInput) for `Key-Value Store` * [`Actor.pushData()`](https://sdk.apify.com/api/apify/class/Actor#pushData) for `Dataset` ### Using platform storage in a local actor[​](#using-platform-storage-in-a-local-actor "Direct link to heading") When you plan to use the platform storage while developing and running your actor locally, you should use [`Actor.openKeyValueStore()`](https://sdk.apify.com/api/apify/class/Actor#openKeyValueStore), [`Actor.openDataset()`](https://sdk.apify.com/api/apify/class/Actor#openDataset) and [`Actor.openRequestQueue()`](https://sdk.apify.com/api/apify/class/Actor#openRequestQueue) to open the respective storage. Using each of these methods allows to pass the [`OpenStorageOptions`](https://sdk.apify.com/api/apify/interface/OpenStorageOptions) as a second argument, which has only one optional property: [`forceCloud`](https://sdk.apify.com/api/apify/interface/OpenStorageOptions#forceCloud). If set to `true` - cloud storage will be used instead of the folder on the local disk. note If you don't plan to force usage of the platform storages when running the actor locally, there is no need to use the [`Actor`](https://sdk.apify.com/api/apify/class/Actor) class for it. The Crawlee variants [`KeyValueStore.open()`](https://docs.apify.com/sdk/js/sdk/js/reference/class/KeyValueStore.md#open), [`Dataset.open()`](https://docs.apify.com/sdk/js/sdk/js/reference/class/Dataset.md#open) and [`RequestQueue.open()`](https://docs.apify.com/sdk/js/sdk/js/reference/class/RequestQueue.md#open) will work the same. ### Getting public url of an item in the platform storage[​](#getting-public-url-of-an-item-in-the-platform-storage "Direct link to heading") If you need to share a link to some file stored in a Key-Value Store on Apify Platform, you can use [`getPublicUrl()`](https://sdk.apify.com/api/apify/class/KeyValueStore#getPublicUrl) method. It accepts only one parameter: `key` - the key of the item you want to share. ``` import { KeyValueStore } from 'apify'; const store = await KeyValueStore.open(); await store.setValue('your-file', { foo: 'bar' }); const url = store.getPublicUrl('your-file'); // https://api.apify.com/v2/key-value-stores//records/your-file ``` ### Exporting dataset data[​](#exporting-dataset-data "Direct link to heading") When the [`Dataset`](https://docs.apify.com/sdk/js/sdk/js/reference/class/Dataset.md) is stored on the [Apify platform](https://apify.com/actors), you can export its data to the following formats: HTML, JSON, CSV, Excel, XML and RSS. The datasets are displayed on the actor run details page and in the [Storage](https://console.apify.com/storage) section in the Apify Console. The actual data is exported using the [Get dataset items](https://apify.com/docs/api/v2#/reference/datasets/item-collection/get-items) Apify API endpoint. This way you can easily share the crawling results. **Related links** * [Apify platform storage documentation](https://docs.apify.com/storage) * [View storage in Apify Console](https://console.apify.com/storage) * [Key-value stores API reference](https://apify.com/docs/api/v2#/reference/key-value-stores) * [Datasets API reference](https://docs.apify.com/api/v2#/reference/datasets) * [Request queues API reference](https://docs.apify.com/api/v2#/reference/request-queues) ## Environment variables[​](#environment-variables "Direct link to heading") The following are some additional environment variables specific to Apify platform. More Crawlee specific environment variables could be found in the [Environment Variables](https://crawlee.dev/docs/guides/configuration#environment-variables) guide. note It's important to notice that `CRAWLEE_` environment variables don't need to be replaced with equivalent `APIFY_` ones. Likewise, Crawlee understands `APIFY_` environment variables after calling `Actor.init()` or when using `Actor.main()`. ### `APIFY_TOKEN`[​](#apify_token "Direct link to heading") The API token for your Apify account. It is used to access the Apify API, e.g. to access cloud storage or to run an actor on the Apify platform. You can find your API token on the [Account Settings / Integrations](https://console.apify.com/account?tab=integrations) page. ### Combinations of `APIFY_TOKEN` and `CRAWLEE_STORAGE_DIR`[​](#combinations-of-apify_token-and-crawlee_storage_dir "Direct link to heading") > `CRAWLEE_STORAGE_DIR` env variable description could be found in [Environment Variables](https://crawlee.dev/docs/guides/configuration#crawlee_storage_dir) guide. By combining the env vars in various ways, you can greatly influence the actor's behavior. | Env Vars | API | Storages | | --------------------------------------- | --- | ---------------- | | none OR `CRAWLEE_STORAGE_DIR` | no | local | | `APIFY_TOKEN` | yes | Apify platform | | `APIFY_TOKEN` AND `CRAWLEE_STORAGE_DIR` | yes | local + platform | When using both `APIFY_TOKEN` and `CRAWLEE_STORAGE_DIR`, you can use all the Apify platform features and your data will be stored locally by default. If you want to access platform storages, you can use the `{ forceCloud: true }` option in their respective functions. ``` import { Actor } from 'apify'; import { Dataset } from 'crawlee'; // or Dataset.open('my-local-data') const localDataset = await Actor.openDataset('my-local-data'); // but here we need the `Actor` class const remoteDataset = await Actor.openDataset('my-dataset', { forceCloud: true, }); ``` ### `APIFY_PROXY_PASSWORD`[​](#apify_proxy_password "Direct link to heading") Optional password to [Apify Proxy](https://docs.apify.com/proxy) for IP address rotation. Assuming Apify Account was already created, you can find the password on the [Proxy page](https://console.apify.com/proxy) in the Apify Console. The password is automatically inferred using the `APIFY_TOKEN` env var, so in most cases, you don't need to touch it. You should use it when, for some reason, you need access to Apify Proxy, but not access to Apify API, or when you need access to proxy from a different account than your token represents. ## Proxy management[​](#proxy-management "Direct link to heading") In addition to your own proxy servers and proxy servers acquired from third-party providers used together with Crawlee, you can also rely on [Apify Proxy](https://apify.com/proxy) for your scraping needs. ### Apify Proxy[​](#apify-proxy "Direct link to heading") If you are already subscribed to Apify Proxy, you can start using them immediately in only a few lines of code (for local usage you first should be [logged in](#logging-into-apify-platform-from-crawlee) to your Apify account. ``` import { Actor } from 'apify'; const proxyConfiguration = await Actor.createProxyConfiguration(); const proxyUrl = await proxyConfiguration.newUrl(); ``` Note that unlike using your own proxies in Crawlee, you shouldn't use the constructor to create [`ProxyConfiguration`](https://docs.apify.com/sdk/js/sdk/js/reference/class/ProxyConfiguration.md) instance. For using Apify Proxy you should create an instance using the [`Actor.createProxyConfiguration()`](https://sdk.apify.com/api/apify/class/Actor#createProxyConfiguration) function instead. ### Apify Proxy Configuration[​](#apify-proxy-configuration "Direct link to heading") With Apify Proxy, you can select specific proxy groups to use, or countries to connect from. This allows you to get better proxy performance after some initial research. ``` import { Actor } from 'apify'; const proxyConfiguration = await Actor.createProxyConfiguration({ groups: ['RESIDENTIAL'], countryCode: 'US', }); const proxyUrl = await proxyConfiguration.newUrl(); ``` Now your crawlers will use only Residential proxies from the US. Note that you must first get access to a proxy group before you are able to use it. You can check proxy groups available to you in the [proxy dashboard](https://console.apify.com/proxy). ### Apify Proxy vs. Own proxies[​](#apify-proxy-vs-own-proxies "Direct link to heading") The `ProxyConfiguration` class covers both Apify Proxy and custom proxy URLs so that you can easily switch between proxy providers. However, some features of the class are available only to Apify Proxy users, mainly because Apify Proxy is what one would call a super-proxy. It's not a single proxy server, but an API endpoint that allows connection through millions of different IP addresses. So the class essentially has two modes: Apify Proxy or Own (third party) proxy. The difference is easy to remember. * If you're using your own proxies - you should create an instance with the ProxyConfiguration [`constructor`](https://docs.apify.com/sdk/js/sdk/js/reference/class/ProxyConfiguration.md#constructor) function based on the provided [`ProxyConfigurationOptions`](https://docs.apify.com/sdk/js/sdk/js/reference/interface/ProxyConfigurationOptions.md). * If you are planning to use Apify Proxy - you should create an instance using the [`Actor.createProxyConfiguration()`](https://sdk.apify.com/api/apify/class/Actor#createProxyConfiguration) function. [`ProxyConfigurationOptions.proxyUrls`](https://docs.apify.com/sdk/js/sdk/js/reference/interface/ProxyConfigurationOptions.md#proxyUrls) and [`ProxyConfigurationOptions.newUrlFunction`](https://docs.apify.com/sdk/js/sdk/js/reference/interface/ProxyConfigurationOptions.md#newUrlFunction) enable use of your custom proxy URLs, whereas all the other options are there to configure Apify Proxy. **Related links** * [Apify Proxy docs](https://docs.apify.com/proxy) --- # Running in Docker Running headless browsers in Docker requires a lot of setup to do it right. But there's no need to worry about that, because we already created base images that you can freely use. We use them every day on the [Apify Platform](https://docs.apify.com/sdk/js/sdk/js/docs/guides/apify-platform.md). All images can be found in their [GitHub repo](https://github.com/apify/apify-actor-docker) and in our [DockerHub](https://hub.docker.com/orgs/apify). ## Overview[​](#overview "Direct link to heading") Browsers are pretty big, so we try to provide a wide variety of images to suit the specific needs. Here's a full list of our Docker images. * [`apify/actor-node`](#actor-node) * [`apify/actor-node-puppeteer-chrome`](#actor-node-puppeteer-chrome) * [`apify/actor-node-playwright`](#actor-node-playwright) * [`apify/actor-node-playwright-chrome`](#actor-node-playwright-chrome) * [`apify/actor-node-playwright-firefox`](#actor-node-playwright-firefox) * [`apify/actor-node-playwright-webkit`](#actor-node-playwright-webkit) ## Versioning[​](#versioning "Direct link to heading") Each image is tagged with up to 2 version tags, depending on the type of the image. One for Node.js version and second for pre-installed web automation library version. If you use the image name without a version tag, you'll always get the latest available version. > We recommend always using at least the Node.js version tag in production Dockerfiles. It will ensure that a future update of Node.js will not break our automations. ### Node.js versioning[​](#nodejs-versioning "Direct link to heading") Our images are built with multiple Node.js versions to ensure backwards compatibility. Currently, Node.js **versions 16 and 18 are supported** (legacy versions still exist, see DockerHub). To select the preferred version, use the appropriate number as the image tag. ``` # Use Node.js 16 FROM apify/actor-node:16 ``` ### Automation library versioning[​](#automation-library-versioning "Direct link to heading") Images that include a pre-installed automation library, which means all images that include `puppeteer` or `playwright` in their name, are also tagged with the pre-installed version of the library. For example, `apify/actor-node-puppeteer-chrome:16-13.7.0` comes with Node.js 16 and Puppeteer v13.7.0. If you try to install a different version of Puppeteer into this image, you may run into compatibility issues, because the Chromium version bundled with `puppeteer` will not match the version of Chromium that's pre-installed. Similarly `apify/actor-node-playwright-firefox:14-1.21.1` runs on Node.js 14 and is pre-installed with the Firefox version that comes with v1.21.1. Installing `apify/actor-node-puppeteer-chrome` (without a tag) will install the latest available version of Node.js and `puppeteer`. ### Pre-release tags[​](#pre-release-tags "Direct link to heading") We also build pre-release versions of the images to test the changes we make. Those are typically denoted by a `beta` suffix, but it can vary depending on our needs. If you need to try a pre-release version, you can do it like this: ``` # Without library version. FROM apify/actor-node:16-beta ``` ``` # With library version. FROM apify/actor-node-playwright-chrome:16-1.10.0-beta ``` ## Best practices[​](#best-practices "Direct link to heading") * Node.js version tag should **always** be used. * The automation library version tag should be used for **added security**. * Asterisk `*` should be used as the automation library version in our `package.json` files. It makes sure the pre-installed version of Puppeteer or Playwright is not re-installed on build. This is important, because those libraries are only guaranteed to work with specific versions of browsers, and those browsers come pre-installed in the image. ``` FROM apify/actor-node-playwright-chrome:16 ``` ``` { "dependencies": { "crawlee": "^3.0.0", "playwright": "*" } } ``` ### Warning about image size[​](#warning-about-image-size "Direct link to heading") Browsers are huge. If you don't need them all in your image, it's better to use a smaller image with only the one browser you need. You should also be careful when installing new dependencies. Nothing prevents you from installing Playwright into the`actor-node-puppeteer-chrome` image, but the resulting image will be about 3 times larger and extremely slow to download and build. When you use only what you need, you'll be rewarded with reasonable build and start times. ## Apify Docker Images[​](#apify-docker-images "Direct link to heading") ### actor-node[​](#actor-node "Direct link to heading") This is the smallest image we have based on Alpine Linux. It does not include any browsers, and it's therefore best used with [`CheerioCrawler`](https://crawlee.dev/api/cheerio-crawler/class/CheerioCrawler). It benefits from lightning fast builds and container startups. [`PuppeteerCrawler`](https://crawlee.dev/api/puppeteer-crawler/class/PuppeteerCrawler) , [`PlaywrightCrawler`](https://crawlee.dev/api/playwright-crawler/class/PlaywrightCrawler) and other browser based features will **NOT** work with this image. ``` FROM apify/actor-node:16 ``` ### actor-node-puppeteer-chrome[​](#actor-node-puppeteer-chrome "Direct link to heading") This image includes Puppeteer (Chromium) and the Chrome browser. It can be used with [`CheerioCrawler`](https://crawlee.dev/api/cheerio-crawler/class/CheerioCrawler) and [`PuppeteerCrawler`](https://crawlee.dev/api/puppeteer-crawler/class/PuppeteerCrawler), but **NOT** with [`PlaywrightCrawler`](https://crawlee.dev/api/playwright-crawler/class/PlaywrightCrawler). The image supports XVFB by default, so you can run both `headless` and `headful` browsers with it. ``` FROM apify/actor-node-puppeteer-chrome:16 ``` ### actor-node-playwright[​](#actor-node-playwright "Direct link to heading") A very large and slow image that can run all Playwright browsers: Chromium, Chrome, Firefox, WebKit. Everything is installed. If you need to develop or test with multiple browsers, this is the image to choose, but in most cases, it's better to use the specialized images below. ``` FROM apify/actor-node-playwright:16 ``` ### actor-node-playwright-chrome[​](#actor-node-playwright-chrome "Direct link to heading") Similar to [`actor-node-puppeteer-chrome`](#actor-node-puppeteer-chrome), but for Playwright. You can run [`CheerioCrawler`](https://crawlee.dev/api/cheerio-crawler/class/CheerioCrawler) and [`PlaywrightCrawler`](https://crawlee.dev/api/playwright-crawler/class/PlaywrightCrawler), but **NOT** [`PuppeteerCrawler`](https://crawlee.dev/api/puppeteer-crawler/class/PuppeteerCrawler). It uses the [`PLAYWRIGHT_SKIP_BROWSER_DOWNLOAD`](https://playwright.dev/docs/api/environment-variables/) environment variable to block installation of more browsers into the image to keep it small. If you want more browsers, either use the [`actor-node-playwright`](#actor-node-playwright) image override this env var. The image supports XVFB by default, so we can run both `headless` and `headful` browsers with it. ``` FROM apify/actor-node-playwright-chrome:16 ``` ### actor-node-playwright-firefox[​](#actor-node-playwright-firefox "Direct link to heading") Same idea as [`actor-node-playwright-chrome`](#actor-node-playwright-chrome), but with Firefox pre-installed. ``` FROM apify/actor-node-playwright-firefox:16 ``` ### actor-node-playwright-webkit[​](#actor-node-playwright-webkit "Direct link to heading") Same idea as [`actor-node-playwright-chrome`](#actor-node-playwright-chrome), but with WebKit pre-installed. ``` FROM apify/actor-node-playwright-webkit:16 ``` ## Example Dockerfile[​](#example-dockerfile "Direct link to heading") To use the above images, it's necessary to have a [`Dockerfile`](https://docs.docker.com/engine/reference/builder/). You can either use this example, or bootstrap your projects with the [Crawlee CLI](https://crawlee.dev/docs/introduction/setting-up) which automatically adds the correct Dockerfile into our project folder. * Node+JavaScript * Node+TypeScript * Browser+JavaScript * Browser+TypeScript ``` # Specify the base Docker image. You can read more about # the available images at https://crawlee.dev/docs/guides/docker-images # You can also use any other image from Docker Hub. FROM apify/actor-node:16 # Copy just package.json and package-lock.json # to speed up the build using Docker layer cache. COPY package*.json ./ # Install NPM packages, skip optional and development dependencies to # keep the image small. Avoid logging too much and print the dependency # tree for debugging RUN npm --quiet set progress=false \ && npm install --omit=dev --omit=optional \ && echo "Installed NPM packages:" \ && (npm list --omit=dev --all || true) \ && echo "Node.js version:" \ && node --version \ && echo "NPM version:" \ && npm --version # Next, copy the remaining files and directories with the source code. # Since we do this after NPM install, quick build will be really fast # for most source file changes. COPY . ./ # Run the image. CMD npm start --silent ``` ``` # Specify the base Docker image. You can read more about # the available images at https://crawlee.dev/docs/guides/docker-images # You can also use any other image from Docker Hub. FROM apify/actor-node:16 AS builder # Copy just package.json and package-lock.json # to speed up the build using Docker layer cache. COPY package*.json ./ # Install all dependencies. Don't audit to speed up the installation. RUN npm install --include=dev --audit=false # Next, copy the source files using the user set # in the base image. COPY . ./ # Install all dependencies and build the project. # Don't audit to speed up the installation. RUN npm run build # Create final image FROM apify/actor-node:16 # Copy only built JS files from builder image COPY --from=builder /usr/src/app/dist ./dist # Copy just package.json and package-lock.json # to speed up the build using Docker layer cache. COPY package*.json ./ # Install NPM packages, skip optional and development dependencies to # keep the image small. Avoid logging too much and print the dependency # tree for debugging RUN npm --quiet set progress=false \ && npm install --omit=dev --omit=optional \ && echo "Installed NPM packages:" \ && (npm list --omit=dev --all || true) \ && echo "Node.js version:" \ && node --version \ && echo "NPM version:" \ && npm --version # Next, copy the remaining files and directories with the source code. # Since we do this after NPM install, quick build will be really fast # for most source file changes. COPY . ./ # Run the image. CMD npm run start:prod --silent ``` This example is for Playwright. If you want to use Puppeteer, simply replace **playwright** with **puppeteer** in the `FROM` declaration. ``` # Specify the base Docker image. You can read more about # the available images at https://crawlee.dev/docs/guides/docker-images # You can also use any other image from Docker Hub. FROM apify/actor-node-playwright-chrome:16 # Copy just package.json and package-lock.json # to speed up the build using Docker layer cache. COPY --chown=myuser package*.json ./ # Install NPM packages, skip optional and development dependencies to # keep the image small. Avoid logging too much and print the dependency # tree for debugging RUN npm --quiet set progress=false \ && npm install --omit=dev --omit=optional \ && echo "Installed NPM packages:" \ && (npm list --omit=dev --all || true) \ && echo "Node.js version:" \ && node --version \ && echo "NPM version:" \ && npm --version # Next, copy the remaining files and directories with the source code. # Since we do this after NPM install, quick build will be really fast # for most source file changes. COPY --chown=myuser . ./ # Run the image. CMD npm start --silent ``` This example is for Playwright. If you want to use Puppeteer, simply replace **playwright** with **puppeteer** in both `FROM` declarations. ``` # Specify the base Docker image. You can read more about # the available images at https://crawlee.dev/docs/guides/docker-images # You can also use any other image from Docker Hub. FROM apify/actor-node-playwright-chrome:16 AS builder # Copy just package.json and package-lock.json # to speed up the build using Docker layer cache. COPY --chown=myuser package*.json ./ # Install all dependencies. Don't audit to speed up the installation. RUN npm install --include=dev --audit=false # Next, copy the source files using the user set # in the base image. COPY --chown=myuser . ./ # Install all dependencies and build the project. # Don't audit to speed up the installation. RUN npm run build # Create final image FROM apify/actor-node-playwright-chrome:16 # Copy only built JS files from builder image COPY --from=builder --chown=myuser /home/myuser/dist ./dist # Copy just package.json and package-lock.json # to speed up the build using Docker layer cache. COPY --chown=myuser package*.json ./ # Install NPM packages, skip optional and development dependencies to # keep the image small. Avoid logging too much and print the dependency # tree for debugging RUN npm --quiet set progress=false \ && npm install --omit=dev --omit=optional \ && echo "Installed NPM packages:" \ && (npm list --omit=dev --all || true) \ && echo "Node.js version:" \ && node --version \ && echo "NPM version:" \ && npm --version # Next, copy the remaining files and directories with the source code. # Since we do this after NPM install, quick build will be really fast # for most source file changes. COPY --chown=myuser . ./ # Run the image. If you know you won't need headful browsers, # you can remove the XVFB start script for a micro perf gain. CMD ./start_xvfb_and_run_cmd.sh && npm run start:prod --silent ``` --- # Environment Variables The following is a list of the environment variables used by Apify SDK that are available to the user. The SDK is capable of running without any env vars present, but certain features will only become available after env vars are properly set. You can use [Apify CLI](https://github.com/apify/apify-cli) to set the env vars for you. [Apify platform](https://docs.apify.com/sdk/js/sdk/js/docs/guides/apify-platform.md) also sets the variables automatically. ## Important env vars[​](#important-env-vars "Direct link to heading") The following environment variables have large impact on the way Apify SDK works and its behavior can be changed significantly by setting or unsetting them. ### `APIFY_LOCAL_STORAGE_DIR`[​](#apify_local_storage_dir "Direct link to heading") Defines the path to a local directory where [`KeyValueStore`](https://docs.apify.com/sdk/js/sdk/js/reference/class/KeyValueStore.md), [`Dataset`](https://docs.apify.com/sdk/js/sdk/js/reference/class/Dataset.md), and [`RequestQueue`](https://docs.apify.com/sdk/js/sdk/js/reference/class/RequestQueue.md) store their data. Typically, it is set to `./storage`. If omitted, you should define the [`APIFY_TOKEN`](#apify_token) environment variable instead. ### `APIFY_TOKEN`[​](#apify_token "Direct link to heading") The API token for your Apify account. It is used to access the Apify API, e.g. to access cloud storage or to run an actor on the Apify platform. You can find your API token on the [Account - Integrations](https://console.apify.com/account#/integrations) page. If omitted, you should define the `APIFY_LOCAL_STORAGE_DIR` environment variable instead. ### Combinations of `APIFY_LOCAL_STORAGE_DIR` and `APIFY_TOKEN`[​](#combinations-of-apify_local_storage_dir-and-apify_token "Direct link to heading") By combining the env vars in various ways, you can greatly influence the behavior of Apify SDK. | Env Vars | API | Storages | | ------------------------------------------- | --- | -------------- | | none OR `APIFY_LOCAL_STORAGE_DIR` | no | local | | `APIFY_TOKEN` | yes | Apify platform | | `APIFY_TOKEN` AND `APIFY_LOCAL_STORAGE_DIR` | yes | local+platform | When using both `APIFY_TOKEN` and `APIFY_LOCAL_STORAGE_DIR`, you can use all the Apify platform features and your data will be stored locally by default. If you want to access platform storages, you can use the `{ forceCloud: true }` option in their respective functions. ``` const localDataset = await Actor.openDataset('my-local-data'); const remoteDataset = await Actor.openDataset('my-remote-data', { forceCloud: true, }); ``` ## Convenience env vars[​](#convenience-env-vars "Direct link to heading") The next group includes env vars that can help achieve certain goals without having to change your code, such as temporarily switching log level to DEBUG. ### `APIFY_HEADLESS`[​](#apify_headless "Direct link to heading") If set to `1`, web browsers launched by Apify SDK will run in the headless mode. You can still override this setting in the code, e.g. by passing the `headless: true` option to the [`Actor.launchPuppeteer()`](https://docs.apify.com/sdk/js/sdk/js/reference/class/Actor.md#launchPuppeteer) function. But having this setting in an environment variable allows you to develop the crawler locally in headful mode to simplify the debugging, and only run the crawler in headless mode once you deploy it to the Apify platform. By default, the browsers are launched in headful mode, i.e. with windows. ### `APIFY_LOG_LEVEL`[​](#apify_log_level "Direct link to heading") Specifies the minimum log level, which can be one of the following values (in order of severity): `DEBUG`, `INFO`, `WARNING` and `ERROR`. By default, the log level is set to `INFO`, which means that `DEBUG` messages are not printed to console. See the [`utils.log`](https://crawlee.dev/api/core/class/Log) namespace for logging utilities. ### `APIFY_MEMORY_MBYTES`[​](#apify_memory_mbytes "Direct link to heading") Sets the amount of system memory in megabytes to be used by the [`AutoscaledPool`](https://crawlee.dev/api/core/class/AutoscaledPool). It is used to limit the number of concurrently running tasks. By default, the max amount of memory to be used is set to one quarter of total system memory, i.e. on a system with 8192 MB of memory, the autoscaling feature will only use up to 2048 MB of memory. ### `APIFY_PROXY_PASSWORD`[​](#apify_proxy_password "Direct link to heading") Optional password to [Apify Proxy](https://docs.apify.com/proxy) for IP address rotation. If you have an Apify Account, you can find the password on the [Proxy page](https://console.apify.com/proxy) in the Apify Console. The password is automatically inferred using the `APIFY_TOKEN` env var, so in most cases, you don't need to touch it. You should use it when, for some reason, you need access to Apify Proxy, but no access to Apify API, or when you need access to proxy from a different account than your token represents. --- # Pay-per-event Monetization Apify provides several [pricing models](https://docs.apify.com/platform/actors/publishing/monetize) for monetizing your Actors. The most recent and most flexible one is [pay-per-event](https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event), which lets you charge your users programmatically directly from your Actor. As the name suggests, you may charge the users each time a specific event occurs, for example a call to an external API or when you return a result. To use the pay-per-event pricing model, you first need to [set it up](https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event) for your Actor in the Apify console. After that, you're free to start charging for events. ## Charging for events[​](#charging-for-events "Direct link to heading") After monetization is set in the Apify console, you can add [`Actor.charge`](https://docs.apify.com/sdk/js/sdk/js/reference/class/Actor.md#charge) calls to your code and start monetizing! ``` import { Actor } from 'apify'; await Actor.init(); // Charge for a single occurence of an event await Actor.charge({ eventName: 'init' }); // Prepare some mock results const result = [ { word: 'Lorem' }, { word: 'Ipsum' }, { word: 'Dolor' }, { word: 'Sit' }, { word: 'Amet' }, ]; // Shortcut for charging for each pushed dataset item await Actor.pushData(result, 'result-item'); // Or you can charge for a given number of events manually await Actor.charge({ eventName: 'result-item', count: result.length, }); await Actor.exit(); ``` Then you just push your code to Apify and that's it! The SDK will even keep track of the max total charge setting for you, so you will not provide more value than what the user chose to pay for. If you need finer control over charging, you can access call [`Actor.getChargingManager()`](https://docs.apify.com/sdk/js/sdk/js/reference/class/Actor.md#getChargingManager) to access the [`ChargingManager`](https://docs.apify.com/sdk/js/sdk/js/reference/class/ChargingManager.md), which can provide more detailed information - for example how many events of each type can be charged before reaching the configured limit. ## Transitioning from a different pricing model[​](#transitioning-from-a-different-pricing-model "Direct link to heading") When you plan to start using the pay-per-event pricing model for an Actor that is already monetized with a different pricing model, your source code will need support both pricing models during the transition period enforced by the Apify platform. Arguably the most frequent case is the transition from the pay-per-result model which utilizes the `ACTOR_MAX_PAID_DATASET_ITEMS` environment variable to prevent returning unpaid dataset items. The following is an example how to handle such scenarios. The key part is the [`ChargingManager.getPricingInfo`](https://docs.apify.com/sdk/js/sdk/js/reference/class/ChargingManager.md#getPricingInfo) method which returns information about the current pricing model. ``` import { Actor } from 'apify'; await Actor.init(); // Check the dataset because there might already be items if the run migrated or was restarted const defaultDataset = await Actor.openDataset(); let chargedItems = (await defaultDataset.getInfo())!.itemCount; if (Actor.getChargingManager().getPricingInfo().isPayPerEvent) { await Actor.pushData({ hello: 'world' }, 'dataset-item'); } else if (chargedItems < Number(process.env.ACTOR_MAX_PAID_DATASET_ITEMS)) { await Actor.pushData({ hello: 'world' }); chargedItems += 1; } await Actor.exit(); ``` ## Local development[​](#local-development "Direct link to heading") It is encouraged to test your monetization code on your machine before releasing it to the public. To tell your Actor that it should work in pay-per-event mode, pass it the `ACTOR_TEST_PAY_PER_EVENT` environment variable: ``` ACTOR_TEST_PAY_PER_EVENT=true npm start ``` If you also wish to see a log of all the events charged throughout the run, you also need to pass the `ACTOR_USE_CHARGING_LOG_DATASET` environment variable. Your charging dataset will then be available under the `charging_log` name (unless you change your storage settings, this dataset is stored in `storage/datasets/charging_log/`). Please note that this log is not available when running the Actor in production on the Apify platform. Because pricing configuration is stored by the Apify platform, all events will have a default price of $1. --- # Proxy Management [IP address blocking](https://en.wikipedia.org/wiki/IP_address_blocking) is one of the oldest and most effective ways of preventing access to a website. It is therefore paramount for a good web scraping library to provide easy to use but powerful tools which can work around IP blocking. The most powerful weapon in your anti IP blocking arsenal is a [proxy server](https://en.wikipedia.org/wiki/Proxy_server). With Apify SDK you can use your own proxy servers, proxy servers acquired from third-party providers, or you can rely on [Apify Proxy](https://apify.com/proxy) for your scraping needs. ## Quick start[​](#quick-start "Direct link to heading") If you already subscribed to Apify Proxy or have proxy URLs of your own, you can start using them immediately in only a few lines of code. > If you want to use Apify Proxy, make sure that your [scraper is logged in](https://docs.apify.com/sdk/js/sdk/js/docs/guides/apify-platform.md). ``` const proxyConfiguration = await Actor.createProxyConfiguration(); const proxyUrl = proxyConfiguration.newUrl(); ``` ``` const proxyConfiguration = await Actor.createProxyConfiguration({ proxyUrls: ['http://proxy-1.com', 'http://proxy-2.com'], }); const proxyUrl = proxyConfiguration.newUrl(); ``` ## Proxy Configuration[​](#proxy-configuration "Direct link to heading") All your proxy needs are managed by the [`ProxyConfiguration`](https://docs.apify.com/sdk/js/sdk/js/reference/class/ProxyConfiguration.md) class. You create an instance using the [`Actor.createProxyConfiguration()`](https://docs.apify.com/sdk/js/sdk/js/reference/class/Actor.md#createProxyConfiguration) function. See the [`ProxyConfigurationOptions`](https://docs.apify.com/sdk/js/sdk/js/reference/interface/ProxyConfigurationOptions.md) for all the possible constructor options. ### Crawler integration[​](#crawler-integration "Direct link to heading") `ProxyConfiguration` integrates seamlessly into [`CheerioCrawler`](https://crawlee.dev/api/cheerio-crawler/class/CheerioCrawler) and [`PuppeteerCrawler`](https://crawlee.dev/api/puppeteer-crawler/class/PuppeteerCrawler). ``` const proxyConfiguration = await Actor.createProxyConfiguration({ /* your proxy opts */ }); const crawler = new CheerioCrawler({ proxyConfiguration, // ... }); ``` ``` const proxyConfiguration = await Actor.createProxyConfiguration({ /* your proxy opts */ }); const crawler = new PuppeteerCrawler({ proxyConfiguration, // ... }); ``` Your crawlers will now use the selected proxies for all connections. ### IP Rotation and session management[​](#ip-rotation-and-session-management "Direct link to heading") [`proxyConfiguration.newUrl()`](https://docs.apify.com/sdk/js/sdk/js/reference/class/ProxyConfiguration.md#newUrl) allows you to pass a `sessionId` parameter. It will then be used to create a `sessionId`-`proxyUrl` pair, and subsequent `newUrl()` calls with the same `sessionId` will always return the same `proxyUrl`. This is extremely useful in scraping, because you want to create the impression of a real user. See the [session management guide](https://docs.apify.com/sdk/js/sdk/js/docs/guides/session-management.md) and [`SessionPool`](https://crawlee.dev/api/core/class/SessionPool) class for more information on how keeping a real session helps you avoid blocking. When no `sessionId` is provided, your proxy URLs are rotated round-robin, whereas Apify Proxy manages their rotation using black magic to get the best performance. ``` const proxyConfiguration = await Actor.createProxyConfiguration({ /* opts */ }); const sessionPool = await SessionPool.open({ /* opts */ }); const session = await sessionPool.getSession(); const proxyUrl = proxyConfiguration.newUrl(session.id); ``` ``` const proxyConfiguration = await Actor.createProxyConfiguration({ /* opts */ }); const crawler = new PuppeteerCrawler({ useSessionPool: true, persistCookiesPerSession: true, proxyConfiguration, // ... }); ``` ## Apify Proxy vs. Your own proxies[​](#apify-proxy-vs-your-own-proxies "Direct link to heading") The `ProxyConfiguration` class covers both Apify Proxy and custom proxy URLs so that you can easily switch between proxy providers, however, some features of the class are available only to Apify Proxy users, mainly because Apify Proxy is what one would call a super-proxy. It's not a single proxy server, but an API endpoint that allows connection through millions of different IP addresses. So the class essentially has two modes: Apify Proxy or Your proxy. The difference is easy to remember. [`ProxyConfigurationOptions.proxyUrls`](https://docs.apify.com/sdk/js/sdk/js/reference/interface/ProxyConfigurationOptions.md#proxyUrls) and [`ProxyConfigurationOptions.newUrlFunction`](https://docs.apify.com/sdk/js/sdk/js/reference/interface/ProxyConfigurationOptions.md#newUrlFunction) enable use of your custom proxy URLs, whereas all the other options are there to configure Apify Proxy. Visit the [Apify Proxy docs](https://docs.apify.com/proxy) for more info on how these parameters work. ## Apify Proxy Configuration[​](#apify-proxy-configuration "Direct link to heading") With Apify Proxy, you can select specific proxy groups to use, or countries to connect from. This allows you to get better proxy performance after some initial research. ``` const proxyConfiguration = await Actor.createProxyConfiguration({ groups: ['RESIDENTIAL'], countryCode: 'US', }); const proxyUrl = proxyConfiguration.newUrl(); ``` Now your crawlers will use only Residential proxies from the US. Note that you must first get access to a proxy group before you are able to use it. You can find your available proxy groups in the [proxy dashboard](https://console.apify.com/proxy). ## Inspecting current proxy in Crawlers[​](#inspecting-current-proxy-in-crawlers "Direct link to heading") `CheerioCrawler` and `PuppeteerCrawler` grant access to information about the currently used proxy in their `handlePageFunction` using a [`proxyInfo`](https://docs.apify.com/sdk/js/sdk/js/reference/interface/ProxyInfo.md) object. With the object, you can easily access the proxy URL. If you're using Apify Proxy, the other configuration parameters will also be available in the `proxyInfo` object. --- # Request Storage The Apify SDK has several request storage types that are useful for specific tasks. The requests are stored either on local disk to a directory defined by the `APIFY_LOCAL_STORAGE_DIR` environment variable, or on the [Apify platform](https://docs.apify.com/sdk/js/sdk/js/docs/guides/apify-platform.md) under the user account identified by the API token defined by the `APIFY_TOKEN` environment variable. If neither of these variables is defined, by default Apify SDK sets `APIFY_LOCAL_STORAGE_DIR` to `./storage` in the current working directory and prints a warning. Typically, you will be developing the code on your local computer and thus set the `APIFY_LOCAL_STORAGE_DIR` environment variable. Once the code is ready, you will deploy it to the Apify platform, where it will automatically set the `APIFY_TOKEN` environment variable and thus use cloud storage. No code changes are needed. **Related links** * [Apify platform storage documentation](https://docs.apify.com/storage) * [View storage in Apify Console](https://console.apify.com/storage) * [Request queues API reference](https://docs.apify.com/api/v2#/reference/request-queues) ## Request queue[​](#request-queue "Direct link to heading") The request queue is a storage of URLs to crawl. The queue is used for the deep crawling of websites, where you start with several URLs and then recursively follow links to other pages. The data structure supports both breadth-first and depth-first crawling orders. Each actor run is associated with a **default request queue**, which is created exclusively for the actor run. Typically, it is used to store URLs to crawl in the specific actor run. Its usage is optional. In Apify SDK, the request queue is represented by the [`RequestQueue`](https://docs.apify.com/sdk/js/sdk/js/reference/class/RequestQueue.md) class. In local configuration, the request queue is emulated by [@apify/storage-local](https://github.com/apify/apify-storage-local-js) NPM package and its data is stored in SQLite database in the directory specified by the `APIFY_LOCAL_STORAGE_DIR` environment variable as follows: ``` {APIFY_LOCAL_STORAGE_DIR}/request_queues/{QUEUE_ID}/db.sqlite ``` Note that `{QUEUE_ID}` is the name or ID of the request queue. The default queue has ID `default`, unless you override it by setting the `APIFY_DEFAULT_REQUEST_QUEUE_ID` environment variable. The following code demonstrates basic operations of the request queue: ``` // Open the default request queue associated with the actor run const requestQueue = await RequestQueue.open(); // Enqueue the initial request await requestQueue.addRequest({ url: 'https://example.com' }); // The crawler will automatically process requests from the queue const crawler = new CheerioCrawler({ requestQueue, handlePageFunction: async ({ $, request }) => { // Add new request to the queue await requestQueue.addRequest({ url: 'https://example.com/new-page' }); // Add links found on page to the queue await Actor.utils.enqueueLinks({ $, requestQueue }); }, }); ``` To see more detailed example of how to use the request queue with a crawler, see the [Puppeteer Crawler](https://docs.apify.com/sdk/js/sdk/js/docs/examples/puppeteer-crawler.md) example. ## Request list[​](#request-list "Direct link to heading") The request list is not a storage per se - it represents the list of URLs to crawl that is stored in a run memory (or optionally in default [Key-Value Store](https://docs.apify.com/sdk/js/sdk/js/docs/guides/result-storage.md#key-value-store) associated with the run, if specified). The list is used for the crawling of a large number of URLs, when you know all the URLs which should be visited by the crawler and no URLs would be added during the run. The URLs can be provided either in code or parsed from a text file hosted on the web. Request list is created exclusively for the actor run and only if its usage is explicitly specified in the code. Its usage is optional. In Apify SDK, the request list is represented by the [`RequestList`](https://crawlee.dev/api/core/class/RequestList) class. The following code demonstrates basic operations of the request list: ``` // Prepare the sources array with URLs to visit const sources = [ { url: 'http://www.example.com/page-1' }, { url: 'http://www.example.com/page-2' }, { url: 'http://www.example.com/page-3' }, ]; // Open the request list. // List name is used to persist the sources and the list state in the key-value store const requestList = await RequestList.open('my-list', sources); // The crawler will automatically process requests from the list const crawler = new PuppeteerCrawler({ requestList, handlePageFunction: async ({ page, request }) => { // Process the page (extract data, take page screenshot, etc). // No more requests could be added to the request list here }, }); ``` To see more detailed example of how to use the request list with a crawler, see the [Puppeteer with proxy](https://docs.apify.com/sdk/js/sdk/js/docs/examples/puppeteer-with-proxy.md) example. ## Which one to choose?[​](#which-one-to-choose "Direct link to heading") When using Request queue - you would normally have several start URLs (e.g. category pages on e-commerce website) and then recursively add more (e.g. individual item pages) programmatically to the queue, it supports dynamic adding and removing of requests. No more URLs can be added to Request list after its initialization as it is immutable, URLs cannot be removed from the list either. On the other hand, the Request queue is not optimized for adding or removing numerous URLs in a batch. This is technically possible, but requests are added one by one to the queue, and thus it would take significant time with a larger number of requests. Request list however can contain even millions of URLs, and it would take significantly less time to add them to the list, compared to the queue. Note that Request queue and Request list can be used together by the same crawler. In such cases, each request from the Request list is enqueued into the Request queue first (to the foremost position in the queue, even if Request queue is not empty) and then consumed from the latter. This is necessary to avoid the same URL being processed more than once (from the list first and then possibly from the queue). In practical terms, such a combination can be useful when there are numerous initial URLs, but more URLs would be added dynamically by the crawler. The following code demonstrates how to use Request queue and Request list in the same crawler: ``` // Prepare the sources array with URLs to visit (it can contain millions of URLs) const sources = [ { url: 'http://www.example.com/page-1' }, { url: 'http://www.example.com/page-2' }, { url: 'http://www.example.com/page-3' }, ]; // Open the request list const requestList = await RequestList.open('my-list', sources); // Open the default request queue. It's not necessary to add any requests to the queue const requestQueue = await RequestQueue.open(); // The crawler will automatically process requests from the list and the queue const crawler = new PuppeteerCrawler({ requestList, requestQueue, // Each request from the request list is enqueued to the request queue one by one. // At this point request with the same URL would exist in the list and the queue handlePageFunction: async ({ request, page }) => { // Add new request to the queue await requestQueue.addRequest({ url: 'http://www.example.com/new-page', }); // Add links found on page to the queue await Actor.utils.enqueueLinks({ page, requestQueue }); // The requests above would be added to the queue (but not to the list) // and would be processed after the request list is empty. // No more requests could be added to the list here }, }); ``` --- # Result Storage The Apify SDK has several result storage types that are useful for specific tasks. The data is stored either on local disk to a directory defined by the `APIFY_LOCAL_STORAGE_DIR` environment variable, or on the [Apify platform](https://docs.apify.com/sdk/js/sdk/js/docs/guides/apify-platform.md) under the user account identified by the API token defined by the `APIFY_TOKEN` environment variable. If neither of these variables is defined, by default Apify SDK sets `APIFY_LOCAL_STORAGE_DIR` to `./storage` in the current working directory and prints a warning. Typically, you will be developing the code on your local computer and thus set the `APIFY_LOCAL_STORAGE_DIR` environment variable. Once the code is ready, you will deploy it to the Apify platform, where it will automatically set the `APIFY_TOKEN` environment variable and thus use cloud storage. No code changes are needed. **Related links** * [Apify platform storage documentation](https://docs.apify.com/storage) * [View storage in Apify Console](https://console.apify.com/storage) * [Key-value stores API reference](https://apify.com/docs/api/v2#/reference/key-value-stores) * [Datasets API reference](https://docs.apify.com/api/v2#/reference/datasets) ## Key-value store[​](#key-value-store "Direct link to heading") The key-value store is used for saving and reading data records or files. Each data record is represented by a unique key and associated with a MIME content type. Key-value stores are ideal for saving screenshots of web pages, PDFs or to persist the state of crawlers. Each actor run is associated with a **default key-value store**, which is created exclusively for the actor run. By convention, the actor run input and output is stored in the default key-value store under the `INPUT` and `OUTPUT` key, respectively. Typically the input and output is a JSON file, although it can be any other format. In the Apify SDK, the key-value store is represented by the [`KeyValueStore`](https://docs.apify.com/sdk/js/sdk/js/reference/class/KeyValueStore.md) class. In order to simplify access to the default key-value store, the SDK also provides [`Actor.getValue()`](https://docs.apify.com/sdk/js/sdk/js/reference/class/Actor.md#getValue) and [`Actor.setValue()`](https://docs.apify.com/sdk/js/sdk/js/reference/class/Actor.md#setValue) functions. In local configuration, the data is stored in the directory specified by the `APIFY_LOCAL_STORAGE_DIR` environment variable as follows: ``` {APIFY_LOCAL_STORAGE_DIR}/key_value_stores/{STORE_ID}/{KEY}.{EXT} ``` Note that `{STORE_ID}` is the name or ID of the key-value store. The default key value store has ID `default`, unless you override it by setting the `APIFY_DEFAULT_KEY_VALUE_STORE_ID` environment variable. The `{KEY}` is the key of the record and `{EXT}` corresponds to the MIME content type of the data value. The following code demonstrates basic operations of key-value stores: ``` // Get actor input from the default key-value store const input = await Actor.getInput(); // Write actor output to the default key-value store. await Actor.setValue('OUTPUT', { myResult: 123 }); // Open a named key-value store const store = await Actor.openKeyValueStore('some-name'); // Write record. JavaScript object is automatically converted to JSON, // strings and binary buffers are stored as they are await store.setValue('some-key', { foo: 'bar' }); // Read record. Note that JSON is automatically parsed to a JavaScript object, // text data returned as a string and other data is returned as binary buffer const value = await store.getValue('some-key'); // Delete record await store.setValue('some-key', null); ``` To see a real-world example of how to get the input from the key-value store, see the [Screenshots](https://docs.apify.com/sdk/js/sdk/js/docs/examples/capture-screenshot.md) example. ## Dataset[​](#dataset "Direct link to heading") Datasets are used to store structured data where each object stored has the same attributes, such as online store products or real estate offers. You can imagine a dataset as a table, where each object is a row and its attributes are columns. Dataset is an append-only storage - you can only add new records to it but you cannot modify or remove existing records. When the dataset is stored on the [Apify platform](https://docs.apify.com/sdk/js/sdk/js/docs/guides/apify-platform.md), you can export its data to the following formats: HTML, JSON, CSV, Excel, XML and RSS. The datasets are displayed on the actor run details page and in the [Storage](https://console.apify.com/storage) section in the Apify Console. The actual data is exported using the [Get dataset items](https://apify.com/docs/api/v2#/reference/datasets/item-collection/get-items) Apify API endpoint. This way you can easily share crawling results. Each actor run is associated with a **default dataset**, which is created exclusively for the actor run. Typically, it is used to store crawling results specific for the actor run. Its usage is optional. In the Apify SDK, the dataset is represented by the [`Dataset`](https://docs.apify.com/sdk/js/sdk/js/reference/class/Dataset.md) class. In order to simplify writes to the default dataset, the SDK also provides the [`Actor.pushData()`](https://docs.apify.com/sdk/js/sdk/js/reference/class/Actor.md#pushData) function. In local configuration, the data is stored in the directory specified by the `APIFY_LOCAL_STORAGE_DIR` environment variable as follows: ``` {APIFY_LOCAL_STORAGE_DIR}/datasets/{DATASET_ID}/{INDEX}.json ``` Note that `{DATASET_ID}` is the name or ID of the dataset. The default dataset has ID `default`, unless you override it by setting the `APIFY_DEFAULT_DATASET_ID` environment variable. Each dataset item is stored as a separate JSON file, where `{INDEX}` is a zero-based index of the item in the dataset. The following code demonstrates basic operations of the dataset: ``` // Write a single row to the default dataset await Actor.pushData({ col1: 123, col2: 'val2' }); // Open a named dataset const dataset = await Actor.openDataset('some-name'); // Write a single row await dataset.pushData({ foo: 'bar' }); // Write multiple rows await dataset.pushData([{ foo: 'bar2', col2: 'val2' }, { col3: 123 }]); ``` To see how to use the dataset to store crawler results, see the [Cheerio Crawler](https://docs.apify.com/sdk/js/sdk/js/docs/examples/cheerio-crawler.md) example. --- # Session Management [`SessionPool`](https://crawlee.dev/api/core/class/SessionPool) is a class that allows you to handle the rotation of proxy IP addresses along with cookies and other custom settings in Apify SDK. The main benefit of a Session pool is that you can filter out blocked or non-working proxies, so your actor does not retry requests over known blocked/non-working proxies. Another benefit of using SessionPool is that you can store information tied tightly to an IP address, such as cookies, auth tokens, and particular headers. Having your cookies and other identificators used only with a specific IP will reduce the chance of being blocked. Last but not least, another benefit is the even rotation of IP addresses - SessionPool picks the session randomly, which should prevent burning out a small pool of available IPs. Now let's take a look at how to use a Session pool. **Example usage in [`PuppeteerCrawler`](https://crawlee.dev/api/puppeteer-crawler/class/PuppeteerCrawler)** ``` const proxyConfiguration = await Actor.createProxyConfiguration(); const crawler = new PuppeteerCrawler({ requestQueue, // To use the proxy IP session rotation logic, you must turn the proxy usage on. proxyConfiguration, // Activates the Session pool. useSessionPool: true, // Overrides default Session pool configuration sessionPoolOptions: { maxPoolSize: 100, }, // Set to true if you want the crawler to save cookies per session, // and set the cookies to page before navigation automatically. persistCookiesPerSession: true, handlePageFunction: async ({ request, page, session }) => { const title = await page.title(); if (title === 'Blocked') { session.retire(); } else if ( title === 'Not sure if blocked, might also be a connection error' ) { session.markBad(); } else { // session.markGood() - this step is done automatically in puppeteer pool. } }, }); ``` **Example usage in [`CheerioCrawler`](https://crawlee.dev/api/cheerio-crawler/class/CheerioCrawler)** ``` const proxyConfiguration = await Actor.createProxyConfiguration(); const crawler = new CheerioCrawler({ requestQueue, // To use the proxy IP session rotation logic, you must turn the proxy usage on. proxyConfiguration, // Activates the Session pool. useSessionPool: true, // Overrides default Session pool configuration. sessionPoolOptions: { maxPoolSize: 100, }, // Set to true if you want the crawler to save cookies per session, // and set the cookie header to request automatically... persistCookiesPerSession: true, handlePageFunction: async ({ request, $, session }) => { const title = $('title'); if (title === 'Blocked') { session.retire(); } else if ( title === 'Not sure if blocked, might also be a connection error' ) { session.markBad(); } else { // session.markGood() - this step is done automatically in BasicCrawler. } }, }); ``` **Example usage in [`BasicCrawler`](https://crawlee.dev/api/basic-crawler/class/BasicCrawler)** ``` const { gotScraping } = require('got-scraping'); const proxyConfiguration = await Actor.createProxyConfiguration(); const crawler = new BasicCrawler({ requestQueue, // Allows access to proxyInfo object in handleRequestFunction proxyConfiguration, useSessionPool: true, sessionPoolOptions: { maxPoolSize: 100, }, handleRequestFunction: async ({ request, session, proxyInfo }) => { // To use the proxy IP session rotation logic, you must turn the proxy usage on. const proxyUrl = proxyInfo.url; const requestOptions = { url: request.url, proxyUrl, throwHttpErrors: false, headers: { // If you want to use the cookieJar. // This way you get the Cookie headers string from session. Cookie: session.getCookieString(), }, }; let response; try { response = await gotScraping(requestOptions); } catch (e) { if (e === 'SomeNetworkError') { // If a network error happens, such as timeout, socket hangup etc... // There is usually a chance that it was just bad luck and the proxy works. // No need to throw it away. session.markBad(); } throw e; } // Automatically retires the session based on response HTTP status code. session.retireOnBlockedStatusCodes(response.statusCode); if (response.body.blocked) { // You are sure it is blocked. // This will throw away the session. session.retire(); } // Everything is ok, you can get the data. // No need to call session.markGood -> BasicCrawler calls it for you. // If you want to use the CookieJar in session you need. session.setCookiesFromResponse(response); }, }); ``` **Example solo usage** ``` Actor.main(async () => { const sessionPoolOptions = { maxPoolSize: 100, }; const sessionPool = await SessionPool.open(sessionPoolOptions); // Get session const session = sessionPool.getSession(); // Increase the errorScore. session.markBad(); // Throw away the session session.retire(); // Lower the errorScore and marks the session good. session.markGood(); }); ``` These are the basics of configuring SessionPool. Please, bear in mind that a Session pool needs time to find working IPs and build up the pool, so you will probably see a lot of errors until it becomes stabilized. --- Apify SDK supports TypeScript by covering public APIs with type declarations. This allows writing code with auto-completion for TypeScript and JavaScript code alike. Besides that, actors written in TypeScript can take advantage of compile-time type-checking and avoid many coding mistakes, while providing documentation for functions, parameters and return values. # Setting up a TypeScript project To use TypeScript in your actors, you'll need the following prerequisites. 1. TypeScript compiler `tsc` installed somewhere: ``` npm install --dev typescript ``` TypeScript can be a development dependency in your project, as shown above. There's no need to pollute your production environment or your system's global repository with TypeScript. 2. A build script invoking `tsc` and a correctly specified `main` entry point defined in your `package.json`: ``` { "scripts": { "build": "tsc -p tsconfig.json" }, "main": "build/main.js" } ``` 3. Type declarations for NodeJS, so you can take advantage of type-checking in all the features you'll use: ``` npm install --dev @types/node ``` 4. TypeScript configuration file allowing `tsc` to understand your project layout and the features used in your project: ``` { "compilerOptions": { "target": "es2019", "module": "commonjs", "moduleResolution": "node", "strict": true, "noImplicitAny": false, "strictNullChecks": false, "lib": [ "DOM", "DOM.Iterable", "ES2015", "ES2016", "ES2018", "ES2019.Object", "ES2018.AsyncIterable", "ES2020.String", "ES2019.Array" ], "rootDir": "src/", "outDir": "build/" }, "include": ["src/"] } ``` Place the content above inside a `tsconfig.json` in your root folder. Also, if you are a VSCode user that is using JavaScript, create a `jsconfig.json` with the same content, adding `"checkJs": true` to `"compilerOptions"`, so you can enjoy using the types in your `.js` source files. # Auto-completion IDE auto-completion should work in most places. That's true even if you are writting actors in pure JavaScript. For time constraints, we left out the amendment of an internal API for the time being, and these need to be added as the SDK developers write new and enhance old code. # SDK Documentation SDK documentation has grown a lot. There is a new API Reference section **Type definitions** that holds documentation for all constructible types, function parameters and return types, in the Apify SDK. # Caveats As of version 1.0+, the generated typings, due to JSDoc limitations, have some properties and parameters annotated with `any` type, therefore the settings `noImplicitAny` and `strictNullChecks`, set to `true`, may not be advised. You may try enabling them, but it might hinder development because of the need for typecasts to be able to compile, your mileage may vary. Besides the *implicit any* errors that might occur in the code when writing in TypeScript, the current typings doesn't offer generics that make outputs type-safe, so you need to declare your types, as such: ``` interface MySchema { expectedParam1?: string; expectedParam2?: number; } const input: MySchema = (await Actor.getInput()) as any; // getInput returns Promise|string|Buffer|null> here if (!input?.expectedParam1) { // input is MySchema now and you can check in a type-safe way throw new Error('Missing expectedParam1'); } ``` There are also other places where you need to explicitly provide your interface / type, like in Dataset iterators (`map`, `reduce`, `forEach`): ``` interface ExpectedShape { id: string; someFields: Fields[]; } const dataset = await Actor.openDataset(); await dataset.forEach((item: ExpectedShape) => { // deal with item.id / item.someFields // otherwise item is "any" }); ``` --- # Apify SDK: The scalable web crawling and scraping library for JavaScript [![npm version](https://badge.fury.io/js/apify.svg)](https://www.npmjs.com/package/apify) Apify SDK simplifies the development of web crawlers, scrapers, data extractors and web automation jobs. It provides tools to manage and automatically scale a pool of headless browsers, to maintain queues of URLs to crawl, store crawling results to a local filesystem or into the cloud, rotate proxies and much more. The SDK is available as the [`apify`](https://www.npmjs.com/package/apify) NPM package. It can be used either stand-alone in your own applications or in [actors](https://docs.apify.com/actor) running on the [Apify Cloud](https://apify.com/). **View full documentation, guides and examples on the [Apify SDK project website](https://sdk.apify.com)** --- # overview ## Overview[​](#overview "Direct link to heading") The Apify SDK is available as the [`apify`](https://www.npmjs.com/package/apify) NPM package and it provides the following tools: * [`Actor`](https://sdk.apify.com/api/apify/class/Actor) - Serves as an alternative approach to the static helpers exported from the package. This class can be used to control the current actor run and to interact with the actor's environment. * [`ApifyClient`](https://sdk.apify.com/api/apify/class/ApifyClient) - Allows user to interact with the Apify platform from code, control and schedule actors on the platform and access the result data stores. * [`Configuration`](https://sdk.apify.com/api/apify/class/Configuration) - Helper class encapsulating the configuration of the current actor run. * [`PlatformEventManager`](https://sdk.apify.com/api/apify/class/PlatformEventManager) - Event emitter for the platform and SDK events. Can be used to track actor run performance or serverless container migration. * [`ProxyConfiguration`](https://sdk.apify.com/api/apify/class/ProxyConfiguration) - Configures connection to a proxy server with the provided options. Setting proxy configuration in your crawlers automatically configures them to use the selected proxies for all connections. The proxy servers are managed by Apify Proxy. * [`RequestQueue`](https://sdk.apify.com/api/apify/class/RequestQueue) - Represents a queue of URLs to crawl, which is stored either on a local filesystem or in the [Apify Cloud](https://apify.com). The queue is used for deep crawling of websites, where you start with several URLs and then recursively follow links to other pages. The data structure supports both breadth-first and depth-first crawling orders. * [`Dataset`](https://sdk.apify.com/api/apify/class/Dataset) - Provides a store for structured data and enables their export to formats like JSON, JSONL, CSV, XML, Excel or HTML. The data is stored on a local filesystem or in the Apify Cloud. Datasets are useful for storing and sharing large tabular crawling results, such as a list of products or real estate offers. * [`KeyValueStore`](https://sdk.apify.com/api/apify/class/KeyValueStore) - A simple key-value store for arbitrary data records or files, along with their MIME content type. It is ideal for saving screenshots of web pages, PDFs or to persist the state of your crawlers. The data is stored on a local filesystem or in the Apify Cloud. Additionally, the package provides various helper functions to simplify running your code on the Apify Cloud and thus take advantage of its pool of proxies, job scheduler, data storage, etc. For more information, see the [Apify SDK Programmer's Reference](https://sdk.apify.com). --- # support ## Support[​](#support "Direct link to heading") If you find any bug or issue with the Apify SDK, please [submit an issue on GitHub](https://github.com/apify/apify-js/issues). For questions, you can ask on [Stack Overflow](https://stackoverflow.com/questions/tagged/apify) or contact ## Contributing[​](#contributing "Direct link to heading") Your code contributions are welcome and you'll be praised to eternity! If you have any ideas for improvements, either submit an issue or create a pull request. For contribution guidelines and the code of conduct, see [CONTRIBUTING.md](https://github.com/apify/apify-js/blob/master/CONTRIBUTING.md). ## License[​](#license "Direct link to heading") This project is licensed under the Apache License 2.0 - see the [LICENSE.md](https://github.com/apify/apify-js/blob/master/LICENSE.md) file for details. ## Acknowledgments[​](#acknowledgments "Direct link to heading") Many thanks to [Chema Balsas](https://www.npmjs.com/~jbalsas) for giving up the `apify` package name on NPM and renaming his project to [jsdocify](https://www.npmjs.com/package/jsdocify). --- # Upgrading to v1 ## Summary[​](#summary "Direct link to heading") After 3.5 years of rapid development and a lot of breaking changes and deprecations, here comes the result - **Apify SDK v1**. There were two goals for this release. **Stability** and **adding support for more browsers** - Firefox and Webkit (Safari). The SDK has grown quite popular over the years, powering thousands of web scraping and automation projects. We think our developers deserve a stable environment to work in and by releasing SDK v1, **we commit to only make breaking changes once a year, with a new major release**. We added support for more browsers by replacing `PuppeteerPool` with [`browser-pool`](https://github.com/apify/browser-pool). A new library that we created specifically for this purpose. It builds on the ideas from `PuppeteerPool` and extends them to support [Playwright](https://github.com/microsoft/playwright). Playwright is a browser automation library similar to Puppeteer. It works with all well known browsers and uses almost the same interface as Puppeteer, while adding useful features and simplifying common tasks. Don't worry, you can still use Puppeteer with the new `BrowserPool`. A large breaking change is that neither `puppeteer` nor `playwright` are bundled with the SDK v1. To make the choice of a library easier and installs faster, users will have to install the selected modules and versions themselves. This allows us to add support for even more libraries in the future. Thanks to the addition of Playwright we now have a `PlaywrightCrawler`. It is very similar to `PuppeteerCrawler` and you can pick the one you prefer. It also means we needed to make some interface changes. The `launchPuppeteerFunction` option of `PuppeteerCrawler` is gone and `launchPuppeteerOptions` were replaced by `launchContext`. We also moved things around in the `handlePageFunction` arguments. See the [migration guide](#migration-guide) for more detailed explanation and migration examples. What's in store for SDK v2? We want to split the SDK into smaller libraries, so that everyone can install only the things they need. We plan a TypeScript migration to make crawler development faster and safer. Finally, we will take a good look at the interface of the whole SDK and update it to improve the developer experience. Bug fixes and scraping features will of course keep landing in versions 1.X as well. ## Migration Guide[​](#migration-guide "Direct link to heading") There are a lot of breaking changes in the v1.0.0 release, but we're confident that updating your code will be a matter of minutes. Below, you'll find examples how to do it and also short tutorials how to use many of the new features. > Many of the new features are made with power users in mind, so don't worry if something looks complicated. You don't need to use it. ## Installation[​](#installation "Direct link to heading") Previous versions of the SDK bundled the `puppeteer` package, so you did not have to install it. SDK v1 supports also `playwright` and we don't want to force users to install both. To install SDK v1 with Puppeteer (same as previous versions), run: ``` npm install apify puppeteer ``` To install SDK v1 with Playwright run: ``` npm install apify playwright ``` > While we tried to add the most important functionality in the initial release, you may find that there are still some utilities or options that are only supported by Puppeteer and not Playwright. ## Running on Apify Platform[​](#running-on-apify-platform "Direct link to heading") If you want to make use of Playwright on the Apify Platform, you need to use a Docker image that supports Playwright. We've created them for you, so head over to the new [Docker image guide](https://sdk.apify.com/docs/guides/docker-images) and pick the one that best suits your needs. Note that your `package.json` **MUST** include `puppeteer` and/or `playwright` as dependencies. If you don't list them, the libraries will be uninstalled from your `node_modules` folder when you build your actors. ## Handler arguments are now Crawling Context[​](#handler-arguments-are-now-crawling-context "Direct link to heading") Previously, arguments of user provided handler functions were provided in separate objects. This made it difficult to track values across function invocations. ``` const handlePageFunction = async (args1) => { args1.hasOwnProperty('proxyInfo'); // true }; const handleFailedRequestFunction = async (args2) => { args2.hasOwnProperty('proxyInfo'); // false }; args1 === args2; // false ``` This happened because a new arguments object was created for each function. With SDK v1 we now have a single object called Crawling Context. ``` const handlePageFunction = async (crawlingContext1) => { crawlingContext1.hasOwnProperty('proxyInfo'); // true }; const handleFailedRequestFunction = async (crawlingContext2) => { crawlingContext2.hasOwnProperty('proxyInfo'); // true }; // All contexts are the same object. crawlingContext1 === crawlingContext2; // true ``` ### `Map` of crawling contexts and their IDs[​](#map-of-crawling-contexts-and-their-ids "Direct link to heading") Now that all the objects are the same, we can keep track of all running crawling contexts. We can do that by working with the new `id` property of `crawlingContext` This is useful when you need cross-context access. ``` let masterContextId; const handlePageFunction = async ({ id, page, request, crawler }) => { if (request.userData.masterPage) { masterContextId = id; // Prepare the master page. } else { const masterContext = crawler.crawlingContexts.get(masterContextId); const masterPage = masterContext.page; const masterRequest = masterContext.request; // Now we can manipulate the master data from another handlePageFunction. } }; ``` ### `autoscaledPool` was moved under `crawlingContext.crawler`[​](#autoscaledpool-was-moved-under-crawlingcontextcrawler "Direct link to heading") To prevent bloat and to make access to certain key objects easier, we exposed a `crawler` property on the handle page arguments. ``` const handePageFunction = async ({ request, page, crawler }) => { await crawler.requestQueue.addRequest({ url: 'https://example.com' }); await crawler.autoscaledPool.pause(); }; ``` This also means that some shorthands like `puppeteerPool` or `autoscaledPool` were no longer necessary. ``` const handePageFunction = async (crawlingContext) => { crawlingContext.autoscaledPool; // does NOT exist anymore crawlingContext.crawler.autoscaledPool; // <= this is correct usage }; ``` ## Replacement of `PuppeteerPool` with `BrowserPool`[​](#replacement-of-puppeteerpool-with-browserpool "Direct link to heading") `BrowserPool` was created to extend `PuppeteerPool` with the ability to manage other browser automation libraries. The API is similar, but not the same. ### Access to running `BrowserPool`[​](#access-to-running-browserpool "Direct link to heading") Only `PuppeteerCrawler` and `PlaywrightCrawler` use `BrowserPool`. You can access it on the `crawler` object. ``` const crawler = new Apify.PlaywrightCrawler({ handlePageFunction: async ({ page, crawler }) => { crawler.browserPool; // <----- }, }); crawler.browserPool; // <----- ``` ### Pages now have IDs[​](#pages-now-have-ids "Direct link to heading") And they're equal to `crawlingContext.id` which gives you access to full `crawlingContext` in hooks. See [Lifecycle hooks](#configuration-and-lifecycle-hooks) below. ``` const pageId = browserPool.getPageId; ``` ### Configuration and lifecycle hooks[​](#configuration-and-lifecycle-hooks "Direct link to heading") The most important addition with `BrowserPool` are the [lifecycle hooks](https://github.com/apify/browser-pool#browserpool). You can access them via `browserPoolOptions` in both crawlers. A full list of `browserPoolOptions` can be found in [`browser-pool` readme](https://github.com/apify/browser-pool#new-browserpooloptions). ``` const crawler = new Apify.PuppeteerCrawler({ browserPoolOptions: { retireBrowserAfterPageCount: 10, preLaunchHooks: [ async (pageId, launchContext) => { const { request } = crawler.crawlingContexts.get(pageId); if (request.userData.useHeadful === true) { launchContext.launchOptions.headless = false; } }, ], }, }); ``` ### Introduction of `BrowserController`[​](#introduction-of-browsercontroller "Direct link to heading") [`BrowserController`](https://github.com/apify/browser-pool#browsercontroller) is a class of `browser-pool` that's responsible for browser management. Its purpose is to provide a single API for working with both Puppeteer and Playwright browsers. It works automatically in the background, but if you ever wanted to close a browser properly, you should use a `browserController` to do it. You can find it in the handle page arguments. ``` const handlePageFunction = async ({ page, browserController }) => { // Wrong usage. Could backfire because it bypasses BrowserPool. await page.browser().close(); // Correct usage. Allows graceful shutdown. await browserController.close(); const cookies = [ /* some cookie objects */ ]; // Wrong usage. Will only work in Puppeteer and not Playwright. await page.setCookies(...cookies); // Correct usage. Will work in both. await browserController.setCookies(page, cookies); }; ``` The `BrowserController` also includes important information about the browser, such as the context it was launched with. This was difficult to do before SDK v1. ``` const handlePageFunction = async ({ browserController }) => { // Information about the proxy used by the browser browserController.launchContext.proxyInfo; // Session used by the browser browserController.launchContext.session; }; ``` ### `BrowserPool` methods vs `PuppeteerPool`[​](#browserpool-methods-vs-puppeteerpool "Direct link to heading") Some functions were removed (in line with earlier deprecations), and some were changed a bit: ``` // OLD await puppeteerPool.recyclePage(page); // NEW await page.close(); ``` ``` // OLD await puppeteerPool.retire(page.browser()); // NEW browserPool.retireBrowserByPage(page); ``` ``` // OLD await puppeteerPool.serveLiveViewSnapshot(); // NEW // There's no LiveView in BrowserPool ``` ## Updated `PuppeteerCrawlerOptions`[​](#updated-puppeteercrawleroptions "Direct link to heading") To keep `PuppeteerCrawler` and `PlaywrightCrawler` consistent, we updated the options. ### Removal of `gotoFunction`[​](#removal-of-gotofunction "Direct link to heading") The concept of a configurable `gotoFunction` is not ideal. Especially since we use a modified `gotoExtended`. Users have to know this when they override `gotoFunction` if they want to extend default behavior. We decided to replace `gotoFunction` with `preNavigationHooks` and `postNavigationHooks`. The following example illustrates how `gotoFunction` makes things complicated. ``` const gotoFunction = async ({ request, page }) => { // pre-processing await makePageStealthy(page); // Have to remember how to do this: const response = await gotoExtended(page, request, { /* have to remember the defaults */ }); // post-processing await page.evaluate(() => { window.foo = 'bar'; }); // Must not forget! return response; }; const crawler = new Apify.PuppeteerCrawler({ gotoFunction, // ... }); ``` With `preNavigationHooks` and `postNavigationHooks` it's much easier. `preNavigationHooks` are called with two arguments: `crawlingContext` and `gotoOptions`. `postNavigationHooks` are called only with `crawlingContext`. ``` const preNavigationHooks = [async ({ page }) => makePageStealthy(page)]; const postNavigationHooks = [ async ({ page }) => page.evaluate(() => { window.foo = 'bar'; }), ]; const crawler = new Apify.PuppeteerCrawler({ preNavigationHooks, postNavigationHooks, // ... }); ``` ### `launchPuppeteerOptions` => `launchContext`[​](#launchpuppeteeroptions--launchcontext "Direct link to heading") Those were always a point of confusion because they merged custom Apify options with `launchOptions` of Puppeteer. ``` const launchPuppeteerOptions = { useChrome: true, // Apify option headless: false, // Puppeteer option }; ``` Use the new `launchContext` object, which explicitly defines `launchOptions`. `launchPuppeteerOptions` were removed. ``` const crawler = new Apify.PuppeteerCrawler({ launchContext: { useChrome: true, // Apify option launchOptions: { headless: false, // Puppeteer option }, }, }); ``` > LaunchContext is also a type of [`browser-pool`](https://github.com/apify/browser-pool) and the structure is exactly the same there. SDK only adds extra options. ### Removal of `launchPuppeteerFunction`[​](#removal-of-launchpuppeteerfunction "Direct link to heading") `browser-pool` introduces the idea of [lifecycle hooks](https://github.com/apify/browser-pool#browserpool), which are functions that are executed when a certain event in the browser lifecycle happens. ``` const launchPuppeteerFunction = async (launchPuppeteerOptions) => { if (someVariable === 'chrome') { launchPuppeteerOptions.useChrome = true; } return Apify.launchPuppeteer(launchPuppeteerOptions); }; const crawler = new Apify.PuppeteerCrawler({ launchPuppeteerFunction, // ... }); ``` Now you can recreate the same functionality with a `preLaunchHook`: ``` const maybeLaunchChrome = (pageId, launchContext) => { if (someVariable === 'chrome') { launchContext.useChrome = true; } }; const crawler = new Apify.PuppeteerCrawler({ browserPoolOptions: { preLaunchHooks: [maybeLaunchChrome], }, // ... }); ``` This is better in multiple ways. It is consistent across both Puppeteer and Playwright. It allows you to easily construct your browsers with pre-defined behavior: ``` const preLaunchHooks = [ maybeLaunchChrome, useHeadfulIfNeeded, injectNewFingerprint, ]; ``` And thanks to the addition of [`crawler.crawlingContexts`](#handler-arguments-are-now-crawling-context) the functions also have access to the `crawlingContext` of the `request` that triggered the launch. ``` const preLaunchHooks = [ async function maybeLaunchChrome(pageId, launchContext) { const { request } = crawler.crawlingContexts.get(pageId); if (request.userData.useHeadful === true) { launchContext.launchOptions.headless = false; } }, ]; ``` ## Launch functions[​](#launch-functions "Direct link to heading") In addition to `Apify.launchPuppeteer()` we now also have `Apify.launchPlaywright()`. ### Updated arguments[​](#updated-arguments "Direct link to heading") We [updated the launch options object](#launchpuppeteeroptions--launchcontext) because it was a frequent source of confusion. ``` // OLD await Apify.launchPuppeteer({ useChrome: true, headless: true, }); // NEW await Apify.launchPuppeteer({ useChrome: true, launchOptions: { headless: true, }, }); ``` ### Custom modules[​](#custom-modules "Direct link to heading") `Apify.launchPuppeteer` already supported the `puppeteerModule` option. With Playwright, we normalized the name to `launcher` because the `playwright` module itself does not launch browsers. ``` const puppeteer = require('puppeteer'); const playwright = require('playwright'); await Apify.launchPuppeteer(); // Is the same as: await Apify.launchPuppeteer({ launcher: puppeteer, }); await Apify.launchPlaywright(); // Is the same as: await Apify.launchPlaywright({ launcher: playwright.chromium, }); ``` --- # Upgrading to v2 * **BREAKING**: Require Node.js >=15.10.0 because HTTP2 support on lower Node.js versions is very buggy. * **BREAKING**: Bump `cheerio` to `1.0.0-rc.10` from `rc.3`. There were breaking changes in `cheerio` between the versions so this bump might be breaking for you as well. * Remove `LiveViewServer` which was deprecated before release of SDK v1. --- # Upgrading to v3 This page summarizes most of the breaking changes between Crawlee (v3) and Apify SDK (v2). Crawlee is the spiritual successor to Apify SDK, so we decided to keep the versioning and release Crawlee as v3. Crawlee vs Apify SDK v2 Up until version 3 of `apify`, the package contained both scraping related tools and Apify platform related helper methods. With v3 we are splitting the whole project into two main parts: * [Crawlee](https://github.com/apify/crawlee), the new web-scraping library, available as [`crawlee`](https://www.npmjs.com/package/crawlee) package on NPM * [Apify SDK](https://github.com/apify/apify-sdk-js), helpers for the Apify platform, available as [`apify`](https://www.npmjs.com/package/apify) package on NPM ## Crawlee monorepo[​](#crawlee-monorepo "Direct link to heading") The [`crawlee`](https://www.npmjs.com/package/crawlee) package consists of several smaller packages, released separately under `@crawlee` namespace: * `@crawlee/core`: the base for all the crawler implementations, also contains things like `Request`, `RequestQueue`, `RequestList` or `Dataset` classes * `@crawlee/basic`: exports `BasicCrawler` * `@crawlee/http`: exports `HttpCrawler` (which is used for creating `@crawlee/dom` and `@crawlee/cheerio`) * `@crawlee/dom`: exports `DOMCrawler` * `@crawlee/cheerio`: exports `CheerioCrawler` * `@crawlee/browser`: exports `BrowserCrawler` (which is used for creating `@crawlee/playwright` and `@crawlee/puppeteer`) * `@crawlee/playwright`: exports `PlaywrightCrawler` * `@crawlee/puppeteer`: exports `PuppeteerCrawler` * `@crawlee/memory-storage`: `@apify/storage-local` alternative * `@crawlee/browser-pool`: previously `browser-pool` package * `@crawlee/utils`: utility methods * `@crawlee/types`: holds TS interfaces mainly about the `StorageClient` ### Installing Crawlee[​](#installing-crawlee "Direct link to heading") Most of the Crawlee packages are extending and reexporting each other, so it's enough to install just the one you plan on using, e.g. `@crawlee/playwright` if you plan on using `playwright` - it already contains everything from the `@crawlee/browser` package, which includes everything from `@crawlee/basic`, which includes everything from `@crawlee/core`. If we don't care much about additional code being pulled in, we can just use the `crawlee` meta-package, which contains (re-exports) most of the `@crawlee/*` packages, and therefore contains all the crawler classes. ``` npm install crawlee ``` Or if all we need is cheerio support, we can install only `@crawlee/cheerio`. ``` npm install @crawlee/cheerio ``` When using `playwright` or `puppeteer`, we still need to install those dependencies explicitly - this allows the users to be in control of which version will be used. ``` npm install crawlee playwright # or npm install @crawlee/playwright playwright ``` Alternatively we can also use the `crawlee` meta-package which contains (re-exports) most of the `@crawlee/*` packages, and therefore contains all the crawler classes. > Sometimes you might want to use some utility methods from `@crawlee/utils`, so you might want to install that as well. This package contains some utilities that were previously available under `Apify.utils`. Browser related utilities can be also found in the crawler packages (e.g. `@crawlee/playwright`). ## Full TypeScript support[​](#full-typescript-support "Direct link to heading") Both Crawlee and Apify SDK are full TypeScript rewrite, so they include up-to-date types in the package. For your TypeScript crawlers we recommend using our predefined TypeScript configuration from `@apify/tsconfig` package. Don't forget to set the `module` and `target` to `ES2022` or above to be able to use top level await. > The `@apify/tsconfig` config has [`noImplicitAny`](https://www.typescriptlang.org/tsconfig#noImplicitAny) enabled, you might want to disable it during the initial development as it will cause build failures if you left some unused local variables in your code. tsconfig.json ``` { "extends": "@apify/tsconfig", "compilerOptions": { "module": "ES2022", "target": "ES2022", "outDir": "dist", "lib": ["DOM"] }, "include": ["./src/**/*"] } ``` ### Docker build[​](#docker-build "Direct link to heading") For `Dockerfile` we recommend using multi-stage build, so you don't install the dev dependencies like TypeScript in your final image: Dockerfile ``` # using multistage build, as we need dev deps to build the TS source code FROM apify/actor-node:16 AS builder # copy all files, install all dependencies (including dev deps) and build the project COPY . ./ RUN npm install --include=dev \ && npm run build # create final image FROM apify/actor-node:16 # copy only necessary files COPY --from=builder /usr/src/app/package*.json ./ COPY --from=builder /usr/src/app/README.md ./ COPY --from=builder /usr/src/app/dist ./dist COPY --from=builder /usr/src/app/apify.json ./apify.json COPY --from=builder /usr/src/app/INPUT_SCHEMA.json ./INPUT_SCHEMA.json # install only prod deps RUN npm --quiet set progress=false \ && npm install --only=prod --no-optional \ && echo "Installed NPM packages:" \ && (npm list --only=prod --no-optional --all || true) \ && echo "Node.js version:" \ && node --version \ && echo "NPM version:" \ && npm --version # run compiled code CMD npm run start:prod ``` ## Browser fingerprints[​](#browser-fingerprints "Direct link to heading") Previously we had a magical `stealth` option in the puppeteer crawler that enabled several tricks aiming to mimic the real users as much as possible. While this worked to a certain degree, we decided to replace it with generated browser fingerprints. In case we don't want to have dynamic fingerprints, we can disable this behaviour via `useFingerprints` in `browserPoolOptions`: ``` const crawler = new PlaywrightCrawler({ browserPoolOptions: { useFingerprints: false, }, }); ``` ## Session cookie method renames[​](#session-cookie-method-renames "Direct link to heading") Previously, if we wanted to get or add cookies for the session that would be used for the request, we had to call `session.getPuppeteerCookies()` or `session.setPuppeteerCookies()`. Since this method could be used for any of our crawlers, not just `PuppeteerCrawler`, the methods have been renamed to `session.getCookies()` and `session.setCookies()` respectively. Otherwise, their usage is exactly the same! ## Memory storage[​](#memory-storage "Direct link to heading") When we store some data or intermediate state (like the one `RequestQueue` holds), we now use `@crawlee/memory-storage` by default. It is an alternative to the `@apify/storage-local`, that stores the state inside memory (as opposed to SQLite database used by `@apify/storage-local`). While the state is stored in memory, it also dumps it to the file system, so we can observe it, as well as respects the existing data stored in KeyValueStore (e.g. the `INPUT.json` file). When we want to run the crawler on Apify platform, we need to use `Actor.init` or `Actor.main`, which will automatically switch the storage client to `ApifyClient` when on the Apify platform. We can still use the `@apify/storage-local`, to do it, first install it pass it to the `Actor.init` or `Actor.main` options: > `@apify/storage-local` v2.1.0+ is required for Crawlee ``` import { Actor } from 'apify'; import { ApifyStorageLocal } from '@apify/storage-local'; const storage = new ApifyStorageLocal(/* options like `enableWalMode` belong here */); await Actor.init({ storage }); ``` ## Purging of the default storage[​](#purging-of-the-default-storage "Direct link to heading") Previously the state was preserved between local runs, and we had to use `--purge` argument of the `apify-cli`. With Crawlee, this is now the default behaviour, we purge the storage automatically on `Actor.init/main` call. We can opt out of it via `purge: false` in the `Actor.init` options. ## Renamed crawler options and interfaces[​](#renamed-crawler-options-and-interfaces "Direct link to heading") Some options were renamed to better reflect what they do. We still support all the old parameter names too, but not at the TS level. * `handleRequestFunction` -> `requestHandler` * `handlePageFunction` -> `requestHandler` * `handleRequestTimeoutSecs` -> `requestHandlerTimeoutSecs` * `handlePageTimeoutSecs` -> `requestHandlerTimeoutSecs` * `requestTimeoutSecs` -> `navigationTimeoutSecs` * `handleFailedRequestFunction` -> `failedRequestHandler` We also renamed the crawling context interfaces, so they follow the same convention and are more meaningful: * `CheerioHandlePageInputs` -> `CheerioCrawlingContext` * `PlaywrightHandlePageFunction` -> `PlaywrightCrawlingContext` * `PuppeteerHandlePageFunction` -> `PuppeteerCrawlingContext` ## Context aware helpers[​](#context-aware-helpers "Direct link to heading") Some utilities previously available under `Apify.utils` namespace are now moved to the crawling context and are *context aware*. This means they have some parameters automatically filled in from the context, like the current `Request` instance or current `Page` object, or the `RequestQueue` bound to the crawler. ### Enqueuing links[​](#enqueuing-links "Direct link to heading") One common helper that received more attention is the `enqueueLinks`. As mentioned above, it is context aware - we no longer need pass in the `requestQueue` or `page` arguments (or the cheerio handle `$`). In addition to that, it now offers 3 enqueuing strategies: * `EnqueueStrategy.All` (`'all'`): Matches any URLs found * `EnqueueStrategy.SameHostname` (`'same-hostname'`) Matches any URLs that have the same subdomain as the base URL (default) * `EnqueueStrategy.SameDomain` (`'same-domain'`) Matches any URLs that have the same domain name. For example, `https://wow.an.example.com` and `https://example.com` will both be matched for a base url of `https://example.com`. This means we can even call `enqueueLinks()` without any parameters. By default, it will go through all the links found on current page and filter only those targeting the same subdomain. Moreover, we can specify patterns the URL should match via globs: ``` const crawler = new PlaywrightCrawler({ async requestHandler({ enqueueLinks }) { await enqueueLinks({ globs: ['https://crawlee.dev/*/*'], // we can also use `regexps` and `pseudoUrls` keys here }); }, }); ``` ## Implicit `RequestQueue` instance[​](#implicit-requestqueue-instance "Direct link to heading") All crawlers now have the `RequestQueue` instance automatically available via `crawler.getRequestQueue()` method. It will create the instance for you if it does not exist yet. This mean we no longer need to create the `RequestQueue` instance manually, and we can just use `crawler.addRequests()` method described underneath. > We can still create the `RequestQueue` explicitly, the `crawler.getRequestQueue()` method will respect that and return the instance provided via crawler options. ## `crawler.addRequests()`[​](#crawleraddrequests "Direct link to heading") We can now add multiple requests in batches. The newly added `addRequests` method will handle everything for us. It enqueues the first 1000 requests and resolves, while continuing with the rest in the background, again in a smaller 1000 items batches, so we don't fall into any API rate limits. This means the crawling will start almost immediately (within few seconds at most), something previously possible only with a combination of `RequestQueue` and `RequestList`. ``` // will resolve right after the initial batch of 1000 requests is added const result = await crawler.addRequests([ /* many requests, can be even millions */ ]); // if we want to wait for all the requests to be added, we can await the `waitForAllRequestsToBeAdded` promise await result.waitForAllRequestsToBeAdded; ``` ## Less verbose error logging[​](#less-verbose-error-logging "Direct link to heading") Previously an error thrown from inside request handler resulted in full error object being logged. With Crawlee, we log only the error message as a warning as long as we know the request will be retried. If you want to enable verbose logging like in v2, use the `CRAWLEE_VERBOSE_LOG` env var. ## `Request.label` shortcut[​](#requestlabel-shortcut "Direct link to heading") Labeling requests used to work via the `Request.userData` object. With Crawlee, we can also use the `Request.label` shortcut. It is implemented as a `get/set` pair, using the value from `Request.userData`. The support for this shortcut is also added to the `enqueueLinks` options interface. ``` async requestHandler({ request, enqueueLinks }) { if (request.label !== 'DETAIL') { await enqueueLinks({ globs: ['...'], label: 'DETAIL', }); } } ``` ## Removal of `requestAsBrowser`[​](#removal-of-requestasbrowser "Direct link to heading") In v1 we replaced the underlying implementation of `requestAsBrowser` to be just a proxy over calling [`got-scraping`](https://github.com/apify/got-scraping) - our custom extension to `got` that tries to mimic the real browsers as much as possible. With v3, we are removing the `requestAsBrowser`, encouraging the use of [`got-scraping`](https://github.com/apify/got-scraping) directly. For easier migration, we also added `context.sendRequest()` helper that allows processing the context bound `Request` object through [`got-scraping`](https://github.com/apify/got-scraping): ``` const crawler = new BasicCrawler({ async requestHandler({ sendRequest, log }) { // we can use the options parameter to override gotScraping options const res = await sendRequest({ responseType: 'json' }); log.info('received body', res.body); }, }); ``` ### How to use `sendRequest()`?[​](#how-to-use-sendrequest "Direct link to heading") See [the Got Scraping guide](https://crawlee.dev/docs/guides/got-scraping). ### Removed options[​](#removed-options "Direct link to heading") The `useInsecureHttpParser` option has been removed. It's permanently set to `true` in order to better mimic browsers' behavior. Got Scraping automatically performs protocol negotiation, hence we removed the `useHttp2` option. It's set to `true` - 100% of browsers nowadays are capable of HTTP/2 requests. Oh, more and more of the web is using it too! ### Renamed options[​](#renamed-options "Direct link to heading") In the `requestAsBrowser` approach, some of the options were named differently. Here's a list of renamed options: #### `payload`[​](#payload "Direct link to heading") This options represents the body to send. It could be a `string` or a `Buffer`. However, there is no `payload` option anymore. You need to use `body` instead. Or, if you wish to send JSON, `json`. Here's an example: ``` // Before: await Apify.utils.requestAsBrowser({ …, payload: 'Hello, world!' }); await Apify.utils.requestAsBrowser({ …, payload: Buffer.from('c0ffe', 'hex') }); await Apify.utils.requestAsBrowser({ …, json: { hello: 'world' } }); // After: await gotScraping({ …, body: 'Hello, world!' }); await gotScraping({ …, body: Buffer.from('c0ffe', 'hex') }); await gotScraping({ …, json: { hello: 'world' } }); ``` #### `ignoreSslErrors`[​](#ignoresslerrors "Direct link to heading") It has been renamed to `https.rejectUnauthorized`. By default, it's set to `false` for convenience. However, if you want to make sure the connection is secure, you can do the following: ``` // Before: await Apify.utils.requestAsBrowser({ …, ignoreSslErrors: false }); // After: await gotScraping({ …, https: { rejectUnauthorized: true } }); ``` Please note: the meanings are opposite! So we needed to invert the values as well. #### `header-generator` options[​](#header-generator-options "Direct link to heading") `useMobileVersion`, `languageCode` and `countryCode` no longer exist. Instead, you need to use `headerGeneratorOptions` directly: ``` // Before: await Apify.utils.requestAsBrowser({ …, useMobileVersion: true, languageCode: 'en', countryCode: 'US', }); // After: await gotScraping({ …, headerGeneratorOptions: { devices: ['mobile'], // or ['desktop'] locales: ['en-US'], }, }); ``` #### `timeoutSecs`[​](#timeoutsecs "Direct link to heading") In order to set a timeout, use `timeout.request` (which is **milliseconds** now). ``` // Before: await Apify.utils.requestAsBrowser({ …, timeoutSecs: 30, }); // After: await gotScraping({ …, timeout: { request: 30 * 1000, }, }); ``` #### `throwOnHttpErrors`[​](#throwonhttperrors "Direct link to heading") `throwOnHttpErrors` → `throwHttpErrors`. This options throws on unsuccessful HTTP status codes, for example `404`. By default, it's set to `false`. #### `decodeBody`[​](#decodebody "Direct link to heading") `decodeBody` → `decompress`. This options decompresses the body. Defaults to `true` - please do not change this or websites will break (unless you know what you're doing!). #### `abortFunction`[​](#abortfunction "Direct link to heading") This function used to make the promise throw on specific responses, if it returned `true`. However, it wasn't that useful. You probably want to cancel the request instead, which you can do in the following way: ``` const promise = gotScraping(…); promise.on('request', request => { // Please note this is not a Got Request instance, but a ClientRequest one. // https://nodejs.org/api/http.html#class-httpclientrequest if (request.protocol !== 'https:') { // Unsecure request, abort. promise.cancel(); // If you set `isStream` to `true`, please use `stream.destroy()` instead. } }); const response = await promise; ``` ## Removal of browser pool plugin mixing[​](#removal-of-browser-pool-plugin-mixing "Direct link to heading") Previously, you were able to have a browser pool that would mix Puppeteer and Playwright plugins (or even your own custom plugins if you've built any). As of this version, that is no longer allowed, and creating such a browser pool will cause an error to be thrown (it's expected that all plugins that will be used are of the same type). Confused? As an example, this change disallows a pool to mix Puppeteer with Playwright. You can still create pools that use multiple Playwright plugins, each with a different launcher if you want! ## Handling requests outside of browser[​](#handling-requests-outside-of-browser "Direct link to heading") One small feature worth mentioning is the ability to handle requests with browser crawlers outside the browser. To do that, we can use a combination of `Request.skipNavigation` and `context.sendRequest()`. Take a look at how to achieve this by checking out the [Skipping navigation for certain requests](https://crawlee.dev/docs/examples/skip-navigation) example! ## Logging[​](#logging "Direct link to heading") Crawlee exports the default `log` instance directly as a named export. We also have a scoped `log` instance provided in the crawling context - this one will log messages prefixed with the crawler name and should be preferred for logging inside the request handler. ``` const crawler = new CheerioCrawler({ async requestHandler({ log, request }) { log.info(`Opened ${request.loadedUrl}`); }, }); ``` ## Auto-saved crawler state[​](#auto-saved-crawler-state "Direct link to heading") Every crawler instance now has `useState()` method that will return a state object we can use. It will be automatically saved when `persistState` event occurs. The value is cached, so we can freely call this method multiple times and get the exact same reference. No need to worry about saving the value either, as it will happen automatically. ``` const crawler = new CheerioCrawler({ async requestHandler({ crawler }) { const state = await crawler.useState({ foo: [] as number[] }); // just change the value, no need to care about saving it state.foo.push(123); }, }); ``` ## Apify SDK[​](#apify-sdk "Direct link to heading") The Apify platform helpers can be now found in the Apify SDK (`apify` NPM package). It exports the `Actor` class that offers following static helpers: * `ApifyClient` shortcuts: `addWebhook()`, `call()`, `callTask()`, `metamorph()` * helpers for running on Apify platform: `init()`, `exit()`, `fail()`, `main()`, `isAtHome()`, `createProxyConfiguration()` * storage support: `getInput()`, `getValue()`, `openDataset()`, `openKeyValueStore()`, `openRequestQueue()`, `pushData()`, `setValue()` * events support: `on()`, `off()` * other utilities: `getEnv()`, `newClient()`, `reboot()` `Actor.main` is now just a syntax sugar around calling `Actor.init()` at the beginning and `Actor.exit()` at the end (plus wrapping the user function in try/catch block). All those methods are async and should be awaited - with node 16 we can use the top level await for that. In other words, following is equivalent: ``` import { Actor } from 'apify'; await Actor.init(); // your code await Actor.exit('Crawling finished!'); ``` ``` import { Actor } from 'apify'; await Actor.main( async () => { // your code }, { statusMessage: 'Crawling finished!' }, ); ``` `Actor.init()` will conditionally set the storage implementation of Crawlee to the `ApifyClient` when running on the Apify platform, or keep the default (memory storage) implementation otherwise. It will also subscribe to the websocket events (or mimic them locally). `Actor.exit()` will handle the tear down and calls `process.exit()` to ensure our process won't hang indefinitely for some reason. ### Events[​](#events "Direct link to heading") Apify SDK (v2) exports `Apify.events`, which is an `EventEmitter` instance. With Crawlee, the events are managed by [`EventManager`](https://crawlee.dev/api/core/class/EventManager) class instead. We can either access it via `Actor.eventManager` getter, or use `Actor.on` and `Actor.off` shortcuts instead. ``` -Apify.events.on(...); +Actor.on(...); ``` > We can also get the [`EventManager`](https://crawlee.dev/api/core/class/EventManager) instance via `Configuration.getEventManager()`. In addition to the existing events, we now have an `exit` event fired when calling `Actor.exit()` (which is called at the end of `Actor.main()`). This event allows you to gracefully shut down any resources when `Actor.exit` is called. ## Smaller/internal breaking changes[​](#smallerinternal-breaking-changes "Direct link to heading") * `Apify.call()` is now just a shortcut for running `ApifyClient.actor(actorId).call(input, options)`, while also taking the token inside env vars into account * `Apify.callTask()` is now just a shortcut for running `ApifyClient.task(taskId).call(input, options)`, while also taking the token inside env vars into account * `Apify.metamorph()` is now just a shortcut for running `ApifyClient.task(taskId).metamorph(input, options)`, while also taking the ACTOR\_RUN\_ID inside env vars into account * `Apify.waitForRunToFinish()` has been removed, use `ApifyClient.waitForFinish()` instead * `Actor.main/init` purges the storage by default * remove `purgeLocalStorage` helper, move purging to the storage class directly * `StorageClient` interface now has optional `purge` method * purging happens automatically via `Actor.init()` (you can opt out via `purge: false` in the options of `init/main` methods) * `QueueOperationInfo.request` is no longer available * `Request.handledAt` is now string date in ISO format * `Request.inProgress` and `Request.reclaimed` are now `Set`s instead of POJOs * `injectUnderscore` from puppeteer utils has been removed * `APIFY_MEMORY_MBYTES` is no longer taken into account, use `CRAWLEE_AVAILABLE_MEMORY_RATIO` instead * some `AutoscaledPool` options are no longer available: * `cpuSnapshotIntervalSecs` and `memorySnapshotIntervalSecs` has been replaced with top level `systemInfoIntervalMillis` configuration * `maxUsedCpuRatio` has been moved to the top level configuration * `ProxyConfiguration.newUrlFunction` can be async. `.newUrl()` and `.newProxyInfo()` now return promises. * `prepareRequestFunction` and `postResponseFunction` options are removed, use navigation hooks instead * `gotoFunction` and `gotoTimeoutSecs` are removed * removed compatibility fix for old/broken request queues with null `Request` props * `fingerprintsOptions` renamed to `fingerprintOptions` (`fingerprints` -> `fingerprint`). * `fingerprintOptions` now accept `useFingerprintCache` and `fingerprintCacheSize` (instead of `useFingerprintPerProxyCache` and `fingerprintPerProxyCacheSize`, which are now no longer available). This is because the cached fingerprints are no longer connected to proxy URLs but to sessions. --- # apify [![npm version](https://badge.fury.io/js/apify.svg)](https://www.npmjs.com/package/apify) [![Downloads](https://img.shields.io/npm/dm/apify.svg)](https://www.npmjs.com/package/apify) [![Chat on discord](https://img.shields.io/discord/801163717915574323?label=discord)](https://discord.gg/jyEM2PRvMU) [![Build Status](https://github.com/apify/apify-sdk-js/actions/workflows/test-and-release.yaml/badge.svg?branch=master)](https://github.com/apify/apify-sdk-js/actions/workflows/test-and-release.yaml) Apify SDK provides the tools required to run your own Apify Actors. The crawlers and scraping related tools, previously included in Apify SDK (v2), have been split into a brand-new module - [`crawlee`](https://npmjs.org/crawlee), while keeping the Apify specific parts in this module. > Would you like to work with us on Crawlee, Apify SDK or similar projects? We are hiring [Node.js engineers](https://apify.com/jobs#senior-node.js-engineer). ## Upgrading from v2[​](#upgrading-from-v2 "Direct link to heading") A lot of things have changed since version 2 of the Apify SDK, including the split of the crawlers to the new [`crawlee`](https://npmjs.org/crawlee) module. We've written a guide to help you easily migrate from v2 to v3. Visit the [Upgrading Guide](https://docs.apify.com/sdk/js/sdk/js/docs/upgrading/upgrading-to-v3.md) to find out what changes you need to make (especially the section related to this very [Apify SDK](https://docs.apify.com/sdk/js/sdk/js/docs/upgrading/upgrading-to-v3.md#apify-sdk)), and, if you encounter any issues, join our [Discord server](https://discord.gg/jyEM2PRvMU) for help! ## Quick Start[​](#quick-start "Direct link to heading") This short tutorial will set you up to start using Apify SDK in a minute or two. If you want to learn more, proceed to the [Apify Platform](https://docs.apify.com/sdk/js/sdk/js/docs/guides/apify-platform.md) guide that will take you step by step through running your Actor on Apify's platform. Apify SDK requires [Node.js](https://nodejs.org/en/) 16 or later. Add Apify SDK to any Node.js project by running: ``` npm install apify crawlee playwright ``` > For this example, we'll also install the [`crawlee`](https://npmjs.org/crawlee) module, as it now provides the crawlers that were previously exported by Apify SDK. If you don't plan to use crawlers in your Actors, then you don't need to install it. Keep in mind that neither `playwright` nor `puppeteer` are bundled with `crawlee` in order to reduce install size and allow greater flexibility. That's why we manually install it with NPM. You can choose one, both, or neither. There are two ways to initialize your Actor: by using the `Actor.main()` function you're probably used to, or by calling `Actor.init()` and `Actor.exit()` manually. We prefer explicitly calling `init` and `exit`. ### Using `Actor.init()` and `Actor.exit()`[​](#using-actorinit-and-actorexit "Direct link to heading") ``` import { Actor } from 'apify'; import { PlaywrightCrawler } from 'crawlee'; await Actor.init(); const crawler = new PlaywrightCrawler({ async requestHandler({ request, page, enqueueLinks }) { // Extract HTML title of the page. const title = await page.title(); console.log(`Title of ${request.url}: ${title}`); // Add URLs that point to the same hostname. await enqueueLinks(); }, }); await crawler.run(['https://crawlee.dev/']); await Actor.exit(); ``` ### Using `Actor.main()`[​](#using-actormain "Direct link to heading") ``` import { Actor } from 'apify'; import { PlaywrightCrawler } from 'crawlee'; await Actor.main(async () => { const crawler = new PlaywrightCrawler({ async requestHandler({ request, page, enqueueLinks }) { // Extract HTML title of the page. const title = await page.title(); console.log(`Title of ${request.url}: ${title}`); // Add URLs that point to the same hostname. await enqueueLinks(); }, }); await crawler.run(['https://crawlee.dev/']); }); ``` ## Support[​](#support "Direct link to heading") If you find any bug or issue with the Apify SDK, please [submit an issue on GitHub](https://github.com/apify/apify-sdk-js/issues). For questions, you can ask on [Stack Overflow](https://stackoverflow.com/questions/tagged/apify) or contact ## Contributing[​](#contributing "Direct link to heading") Your code contributions are welcome, and you'll be praised to eternity! If you have any ideas for improvements, either submit an issue or create a pull request. For contribution guidelines and the code of conduct, see [CONTRIBUTING.md](https://github.com/apify/apify-sdk-js/blob/master/CONTRIBUTING.md). ## License[​](#license "Direct link to heading") This project is licensed under the Apache License 2.0 - see the [LICENSE.md](https://github.com/apify/apify-sdk-js/blob/master/LICENSE.md) file for details. ## Acknowledgments[​](#acknowledgments "Direct link to heading") Many thanks to [Chema Balsas](https://www.npmjs.com/~jbalsas) for giving up the `apify` package name on NPM and renaming his project to [jsdocify](https://www.npmjs.com/package/jsdocify). ## Index[**](#Index) ### Result Stores * [**Dataset](https://docs.apify.com/sdk/js/sdk/js/reference/class/Dataset.md) ### Scaling * [**ProxyConfiguration](https://docs.apify.com/sdk/js/sdk/js/reference/class/ProxyConfiguration.md) ### Sources * [**RequestQueue](https://docs.apify.com/sdk/js/sdk/js/reference/class/RequestQueue.md) ### Other * [**LogLevel](https://docs.apify.com/sdk/js/sdk/js/reference/enum/LogLevel.md) * [**Actor](https://docs.apify.com/sdk/js/sdk/js/reference/class/Actor.md) * [**ApifyClient](https://docs.apify.com/sdk/js/sdk/js/reference/class/ApifyClient.md) * [**Configuration](https://docs.apify.com/sdk/js/sdk/js/reference/class/Configuration.md) * [**ChargingManager](https://docs.apify.com/sdk/js/sdk/js/reference/class/ChargingManager.md) * [**KeyValueStore](https://docs.apify.com/sdk/js/sdk/js/reference/class/KeyValueStore.md) * [**Log](https://docs.apify.com/sdk/js/sdk/js/reference/class/Log.md) * [**Logger](https://docs.apify.com/sdk/js/sdk/js/reference/class/Logger.md) * [**LoggerJson](https://docs.apify.com/sdk/js/sdk/js/reference/class/LoggerJson.md) * [**LoggerText](https://docs.apify.com/sdk/js/sdk/js/reference/class/LoggerText.md) * [**PlatformEventManager](https://docs.apify.com/sdk/js/sdk/js/reference/class/PlatformEventManager.md) * [**AbortOptions](https://docs.apify.com/sdk/js/sdk/js/reference/interface/AbortOptions.md) * [**ActorPricingInfo](https://docs.apify.com/sdk/js/sdk/js/reference/interface/ActorPricingInfo.md) * [**ActorRun](https://docs.apify.com/sdk/js/sdk/js/reference/interface/ActorRun.md) * [**ApifyClientOptions](https://docs.apify.com/sdk/js/sdk/js/reference/interface/ApifyClientOptions.md) * [**ApifyEnv](https://docs.apify.com/sdk/js/sdk/js/reference/interface/ApifyEnv.md) * [**CallOptions](https://docs.apify.com/sdk/js/sdk/js/reference/interface/CallOptions.md) * [**CallTaskOptions](https://docs.apify.com/sdk/js/sdk/js/reference/interface/CallTaskOptions.md) * [**ConfigurationOptions](https://docs.apify.com/sdk/js/sdk/js/reference/interface/ConfigurationOptions.md) * [**DatasetConsumer](https://docs.apify.com/sdk/js/sdk/js/reference/interface/DatasetConsumer.md) * [**DatasetContent](https://docs.apify.com/sdk/js/sdk/js/reference/interface/DatasetContent.md) * [**DatasetDataOptions](https://docs.apify.com/sdk/js/sdk/js/reference/interface/DatasetDataOptions.md) * [**DatasetIteratorOptions](https://docs.apify.com/sdk/js/sdk/js/reference/interface/DatasetIteratorOptions.md) * [**DatasetMapper](https://docs.apify.com/sdk/js/sdk/js/reference/interface/DatasetMapper.md) * [**DatasetOptions](https://docs.apify.com/sdk/js/sdk/js/reference/interface/DatasetOptions.md) * [**DatasetReducer](https://docs.apify.com/sdk/js/sdk/js/reference/interface/DatasetReducer.md) * [**ExitOptions](https://docs.apify.com/sdk/js/sdk/js/reference/interface/ExitOptions.md) * [**ChargeOptions](https://docs.apify.com/sdk/js/sdk/js/reference/interface/ChargeOptions.md) * [**ChargeResult](https://docs.apify.com/sdk/js/sdk/js/reference/interface/ChargeResult.md) * [**InitOptions](https://docs.apify.com/sdk/js/sdk/js/reference/interface/InitOptions.md) * [**KeyConsumer](https://docs.apify.com/sdk/js/sdk/js/reference/interface/KeyConsumer.md) * [**KeyValueStoreIteratorOptions](https://docs.apify.com/sdk/js/sdk/js/reference/interface/KeyValueStoreIteratorOptions.md) * [**KeyValueStoreOptions](https://docs.apify.com/sdk/js/sdk/js/reference/interface/KeyValueStoreOptions.md) * [**LoggerOptions](https://docs.apify.com/sdk/js/sdk/js/reference/interface/LoggerOptions.md) * [**MainOptions](https://docs.apify.com/sdk/js/sdk/js/reference/interface/MainOptions.md) * [**MetamorphOptions](https://docs.apify.com/sdk/js/sdk/js/reference/interface/MetamorphOptions.md) * [**OpenStorageOptions](https://docs.apify.com/sdk/js/sdk/js/reference/interface/OpenStorageOptions.md) * [**ProxyConfigurationOptions](https://docs.apify.com/sdk/js/sdk/js/reference/interface/ProxyConfigurationOptions.md) * [**ProxyInfo](https://docs.apify.com/sdk/js/sdk/js/reference/interface/ProxyInfo.md) * [**QueueOperationInfo](https://docs.apify.com/sdk/js/sdk/js/reference/interface/QueueOperationInfo.md) * [**RebootOptions](https://docs.apify.com/sdk/js/sdk/js/reference/interface/RebootOptions.md) * [**RecordOptions](https://docs.apify.com/sdk/js/sdk/js/reference/interface/RecordOptions.md) * [**RequestQueueOperationOptions](https://docs.apify.com/sdk/js/sdk/js/reference/interface/RequestQueueOperationOptions.md) * [**RequestQueueOptions](https://docs.apify.com/sdk/js/sdk/js/reference/interface/RequestQueueOptions.md) * [**WebhookOptions](https://docs.apify.com/sdk/js/sdk/js/reference/interface/WebhookOptions.md) * [**UserFunc](https://docs.apify.com/sdk/js/sdk/js/reference.md#UserFunc) * [**log](https://docs.apify.com/sdk/js/sdk/js/reference.md#log) ## Other[**](#__CATEGORY__) ### [**](#UserFunc)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1889)UserFunc **UserFunc\: () => Awaitable\ #### Type parameters * **T** = unknown #### Type declaration * * **(): Awaitable\ - #### Returns Awaitable\ ### [**](#log)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@apify/log/src/index.d.ts#L252)externalconstlog **log: [Log](https://docs.apify.com/sdk/js/sdk/js/reference/class/Log.md) --- # Changelog # Change Log All notable changes to this project will be documented in this file. See [Conventional Commits](https://conventionalcommits.org) for commit guidelines. ## [3.4.4](https://github.com/apify/apify-sdk-js/compare/apify@3.4.3...apify@3.4.4) (2025-07-28)[​](#344-2025-07-28 "Direct link to heading") **Note:** Version bump only for package apify ## [3.4.3](https://github.com/apify/apify-sdk-js/compare/apify@3.4.2...apify@3.4.3) (2025-07-14)[​](#343-2025-07-14 "Direct link to heading") ### Bug Fixes[​](#bug-fixes "Direct link to heading") * Return eventChargeLimitReached=false for charge calls with count=0 ([#395](https://github.com/apify/apify-sdk-js/issues/395)) ([4f97da0](https://github.com/apify/apify-sdk-js/commit/4f97da0cf4bbda33dcaa3d91f0f543f080dbab8b)), closes [#372](https://github.com/apify/apify-sdk-js/issues/372) ## [3.4.2](https://github.com/apify/apify-sdk-js/compare/apify@3.4.1...apify@3.4.2) (2025-05-19)[​](#342-2025-05-19 "Direct link to heading") ### Bug Fixes[​](#bug-fixes-1 "Direct link to heading") * improve extension of `Configuration` class to fix issues in native ESM projects ([#394](https://github.com/apify/apify-sdk-js/issues/394)) ([8842706](https://github.com/apify/apify-sdk-js/commit/884270611e09a0fec40903958f74d458ba454300)) ## [3.4.1](https://github.com/apify/apify-sdk-js/compare/apify@3.4.0...apify@3.4.1) (2025-05-07)[​](#341-2025-05-07 "Direct link to heading") ### Bug Fixes[​](#bug-fixes-2 "Direct link to heading") * convert `[@apilink](https://github.com/apilink)` to `[@link](https://github.com/link)` on build ([#383](https://github.com/apify/apify-sdk-js/issues/383)) ([ccae1ac](https://github.com/apify/apify-sdk-js/commit/ccae1ac9737dfc5bfc64e4586846e413ddb54a37)) * improve check for crawlee version mismatch ([#386](https://github.com/apify/apify-sdk-js/issues/386)) ([721e67d](https://github.com/apify/apify-sdk-js/commit/721e67dbde367b01e1347900b73394221bca0c9d)), closes [#375](https://github.com/apify/apify-sdk-js/issues/375) * prefer proxy password from env var ([#385](https://github.com/apify/apify-sdk-js/issues/385)) ([132b5dc](https://github.com/apify/apify-sdk-js/commit/132b5dc5b0c5b77cad357b4d022b53ab6801a3a2)), closes [#20502](https://github.com/apify/apify-sdk-js/issues/20502) # [3.4.0](https://github.com/apify/apify-sdk-js/compare/apify@3.3.2...apify@3.4.0) (2025-04-01) ### Features[​](#features "Direct link to heading") * sign record's public url ([#358](https://github.com/apify/apify-sdk-js/issues/358)) ([6274cc0](https://github.com/apify/apify-sdk-js/commit/6274cc018ab3b02787c234eacbb025d4d973a95d)) ## [3.3.2](https://github.com/apify/apify-sdk-js/compare/apify@3.3.1...apify@3.3.2) (2025-02-20)[​](#332-2025-02-20 "Direct link to heading") ### Bug Fixes[​](#bug-fixes-3 "Direct link to heading") * ensure `maxTotalChargeUsd` is correctly mapped to number, consider empty string as infinity ([#361](https://github.com/apify/apify-sdk-js/issues/361)) ([bb65f70](https://github.com/apify/apify-sdk-js/commit/bb65f70be4750c8dc1bee368f849fafa924add39)) ## [3.3.1](https://github.com/apify/apify-sdk-js/compare/apify@3.3.0...apify@3.3.1) (2025-02-19)[​](#331-2025-02-19 "Direct link to heading") ### Bug Fixes[​](#bug-fixes-4 "Direct link to heading") * Add workaround for incorrect handling of maxTotalChargeUsd ([#360](https://github.com/apify/apify-sdk-js/issues/360)) ([acb2abe](https://github.com/apify/apify-sdk-js/commit/acb2abe9a3422e5b0b28972085377664173fd3ff)) # [3.3.0](https://github.com/apify/apify-sdk-js/compare/apify@3.2.6...apify@3.3.0) (2025-02-06) ### Bug Fixes[​](#bug-fixes-5 "Direct link to heading") * load `initialCount` in `openRequestQueue()` ([#339](https://github.com/apify/apify-sdk-js/issues/339)) ([48548cd](https://github.com/apify/apify-sdk-js/commit/48548cd088365b84a0178ba38c5d3da7f4922ade)) * prevent reboot loops ([#345](https://github.com/apify/apify-sdk-js/issues/345)) ([271bc99](https://github.com/apify/apify-sdk-js/commit/271bc999c1a6c75f8e8359214237b51f8ade03c7)) ### Features[​](#features-1 "Direct link to heading") * Actor.charge() ([#346](https://github.com/apify/apify-sdk-js/issues/346)) ([e26e496](https://github.com/apify/apify-sdk-js/commit/e26e49669cae04df11f2138b80549e5cd8611b3c)) * add SDK and Crawlee version to the `ApifyClient`'s user agent ([#335](https://github.com/apify/apify-sdk-js/issues/335)) ([9c069a1](https://github.com/apify/apify-sdk-js/commit/9c069a1643f0a5f417765e9391550ae06c50160f)), closes [#331](https://github.com/apify/apify-sdk-js/issues/331) * add standby URL, change default port ([#328](https://github.com/apify/apify-sdk-js/issues/328)) ([7d265f3](https://github.com/apify/apify-sdk-js/commit/7d265f3e2a7dfdda232e0bbf7c6bb73736112950)) ## [3.2.6](https://github.com/apify/apify-sdk-js/compare/apify@3.2.5...apify@3.2.6) (2024-10-07)[​](#326-2024-10-07 "Direct link to heading") ### Bug Fixes[​](#bug-fixes-6 "Direct link to heading") * decode special characters in proxy `username` and `password` ([#326](https://github.com/apify/apify-sdk-js/issues/326)) ([9a7a4d0](https://github.com/apify/apify-sdk-js/commit/9a7a4d0ecc30f21d2be607840ce28903dbf1d191)) ## [3.2.5](https://github.com/apify/apify-sdk-js/compare/apify@3.2.4...apify@3.2.5) (2024-08-14)[​](#325-2024-08-14 "Direct link to heading") ### Features[​](#features-2 "Direct link to heading") * add `metaOrigin` to `Actor.config` ([#320](https://github.com/apify/apify-sdk-js/issues/320)) ([5a4d2eb](https://github.com/apify/apify-sdk-js/commit/5a4d2ebb9218bd342438f740d035a563444037d3)) ## [3.2.4](https://github.com/apify/apify-sdk-js/compare/apify@3.2.3...apify@3.2.4) (2024-07-04)[​](#324-2024-07-04 "Direct link to heading") ### Bug Fixes[​](#bug-fixes-7 "Direct link to heading") * add `standbyPort` to `ConfigurationOptions` ([#311](https://github.com/apify/apify-sdk-js/issues/311)) ([530b8a1](https://github.com/apify/apify-sdk-js/commit/530b8a133f2808c61e079449156e5ed1fe73ce64)) ### Features[​](#features-3 "Direct link to heading") * add standby port to configuration ([#310](https://github.com/apify/apify-sdk-js/issues/310)) ([cc26098](https://github.com/apify/apify-sdk-js/commit/cc26098da862a7338fdf776956b904d5672a5daf)) ## [3.2.3](https://github.com/apify/apify-sdk-js/compare/apify@3.2.2...apify@3.2.3) (2024-06-03)[​](#323-2024-06-03 "Direct link to heading") ### Bug Fixes[​](#bug-fixes-8 "Direct link to heading") * respect `KVS.getPublicUrl()` from core when not on apify platform ([#302](https://github.com/apify/apify-sdk-js/issues/302)) ([a4d80bb](https://github.com/apify/apify-sdk-js/commit/a4d80bbeee2fd2db145638b17757fa5f673e7452)) ## [3.2.2](https://github.com/apify/apify-sdk-js/compare/apify@3.2.1...apify@3.2.2) (2024-05-23)[​](#322-2024-05-23 "Direct link to heading") ### Bug Fixes[​](#bug-fixes-9 "Direct link to heading") * dont fail on resolution of not installed packages ([0cea251](https://github.com/apify/apify-sdk-js/commit/0cea251b35c652d529320a0570d6b283f52f0ac1)) ## [3.2.1](https://github.com/apify/apify-sdk-js/compare/apify@3.2.0...apify@3.2.1) (2024-05-23)[​](#321-2024-05-23 "Direct link to heading") ### Features[​](#features-4 "Direct link to heading") * validate crawlee versions in `Actor.init` ([#301](https://github.com/apify/apify-sdk-js/issues/301)) ([66ff6a9](https://github.com/apify/apify-sdk-js/commit/66ff6a9090e9e3321a217e14019e8c3001e3df4d)), closes [#237](https://github.com/apify/apify-sdk-js/issues/237) # [3.2.0](https://github.com/apify/apify-sdk-js/compare/apify@3.1.16...apify@3.2.0) (2024-04-11) ### Features[​](#features-5 "Direct link to heading") * support for proxy tiers ([#290](https://github.com/apify/apify-sdk-js/issues/290)) ([fff3a66](https://github.com/apify/apify-sdk-js/commit/fff3a66d3a0fe5080121cc083e27f59db3d979b5)) ## [3.1.16](https://github.com/apify/apify-sdk-js/compare/apify@3.1.15...apify@3.1.16) (2024-02-23)[​](#3116-2024-02-23 "Direct link to heading") **Note:** Version bump only for package apify ## [3.1.15](https://github.com/apify/apify-sdk-js/compare/apify@3.1.14...apify@3.1.15) (2024-01-08)[​](#3115-2024-01-08 "Direct link to heading") ### Features[​](#features-6 "Direct link to heading") * ignore proxy configuration locally if no valid token or password is found ([#272](https://github.com/apify/apify-sdk-js/issues/272)) ([0931c2e](https://github.com/apify/apify-sdk-js/commit/0931c2e27e48425bfc58c5df80cd42ed66b9395d)), closes [#262](https://github.com/apify/apify-sdk-js/issues/262) ## [3.1.14](https://github.com/apify/apify-sdk-js/compare/apify@3.1.13...apify@3.1.14) (2023-11-27)[​](#3114-2023-11-27 "Direct link to heading") **Note:** Version bump only for package apify ## [3.1.13](https://github.com/apify/apify-sdk-js/compare/apify@3.1.12...apify@3.1.13) (2023-11-15)[​](#3113-2023-11-15 "Direct link to heading") ### Bug Fixes[​](#bug-fixes-10 "Direct link to heading") * **apify:** declare got-scraping as dependency ([#252](https://github.com/apify/apify-sdk-js/issues/252)) ([a6bcf1d](https://github.com/apify/apify-sdk-js/commit/a6bcf1d578a7c7ebbb23b3768e8bbf9e94e2b404)) ## [3.1.12](https://github.com/apify/apify-sdk-js/compare/apify@3.1.11...apify@3.1.12) (2023-10-05)[​](#3112-2023-10-05 "Direct link to heading") ### Bug Fixes[​](#bug-fixes-11 "Direct link to heading") * add more logging to `Actor.init` and `Actor.exit` ([#236](https://github.com/apify/apify-sdk-js/issues/236)) ([b7e01fc](https://github.com/apify/apify-sdk-js/commit/b7e01fc649de84d6f1391bf95e0f349f7ca32536)) ## [3.1.11](https://github.com/apify/apify-sdk-js/compare/apify@3.1.10...apify@3.1.11) (2023-10-04)[​](#3111-2023-10-04 "Direct link to heading") ### Bug Fixes[​](#bug-fixes-12 "Direct link to heading") * run the whole `Actor.exit()` code inside a timeout handler ([#235](https://github.com/apify/apify-sdk-js/issues/235)) ([c8aabae](https://github.com/apify/apify-sdk-js/commit/c8aabaee5f2de1ab40947f47f95f54ccff37cad0)) ### Features[​](#features-7 "Direct link to heading") * Use `.reboot()` instead of `.metamorph()` for reboot ([#227](https://github.com/apify/apify-sdk-js/issues/227)) ([8c0bff5](https://github.com/apify/apify-sdk-js/commit/8c0bff5a8d3ea65e532b3700b34b9c563856158a)) ## [3.1.10](https://github.com/apify/apify-sdk-js/compare/apify@3.1.9...apify@3.1.10) (2023-09-07)[​](#3110-2023-09-07 "Direct link to heading") ### Bug Fixes[​](#bug-fixes-13 "Direct link to heading") * require newer version of crawlee to fix possible issues with `purgeDefaultStorages` ([#226](https://github.com/apify/apify-sdk-js/issues/226)) ([95cf31f](https://github.com/apify/apify-sdk-js/commit/95cf31f3d1d054a1c8e3daac89f41bbb0aaddbba)) ## [3.1.9](https://github.com/apify/apify-sdk-js/compare/apify@3.1.8...apify@3.1.9) (2023-09-06)[​](#319-2023-09-06 "Direct link to heading") **Note:** Version bump only for package apify ## [3.1.8](https://github.com/apify/apify-sdk-js/compare/apify@3.1.7...apify@3.1.8) (2023-07-20)[​](#318-2023-07-20 "Direct link to heading") ### Bug Fixes[​](#bug-fixes-14 "Direct link to heading") * require newer version of apify-client and other packages ([24a3a4b](https://github.com/apify/apify-sdk-js/commit/24a3a4b5bf2f61e690348727e7f24c06c45a0999)) ### Features[​](#features-8 "Direct link to heading") * Use Actor env vars ([#216](https://github.com/apify/apify-sdk-js/issues/216)) ([11ff740](https://github.com/apify/apify-sdk-js/commit/11ff740ad3d2bdd37fce011d94b64ea01413b0d9)) ## [3.1.7](https://github.com/apify/apify-sdk-js/compare/apify@3.1.6...apify@3.1.7) (2023-06-09)[​](#317-2023-06-09 "Direct link to heading") **Note:** Version bump only for package apify ## [3.1.6](https://github.com/apify/apify-sdk-js/compare/apify@3.1.5...apify@3.1.6) (2023-06-09)[​](#316-2023-06-09 "Direct link to heading") ### Bug Fixes[​](#bug-fixes-15 "Direct link to heading") * only print status message when explicitly provided in `Actor.exit()` ([#203](https://github.com/apify/apify-sdk-js/issues/203)) ([85159e4](https://github.com/apify/apify-sdk-js/commit/85159e499984c78eee90b6d92332ea63b9f46c8c)) ## [3.1.5](https://github.com/apify/apify-sdk-js/compare/apify@3.1.4...apify@3.1.5) (2023-05-31)[​](#315-2023-05-31 "Direct link to heading") ### Bug Fixes[​](#bug-fixes-16 "Direct link to heading") * add missing `options` parameter to `Actor.setStatusMessage()` ([712e8c6](https://github.com/apify/apify-sdk-js/commit/712e8c66755ac8baeb35fcc1ad000487da8b2c48)) ### Features[​](#features-9 "Direct link to heading") * add `Actor.getInputOrThrow()` method ([#198](https://github.com/apify/apify-sdk-js/issues/198)) ([5fbbfe4](https://github.com/apify/apify-sdk-js/commit/5fbbfe4960a79fbbd23f4fdd7d07a1a5063820f4)) ## [3.1.4](https://github.com/apify/apify-sdk-js/compare/apify@3.1.3...apify@3.1.4) (2023-03-23)[​](#314-2023-03-23 "Direct link to heading") ### Bug Fixes[​](#bug-fixes-17 "Direct link to heading") * log status message only once and without prefix ([#179](https://github.com/apify/apify-sdk-js/issues/179)) ([1f11a6a](https://github.com/apify/apify-sdk-js/commit/1f11a6ad8ebc8a0cfaef58be47ba8b12c75018f1)) ## [3.1.3](https://github.com/apify/apify-sdk-js/compare/apify@3.1.2...apify@3.1.3) (2023-03-22)[​](#313-2023-03-22 "Direct link to heading") ### Bug Fixes[​](#bug-fixes-18 "Direct link to heading") * `call/callTask` accept `waitSecs` instead of `waitForFinish` ([#176](https://github.com/apify/apify-sdk-js/issues/176)) ([f0c73d8](https://github.com/apify/apify-sdk-js/commit/f0c73d8765091212f2abb4b4faaf109f9447d90a)) ### Features[​](#features-10 "Direct link to heading") * terminal message on Actor.exit() ([#172](https://github.com/apify/apify-sdk-js/issues/172)) ([e0feca8](https://github.com/apify/apify-sdk-js/commit/e0feca895766af0d92fbf78ca4c2d7b49bd2acff)) ## [3.1.2](https://github.com/apify/apify-sdk-js/compare/apify@3.1.1...apify@3.1.2) (2023-02-07)[​](#312-2023-02-07 "Direct link to heading") ### Bug Fixes[​](#bug-fixes-19 "Direct link to heading") * declare missing dependency on tslib ([bc27118](https://github.com/apify/apify-sdk-js/commit/bc27118daab211857305f7617b1ee1433da13d4a)) * remove unused export of `QueueOperationInfoOptions` ([b29fe48](https://github.com/apify/apify-sdk-js/commit/b29fe4853d637ab527a7f7e3e53c7a5b0fe27a32)) ## [3.1.1](https://github.com/apify/apify-sdk-js/compare/apify@3.1.0...apify@3.1.1) (2022-11-13)[​](#311-2022-11-13 "Direct link to heading") ### Features[​](#features-11 "Direct link to heading") * add `statusMessage` to `AbortOptions` ([fb10bb6](https://github.com/apify/apify-sdk-js/commit/fb10bb60c12c0af97e41ae88adcf0b2000286235)) * warn about Actor not being initialized before using storage methods ([#126](https://github.com/apify/apify-sdk-js/issues/126)) ([91cd246](https://github.com/apify/apify-sdk-js/commit/91cd2467d111de19490a6bf47b4a9138f26a37d4)) # 3.1.0 (2022-10-13) ### Bug Fixes[​](#bug-fixes-20 "Direct link to heading") * **apify:** add `@apify/timeout` to dependencies ([#76](https://github.com/apify/apify-sdk-js/issues/76)) ([1d64a1f](https://github.com/apify/apify-sdk-js/commit/1d64a1fa8f0e88a96eb82c2669e85b09dd4f372d)) * use correct event manager for Actor methods ([#49](https://github.com/apify/apify-sdk-js/issues/49)) ([ef3a0c5](https://github.com/apify/apify-sdk-js/commit/ef3a0c54359be64c89e76b0cac600cd780281321)) * wait for memory storage to write changes before `Actor.exit` exists the process ([c721d98](https://github.com/apify/apify-sdk-js/commit/c721d988141cf5b7aa170fddeffb792ded769622)) ### Features[​](#features-12 "Direct link to heading") * add `Actor.useState()` helper ([#98](https://github.com/apify/apify-sdk-js/issues/98)) ([27dc413](https://github.com/apify/apify-sdk-js/commit/27dc4139caa0a2d94c570edac2cb628f6b3f747c)) * **apify:** add decryption for input secrets ([#83](https://github.com/apify/apify-sdk-js/issues/83)) ([78bb990](https://github.com/apify/apify-sdk-js/commit/78bb990817c01254de19c828937181c1263e21eb)) * re-export the logger in Actor sdk ([#54](https://github.com/apify/apify-sdk-js/issues/54)) ([c78d8a4](https://github.com/apify/apify-sdk-js/commit/c78d8a44d7af5de7fda7bf2e436fefda752a4b1a)) * update @apify/scraper-tools ([#37](https://github.com/apify/apify-sdk-js/issues/37)) ([788913e](https://github.com/apify/apify-sdk-js/commit/788913e0cc669b15b35359df30202a449b881b5f)) * update the scrapers ([#70](https://github.com/apify/apify-sdk-js/issues/70)) ([efbfc44](https://github.com/apify/apify-sdk-js/commit/efbfc442bc8be4f07b5f2432a750cb861d7f05e8)) ## [3.0.0](https://github.com/apify/apify-sdk-js/compare/v2.3.2...v3.0.0) (2022-07-13)[​](#300-2022-07-13 "Direct link to heading") This section summarizes most of the breaking changes between Crawlee (v3) and Apify SDK (v2). Crawlee is the spiritual successor to Apify SDK, so we decided to keep the versioning and release Crawlee as v3. ### Crawlee vs Apify SDK[​](#crawlee-vs-apify-sdk "Direct link to heading") Up until version 3 of `apify`, the package contained both scraping related tools and Apify platform related helper methods. With v3 we are splitting the whole project into two main parts: * Crawlee, the new web-scraping library, available as `crawlee` package on NPM * Apify SDK, helpers for the Apify platform, available as `apify` package on NPM Moreover, the Crawlee library is published as several packages under `@crawlee` namespace: * `@crawlee/core`: the base for all the crawler implementations, also contains things like `Request`, `RequestQueue`, `RequestList` or `Dataset` classes * `@crawlee/basic`: exports `BasicCrawler` * `@crawlee/cheerio`: exports `CheerioCrawler` * `@crawlee/browser`: exports `BrowserCrawler` (which is used for creating `@crawlee/playwright` and `@crawlee/puppeteer`) * `@crawlee/playwright`: exports `PlaywrightCrawler` * `@crawlee/puppeteer`: exports `PuppeteerCrawler` * `@crawlee/memory-storage`: `@apify/storage-local` alternative * `@crawlee/browser-pool`: previously `browser-pool` package * `@crawlee/utils`: utility methods * `@crawlee/types`: holds TS interfaces mainly about the `StorageClient` #### Installing Crawlee[​](#installing-crawlee "Direct link to heading") > As Crawlee is not yet released as `latest`, we need to install from the `next` distribution tag! Most of the Crawlee packages are extending and reexporting each other, so it's enough to install just the one you plan on using, e.g. `@crawlee/playwright` if you plan on using `playwright` - it already contains everything from the `@crawlee/browser` package, which includes everything from `@crawlee/basic`, which includes everything from `@crawlee/core`. ``` npm install crawlee@next ``` Or if all we need is cheerio support, we can install only @crawlee/cheerio ``` npm install @crawlee/cheerio@next ``` When using `playwright` or `puppeteer`, we still need to install those dependencies explicitly - this allows the users to be in control of which version will be used. ``` npm install crawlee@next playwright # or npm install @crawlee/playwright@next playwright ``` Alternatively we can also use the `crawlee` meta-package which contains (re-exports) most of the `@crawlee/*` packages, and therefore contains all the crawler classes. > Sometimes you might want to use some utility methods from `@crawlee/utils`, so you might want to install that as well. This package contains some utilities that were previously available under `Apify.utils`. Browser related utilities can be also found in the crawler packages (e.g. `@crawlee/playwright`). ### Full TypeScript support[​](#full-typescript-support "Direct link to heading") Both Crawlee and Apify SDK are full TypeScript rewrite, so they include up-to-date types in the package. For your TypeScript crawlers we recommend using our predefined TypeScript configuration from `@apify/tsconfig` package. Don't forget to set the `module` and `target` to `ES2022` or above to be able to use top level await. > The `@apify/tsconfig` config has [`noImplicitAny`](https://www.typescriptlang.org/tsconfig#noImplicitAny) enabled, you might want to disable it during the initial development as it will cause build failures if you left some unused local variables in your code. tsconfig.json ``` { "extends": "@apify/tsconfig", "compilerOptions": { "module": "ES2022", "target": "ES2022", "outDir": "dist", "lib": ["DOM"] }, "include": ["./src/**/*"] } ``` #### Docker build[​](#docker-build "Direct link to heading") For `Dockerfile` we recommend using multi-stage build, so you don't install the dev dependencies like TypeScript in your final image: Dockerfile ``` # using multistage build, as we need dev deps to build the TS source code FROM apify/actor-node:16 AS builder # copy all files, install all dependencies (including dev deps) and build the project COPY . ./ RUN npm install --include=dev \ && npm run build # create final image FROM apify/actor-node:16 # copy only necessary files COPY --from=builder /usr/src/app/package*.json ./ COPY --from=builder /usr/src/app/README.md ./ COPY --from=builder /usr/src/app/dist ./dist COPY --from=builder /usr/src/app/apify.json ./apify.json COPY --from=builder /usr/src/app/INPUT_SCHEMA.json ./INPUT_SCHEMA.json # install only prod deps RUN npm --quiet set progress=false \ && npm install --only=prod --no-optional \ && echo "Installed NPM packages:" \ && (npm list --only=prod --no-optional --all || true) \ && echo "Node.js version:" \ && node --version \ && echo "NPM version:" \ && npm --version # run compiled code CMD npm run start:prod ``` ### Browser fingerprints[​](#browser-fingerprints "Direct link to heading") Previously we had a magical `stealth` option in the puppeteer crawler that enabled several tricks aiming to mimic the real users as much as possible. While this worked to a certain degree, we decided to replace it with generated browser fingerprints. In case we don't want to have dynamic fingerprints, we can disable this behaviour via `useFingerprints` in `browserPoolOptions`: ``` const crawler = new PlaywrightCrawler({ browserPoolOptions: { useFingerprints: false, }, }); ``` ### Session cookie method renames[​](#session-cookie-method-renames "Direct link to heading") Previously, if we wanted to get or add cookies for the session that would be used for the request, we had to call `session.getPuppeteerCookies()` or `session.setPuppeteerCookies()`. Since this method could be used for any of our crawlers, not just `PuppeteerCrawler`, the methods have been renamed to `session.getCookies()` and `session.setCookies()` respectively. Otherwise, their usage is exactly the same! ### Memory storage[​](#memory-storage "Direct link to heading") When we store some data or intermediate state (like the one `RequestQueue` holds), we now use `@crawlee/memory-storage` by default. It is an alternative to the `@apify/storage-local`, that stores the state inside memory (as opposed to SQLite database used by `@apify/storage-local`). While the state is stored in memory, it also dumps it to the file system, so we can observe it, as well as respects the existing data stored in KeyValueStore (e.g. the `INPUT.json` file). When we want to run the crawler on Apify platform, we need to use `Actor.init` or `Actor.main`, which will automatically switch the storage client to `ApifyClient` when on the Apify platform. We can still use the `@apify/storage-local`, to do it, first install it pass it to the `Actor.init` or `Actor.main` options: > `@apify/storage-local` v2.1.0+ is required for Crawlee ``` import { Actor } from 'apify'; import { ApifyStorageLocal } from '@apify/storage-local'; const storage = new ApifyStorageLocal(/* options like `enableWalMode` belong here */); await Actor.init({ storage }); ``` ### Purging of the default storage[​](#purging-of-the-default-storage "Direct link to heading") Previously the state was preserved between local runs, and we had to use `--purge` argument of the `apify-cli`. With Crawlee, this is now the default behaviour, we purge the storage automatically on `Actor.init/main` call. We can opt out of it via `purge: false` in the `Actor.init` options. ### Renamed crawler options and interfaces[​](#renamed-crawler-options-and-interfaces "Direct link to heading") Some options were renamed to better reflect what they do. We still support all the old parameter names too, but not at the TS level. * `handleRequestFunction` -> `requestHandler` * `handlePageFunction` -> `requestHandler` * `handleRequestTimeoutSecs` -> `requestHandlerTimeoutSecs` * `handlePageTimeoutSecs` -> `requestHandlerTimeoutSecs` * `requestTimeoutSecs` -> `navigationTimeoutSecs` * `handleFailedRequestFunction` -> `failedRequestHandler` We also renamed the crawling context interfaces, so they follow the same convention and are more meaningful: * `CheerioHandlePageInputs` -> `CheerioCrawlingContext` * `PlaywrightHandlePageFunction` -> `PlaywrightCrawlingContext` * `PuppeteerHandlePageFunction` -> `PuppeteerCrawlingContext` ### Context aware helpers[​](#context-aware-helpers "Direct link to heading") Some utilities previously available under `Apify.utils` namespace are now moved to the crawling context and are *context aware*. This means they have some parameters automatically filled in from the context, like the current `Request` instance or current `Page` object, or the `RequestQueue` bound to the crawler. #### Enqueuing links[​](#enqueuing-links "Direct link to heading") One common helper that received more attention is the `enqueueLinks`. As mentioned above, it is context aware - we no longer need pass in the `requestQueue` or `page` arguments (or the cheerio handle `$`). In addition to that, it now offers 3 enqueuing strategies: * `EnqueueStrategy.All` (`'all'`): Matches any URLs found * `EnqueueStrategy.SameHostname` (`'same-hostname'`) Matches any URLs that have the same subdomain as the base URL (default) * `EnqueueStrategy.SameDomain` (`'same-domain'`) Matches any URLs that have the same domain name. For example, `https://wow.an.example.com` and `https://example.com` will both be matched for a base url of `https://example.com`. This means we can even call `enqueueLinks()` without any parameters. By default, it will go through all the links found on current page and filter only those targeting the same subdomain. Moreover, we can specify patterns the URL should match via globs: ``` const crawler = new PlaywrightCrawler({ async requestHandler({ enqueueLinks }) { await enqueueLinks({ globs: ['https://apify.com/*/*'], // we can also use `regexps` and `pseudoUrls` keys here }); }, }); ``` ### Implicit `RequestQueue` instance[​](#implicit-requestqueue-instance "Direct link to heading") All crawlers now have the `RequestQueue` instance automatically available via `crawler.getRequestQueue()` method. It will create the instance for you if it does not exist yet. This mean we no longer need to create the `RequestQueue` instance manually, and we can just use `crawler.addRequests()` method described underneath. > We can still create the `RequestQueue` explicitly, the `crawler.getRequestQueue()` method will respect that and return the instance provided via crawler options. ### `crawler.addRequests()`[​](#crawleraddrequests "Direct link to heading") We can now add multiple requests in batches. The newly added `addRequests` method will handle everything for us. It enqueues the first 1000 requests and resolves, while continuing with the rest in the background, again in a smaller 1000 items batches, so we don't fall into any API rate limits. This means the crawling will start almost immediately (within few seconds at most), something previously possible only with a combination of `RequestQueue` and `RequestList`. ``` // will resolve right after the initial batch of 1000 requests is added const result = await crawler.addRequests([ /* many requests, can be even millions */ ]); // if we want to wait for all the requests to be added, we can await the `waitForAllRequestsToBeAdded` promise await result.waitForAllRequestsToBeAdded; ``` ### Less verbose error logging[​](#less-verbose-error-logging "Direct link to heading") Previously an error thrown from inside request handler resulted in full error object being logged. With Crawlee, we log only the error message as a warning as long as we know the request will be retried. If you want to enable verbose logging like in v2, use the `CRAWLEE_VERBOSE_LOG` env var. ### Removal of `requestAsBrowser`[​](#removal-of-requestasbrowser "Direct link to heading") In v1 we replaced the underlying implementation of `requestAsBrowser` to be just a proxy over calling [`got-scraping`](https://github.com/apify/got-scraping) - our custom extension to `got` that tries to mimic the real browsers as much as possible. With v3, we are removing the `requestAsBrowser`, encouraging the use of [`got-scraping`](https://github.com/apify/got-scraping) directly. For easier migration, we also added `context.sendRequest()` helper that allows processing the context bound `Request` object through [`got-scraping`](https://github.com/apify/got-scraping): ``` const crawler = new BasicCrawler({ async requestHandler({ sendRequest, log }) { // we can use the options parameter to override gotScraping options const res = await sendRequest({ responseType: 'json' }); log.info('received body', res.body); }, }); ``` #### How to use `sendRequest()`?[​](#how-to-use-sendrequest "Direct link to heading") See [the Got Scraping guide](https://crawlee.dev/docs/guides/got-scraping). #### Removed options[​](#removed-options "Direct link to heading") The `useInsecureHttpParser` option has been removed. It's permanently set to `true` in order to better mimic browsers' behavior. Got Scraping automatically performs protocol negotiation, hence we removed the `useHttp2` option. It's set to `true` - 100% of browsers nowadays are capable of HTTP/2 requests. Oh, more and more of the web is using it too! #### Renamed options[​](#renamed-options "Direct link to heading") In the `requestAsBrowser` approach, some of the options were named differently. Here's a list of renamed options: ##### `payload`[​](#payload "Direct link to heading") This options represents the body to send. It could be a `string` or a `Buffer`. However, there is no `payload` option anymore. You need to use `body` instead. Or, if you wish to send JSON, `json`. Here's an example: ``` // Before: await Apify.utils.requestAsBrowser({ …, payload: 'Hello, world!' }); await Apify.utils.requestAsBrowser({ …, payload: Buffer.from('c0ffe', 'hex') }); await Apify.utils.requestAsBrowser({ …, json: { hello: 'world' } }); // After: await gotScraping({ …, body: 'Hello, world!' }); await gotScraping({ …, body: Buffer.from('c0ffe', 'hex') }); await gotScraping({ …, json: { hello: 'world' } }); ``` ##### `ignoreSslErrors`[​](#ignoresslerrors "Direct link to heading") It has been renamed to `https.rejectUnauthorized`. By default, it's set to `false` for convenience. However, if you want to make sure the connection is secure, you can do the following: ``` // Before: await Apify.utils.requestAsBrowser({ …, ignoreSslErrors: false }); // After: await gotScraping({ …, https: { rejectUnauthorized: true } }); ``` Please note: the meanings are opposite! So we needed to invert the values as well. ##### `header-generator` options[​](#header-generator-options "Direct link to heading") `useMobileVersion`, `languageCode` and `countryCode` no longer exist. Instead, you need to use `headerGeneratorOptions` directly: ``` // Before: await Apify.utils.requestAsBrowser({ …, useMobileVersion: true, languageCode: 'en', countryCode: 'US', }); // After: await gotScraping({ …, headerGeneratorOptions: { devices: ['mobile'], // or ['desktop'] locales: ['en-US'], }, }); ``` ##### `timeoutSecs`[​](#timeoutsecs "Direct link to heading") In order to set a timeout, use `timeout.request` (which is **milliseconds** now). ``` // Before: await Apify.utils.requestAsBrowser({ …, timeoutSecs: 30, }); // After: await gotScraping({ …, timeout: { request: 30 * 1000, }, }); ``` ##### `throwOnHttpErrors`[​](#throwonhttperrors "Direct link to heading") `throwOnHttpErrors` → `throwHttpErrors`. This options throws on unsuccessful HTTP status codes, for example `404`. By default, it's set to `false`. ##### `decodeBody`[​](#decodebody "Direct link to heading") `decodeBody` → `decompress`. This options decompresses the body. Defaults to `true` - please do not change this or websites will break (unless you know what you're doing!). ##### `abortFunction`[​](#abortfunction "Direct link to heading") This function used to make the promise throw on specific responses, if it returned `true`. However, it wasn't that useful. You probably want to cancel the request instead, which you can do in the following way: ``` const promise = gotScraping(…); promise.on('request', request => { // Please note this is not a Got Request instance, but a ClientRequest one. // https://nodejs.org/api/http.html#class-httpclientrequest if (request.protocol !== 'https:') { // Unsecure request, abort. promise.cancel(); // If you set `isStream` to `true`, please use `stream.destroy()` instead. } }); const response = await promise; ``` ### Removal of browser pool plugin mixing[​](#removal-of-browser-pool-plugin-mixing "Direct link to heading") Previously, you were able to have a browser pool that would mix Puppeteer and Playwright plugins (or even your own custom plugins if you've built any). As of this version, that is no longer allowed, and creating such a browser pool will cause an error to be thrown (it's expected that all plugins that will be used are of the same type). ### Handling requests outside of browser[​](#handling-requests-outside-of-browser "Direct link to heading") One small feature worth mentioning is the ability to handle requests with browser crawlers outside the browser. To do that, we can use a combination of `Request.skipNavigation` and `context.sendRequest()`. Take a look at how to achieve this by checking out the [Skipping navigation for certain requests](https://crawlee.dev/docs/examples/skip-navigation) example! ### Logging[​](#logging "Direct link to heading") Crawlee exports the default `log` instance directly as a named export. We also have a scoped `log` instance provided in the crawling context - this one will log messages prefixed with the crawler name and should be preferred for logging inside the request handler. ``` const crawler = new CheerioCrawler({ async requestHandler({ log, request }) { log.info(`Opened ${request.loadedUrl}`); }, }); ``` ### Auto-saved crawler state[​](#auto-saved-crawler-state "Direct link to heading") Every crawler instance now has `useState()` method that will return a state object we can use. It will be automatically saved when `persistState` event occurs. The value is cached, so we can freely call this method multiple times and get the exact same reference. No need to worry about saving the value either, as it will happen automatically. ``` const crawler = new CheerioCrawler({ async requestHandler({ crawler }) { const state = await crawler.useState({ foo: [] as number[] }); // just change the value, no need to care about saving it state.foo.push(123); }, }); ``` ### Apify SDK[​](#apify-sdk "Direct link to heading") The Apify platform helpers can be now found in the Apify SDK (`apify` NPM package). It exports the `Actor` class that offers following static helpers: * `ApifyClient` shortcuts: `addWebhook()`, `call()`, `callTask()`, `metamorph()` * helpers for running on Apify platform: `init()`, `exit()`, `fail()`, `main()`, `isAtHome()`, `createProxyConfiguration()` * storage support: `getInput()`, `getValue()`, `openDataset()`, `openKeyValueStore()`, `openRequestQueue()`, `pushData()`, `setValue()` * events support: `on()`, `off()` * other utilities: `getEnv()`, `newClient()`, `reboot()` `Actor.main` is now just a syntax sugar around calling `Actor.init()` at the beginning and `Actor.exit()` at the end (plus wrapping the user function in try/catch block). All those methods are async and should be awaited - with node 16 we can use the top level await for that. In other words, following is equivalent: ``` import { Actor } from 'apify'; await Actor.init(); // your code await Actor.exit('Crawling finished!'); ``` ``` import { Actor } from 'apify'; await Actor.main( async () => { // your code }, { statusMessage: 'Crawling finished!' }, ); ``` `Actor.init()` will conditionally set the storage implementation of Crawlee to the `ApifyClient` when running on the Apify platform, or keep the default (memory storage) implementation otherwise. It will also subscribe to the websocket events (or mimic them locally). `Actor.exit()` will handle the tear down and calls `process.exit()` to ensure our process won't hang indefinitely for some reason. #### Events[​](#events "Direct link to heading") Apify SDK (v2) exports `Apify.events`, which is an `EventEmitter` instance. With Crawlee, the events are managed by [`EventManager`](https://crawlee.dev/api/core/class/EventManager) class instead. We can either access it via `Actor.eventManager` getter, or use `Actor.on` and `Actor.off` shortcuts instead. ``` -Apify.events.on(...); +Actor.on(...); ``` > We can also get the [`EventManager`](https://crawlee.dev/api/core/class/EventManager) instance via `Configuration.getEventManager()`. In addition to the existing events, we now have an `exit` event fired when calling `Actor.exit()` (which is called at the end of `Actor.main()`). This event allows you to gracefully shut down any resources when `Actor.exit` is called. ### Smaller/internal breaking changes[​](#smallerinternal-breaking-changes "Direct link to heading") * `Apify.call()` is now just a shortcut for running `ApifyClient.actor(actorId).call(input, options)`, while also taking the token inside env vars into account * `Apify.callTask()` is now just a shortcut for running `ApifyClient.task(taskId).call(input, options)`, while also taking the token inside env vars into account * `Apify.metamorph()` is now just a shortcut for running `ApifyClient.task(taskId).metamorph(input, options)`, while also taking the ACTOR\_RUN\_ID inside env vars into account * `Apify.waitForRunToFinish()` has been removed, use `ApifyClient.waitForFinish()` instead * `Actor.main/init` purges the storage by default * remove `purgeLocalStorage` helper, move purging to the storage class directly * `StorageClient` interface now has optional `purge` method * purging happens automatically via `Actor.init()` (you can opt out via `purge: false` in the options of `init/main` methods) * `QueueOperationInfo.request` is no longer available * `Request.handledAt` is now string date in ISO format * `Request.inProgress` and `Request.reclaimed` are now `Set`s instead of POJOs * `injectUnderscore` from puppeteer utils has been removed * `APIFY_MEMORY_MBYTES` is no longer taken into account, use `CRAWLEE_AVAILABLE_MEMORY_RATIO` instead * some `AutoscaledPool` options are no longer available: * `cpuSnapshotIntervalSecs` and `memorySnapshotIntervalSecs` has been replaced with top level `systemInfoIntervalMillis` configuration * `maxUsedCpuRatio` has been moved to the top level configuration * `ProxyConfiguration.newUrlFunction` can be async. `.newUrl()` and `.newProxyInfo()` now return promises. * `prepareRequestFunction` and `postResponseFunction` options are removed, use navigation hooks instead * `gotoFunction` and `gotoTimeoutSecs` are removed * removed compatibility fix for old/broken request queues with null `Request` props * `fingerprintsOptions` renamed to `fingerprintOptions` (`fingerprints` -> `fingerprint`). * `fingerprintOptions` now accept `useFingerprintCache` and `fingerprintCacheSize` (instead of `useFingerprintPerProxyCache` and `fingerprintPerProxyCacheSize`, which are now no longer available). This is because the cached fingerprints are no longer connected to proxy URLs but to sessions. ## [2.3.2](https://github.com/apify/apify-sdk-js/compare/v2.3.1...v2.3.2) (2022-05-05)[​](#232-2022-05-05 "Direct link to heading") * fix: use default user agent for playwright with chrome instead of the default "headless UA" * fix: always hide webdriver of chrome browsers ## [2.3.1](https://github.com/apify/apify-sdk-js/compare/v2.3.0...v2.3.1) (2022-05-03)[​](#231-2022-05-03 "Direct link to heading") * fix: `utils.apifyClient` early instantiation (#1330) * feat: `utils.playwright.injectJQuery()` (#1337) * feat: add `keyValueStore` option to `Statistics` class (#1345) * fix: ensure failed req count is correct when using `RequestList` (#1347) * fix: random puppeteer crawler (running in headful mode) failure (#1348) > This should help with the `We either navigate top level or have old version of the navigated frame` bug in puppeteer. * fix: allow returning falsy values in `RequestTransform`'s return type ## [2.3.0](https://github.com/apify/apify-sdk-js/compare/v2.2.2...v2.3.0) (2022-04-07)[​](#230-2022-04-07 "Direct link to heading") * feat: accept more social media patterns (#1286) * feat: add multiple click support to `enqueueLinksByClickingElements` (#1295) * feat: instance-scoped "global" configuration (#1315) * feat: requestList accepts proxyConfiguration for requestsFromUrls (#1317) * feat: update `playwright` to v1.20.2 * feat: update `puppeteer` to v13.5.2 > We noticed that with this version of puppeteer Actor run could crash with `We either navigate top level or have old version of the navigated frame` error (puppeteer issue [here](https://github.com/puppeteer/puppeteer/issues/7050)). It should not happen while running the browser in headless mode. In case you need to run the browser in headful mode (`headless: false`), we recommend pinning puppeteer version to `10.4.0` in Actor `package.json` file. * feat: stealth deprecation (#1314) * feat: allow passing a stream to KeyValueStore.setRecord (#1325) * fix: use correct apify-client instance for snapshotting (#1308) * fix: automatically reset `RequestQueue` state after 5 minutes of inactivity, closes #997 * fix: improve guessing of chrome executable path on windows (#1294) * fix: prune CPU snapshots locally (#1313) * fix: improve browser launcher types (#1318) ### 0 concurrency mitigation[​](#0-concurrency-mitigation "Direct link to heading") This release should resolve the 0 concurrency bug by automatically resetting the internal `RequestQueue` state after 5 minutes of inactivity. We now track last activity done on a `RequestQueue` instance: * added new request * started processing a request (added to `inProgress` cache) * marked request as handled * reclaimed request If we don't detect one of those actions in last 5 minutes, and we have some requests in the `inProgress` cache, we try to reset the state. We can override this limit via `CRAWLEE_INTERNAL_TIMEOUT` env var. This should finally resolve the 0 concurrency bug, as it was always about stuck requests in the `inProgress` cache. ## [2.2.2](https://github.com/apify/apify-sdk-js/compare/v2.2.1...v2.2.2) (2022-02-14)[​](#222-2022-02-14 "Direct link to heading") * fix: ensure `request.headers` is set * fix: lower `RequestQueue` API timeout to 30 seconds * improve logging for fetching next request and timeouts ## [2.2.1](https://github.com/apify/apify-sdk-js/compare/v2.2.0...v2.2.1) (2022-01-03)[​](#221-2022-01-03 "Direct link to heading") * fix: ignore requests that are no longer in progress (#1258) * fix: do not use `tryCancel()` from inside sync callback (#1265) * fix: revert to puppeteer 10.x (#1276) * fix: wait when `body` is not available in `infiniteScroll()` from Puppeteer utils (#1238) * fix: expose logger classes on the `utils.log` instance (#1278) ## [2.2.0](https://github.com/apify/apify-sdk-js/compare/v2.1.0...v2.2.0) (2021-12-17)[​](#220-2021-12-17 "Direct link to heading") ### Proxy per page[​](#proxy-per-page "Direct link to heading") Up until now, browser crawlers used the same session (and therefore the same proxy) for all request from a single browser \* now get a new proxy for each session. This means that with incognito pages, each page will get a new proxy, aligning the behaviour with `CheerioCrawler`. This feature is not enabled by default. To use it, we need to enable `useIncognitoPages` flag under `launchContext`: ``` new Apify.Playwright({ launchContext: { useIncognitoPages: true, }, // ... }); ``` > Note that currently there is a performance overhead for using `useIncognitoPages`. Use this flag at your own will. We are planning to enable this feature by default in SDK v3.0. ### Abortable timeouts[​](#abortable-timeouts "Direct link to heading") Previously when a page function timed out, the task still kept running. This could lead to requests being processed multiple times. In v2.2 we now have abortable timeouts that will cancel the task as early as possible. ### Mitigation of zero concurrency issue[​](#mitigation-of-zero-concurrency-issue "Direct link to heading") Several new timeouts were added to the task function, which should help mitigate the zero concurrency bug. Namely fetching of next request information and reclaiming failed requests back to the queue are now executed with a timeout with 3 additional retries before the task fails. The timeout is always at least 300s (5 minutes), or `requestHandlerTimeoutSecs` if that value is higher. ### Full list of changes[​](#full-list-of-changes "Direct link to heading") * fix `RequestError: URI malformed` in cheerio crawler (#1205) * only provide Cookie header if cookies are present (#1218) * handle extra cases for `diffCookie` (#1217) * add timeout for task function (#1234) * implement proxy per page in browser crawlers (#1228) * add fingerprinting support (#1243) * implement abortable timeouts (#1245) * add timeouts with retries to `runTaskFunction()` (#1250) * automatically convert google spreadsheet URLs to CSV exports (#1255) ## [2.1.0](https://github.com/apify/apify-sdk-js/compare/v2.0.7...v2.1.0) (2021-10-07)[​](#210-2021-10-07 "Direct link to heading") * automatically convert google docs share urls to csv download ones in request list (#1174) * use puppeteer emulating scrolls instead of `window.scrollBy` (#1170) * warn if apify proxy is used in proxyUrls (#1173) * fix `YOUTUBE_REGEX_STRING` being too greedy (#1171) * add `purgeLocalStorage` utility method (#1187) * catch errors inside request interceptors (#1188, #1190) * add support for cgroups v2 (#1177) * fix incorrect offset in `fixUrl` function (#1184) * support channel and user links in YouTube regex (#1178) * fix: allow passing `requestsFromUrl` to `RequestListOptions` in TS (#1191) * allow passing `forceCloud` down to the KV store (#1186), closes #752 * merge cookies from session with user provided ones (#1201), closes #1197 * use `ApifyClient` v2 (full rewrite to TS) ## [2.0.7](https://github.com/apify/apify-sdk-js/compare/v2.0.6...v2.0.7) (2021-09-08)[​](#207-2021-09-08 "Direct link to heading") * Fix casting of int/bool environment variables (e.g. `APIFY_LOCAL_STORAGE_ENABLE_WAL_MODE`), closes #956 * Fix incognito pages and user data dir (#1145) * Add `@ts-ignore` comments to imports of optional peer dependencies (#1152) * Use config instance in `sdk.openSessionPool()` (#1154) * Add a breaking callback to `infiniteScroll` (#1140) ## [2.0.6](https://github.com/apify/apify-sdk-js/compare/v2.0.5...v2.0.6) (2021-08-27)[​](#206-2021-08-27 "Direct link to heading") * Fix deprecation messages logged from `ProxyConfiguration` and `CheerioCrawler`. * Update `got-scraping` to receive multiple improvements. ## [2.0.5](https://github.com/apify/apify-sdk-js/compare/v2.0.4...v2.0.5) (2021-08-24)[​](#205-2021-08-24 "Direct link to heading") * Fix error handling in puppeteer crawler ## [2.0.4](https://github.com/apify/apify-sdk-js/compare/v2.0.3...v2.0.4) (2021-08-23)[​](#204-2021-08-23 "Direct link to heading") * Use `sessionToken` with `got-scraping` ## [2.0.3](https://github.com/apify/apify-sdk-js/compare/v2.0.2...v2.0.3) (2021-08-20)[​](#203-2021-08-20 "Direct link to heading") * **BREAKING IN EDGE CASES** \* We removed `forceUrlEncoding` in `requestAsBrowser` because we found out that recent versions of the underlying HTTP client `got` already encode URLs and `forceUrlEncoding` could lead to weird behavior. We think of this as fixing a bug, so we're not bumping the major version. * Limit `handleRequestTimeoutMillis` to max valid value to prevent Node.js fallback to `1`. * Use `got-scraping@^3.0.1` * Disable SSL validation on MITM proxie * Limit `handleRequestTimeoutMillis` to max valid value ## [2.0.2](https://github.com/apify/apify-sdk-js/compare/v2.0.1...v2.0.2) (2021-08-12)[​](#202-2021-08-12 "Direct link to heading") * Fix serialization issues in `CheerioCrawler` caused by parser conflicts in recent versions of `cheerio`. ## [2.0.1](https://github.com/apify/apify-sdk-js/compare/v2.0.0...v2.0.1) (2021-08-06)[​](#201-2021-08-06 "Direct link to heading") * Use `got-scraping` 2.0.1 until fully compatible. ## [2.0.0](https://github.com/apify/apify-sdk-js/compare/v1.3.4...v2.0.0) (2021-08-05)[​](#200-2021-08-05 "Direct link to heading") * **BREAKING**: Require Node.js >=15.10.0 because HTTP2 support on lower Node.js versions is very buggy. * **BREAKING**: Bump `cheerio` to `1.0.0-rc.10` from `rc.3`. There were breaking changes in `cheerio` between the versions so this bump might be breaking for you as well. * Remove `LiveViewServer` which was deprecated before release of SDK v1. --- # Actor \ `Actor` class serves as an alternative approach to the static helpers exported from the package. It allows to pass configuration that will be used on the instance methods. Environment variables will have precedence over this configuration. See [Configuration](https://docs.apify.com/sdk/js/sdk/js/reference/class/Configuration.md) for details about what can be configured and what are the default values. ## Index[**](#Index) ### Constructors * [**constructor](#constructor) ### Properties * [**initialized](#initialized) ### Accessors * [**apifyClient](#apifyClient) * [**config](#config) ### Methods * [**getInputOrThrow](#getInputOrThrow) * [**useState](#useState) * [**abort](#abort) * [**addWebhook](#addWebhook) * [**call](#call) * [**callTask](#callTask) * [**createProxyConfiguration](#createProxyConfiguration) * [**exit](#exit) * [**fail](#fail) * [**getEnv](#getEnv) * [**getChargingManager](#getChargingManager) * [**getInput](#getInput) * [**getInputOrThrow](#getInputOrThrow) * [**getValue](#getValue) * [**charge](#charge) * [**init](#init) * [**isAtHome](#isAtHome) * [**main](#main) * [**metamorph](#metamorph) * [**newClient](#newClient) * [**off](#off) * [**on](#on) * [**openDataset](#openDataset) * [**openKeyValueStore](#openKeyValueStore) * [**openRequestQueue](#openRequestQueue) * [**pushData](#pushData) * [**reboot](#reboot) * [**setStatusMessage](#setStatusMessage) * [**setValue](#setValue) * [**start](#start) * [**useState](#useState) ## Constructors[**](#Constructors) ### [**](#constructor)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L93)constructor * ****new Actor**(options): [Actor](https://docs.apify.com/sdk/js/sdk/js/reference/class/Actor.md)\ - #### Parameters * ##### options: ConfigurationOptions = {} #### Returns [Actor](https://docs.apify.com/sdk/js/sdk/js/reference/class/Actor.md)\ ## Properties[**](#Properties) ### [**](#initialized)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L78)initialized **initialized: boolean = false Whether the Actor instance was initialized. This is set by calling [Actor.init](https://docs.apify.com/sdk/js/sdk/js/reference/class/Actor.md#init). ## Accessors[**](#Accessors) ### [**](#apifyClient)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1708)staticapifyClient * **get apifyClient(): [ApifyClient](https://docs.apify.com/sdk/js/sdk/js/reference/class/ApifyClient.md) - Default [ApifyClient](https://docs.apify.com/sdk/js/sdk/js/reference/class/ApifyClient.md) instance. *** #### Returns [ApifyClient](https://docs.apify.com/sdk/js/sdk/js/reference/class/ApifyClient.md) ### [**](#config)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1713)staticconfig * **get config(): [Configuration](https://docs.apify.com/sdk/js/sdk/js/reference/class/Configuration.md) - Default [Configuration](https://docs.apify.com/sdk/js/sdk/js/reference/class/Configuration.md) instance. *** #### Returns [Configuration](https://docs.apify.com/sdk/js/sdk/js/reference/class/Configuration.md) ## Methods[**](#Methods) ### [**](#getInputOrThrow)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L846)getInputOrThrow * ****getInputOrThrow**(): Promise\ - Gets the Actor input value just like the [Actor.getInput](https://docs.apify.com/sdk/js/sdk/js/reference/class/Actor.md#getInput) method, but throws if it is not found. *** #### Returns Promise\ ### [**](#useState)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1079)useState * ****useState**(name, defaultValue, options): Promise\ - Easily create and manage state values. All state values are automatically persisted. Values can be modified by simply using the assignment operator. *** #### Parameters * ##### optionalname: string The name of the store to use. * ##### defaultValue: State = ... If the store does not yet have a value in it, the value will be initialized with the `defaultValue` you provide. * ##### optionaloptions: UseStateOptions An optional object parameter where a custom `keyValueStoreName` and `config` can be passed in. #### Returns Promise\ ### [**](#abort)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1320)staticabort * ****abort**(runId, options): Promise<[ActorRun](https://docs.apify.com/sdk/js/sdk/js/reference/interface/ActorRun.md)> - Aborts given Actor run on the Apify platform using the current user account (determined by the `APIFY_TOKEN` environment variable). The result of the function is an [ActorRun](https://docs.apify.com/sdk/js/sdk/js/reference/interface/ActorRun.md) object that contains details about the Actor run. For more information about Actors, read the [documentation](https://docs.apify.com/actor). **Example usage:** ``` const run = await Actor.abort(runId); ``` *** #### Parameters * ##### runId: string * ##### options: [AbortOptions](https://docs.apify.com/sdk/js/sdk/js/reference/interface/AbortOptions.md) = {} #### Returns Promise<[ActorRun](https://docs.apify.com/sdk/js/sdk/js/reference/interface/ActorRun.md)> ### [**](#addWebhook)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1361)staticaddWebhook * ****addWebhook**(options): Promise\ - Creates an ad-hoc webhook for the current Actor run, which lets you receive a notification when the Actor run finished or failed. For more information about Apify Actor webhooks, please see the [documentation](https://docs.apify.com/webhooks). Note that webhooks are only supported for Actors running on the Apify platform. In local environment, the function will print a warning and have no effect. *** #### Parameters * ##### options: [WebhookOptions](https://docs.apify.com/sdk/js/sdk/js/reference/interface/WebhookOptions.md) #### Returns Promise\ The return value is the Webhook object. For more information, see the [Get webhook](https://apify.com/docs/api/v2#/reference/webhooks/webhook-object/get-webhook) API endpoint. ### [**](#call)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1246)staticcall * ****call**(actorId, input, options): Promise<[ActorRun](https://docs.apify.com/sdk/js/sdk/js/reference/interface/ActorRun.md)> - Runs an Actor on the Apify platform using the current user account (determined by the `APIFY_TOKEN` environment variable). The result of the function is an [ActorRun](https://docs.apify.com/sdk/js/sdk/js/reference/interface/ActorRun.md) object that contains details about the Actor run. If you want to run an Actor task rather than an Actor, please use the [Actor.callTask](https://docs.apify.com/sdk/js/sdk/js/reference/class/Actor.md#callTask) function instead. For more information about Actors, read the [documentation](https://docs.apify.com/actor). **Example usage:** ``` const run = await Actor.call('apify/hello-world', { myInput: 123 }); ``` *** #### Parameters * ##### actorId: string Allowed formats are `username/actor-name`, `userId/actor-name` or Actor ID. * ##### optionalinput: unknown Input for the Actor. If it is an object, it will be stringified to JSON and its content type set to `application/json; charset=utf-8`. Otherwise the `options.contentType` parameter must be provided. * ##### optionaloptions: [CallOptions](https://docs.apify.com/sdk/js/sdk/js/reference/interface/CallOptions.md) = {} #### Returns Promise<[ActorRun](https://docs.apify.com/sdk/js/sdk/js/reference/interface/ActorRun.md)> ### [**](#callTask)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1275)staticcallTask * ****callTask**(taskId, input, options): Promise<[ActorRun](https://docs.apify.com/sdk/js/sdk/js/reference/interface/ActorRun.md)> - Runs an Actor task on the Apify platform using the current user account (determined by the `APIFY_TOKEN` environment variable). The result of the function is an [ActorRun](https://docs.apify.com/sdk/js/sdk/js/reference/interface/ActorRun.md) object that contains details about the Actor run. Note that an Actor task is a saved input configuration and options for an Actor. If you want to run an Actor directly rather than an Actor task, please use the [Actor.call](https://docs.apify.com/sdk/js/sdk/js/reference/class/Actor.md#call) function instead. For more information about Actor tasks, read the [documentation](https://docs.apify.com/tasks). **Example usage:** ``` const run = await Actor.callTask('bob/some-task'); ``` *** #### Parameters * ##### taskId: string Allowed formats are `username/task-name`, `userId/task-name` or task ID. * ##### optionalinput: Dictionary Input overrides for the Actor task. If it is an object, it will be stringified to JSON and its content type set to `application/json; charset=utf-8`. Provided input will be merged with Actor task input. * ##### optionaloptions: [CallTaskOptions](https://docs.apify.com/sdk/js/sdk/js/reference/interface/CallTaskOptions.md) = {} #### Returns Promise<[ActorRun](https://docs.apify.com/sdk/js/sdk/js/reference/interface/ActorRun.md)> ### [**](#createProxyConfiguration)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1656)staticcreateProxyConfiguration * ****createProxyConfiguration**(proxyConfigurationOptions): Promise\ - Creates a proxy configuration and returns a promise resolving to an instance of the [ProxyConfiguration](https://docs.apify.com/sdk/js/sdk/js/reference/class/ProxyConfiguration.md) class that is already initialized. Configures connection to a proxy server with the provided options. Proxy servers are used to prevent target websites from blocking your crawlers based on IP address rate limits or blacklists. Setting proxy configuration in your crawlers automatically configures them to use the selected proxies for all connections. For more details and code examples, see the [ProxyConfiguration](https://docs.apify.com/sdk/js/sdk/js/reference/class/ProxyConfiguration.md) class. ``` // Returns initialized proxy configuration class const proxyConfiguration = await Actor.createProxyConfiguration({ groups: ['GROUP1', 'GROUP2'] // List of Apify proxy groups countryCode: 'US' }); const crawler = new CheerioCrawler({ // ... proxyConfiguration, requestHandler({ proxyInfo }) { const usedProxyUrl = proxyInfo.url; // Getting the proxy URL } }) ``` For compatibility with existing Actor Input UI (Input Schema), this function returns `undefined` when the following object is passed as `proxyConfigurationOptions`. ``` { useApifyProxy: false } ``` *** #### Parameters * ##### proxyConfigurationOptions: [ProxyConfigurationOptions](https://docs.apify.com/sdk/js/sdk/js/reference/interface/ProxyConfigurationOptions.md) & { useApifyProxy?: boolean } = {} #### Returns Promise\ ### [**](#exit)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1202)staticexit * ****exit**(messageOrOptions, options): Promise\ - Gracefully exits the Actor run with the provided status message and exit code. *** #### Parameters * ##### optionalmessageOrOptions: string | [ExitOptions](https://docs.apify.com/sdk/js/sdk/js/reference/interface/ExitOptions.md) First parameter accepts either a string (a terminal status message) or an `ExitOptions` object. * ##### options: [ExitOptions](https://docs.apify.com/sdk/js/sdk/js/reference/interface/ExitOptions.md) = {} Second parameter accepts an `ExitOptions` object. #### Returns Promise\ ### [**](#fail)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1211)staticfail * ****fail**(messageOrOptions, options): Promise\ - Calls `Actor.exit()` with `options.exitCode` set to `1`. *** #### Parameters * ##### optionalmessageOrOptions: string | [ExitOptions](https://docs.apify.com/sdk/js/sdk/js/reference/interface/ExitOptions.md) First parameter accepts either a string (a terminal status message) or an `ExitOptions` object. * ##### options: [ExitOptions](https://docs.apify.com/sdk/js/sdk/js/reference/interface/ExitOptions.md) = {} Second parameter accepts an `ExitOptions` object. #### Returns Promise\ ### [**](#getEnv)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1685)staticgetEnv * ****getEnv**(): [ApifyEnv](https://docs.apify.com/sdk/js/sdk/js/reference/interface/ApifyEnv.md) - Returns a new [ApifyEnv](https://docs.apify.com/sdk/js/sdk/js/reference/interface/ApifyEnv.md) object which contains information parsed from all the Apify environment variables. For the list of the Apify environment variables, see [Actor documentation](https://docs.apify.com/actor/run#environment-variables). If some of the variables are not defined or are invalid, the corresponding value in the resulting object will be null. *** #### Returns [ApifyEnv](https://docs.apify.com/sdk/js/sdk/js/reference/interface/ApifyEnv.md) ### [**](#getChargingManager)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1674)staticgetChargingManager * ****getChargingManager**(): [ChargingManager](https://docs.apify.com/sdk/js/sdk/js/reference/class/ChargingManager.md) - Retrieve the charging manager to access granular pricing information. *** #### Returns [ChargingManager](https://docs.apify.com/sdk/js/sdk/js/reference/class/ChargingManager.md) ### [**](#getInput)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1571)staticgetInput * ****getInput**(): Promise\ - Gets the Actor input value from the default [KeyValueStore](https://docs.apify.com/sdk/js/sdk/js/reference/class/KeyValueStore.md) associated with the current Actor run. This is just a convenient shortcut for [`keyValueStore.getValue('INPUT')`](https://docs.apify.com/sdk/js/sdk/js/reference/class/KeyValueStore.md#getValue). For example, calling the following code: ``` const input = await Actor.getInput(); ``` is equivalent to: ``` const store = await Actor.openKeyValueStore(); await store.getValue('INPUT'); ``` Note that the `getInput()` function does not cache the value read from the key-value store. If you need to use the input multiple times in your Actor, it is far more efficient to read it once and store it locally. For more information, see [Actor.openKeyValueStore](https://docs.apify.com/sdk/js/sdk/js/reference/class/Actor.md#openKeyValueStore) and [KeyValueStore.getValue](https://docs.apify.com/sdk/js/sdk/js/reference/class/KeyValueStore.md#getValue). *** #### Returns Promise\ Returns a promise that resolves to an object, string or [`Buffer`](https://nodejs.org/api/buffer.html), depending on the MIME content type of the record, or `null` if the record is missing. ### [**](#getInputOrThrow)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1579)staticgetInputOrThrow * ****getInputOrThrow**(): Promise\ - Gets the Actor input value just like the [Actor.getInput](https://docs.apify.com/sdk/js/sdk/js/reference/class/Actor.md#getInput) method, but throws if it is not found. *** #### Returns Promise\ ### [**](#getValue)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1506)staticgetValue * ****getValue**(key): Promise\ - Gets a value from the default [KeyValueStore](https://docs.apify.com/sdk/js/sdk/js/reference/class/KeyValueStore.md) associated with the current Actor run. This is just a convenient shortcut for [KeyValueStore.getValue](https://docs.apify.com/sdk/js/sdk/js/reference/class/KeyValueStore.md#getValue). For example, calling the following code: ``` const value = await Actor.getValue('my-key'); ``` is equivalent to: ``` const store = await Actor.openKeyValueStore(); const value = await store.getValue('my-key'); ``` To store the value to the default key-value store, you can use the [Actor.setValue](https://docs.apify.com/sdk/js/sdk/js/reference/class/Actor.md#setValue) function. For more information, see [Actor.openKeyValueStore](https://docs.apify.com/sdk/js/sdk/js/reference/class/Actor.md#openKeyValueStore) and [KeyValueStore.getValue](https://docs.apify.com/sdk/js/sdk/js/reference/class/KeyValueStore.md#getValue). *** #### Parameters * ##### key: string Unique record key. #### Returns Promise\ Returns a promise that resolves to an object, string or [`Buffer`](https://nodejs.org/api/buffer.html), depending on the MIME content type of the record, or `null` if the record is missing. ### [**](#charge)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1667)staticcharge * ****charge**(options): Promise<[ChargeResult](https://docs.apify.com/sdk/js/sdk/js/reference/interface/ChargeResult.md)> - Charge for a specified number of events - sub-operations of the Actor. *** #### Parameters * ##### options: [ChargeOptions](https://docs.apify.com/sdk/js/sdk/js/reference/interface/ChargeOptions.md) The name of the event to charge for and the number of events to be charged. #### Returns Promise<[ChargeResult](https://docs.apify.com/sdk/js/sdk/js/reference/interface/ChargeResult.md)> ### [**](#init)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1193)staticinit * ****init**(options): Promise\ - Initializes the Actor, enabling support for the [Apify platform](https://apify.com/actors) dynamically based on `APIFY_IS_AT_HOME` env var. If you are not running the code on Apify, you don't need to use it. The method will switch storage client implementation automatically, so when you run on the Apify platform, it will use its API instead of the default memory storage. It also increases the available memory ratio from 25% to 100% on the platform. Calling `Actor.exit()` is required if you use the `Actor.init()` method, since it opens websocket connection (see Actor.events for details), which needs to be terminated for the code to finish. ``` import { gotScraping } from 'got-scraping'; await Actor.init(); const html = await gotScraping('http://www.example.com'); console.log(html); await Actor.exit(); ``` *** #### Parameters * ##### options: [InitOptions](https://docs.apify.com/sdk/js/sdk/js/reference/interface/InitOptions.md) = {} #### Returns Promise\ ### [**](#isAtHome)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1703)staticisAtHome * ****isAtHome**(): boolean - Returns `true` when code is running on Apify platform and `false` otherwise (for example locally). *** #### Returns boolean ### [**](#main)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1166)staticmain * ****main**(userFunc, options): Promise\ - Runs the main user function that performs the job of the Actor and terminates the process when the user function finishes. **The `Actor.main()` function is optional** and is provided merely for your convenience. It is mainly useful when you're running your code as an Actor on the [Apify platform](https://apify.com/actors). However, if you want to use Apify SDK tools directly inside your existing projects, e.g. running in an [Express](https://expressjs.com/) server, on [Google Cloud functions](https://cloud.google.com/functions) or [AWS Lambda](https://aws.amazon.com/lambda/), it's better to avoid it since the function terminates the main process when it finishes! The `Actor.main()` function performs the following actions: * When running on the Apify platform (i.e. `APIFY_IS_AT_HOME` environment variable is set), it sets up a connection to listen for platform events. For example, to get a notification about an imminent migration to another server. See Actor.events for details. * It invokes the user function passed as the `userFunc` parameter. * If the user function returned a promise, waits for it to resolve. * If the user function throws an exception or some other error is encountered, prints error details to console so that they are stored to the log. * Exits the Node.js process, with zero exit code on success and non-zero on errors. The user function can be synchronous: ``` await Actor.main(() => { // My synchronous function that returns immediately console.log('Hello world from Actor!'); }); ``` If the user function returns a promise, it is considered asynchronous: ``` import { gotScraping } from 'got-scraping'; await Actor.main(() => { // My asynchronous function that returns a promise return gotScraping('http://www.example.com').then((html) => { console.log(html); }); }); ``` To simplify your code, you can take advantage of the `async`/`await` keywords: ``` import { gotScraping } from 'got-scraping'; await Actor.main(async () => { // My asynchronous function const html = await gotScraping('http://www.example.com'); console.log(html); }); ``` *** #### Parameters * ##### userFunc: [UserFunc](https://docs.apify.com/sdk/js/sdk/js/reference.md#UserFunc)\ User function to be executed. If it returns a promise, the promise will be awaited. The user function is called with no arguments. * ##### optionaloptions: [MainOptions](https://docs.apify.com/sdk/js/sdk/js/reference/interface/MainOptions.md) #### Returns Promise\ ### [**](#metamorph)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1337)staticmetamorph * ****metamorph**(targetActorId, input, options): Promise\ - Transforms this Actor run to an Actor run of a given Actor. The system stops the current container and starts the new container instead. All the default storages are preserved and the new input is stored under the `INPUT-METAMORPH-1` key in the same default key-value store. *** #### Parameters * ##### targetActorId: string Either `username/actor-name` or Actor ID of an Actor to which we want to metamorph. * ##### optionalinput: unknown Input for the Actor. If it is an object, it will be stringified to JSON and its content type set to `application/json; charset=utf-8`. Otherwise, the `options.contentType` parameter must be provided. * ##### optionaloptions: [MetamorphOptions](https://docs.apify.com/sdk/js/sdk/js/reference/interface/MetamorphOptions.md) = {} #### Returns Promise\ ### [**](#newClient)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1696)staticnewClient * ****newClient**(options): [ApifyClient](https://docs.apify.com/sdk/js/sdk/js/reference/class/ApifyClient.md) - Returns a new instance of the Apify API client. The `ApifyClient` class is provided by the [apify-client](https://www.npmjs.com/package/apify-client) NPM package, and it is automatically configured using the `APIFY_API_BASE_URL`, and `APIFY_TOKEN` environment variables. You can override the token via the available options. That's useful if you want to use the client as a different Apify user than the SDK internals are using. *** #### Parameters * ##### options: [ApifyClientOptions](https://docs.apify.com/sdk/js/sdk/js/reference/interface/ApifyClientOptions.md) = {} #### Returns [ApifyClient](https://docs.apify.com/sdk/js/sdk/js/reference/class/ApifyClient.md) ### [**](#off)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1219)staticoff * ****off**(event, listener): void - #### Parameters * ##### event: EventTypeName * ##### optionallistener: (...args) => any #### Returns void ### [**](#on)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1215)staticon * ****on**(event, listener): void - #### Parameters * ##### event: EventTypeName * ##### listener: (...args) => any #### Returns void ### [**](#openDataset)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1473)staticopenDataset * ****openDataset**(datasetIdOrName, options): Promise<[Dataset](https://docs.apify.com/sdk/js/sdk/js/reference/class/Dataset.md)\> - Opens a dataset and returns a promise resolving to an instance of the [Dataset](https://docs.apify.com/sdk/js/sdk/js/reference/class/Dataset.md) class. Datasets are used to store structured data where each object stored has the same attributes, such as online store products or real estate offers. The actual data is stored either on the local filesystem or in the cloud. For more details and code examples, see the [Dataset](https://docs.apify.com/sdk/js/sdk/js/reference/class/Dataset.md) class. *** #### Parameters * ##### optionaldatasetIdOrName: null | string ID or name of the dataset to be opened. If `null` or `undefined`, the function returns the default dataset associated with the Actor run. * ##### optionaloptions: [OpenStorageOptions](https://docs.apify.com/sdk/js/sdk/js/reference/interface/OpenStorageOptions.md) = {} #### Returns Promise<[Dataset](https://docs.apify.com/sdk/js/sdk/js/reference/class/Dataset.md)\> ### [**](#openKeyValueStore)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1597)staticopenKeyValueStore * ****openKeyValueStore**(storeIdOrName, options): Promise<[KeyValueStore](https://docs.apify.com/sdk/js/sdk/js/reference/class/KeyValueStore.md)> - Opens a key-value store and returns a promise resolving to an instance of the [KeyValueStore](https://docs.apify.com/sdk/js/sdk/js/reference/class/KeyValueStore.md) class. Key-value stores are used to store records or files, along with their MIME content type. The records are stored and retrieved using a unique key. The actual data is stored either on a local filesystem or in the Apify cloud. For more details and code examples, see the [KeyValueStore](https://docs.apify.com/sdk/js/sdk/js/reference/class/KeyValueStore.md) class. *** #### Parameters * ##### optionalstoreIdOrName: null | string ID or name of the key-value store to be opened. If `null` or `undefined`, the function returns the default key-value store associated with the Actor run. * ##### optionaloptions: [OpenStorageOptions](https://docs.apify.com/sdk/js/sdk/js/reference/interface/OpenStorageOptions.md) = {} #### Returns Promise<[KeyValueStore](https://docs.apify.com/sdk/js/sdk/js/reference/class/KeyValueStore.md)> ### [**](#openRequestQueue)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1617)staticopenRequestQueue * ****openRequestQueue**(queueIdOrName, options): Promise<[RequestQueue](https://docs.apify.com/sdk/js/sdk/js/reference/class/RequestQueue.md)> - Opens a request queue and returns a promise resolving to an instance of the [RequestQueue](https://docs.apify.com/sdk/js/sdk/js/reference/class/RequestQueue.md) class. [RequestQueue](https://docs.apify.com/sdk/js/sdk/js/reference/class/RequestQueue.md) represents a queue of URLs to crawl, which is stored either on local filesystem or in the cloud. The queue is used for deep crawling of websites, where you start with several URLs and then recursively follow links to other pages. The data structure supports both breadth-first and depth-first crawling orders. For more details and code examples, see the [RequestQueue](https://docs.apify.com/sdk/js/sdk/js/reference/class/RequestQueue.md) class. *** #### Parameters * ##### optionalqueueIdOrName: null | string ID or name of the request queue to be opened. If `null` or `undefined`, the function returns the default request queue associated with the Actor run. * ##### optionaloptions: [OpenStorageOptions](https://docs.apify.com/sdk/js/sdk/js/reference/interface/OpenStorageOptions.md) = {} #### Returns Promise<[RequestQueue](https://docs.apify.com/sdk/js/sdk/js/reference/class/RequestQueue.md)> ### [**](#pushData)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1402)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1427)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1452)staticpushData * ****pushData**(item): Promise\ * ****pushData**(item, eventName): Promise<[ChargeResult](https://docs.apify.com/sdk/js/sdk/js/reference/interface/ChargeResult.md)> - Stores an object or an array of objects to the default [Dataset](https://docs.apify.com/sdk/js/sdk/js/reference/class/Dataset.md) of the current Actor run. This is just a convenient shortcut for [Dataset.pushData](https://docs.apify.com/sdk/js/sdk/js/reference/class/Dataset.md#pushData). For example, calling the following code: ``` await Actor.pushData({ myValue: 123 }); ``` is equivalent to: ``` const dataset = await Actor.openDataset(); await dataset.pushData({ myValue: 123 }); ``` For more information, see [Actor.openDataset](https://docs.apify.com/sdk/js/sdk/js/reference/class/Actor.md#openDataset) and [Dataset.pushData](https://docs.apify.com/sdk/js/sdk/js/reference/class/Dataset.md#pushData) **IMPORTANT**: Make sure to use the `await` keyword when calling `pushData()`, otherwise the Actor process might finish before the data are stored! *** #### Parameters * ##### item: Data | Data\[] Object or array of objects containing data to be stored in the default dataset. The objects must be serializable to JSON and the JSON representation of each object must be smaller than 9MB. #### Returns Promise\ ### [**](#reboot)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1346)staticreboot * ****reboot**(options): Promise\ - Internally reboots this Actor run. The system stops the current container and starts a new container with the same run id. This can be used to get the Actor out of irrecoverable error state and continue where it left off. *** #### Parameters * ##### options: [RebootOptions](https://docs.apify.com/sdk/js/sdk/js/reference/interface/RebootOptions.md) = {} #### Returns Promise\ ### [**](#setStatusMessage)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1375)staticsetStatusMessage * ****setStatusMessage**(statusMessage, options): Promise<[ActorRun](https://docs.apify.com/sdk/js/sdk/js/reference/interface/ActorRun.md)> - Sets the status message for the current Actor run. *** #### Parameters * ##### statusMessage: string The status message to set. * ##### optionaloptions: SetStatusMessageOptions #### Returns Promise<[ActorRun](https://docs.apify.com/sdk/js/sdk/js/reference/interface/ActorRun.md)> The return value is the Run object. When run locally, this method returns empty object (`{}`). For more information, see the [Actor Runs](https://docs.apify.com/api/v2#/reference/actor-runs/) API endpoints. ### [**](#setValue)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1540)staticsetValue * ****setValue**(key, value, options): Promise\ - Stores or deletes a value in the default [KeyValueStore](https://docs.apify.com/sdk/js/sdk/js/reference/class/KeyValueStore.md) associated with the current Actor run. This is just a convenient shortcut for [KeyValueStore.setValue](https://docs.apify.com/sdk/js/sdk/js/reference/class/KeyValueStore.md#setValue). For example, calling the following code: ``` await Actor.setValue('OUTPUT', { foo: "bar" }); ``` is equivalent to: ``` const store = await Actor.openKeyValueStore(); await store.setValue('OUTPUT', { foo: "bar" }); ``` To get a value from the default key-value store, you can use the [Actor.getValue](https://docs.apify.com/sdk/js/sdk/js/reference/class/Actor.md#getValue) function. For more information, see [Actor.openKeyValueStore](https://docs.apify.com/sdk/js/sdk/js/reference/class/Actor.md#openKeyValueStore) and [KeyValueStore.getValue](https://docs.apify.com/sdk/js/sdk/js/reference/class/KeyValueStore.md#getValue). *** #### Parameters * ##### key: string Unique record key. * ##### value: null | T Record data, which can be one of the following values: * If `null`, the record in the key-value store is deleted. * If no `options.contentType` is specified, `value` can be any JavaScript object, and it will be stringified to JSON. * If `options.contentType` is set, `value` is taken as is, and it must be a `String` or [`Buffer`](https://nodejs.org/api/buffer.html). For any other value an error will be thrown. * ##### optionaloptions: [RecordOptions](https://docs.apify.com/sdk/js/sdk/js/reference/interface/RecordOptions.md) = {} #### Returns Promise\ ### [**](#start)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1302)staticstart * ****start**(actorId, input, options): Promise<[ActorRun](https://docs.apify.com/sdk/js/sdk/js/reference/interface/ActorRun.md)> - Runs an Actor on the Apify platform using the current user account (determined by the `APIFY_TOKEN` environment variable), unlike `Actor.call`, this method just starts the run without waiting for finish. The result of the function is an [ActorRun](https://docs.apify.com/sdk/js/sdk/js/reference/interface/ActorRun.md) object that contains details about the Actor run. For more information about Actors, read the [documentation](https://docs.apify.com/actor). **Example usage:** ``` const run = await Actor.start('apify/hello-world', { myInput: 123 }); ``` *** #### Parameters * ##### actorId: string Allowed formats are `username/actor-name`, `userId/actor-name` or Actor ID. * ##### optionalinput: Dictionary Input for the Actor. If it is an object, it will be stringified to JSON and its content type set to `application/json; charset=utf-8`. Otherwise the `options.contentType` parameter must be provided. * ##### optionaloptions: [CallOptions](https://docs.apify.com/sdk/js/sdk/js/reference/interface/CallOptions.md) = {} #### Returns Promise<[ActorRun](https://docs.apify.com/sdk/js/sdk/js/reference/interface/ActorRun.md)> ### [**](#useState)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1097)staticuseState * ****useState**(name, defaultValue, options): Promise\ - Easily create and manage state values. All state values are automatically persisted. Values can be modified by simply using the assignment operator. *** #### Parameters * ##### optionalname: string The name of the store to use. * ##### defaultValue: State = ... If the store does not yet have a value in it, the value will be initialized with the `defaultValue` you provide. * ##### optionaloptions: UseStateOptions An optional object parameter where a custom `keyValueStoreName` and `config` can be passed in. #### Returns Promise\ --- # externalApifyClient ApifyClient is the official library to access [Apify API](https://docs.apify.com/api/v2) from your JavaScript applications. It runs both in Node.js and browser. ## Index[**](#Index) ### Constructors * [**constructor](#constructor) ### Properties * [**baseUrl](#baseUrl) * [**httpClient](#httpClient) * [**logger](#logger) * [**stats](#stats) * [**token](#token) ### Methods * [**actor](#actor) * [**actors](#actors) * [**build](#build) * [**builds](#builds) * [**dataset](#dataset) * [**datasets](#datasets) * [**keyValueStore](#keyValueStore) * [**keyValueStores](#keyValueStores) * [**log](#log) * [**requestQueue](#requestQueue) * [**requestQueues](#requestQueues) * [**run](#run) * [**runs](#runs) * [**setStatusMessage](#setStatusMessage) * [**schedule](#schedule) * [**schedules](#schedules) * [**store](#store) * [**task](#task) * [**tasks](#tasks) * [**user](#user) * [**webhook](#webhook) * [**webhookDispatch](#webhookDispatch) * [**webhookDispatches](#webhookDispatches) * [**webhooks](#webhooks) ## Constructors[**](#Constructors) ### [**](#constructor)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/apify_client.d.ts#L40)externalconstructor * ****new ApifyClient**(options): [ApifyClient](https://docs.apify.com/sdk/js/sdk/js/reference/class/ApifyClient.md) - #### Parameters * ##### externaloptionaloptions: [ApifyClientOptions](https://docs.apify.com/sdk/js/sdk/js/reference/interface/ApifyClientOptions.md) #### Returns [ApifyClient](https://docs.apify.com/sdk/js/sdk/js/reference/class/ApifyClient.md) ## Properties[**](#Properties) ### [**](#baseUrl)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/apify_client.d.ts#L35)externalbaseUrl **baseUrl: string ### [**](#httpClient)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/apify_client.d.ts#L39)externalhttpClient **httpClient: HttpClient ### [**](#logger)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/apify_client.d.ts#L38)externallogger **logger: [Log](https://docs.apify.com/sdk/js/sdk/js/reference/class/Log.md) ### [**](#stats)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/apify_client.d.ts#L37)externalstats **stats: Statistics ### [**](#token)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/apify_client.d.ts#L36)externaloptionaltoken **token? : string ## Methods[**](#Methods) ### [**](#actor)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/apify_client.d.ts#L49)externalactor * ****actor**(id): ActorClient - *** #### Parameters * ##### externalid: string #### Returns ActorClient ### [**](#actors)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/apify_client.d.ts#L45)externalactors * ****actors**(): ActorCollectionClient - *** #### Returns ActorCollectionClient ### [**](#build)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/apify_client.d.ts#L57)externalbuild * ****build**(id): BuildClient - *** #### Parameters * ##### externalid: string #### Returns BuildClient ### [**](#builds)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/apify_client.d.ts#L53)externalbuilds * ****builds**(): BuildCollectionClient - *** #### Returns BuildCollectionClient ### [**](#dataset)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/apify_client.d.ts#L65)externaldataset * ****dataset**(id): DatasetClient\ - *** #### Parameters * ##### externalid: string #### Returns DatasetClient\ ### [**](#datasets)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/apify_client.d.ts#L61)externaldatasets * ****datasets**(): DatasetCollectionClient - *** #### Returns DatasetCollectionClient ### [**](#keyValueStore)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/apify_client.d.ts#L73)externalkeyValueStore * ****keyValueStore**(id): KeyValueStoreClient - *** #### Parameters * ##### externalid: string #### Returns KeyValueStoreClient ### [**](#keyValueStores)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/apify_client.d.ts#L69)externalkeyValueStores * ****keyValueStores**(): KeyValueStoreCollectionClient - *** #### Returns KeyValueStoreCollectionClient ### [**](#log)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/apify_client.d.ts#L77)externallog * ****log**(buildOrRunId): LogClient - *** #### Parameters * ##### externalbuildOrRunId: string #### Returns LogClient ### [**](#requestQueue)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/apify_client.d.ts#L85)externalrequestQueue * ****requestQueue**(id, options): RequestQueueClient - *** #### Parameters * ##### externalid: string * ##### externaloptionaloptions: RequestQueueUserOptions #### Returns RequestQueueClient ### [**](#requestQueues)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/apify_client.d.ts#L81)externalrequestQueues * ****requestQueues**(): RequestQueueCollectionClient - *** #### Returns RequestQueueCollectionClient ### [**](#run)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/apify_client.d.ts#L93)externalrun * ****run**(id): RunClient - *** #### Parameters * ##### externalid: string #### Returns RunClient ### [**](#runs)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/apify_client.d.ts#L89)externalruns * ****runs**(): RunCollectionClient - *** #### Returns RunCollectionClient ### [**](#setStatusMessage)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/apify_client.d.ts#L134)externalsetStatusMessage * ****setStatusMessage**(message, options): Promise\ - #### Parameters * ##### externalmessage: string * ##### externaloptionaloptions: SetStatusMessageOptions #### Returns Promise\ ### [**](#schedule)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/apify_client.d.ts#L109)externalschedule * ****schedule**(id): ScheduleClient - *** #### Parameters * ##### externalid: string #### Returns ScheduleClient ### [**](#schedules)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/apify_client.d.ts#L105)externalschedules * ****schedules**(): ScheduleCollectionClient - *** #### Returns ScheduleCollectionClient ### [**](#store)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/apify_client.d.ts#L133)externalstore * ****store**(): StoreCollectionClient - *** #### Returns StoreCollectionClient ### [**](#task)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/apify_client.d.ts#L101)externaltask * ****task**(id): TaskClient - *** #### Parameters * ##### externalid: string #### Returns TaskClient ### [**](#tasks)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/apify_client.d.ts#L97)externaltasks * ****tasks**(): TaskCollectionClient - *** #### Returns TaskCollectionClient ### [**](#user)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/apify_client.d.ts#L113)externaluser * ****user**(id): UserClient - *** #### Parameters * ##### externaloptionalid: string #### Returns UserClient ### [**](#webhook)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/apify_client.d.ts#L121)externalwebhook * ****webhook**(id): WebhookClient - *** #### Parameters * ##### externalid: string #### Returns WebhookClient ### [**](#webhookDispatch)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/apify_client.d.ts#L129)externalwebhookDispatch * ****webhookDispatch**(id): WebhookDispatchClient - *** #### Parameters * ##### externalid: string #### Returns WebhookDispatchClient ### [**](#webhookDispatches)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/apify_client.d.ts#L125)externalwebhookDispatches * ****webhookDispatches**(): WebhookDispatchCollectionClient - *** #### Returns WebhookDispatchCollectionClient ### [**](#webhooks)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/apify_client.d.ts#L117)externalwebhooks * ****webhooks**(): WebhookCollectionClient - *** #### Returns WebhookCollectionClient --- # ChargingManager Handles pay-per-event charging. ## Index[**](#Index) ### Constructors * [**constructor](#constructor) ### Methods * [**calculateMaxEventChargeCountWithinLimit](#calculateMaxEventChargeCountWithinLimit) * [**getChargedEventCount](#getChargedEventCount) * [**getMaxTotalChargeUsd](#getMaxTotalChargeUsd) * [**getPricingInfo](#getPricingInfo) * [**charge](#charge) * [**init](#init) ## Constructors[**](#Constructors) ### [**](#constructor)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/charging.ts#L28)constructor * ****new ChargingManager**(configuration, apifyClient): [ChargingManager](https://docs.apify.com/sdk/js/sdk/js/reference/class/ChargingManager.md) - #### Parameters * ##### configuration: [Configuration](https://docs.apify.com/sdk/js/sdk/js/reference/class/Configuration.md) * ##### apifyClient: [ApifyClient](https://docs.apify.com/sdk/js/sdk/js/reference/class/ApifyClient.md) #### Returns [ChargingManager](https://docs.apify.com/sdk/js/sdk/js/reference/class/ChargingManager.md) ## Methods[**](#Methods) ### [**](#calculateMaxEventChargeCountWithinLimit)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/charging.ts#L264)calculateMaxEventChargeCountWithinLimit * ****calculateMaxEventChargeCountWithinLimit**(eventName): number - How many events of a given type can still be charged for before reaching the limit; If the event is not registered, returns Infinity (free of charge) *** #### Parameters * ##### eventName: string #### Returns number ### [**](#getChargedEventCount)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/charging.ts#L230)getChargedEventCount * ****getChargedEventCount**(eventName): number - Get the number of events with given name that the Actor has charged for so far. *** #### Parameters * ##### eventName: string #### Returns number ### [**](#getMaxTotalChargeUsd)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/charging.ts#L241)getMaxTotalChargeUsd * ****getMaxTotalChargeUsd**(): number - Get the maximum amount of money that the Actor is allowed to charge. *** #### Returns number ### [**](#getPricingInfo)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/charging.ts#L129)getPricingInfo * ****getPricingInfo**(): [ActorPricingInfo](https://docs.apify.com/sdk/js/sdk/js/reference/interface/ActorPricingInfo.md) - Get information about the pricing for this Actor. *** #### Returns [ActorPricingInfo](https://docs.apify.com/sdk/js/sdk/js/reference/interface/ActorPricingInfo.md) ### [**](#charge)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/charging.ts#L149)charge * ****charge**(options): Promise<[ChargeResult](https://docs.apify.com/sdk/js/sdk/js/reference/interface/ChargeResult.md)> - Charge for a specified number of events - sub-operations of the Actor. *** #### Parameters * ##### options: [ChargeOptions](https://docs.apify.com/sdk/js/sdk/js/reference/interface/ChargeOptions.md) The name of the event to charge for and the number of events to be charged. #### Returns Promise<[ChargeResult](https://docs.apify.com/sdk/js/sdk/js/reference/interface/ChargeResult.md)> ### [**](#init)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/charging.ts#L57)init * ****init**(): Promise\ - Initialize the ChargingManager by loading pricing information and charging state via Apify API. *** #### Returns Promise\ --- # Configuration `Configuration` is a value object holding the SDK configuration. We can use it in two ways: 1. When using `Actor` class, we can get the instance configuration via `sdk.config` ``` import { Actor } from 'apify'; import { BasicCrawler } from 'crawlee'; const sdk = new Actor({ token: '123' }); console.log(sdk.config.get('token')); // '123' const crawler = new BasicCrawler({ // ... crawler options }, sdk.config); ``` 2. To get the global configuration (singleton instance). It will respect the environment variables. ``` import { BasicCrawler, Configuration } from 'crawlee'; // Get the global configuration const config = Configuration.getGlobalConfig(); // Set the 'persistStateIntervalMillis' option // of global configuration to 30 seconds config.set('persistStateIntervalMillis', 30_000); // No need to pass the configuration to the crawler, // as it's using the global configuration by default const crawler = new BasicCrawler(); ``` ## Supported Configuration Options | Key | Environment Variable | Default Value | | :--------------------------- | :------------------------------------ | :------------ | | `memoryMbytes` | `ACTOR_MEMORY_MBYTES` | - | | `headless` | `APIFY_HEADLESS` | - | | `persistStateIntervalMillis` | `APIFY_PERSIST_STATE_INTERVAL_MILLIS` | `60e3` | | `token` | `APIFY_TOKEN` | - | | `isAtHome` | `APIFY_IS_AT_HOME` | - | | `defaultDatasetId` | `ACTOR_DEFAULT_DATASET_ID` | `'default'` | | `defaultKeyValueStoreId` | `ACTOR_DEFAULT_KEY_VALUE_STORE_ID` | `'default'` | | `defaultRequestQueueId` | `ACTOR_DEFAULT_REQUEST_QUEUE_ID` | `'default'` | ## Advanced Configuration Options | Key | Environment Variable | Default Value | | :-------------------------- | :----------------------------------- | :------------------------- | | `actorEventsWsUrl` | `ACTOR_EVENTS_WEBSOCKET_URL` | - | | `actorId` | `ACTOR_ID` | - | | `actorRunId` | `ACTOR_RUN_ID` | - | | `actorTaskId` | `ACTOR_TASK_ID` | - | | `apiBaseUrl` | `APIFY_API_BASE_URL` | `'https://api.apify.com'` | | `containerPort` | `ACTOR_WEB_SERVER_PORT` | `4321` | | `containerUrl` | `ACTOR_WEB_SERVER_URL` | `'http://localhost:4321'` | | `inputKey` | `ACTOR_INPUT_KEY` | `'INPUT'` | | `metamorphAfterSleepMillis` | `APIFY_METAMORPH_AFTER_SLEEP_MILLIS` | `300e3` | | `metaOrigin` | `APIFY_META_ORIGIN` | - | | `proxyHostname` | `APIFY_PROXY_HOSTNAME` | `'proxy.apify.com'` | | `proxyPassword` | `APIFY_PROXY_PASSWORD` | - | | `proxyPort` | `APIFY_PROXY_PORT` | `8000` | | `proxyStatusUrl` | `APIFY_PROXY_STATUS_URL` | `'http://proxy.apify.com'` | | `userId` | `APIFY_USER_ID` | - | | `xvfb` | `APIFY_XVFB` | - | | `standbyPort` | `ACTOR_STANDBY_PORT` | `4321` | | `standbyUrl` | `ACTOR_STANDBY_URL` | - | | `chromeExecutablePath` | `APIFY_CHROME_EXECUTABLE_PATH` | - | | `defaultBrowserPath` | `APIFY_DEFAULT_BROWSER_PATH` | - | ### Hierarchy * Configuration * *Configuration* ## Index[**](#Index) ### Constructors * [**constructor](#constructor) ### Properties * [**storageManagers](#storageManagers) * [**globalConfig](#globalConfig) ### Methods * [**get](#get) * [**getEventManager](#getEventManager) * [**set](#set) * [**useEventManager](#useEventManager) * [**useStorageClient](#useStorageClient) * [**getEventManager](#getEventManager) * [**getGlobalConfig](#getGlobalConfig) * [**getStorageClient](#getStorageClient) * [**resetGlobalState](#resetGlobalState) * [**set](#set) * [**useStorageClient](#useStorageClient) ## Constructors[**](#Constructors) ### [**](#constructor)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/configuration.d.ts#L241)externalconstructor * ****new Configuration**(options): [Configuration](https://docs.apify.com/sdk/js/sdk/js/reference/class/Configuration.md) - Inherited from CoreConfiguration.constructor Creates new `Configuration` instance with provided options. Env vars will have precedence over those. *** #### Parameters * ##### externaloptionaloptions: ConfigurationOptions #### Returns [Configuration](https://docs.apify.com/sdk/js/sdk/js/reference/class/Configuration.md) ## Properties[**](#Properties) ### [**](#storageManagers)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/configuration.d.ts#L237)externalreadonlyinheritedstorageManagers **storageManagers: Map\> Inherited from CoreConfiguration.storageManagers ### [**](#globalConfig)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/configuration.ts#L116)staticoptionalglobalConfig **globalConfig? : [Configuration](https://docs.apify.com/sdk/js/sdk/js/reference/class/Configuration.md) Overrides CoreConfiguration.globalConfig * **@inheritDoc** ## Methods[**](#Methods) ### [**](#get)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/configuration.ts#L206)get * ****get**(key, defaultValue): U - Overrides CoreConfiguration.get * **@inheritDoc** *** #### Parameters * ##### key: T * ##### optionaldefaultValue: U #### Returns U ### [**](#getEventManager)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/configuration.d.ts#L269)externalinheritedgetEventManager * ****getEventManager**(): EventManager - Inherited from CoreConfiguration.getEventManager #### Returns EventManager ### [**](#set)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/configuration.ts#L213)set * ****set**(key, value): void - Overrides CoreConfiguration.set * **@inheritDoc** *** #### Parameters * ##### key: keyof [ConfigurationOptions](https://docs.apify.com/sdk/js/sdk/js/reference/interface/ConfigurationOptions.md) * ##### optionalvalue: any #### Returns void ### [**](#useEventManager)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/configuration.d.ts#L277)externalinheriteduseEventManager * ****useEventManager**(events): void - Inherited from CoreConfiguration.useEventManager #### Parameters * ##### externalevents: EventManager #### Returns void ### [**](#useStorageClient)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/configuration.d.ts#L275)externalinheriteduseStorageClient * ****useStorageClient**(client): void - Inherited from CoreConfiguration.useStorageClient #### Parameters * ##### externalclient: StorageClient #### Returns void ### [**](#getEventManager)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/configuration.d.ts#L289)staticexternalinheritedgetEventManager * ****getEventManager**(): EventManager - Inherited from CoreConfiguration.getEventManager Gets default EventManager instance. *** #### Returns EventManager ### [**](#getGlobalConfig)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/configuration.ts#L220)staticgetGlobalConfig * ****getGlobalConfig**(): [Configuration](https://docs.apify.com/sdk/js/sdk/js/reference/class/Configuration.md) - Overrides CoreConfiguration.getGlobalConfig * **@inheritDoc** *** #### Returns [Configuration](https://docs.apify.com/sdk/js/sdk/js/reference/class/Configuration.md) ### [**](#getStorageClient)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/configuration.d.ts#L285)staticexternalinheritedgetStorageClient * ****getStorageClient**(): StorageClient - Inherited from CoreConfiguration.getStorageClient Gets default StorageClient instance. *** #### Returns StorageClient ### [**](#resetGlobalState)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/configuration.ts#L233)staticresetGlobalState * ****resetGlobalState**(): void - Overrides CoreConfiguration.resetGlobalState Resets global configuration instance. The default instance holds configuration based on env vars, if we want to change them, we need to first reset the global state. Used mainly for testing purposes. *** #### Returns void ### [**](#set)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/configuration.d.ts#L258)staticexternalinheritedset * ****set**(key, value): void - Inherited from CoreConfiguration.set Sets value for given option. Only affects the global `Configuration` instance, the value will not be propagated down to the env var. To reset a value, we can omit the `value` argument or pass `undefined` there. *** #### Parameters * ##### externalkey: keyof ConfigurationOptions * ##### externaloptionalvalue: any #### Returns void ### [**](#useStorageClient)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/configuration.d.ts#L276)staticexternalinheriteduseStorageClient * ****useStorageClient**(client): void - Inherited from CoreConfiguration.useStorageClient #### Parameters * ##### externalclient: StorageClient #### Returns void --- # externalDataset \ The `Dataset` class represents a store for structured data where each object stored has the same attributes, such as online store products or real estate offers. You can imagine it as a table, where each object is a row and its attributes are columns. Dataset is an append-only storage - you can only add new records to it but you cannot modify or remove existing records. Typically it is used to store crawling results. Do not instantiate this class directly, use the [Dataset.open](https://docs.apify.com/sdk/js/sdk/js/reference/class/Dataset.md#open) function instead. `Dataset` stores its data either on local disk or in the Apify cloud, depending on whether the `APIFY_LOCAL_STORAGE_DIR` or `APIFY_TOKEN` environment variables are set. If the `APIFY_LOCAL_STORAGE_DIR` environment variable is set, the data is stored in the local directory in the following files: ``` {APIFY_LOCAL_STORAGE_DIR}/datasets/{DATASET_ID}/{INDEX}.json ``` Note that `{DATASET_ID}` is the name or ID of the dataset. The default dataset has ID: `default`, unless you override it by setting the `APIFY_DEFAULT_DATASET_ID` environment variable. Each dataset item is stored as a separate JSON file, where `{INDEX}` is a zero-based index of the item in the dataset. If the `APIFY_TOKEN` environment variable is set but `APIFY_LOCAL_STORAGE_DIR` not, the data is stored in the [Apify Dataset](https://docs.apify.com/storage/dataset) cloud storage. Note that you can force usage of the cloud storage also by passing the `forceCloud` option to [Dataset.open](https://docs.apify.com/sdk/js/sdk/js/reference/class/Dataset.md#open) function, even if the `APIFY_LOCAL_STORAGE_DIR` variable is set. **Example usage:** ``` // Write a single row to the default dataset await Dataset.pushData({ col1: 123, col2: 'val2' }); // Open a named dataset const dataset = await Dataset.open('some-name'); // Write a single row await dataset.pushData({ foo: 'bar' }); // Write multiple rows await dataset.pushData([ { foo: 'bar2', col2: 'val2' }, { col3: 123 }, ]); // Export the entirety of the dataset to one file in the key-value store await dataset.exportToCSV('MY-DATA'); ``` ## Index[**](#Index) ### Properties * [**client](#client) * [**config](#config) * [**id](#id) * [**log](#log) * [**name](#name) ### Methods * [**drop](#drop) * [**export](#export) * [**exportTo](#exportTo) * [**exportToCSV](#exportToCSV) * [**exportToJSON](#exportToJSON) * [**forEach](#forEach) * [**getData](#getData) * [**getInfo](#getInfo) * [**map](#map) * [**pushData](#pushData) * [**reduce](#reduce) * [**exportToCSV](#exportToCSV) * [**exportToJSON](#exportToJSON) * [**getData](#getData) * [**open](#open) ## Properties[**](#Properties) ### [**](#client)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/dataset.d.ts#L147)externalclient **client: DatasetClient\ ### [**](#config)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/dataset.d.ts#L144)externalreadonlyconfig **config: Configuration ### [**](#id)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/dataset.d.ts#L145)externalid **id: string ### [**](#log)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/dataset.d.ts#L148)externallog **log: [Log](https://docs.apify.com/sdk/js/sdk/js/reference/class/Log.md) ### [**](#name)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/dataset.d.ts#L146)externaloptionalname **name? : string ## Methods[**](#Methods) ### [**](#drop)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/dataset.d.ts#L333)externaldrop * ****drop**(): Promise\ - Removes the dataset either from the Apify cloud storage or from the local directory, depending on the mode of operation. *** #### Returns Promise\ ### [**](#export)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/dataset.d.ts#L186)externalexport * ****export**(options): Promise\ - Returns all the data from the dataset. This will iterate through the whole dataset via the `listItems()` client method, which gives you only paginated results. *** #### Parameters * ##### externaloptionaloptions: DatasetExportOptions #### Returns Promise\ ### [**](#exportTo)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/dataset.d.ts#L194)externalexportTo * ****exportTo**(key, options, contentType): Promise\ - Save the entirety of the dataset's contents into one file within a key-value store. *** #### Parameters * ##### externalkey: string The name of the value to save the data in. * ##### externaloptionaloptions: DatasetExportToOptions An optional options object where you can provide the dataset and target KVS name. * ##### externaloptionalcontentType: string Only JSON and CSV are supported currently, defaults to JSON. #### Returns Promise\ ### [**](#exportToCSV)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/dataset.d.ts#L208)externalexportToCSV * ****exportToCSV**(key, options): Promise\ - Save entire default dataset's contents into one CSV file within a key-value store. *** #### Parameters * ##### externalkey: string The name of the value to save the data in. * ##### externaloptionaloptions: Omit\ An optional options object where you can provide the target KVS name. #### Returns Promise\ ### [**](#exportToJSON)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/dataset.d.ts#L201)externalexportToJSON * ****exportToJSON**(key, options): Promise\ - Save entire default dataset's contents into one JSON file within a key-value store. *** #### Parameters * ##### externalkey: string The name of the value to save the data in. * ##### externaloptionaloptions: Omit\ An optional options object where you can provide the target KVS name. #### Returns Promise\ ### [**](#forEach)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/dataset.d.ts#L266)externalforEach * ****forEach**(iteratee, options, index): Promise\ - Iterates over dataset items, yielding each in turn to an `iteratee` function. Each invocation of `iteratee` is called with two arguments: `(item, index)`. If the `iteratee` function returns a Promise then it is awaited before the next call. If it throws an error, the iteration is aborted and the `forEach` function throws the error. **Example usage** ``` const dataset = await Dataset.open('my-results'); await dataset.forEach(async (item, index) => { console.log(`Item at ${index}: ${JSON.stringify(item)}`); }); ``` *** #### Parameters * ##### externaliteratee: [DatasetConsumer](https://docs.apify.com/sdk/js/sdk/js/reference/interface/DatasetConsumer.md)\ A function that is called for every item in the dataset. * ##### externaloptionaloptions: [DatasetIteratorOptions](https://docs.apify.com/sdk/js/sdk/js/reference/interface/DatasetIteratorOptions.md) All `forEach()` parameters. * ##### externaloptionalindex: number Specifies the initial index number passed to the `iteratee` function. #### Returns Promise\ ### [**](#getData)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/dataset.d.ts#L181)externalgetData * ****getData**(options): Promise<[DatasetContent](https://docs.apify.com/sdk/js/sdk/js/reference/interface/DatasetContent.md)\> - Returns [DatasetContent](https://docs.apify.com/sdk/js/sdk/js/reference/interface/DatasetContent.md) object holding the items in the dataset based on the provided parameters. *** #### Parameters * ##### externaloptionaloptions: [DatasetDataOptions](https://docs.apify.com/sdk/js/sdk/js/reference/interface/DatasetDataOptions.md) #### Returns Promise<[DatasetContent](https://docs.apify.com/sdk/js/sdk/js/reference/interface/DatasetContent.md)\> ### [**](#getInfo)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/dataset.d.ts#L245)externalgetInfo * ****getInfo**(): Promise\ - Returns an object containing general information about the dataset. The function returns the same object as the Apify API Client's [getDataset](https://docs.apify.com/api/apify-client-js/latest#ApifyClient-datasets-getDataset) function, which in turn calls the [Get dataset](https://apify.com/docs/api/v2#/reference/datasets/dataset/get-dataset) API endpoint. **Example:** ``` { id: "WkzbQMuFYuamGv3YF", name: "my-dataset", userId: "wRsJZtadYvn4mBZmm", createdAt: new Date("2015-12-12T07:34:14.202Z"), modifiedAt: new Date("2015-12-13T08:36:13.202Z"), accessedAt: new Date("2015-12-14T08:36:13.202Z"), itemCount: 14, } ``` *** #### Returns Promise\ ### [**](#map)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/dataset.d.ts#L276)externalmap * ****map**(iteratee, options): Promise\ - Produces a new array of values by mapping each value in list through a transformation function `iteratee()`. Each invocation of `iteratee()` is called with two arguments: `(element, index)`. If `iteratee` returns a `Promise` then it's awaited before a next call. *** #### Parameters * ##### externaliteratee: [DatasetMapper](https://docs.apify.com/sdk/js/sdk/js/reference/interface/DatasetMapper.md)\ * ##### externaloptionaloptions: [DatasetIteratorOptions](https://docs.apify.com/sdk/js/sdk/js/reference/interface/DatasetIteratorOptions.md) All `map()` parameters. #### Returns Promise\ ### [**](#pushData)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/dataset.d.ts#L177)externalpushData * ****pushData**(data): Promise\ - Stores an object or an array of objects to the dataset. The function returns a promise that resolves when the operation finishes. It has no result, but throws on invalid args or other errors. **IMPORTANT**: Make sure to use the `await` keyword when calling `pushData()`, otherwise the crawler process might finish before the data is stored! The size of the data is limited by the receiving API and therefore `pushData()` will only allow objects whose JSON representation is smaller than 9MB. When an array is passed, none of the included objects may be larger than 9MB, but the array itself may be of any size. The function internally chunks the array into separate items and pushes them sequentially. The chunking process is stable (keeps order of data), but it does not provide a transaction safety mechanism. Therefore, in the event of an uploading error (after several automatic retries), the function's Promise will reject and the dataset will be left in a state where some of the items have already been saved to the dataset while other items from the source array were not. To overcome this limitation, the developer may, for example, read the last item saved in the dataset and re-attempt the save of the data from this item onwards to prevent duplicates. *** #### Parameters * ##### externaldata: Data | Data\[] Object or array of objects containing data to be stored in the default dataset. The objects must be serializable to JSON and the JSON representation of each object must be smaller than 9MB. #### Returns Promise\ ### [**](#reduce)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/dataset.d.ts#L294)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/dataset.d.ts#L314)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/dataset.d.ts#L328)externalreduce * ****reduce**(iteratee): Promise\ * ****reduce**(iteratee, memo, options): Promise\ * ****reduce**(iteratee, memo, options): Promise\ - Reduces a list of values down to a single value. The first element of the dataset is the initial value, with each successive reductions should be returned by `iteratee()`. The `iteratee()` is passed three arguments: the `memo`, `value` and `index` of the current element being folded into the reduction. The `iteratee` is first invoked on the second element of the list (`index = 1`), with the first element given as the memo parameter. After that, the rest of the elements in the dataset is passed to `iteratee`, with the result of the previous invocation as the memo. If `iteratee()` returns a `Promise` it's awaited before a next call. If the dataset is empty, reduce will return undefined. *** #### Parameters * ##### externaliteratee: [DatasetReducer](https://docs.apify.com/sdk/js/sdk/js/reference/interface/DatasetReducer.md)\ #### Returns Promise\ ### [**](#exportToCSV)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/dataset.d.ts#L222)staticexternalexportToCSV * ****exportToCSV**(key, options): Promise\ - Save entire default dataset's contents into one CSV file within a key-value store. *** #### Parameters * ##### externalkey: string The name of the value to save the data in. * ##### externaloptionaloptions: DatasetExportToOptions An optional options object where you can provide the dataset and target KVS name. #### Returns Promise\ ### [**](#exportToJSON)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/dataset.d.ts#L215)staticexternalexportToJSON * ****exportToJSON**(key, options): Promise\ - Save entire default dataset's contents into one JSON file within a key-value store. *** #### Parameters * ##### externalkey: string The name of the value to save the data in. * ##### externaloptionaloptions: DatasetExportToOptions An optional options object where you can provide the dataset and target KVS name. #### Returns Promise\ ### [**](#getData)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/dataset.d.ts#L377)staticexternalgetData * ****getData**(options): Promise<[DatasetContent](https://docs.apify.com/sdk/js/sdk/js/reference/interface/DatasetContent.md)\> - Returns [DatasetContent](https://docs.apify.com/sdk/js/sdk/js/reference/interface/DatasetContent.md) object holding the items in the dataset based on the provided parameters. *** #### Parameters * ##### externaloptionaloptions: [DatasetDataOptions](https://docs.apify.com/sdk/js/sdk/js/reference/interface/DatasetDataOptions.md) #### Returns Promise<[DatasetContent](https://docs.apify.com/sdk/js/sdk/js/reference/interface/DatasetContent.md)\> ### [**](#open)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/dataset.d.ts#L348)staticexternalopen * ****open**(datasetIdOrName, options): Promise<[Dataset](https://docs.apify.com/sdk/js/sdk/js/reference/class/Dataset.md)\> - Opens a dataset and returns a promise resolving to an instance of the [Dataset](https://docs.apify.com/sdk/js/sdk/js/reference/class/Dataset.md) class. Datasets are used to store structured data where each object stored has the same attributes, such as online store products or real estate offers. The actual data is stored either on the local filesystem or in the cloud. For more details and code examples, see the [Dataset](https://docs.apify.com/sdk/js/sdk/js/reference/class/Dataset.md) class. *** #### Parameters * ##### externaloptionaldatasetIdOrName: null | string ID or name of the dataset to be opened. If `null` or `undefined`, the function returns the default dataset associated with the crawler run. * ##### externaloptionaloptions: StorageManagerOptions Storage manager options. #### Returns Promise<[Dataset](https://docs.apify.com/sdk/js/sdk/js/reference/class/Dataset.md)\> --- # KeyValueStore * **@inheritDoc** ### Hierarchy * KeyValueStore * *KeyValueStore* ## Index[**](#Index) ### Properties * [**config](#config) * [**id](#id) * [**name](#name) * [**storageObject](#storageObject) ### Methods * [**drop](#drop) * [**forEachKey](#forEachKey) * [**getAutoSavedValue](#getAutoSavedValue) * [**getPublicUrl](#getPublicUrl) * [**getValue](#getValue) * [**recordExists](#recordExists) * [**setValue](#setValue) * [**getAutoSavedValue](#getAutoSavedValue) * [**open](#open) * [**recordExists](#recordExists) ## Properties[**](#Properties) ### [**](#config)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/key_value_store.d.ts#L70)externalreadonlyinheritedconfig **config: Configuration Inherited from CoreKeyValueStore.config ### [**](#id)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/key_value_store.d.ts#L71)externalreadonlyinheritedid **id: string Inherited from CoreKeyValueStore.id ### [**](#name)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/key_value_store.d.ts#L72)externaloptionalreadonlyinheritedname **name? : string Inherited from CoreKeyValueStore.name ### [**](#storageObject)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/key_value_store.d.ts#L73)externaloptionalreadonlyinheritedstorageObject **storageObject? : Record\ Inherited from CoreKeyValueStore.storageObject ## Methods[**](#Methods) ### [**](#drop)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/key_value_store.d.ts#L203)externalinheriteddrop * ****drop**(): Promise\ - Inherited from CoreKeyValueStore.drop Removes the key-value store either from the Apify cloud storage or from the local directory, depending on the mode of operation. *** #### Returns Promise\ ### [**](#forEachKey)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/key_value_store.d.ts#L227)externalinheritedforEachKey * ****forEachKey**(iteratee, options): Promise\ - Inherited from CoreKeyValueStore.forEachKey Iterates over key-value store keys, yielding each in turn to an `iteratee` function. Each invocation of `iteratee` is called with three arguments: `(key, index, info)`, where `key` is the record key, `index` is a zero-based index of the key in the current iteration (regardless of `options.exclusiveStartKey`) and `info` is an object that contains a single property `size` indicating size of the record in bytes. If the `iteratee` function returns a Promise then it is awaited before the next call. If it throws an error, the iteration is aborted and the `forEachKey` function throws the error. **Example usage** ``` const keyValueStore = await KeyValueStore.open(); await keyValueStore.forEachKey(async (key, index, info) => { console.log(`Key at ${index}: ${key} has size ${info.size}`); }); ``` *** #### Parameters * ##### externaliteratee: [KeyConsumer](https://docs.apify.com/sdk/js/sdk/js/reference/interface/KeyConsumer.md) A function that is called for every key in the key-value store. * ##### externaloptionaloptions: [KeyValueStoreIteratorOptions](https://docs.apify.com/sdk/js/sdk/js/reference/interface/KeyValueStoreIteratorOptions.md) All `forEachKey()` parameters. #### Returns Promise\ ### [**](#getAutoSavedValue)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/key_value_store.d.ts#L153)externalinheritedgetAutoSavedValue * ****getAutoSavedValue**(key, defaultValue): Promise\ - Inherited from CoreKeyValueStore.getAutoSavedValue #### Parameters * ##### externalkey: string * ##### externaloptionaldefaultValue: T #### Returns Promise\ ### [**](#getPublicUrl)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/key_value_store.ts#L18)getPublicUrl * ****getPublicUrl**(key): string - Overrides CoreKeyValueStore.getPublicUrl Returns a URL for the given key that may be used to publicly access the value in the remote key-value store. *** #### Parameters * ##### key: string #### Returns string ### [**](#getValue)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/key_value_store.d.ts#L112)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/key_value_store.d.ts#L145)externalinheritedgetValue * ****getValue**(key): Promise\ * ****getValue**(key, defaultValue): Promise\ - Inherited from CoreKeyValueStore.getValue Gets a value from the key-value store. The function returns a `Promise` that resolves to the record value, whose JavaScript type depends on the MIME content type of the record. Records with the `application/json` content type are automatically parsed and returned as a JavaScript object. Similarly, records with `text/plain` content types are returned as a string. For all other content types, the value is returned as a raw [`Buffer`](https://nodejs.org/api/buffer.html) instance. If the record does not exist, the function resolves to `null`. To save or delete a value in the key-value store, use the [KeyValueStore.setValue](https://docs.apify.com/sdk/js/sdk/js/reference/class/KeyValueStore.md#setValue) function. **Example usage:** ``` const store = await KeyValueStore.open(); const buffer = await store.getValue('screenshot1.png'); ``` *** #### Parameters * ##### externalkey: string Unique key of the record. It can be at most 256 characters long and only consist of the following characters: `a`-`z`, `A`-`Z`, `0`-`9` and `!-_.'()` #### Returns Promise\ Returns a promise that resolves to an object, string or [`Buffer`](https://nodejs.org/api/buffer.html), depending on the MIME content type of the record. ### [**](#recordExists)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/key_value_store.d.ts#L152)externalinheritedrecordExists * ****recordExists**(key): Promise\ - Inherited from CoreKeyValueStore.recordExists Tests whether a record with the given key exists in the key-value store without retrieving its value. *** #### Parameters * ##### externalkey: string The queried record key. #### Returns Promise\ `true` if the record exists, `false` if it does not. ### [**](#setValue)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/key_value_store.d.ts#L198)externalinheritedsetValue * ****setValue**(key, value, options): Promise\ - Inherited from CoreKeyValueStore.setValue Saves or deletes a record in the key-value store. The function returns a promise that resolves once the record has been saved or deleted. **Example usage:** ``` const store = await KeyValueStore.open(); await store.setValue('OUTPUT', { foo: 'bar' }); ``` Beware that the key can be at most 256 characters long and only contain the following characters: `a-zA-Z0-9!-_.'()` By default, `value` is converted to JSON and stored with the `application/json; charset=utf-8` MIME content type. To store the value with another content type, pass it in the options as follows: ``` const store = await KeyValueStore.open('my-text-store'); await store.setValue('RESULTS', 'my text data', { contentType: 'text/plain' }); ``` If you set custom content type, `value` must be either a string or [`Buffer`](https://nodejs.org/api/buffer.html), otherwise an error will be thrown. If `value` is `null`, the record is deleted instead. Note that the `setValue()` function succeeds regardless whether the record existed or not. To retrieve a value from the key-value store, use the [KeyValueStore.getValue](https://docs.apify.com/sdk/js/sdk/js/reference/class/KeyValueStore.md#getValue) function. **IMPORTANT:** Always make sure to use the `await` keyword when calling `setValue()`, otherwise the crawler process might finish before the value is stored! *** #### Parameters * ##### externalkey: string Unique key of the record. It can be at most 256 characters long and only consist of the following characters: `a`-`z`, `A`-`Z`, `0`-`9` and `!-_.'()` * ##### externalvalue: null | T Record data, which can be one of the following values: * If `null`, the record in the key-value store is deleted. * If no `options.contentType` is specified, `value` can be any JavaScript object and it will be stringified to JSON. * If `options.contentType` is set, `value` is taken as is and it must be a `String` or [`Buffer`](https://nodejs.org/api/buffer.html). For any other value an error will be thrown. * ##### externaloptionaloptions: [RecordOptions](https://docs.apify.com/sdk/js/sdk/js/reference/interface/RecordOptions.md) Record options. #### Returns Promise\ ### [**](#getAutoSavedValue)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/key_value_store.d.ts#L312)staticexternalinheritedgetAutoSavedValue * ****getAutoSavedValue**(key, defaultValue): Promise\ - Inherited from CoreKeyValueStore.getAutoSavedValue #### Parameters * ##### externalkey: string * ##### externaloptionaldefaultValue: T #### Returns Promise\ ### [**](#open)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/key_value_store.ts#L36)staticopen * ****open**(storeIdOrName, options): Promise<[KeyValueStore](https://docs.apify.com/sdk/js/sdk/js/reference/class/KeyValueStore.md)> - Overrides CoreKeyValueStore.open * **@inheritDoc** *** #### Parameters * ##### optionalstoreIdOrName: null | string * ##### options: StorageManagerOptions = {} #### Returns Promise<[KeyValueStore](https://docs.apify.com/sdk/js/sdk/js/reference/class/KeyValueStore.md)> ### [**](#recordExists)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/key_value_store.d.ts#L311)staticexternalinheritedrecordExists * ****recordExists**(key): Promise\ - Inherited from CoreKeyValueStore.recordExists Tests whether a record with the given key exists in the default [KeyValueStore](https://docs.apify.com/sdk/js/sdk/js/reference/class/KeyValueStore.md) associated with the current crawler run. *** #### Parameters * ##### externalkey: string The queried record key. #### Returns Promise\ `true` if the record exists, `false` if it does not. --- # externalLog The log instance enables level aware logging of messages and we advise to use it instead of `console.log()` and its aliases in most development scenarios. A very useful use case for `log` is using `log.debug` liberally throughout the codebase to get useful logging messages only when appropriate log level is set and keeping the console tidy in production environments. The available logging levels are, in this order: `DEBUG`, `INFO`, `WARNING`, `ERROR`, `OFF` and can be referenced from the `log.LEVELS` constant, such as `log.LEVELS.ERROR`. To log messages to the system console, use the `log.level(message)` invocation, such as `log.debug('this is a debug message')`. To prevent writing of messages above a certain log level to the console, simply set the appropriate level. The default log level is `INFO`, which means that `DEBUG` messages will not be printed, unless enabled. **Example:** ``` import log from '@apify/log'; // importing from the Apify SDK or Crawlee is also supported: // import { log } from 'apify'; // import { log } from 'crawlee'; log.info('Information message', { someData: 123 }); // prints message log.debug('Debug message', { debugData: 'hello' }); // doesn't print anything log.setLevel(log.LEVELS.DEBUG); log.debug('Debug message'); // prints message log.setLevel(log.LEVELS.ERROR); log.debug('Debug message'); // doesn't print anything log.info('Info message'); // doesn't print anything log.error('Error message', { errorDetails: 'This is bad!' }); // prints message try { throw new Error('Not good!'); } catch (e) { log.exception(e, 'Exception occurred', { errorDetails: 'This is really bad!' }); // prints message } log.setOptions({ prefix: 'My actor' }); log.info('I am running!'); // prints "My actor: I am running" const childLog = log.child({ prefix: 'Crawler' }); log.info('I am crawling!'); // prints "My actor:Crawler: I am crawling" ``` Another very useful way of setting the log level is by setting the `APIFY_LOG_LEVEL` environment variable, such as `APIFY_LOG_LEVEL=DEBUG`. This way, no code changes are necessary to turn on your debug messages and start debugging right away. To add timestamps to your logs, you can override the default logger settings: ``` log.setOptions({ logger: new log.LoggerText({ skipTime: false }), }); ``` You can customize your logging further by extending or replacing the default logger instances with your own implementations. ## Index[**](#Index) ### Constructors * [**constructor](#constructor) ### Properties * [**LEVELS](#LEVELS) ### Methods * [**debug](#debug) * [**deprecated](#deprecated) * [**error](#error) * [**exception](#exception) * [**getLevel](#getLevel) * [**getOptions](#getOptions) * [**child](#child) * [**info](#info) * [**internal](#internal) * [**perf](#perf) * [**setLevel](#setLevel) * [**setOptions](#setOptions) * [**softFail](#softFail) * [**warning](#warning) * [**warningOnce](#warningOnce) ## Constructors[**](#Constructors) ### [**](#constructor)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@apify/log/src/index.d.ts#L136)externalconstructor * ****new Log**(options): [Log](https://docs.apify.com/sdk/js/sdk/js/reference/class/Log.md) - #### Parameters * ##### externaloptionaloptions: Partial<[LoggerOptions](https://docs.apify.com/sdk/js/sdk/js/reference/interface/LoggerOptions.md)> #### Returns [Log](https://docs.apify.com/sdk/js/sdk/js/reference/class/Log.md) ## Properties[**](#Properties) ### [**](#LEVELS)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@apify/log/src/index.d.ts#L133)externalreadonlyLEVELS **LEVELS: typeof [LogLevel](https://docs.apify.com/sdk/js/sdk/js/reference/enum/LogLevel.md) Map of available log levels that's useful for easy setting of appropriate log levels. Each log level is represented internally by a number. Eg. `log.LEVELS.DEBUG === 5`. ## Methods[**](#Methods) ### [**](#debug)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@apify/log/src/index.d.ts#L195)externaldebug * ****debug**(message, data): void - Logs a `DEBUG` message. By default, it will not be written to the console. To see `DEBUG` messages in the console, set the log level to `DEBUG` either using the `log.setLevel(log.LEVELS.DEBUG)` method or using the environment variable `APIFY_LOG_LEVEL=DEBUG`. Data are stringified and appended to the message. *** #### Parameters * ##### externalmessage: string * ##### externaloptionaldata: AdditionalData #### Returns void ### [**](#deprecated)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@apify/log/src/index.d.ts#L204)externaldeprecated * ****deprecated**(message): void - Logs given message only once as WARNING. It's used to warn user that some feature he is using has been deprecated. *** #### Parameters * ##### externalmessage: string #### Returns void ### [**](#error)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@apify/log/src/index.d.ts#L173)externalerror * ****error**(message, data): void - Logs an `ERROR` message. Use this method to log error messages that are not directly connected to an exception. For logging exceptions, use the `log.exception` method. *** #### Parameters * ##### externalmessage: string * ##### externaloptionaldata: AdditionalData #### Returns void ### [**](#exception)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@apify/log/src/index.d.ts#L178)externalexception * ****exception**(exception, message, data): void - Logs an `ERROR` level message with a nicely formatted exception. Note that the exception is the first parameter here and an additional message is only optional. *** #### Parameters * ##### externalexception: Error * ##### externalmessage: string * ##### externaloptionaldata: AdditionalData #### Returns void ### [**](#getLevel)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@apify/log/src/index.d.ts#L144)externalgetLevel * ****getLevel**(): number - Returns the currently selected logging level. This is useful for checking whether a message will actually be printed to the console before one actually performs a resource intensive operation to construct the message, such as querying a DB for some metadata that need to be added. If the log level is not high enough at the moment, it doesn't make sense to execute the query. *** #### Returns number ### [**](#getOptions)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@apify/log/src/index.d.ts#L164)externalgetOptions * ****getOptions**(): Required<[LoggerOptions](https://docs.apify.com/sdk/js/sdk/js/reference/interface/LoggerOptions.md)> - Returns the logger configuration. *** #### Returns Required<[LoggerOptions](https://docs.apify.com/sdk/js/sdk/js/reference/interface/LoggerOptions.md)> ### [**](#child)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@apify/log/src/index.d.ts#L168)externalchild * ****child**(options): [Log](https://docs.apify.com/sdk/js/sdk/js/reference/class/Log.md) - Creates a new instance of logger that inherits settings from a parent logger. *** #### Parameters * ##### externaloptions: Partial<[LoggerOptions](https://docs.apify.com/sdk/js/sdk/js/reference/interface/LoggerOptions.md)> #### Returns [Log](https://docs.apify.com/sdk/js/sdk/js/reference/class/Log.md) ### [**](#info)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@apify/log/src/index.d.ts#L188)externalinfo * ****info**(message, data): void - Logs an `INFO` message. `INFO` is the default log level so info messages will be always logged, unless the log level is changed. Data are stringified and appended to the message. *** #### Parameters * ##### externalmessage: string * ##### externaloptionaldata: AdditionalData #### Returns void ### [**](#internal)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@apify/log/src/index.d.ts#L156)externalinternal * ****internal**(level, message, data, exception): void - #### Parameters * ##### externallevel: [LogLevel](https://docs.apify.com/sdk/js/sdk/js/reference/enum/LogLevel.md) * ##### externalmessage: string * ##### externaloptionaldata: any * ##### externaloptionalexception: any #### Returns void ### [**](#perf)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@apify/log/src/index.d.ts#L196)externalperf * ****perf**(message, data): void - #### Parameters * ##### externalmessage: string * ##### externaloptionaldata: AdditionalData #### Returns void ### [**](#setLevel)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@apify/log/src/index.d.ts#L155)externalsetLevel * ****setLevel**(level): void - Sets the log level to the given value, preventing messages from less important log levels from being printed to the console. Use in conjunction with the `log.LEVELS` constants such as ``` log.setLevel(log.LEVELS.DEBUG); ``` Default log level is INFO. *** #### Parameters * ##### externallevel: [LogLevel](https://docs.apify.com/sdk/js/sdk/js/reference/enum/LogLevel.md) #### Returns void ### [**](#setOptions)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@apify/log/src/index.d.ts#L160)externalsetOptions * ****setOptions**(options): void - Configures logger. *** #### Parameters * ##### externaloptions: Partial<[LoggerOptions](https://docs.apify.com/sdk/js/sdk/js/reference/interface/LoggerOptions.md)> #### Returns void ### [**](#softFail)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@apify/log/src/index.d.ts#L179)externalsoftFail * ****softFail**(message, data): void - #### Parameters * ##### externalmessage: string * ##### externaloptionaldata: AdditionalData #### Returns void ### [**](#warning)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@apify/log/src/index.d.ts#L183)externalwarning * ****warning**(message, data): void - Logs a `WARNING` level message. Data are stringified and appended to the message. *** #### Parameters * ##### externalmessage: string * ##### externaloptionaldata: AdditionalData #### Returns void ### [**](#warningOnce)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@apify/log/src/index.d.ts#L200)externalwarningOnce * ****warningOnce**(message): void - Logs a `WARNING` level message only once. *** #### Parameters * ##### externalmessage: string #### Returns void --- # externalLogger This is an abstract class that should be extended by custom logger classes. this.\_log() method must be implemented by them. ### Hierarchy * EventEmitter * *Logger* * [LoggerJson](https://docs.apify.com/sdk/js/sdk/js/reference/class/LoggerJson.md) * [LoggerText](https://docs.apify.com/sdk/js/sdk/js/reference/class/LoggerText.md) ## Index[**](#Index) ### Constructors * [**constructor](#constructor) ### Properties * [**captureRejections](#captureRejections) * [**captureRejectionSymbol](#captureRejectionSymbol) * [**defaultMaxListeners](#defaultMaxListeners) * [**errorMonitor](#errorMonitor) ### Methods * [**\_log](#_log) * [**\_outputWithConsole](#_outputWithConsole) * [**\[captureRejectionSymbol\]](#\[captureRejectionSymbol]) * [**addListener](#addListener) * [**emit](#emit) * [**eventNames](#eventNames) * [**getMaxListeners](#getMaxListeners) * [**getOptions](#getOptions) * [**listenerCount](#listenerCount) * [**listeners](#listeners) * [**log](#log) * [**off](#off) * [**on](#on) * [**once](#once) * [**prependListener](#prependListener) * [**prependOnceListener](#prependOnceListener) * [**rawListeners](#rawListeners) * [**removeAllListeners](#removeAllListeners) * [**removeListener](#removeListener) * [**setMaxListeners](#setMaxListeners) * [**setOptions](#setOptions) * [**addAbortListener](#addAbortListener) * [**getEventListeners](#getEventListeners) * [**getMaxListeners](#getMaxListeners) * [**listenerCount](#listenerCount) * [**on](#on) * [**once](#once) * [**setMaxListeners](#setMaxListeners) ## Constructors[**](#Constructors) ### [**](#constructor)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@apify/log/src/index.d.ts#L33)externalconstructor * ****new Logger**(options): [Logger](https://docs.apify.com/sdk/js/sdk/js/reference/class/Logger.md) - Overrides EventEmitter.constructor #### Parameters * ##### externaloptions: Record\ #### Returns [Logger](https://docs.apify.com/sdk/js/sdk/js/reference/class/Logger.md) ## Properties[**](#Properties) ### [**](#captureRejections)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L459)staticexternalinheritedcaptureRejections **captureRejections: boolean Inherited from EventEmitter.captureRejections Value: [boolean](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Data_structures#Boolean_type) Change the default `captureRejections` option on all new `EventEmitter` objects. * **@since** v13.4.0, v12.16.0 ### [**](#captureRejectionSymbol)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L452)staticexternalreadonlyinheritedcaptureRejectionSymbol **captureRejectionSymbol: typeof captureRejectionSymbol Inherited from EventEmitter.captureRejectionSymbol Value: `Symbol.for('nodejs.rejection')` See how to write a custom `rejection handler`. * **@since** v13.4.0, v12.16.0 ### [**](#defaultMaxListeners)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L498)staticexternalinheriteddefaultMaxListeners **defaultMaxListeners: number Inherited from EventEmitter.defaultMaxListeners By default, a maximum of `10` listeners can be registered for any single event. This limit can be changed for individual `EventEmitter` instances using the `emitter.setMaxListeners(n)` method. To change the default for *all*`EventEmitter` instances, the `events.defaultMaxListeners` property can be used. If this value is not a positive number, a `RangeError` is thrown. Take caution when setting the `events.defaultMaxListeners` because the change affects *all* `EventEmitter` instances, including those created before the change is made. However, calling `emitter.setMaxListeners(n)` still has precedence over `events.defaultMaxListeners`. This is not a hard limit. The `EventEmitter` instance will allow more listeners to be added but will output a trace warning to stderr indicating that a "possible EventEmitter memory leak" has been detected. For any single `EventEmitter`, the `emitter.getMaxListeners()` and `emitter.setMaxListeners()` methods can be used to temporarily avoid this warning: ``` import { EventEmitter } from 'node:events'; const emitter = new EventEmitter(); emitter.setMaxListeners(emitter.getMaxListeners() + 1); emitter.once('event', () => { // do stuff emitter.setMaxListeners(Math.max(emitter.getMaxListeners() - 1, 0)); }); ``` The `--trace-warnings` command-line flag can be used to display the stack trace for such warnings. The emitted warning can be inspected with `process.on('warning')` and will have the additional `emitter`, `type`, and `count` properties, referring to the event emitter instance, the event's name and the number of attached listeners, respectively. Its `name` property is set to `'MaxListenersExceededWarning'`. * **@since** v0.11.2 ### [**](#errorMonitor)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L445)staticexternalreadonlyinheritederrorMonitor **errorMonitor: typeof errorMonitor Inherited from EventEmitter.errorMonitor This symbol shall be used to install a listener for only monitoring `'error'` events. Listeners installed using this symbol are called before the regular `'error'` listeners are called. Installing a listener using this symbol does not change the behavior once an `'error'` event is emitted. Therefore, the process will still crash if no regular `'error'` listener is installed. * **@since** v13.6.0, v12.17.0 ## Methods[**](#Methods) ### [**](#_log)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@apify/log/src/index.d.ts#L37)external\_log * ****\_log**(level, message, data, exception, opts): void - #### Parameters * ##### externallevel: [LogLevel](https://docs.apify.com/sdk/js/sdk/js/reference/enum/LogLevel.md) * ##### externalmessage: string * ##### externaloptionaldata: any * ##### externaloptionalexception: unknown * ##### externaloptionalopts: Record\ #### Returns void ### [**](#_outputWithConsole)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@apify/log/src/index.d.ts#L36)external\_outputWithConsole * ****\_outputWithConsole**(level, line): void - #### Parameters * ##### externallevel: [LogLevel](https://docs.apify.com/sdk/js/sdk/js/reference/enum/LogLevel.md) * ##### externalline: string #### Returns void ### [**](#\[captureRejectionSymbol])[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L136)externaloptionalinherited\[captureRejectionSymbol] * ****\[captureRejectionSymbol]**(error, event, ...args): void - Inherited from EventEmitter.\[captureRejectionSymbol] #### Parameters * ##### externalerror: Error * ##### externalevent: string | symbol * ##### externalrest...args: AnyRest #### Returns void ### [**](#addListener)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L597)externalinheritedaddListener * ****addListener**(eventName, listener): this - Inherited from EventEmitter.addListener Alias for `emitter.on(eventName, listener)`. * **@since** v0.1.26 *** #### Parameters * ##### externaleventName: string | symbol * ##### externallistener: (...args) => void #### Returns this ### [**](#emit)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L859)externalinheritedemit * ****emit**(eventName, ...args): boolean - Inherited from EventEmitter.emit Synchronously calls each of the listeners registered for the event named `eventName`, in the order they were registered, passing the supplied arguments to each. Returns `true` if the event had listeners, `false` otherwise. ``` import { EventEmitter } from 'node:events'; const myEmitter = new EventEmitter(); // First listener myEmitter.on('event', function firstListener() { console.log('Helloooo! first listener'); }); // Second listener myEmitter.on('event', function secondListener(arg1, arg2) { console.log(`event with parameters ${arg1}, ${arg2} in second listener`); }); // Third listener myEmitter.on('event', function thirdListener(...args) { const parameters = args.join(', '); console.log(`event with parameters ${parameters} in third listener`); }); console.log(myEmitter.listeners('event')); myEmitter.emit('event', 1, 2, 3, 4, 5); // Prints: // [ // [Function: firstListener], // [Function: secondListener], // [Function: thirdListener] // ] // Helloooo! first listener // event with parameters 1, 2 in second listener // event with parameters 1, 2, 3, 4, 5 in third listener ``` * **@since** v0.1.26 *** #### Parameters * ##### externaleventName: string | symbol * ##### externalrest...args: AnyRest #### Returns boolean ### [**](#eventNames)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L922)externalinheritedeventNames * ****eventNames**(): (string | symbol)\[] - Inherited from EventEmitter.eventNames Returns an array listing the events for which the emitter has registered listeners. The values in the array are strings or `Symbol`s. ``` import { EventEmitter } from 'node:events'; const myEE = new EventEmitter(); myEE.on('foo', () => {}); myEE.on('bar', () => {}); const sym = Symbol('symbol'); myEE.on(sym, () => {}); console.log(myEE.eventNames()); // Prints: [ 'foo', 'bar', Symbol(symbol) ] ``` * **@since** v6.0.0 *** #### Returns (string | symbol)\[] ### [**](#getMaxListeners)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L774)externalinheritedgetMaxListeners * ****getMaxListeners**(): number - Inherited from EventEmitter.getMaxListeners Returns the current max listener value for the `EventEmitter` which is either set by `emitter.setMaxListeners(n)` or defaults to defaultMaxListeners. * **@since** v1.0.0 *** #### Returns number ### [**](#getOptions)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@apify/log/src/index.d.ts#L35)externalgetOptions * ****getOptions**(): Record\ - #### Returns Record\ ### [**](#listenerCount)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L868)externalinheritedlistenerCount * ****listenerCount**(eventName, listener): number - Inherited from EventEmitter.listenerCount Returns the number of listeners listening for the event named `eventName`. If `listener` is provided, it will return how many times the listener is found in the list of the listeners of the event. * **@since** v3.2.0 *** #### Parameters * ##### externaleventName: string | symbol The name of the event being listened for * ##### externaloptionallistener: Function The event handler function #### Returns number ### [**](#listeners)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L787)externalinheritedlisteners * ****listeners**(eventName): Function\[] - Inherited from EventEmitter.listeners Returns a copy of the array of listeners for the event named `eventName`. ``` server.on('connection', (stream) => { console.log('someone connected!'); }); console.log(util.inspect(server.listeners('connection'))); // Prints: [ [Function] ] ``` * **@since** v0.1.26 *** #### Parameters * ##### externaleventName: string | symbol #### Returns Function\[] ### [**](#log)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@apify/log/src/index.d.ts#L38)externallog * ****log**(level, message, ...args): void - #### Parameters * ##### externallevel: [LogLevel](https://docs.apify.com/sdk/js/sdk/js/reference/enum/LogLevel.md) * ##### externalmessage: string * ##### externalrest...args: any\[] #### Returns void ### [**](#off)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L747)externalinheritedoff * ****off**(eventName, listener): this - Inherited from EventEmitter.off Alias for `emitter.removeListener()`. * **@since** v10.0.0 *** #### Parameters * ##### externaleventName: string | symbol * ##### externallistener: (...args) => void #### Returns this ### [**](#on)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L629)externalinheritedon * ****on**(eventName, listener): this - Inherited from EventEmitter.on Adds the `listener` function to the end of the listeners array for the event named `eventName`. No checks are made to see if the `listener` has already been added. Multiple calls passing the same combination of `eventName` and `listener` will result in the `listener` being added, and called, multiple times. ``` server.on('connection', (stream) => { console.log('someone connected!'); }); ``` Returns a reference to the `EventEmitter`, so that calls can be chained. By default, event listeners are invoked in the order they are added. The `emitter.prependListener()` method can be used as an alternative to add the event listener to the beginning of the listeners array. ``` import { EventEmitter } from 'node:events'; const myEE = new EventEmitter(); myEE.on('foo', () => console.log('a')); myEE.prependListener('foo', () => console.log('b')); myEE.emit('foo'); // Prints: // b // a ``` * **@since** v0.1.101 *** #### Parameters * ##### externaleventName: string | symbol The name of the event. * ##### externallistener: (...args) => void The callback function #### Returns this ### [**](#once)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L659)externalinheritedonce * ****once**(eventName, listener): this - Inherited from EventEmitter.once Adds a **one-time** `listener` function for the event named `eventName`. The next time `eventName` is triggered, this listener is removed and then invoked. ``` server.once('connection', (stream) => { console.log('Ah, we have our first user!'); }); ``` Returns a reference to the `EventEmitter`, so that calls can be chained. By default, event listeners are invoked in the order they are added. The `emitter.prependOnceListener()` method can be used as an alternative to add the event listener to the beginning of the listeners array. ``` import { EventEmitter } from 'node:events'; const myEE = new EventEmitter(); myEE.once('foo', () => console.log('a')); myEE.prependOnceListener('foo', () => console.log('b')); myEE.emit('foo'); // Prints: // b // a ``` * **@since** v0.3.0 *** #### Parameters * ##### externaleventName: string | symbol The name of the event. * ##### externallistener: (...args) => void The callback function #### Returns this ### [**](#prependListener)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L886)externalinheritedprependListener * ****prependListener**(eventName, listener): this - Inherited from EventEmitter.prependListener Adds the `listener` function to the *beginning* of the listeners array for the event named `eventName`. No checks are made to see if the `listener` has already been added. Multiple calls passing the same combination of `eventName` and `listener` will result in the `listener` being added, and called, multiple times. ``` server.prependListener('connection', (stream) => { console.log('someone connected!'); }); ``` Returns a reference to the `EventEmitter`, so that calls can be chained. * **@since** v6.0.0 *** #### Parameters * ##### externaleventName: string | symbol The name of the event. * ##### externallistener: (...args) => void The callback function #### Returns this ### [**](#prependOnceListener)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L902)externalinheritedprependOnceListener * ****prependOnceListener**(eventName, listener): this - Inherited from EventEmitter.prependOnceListener Adds a **one-time**`listener` function for the event named `eventName` to the *beginning* of the listeners array. The next time `eventName` is triggered, this listener is removed, and then invoked. ``` server.prependOnceListener('connection', (stream) => { console.log('Ah, we have our first user!'); }); ``` Returns a reference to the `EventEmitter`, so that calls can be chained. * **@since** v6.0.0 *** #### Parameters * ##### externaleventName: string | symbol The name of the event. * ##### externallistener: (...args) => void The callback function #### Returns this ### [**](#rawListeners)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L818)externalinheritedrawListeners * ****rawListeners**(eventName): Function\[] - Inherited from EventEmitter.rawListeners Returns a copy of the array of listeners for the event named `eventName`, including any wrappers (such as those created by `.once()`). ``` import { EventEmitter } from 'node:events'; const emitter = new EventEmitter(); emitter.once('log', () => console.log('log once')); // Returns a new Array with a function `onceWrapper` which has a property // `listener` which contains the original listener bound above const listeners = emitter.rawListeners('log'); const logFnWrapper = listeners[0]; // Logs "log once" to the console and does not unbind the `once` event logFnWrapper.listener(); // Logs "log once" to the console and removes the listener logFnWrapper(); emitter.on('log', () => console.log('log persistently')); // Will return a new Array with a single function bound by `.on()` above const newListeners = emitter.rawListeners('log'); // Logs "log persistently" twice newListeners[0](); emitter.emit('log'); ``` * **@since** v9.4.0 *** #### Parameters * ##### externaleventName: string | symbol #### Returns Function\[] ### [**](#removeAllListeners)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L758)externalinheritedremoveAllListeners * ****removeAllListeners**(eventName): this - Inherited from EventEmitter.removeAllListeners Removes all listeners, or those of the specified `eventName`. It is bad practice to remove listeners added elsewhere in the code, particularly when the `EventEmitter` instance was created by some other component or module (e.g. sockets or file streams). Returns a reference to the `EventEmitter`, so that calls can be chained. * **@since** v0.1.26 *** #### Parameters * ##### externaloptionaleventName: string | symbol #### Returns this ### [**](#removeListener)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L742)externalinheritedremoveListener * ****removeListener**(eventName, listener): this - Inherited from EventEmitter.removeListener Removes the specified `listener` from the listener array for the event named `eventName`. ``` const callback = (stream) => { console.log('someone connected!'); }; server.on('connection', callback); // ... server.removeListener('connection', callback); ``` `removeListener()` will remove, at most, one instance of a listener from the listener array. If any single listener has been added multiple times to the listener array for the specified `eventName`, then `removeListener()` must be called multiple times to remove each instance. Once an event is emitted, all listeners attached to it at the time of emitting are called in order. This implies that any `removeListener()` or `removeAllListeners()` calls *after* emitting and *before* the last listener finishes execution will not remove them from`emit()` in progress. Subsequent events behave as expected. ``` import { EventEmitter } from 'node:events'; class MyEmitter extends EventEmitter {} const myEmitter = new MyEmitter(); const callbackA = () => { console.log('A'); myEmitter.removeListener('event', callbackB); }; const callbackB = () => { console.log('B'); }; myEmitter.on('event', callbackA); myEmitter.on('event', callbackB); // callbackA removes listener callbackB but it will still be called. // Internal listener array at time of emit [callbackA, callbackB] myEmitter.emit('event'); // Prints: // A // B // callbackB is now removed. // Internal listener array [callbackA] myEmitter.emit('event'); // Prints: // A ``` Because listeners are managed using an internal array, calling this will change the position indices of any listener registered *after* the listener being removed. This will not impact the order in which listeners are called, but it means that any copies of the listener array as returned by the `emitter.listeners()` method will need to be recreated. When a single function has been added as a handler multiple times for a single event (as in the example below), `removeListener()` will remove the most recently added instance. In the example the `once('ping')` listener is removed: ``` import { EventEmitter } from 'node:events'; const ee = new EventEmitter(); function pong() { console.log('pong'); } ee.on('ping', pong); ee.once('ping', pong); ee.removeListener('ping', pong); ee.emit('ping'); ee.emit('ping'); ``` Returns a reference to the `EventEmitter`, so that calls can be chained. * **@since** v0.1.26 *** #### Parameters * ##### externaleventName: string | symbol * ##### externallistener: (...args) => void #### Returns this ### [**](#setMaxListeners)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L768)externalinheritedsetMaxListeners * ****setMaxListeners**(n): this - Inherited from EventEmitter.setMaxListeners By default `EventEmitter`s will print a warning if more than `10` listeners are added for a particular event. This is a useful default that helps finding memory leaks. The `emitter.setMaxListeners()` method allows the limit to be modified for this specific `EventEmitter` instance. The value can be set to `Infinity` (or `0`) to indicate an unlimited number of listeners. Returns a reference to the `EventEmitter`, so that calls can be chained. * **@since** v0.3.5 *** #### Parameters * ##### externaln: number #### Returns this ### [**](#setOptions)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@apify/log/src/index.d.ts#L34)externalsetOptions * ****setOptions**(options): void - #### Parameters * ##### externaloptions: Record\ #### Returns void ### [**](#addAbortListener)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L437)staticexternalinheritedaddAbortListener * ****addAbortListener**(signal, resource): Disposable - Inherited from EventEmitter.addAbortListener experimental Listens once to the `abort` event on the provided `signal`. Listening to the `abort` event on abort signals is unsafe and may lead to resource leaks since another third party with the signal can call `e.stopImmediatePropagation()`. Unfortunately Node.js cannot change this since it would violate the web standard. Additionally, the original API makes it easy to forget to remove listeners. This API allows safely using `AbortSignal`s in Node.js APIs by solving these two issues by listening to the event such that `stopImmediatePropagation` does not prevent the listener from running. Returns a disposable so that it may be unsubscribed from more easily. ``` import { addAbortListener } from 'node:events'; function example(signal) { let disposable; try { signal.addEventListener('abort', (e) => e.stopImmediatePropagation()); disposable = addAbortListener(signal, (e) => { // Do something when signal is aborted. }); } finally { disposable?.[Symbol.dispose](); } } ``` * **@since** v20.5.0 *** #### Parameters * ##### externalsignal: AbortSignal * ##### externalresource: (event) => void #### Returns Disposable Disposable that removes the `abort` listener. ### [**](#getEventListeners)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L358)staticexternalinheritedgetEventListeners * ****getEventListeners**(emitter, name): Function\[] - Inherited from EventEmitter.getEventListeners Returns a copy of the array of listeners for the event named `eventName`. For `EventEmitter`s this behaves exactly the same as calling `.listeners` on the emitter. For `EventTarget`s this is the only way to get the event listeners for the event target. This is useful for debugging and diagnostic purposes. ``` import { getEventListeners, EventEmitter } from 'node:events'; { const ee = new EventEmitter(); const listener = () => console.log('Events are fun'); ee.on('foo', listener); console.log(getEventListeners(ee, 'foo')); // [ [Function: listener] ] } { const et = new EventTarget(); const listener = () => console.log('Events are fun'); et.addEventListener('foo', listener); console.log(getEventListeners(et, 'foo')); // [ [Function: listener] ] } ``` * **@since** v15.2.0, v14.17.0 *** #### Parameters * ##### externalemitter: EventEmitter\ | EventTarget * ##### externalname: string | symbol #### Returns Function\[] ### [**](#getMaxListeners)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L387)staticexternalinheritedgetMaxListeners * ****getMaxListeners**(emitter): number - Inherited from EventEmitter.getMaxListeners Returns the currently set max amount of listeners. For `EventEmitter`s this behaves exactly the same as calling `.getMaxListeners` on the emitter. For `EventTarget`s this is the only way to get the max event listeners for the event target. If the number of event handlers on a single EventTarget exceeds the max set, the EventTarget will print a warning. ``` import { getMaxListeners, setMaxListeners, EventEmitter } from 'node:events'; { const ee = new EventEmitter(); console.log(getMaxListeners(ee)); // 10 setMaxListeners(11, ee); console.log(getMaxListeners(ee)); // 11 } { const et = new EventTarget(); console.log(getMaxListeners(et)); // 10 setMaxListeners(11, et); console.log(getMaxListeners(et)); // 11 } ``` * **@since** v19.9.0 *** #### Parameters * ##### externalemitter: EventEmitter\ | EventTarget #### Returns number ### [**](#listenerCount)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L330)staticexternalinheritedlistenerCount * ****listenerCount**(emitter, eventName): number - Inherited from EventEmitter.listenerCount A class method that returns the number of listeners for the given `eventName` registered on the given `emitter`. ``` import { EventEmitter, listenerCount } from 'node:events'; const myEmitter = new EventEmitter(); myEmitter.on('event', () => {}); myEmitter.on('event', () => {}); console.log(listenerCount(myEmitter, 'event')); // Prints: 2 ``` * **@since** v0.9.12 * **@deprecated** Since v3.2.0 - Use `listenerCount` instead. *** #### Parameters * ##### externalemitter: EventEmitter\ The emitter to query * ##### externaleventName: string | symbol The event name #### Returns number ### [**](#on)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L303)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L308)staticexternalinheritedon * ****on**(emitter, eventName, options): AsyncIterator\ * ****on**(emitter, eventName, options): AsyncIterator\ - Inherited from EventEmitter.on ``` import { on, EventEmitter } from 'node:events'; import process from 'node:process'; const ee = new EventEmitter(); // Emit later on process.nextTick(() => { ee.emit('foo', 'bar'); ee.emit('foo', 42); }); for await (const event of on(ee, 'foo')) { // The execution of this inner block is synchronous and it // processes one event at a time (even with await). Do not use // if concurrent execution is required. console.log(event); // prints ['bar'] [42] } // Unreachable here ``` Returns an `AsyncIterator` that iterates `eventName` events. It will throw if the `EventEmitter` emits `'error'`. It removes all listeners when exiting the loop. The `value` returned by each iteration is an array composed of the emitted event arguments. An `AbortSignal` can be used to cancel waiting on events: ``` import { on, EventEmitter } from 'node:events'; import process from 'node:process'; const ac = new AbortController(); (async () => { const ee = new EventEmitter(); // Emit later on process.nextTick(() => { ee.emit('foo', 'bar'); ee.emit('foo', 42); }); for await (const event of on(ee, 'foo', { signal: ac.signal })) { // The execution of this inner block is synchronous and it // processes one event at a time (even with await). Do not use // if concurrent execution is required. console.log(event); // prints ['bar'] [42] } // Unreachable here })(); process.nextTick(() => ac.abort()); ``` Use the `close` option to specify an array of event names that will end the iteration: ``` import { on, EventEmitter } from 'node:events'; import process from 'node:process'; const ee = new EventEmitter(); // Emit later on process.nextTick(() => { ee.emit('foo', 'bar'); ee.emit('foo', 42); ee.emit('close'); }); for await (const event of on(ee, 'foo', { close: ['close'] })) { console.log(event); // prints ['bar'] [42] } // the loop will exit after 'close' is emitted console.log('done'); // prints 'done' ``` * **@since** v13.6.0, v12.16.0 *** #### Parameters * ##### externalemitter: EventEmitter\ * ##### externaleventName: string | symbol * ##### externaloptionaloptions: StaticEventEmitterIteratorOptions #### Returns AsyncIterator\ An `AsyncIterator` that iterates `eventName` events emitted by the `emitter` ### [**](#once)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L217)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L222)staticexternalinheritedonce * ****once**(emitter, eventName, options): Promise\ * ****once**(emitter, eventName, options): Promise\ - Inherited from EventEmitter.once Creates a `Promise` that is fulfilled when the `EventEmitter` emits the given event or that is rejected if the `EventEmitter` emits `'error'` while waiting. The `Promise` will resolve with an array of all the arguments emitted to the given event. This method is intentionally generic and works with the web platform [EventTarget](https://dom.spec.whatwg.org/#interface-eventtarget) interface, which has no special`'error'` event semantics and does not listen to the `'error'` event. ``` import { once, EventEmitter } from 'node:events'; import process from 'node:process'; const ee = new EventEmitter(); process.nextTick(() => { ee.emit('myevent', 42); }); const [value] = await once(ee, 'myevent'); console.log(value); const err = new Error('kaboom'); process.nextTick(() => { ee.emit('error', err); }); try { await once(ee, 'myevent'); } catch (err) { console.error('error happened', err); } ``` The special handling of the `'error'` event is only used when `events.once()` is used to wait for another event. If `events.once()` is used to wait for the '`error'` event itself, then it is treated as any other kind of event without special handling: ``` import { EventEmitter, once } from 'node:events'; const ee = new EventEmitter(); once(ee, 'error') .then(([err]) => console.log('ok', err.message)) .catch((err) => console.error('error', err.message)); ee.emit('error', new Error('boom')); // Prints: ok boom ``` An `AbortSignal` can be used to cancel waiting for the event: ``` import { EventEmitter, once } from 'node:events'; const ee = new EventEmitter(); const ac = new AbortController(); async function foo(emitter, event, signal) { try { await once(emitter, event, { signal }); console.log('event emitted!'); } catch (error) { if (error.name === 'AbortError') { console.error('Waiting for the event was canceled!'); } else { console.error('There was an error', error.message); } } } foo(ee, 'foo', ac.signal); ac.abort(); // Abort waiting for the event ee.emit('foo'); // Prints: Waiting for the event was canceled! ``` * **@since** v11.13.0, v10.16.0 *** #### Parameters * ##### externalemitter: EventEmitter\ * ##### externaleventName: string | symbol * ##### externaloptionaloptions: StaticEventEmitterOptions #### Returns Promise\ ### [**](#setMaxListeners)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L402)staticexternalinheritedsetMaxListeners * ****setMaxListeners**(n, ...eventTargets): void - Inherited from EventEmitter.setMaxListeners ``` import { setMaxListeners, EventEmitter } from 'node:events'; const target = new EventTarget(); const emitter = new EventEmitter(); setMaxListeners(5, target, emitter); ``` * **@since** v15.4.0 *** #### Parameters * ##### externaloptionaln: number A non-negative number. The maximum number of listeners per `EventTarget` event. * ##### externalrest...eventTargets: (EventEmitter\ | EventTarget)\[] Zero or more {EventTarget} or {EventEmitter} instances. If none are specified, `n` is set as the default max for all newly created {EventTarget} and {EventEmitter} objects. #### Returns void --- # externalLoggerJson This is an abstract class that should be extended by custom logger classes. this.\_log() method must be implemented by them. ### Hierarchy * [Logger](https://docs.apify.com/sdk/js/sdk/js/reference/class/Logger.md) * *LoggerJson* ## Index[**](#Index) ### Constructors * [**constructor](#constructor) ### Properties * [**captureRejections](#captureRejections) * [**captureRejectionSymbol](#captureRejectionSymbol) * [**defaultMaxListeners](#defaultMaxListeners) * [**errorMonitor](#errorMonitor) ### Methods * [**\_log](#_log) * [**\_outputWithConsole](#_outputWithConsole) * [**\[captureRejectionSymbol\]](#\[captureRejectionSymbol]) * [**addListener](#addListener) * [**emit](#emit) * [**eventNames](#eventNames) * [**getMaxListeners](#getMaxListeners) * [**getOptions](#getOptions) * [**listenerCount](#listenerCount) * [**listeners](#listeners) * [**log](#log) * [**off](#off) * [**on](#on) * [**once](#once) * [**prependListener](#prependListener) * [**prependOnceListener](#prependOnceListener) * [**rawListeners](#rawListeners) * [**removeAllListeners](#removeAllListeners) * [**removeListener](#removeListener) * [**setMaxListeners](#setMaxListeners) * [**setOptions](#setOptions) * [**addAbortListener](#addAbortListener) * [**getEventListeners](#getEventListeners) * [**getMaxListeners](#getMaxListeners) * [**listenerCount](#listenerCount) * [**on](#on) * [**once](#once) * [**setMaxListeners](#setMaxListeners) ## Constructors[**](#Constructors) ### [**](#constructor)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@apify/log/src/index.d.ts#L241)externalconstructor * ****new LoggerJson**(options): [LoggerJson](https://docs.apify.com/sdk/js/sdk/js/reference/class/LoggerJson.md) - Overrides Logger.constructor #### Parameters * ##### externaloptionaloptions: {} #### Returns [LoggerJson](https://docs.apify.com/sdk/js/sdk/js/reference/class/LoggerJson.md) ## Properties[**](#Properties) ### [**](#captureRejections)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L459)staticexternalinheritedcaptureRejections **captureRejections: boolean Inherited from Logger.captureRejections Value: [boolean](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Data_structures#Boolean_type) Change the default `captureRejections` option on all new `EventEmitter` objects. * **@since** v13.4.0, v12.16.0 ### [**](#captureRejectionSymbol)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L452)staticexternalreadonlyinheritedcaptureRejectionSymbol **captureRejectionSymbol: typeof captureRejectionSymbol Inherited from Logger.captureRejectionSymbol Value: `Symbol.for('nodejs.rejection')` See how to write a custom `rejection handler`. * **@since** v13.4.0, v12.16.0 ### [**](#defaultMaxListeners)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L498)staticexternalinheriteddefaultMaxListeners **defaultMaxListeners: number Inherited from Logger.defaultMaxListeners By default, a maximum of `10` listeners can be registered for any single event. This limit can be changed for individual `EventEmitter` instances using the `emitter.setMaxListeners(n)` method. To change the default for *all*`EventEmitter` instances, the `events.defaultMaxListeners` property can be used. If this value is not a positive number, a `RangeError` is thrown. Take caution when setting the `events.defaultMaxListeners` because the change affects *all* `EventEmitter` instances, including those created before the change is made. However, calling `emitter.setMaxListeners(n)` still has precedence over `events.defaultMaxListeners`. This is not a hard limit. The `EventEmitter` instance will allow more listeners to be added but will output a trace warning to stderr indicating that a "possible EventEmitter memory leak" has been detected. For any single `EventEmitter`, the `emitter.getMaxListeners()` and `emitter.setMaxListeners()` methods can be used to temporarily avoid this warning: ``` import { EventEmitter } from 'node:events'; const emitter = new EventEmitter(); emitter.setMaxListeners(emitter.getMaxListeners() + 1); emitter.once('event', () => { // do stuff emitter.setMaxListeners(Math.max(emitter.getMaxListeners() - 1, 0)); }); ``` The `--trace-warnings` command-line flag can be used to display the stack trace for such warnings. The emitted warning can be inspected with `process.on('warning')` and will have the additional `emitter`, `type`, and `count` properties, referring to the event emitter instance, the event's name and the number of attached listeners, respectively. Its `name` property is set to `'MaxListenersExceededWarning'`. * **@since** v0.11.2 ### [**](#errorMonitor)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L445)staticexternalreadonlyinheritederrorMonitor **errorMonitor: typeof errorMonitor Inherited from Logger.errorMonitor This symbol shall be used to install a listener for only monitoring `'error'` events. Listeners installed using this symbol are called before the regular `'error'` listeners are called. Installing a listener using this symbol does not change the behavior once an `'error'` event is emitted. Therefore, the process will still crash if no regular `'error'` listener is installed. * **@since** v13.6.0, v12.17.0 ## Methods[**](#Methods) ### [**](#_log)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@apify/log/src/index.d.ts#L242)external\_log * ****\_log**(level, message, data, exception, opts): string - Overrides Logger.\_log #### Parameters * ##### externallevel: [LogLevel](https://docs.apify.com/sdk/js/sdk/js/reference/enum/LogLevel.md) * ##### externalmessage: string * ##### externaloptionaldata: any * ##### externaloptionalexception: unknown * ##### externaloptionalopts: Record\ #### Returns string ### [**](#_outputWithConsole)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@apify/log/src/index.d.ts#L36)externalinherited\_outputWithConsole * ****\_outputWithConsole**(level, line): void - Inherited from Logger.\_outputWithConsole #### Parameters * ##### externallevel: [LogLevel](https://docs.apify.com/sdk/js/sdk/js/reference/enum/LogLevel.md) * ##### externalline: string #### Returns void ### [**](#\[captureRejectionSymbol])[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L136)externaloptionalinherited\[captureRejectionSymbol] * ****\[captureRejectionSymbol]**(error, event, ...args): void - Inherited from Logger.\[captureRejectionSymbol] #### Parameters * ##### externalerror: Error * ##### externalevent: string | symbol * ##### externalrest...args: AnyRest #### Returns void ### [**](#addListener)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L597)externalinheritedaddListener * ****addListener**(eventName, listener): this - Inherited from Logger.addListener Alias for `emitter.on(eventName, listener)`. * **@since** v0.1.26 *** #### Parameters * ##### externaleventName: string | symbol * ##### externallistener: (...args) => void #### Returns this ### [**](#emit)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L859)externalinheritedemit * ****emit**(eventName, ...args): boolean - Inherited from Logger.emit Synchronously calls each of the listeners registered for the event named `eventName`, in the order they were registered, passing the supplied arguments to each. Returns `true` if the event had listeners, `false` otherwise. ``` import { EventEmitter } from 'node:events'; const myEmitter = new EventEmitter(); // First listener myEmitter.on('event', function firstListener() { console.log('Helloooo! first listener'); }); // Second listener myEmitter.on('event', function secondListener(arg1, arg2) { console.log(`event with parameters ${arg1}, ${arg2} in second listener`); }); // Third listener myEmitter.on('event', function thirdListener(...args) { const parameters = args.join(', '); console.log(`event with parameters ${parameters} in third listener`); }); console.log(myEmitter.listeners('event')); myEmitter.emit('event', 1, 2, 3, 4, 5); // Prints: // [ // [Function: firstListener], // [Function: secondListener], // [Function: thirdListener] // ] // Helloooo! first listener // event with parameters 1, 2 in second listener // event with parameters 1, 2, 3, 4, 5 in third listener ``` * **@since** v0.1.26 *** #### Parameters * ##### externaleventName: string | symbol * ##### externalrest...args: AnyRest #### Returns boolean ### [**](#eventNames)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L922)externalinheritedeventNames * ****eventNames**(): (string | symbol)\[] - Inherited from Logger.eventNames Returns an array listing the events for which the emitter has registered listeners. The values in the array are strings or `Symbol`s. ``` import { EventEmitter } from 'node:events'; const myEE = new EventEmitter(); myEE.on('foo', () => {}); myEE.on('bar', () => {}); const sym = Symbol('symbol'); myEE.on(sym, () => {}); console.log(myEE.eventNames()); // Prints: [ 'foo', 'bar', Symbol(symbol) ] ``` * **@since** v6.0.0 *** #### Returns (string | symbol)\[] ### [**](#getMaxListeners)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L774)externalinheritedgetMaxListeners * ****getMaxListeners**(): number - Inherited from Logger.getMaxListeners Returns the current max listener value for the `EventEmitter` which is either set by `emitter.setMaxListeners(n)` or defaults to defaultMaxListeners. * **@since** v1.0.0 *** #### Returns number ### [**](#getOptions)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@apify/log/src/index.d.ts#L35)externalinheritedgetOptions * ****getOptions**(): Record\ - Inherited from Logger.getOptions #### Returns Record\ ### [**](#listenerCount)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L868)externalinheritedlistenerCount * ****listenerCount**(eventName, listener): number - Inherited from Logger.listenerCount Returns the number of listeners listening for the event named `eventName`. If `listener` is provided, it will return how many times the listener is found in the list of the listeners of the event. * **@since** v3.2.0 *** #### Parameters * ##### externaleventName: string | symbol The name of the event being listened for * ##### externaloptionallistener: Function The event handler function #### Returns number ### [**](#listeners)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L787)externalinheritedlisteners * ****listeners**(eventName): Function\[] - Inherited from Logger.listeners Returns a copy of the array of listeners for the event named `eventName`. ``` server.on('connection', (stream) => { console.log('someone connected!'); }); console.log(util.inspect(server.listeners('connection'))); // Prints: [ [Function] ] ``` * **@since** v0.1.26 *** #### Parameters * ##### externaleventName: string | symbol #### Returns Function\[] ### [**](#log)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@apify/log/src/index.d.ts#L38)externalinheritedlog * ****log**(level, message, ...args): void - Inherited from Logger.log #### Parameters * ##### externallevel: [LogLevel](https://docs.apify.com/sdk/js/sdk/js/reference/enum/LogLevel.md) * ##### externalmessage: string * ##### externalrest...args: any\[] #### Returns void ### [**](#off)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L747)externalinheritedoff * ****off**(eventName, listener): this - Inherited from Logger.off Alias for `emitter.removeListener()`. * **@since** v10.0.0 *** #### Parameters * ##### externaleventName: string | symbol * ##### externallistener: (...args) => void #### Returns this ### [**](#on)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L629)externalinheritedon * ****on**(eventName, listener): this - Inherited from Logger.on Adds the `listener` function to the end of the listeners array for the event named `eventName`. No checks are made to see if the `listener` has already been added. Multiple calls passing the same combination of `eventName` and `listener` will result in the `listener` being added, and called, multiple times. ``` server.on('connection', (stream) => { console.log('someone connected!'); }); ``` Returns a reference to the `EventEmitter`, so that calls can be chained. By default, event listeners are invoked in the order they are added. The `emitter.prependListener()` method can be used as an alternative to add the event listener to the beginning of the listeners array. ``` import { EventEmitter } from 'node:events'; const myEE = new EventEmitter(); myEE.on('foo', () => console.log('a')); myEE.prependListener('foo', () => console.log('b')); myEE.emit('foo'); // Prints: // b // a ``` * **@since** v0.1.101 *** #### Parameters * ##### externaleventName: string | symbol The name of the event. * ##### externallistener: (...args) => void The callback function #### Returns this ### [**](#once)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L659)externalinheritedonce * ****once**(eventName, listener): this - Inherited from Logger.once Adds a **one-time** `listener` function for the event named `eventName`. The next time `eventName` is triggered, this listener is removed and then invoked. ``` server.once('connection', (stream) => { console.log('Ah, we have our first user!'); }); ``` Returns a reference to the `EventEmitter`, so that calls can be chained. By default, event listeners are invoked in the order they are added. The `emitter.prependOnceListener()` method can be used as an alternative to add the event listener to the beginning of the listeners array. ``` import { EventEmitter } from 'node:events'; const myEE = new EventEmitter(); myEE.once('foo', () => console.log('a')); myEE.prependOnceListener('foo', () => console.log('b')); myEE.emit('foo'); // Prints: // b // a ``` * **@since** v0.3.0 *** #### Parameters * ##### externaleventName: string | symbol The name of the event. * ##### externallistener: (...args) => void The callback function #### Returns this ### [**](#prependListener)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L886)externalinheritedprependListener * ****prependListener**(eventName, listener): this - Inherited from Logger.prependListener Adds the `listener` function to the *beginning* of the listeners array for the event named `eventName`. No checks are made to see if the `listener` has already been added. Multiple calls passing the same combination of `eventName` and `listener` will result in the `listener` being added, and called, multiple times. ``` server.prependListener('connection', (stream) => { console.log('someone connected!'); }); ``` Returns a reference to the `EventEmitter`, so that calls can be chained. * **@since** v6.0.0 *** #### Parameters * ##### externaleventName: string | symbol The name of the event. * ##### externallistener: (...args) => void The callback function #### Returns this ### [**](#prependOnceListener)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L902)externalinheritedprependOnceListener * ****prependOnceListener**(eventName, listener): this - Inherited from Logger.prependOnceListener Adds a **one-time**`listener` function for the event named `eventName` to the *beginning* of the listeners array. The next time `eventName` is triggered, this listener is removed, and then invoked. ``` server.prependOnceListener('connection', (stream) => { console.log('Ah, we have our first user!'); }); ``` Returns a reference to the `EventEmitter`, so that calls can be chained. * **@since** v6.0.0 *** #### Parameters * ##### externaleventName: string | symbol The name of the event. * ##### externallistener: (...args) => void The callback function #### Returns this ### [**](#rawListeners)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L818)externalinheritedrawListeners * ****rawListeners**(eventName): Function\[] - Inherited from Logger.rawListeners Returns a copy of the array of listeners for the event named `eventName`, including any wrappers (such as those created by `.once()`). ``` import { EventEmitter } from 'node:events'; const emitter = new EventEmitter(); emitter.once('log', () => console.log('log once')); // Returns a new Array with a function `onceWrapper` which has a property // `listener` which contains the original listener bound above const listeners = emitter.rawListeners('log'); const logFnWrapper = listeners[0]; // Logs "log once" to the console and does not unbind the `once` event logFnWrapper.listener(); // Logs "log once" to the console and removes the listener logFnWrapper(); emitter.on('log', () => console.log('log persistently')); // Will return a new Array with a single function bound by `.on()` above const newListeners = emitter.rawListeners('log'); // Logs "log persistently" twice newListeners[0](); emitter.emit('log'); ``` * **@since** v9.4.0 *** #### Parameters * ##### externaleventName: string | symbol #### Returns Function\[] ### [**](#removeAllListeners)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L758)externalinheritedremoveAllListeners * ****removeAllListeners**(eventName): this - Inherited from Logger.removeAllListeners Removes all listeners, or those of the specified `eventName`. It is bad practice to remove listeners added elsewhere in the code, particularly when the `EventEmitter` instance was created by some other component or module (e.g. sockets or file streams). Returns a reference to the `EventEmitter`, so that calls can be chained. * **@since** v0.1.26 *** #### Parameters * ##### externaloptionaleventName: string | symbol #### Returns this ### [**](#removeListener)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L742)externalinheritedremoveListener * ****removeListener**(eventName, listener): this - Inherited from Logger.removeListener Removes the specified `listener` from the listener array for the event named `eventName`. ``` const callback = (stream) => { console.log('someone connected!'); }; server.on('connection', callback); // ... server.removeListener('connection', callback); ``` `removeListener()` will remove, at most, one instance of a listener from the listener array. If any single listener has been added multiple times to the listener array for the specified `eventName`, then `removeListener()` must be called multiple times to remove each instance. Once an event is emitted, all listeners attached to it at the time of emitting are called in order. This implies that any `removeListener()` or `removeAllListeners()` calls *after* emitting and *before* the last listener finishes execution will not remove them from`emit()` in progress. Subsequent events behave as expected. ``` import { EventEmitter } from 'node:events'; class MyEmitter extends EventEmitter {} const myEmitter = new MyEmitter(); const callbackA = () => { console.log('A'); myEmitter.removeListener('event', callbackB); }; const callbackB = () => { console.log('B'); }; myEmitter.on('event', callbackA); myEmitter.on('event', callbackB); // callbackA removes listener callbackB but it will still be called. // Internal listener array at time of emit [callbackA, callbackB] myEmitter.emit('event'); // Prints: // A // B // callbackB is now removed. // Internal listener array [callbackA] myEmitter.emit('event'); // Prints: // A ``` Because listeners are managed using an internal array, calling this will change the position indices of any listener registered *after* the listener being removed. This will not impact the order in which listeners are called, but it means that any copies of the listener array as returned by the `emitter.listeners()` method will need to be recreated. When a single function has been added as a handler multiple times for a single event (as in the example below), `removeListener()` will remove the most recently added instance. In the example the `once('ping')` listener is removed: ``` import { EventEmitter } from 'node:events'; const ee = new EventEmitter(); function pong() { console.log('pong'); } ee.on('ping', pong); ee.once('ping', pong); ee.removeListener('ping', pong); ee.emit('ping'); ee.emit('ping'); ``` Returns a reference to the `EventEmitter`, so that calls can be chained. * **@since** v0.1.26 *** #### Parameters * ##### externaleventName: string | symbol * ##### externallistener: (...args) => void #### Returns this ### [**](#setMaxListeners)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L768)externalinheritedsetMaxListeners * ****setMaxListeners**(n): this - Inherited from Logger.setMaxListeners By default `EventEmitter`s will print a warning if more than `10` listeners are added for a particular event. This is a useful default that helps finding memory leaks. The `emitter.setMaxListeners()` method allows the limit to be modified for this specific `EventEmitter` instance. The value can be set to `Infinity` (or `0`) to indicate an unlimited number of listeners. Returns a reference to the `EventEmitter`, so that calls can be chained. * **@since** v0.3.5 *** #### Parameters * ##### externaln: number #### Returns this ### [**](#setOptions)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@apify/log/src/index.d.ts#L34)externalinheritedsetOptions * ****setOptions**(options): void - Inherited from Logger.setOptions #### Parameters * ##### externaloptions: Record\ #### Returns void ### [**](#addAbortListener)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L437)staticexternalinheritedaddAbortListener * ****addAbortListener**(signal, resource): Disposable - Inherited from Logger.addAbortListener experimental Listens once to the `abort` event on the provided `signal`. Listening to the `abort` event on abort signals is unsafe and may lead to resource leaks since another third party with the signal can call `e.stopImmediatePropagation()`. Unfortunately Node.js cannot change this since it would violate the web standard. Additionally, the original API makes it easy to forget to remove listeners. This API allows safely using `AbortSignal`s in Node.js APIs by solving these two issues by listening to the event such that `stopImmediatePropagation` does not prevent the listener from running. Returns a disposable so that it may be unsubscribed from more easily. ``` import { addAbortListener } from 'node:events'; function example(signal) { let disposable; try { signal.addEventListener('abort', (e) => e.stopImmediatePropagation()); disposable = addAbortListener(signal, (e) => { // Do something when signal is aborted. }); } finally { disposable?.[Symbol.dispose](); } } ``` * **@since** v20.5.0 *** #### Parameters * ##### externalsignal: AbortSignal * ##### externalresource: (event) => void #### Returns Disposable Disposable that removes the `abort` listener. ### [**](#getEventListeners)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L358)staticexternalinheritedgetEventListeners * ****getEventListeners**(emitter, name): Function\[] - Inherited from Logger.getEventListeners Returns a copy of the array of listeners for the event named `eventName`. For `EventEmitter`s this behaves exactly the same as calling `.listeners` on the emitter. For `EventTarget`s this is the only way to get the event listeners for the event target. This is useful for debugging and diagnostic purposes. ``` import { getEventListeners, EventEmitter } from 'node:events'; { const ee = new EventEmitter(); const listener = () => console.log('Events are fun'); ee.on('foo', listener); console.log(getEventListeners(ee, 'foo')); // [ [Function: listener] ] } { const et = new EventTarget(); const listener = () => console.log('Events are fun'); et.addEventListener('foo', listener); console.log(getEventListeners(et, 'foo')); // [ [Function: listener] ] } ``` * **@since** v15.2.0, v14.17.0 *** #### Parameters * ##### externalemitter: EventEmitter\ | EventTarget * ##### externalname: string | symbol #### Returns Function\[] ### [**](#getMaxListeners)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L387)staticexternalinheritedgetMaxListeners * ****getMaxListeners**(emitter): number - Inherited from Logger.getMaxListeners Returns the currently set max amount of listeners. For `EventEmitter`s this behaves exactly the same as calling `.getMaxListeners` on the emitter. For `EventTarget`s this is the only way to get the max event listeners for the event target. If the number of event handlers on a single EventTarget exceeds the max set, the EventTarget will print a warning. ``` import { getMaxListeners, setMaxListeners, EventEmitter } from 'node:events'; { const ee = new EventEmitter(); console.log(getMaxListeners(ee)); // 10 setMaxListeners(11, ee); console.log(getMaxListeners(ee)); // 11 } { const et = new EventTarget(); console.log(getMaxListeners(et)); // 10 setMaxListeners(11, et); console.log(getMaxListeners(et)); // 11 } ``` * **@since** v19.9.0 *** #### Parameters * ##### externalemitter: EventEmitter\ | EventTarget #### Returns number ### [**](#listenerCount)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L330)staticexternalinheritedlistenerCount * ****listenerCount**(emitter, eventName): number - Inherited from Logger.listenerCount A class method that returns the number of listeners for the given `eventName` registered on the given `emitter`. ``` import { EventEmitter, listenerCount } from 'node:events'; const myEmitter = new EventEmitter(); myEmitter.on('event', () => {}); myEmitter.on('event', () => {}); console.log(listenerCount(myEmitter, 'event')); // Prints: 2 ``` * **@since** v0.9.12 * **@deprecated** Since v3.2.0 - Use `listenerCount` instead. *** #### Parameters * ##### externalemitter: EventEmitter\ The emitter to query * ##### externaleventName: string | symbol The event name #### Returns number ### [**](#on)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L303)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L308)staticexternalinheritedon * ****on**(emitter, eventName, options): AsyncIterator\ * ****on**(emitter, eventName, options): AsyncIterator\ - Inherited from Logger.on ``` import { on, EventEmitter } from 'node:events'; import process from 'node:process'; const ee = new EventEmitter(); // Emit later on process.nextTick(() => { ee.emit('foo', 'bar'); ee.emit('foo', 42); }); for await (const event of on(ee, 'foo')) { // The execution of this inner block is synchronous and it // processes one event at a time (even with await). Do not use // if concurrent execution is required. console.log(event); // prints ['bar'] [42] } // Unreachable here ``` Returns an `AsyncIterator` that iterates `eventName` events. It will throw if the `EventEmitter` emits `'error'`. It removes all listeners when exiting the loop. The `value` returned by each iteration is an array composed of the emitted event arguments. An `AbortSignal` can be used to cancel waiting on events: ``` import { on, EventEmitter } from 'node:events'; import process from 'node:process'; const ac = new AbortController(); (async () => { const ee = new EventEmitter(); // Emit later on process.nextTick(() => { ee.emit('foo', 'bar'); ee.emit('foo', 42); }); for await (const event of on(ee, 'foo', { signal: ac.signal })) { // The execution of this inner block is synchronous and it // processes one event at a time (even with await). Do not use // if concurrent execution is required. console.log(event); // prints ['bar'] [42] } // Unreachable here })(); process.nextTick(() => ac.abort()); ``` Use the `close` option to specify an array of event names that will end the iteration: ``` import { on, EventEmitter } from 'node:events'; import process from 'node:process'; const ee = new EventEmitter(); // Emit later on process.nextTick(() => { ee.emit('foo', 'bar'); ee.emit('foo', 42); ee.emit('close'); }); for await (const event of on(ee, 'foo', { close: ['close'] })) { console.log(event); // prints ['bar'] [42] } // the loop will exit after 'close' is emitted console.log('done'); // prints 'done' ``` * **@since** v13.6.0, v12.16.0 *** #### Parameters * ##### externalemitter: EventEmitter\ * ##### externaleventName: string | symbol * ##### externaloptionaloptions: StaticEventEmitterIteratorOptions #### Returns AsyncIterator\ An `AsyncIterator` that iterates `eventName` events emitted by the `emitter` ### [**](#once)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L217)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L222)staticexternalinheritedonce * ****once**(emitter, eventName, options): Promise\ * ****once**(emitter, eventName, options): Promise\ - Inherited from Logger.once Creates a `Promise` that is fulfilled when the `EventEmitter` emits the given event or that is rejected if the `EventEmitter` emits `'error'` while waiting. The `Promise` will resolve with an array of all the arguments emitted to the given event. This method is intentionally generic and works with the web platform [EventTarget](https://dom.spec.whatwg.org/#interface-eventtarget) interface, which has no special`'error'` event semantics and does not listen to the `'error'` event. ``` import { once, EventEmitter } from 'node:events'; import process from 'node:process'; const ee = new EventEmitter(); process.nextTick(() => { ee.emit('myevent', 42); }); const [value] = await once(ee, 'myevent'); console.log(value); const err = new Error('kaboom'); process.nextTick(() => { ee.emit('error', err); }); try { await once(ee, 'myevent'); } catch (err) { console.error('error happened', err); } ``` The special handling of the `'error'` event is only used when `events.once()` is used to wait for another event. If `events.once()` is used to wait for the '`error'` event itself, then it is treated as any other kind of event without special handling: ``` import { EventEmitter, once } from 'node:events'; const ee = new EventEmitter(); once(ee, 'error') .then(([err]) => console.log('ok', err.message)) .catch((err) => console.error('error', err.message)); ee.emit('error', new Error('boom')); // Prints: ok boom ``` An `AbortSignal` can be used to cancel waiting for the event: ``` import { EventEmitter, once } from 'node:events'; const ee = new EventEmitter(); const ac = new AbortController(); async function foo(emitter, event, signal) { try { await once(emitter, event, { signal }); console.log('event emitted!'); } catch (error) { if (error.name === 'AbortError') { console.error('Waiting for the event was canceled!'); } else { console.error('There was an error', error.message); } } } foo(ee, 'foo', ac.signal); ac.abort(); // Abort waiting for the event ee.emit('foo'); // Prints: Waiting for the event was canceled! ``` * **@since** v11.13.0, v10.16.0 *** #### Parameters * ##### externalemitter: EventEmitter\ * ##### externaleventName: string | symbol * ##### externaloptionaloptions: StaticEventEmitterOptions #### Returns Promise\ ### [**](#setMaxListeners)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L402)staticexternalinheritedsetMaxListeners * ****setMaxListeners**(n, ...eventTargets): void - Inherited from Logger.setMaxListeners ``` import { setMaxListeners, EventEmitter } from 'node:events'; const target = new EventTarget(); const emitter = new EventEmitter(); setMaxListeners(5, target, emitter); ``` * **@since** v15.4.0 *** #### Parameters * ##### externaloptionaln: number A non-negative number. The maximum number of listeners per `EventTarget` event. * ##### externalrest...eventTargets: (EventEmitter\ | EventTarget)\[] Zero or more {EventTarget} or {EventEmitter} instances. If none are specified, `n` is set as the default max for all newly created {EventTarget} and {EventEmitter} objects. #### Returns void --- # externalLoggerText This is an abstract class that should be extended by custom logger classes. this.\_log() method must be implemented by them. ### Hierarchy * [Logger](https://docs.apify.com/sdk/js/sdk/js/reference/class/Logger.md) * *LoggerText* ## Index[**](#Index) ### Constructors * [**constructor](#constructor) ### Properties * [**captureRejections](#captureRejections) * [**captureRejectionSymbol](#captureRejectionSymbol) * [**defaultMaxListeners](#defaultMaxListeners) * [**errorMonitor](#errorMonitor) ### Methods * [**\_log](#_log) * [**\_outputWithConsole](#_outputWithConsole) * [**\[captureRejectionSymbol\]](#\[captureRejectionSymbol]) * [**addListener](#addListener) * [**emit](#emit) * [**eventNames](#eventNames) * [**getMaxListeners](#getMaxListeners) * [**getOptions](#getOptions) * [**listenerCount](#listenerCount) * [**listeners](#listeners) * [**log](#log) * [**off](#off) * [**on](#on) * [**once](#once) * [**prependListener](#prependListener) * [**prependOnceListener](#prependOnceListener) * [**rawListeners](#rawListeners) * [**removeAllListeners](#removeAllListeners) * [**removeListener](#removeListener) * [**setMaxListeners](#setMaxListeners) * [**setOptions](#setOptions) * [**addAbortListener](#addAbortListener) * [**getEventListeners](#getEventListeners) * [**getMaxListeners](#getMaxListeners) * [**listenerCount](#listenerCount) * [**on](#on) * [**once](#once) * [**setMaxListeners](#setMaxListeners) ## Constructors[**](#Constructors) ### [**](#constructor)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@apify/log/src/index.d.ts#L246)externalconstructor * ****new LoggerText**(options): [LoggerText](https://docs.apify.com/sdk/js/sdk/js/reference/class/LoggerText.md) - Overrides Logger.constructor #### Parameters * ##### externaloptionaloptions: {} #### Returns [LoggerText](https://docs.apify.com/sdk/js/sdk/js/reference/class/LoggerText.md) ## Properties[**](#Properties) ### [**](#captureRejections)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L459)staticexternalinheritedcaptureRejections **captureRejections: boolean Inherited from Logger.captureRejections Value: [boolean](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Data_structures#Boolean_type) Change the default `captureRejections` option on all new `EventEmitter` objects. * **@since** v13.4.0, v12.16.0 ### [**](#captureRejectionSymbol)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L452)staticexternalreadonlyinheritedcaptureRejectionSymbol **captureRejectionSymbol: typeof captureRejectionSymbol Inherited from Logger.captureRejectionSymbol Value: `Symbol.for('nodejs.rejection')` See how to write a custom `rejection handler`. * **@since** v13.4.0, v12.16.0 ### [**](#defaultMaxListeners)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L498)staticexternalinheriteddefaultMaxListeners **defaultMaxListeners: number Inherited from Logger.defaultMaxListeners By default, a maximum of `10` listeners can be registered for any single event. This limit can be changed for individual `EventEmitter` instances using the `emitter.setMaxListeners(n)` method. To change the default for *all*`EventEmitter` instances, the `events.defaultMaxListeners` property can be used. If this value is not a positive number, a `RangeError` is thrown. Take caution when setting the `events.defaultMaxListeners` because the change affects *all* `EventEmitter` instances, including those created before the change is made. However, calling `emitter.setMaxListeners(n)` still has precedence over `events.defaultMaxListeners`. This is not a hard limit. The `EventEmitter` instance will allow more listeners to be added but will output a trace warning to stderr indicating that a "possible EventEmitter memory leak" has been detected. For any single `EventEmitter`, the `emitter.getMaxListeners()` and `emitter.setMaxListeners()` methods can be used to temporarily avoid this warning: ``` import { EventEmitter } from 'node:events'; const emitter = new EventEmitter(); emitter.setMaxListeners(emitter.getMaxListeners() + 1); emitter.once('event', () => { // do stuff emitter.setMaxListeners(Math.max(emitter.getMaxListeners() - 1, 0)); }); ``` The `--trace-warnings` command-line flag can be used to display the stack trace for such warnings. The emitted warning can be inspected with `process.on('warning')` and will have the additional `emitter`, `type`, and `count` properties, referring to the event emitter instance, the event's name and the number of attached listeners, respectively. Its `name` property is set to `'MaxListenersExceededWarning'`. * **@since** v0.11.2 ### [**](#errorMonitor)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L445)staticexternalreadonlyinheritederrorMonitor **errorMonitor: typeof errorMonitor Inherited from Logger.errorMonitor This symbol shall be used to install a listener for only monitoring `'error'` events. Listeners installed using this symbol are called before the regular `'error'` listeners are called. Installing a listener using this symbol does not change the behavior once an `'error'` event is emitted. Therefore, the process will still crash if no regular `'error'` listener is installed. * **@since** v13.6.0, v12.17.0 ## Methods[**](#Methods) ### [**](#_log)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@apify/log/src/index.d.ts#L247)external\_log * ****\_log**(level, message, data, exception, opts): string - Overrides Logger.\_log #### Parameters * ##### externallevel: [LogLevel](https://docs.apify.com/sdk/js/sdk/js/reference/enum/LogLevel.md) * ##### externalmessage: string * ##### externaloptionaldata: any * ##### externaloptionalexception: unknown * ##### externaloptionalopts: Record\ #### Returns string ### [**](#_outputWithConsole)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@apify/log/src/index.d.ts#L36)externalinherited\_outputWithConsole * ****\_outputWithConsole**(level, line): void - Inherited from Logger.\_outputWithConsole #### Parameters * ##### externallevel: [LogLevel](https://docs.apify.com/sdk/js/sdk/js/reference/enum/LogLevel.md) * ##### externalline: string #### Returns void ### [**](#\[captureRejectionSymbol])[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L136)externaloptionalinherited\[captureRejectionSymbol] * ****\[captureRejectionSymbol]**(error, event, ...args): void - Inherited from Logger.\[captureRejectionSymbol] #### Parameters * ##### externalerror: Error * ##### externalevent: string | symbol * ##### externalrest...args: AnyRest #### Returns void ### [**](#addListener)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L597)externalinheritedaddListener * ****addListener**(eventName, listener): this - Inherited from Logger.addListener Alias for `emitter.on(eventName, listener)`. * **@since** v0.1.26 *** #### Parameters * ##### externaleventName: string | symbol * ##### externallistener: (...args) => void #### Returns this ### [**](#emit)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L859)externalinheritedemit * ****emit**(eventName, ...args): boolean - Inherited from Logger.emit Synchronously calls each of the listeners registered for the event named `eventName`, in the order they were registered, passing the supplied arguments to each. Returns `true` if the event had listeners, `false` otherwise. ``` import { EventEmitter } from 'node:events'; const myEmitter = new EventEmitter(); // First listener myEmitter.on('event', function firstListener() { console.log('Helloooo! first listener'); }); // Second listener myEmitter.on('event', function secondListener(arg1, arg2) { console.log(`event with parameters ${arg1}, ${arg2} in second listener`); }); // Third listener myEmitter.on('event', function thirdListener(...args) { const parameters = args.join(', '); console.log(`event with parameters ${parameters} in third listener`); }); console.log(myEmitter.listeners('event')); myEmitter.emit('event', 1, 2, 3, 4, 5); // Prints: // [ // [Function: firstListener], // [Function: secondListener], // [Function: thirdListener] // ] // Helloooo! first listener // event with parameters 1, 2 in second listener // event with parameters 1, 2, 3, 4, 5 in third listener ``` * **@since** v0.1.26 *** #### Parameters * ##### externaleventName: string | symbol * ##### externalrest...args: AnyRest #### Returns boolean ### [**](#eventNames)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L922)externalinheritedeventNames * ****eventNames**(): (string | symbol)\[] - Inherited from Logger.eventNames Returns an array listing the events for which the emitter has registered listeners. The values in the array are strings or `Symbol`s. ``` import { EventEmitter } from 'node:events'; const myEE = new EventEmitter(); myEE.on('foo', () => {}); myEE.on('bar', () => {}); const sym = Symbol('symbol'); myEE.on(sym, () => {}); console.log(myEE.eventNames()); // Prints: [ 'foo', 'bar', Symbol(symbol) ] ``` * **@since** v6.0.0 *** #### Returns (string | symbol)\[] ### [**](#getMaxListeners)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L774)externalinheritedgetMaxListeners * ****getMaxListeners**(): number - Inherited from Logger.getMaxListeners Returns the current max listener value for the `EventEmitter` which is either set by `emitter.setMaxListeners(n)` or defaults to defaultMaxListeners. * **@since** v1.0.0 *** #### Returns number ### [**](#getOptions)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@apify/log/src/index.d.ts#L35)externalinheritedgetOptions * ****getOptions**(): Record\ - Inherited from Logger.getOptions #### Returns Record\ ### [**](#listenerCount)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L868)externalinheritedlistenerCount * ****listenerCount**(eventName, listener): number - Inherited from Logger.listenerCount Returns the number of listeners listening for the event named `eventName`. If `listener` is provided, it will return how many times the listener is found in the list of the listeners of the event. * **@since** v3.2.0 *** #### Parameters * ##### externaleventName: string | symbol The name of the event being listened for * ##### externaloptionallistener: Function The event handler function #### Returns number ### [**](#listeners)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L787)externalinheritedlisteners * ****listeners**(eventName): Function\[] - Inherited from Logger.listeners Returns a copy of the array of listeners for the event named `eventName`. ``` server.on('connection', (stream) => { console.log('someone connected!'); }); console.log(util.inspect(server.listeners('connection'))); // Prints: [ [Function] ] ``` * **@since** v0.1.26 *** #### Parameters * ##### externaleventName: string | symbol #### Returns Function\[] ### [**](#log)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@apify/log/src/index.d.ts#L38)externalinheritedlog * ****log**(level, message, ...args): void - Inherited from Logger.log #### Parameters * ##### externallevel: [LogLevel](https://docs.apify.com/sdk/js/sdk/js/reference/enum/LogLevel.md) * ##### externalmessage: string * ##### externalrest...args: any\[] #### Returns void ### [**](#off)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L747)externalinheritedoff * ****off**(eventName, listener): this - Inherited from Logger.off Alias for `emitter.removeListener()`. * **@since** v10.0.0 *** #### Parameters * ##### externaleventName: string | symbol * ##### externallistener: (...args) => void #### Returns this ### [**](#on)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L629)externalinheritedon * ****on**(eventName, listener): this - Inherited from Logger.on Adds the `listener` function to the end of the listeners array for the event named `eventName`. No checks are made to see if the `listener` has already been added. Multiple calls passing the same combination of `eventName` and `listener` will result in the `listener` being added, and called, multiple times. ``` server.on('connection', (stream) => { console.log('someone connected!'); }); ``` Returns a reference to the `EventEmitter`, so that calls can be chained. By default, event listeners are invoked in the order they are added. The `emitter.prependListener()` method can be used as an alternative to add the event listener to the beginning of the listeners array. ``` import { EventEmitter } from 'node:events'; const myEE = new EventEmitter(); myEE.on('foo', () => console.log('a')); myEE.prependListener('foo', () => console.log('b')); myEE.emit('foo'); // Prints: // b // a ``` * **@since** v0.1.101 *** #### Parameters * ##### externaleventName: string | symbol The name of the event. * ##### externallistener: (...args) => void The callback function #### Returns this ### [**](#once)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L659)externalinheritedonce * ****once**(eventName, listener): this - Inherited from Logger.once Adds a **one-time** `listener` function for the event named `eventName`. The next time `eventName` is triggered, this listener is removed and then invoked. ``` server.once('connection', (stream) => { console.log('Ah, we have our first user!'); }); ``` Returns a reference to the `EventEmitter`, so that calls can be chained. By default, event listeners are invoked in the order they are added. The `emitter.prependOnceListener()` method can be used as an alternative to add the event listener to the beginning of the listeners array. ``` import { EventEmitter } from 'node:events'; const myEE = new EventEmitter(); myEE.once('foo', () => console.log('a')); myEE.prependOnceListener('foo', () => console.log('b')); myEE.emit('foo'); // Prints: // b // a ``` * **@since** v0.3.0 *** #### Parameters * ##### externaleventName: string | symbol The name of the event. * ##### externallistener: (...args) => void The callback function #### Returns this ### [**](#prependListener)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L886)externalinheritedprependListener * ****prependListener**(eventName, listener): this - Inherited from Logger.prependListener Adds the `listener` function to the *beginning* of the listeners array for the event named `eventName`. No checks are made to see if the `listener` has already been added. Multiple calls passing the same combination of `eventName` and `listener` will result in the `listener` being added, and called, multiple times. ``` server.prependListener('connection', (stream) => { console.log('someone connected!'); }); ``` Returns a reference to the `EventEmitter`, so that calls can be chained. * **@since** v6.0.0 *** #### Parameters * ##### externaleventName: string | symbol The name of the event. * ##### externallistener: (...args) => void The callback function #### Returns this ### [**](#prependOnceListener)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L902)externalinheritedprependOnceListener * ****prependOnceListener**(eventName, listener): this - Inherited from Logger.prependOnceListener Adds a **one-time**`listener` function for the event named `eventName` to the *beginning* of the listeners array. The next time `eventName` is triggered, this listener is removed, and then invoked. ``` server.prependOnceListener('connection', (stream) => { console.log('Ah, we have our first user!'); }); ``` Returns a reference to the `EventEmitter`, so that calls can be chained. * **@since** v6.0.0 *** #### Parameters * ##### externaleventName: string | symbol The name of the event. * ##### externallistener: (...args) => void The callback function #### Returns this ### [**](#rawListeners)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L818)externalinheritedrawListeners * ****rawListeners**(eventName): Function\[] - Inherited from Logger.rawListeners Returns a copy of the array of listeners for the event named `eventName`, including any wrappers (such as those created by `.once()`). ``` import { EventEmitter } from 'node:events'; const emitter = new EventEmitter(); emitter.once('log', () => console.log('log once')); // Returns a new Array with a function `onceWrapper` which has a property // `listener` which contains the original listener bound above const listeners = emitter.rawListeners('log'); const logFnWrapper = listeners[0]; // Logs "log once" to the console and does not unbind the `once` event logFnWrapper.listener(); // Logs "log once" to the console and removes the listener logFnWrapper(); emitter.on('log', () => console.log('log persistently')); // Will return a new Array with a single function bound by `.on()` above const newListeners = emitter.rawListeners('log'); // Logs "log persistently" twice newListeners[0](); emitter.emit('log'); ``` * **@since** v9.4.0 *** #### Parameters * ##### externaleventName: string | symbol #### Returns Function\[] ### [**](#removeAllListeners)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L758)externalinheritedremoveAllListeners * ****removeAllListeners**(eventName): this - Inherited from Logger.removeAllListeners Removes all listeners, or those of the specified `eventName`. It is bad practice to remove listeners added elsewhere in the code, particularly when the `EventEmitter` instance was created by some other component or module (e.g. sockets or file streams). Returns a reference to the `EventEmitter`, so that calls can be chained. * **@since** v0.1.26 *** #### Parameters * ##### externaloptionaleventName: string | symbol #### Returns this ### [**](#removeListener)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L742)externalinheritedremoveListener * ****removeListener**(eventName, listener): this - Inherited from Logger.removeListener Removes the specified `listener` from the listener array for the event named `eventName`. ``` const callback = (stream) => { console.log('someone connected!'); }; server.on('connection', callback); // ... server.removeListener('connection', callback); ``` `removeListener()` will remove, at most, one instance of a listener from the listener array. If any single listener has been added multiple times to the listener array for the specified `eventName`, then `removeListener()` must be called multiple times to remove each instance. Once an event is emitted, all listeners attached to it at the time of emitting are called in order. This implies that any `removeListener()` or `removeAllListeners()` calls *after* emitting and *before* the last listener finishes execution will not remove them from`emit()` in progress. Subsequent events behave as expected. ``` import { EventEmitter } from 'node:events'; class MyEmitter extends EventEmitter {} const myEmitter = new MyEmitter(); const callbackA = () => { console.log('A'); myEmitter.removeListener('event', callbackB); }; const callbackB = () => { console.log('B'); }; myEmitter.on('event', callbackA); myEmitter.on('event', callbackB); // callbackA removes listener callbackB but it will still be called. // Internal listener array at time of emit [callbackA, callbackB] myEmitter.emit('event'); // Prints: // A // B // callbackB is now removed. // Internal listener array [callbackA] myEmitter.emit('event'); // Prints: // A ``` Because listeners are managed using an internal array, calling this will change the position indices of any listener registered *after* the listener being removed. This will not impact the order in which listeners are called, but it means that any copies of the listener array as returned by the `emitter.listeners()` method will need to be recreated. When a single function has been added as a handler multiple times for a single event (as in the example below), `removeListener()` will remove the most recently added instance. In the example the `once('ping')` listener is removed: ``` import { EventEmitter } from 'node:events'; const ee = new EventEmitter(); function pong() { console.log('pong'); } ee.on('ping', pong); ee.once('ping', pong); ee.removeListener('ping', pong); ee.emit('ping'); ee.emit('ping'); ``` Returns a reference to the `EventEmitter`, so that calls can be chained. * **@since** v0.1.26 *** #### Parameters * ##### externaleventName: string | symbol * ##### externallistener: (...args) => void #### Returns this ### [**](#setMaxListeners)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L768)externalinheritedsetMaxListeners * ****setMaxListeners**(n): this - Inherited from Logger.setMaxListeners By default `EventEmitter`s will print a warning if more than `10` listeners are added for a particular event. This is a useful default that helps finding memory leaks. The `emitter.setMaxListeners()` method allows the limit to be modified for this specific `EventEmitter` instance. The value can be set to `Infinity` (or `0`) to indicate an unlimited number of listeners. Returns a reference to the `EventEmitter`, so that calls can be chained. * **@since** v0.3.5 *** #### Parameters * ##### externaln: number #### Returns this ### [**](#setOptions)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@apify/log/src/index.d.ts#L34)externalinheritedsetOptions * ****setOptions**(options): void - Inherited from Logger.setOptions #### Parameters * ##### externaloptions: Record\ #### Returns void ### [**](#addAbortListener)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L437)staticexternalinheritedaddAbortListener * ****addAbortListener**(signal, resource): Disposable - Inherited from Logger.addAbortListener experimental Listens once to the `abort` event on the provided `signal`. Listening to the `abort` event on abort signals is unsafe and may lead to resource leaks since another third party with the signal can call `e.stopImmediatePropagation()`. Unfortunately Node.js cannot change this since it would violate the web standard. Additionally, the original API makes it easy to forget to remove listeners. This API allows safely using `AbortSignal`s in Node.js APIs by solving these two issues by listening to the event such that `stopImmediatePropagation` does not prevent the listener from running. Returns a disposable so that it may be unsubscribed from more easily. ``` import { addAbortListener } from 'node:events'; function example(signal) { let disposable; try { signal.addEventListener('abort', (e) => e.stopImmediatePropagation()); disposable = addAbortListener(signal, (e) => { // Do something when signal is aborted. }); } finally { disposable?.[Symbol.dispose](); } } ``` * **@since** v20.5.0 *** #### Parameters * ##### externalsignal: AbortSignal * ##### externalresource: (event) => void #### Returns Disposable Disposable that removes the `abort` listener. ### [**](#getEventListeners)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L358)staticexternalinheritedgetEventListeners * ****getEventListeners**(emitter, name): Function\[] - Inherited from Logger.getEventListeners Returns a copy of the array of listeners for the event named `eventName`. For `EventEmitter`s this behaves exactly the same as calling `.listeners` on the emitter. For `EventTarget`s this is the only way to get the event listeners for the event target. This is useful for debugging and diagnostic purposes. ``` import { getEventListeners, EventEmitter } from 'node:events'; { const ee = new EventEmitter(); const listener = () => console.log('Events are fun'); ee.on('foo', listener); console.log(getEventListeners(ee, 'foo')); // [ [Function: listener] ] } { const et = new EventTarget(); const listener = () => console.log('Events are fun'); et.addEventListener('foo', listener); console.log(getEventListeners(et, 'foo')); // [ [Function: listener] ] } ``` * **@since** v15.2.0, v14.17.0 *** #### Parameters * ##### externalemitter: EventEmitter\ | EventTarget * ##### externalname: string | symbol #### Returns Function\[] ### [**](#getMaxListeners)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L387)staticexternalinheritedgetMaxListeners * ****getMaxListeners**(emitter): number - Inherited from Logger.getMaxListeners Returns the currently set max amount of listeners. For `EventEmitter`s this behaves exactly the same as calling `.getMaxListeners` on the emitter. For `EventTarget`s this is the only way to get the max event listeners for the event target. If the number of event handlers on a single EventTarget exceeds the max set, the EventTarget will print a warning. ``` import { getMaxListeners, setMaxListeners, EventEmitter } from 'node:events'; { const ee = new EventEmitter(); console.log(getMaxListeners(ee)); // 10 setMaxListeners(11, ee); console.log(getMaxListeners(ee)); // 11 } { const et = new EventTarget(); console.log(getMaxListeners(et)); // 10 setMaxListeners(11, et); console.log(getMaxListeners(et)); // 11 } ``` * **@since** v19.9.0 *** #### Parameters * ##### externalemitter: EventEmitter\ | EventTarget #### Returns number ### [**](#listenerCount)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L330)staticexternalinheritedlistenerCount * ****listenerCount**(emitter, eventName): number - Inherited from Logger.listenerCount A class method that returns the number of listeners for the given `eventName` registered on the given `emitter`. ``` import { EventEmitter, listenerCount } from 'node:events'; const myEmitter = new EventEmitter(); myEmitter.on('event', () => {}); myEmitter.on('event', () => {}); console.log(listenerCount(myEmitter, 'event')); // Prints: 2 ``` * **@since** v0.9.12 * **@deprecated** Since v3.2.0 - Use `listenerCount` instead. *** #### Parameters * ##### externalemitter: EventEmitter\ The emitter to query * ##### externaleventName: string | symbol The event name #### Returns number ### [**](#on)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L303)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L308)staticexternalinheritedon * ****on**(emitter, eventName, options): AsyncIterator\ * ****on**(emitter, eventName, options): AsyncIterator\ - Inherited from Logger.on ``` import { on, EventEmitter } from 'node:events'; import process from 'node:process'; const ee = new EventEmitter(); // Emit later on process.nextTick(() => { ee.emit('foo', 'bar'); ee.emit('foo', 42); }); for await (const event of on(ee, 'foo')) { // The execution of this inner block is synchronous and it // processes one event at a time (even with await). Do not use // if concurrent execution is required. console.log(event); // prints ['bar'] [42] } // Unreachable here ``` Returns an `AsyncIterator` that iterates `eventName` events. It will throw if the `EventEmitter` emits `'error'`. It removes all listeners when exiting the loop. The `value` returned by each iteration is an array composed of the emitted event arguments. An `AbortSignal` can be used to cancel waiting on events: ``` import { on, EventEmitter } from 'node:events'; import process from 'node:process'; const ac = new AbortController(); (async () => { const ee = new EventEmitter(); // Emit later on process.nextTick(() => { ee.emit('foo', 'bar'); ee.emit('foo', 42); }); for await (const event of on(ee, 'foo', { signal: ac.signal })) { // The execution of this inner block is synchronous and it // processes one event at a time (even with await). Do not use // if concurrent execution is required. console.log(event); // prints ['bar'] [42] } // Unreachable here })(); process.nextTick(() => ac.abort()); ``` Use the `close` option to specify an array of event names that will end the iteration: ``` import { on, EventEmitter } from 'node:events'; import process from 'node:process'; const ee = new EventEmitter(); // Emit later on process.nextTick(() => { ee.emit('foo', 'bar'); ee.emit('foo', 42); ee.emit('close'); }); for await (const event of on(ee, 'foo', { close: ['close'] })) { console.log(event); // prints ['bar'] [42] } // the loop will exit after 'close' is emitted console.log('done'); // prints 'done' ``` * **@since** v13.6.0, v12.16.0 *** #### Parameters * ##### externalemitter: EventEmitter\ * ##### externaleventName: string | symbol * ##### externaloptionaloptions: StaticEventEmitterIteratorOptions #### Returns AsyncIterator\ An `AsyncIterator` that iterates `eventName` events emitted by the `emitter` ### [**](#once)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L217)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L222)staticexternalinheritedonce * ****once**(emitter, eventName, options): Promise\ * ****once**(emitter, eventName, options): Promise\ - Inherited from Logger.once Creates a `Promise` that is fulfilled when the `EventEmitter` emits the given event or that is rejected if the `EventEmitter` emits `'error'` while waiting. The `Promise` will resolve with an array of all the arguments emitted to the given event. This method is intentionally generic and works with the web platform [EventTarget](https://dom.spec.whatwg.org/#interface-eventtarget) interface, which has no special`'error'` event semantics and does not listen to the `'error'` event. ``` import { once, EventEmitter } from 'node:events'; import process from 'node:process'; const ee = new EventEmitter(); process.nextTick(() => { ee.emit('myevent', 42); }); const [value] = await once(ee, 'myevent'); console.log(value); const err = new Error('kaboom'); process.nextTick(() => { ee.emit('error', err); }); try { await once(ee, 'myevent'); } catch (err) { console.error('error happened', err); } ``` The special handling of the `'error'` event is only used when `events.once()` is used to wait for another event. If `events.once()` is used to wait for the '`error'` event itself, then it is treated as any other kind of event without special handling: ``` import { EventEmitter, once } from 'node:events'; const ee = new EventEmitter(); once(ee, 'error') .then(([err]) => console.log('ok', err.message)) .catch((err) => console.error('error', err.message)); ee.emit('error', new Error('boom')); // Prints: ok boom ``` An `AbortSignal` can be used to cancel waiting for the event: ``` import { EventEmitter, once } from 'node:events'; const ee = new EventEmitter(); const ac = new AbortController(); async function foo(emitter, event, signal) { try { await once(emitter, event, { signal }); console.log('event emitted!'); } catch (error) { if (error.name === 'AbortError') { console.error('Waiting for the event was canceled!'); } else { console.error('There was an error', error.message); } } } foo(ee, 'foo', ac.signal); ac.abort(); // Abort waiting for the event ee.emit('foo'); // Prints: Waiting for the event was canceled! ``` * **@since** v11.13.0, v10.16.0 *** #### Parameters * ##### externalemitter: EventEmitter\ * ##### externaleventName: string | symbol * ##### externaloptionaloptions: StaticEventEmitterOptions #### Returns Promise\ ### [**](#setMaxListeners)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@types/node/events.d.ts#L402)staticexternalinheritedsetMaxListeners * ****setMaxListeners**(n, ...eventTargets): void - Inherited from Logger.setMaxListeners ``` import { setMaxListeners, EventEmitter } from 'node:events'; const target = new EventTarget(); const emitter = new EventEmitter(); setMaxListeners(5, target, emitter); ``` * **@since** v15.4.0 *** #### Parameters * ##### externaloptionaln: number A non-negative number. The maximum number of listeners per `EventTarget` event. * ##### externalrest...eventTargets: (EventEmitter\ | EventTarget)\[] Zero or more {EventTarget} or {EventEmitter} instances. If none are specified, `n` is set as the default max for all newly created {EventTarget} and {EventEmitter} objects. #### Returns void --- # PlatformEventManager Gets an instance of a Node.js' [EventEmitter](https://nodejs.org/api/events.html#events_class_eventemitter) class that emits various events from the SDK or the Apify platform. The event emitter is initialized by calling the [Actor.main](https://docs.apify.com/sdk/js/sdk/js/reference/class/Actor.md#main) function. **Example usage:** ``` Actor.on('cpuInfo', (data) => { if (data.isCpuOverloaded) console.log('Oh no, the CPU is overloaded!'); }); ``` The following events are emitted: * `cpuInfo`: `{ "isCpuOverloaded": Boolean }` The event is emitted approximately every second and it indicates whether the Actor is using the maximum of available CPU resources. If that's the case, the Actor should not add more workload. For example, this event is used by the AutoscaledPool class. * `migrating`: `void` Emitted when the Actor running on the Apify platform is going to be migrated to another worker server soon. You can use it to persist the state of the Actor and gracefully stop your in-progress tasks, so that they are not interrupted by the migration. For example, this is used by the RequestList class. * `aborting`: `void` When a user aborts an Actor run on the Apify platform, they can choose to abort gracefully to allow the Actor some time before getting killed. This graceful abort emits the `aborting` event which the SDK uses to gracefully stop running crawls and you can use it to do your own cleanup as well. * `persistState`: `{ "isMigrating": Boolean }` Emitted in regular intervals (by default 60 seconds) to notify all components of Apify SDK that it is time to persist their state, in order to avoid repeating all work when the Actor restarts. This event is automatically emitted together with the `migrating` event, in which case the `isMigrating` flag is set to `true`. Otherwise the flag is `false`. Note that the `persistState` event is provided merely for user convenience, you can achieve the same effect using `setInterval()` and listening for the `migrating` event. ### Hierarchy * EventManager * *PlatformEventManager* ## Index[**](#Index) ### Constructors * [**constructor](#constructor) ### Properties * [**config](#config) ### Methods * [**close](#close) * [**emit](#emit) * [**init](#init) * [**isInitialized](#isInitialized) * [**off](#off) * [**on](#on) ## Constructors[**](#Constructors) ### [**](#constructor)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/platform_event_manager.ts#L50)constructor * ****new PlatformEventManager**(config): [PlatformEventManager](https://docs.apify.com/sdk/js/sdk/js/reference/class/PlatformEventManager.md) - Overrides EventManager.constructor #### Parameters * ##### config: [Configuration](https://docs.apify.com/sdk/js/sdk/js/reference/class/Configuration.md) = ... #### Returns [PlatformEventManager](https://docs.apify.com/sdk/js/sdk/js/reference/class/PlatformEventManager.md) ## Properties[**](#Properties) ### [**](#config)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/platform_event_manager.ts#L50)readonlyinheritedconfig **config: [Configuration](https://docs.apify.com/sdk/js/sdk/js/reference/class/Configuration.md) = ... Inherited from EventManager.config ## Methods[**](#Methods) ### [**](#close)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/platform_event_manager.ts#L112)close * ****close**(): Promise\ - Overrides EventManager.close Closes websocket providing events from Actor infrastructure and also stops sending internal events of Apify package such as `persistState`. This is automatically called at the end of `Actor.main()`. *** #### Returns Promise\ ### [**](#emit)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/events/event_manager.d.ts#L36)externalinheritedemit * ****emit**(event, ...args): void - Inherited from EventManager.emit #### Parameters * ##### externalevent: EventTypeName * ##### externalrest...args: unknown\[] #### Returns void ### [**](#init)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/platform_event_manager.ts#L58)init * ****init**(): Promise\ - Overrides EventManager.init Initializes `Actor.events` event emitter by creating a connection to a websocket that provides them. This is an internal function that is automatically called by `Actor.main()`. *** #### Returns Promise\ ### [**](#isInitialized)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/events/event_manager.d.ts#L37)externalinheritedisInitialized * ****isInitialized**(): boolean - Inherited from EventManager.isInitialized #### Returns boolean ### [**](#off)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/events/event_manager.d.ts#L35)externalinheritedoff * ****off**(event, listener): void - Inherited from EventManager.off #### Parameters * ##### externalevent: EventTypeName * ##### externaloptionallistener: (...args) => any #### Returns void ### [**](#on)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/events/event_manager.d.ts#L34)externalinheritedon * ****on**(event, listener): void - Inherited from EventManager.on #### Parameters * ##### externalevent: EventTypeName * ##### externallistener: (...args) => any #### Returns void --- # ProxyConfiguration Configures connection to a proxy server with the provided options. Proxy servers are used to prevent target websites from blocking your crawlers based on IP address rate limits or blacklists. Setting proxy configuration in your crawlers automatically configures them to use the selected proxies for all connections. You can get information about the currently used proxy by inspecting the [ProxyInfo](https://docs.apify.com/sdk/js/sdk/js/reference/interface/ProxyInfo.md) property in your crawler's page function. There, you can inspect the proxy's URL and other attributes. The proxy servers are managed by [Apify Proxy](https://docs.apify.com/proxy). To be able to use Apify Proxy, you need an Apify account and access to the selected proxies. If you provide no configuration option, the proxies will be managed automatically using a smart algorithm. If you want to use your own proxies, use the [ProxyConfigurationOptions.proxyUrls](https://docs.apify.com/sdk/js/sdk/js/reference/interface/ProxyConfigurationOptions.md#proxyUrls) option. Your list of proxy URLs will be rotated by the configuration if this option is provided. **Example usage:** ``` const proxyConfiguration = await Actor.createProxyConfiguration({ groups: ['GROUP1', 'GROUP2'] // List of Apify Proxy groups countryCode: 'US', }); const crawler = new CheerioCrawler({ // ... proxyConfiguration, requestHandler({ proxyInfo }) { const usedProxyUrl = proxyInfo.url; // Getting the proxy URL } }) ``` ### Hierarchy * ProxyConfiguration * *ProxyConfiguration* ## Index[**](#Index) ### Properties * [**config](#config) * [**isManInTheMiddle](#isManInTheMiddle) ### Methods * [**initialize](#initialize) * [**newProxyInfo](#newProxyInfo) * [**newUrl](#newUrl) ## Properties[**](#Properties) ### [**](#config)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/proxy_configuration.ts#L171)readonlyconfig **config: [Configuration](https://docs.apify.com/sdk/js/sdk/js/reference/class/Configuration.md) = ... ### [**](#isManInTheMiddle)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/proxy_configuration.d.ts#L157)externalinheritedisManInTheMiddle **isManInTheMiddle: boolean Inherited from CoreProxyConfiguration.isManInTheMiddle ## Methods[**](#Methods) ### [**](#initialize)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/proxy_configuration.ts#L234)initialize * ****initialize**(): Promise\ - Loads proxy password if token is provided and checks access to Apify Proxy and provided proxy groups if Apify Proxy configuration is used. Also checks if country has access to Apify Proxy groups if the country code is provided. You should use the createProxyConfiguration function to create a pre-initialized `ProxyConfiguration` instance instead of calling this manually. *** #### Returns Promise\ ### [**](#newProxyInfo)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/proxy_configuration.ts#L260)newProxyInfo * ****newProxyInfo**(sessionId, options): Promise\ - Overrides CoreProxyConfiguration.newProxyInfo This function creates a new [ProxyInfo](https://docs.apify.com/sdk/js/sdk/js/reference/interface/ProxyInfo.md) info object. It is used by CheerioCrawler and PuppeteerCrawler to generate proxy URLs and also to allow the user to inspect the currently used proxy via the requestHandler parameter `proxyInfo`. Use it if you want to work with a rich representation of a proxy URL. If you need the URL string only, use [ProxyConfiguration.newUrl](https://docs.apify.com/sdk/js/sdk/js/reference/class/ProxyConfiguration.md#newUrl). *** #### Parameters * ##### optionalsessionId: string | number Represents the identifier of user Session that can be managed by the SessionPool or you can use the Apify Proxy [Session](https://docs.apify.com/proxy#sessions) identifier. When the provided sessionId is a number, it's converted to a string. Property sessionId of [ProxyInfo](https://docs.apify.com/sdk/js/sdk/js/reference/interface/ProxyInfo.md) is always returned as a type string. All the HTTP requests going through the proxy with the same session identifier will use the same target proxy server (i.e. the same IP address). The identifier must not be longer than 50 characters and include only the following: `0-9`, `a-z`, `A-Z`, `"."`, `"_"` and `"~"`. * ##### optionaloptions: TieredProxyOptions #### Returns Promise\ Represents information about used proxy and its configuration. ### [**](#newUrl)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/proxy_configuration.ts#L294)newUrl * ****newUrl**(sessionId, options): Promise\ - Overrides CoreProxyConfiguration.newUrl Returns a new proxy URL based on provided configuration options and the `sessionId` parameter. *** #### Parameters * ##### optionalsessionId: string | number Represents the identifier of user Session that can be managed by the SessionPool or you can use the Apify Proxy [Session](https://docs.apify.com/proxy#sessions) identifier. When the provided sessionId is a number, it's converted to a string. All the HTTP requests going through the proxy with the same session identifier will use the same target proxy server (i.e. the same IP address). The identifier must not be longer than 50 characters and include only the following: `0-9`, `a-z`, `A-Z`, `"."`, `"_"` and `"~"`. * ##### optionaloptions: TieredProxyOptions #### Returns Promise\ A string with a proxy URL, including authentication credentials and port number. For example, `http://bob:password123@proxy.example.com:8000` --- # externalRequestQueue Represents a queue of URLs to crawl, which is used for deep crawling of websites where you start with several URLs and then recursively follow links to other pages. The data structure supports both breadth-first and depth-first crawling orders. Each URL is represented using an instance of the Request class. The queue can only contain unique URLs. More precisely, it can only contain Request instances with distinct `uniqueKey` properties. By default, `uniqueKey` is generated from the URL, but it can also be overridden. To add a single URL multiple times to the queue, corresponding Request objects will need to have different `uniqueKey` properties. Do not instantiate this class directly, use the [RequestQueue.open](https://docs.apify.com/sdk/js/sdk/js/reference/class/RequestQueue.md#open) function instead. `RequestQueue` is used by BasicCrawler, CheerioCrawler, PuppeteerCrawler and PlaywrightCrawler as a source of URLs to crawl. Unlike RequestList, `RequestQueue` supports dynamic adding and removing of requests. On the other hand, the queue is not optimized for operations that add or remove a large number of URLs in a batch. **Example usage:** ``` // Open the default request queue associated with the crawler run const queue = await RequestQueue.open(); // Open a named request queue const queueWithName = await RequestQueue.open('some-name'); // Enqueue few requests await queue.addRequest({ url: 'http://example.com/aaa' }); await queue.addRequest({ url: 'http://example.com/bbb' }); await queue.addRequest({ url: 'http://example.com/foo/bar' }, { forefront: true }); ``` ### Hierarchy * RequestProvider * *RequestQueue* ## Index[**](#Index) ### Constructors * [**constructor](#constructor) ### Properties * [**assumedHandledCount](#assumedHandledCount) * [**assumedTotalCount](#assumedTotalCount) * [**client](#client) * [**clientKey](#clientKey) * [**config](#config) * [**id](#id) * [**internalTimeoutMillis](#internalTimeoutMillis) * [**log](#log) * [**name](#name) * [**requestLockSecs](#requestLockSecs) * [**timeoutSecs](#timeoutSecs) ### Methods * [**addRequest](#addRequest) * [**addRequests](#addRequests) * [**addRequestsBatched](#addRequestsBatched) * [**drop](#drop) * [**fetchNextRequest](#fetchNextRequest) * [**getInfo](#getInfo) * [**getRequest](#getRequest) * [**getTotalCount](#getTotalCount) * [**handledCount](#handledCount) * [**isEmpty](#isEmpty) * [**isFinished](#isFinished) * [**markRequestHandled](#markRequestHandled) * [**reclaimRequest](#reclaimRequest) * [**open](#open) ## Constructors[**](#Constructors) ### [**](#constructor)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/request_queue_v2.d.ts#L45)externalconstructor * ****new RequestQueue**(options, config): [RequestQueue](https://docs.apify.com/sdk/js/sdk/js/reference/class/RequestQueue.md) - Overrides RequestProvider.constructor #### Parameters * ##### externaloptions: RequestProviderOptions * ##### externaloptionalconfig: Configuration #### Returns [RequestQueue](https://docs.apify.com/sdk/js/sdk/js/reference/class/RequestQueue.md) ## Properties[**](#Properties) ### [**](#assumedHandledCount)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/request_provider.d.ts#L21)externalinheritedassumedHandledCount **assumedHandledCount: number Inherited from RequestProvider.assumedHandledCount ### [**](#assumedTotalCount)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/request_provider.d.ts#L20)externalinheritedassumedTotalCount **assumedTotalCount: number Inherited from RequestProvider.assumedTotalCount ### [**](#client)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/request_provider.d.ts#L15)externalinheritedclient **client: RequestQueueClient Inherited from RequestProvider.client ### [**](#clientKey)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/request_provider.d.ts#L14)externalinheritedclientKey **clientKey: string Inherited from RequestProvider.clientKey ### [**](#config)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/request_provider.d.ts#L10)externalreadonlyinheritedconfig **config: Configuration Inherited from RequestProvider.config ### [**](#id)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/request_provider.d.ts#L11)externalinheritedid **id: string Inherited from RequestProvider.id ### [**](#internalTimeoutMillis)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/request_provider.d.ts#L18)externalinheritedinternalTimeoutMillis **internalTimeoutMillis: number Inherited from RequestProvider.internalTimeoutMillis ### [**](#log)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/request_provider.d.ts#L17)externalinheritedlog **log: [Log](https://docs.apify.com/sdk/js/sdk/js/reference/class/Log.md) Inherited from RequestProvider.log ### [**](#name)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/request_provider.d.ts#L12)externaloptionalinheritedname **name? : string Inherited from RequestProvider.name ### [**](#requestLockSecs)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/request_provider.d.ts#L19)externalinheritedrequestLockSecs **requestLockSecs: number Inherited from RequestProvider.requestLockSecs ### [**](#timeoutSecs)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/request_provider.d.ts#L13)externalinheritedtimeoutSecs **timeoutSecs: number Inherited from RequestProvider.timeoutSecs ## Methods[**](#Methods) ### [**](#addRequest)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/request_queue_v2.d.ts#L53)externaladdRequest * ****addRequest**(requestLike, options): Promise\ - Overrides RequestProvider.addRequest * **@inheritDoc** *** #### Parameters * ##### externalrequestLike: Source * ##### externaloptionaloptions: [RequestQueueOperationOptions](https://docs.apify.com/sdk/js/sdk/js/reference/interface/RequestQueueOperationOptions.md) #### Returns Promise\ ### [**](#addRequests)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/request_queue_v2.d.ts#L57)externaladdRequests * ****addRequests**(requestsLike, options): Promise\ - Overrides RequestProvider.addRequests * **@inheritDoc** *** #### Parameters * ##### externalrequestsLike: Source\[] * ##### externaloptionaloptions: [RequestQueueOperationOptions](https://docs.apify.com/sdk/js/sdk/js/reference/interface/RequestQueueOperationOptions.md) #### Returns Promise\ ### [**](#addRequestsBatched)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/request_provider.d.ts#L76)externalinheritedaddRequestsBatched * ****addRequestsBatched**(requests, options): Promise\ - Inherited from RequestProvider.addRequestsBatched Adds requests to the queue in batches. By default, it will resolve after the initial batch is added, and continue adding the rest in the background. You can configure the batch size via `batchSize` option and the sleep time in between the batches via `waitBetweenBatchesMillis`. If you want to wait for all batches to be added to the queue, you can use the `waitForAllRequestsToBeAdded` promise you get in the response object. *** #### Parameters * ##### externalrequests: (string | Source)\[] The requests to add * ##### externaloptionaloptions: AddRequestsBatchedOptions Options for the request queue #### Returns Promise\ ### [**](#drop)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/request_provider.d.ts#L144)externalinheriteddrop * ****drop**(): Promise\ - Inherited from RequestProvider.drop Removes the queue either from the Apify Cloud storage or from the local database, depending on the mode of operation. *** #### Returns Promise\ ### [**](#fetchNextRequest)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/request_queue_v2.d.ts#L61)externalfetchNextRequest * ****fetchNextRequest**(): Promise\> - Overrides RequestProvider.fetchNextRequest * **@inheritDoc** *** #### Returns Promise\> ### [**](#getInfo)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/request_provider.d.ts#L179)externalinheritedgetInfo * ****getInfo**(): Promise\ - Inherited from RequestProvider.getInfo Returns an object containing general information about the request queue. The function returns the same object as the Apify API Client's [getQueue](https://docs.apify.com/api/apify-client-js/latest#ApifyClient-requestQueues) function, which in turn calls the [Get request queue](https://apify.com/docs/api/v2#/reference/request-queues/queue/get-request-queue) API endpoint. **Example:** ``` { id: "WkzbQMuFYuamGv3YF", name: "my-queue", userId: "wRsJZtadYvn4mBZmm", createdAt: new Date("2015-12-12T07:34:14.202Z"), modifiedAt: new Date("2015-12-13T08:36:13.202Z"), accessedAt: new Date("2015-12-14T08:36:13.202Z"), totalRequestCount: 25, handledRequestCount: 5, pendingRequestCount: 20, } ``` *** #### Returns Promise\ ### [**](#getRequest)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/request_provider.d.ts#L83)externalinheritedgetRequest * ****getRequest**(id): Promise\> - Inherited from RequestProvider.getRequest Gets the request from the queue specified by ID. *** #### Parameters * ##### externalid: string ID of the request. #### Returns Promise\> Returns the request object, or `null` if it was not found. ### [**](#getTotalCount)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/request_provider.d.ts#L36)externalinheritedgetTotalCount * ****getTotalCount**(): number - Inherited from RequestProvider.getTotalCount Returns an offline approximation of the total number of requests in the queue (i.e. pending + handled). Survives restarts and actor migrations. *** #### Returns number ### [**](#handledCount)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/request_provider.d.ts#L154)externalinheritedhandledCount * ****handledCount**(): Promise\ - Inherited from RequestProvider.handledCount Returns the number of handled requests. This function is just a convenient shortcut for: ``` const { handledRequestCount } = await queue.getInfo(); ``` *** #### Returns Promise\ ### [**](#isEmpty)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/request_provider.d.ts#L123)externalinheritedisEmpty * ****isEmpty**(): Promise\ - Inherited from RequestProvider.isEmpty Resolves to `true` if the next call to [RequestQueue.fetchNextRequest](https://docs.apify.com/sdk/js/sdk/js/reference/class/RequestQueue.md#fetchNextRequest) would return `null`, otherwise it resolves to `false`. Note that even if the queue is empty, there might be some pending requests currently being processed. If you need to ensure that there is no activity in the queue, use [RequestQueue.isFinished](https://docs.apify.com/sdk/js/sdk/js/reference/class/RequestQueue.md#isFinished). *** #### Returns Promise\ ### [**](#isFinished)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/request_queue_v2.d.ts#L69)externalisFinished * ****isFinished**(): Promise\ - Overrides RequestProvider.isFinished * **@inheritDoc** *** #### Returns Promise\ ### [**](#markRequestHandled)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/request_queue_v2.d.ts#L65)externalmarkRequestHandled * ****markRequestHandled**(request): Promise\ - Overrides RequestProvider.markRequestHandled * **@inheritDoc** *** #### Parameters * ##### externalrequest: Request\ #### Returns Promise\ ### [**](#reclaimRequest)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/request_queue_v2.d.ts#L73)externalreclaimRequest * ****reclaimRequest**(...args): Promise\ - Overrides RequestProvider.reclaimRequest * **@inheritDoc** *** #### Parameters * ##### externalrest...args: \[request: Request\, options?: [RequestQueueOperationOptions](https://docs.apify.com/sdk/js/sdk/js/reference/interface/RequestQueueOperationOptions.md)] #### Returns Promise\ ### [**](#open)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/request_queue_v2.d.ts#L85)staticexternalopen * ****open**(...args): Promise<[RequestQueue](https://docs.apify.com/sdk/js/sdk/js/reference/class/RequestQueue.md)> - Overrides RequestProvider.open * **@inheritDoc** *** #### Parameters * ##### externalrest...args: \[queueIdOrName?: null | string, options?: StorageManagerOptions] #### Returns Promise<[RequestQueue](https://docs.apify.com/sdk/js/sdk/js/reference/class/RequestQueue.md)> --- # externalLogLevel ## Index[**](#Index) ### Enumeration Members * [**DEBUG](#DEBUG) * [**ERROR](#ERROR) * [**INFO](#INFO) * [**OFF](#OFF) * [**PERF](#PERF) * [**SOFT\_FAIL](#SOFT_FAIL) * [**WARNING](#WARNING) ## Enumeration Members[**](<#Enumeration Members>) ### [**](#DEBUG)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@apify/log/src/index.d.ts#L9)externalDEBUG **DEBUG: 5 ### [**](#ERROR)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@apify/log/src/index.d.ts#L5)externalERROR **ERROR: 1 ### [**](#INFO)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@apify/log/src/index.d.ts#L8)externalINFO **INFO: 4 ### [**](#OFF)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@apify/log/src/index.d.ts#L4)externalOFF **OFF: 0 ### [**](#PERF)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@apify/log/src/index.d.ts#L10)externalPERF **PERF: 6 ### [**](#SOFT_FAIL)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@apify/log/src/index.d.ts#L6)externalSOFT\_FAIL **SOFT\_FAIL: 2 ### [**](#WARNING)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@apify/log/src/index.d.ts#L7)externalWARNING **WARNING: 3 --- # AbortOptions ### Hierarchy * RunAbortOptions * *AbortOptions* ## Index[**](#Index) ### Properties * [**gracefully](#gracefully) * [**statusMessage](#statusMessage) * [**token](#token) ## Properties[**](#Properties) ### [**](#gracefully)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/resource_clients/run.d.ts#L89)externaloptionalinheritedgracefully **gracefully? : boolean Inherited from RunAbortOptions.gracefully ### [**](#statusMessage)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1912)optionalstatusMessage **statusMessage? : string Exit with given status message ### [**](#token)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1909)optionaltoken **token? : string User API token that is used to run the Actor. By default, it is taken from the `APIFY_TOKEN` environment variable. --- # ActorPricingInfo ## Index[**](#Index) ### Properties * [**isPayPerEvent](#isPayPerEvent) * [**maxTotalChargeUsd](#maxTotalChargeUsd) * [**perEventPrices](#perEventPrices) * [**pricingModel](#pricingModel) ## Properties[**](#Properties) ### [**](#isPayPerEvent)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/charging.ts#L306)isPayPerEvent **isPayPerEvent: boolean ### [**](#maxTotalChargeUsd)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/charging.ts#L305)maxTotalChargeUsd **maxTotalChargeUsd: number ### [**](#perEventPrices)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/charging.ts#L307)perEventPrices **perEventPrices: Record\ ### [**](#pricingModel)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/charging.ts#L304)optionalpricingModel **pricingModel? : PAY\_PER\_EVENT | PRICE\_PER\_DATASET\_ITEM | FLAT\_PRICE\_PER\_MONTH | FREE --- # externalActorRun ### Hierarchy * ActorRunListItem * *ActorRun* ## Index[**](#Index) ### Properties * [**actId](#actId) * [**actorTaskId](#actorTaskId) * [**buildId](#buildId) * [**buildNumber](#buildNumber) * [**containerUrl](#containerUrl) * [**defaultDatasetId](#defaultDatasetId) * [**defaultKeyValueStoreId](#defaultKeyValueStoreId) * [**defaultRequestQueueId](#defaultRequestQueueId) * [**exitCode](#exitCode) * [**finishedAt](#finishedAt) * [**gitBranchName](#gitBranchName) * [**chargedEventCounts](#chargedEventCounts) * [**id](#id) * [**isContainerServerReady](#isContainerServerReady) * [**meta](#meta) * [**options](#options) * [**pricingInfo](#pricingInfo) * [**startedAt](#startedAt) * [**stats](#stats) * [**status](#status) * [**statusMessage](#statusMessage) * [**usage](#usage) * [**usageTotalUsd](#usageTotalUsd) * [**usageUsd](#usageUsd) * [**userId](#userId) ## Properties[**](#Properties) ### [**](#actId)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/resource_clients/actor.d.ts#L180)externalinheritedactId **actId: string Inherited from ActorRunListItem.actId ### [**](#actorTaskId)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/resource_clients/actor.d.ts#L181)externaloptionalinheritedactorTaskId **actorTaskId? : string Inherited from ActorRunListItem.actorTaskId ### [**](#buildId)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/resource_clients/actor.d.ts#L186)externalinheritedbuildId **buildId: string Inherited from ActorRunListItem.buildId ### [**](#buildNumber)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/resource_clients/actor.d.ts#L187)externalinheritedbuildNumber **buildNumber: string Inherited from ActorRunListItem.buildNumber ### [**](#containerUrl)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/resource_clients/actor.d.ts#L199)externalcontainerUrl **containerUrl: string ### [**](#defaultDatasetId)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/resource_clients/actor.d.ts#L189)externalinheriteddefaultDatasetId **defaultDatasetId: string Inherited from ActorRunListItem.defaultDatasetId ### [**](#defaultKeyValueStoreId)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/resource_clients/actor.d.ts#L188)externalinheriteddefaultKeyValueStoreId **defaultKeyValueStoreId: string Inherited from ActorRunListItem.defaultKeyValueStoreId ### [**](#defaultRequestQueueId)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/resource_clients/actor.d.ts#L190)externalinheriteddefaultRequestQueueId **defaultRequestQueueId: string Inherited from ActorRunListItem.defaultRequestQueueId ### [**](#exitCode)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/resource_clients/actor.d.ts#L198)externaloptionalexitCode **exitCode? : number ### [**](#finishedAt)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/resource_clients/actor.d.ts#L183)externalinheritedfinishedAt **finishedAt: Date Inherited from ActorRunListItem.finishedAt ### [**](#gitBranchName)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/resource_clients/actor.d.ts#L201)externaloptionalgitBranchName **gitBranchName? : string ### [**](#chargedEventCounts)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/resource_clients/actor.d.ts#L205)externaloptionalchargedEventCounts **chargedEventCounts? : Record\ ### [**](#id)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/resource_clients/actor.d.ts#L179)externalinheritedid **id: string Inherited from ActorRunListItem.id ### [**](#isContainerServerReady)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/resource_clients/actor.d.ts#L200)externaloptionalisContainerServerReady **isContainerServerReady? : boolean ### [**](#meta)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/resource_clients/actor.d.ts#L185)externalinheritedmeta **meta: ActorRunMeta Inherited from ActorRunListItem.meta ### [**](#options)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/resource_clients/actor.d.ts#L197)externaloptions **options: ActorRunOptions ### [**](#pricingInfo)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/resource_clients/actor.d.ts#L204)externaloptionalpricingInfo **pricingInfo? : ActorRunPricingInfo ### [**](#startedAt)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/resource_clients/actor.d.ts#L182)externalinheritedstartedAt **startedAt: Date Inherited from ActorRunListItem.startedAt ### [**](#stats)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/resource_clients/actor.d.ts#L196)externalstats **stats: ActorRunStats ### [**](#status)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/resource_clients/actor.d.ts#L184)externalinheritedstatus **status: READY | RUNNING | SUCCEEDED | FAILED | ABORTING | ABORTED | TIMING-OUT | TIMED-OUT Inherited from ActorRunListItem.status ### [**](#statusMessage)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/resource_clients/actor.d.ts#L195)externaloptionalstatusMessage **statusMessage? : string ### [**](#usage)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/resource_clients/actor.d.ts#L202)externaloptionalusage **usage? : ActorRunUsage ### [**](#usageTotalUsd)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/resource_clients/actor.d.ts#L191)externaloptionalinheritedusageTotalUsd **usageTotalUsd? : number Inherited from ActorRunListItem.usageTotalUsd ### [**](#usageUsd)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/resource_clients/actor.d.ts#L203)externaloptionalusageUsd **usageUsd? : ActorRunUsage ### [**](#userId)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/resource_clients/actor.d.ts#L194)externaluserId **userId: string --- # externalApifyClientOptions ## Index[**](#Index) ### Properties * [**baseUrl](#baseUrl) * [**maxRetries](#maxRetries) * [**minDelayBetweenRetriesMillis](#minDelayBetweenRetriesMillis) * [**requestInterceptors](#requestInterceptors) * [**timeoutSecs](#timeoutSecs) * [**token](#token) * [**userAgentSuffix](#userAgentSuffix) ## Properties[**](#Properties) ### [**](#baseUrl)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/apify_client.d.ts#L138)externaloptionalbaseUrl **baseUrl? : string = https\://api.apify.com ### [**](#maxRetries)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/apify_client.d.ts#L140)externaloptionalmaxRetries **maxRetries? : number = 8 ### [**](#minDelayBetweenRetriesMillis)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/apify_client.d.ts#L142)externaloptionalminDelayBetweenRetriesMillis **minDelayBetweenRetriesMillis? : number = 500 ### [**](#requestInterceptors)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/apify_client.d.ts#L144)externaloptionalrequestInterceptors **requestInterceptors? : (undefined | null | (value) => ApifyRequestConfig | Promise\)\[] = \[] ### [**](#timeoutSecs)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/apify_client.d.ts#L146)externaloptionaltimeoutSecs **timeoutSecs? : number = 360 ### [**](#token)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/apify_client.d.ts#L147)externaloptionaltoken **token? : string ### [**](#userAgentSuffix)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/apify_client.d.ts#L148)externaloptionaluserAgentSuffix **userAgentSuffix? : string | string\[] --- # ApifyEnv Parsed representation of the Apify environment variables. This object is returned by the [Actor.getEnv](https://docs.apify.com/sdk/js/sdk/js/reference/class/Actor.md#getEnv) function. ## Index[**](#Index) ### Properties * [**actorBuildId](#actorBuildId) * [**actorBuildNumber](#actorBuildNumber) * [**actorEventsWsUrl](#actorEventsWsUrl) * [**actorId](#actorId) * [**actorMaxPaidDatasetItems](#actorMaxPaidDatasetItems) * [**actorRunId](#actorRunId) * [**actorTaskId](#actorTaskId) * [**apiBaseUrl](#apiBaseUrl) * [**apiPublicBaseUrl](#apiPublicBaseUrl) * [**containerPort](#containerPort) * [**containerUrl](#containerUrl) * [**dedicatedCpus](#dedicatedCpus) * [**defaultDatasetId](#defaultDatasetId) * [**defaultKeyValueStoreId](#defaultKeyValueStoreId) * [**defaultRequestQueueId](#defaultRequestQueueId) * [**disableOutdatedWarning](#disableOutdatedWarning) * [**fact](#fact) * [**headless](#headless) * [**chromeExecutablePath](#chromeExecutablePath) * [**inputKey](#inputKey) * [**inputSecretsPrivateKeyFile](#inputSecretsPrivateKeyFile) * [**inputSecretsPrivateKeyPassphrase](#inputSecretsPrivateKeyPassphrase) * [**isAtHome](#isAtHome) * [**localStorageDir](#localStorageDir) * [**logFormat](#logFormat) * [**logLevel](#logLevel) * [**memoryMbytes](#memoryMbytes) * [**metaOrigin](#metaOrigin) * [**proxyHostname](#proxyHostname) * [**proxyPassword](#proxyPassword) * [**proxyPort](#proxyPort) * [**proxyStatusUrl](#proxyStatusUrl) * [**sdkLatestVersion](#sdkLatestVersion) * [**startedAt](#startedAt) * [**systemInfoIntervalMillis](#systemInfoIntervalMillis) * [**timeoutAt](#timeoutAt) * [**token](#token) * [**userId](#userId) * [**workflowKey](#workflowKey) ## Properties[**](#Properties) ### [**](#actorBuildId)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1779)actorBuildId **actorBuildId: null | string ID of the Actor build used in the run. (ACTOR\_BUILD\_ID) ### [**](#actorBuildNumber)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1881)actorBuildNumber **actorBuildNumber: null | string ### [**](#actorEventsWsUrl)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1882)actorEventsWsUrl **actorEventsWsUrl: null | string ### [**](#actorId)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1764)actorId **actorId: null | string ID of the Actor (ACTOR\_ID) ### [**](#actorMaxPaidDatasetItems)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1883)actorMaxPaidDatasetItems **actorMaxPaidDatasetItems: null | number ### [**](#actorRunId)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1769)actorRunId **actorRunId: null | string ID of the Actor run (ACTOR\_RUN\_ID) ### [**](#actorTaskId)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1774)actorTaskId **actorTaskId: null | string ID of the Actor task (ACTOR\_TASK\_ID) ### [**](#apiBaseUrl)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1844)apiBaseUrl **apiBaseUrl: null | string ### [**](#apiPublicBaseUrl)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1845)apiPublicBaseUrl **apiPublicBaseUrl: null | string ### [**](#containerPort)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1884)containerPort **containerPort: null | number ### [**](#containerUrl)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1885)containerUrl **containerUrl: null | string ### [**](#dedicatedCpus)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1847)dedicatedCpus **dedicatedCpus: null | string ### [**](#defaultDatasetId)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1813)defaultDatasetId **defaultDatasetId: null | string ID of the dataset where input and output data of this Actor is stored (ACTOR\_DEFAULT\_DATASET\_ID) ### [**](#defaultKeyValueStoreId)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1807)defaultKeyValueStoreId **defaultKeyValueStoreId: null | string ID of the key-value store where input and output data of this Actor is stored (ACTOR\_DEFAULT\_KEY\_VALUE\_STORE\_ID) ### [**](#defaultRequestQueueId)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1886)defaultRequestQueueId **defaultRequestQueueId: null | string ### [**](#disableOutdatedWarning)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1848)disableOutdatedWarning **disableOutdatedWarning: null | 1 ### [**](#fact)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1849)fact **fact: null | string ### [**](#headless)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1825)headless **headless: null | string If set to "1", the web browsers inside the Actor should run in headless mode because there is no windowing system available. (APIFY\_HEADLESS) ### [**](#chromeExecutablePath)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1846)chromeExecutablePath **chromeExecutablePath: null | string ### [**](#inputKey)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1877)inputKey **inputKey: null | string The key of the input record in the Actor’s default key-value store (ACTOR\_INPUT\_KEY) ### [**](#inputSecretsPrivateKeyFile)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1850)inputSecretsPrivateKeyFile **inputSecretsPrivateKeyFile: null | string ### [**](#inputSecretsPrivateKeyPassphrase)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1851)inputSecretsPrivateKeyPassphrase **inputSecretsPrivateKeyPassphrase: null | string ### [**](#isAtHome)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1831)isAtHome **isAtHome: null | string Is set to "1" if the Actor is running on Apify servers. (APIFY\_IS\_AT\_HOME) ### [**](#localStorageDir)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1859)localStorageDir **localStorageDir: null | string Defines the path to a local directory where KeyValueStore, Dataset, and RequestQueue store their data. Typically, it is set to ./storage. If omitted, you should define the APIFY\_TOKEN environment variable instead. See more info on combination of this and APIFY\_TOKEN [here](https://docs.apify.com/sdk/js/sdk/js/docs/guides/environment-variables.md#combinations-of-apify_local_storage_dir-and-apify_token)(CRAWLEE\_STORAGE\_DIR) ### [**](#logFormat)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1866)logFormat **logFormat: null | string ### [**](#logLevel)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1865)logLevel **logLevel: null | string Specifies the minimum log level, which can be one of the following values (in order of severity): DEBUG, INFO, WARNING and ERROR (APIFY\_LOG\_LEVEL) ### [**](#memoryMbytes)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1819)memoryMbytes **memoryMbytes: null | number Amount of memory allocated for the Actor, in megabytes (ACTOR\_MEMORY\_MBYTES) ### [**](#metaOrigin)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1872)metaOrigin **metaOrigin: null | string Origin for the Actor run, i.e. how it was started. See [here](https://docs.apify.com/sdk/python/reference/enum/MetaOrigin) for more details. (APIFY\_META\_ORIGIN) ### [**](#proxyHostname)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1837)proxyHostname **proxyHostname: null | string ### [**](#proxyPassword)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1836)proxyPassword **proxyPassword: null | string The Apify Proxy password of the user who started the Actor. (APIFY\_PROXY\_PASSWORD) ### [**](#proxyPort)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1838)proxyPort **proxyPort: null | string ### [**](#proxyStatusUrl)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1843)proxyStatusUrl **proxyStatusUrl: null | string You can visit this page to troubleshoot your proxy connection. (APIFY\_PROXY\_STATUS\_URL) ### [**](#sdkLatestVersion)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1878)sdkLatestVersion **sdkLatestVersion: null | string ### [**](#startedAt)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1796)startedAt **startedAt: null | Date Date when the Actor was started (ACTOR\_STARTED\_AT) ### [**](#systemInfoIntervalMillis)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1879)systemInfoIntervalMillis **systemInfoIntervalMillis: null | string ### [**](#timeoutAt)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1801)timeoutAt **timeoutAt: null | Date Date when the Actor will time out (ACTOR\_TIMEOUT\_AT) ### [**](#token)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1791)token **token: null | string Authentication token representing privileges given to the Actor run, it can be passed to various Apify APIs (APIFY\_TOKEN) ### [**](#userId)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1785)userId **userId: null | string ID of the user who started the Actor - note that it might be different than the owner of the Actor (APIFY\_USER\_ID) ### [**](#workflowKey)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1880)workflowKey **workflowKey: null | string --- # CallOptions ### Hierarchy * ActorCallOptions * *CallOptions* ## Index[**](#Index) ### Properties * [**build](#build) * [**contentType](#contentType) * [**maxItems](#maxItems) * [**memory](#memory) * [**timeout](#timeout) * [**token](#token) * [**waitSecs](#waitSecs) * [**webhooks](#webhooks) ## Properties[**](#Properties) ### [**](#build)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/resource_clients/actor.d.ts#L137)externaloptionalinheritedbuild **build? : string Inherited from ActorCallOptions.build Tag or number of the actor build to run (e.g. `beta` or `1.2.345`). If not provided, the run uses build tag or number from the default actor run configuration (typically `latest`). ### [**](#contentType)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/resource_clients/actor.d.ts#L144)externaloptionalinheritedcontentType **contentType? : string Inherited from ActorCallOptions.contentType Content type for the `input`. If not specified, `input` is expected to be an object that will be stringified to JSON and content type set to `application/json; charset=utf-8`. If `options.contentType` is specified, then `input` must be a `String` or `Buffer`. ### [**](#maxItems)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/resource_clients/actor.d.ts#L173)externaloptionalinheritedmaxItems **maxItems? : number Inherited from ActorCallOptions.maxItems Specifies maximum number of items that the actor run should return. This is used by pay per result actors to limit the maximum number of results that will be charged to customer. Value can be accessed in actor run using `ACTOR_MAX_PAID_DATASET_ITEMS` environment variable. ### [**](#memory)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/resource_clients/actor.d.ts#L149)externaloptionalinheritedmemory **memory? : number Inherited from ActorCallOptions.memory Memory in megabytes which will be allocated for the new actor run. If not provided, the run uses memory of the default actor run configuration. ### [**](#timeout)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/resource_clients/actor.d.ts#L154)externaloptionalinheritedtimeout **timeout? : number Inherited from ActorCallOptions.timeout Timeout for the actor run in seconds. Zero value means there is no timeout. If not provided, the run uses timeout of the default actor run configuration. ### [**](#token)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1895)optionaltoken **token? : string User API token that is used to run the Actor. By default, it is taken from the `APIFY_TOKEN` environment variable. ### [**](#waitSecs)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/resource_clients/actor.d.ts#L176)externaloptionalinheritedwaitSecs **waitSecs? : number Inherited from ActorCallOptions.waitSecs ### [**](#webhooks)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/resource_clients/actor.d.ts#L167)externaloptionalinheritedwebhooks **webhooks? : readonly WebhookUpdateData\[] Inherited from ActorCallOptions.webhooks Specifies optional webhooks associated with the actor run, which can be used to receive a notification e.g. when the actor finished or failed, see [ad hook webhooks documentation](https://docs.apify.com/webhooks/ad-hoc-webhooks) for detailed description. --- # CallTaskOptions ### Hierarchy * TaskCallOptions * *CallTaskOptions* ## Index[**](#Index) ### Properties * [**build](#build) * [**maxItems](#maxItems) * [**memory](#memory) * [**timeout](#timeout) * [**token](#token) * [**waitSecs](#waitSecs) * [**webhooks](#webhooks) ## Properties[**](#Properties) ### [**](#build)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/resource_clients/actor.d.ts#L137)externaloptionalinheritedbuild **build? : string Inherited from TaskCallOptions.build Tag or number of the actor build to run (e.g. `beta` or `1.2.345`). If not provided, the run uses build tag or number from the default actor run configuration (typically `latest`). ### [**](#maxItems)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/resource_clients/actor.d.ts#L173)externaloptionalinheritedmaxItems **maxItems? : number Inherited from TaskCallOptions.maxItems Specifies maximum number of items that the actor run should return. This is used by pay per result actors to limit the maximum number of results that will be charged to customer. Value can be accessed in actor run using `ACTOR_MAX_PAID_DATASET_ITEMS` environment variable. ### [**](#memory)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/resource_clients/actor.d.ts#L149)externaloptionalinheritedmemory **memory? : number Inherited from TaskCallOptions.memory Memory in megabytes which will be allocated for the new actor run. If not provided, the run uses memory of the default actor run configuration. ### [**](#timeout)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/resource_clients/actor.d.ts#L154)externaloptionalinheritedtimeout **timeout? : number Inherited from TaskCallOptions.timeout Timeout for the actor run in seconds. Zero value means there is no timeout. If not provided, the run uses timeout of the default actor run configuration. ### [**](#token)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1902)optionaltoken **token? : string User API token that is used to run the Actor. By default, it is taken from the `APIFY_TOKEN` environment variable. ### [**](#waitSecs)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/resource_clients/task.d.ts#L87)externaloptionalinheritedwaitSecs **waitSecs? : number Inherited from TaskCallOptions.waitSecs ### [**](#webhooks)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/apify-client/src/resource_clients/actor.d.ts#L167)externaloptionalinheritedwebhooks **webhooks? : readonly WebhookUpdateData\[] Inherited from TaskCallOptions.webhooks Specifies optional webhooks associated with the actor run, which can be used to receive a notification e.g. when the actor finished or failed, see [ad hook webhooks documentation](https://docs.apify.com/webhooks/ad-hoc-webhooks) for detailed description. --- # ChargeOptions ## Index[**](#Index) ### Properties * [**count](#count) * [**eventName](#eventName) ## Properties[**](#Properties) ### [**](#count)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/charging.ts#L294)optionalcount **count? : number ### [**](#eventName)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/charging.ts#L293)eventName **eventName: string --- # ChargeResult ## Index[**](#Index) ### Properties * [**eventChargeLimitReached](#eventChargeLimitReached) * [**chargeableWithinLimit](#chargeableWithinLimit) * [**chargedCount](#chargedCount) ## Properties[**](#Properties) ### [**](#eventChargeLimitReached)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/charging.ts#L298)eventChargeLimitReached **eventChargeLimitReached: boolean ### [**](#chargeableWithinLimit)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/charging.ts#L300)chargeableWithinLimit **chargeableWithinLimit: Record\ ### [**](#chargedCount)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/charging.ts#L299)chargedCount **chargedCount: number --- # ConfigurationOptions ### Hierarchy * ConfigurationOptions * *ConfigurationOptions* ## Index[**](#Index) ### Properties * [**actorEventsWsUrl](#actorEventsWsUrl) * [**actorId](#actorId) * [**actorRunId](#actorRunId) * [**actorTaskId](#actorTaskId) * [**apiBaseUrl](#apiBaseUrl) * [**apiPublicBaseUrl](#apiPublicBaseUrl) * [**availableMemoryRatio](#availableMemoryRatio) * [**containerized](#containerized) * [**containerPort](#containerPort) * [**containerUrl](#containerUrl) * [**defaultBrowserPath](#defaultBrowserPath) * [**defaultDatasetId](#defaultDatasetId) * [**defaultKeyValueStoreId](#defaultKeyValueStoreId) * [**defaultRequestQueueId](#defaultRequestQueueId) * [**disableBrowserSandbox](#disableBrowserSandbox) * [**eventManager](#eventManager) * [**headless](#headless) * [**chromeExecutablePath](#chromeExecutablePath) * [**inputKey](#inputKey) * [**inputSecretsPrivateKeyFile](#inputSecretsPrivateKeyFile) * [**inputSecretsPrivateKeyPassphrase](#inputSecretsPrivateKeyPassphrase) * [**isAtHome](#isAtHome) * [**logLevel](#logLevel) * [**maxTotalChargeUsd](#maxTotalChargeUsd) * [**maxUsedCpuRatio](#maxUsedCpuRatio) * [**memoryMbytes](#memoryMbytes) * [**metamorphAfterSleepMillis](#metamorphAfterSleepMillis) * [**metaOrigin](#metaOrigin) * [**persistStateIntervalMillis](#persistStateIntervalMillis) * [**persistStorage](#persistStorage) * [**proxyHostname](#proxyHostname) * [**proxyPassword](#proxyPassword) * [**proxyPort](#proxyPort) * [**proxyStatusUrl](#proxyStatusUrl) * [**purgeOnStart](#purgeOnStart) * [**standbyPort](#standbyPort) * [**standbyUrl](#standbyUrl) * [**storageClient](#storageClient) * [**storageClientOptions](#storageClientOptions) * [**systemInfoIntervalMillis](#systemInfoIntervalMillis) * [**systemInfoV2](#systemInfoV2) * [**testPayPerEvent](#testPayPerEvent) * [**token](#token) * [**useChargingLogDataset](#useChargingLogDataset) * [**userId](#userId) * [**xvfb](#xvfb) ## Properties[**](#Properties) ### [**](#actorEventsWsUrl)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/configuration.ts#L13)optionalactorEventsWsUrl **actorEventsWsUrl? : string ### [**](#actorId)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/configuration.ts#L15)optionalactorId **actorId? : string ### [**](#actorRunId)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/configuration.ts#L16)optionalactorRunId **actorRunId? : string ### [**](#actorTaskId)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/configuration.ts#L17)optionalactorTaskId **actorTaskId? : string ### [**](#apiBaseUrl)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/configuration.ts#L18)optionalapiBaseUrl **apiBaseUrl? : string ### [**](#apiPublicBaseUrl)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/configuration.ts#L21)optionalapiPublicBaseUrl **apiPublicBaseUrl? : string ### [**](#availableMemoryRatio)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/configuration.d.ts#L66)externaloptionalinheritedavailableMemoryRatio **availableMemoryRatio? : number = 0.25 Inherited from CoreConfigurationOptions.availableMemoryRatio Sets the ratio, defining the amount of system memory that could be used by the AutoscaledPool. When the memory usage is more than the provided ratio, the memory is considered overloaded. Alternative to `CRAWLEE_AVAILABLE_MEMORY_RATIO` environment variable. ### [**](#containerized)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/configuration.d.ts#L150)externaloptionalinheritedcontainerized **containerized? : boolean Inherited from CoreConfigurationOptions.containerized Used in place of `isContainerized()` when collecting system metrics. Alternative to `CRAWLEE_CONTAINERIZED` environment variable. ### [**](#containerPort)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/configuration.ts#L22)optionalcontainerPort **containerPort? : number ### [**](#containerUrl)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/configuration.ts#L23)optionalcontainerUrl **containerUrl? : string ### [**](#defaultBrowserPath)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/configuration.d.ts#L119)externaloptionalinheriteddefaultBrowserPath **defaultBrowserPath? : string Inherited from CoreConfigurationOptions.defaultBrowserPath Defines a path to default browser executable. Alternative to `CRAWLEE_DEFAULT_BROWSER_PATH` environment variable. ### [**](#defaultDatasetId)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/configuration.d.ts#L31)externaloptionalinheriteddefaultDatasetId **defaultDatasetId? : string = ‘default’ Inherited from CoreConfigurationOptions.defaultDatasetId Default dataset id. Alternative to `CRAWLEE_DEFAULT_DATASET_ID` environment variable. ### [**](#defaultKeyValueStoreId)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/configuration.d.ts#L45)externaloptionalinheriteddefaultKeyValueStoreId **defaultKeyValueStoreId? : string = ‘default’ Inherited from CoreConfigurationOptions.defaultKeyValueStoreId Default key-value store id. Alternative to `CRAWLEE_DEFAULT_KEY_VALUE_STORE_ID` environment variable. ### [**](#defaultRequestQueueId)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/configuration.d.ts#L52)externaloptionalinheriteddefaultRequestQueueId **defaultRequestQueueId? : string = ‘default’ Inherited from CoreConfigurationOptions.defaultRequestQueueId Default request queue id. Alternative to `CRAWLEE_DEFAULT_REQUEST_QUEUE_ID` environment variable. ### [**](#disableBrowserSandbox)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/configuration.d.ts#L125)externaloptionalinheriteddisableBrowserSandbox **disableBrowserSandbox? : boolean Inherited from CoreConfigurationOptions.disableBrowserSandbox Defines whether to disable browser sandbox by adding `--no-sandbox` flag to `launchOptions`. Alternative to `CRAWLEE_DISABLE_BROWSER_SANDBOX` environment variable. ### [**](#eventManager)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/configuration.d.ts#L19)externaloptionalinheritedeventManager **eventManager? : EventManager = EventManager Inherited from CoreConfigurationOptions.eventManager Defines the Event Manager to be used. ### [**](#headless)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/configuration.d.ts#L100)externaloptionalinheritedheadless **headless? : boolean = true Inherited from CoreConfigurationOptions.headless Defines whether web browsers launched by Crawlee will run in the headless mode. Alternative to `CRAWLEE_HEADLESS` environment variable. ### [**](#chromeExecutablePath)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/configuration.d.ts#L113)externaloptionalinheritedchromeExecutablePath **chromeExecutablePath? : string Inherited from CoreConfigurationOptions.chromeExecutablePath Defines a path to Chrome executable. Alternative to `CRAWLEE_CHROME_EXECUTABLE_PATH` environment variable. ### [**](#inputKey)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/configuration.d.ts#L93)externaloptionalinheritedinputKey **inputKey? : string = ‘INPUT’ Inherited from CoreConfigurationOptions.inputKey Defines the default input key, i.e. the key that is used to get the crawler input value from the default [KeyValueStore](https://docs.apify.com/sdk/js/sdk/js/reference/class/KeyValueStore.md) associated with the current crawler run. Alternative to `CRAWLEE_INPUT_KEY` environment variable. ### [**](#inputSecretsPrivateKeyFile)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/configuration.ts#L36)optionalinputSecretsPrivateKeyFile **inputSecretsPrivateKeyFile? : string ### [**](#inputSecretsPrivateKeyPassphrase)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/configuration.ts#L35)optionalinputSecretsPrivateKeyPassphrase **inputSecretsPrivateKeyPassphrase? : string ### [**](#isAtHome)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/configuration.ts#L33)optionalisAtHome **isAtHome? : boolean ### [**](#logLevel)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/configuration.d.ts#L132)externaloptionalinheritedlogLevel **logLevel? : [LogLevel](https://docs.apify.com/sdk/js/sdk/js/reference/enum/LogLevel.md) | (radix) => string | (fractionDigits) => string | (fractionDigits) => string | (precision) => string | () => number | ({ (locales, options): string; (locales, options): string }) = [LogLevel](https://docs.apify.com/sdk/js/sdk/js/reference/enum/LogLevel.md) | (radix) => string | (fractionDigits) => string | (fractionDigits) => string | (precision) => string | () => number | ({ (locales, options): string; (locales, options): string }) Inherited from CoreConfigurationOptions.logLevel Sets the log level to the given value. Alternative to `CRAWLEE_LOG_LEVEL` environment variable. ### [**](#maxTotalChargeUsd)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/configuration.ts#L37)optionalmaxTotalChargeUsd **maxTotalChargeUsd? : number ### [**](#maxUsedCpuRatio)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/configuration.d.ts#L58)externaloptionalinheritedmaxUsedCpuRatio **maxUsedCpuRatio? : number = 0.95 Inherited from CoreConfigurationOptions.maxUsedCpuRatio Sets the ratio, defining the maximum CPU usage. When the CPU usage is higher than the provided ratio, the CPU is considered overloaded. ### [**](#memoryMbytes)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/configuration.d.ts#L73)externaloptionalinheritedmemoryMbytes **memoryMbytes? : number Inherited from CoreConfigurationOptions.memoryMbytes Sets the amount of system memory in megabytes to be used by the AutoscaledPool. By default, the maximum memory is set to one quarter of total system memory. Alternative to `CRAWLEE_MEMORY_MBYTES` environment variable. ### [**](#metamorphAfterSleepMillis)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/configuration.ts#L12)optionalmetamorphAfterSleepMillis **metamorphAfterSleepMillis? : number ### [**](#metaOrigin)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/configuration.ts#L38)optionalmetaOrigin **metaOrigin? : DEVELOPMENT | WEB | API | SCHEDULER | TEST | WEBHOOK | ACTOR | CLI | STANDBY ### [**](#persistStateIntervalMillis)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/configuration.d.ts#L80)externaloptionalinheritedpersistStateIntervalMillis **persistStateIntervalMillis? : number = 60\_000 Inherited from CoreConfigurationOptions.persistStateIntervalMillis Defines the interval of emitting the `persistState` event. Alternative to `CRAWLEE_PERSIST_STATE_INTERVAL_MILLIS` environment variable. ### [**](#persistStorage)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/configuration.d.ts#L138)externaloptionalinheritedpersistStorage **persistStorage? : boolean Inherited from CoreConfigurationOptions.persistStorage Defines whether the storage client used should persist the data it stores. Alternative to `CRAWLEE_PERSIST_STORAGE` environment variable. ### [**](#proxyHostname)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/configuration.ts#L24)optionalproxyHostname **proxyHostname? : string ### [**](#proxyPassword)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/configuration.ts#L25)optionalproxyPassword **proxyPassword? : string ### [**](#proxyPort)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/configuration.ts#L26)optionalproxyPort **proxyPort? : number ### [**](#proxyStatusUrl)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/configuration.ts#L27)optionalproxyStatusUrl **proxyStatusUrl? : string ### [**](#purgeOnStart)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/configuration.d.ts#L38)externaloptionalinheritedpurgeOnStart **purgeOnStart? : boolean = true Inherited from CoreConfigurationOptions.purgeOnStart Defines whether to purge the default storage folders before starting the crawler run. Alternative to `CRAWLEE_PURGE_ON_START` environment variable. ### [**](#standbyPort)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/configuration.ts#L31)optionalstandbyPort **standbyPort? : number * **@deprecated** use `containerPort` instead ### [**](#standbyUrl)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/configuration.ts#L32)optionalstandbyUrl **standbyUrl? : string ### [**](#storageClient)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/configuration.d.ts#L14)externaloptionalinheritedstorageClient **storageClient? : StorageClient = StorageClient Inherited from CoreConfigurationOptions.storageClient Defines storage client to be used. ### [**](#storageClientOptions)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/configuration.d.ts#L24)externaloptionalinheritedstorageClientOptions **storageClientOptions? : Dictionary Inherited from CoreConfigurationOptions.storageClientOptions Could be used to adjust the storage client behavior e.g. MemoryStorageOptions could be used to adjust the MemoryStorage behavior. ### [**](#systemInfoIntervalMillis)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/configuration.d.ts#L85)externaloptionalinheritedsystemInfoIntervalMillis **systemInfoIntervalMillis? : number = 1\_000 Inherited from CoreConfigurationOptions.systemInfoIntervalMillis Defines the interval of emitting the `systemInfo` event. ### [**](#systemInfoV2)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/configuration.d.ts#L144)externaloptionalinheritedsystemInfoV2 **systemInfoV2? : boolean Inherited from CoreConfigurationOptions.systemInfoV2 Defines whether to use the systemInfoV2 metric collection experiment. Alternative to `CRAWLEE_SYSTEM_INFO_V2` environment variable. ### [**](#testPayPerEvent)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/configuration.ts#L39)optionaltestPayPerEvent **testPayPerEvent? : boolean ### [**](#token)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/configuration.ts#L14)optionaltoken **token? : string ### [**](#useChargingLogDataset)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/configuration.ts#L40)optionaluseChargingLogDataset **useChargingLogDataset? : boolean ### [**](#userId)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/configuration.ts#L34)optionaluserId **userId? : string ### [**](#xvfb)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/configuration.d.ts#L107)externaloptionalinheritedxvfb **xvfb? : boolean = false Inherited from CoreConfigurationOptions.xvfb Defines whether to run X virtual framebuffer on the web browsers launched by Crawlee. Alternative to `CRAWLEE_XVFB` environment variable. --- # externalDatasetConsumer \ User-function used in the `Dataset.forEach()` API. ### Callable * ****DatasetConsumer**(item, index): Awaitable\ *** * #### Parameters * ##### externalitem: Data Current [Dataset](https://docs.apify.com/sdk/js/sdk/js/reference/class/Dataset.md) entry being processed. * ##### externalindex: number Position of current [Dataset](https://docs.apify.com/sdk/js/sdk/js/reference/class/Dataset.md) entry. #### Returns Awaitable\ --- # externalDatasetContent \ ## Index[**](#Index) ### Properties * [**count](#count) * [**desc](#desc) * [**items](#items) * [**limit](#limit) * [**offset](#offset) * [**total](#total) ## Properties[**](#Properties) ### [**](#count)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/dataset.d.ts#L420)externalcount **count: number Count of dataset entries returned in this set. ### [**](#desc)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/dataset.d.ts#L428)externaloptionaldesc **desc? : boolean Should the results be in descending order. ### [**](#items)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/dataset.d.ts#L426)externalitems **items: Data\[] Dataset entries based on chosen format parameter. ### [**](#limit)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/dataset.d.ts#L424)externallimit **limit: number Maximum number of dataset entries requested. ### [**](#offset)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/dataset.d.ts#L422)externaloffset **offset: number Position of the first returned entry in the dataset. ### [**](#total)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/dataset.d.ts#L418)externaltotal **total: number Total count of entries in the dataset. --- # externalDatasetDataOptions ## Index[**](#Index) ### Properties * [**clean](#clean) * [**desc](#desc) * [**fields](#fields) * [**limit](#limit) * [**offset](#offset) * [**skipEmpty](#skipEmpty) * [**skipHidden](#skipHidden) * [**unwind](#unwind) ## Properties[**](#Properties) ### [**](#clean)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/dataset.d.ts#L56)externaloptionalclean **clean? : boolean = false If `true` then the function returns only non-empty items and skips hidden fields (i.e. fields starting with `#` character). Note that the `clean` parameter is a shortcut for `skipHidden: true` and `skipEmpty: true` options. ### [**](#desc)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/dataset.d.ts#L41)externaloptionaldesc **desc? : boolean = false If `true` then the objects are sorted by `createdAt` in descending order. Otherwise they are sorted in ascending order. ### [**](#fields)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/dataset.d.ts#L45)externaloptionalfields **fields? : string\[] An array of field names that will be included in the result. If omitted, all fields are included in the results. ### [**](#limit)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/dataset.d.ts#L35)externaloptionallimit **limit? : number = 250000 Maximum number of array elements to return. ### [**](#offset)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/dataset.d.ts#L30)externaloptionaloffset **offset? : number = 0 Number of array elements that should be skipped at the start. ### [**](#skipEmpty)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/dataset.d.ts#L67)externaloptionalskipEmpty **skipEmpty? : boolean = false If `true` then the function doesn't return empty items. Note that in this case the returned number of items might be lower than limit parameter and pagination must be done using the `limit` value. ### [**](#skipHidden)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/dataset.d.ts#L61)externaloptionalskipHidden **skipHidden? : boolean = false If `true` then the function doesn't return hidden fields (fields starting with "#" character). ### [**](#unwind)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/dataset.d.ts#L50)externaloptionalunwind **unwind? : string Specifies a name of the field in the result objects that will be used to unwind the resulting objects. By default, the results are returned as they are. --- # externalDatasetIteratorOptions ### Hierarchy * Omit<[DatasetDataOptions](https://docs.apify.com/sdk/js/sdk/js/reference/interface/DatasetDataOptions.md), offset | limit | clean | skipHidden | skipEmpty> * *DatasetIteratorOptions* ## Index[**](#Index) ### Properties * [**desc](#desc) * [**fields](#fields) * [**unwind](#unwind) ## Properties[**](#Properties) ### [**](#desc)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/dataset.d.ts#L41)externaloptionalinheriteddesc **desc? : boolean = false Inherited from Omit.desc If `true` then the objects are sorted by `createdAt` in descending order. Otherwise they are sorted in ascending order. ### [**](#fields)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/dataset.d.ts#L45)externaloptionalinheritedfields **fields? : string\[] Inherited from Omit.fields An array of field names that will be included in the result. If omitted, all fields are included in the results. ### [**](#unwind)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/dataset.d.ts#L50)externaloptionalinheritedunwind **unwind? : string Inherited from Omit.unwind Specifies a name of the field in the result objects that will be used to unwind the resulting objects. By default, the results are returned as they are. --- # externalDatasetMapper \ User-function used in the `Dataset.map()` API. ### Callable * ****DatasetMapper**(item, index): Awaitable\ *** * User-function used in the `Dataset.map()` API. *** #### Parameters * ##### externalitem: Data Current [Dataset](https://docs.apify.com/sdk/js/sdk/js/reference/class/Dataset.md) entry being processed. * ##### externalindex: number Position of current [Dataset](https://docs.apify.com/sdk/js/sdk/js/reference/class/Dataset.md) entry. #### Returns Awaitable\ --- # externalDatasetOptions ## Index[**](#Index) ### Properties * [**client](#client) * [**id](#id) * [**name](#name) ## Properties[**](#Properties) ### [**](#client)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/dataset.d.ts#L414)externalclient **client: StorageClient ### [**](#id)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/dataset.d.ts#L412)externalid **id: string ### [**](#name)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/dataset.d.ts#L413)externaloptionalname **name? : string --- # externalDatasetReducer \ User-function used in the `Dataset.reduce()` API. ### Callable * ****DatasetReducer**(memo, item, index): Awaitable\ *** * #### Parameters * ##### externalmemo: T Previous state of the reduction. * ##### externalitem: Data Current [Dataset](https://docs.apify.com/sdk/js/sdk/js/reference/class/Dataset.md) entry being processed. * ##### externalindex: number Position of current [Dataset](https://docs.apify.com/sdk/js/sdk/js/reference/class/Dataset.md) entry. #### Returns Awaitable\ --- # ExitOptions ### Hierarchy * *ExitOptions* * [MainOptions](https://docs.apify.com/sdk/js/sdk/js/reference/interface/MainOptions.md) ## Index[**](#Index) ### Properties * [**exit](#exit) * [**exitCode](#exitCode) * [**statusMessage](#statusMessage) * [**timeoutSecs](#timeoutSecs) ## Properties[**](#Properties) ### [**](#exit)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1981)optionalexit **exit? : boolean Call `process.exit()`? Defaults to true ### [**](#exitCode)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1979)optionalexitCode **exitCode? : number Exit code, defaults to 0 ### [**](#statusMessage)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1972)optionalstatusMessage **statusMessage? : string Exit with given status message ### [**](#timeoutSecs)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1977)optionaltimeoutSecs **timeoutSecs? : number = 30 Amount of time, in seconds, to wait for all event handlers to finish before exiting the process. --- # InitOptions ### Hierarchy * *InitOptions* * [MainOptions](https://docs.apify.com/sdk/js/sdk/js/reference/interface/MainOptions.md) ## Index[**](#Index) ### Properties * [**storage](#storage) ## Properties[**](#Properties) ### [**](#storage)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1748)optionalstorage **storage? : StorageClient --- # externalKeyConsumer User-function used in the [KeyValueStore.forEachKey](https://docs.apify.com/sdk/js/sdk/js/reference/class/KeyValueStore.md#forEachKey) method. ### Callable * ****KeyConsumer**(key, index, info): Awaitable\ *** * #### Parameters * ##### externalkey: string Current [KeyValueStore](https://docs.apify.com/sdk/js/sdk/js/reference/class/KeyValueStore.md) key being processed. * ##### externalindex: number Position of the current key in [KeyValueStore](https://docs.apify.com/sdk/js/sdk/js/reference/class/KeyValueStore.md). * ##### externalinfo: { size: number } Information about the current [KeyValueStore](https://docs.apify.com/sdk/js/sdk/js/reference/class/KeyValueStore.md) entry. * ##### externalsize: number Size of the value associated with the current key in bytes. #### Returns Awaitable\ --- # externalKeyValueStoreIteratorOptions ## Index[**](#Index) ### Properties * [**exclusiveStartKey](#exclusiveStartKey) ## Properties[**](#Properties) ### [**](#exclusiveStartKey)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/key_value_store.d.ts#L396)externaloptionalexclusiveStartKey **exclusiveStartKey? : string All keys up to this one (including) are skipped from the result. --- # externalKeyValueStoreOptions ## Index[**](#Index) ### Properties * [**client](#client) * [**id](#id) * [**name](#name) * [**storageObject](#storageObject) ## Properties[**](#Properties) ### [**](#client)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/key_value_store.d.ts#L383)externalclient **client: StorageClient ### [**](#id)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/key_value_store.d.ts#L381)externalid **id: string ### [**](#name)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/key_value_store.d.ts#L382)externaloptionalname **name? : string ### [**](#storageObject)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/key_value_store.d.ts#L384)externaloptionalstorageObject **storageObject? : Record\ --- # externalLoggerOptions ## Index[**](#Index) ### Properties * [**data](#data) * [**level](#level) * [**logger](#logger) * [**maxDepth](#maxDepth) * [**maxStringLength](#maxStringLength) * [**prefix](#prefix) * [**suffix](#suffix) ## Properties[**](#Properties) ### [**](#data)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@apify/log/src/index.d.ts#L61)externaloptionaldata **data? : Record\ Additional data to be added to each log line. ### [**](#level)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@apify/log/src/index.d.ts#L46)externaloptionallevel **level? : number Sets the log level to the given value, preventing messages from less important log levels from being printed to the console. Use in conjunction with the `log.LEVELS` constants. ### [**](#logger)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@apify/log/src/index.d.ts#L59)externaloptionallogger **logger? : [Logger](https://docs.apify.com/sdk/js/sdk/js/reference/class/Logger.md) Logger implementation to be used. Default one is log.LoggerText to log messages as easily readable strings. Optionally you can use `log.LoggerJson` that formats each log line as a JSON. ### [**](#maxDepth)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@apify/log/src/index.d.ts#L48)externaloptionalmaxDepth **maxDepth? : number Max depth of data object that will be logged. Anything deeper than the limit will be stripped off. ### [**](#maxStringLength)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@apify/log/src/index.d.ts#L50)externaloptionalmaxStringLength **maxStringLength? : number Max length of the string to be logged. Longer strings will be truncated. ### [**](#prefix)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@apify/log/src/index.d.ts#L52)externaloptionalprefix **prefix? : null | string Prefix to be prepended the each logged line. ### [**](#suffix)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@apify/log/src/index.d.ts#L54)externaloptionalsuffix **suffix? : null | string Suffix that will be appended the each logged line. --- # MainOptions ### Hierarchy * [ExitOptions](https://docs.apify.com/sdk/js/sdk/js/reference/interface/ExitOptions.md) * [InitOptions](https://docs.apify.com/sdk/js/sdk/js/reference/interface/InitOptions.md) * *MainOptions* ## Index[**](#Index) ### Properties * [**exit](#exit) * [**exitCode](#exitCode) * [**statusMessage](#statusMessage) * [**storage](#storage) * [**timeoutSecs](#timeoutSecs) ## Properties[**](#Properties) ### [**](#exit)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1981)optionalinheritedexit **exit? : boolean Inherited from ExitOptions.exit Call `process.exit()`? Defaults to true ### [**](#exitCode)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1979)optionalinheritedexitCode **exitCode? : number Inherited from ExitOptions.exitCode Exit code, defaults to 0 ### [**](#statusMessage)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1972)optionalinheritedstatusMessage **statusMessage? : string Inherited from ExitOptions.statusMessage Exit with given status message ### [**](#storage)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1748)optionalinheritedstorage **storage? : StorageClient Inherited from InitOptions.storage ### [**](#timeoutSecs)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1977)optionalinheritedtimeoutSecs **timeoutSecs? : number = 30 Inherited from ExitOptions.timeoutSecs Amount of time, in seconds, to wait for all event handlers to finish before exiting the process. --- # MetamorphOptions ## Index[**](#Index) ### Properties * [**build](#build) * [**contentType](#contentType) ## Properties[**](#Properties) ### [**](#build)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1959)optionalbuild **build? : string Tag or number of the target Actor build to metamorph into (e.g. `beta` or `1.2.345`). If not provided, the run uses build tag or number from the default Actor run configuration (typically `latest`). ### [**](#contentType)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1953)optionalcontentType **contentType? : string Content type for the `input`. If not specified, `input` is expected to be an object that will be stringified to JSON and content type set to `application/json; charset=utf-8`. If `options.contentType` is specified, then `input` must be a `String` or `Buffer`. --- # OpenStorageOptions ## Index[**](#Index) ### Properties * [**forceCloud](#forceCloud) ## Properties[**](#Properties) ### [**](#forceCloud)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1990)optionalforceCloud **forceCloud? : boolean = false If set to `true` then the cloud storage is used even if the `CRAWLEE_STORAGE_DIR` environment variable is set. This way it is possible to combine local and cloud storage. --- # ProxyConfigurationOptions ### Hierarchy * ProxyConfigurationOptions * *ProxyConfigurationOptions* ## Index[**](#Index) ### Properties * [**apifyProxyCountry](#apifyProxyCountry) * [**apifyProxyGroups](#apifyProxyGroups) * [**countryCode](#countryCode) * [**groups](#groups) * [**newUrlFunction](#newUrlFunction) * [**password](#password) * [**proxyUrls](#proxyUrls) * [**tieredProxyConfig](#tieredProxyConfig) * [**tieredProxyUrls](#tieredProxyUrls) ## Properties[**](#Properties) ### [**](#apifyProxyCountry)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/proxy_configuration.ts#L59)optionalapifyProxyCountry **apifyProxyCountry? : string Same option as `countryCode` which can be used to configurate the proxy by UI input schema. You should use the `countryCode` option in your crawler code. ### [**](#apifyProxyGroups)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/proxy_configuration.ts#L53)optionalapifyProxyGroups **apifyProxyGroups? : string\[] Same option as `groups` which can be used to configurate the proxy by UI input schema. You should use the `groups` option in your crawler code. ### [**](#countryCode)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/proxy_configuration.ts#L47)optionalcountryCode **countryCode? : string If set and relevant proxies are available in your Apify account, all proxied requests will use IP addresses that are geolocated to the specified country. For example `GB` for IPs from Great Britain. Note that online services often have their own rules for handling geolocation and thus the country selection is a best attempt at geolocation, rather than a guaranteed hit. This parameter is optional, by default, each proxied request is assigned an IP address from a random country. The country code needs to be a two letter ISO country code. See the [full list of available country codes](https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2#Officially_assigned_code_elements). This parameter is optional, by default, the proxy uses all available proxy servers from all countries. on the Apify cloud, or when using the [Apify CLI](https://github.com/apify/apify-cli). ### [**](#groups)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/proxy_configuration.ts#L34)optionalgroups **groups? : string\[] An array of proxy groups to be used by the [Apify Proxy](https://docs.apify.com/proxy). If not provided, the proxy will select the groups automatically. ### [**](#newUrlFunction)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/proxy_configuration.d.ts#L20)externaloptionalinheritednewUrlFunction **newUrlFunction? : ProxyConfigurationFunction Inherited from CoreProxyConfigurationOptions.newUrlFunction Custom function that allows you to generate the new proxy URL dynamically. It gets the `sessionId` as a parameter and an optional parameter with the `Request` object when applicable. Can return either stringified proxy URL or `null` if the proxy should not be used. Can be asynchronous. This function is used to generate the URL when [ProxyConfiguration.newUrl](https://docs.apify.com/sdk/js/sdk/js/reference/class/ProxyConfiguration.md#newUrl) or [ProxyConfiguration.newProxyInfo](https://docs.apify.com/sdk/js/sdk/js/reference/class/ProxyConfiguration.md#newProxyInfo) is called. ### [**](#password)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/proxy_configuration.ts#L28)optionalpassword **password? : string User's password for the proxy. By default, it is taken from the `APIFY_PROXY_PASSWORD` environment variable, which is automatically set by the system when running the Actors. ### [**](#proxyUrls)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/proxy_configuration.d.ts#L13)externaloptionalinheritedproxyUrls **proxyUrls? : string\[] Inherited from CoreProxyConfigurationOptions.proxyUrls An array of custom proxy URLs to be rotated. Custom proxies are not compatible with Apify Proxy and an attempt to use both configuration options will cause an error to be thrown on initialize. ### [**](#tieredProxyConfig)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/proxy_configuration.ts#L65)optionaltieredProxyConfig **tieredProxyConfig? : Omit<[ProxyConfigurationOptions](https://docs.apify.com/sdk/js/sdk/js/reference/interface/ProxyConfigurationOptions.md), keyof ProxyConfigurationOptions | tieredProxyConfig>\[] Multiple different ProxyConfigurationOptions stratified into tiers. Crawlee crawlers will switch between those tiers based on the blocked request statistics. ### [**](#tieredProxyUrls)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/proxy_configuration.d.ts#L32)externaloptionalinheritedtieredProxyUrls **tieredProxyUrls? : (null | string)\[]\[] Inherited from CoreProxyConfigurationOptions.tieredProxyUrls An array of custom proxy URLs to be rotated stratified in tiers. This is a more advanced version of `proxyUrls` that allows you to define a hierarchy of proxy URLs If everything goes well, all the requests will be sent through the first proxy URL in the list. Whenever the crawler encounters a problem with the current proxy on the given domain, it will switch to the higher tier for this domain. The crawler probes lower-level proxies at intervals to check if it can make the tier downshift. This feature is useful when you have a set of proxies with different performance characteristics (speed, price, antibot performance etc.) and you want to use the best one for each domain. Use `null` as a proxy URL to disable the proxy for the given tier. --- # ProxyInfo The main purpose of the ProxyInfo object is to provide information about the current proxy connection used by the crawler for the request. Outside of crawlers, you can get this object by calling [ProxyConfiguration.newProxyInfo](https://docs.apify.com/sdk/js/sdk/js/reference/class/ProxyConfiguration.md#newProxyInfo). **Example usage:** ``` const proxyConfiguration = await Actor.createProxyConfiguration({ groups: ['GROUP1', 'GROUP2'] // List of Apify Proxy groups countryCode: 'US', }); // Getting proxyInfo object by calling class method directly const proxyInfo = proxyConfiguration.newProxyInfo(); // In crawler const crawler = new CheerioCrawler({ // ... proxyConfiguration, requestHandler({ proxyInfo }) { // Getting used proxy URL const proxyUrl = proxyInfo.url; // Getting ID of used Session const sessionIdentifier = proxyInfo.sessionId; } }) ``` ### Hierarchy * ProxyInfo * *ProxyInfo* ## Index[**](#Index) ### Properties * [**countryCode](#countryCode) * [**groups](#groups) * [**hostname](#hostname) * [**password](#password) * [**port](#port) * [**proxyTier](#proxyTier) * [**sessionId](#sessionId) * [**url](#url) * [**username](#username) ## Properties[**](#Properties) ### [**](#countryCode)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/proxy_configuration.ts#L117)optionalcountryCode **countryCode? : string If set and relevant proxies are available in your Apify account, all proxied requests will use IP addresses that are geolocated to the specified country. For example `GB` for IPs from Great Britain. Note that online services often have their own rules for handling geolocation and thus the country selection is a best attempt at geolocation, rather than a guaranteed hit. This parameter is optional, by default, each proxied request is assigned an IP address from a random country. The country code needs to be a two letter ISO country code. See the [full list of available country codes](https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2#Officially_assigned_code_elements). This parameter is optional, by default, the proxy uses all available proxy servers from all countries. ### [**](#groups)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/proxy_configuration.ts#L105)groups **groups: string\[] An array of proxy groups to be used by the [Apify Proxy](https://docs.apify.com/proxy). If not provided, the proxy will select the groups automatically. ### [**](#hostname)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/proxy_configuration.d.ts#L88)externalinheritedhostname **hostname: string Inherited from CoreProxyInfo.hostname Hostname of your proxy. ### [**](#password)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/proxy_configuration.ts#L124)password **password: string Overrides CoreProxyInfo.password User's password for the proxy. By default, it is taken from the `APIFY_PROXY_PASSWORD` environment variable, which is automatically set by the system when running the Actors on the Apify cloud, or when using the [Apify CLI](https://github.com/apify/apify-cli). ### [**](#port)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/proxy_configuration.d.ts#L92)externalinheritedport **port: string | number Inherited from CoreProxyInfo.port Proxy port. ### [**](#proxyTier)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/proxy_configuration.d.ts#L96)externaloptionalinheritedproxyTier **proxyTier? : number Inherited from CoreProxyInfo.proxyTier Proxy tier for the current proxy, if applicable (only for `tieredProxyUrls`). ### [**](#sessionId)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/proxy_configuration.d.ts#L72)externaloptionalinheritedsessionId **sessionId? : string Inherited from CoreProxyInfo.sessionId The identifier of used Session, if used. ### [**](#url)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/proxy_configuration.d.ts#L76)externalinheritedurl **url: string Inherited from CoreProxyInfo.url The URL of the proxy. ### [**](#username)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/proxy_configuration.d.ts#L80)externaloptionalinheritedusername **username? : string Inherited from CoreProxyInfo.username Username for the proxy. --- # externalQueueOperationInfo A helper class that is used to report results from various [RequestQueue](https://docs.apify.com/sdk/js/sdk/js/reference/class/RequestQueue.md) functions as well as enqueueLinks. ## Index[**](#Index) ### Properties * [**requestId](#requestId) * [**wasAlreadyHandled](#wasAlreadyHandled) * [**wasAlreadyPresent](#wasAlreadyPresent) ## Properties[**](#Properties) ### [**](#requestId)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/types/storages.d.ts#L12)externalrequestId **requestId: string The ID of the added request ### [**](#wasAlreadyHandled)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/types/storages.d.ts#L10)externalwasAlreadyHandled **wasAlreadyHandled: boolean Indicates if request was already marked as handled. ### [**](#wasAlreadyPresent)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/types/storages.d.ts#L8)externalwasAlreadyPresent **wasAlreadyPresent: boolean Indicates if request was already present in the queue. --- # RebootOptions --- # externalRecordOptions ## Index[**](#Index) ### Properties * [**contentType](#contentType) ## Properties[**](#Properties) ### [**](#contentType)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/key_value_store.d.ts#L390)externaloptionalcontentType **contentType? : string Specifies a custom MIME content type of the record. --- # externalRequestQueueOperationOptions ## Index[**](#Index) ### Properties * [**forefront](#forefront) ## Properties[**](#Properties) ### [**](#forefront)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/request_provider.d.ts#L256)externaloptionalforefront **forefront? : boolean = false If set to `true`: * while adding the request to the queue: the request will be added to the foremost position in the queue. * while reclaiming the request: the request will be placed to the beginning of the queue, so that it's returned in the next call to [RequestQueue.fetchNextRequest](https://docs.apify.com/sdk/js/sdk/js/reference/class/RequestQueue.md#fetchNextRequest). By default, it's put to the end of the queue. In case the request is already present in the queue, this option has no effect. If more requests are added with this option at once, their order in the following `fetchNextRequest` call is arbitrary. --- # externalRequestQueueOptions * **@deprecated** Use RequestProviderOptions instead. ### Hierarchy * RequestProviderOptions * *RequestQueueOptions* ## Index[**](#Index) ### Properties * [**client](#client) * [**id](#id) * [**name](#name) * [**proxyConfiguration](#proxyConfiguration) ## Properties[**](#Properties) ### [**](#client)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/request_provider.d.ts#L221)externalinheritedclient **client: StorageClient Inherited from RequestProviderOptions.client ### [**](#id)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/request_provider.d.ts#L219)externalinheritedid **id: string Inherited from RequestProviderOptions.id ### [**](#name)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/request_provider.d.ts#L220)externaloptionalinheritedname **name? : string Inherited from RequestProviderOptions.name ### [**](#proxyConfiguration)[**](https://undefined/apify/apify-sdk-js/blob/master/node_modules/@crawlee/core/storages/request_provider.d.ts#L227)externaloptionalinheritedproxyConfiguration **proxyConfiguration? : ProxyConfiguration Inherited from RequestProviderOptions.proxyConfiguration Used to pass the proxy configuration for the `requestsFromUrl` objects. Takes advantage of the internal address rotation and authentication process. If undefined, the `requestsFromUrl` requests will be made without proxy. --- # WebhookOptions ## Index[**](#Index) ### Properties * [**eventTypes](#eventTypes) * [**idempotencyKey](#idempotencyKey) * [**payloadTemplate](#payloadTemplate) * [**requestUrl](#requestUrl) ## Properties[**](#Properties) ### [**](#eventTypes)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1920)eventTypes **eventTypes: readonly WebhookEventType\[] Array of event types, which you can set for Actor run, see the [Actor run events](https://docs.apify.com/webhooks/events#actor-run) in the Apify doc. ### [**](#idempotencyKey)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1943)optionalidempotencyKey **idempotencyKey? : string Idempotency key enables you to ensure that a webhook will not be added multiple times in case of an Actor restart or other situation that would cause the `addWebhook()` function to be called again. We suggest using the Actor run ID as the idempotency key. You can get the run ID by calling [Actor.getEnv](https://docs.apify.com/sdk/js/sdk/js/reference/class/Actor.md#getEnv) function. ### [**](#payloadTemplate)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1935)optionalpayloadTemplate **payloadTemplate? : string Payload template is a JSON-like string that describes the structure of the webhook POST request payload. It uses JSON syntax, extended with a double curly braces syntax for injecting variables `{{variable}}`. Those variables are resolved at the time of the webhook's dispatch, and a list of available variables with their descriptions is available in the [Apify webhook documentation](https://docs.apify.com/webhooks). If `payloadTemplate` is omitted, the default payload template is used ([view docs](https://docs.apify.com/webhooks/actions#payload-template)). ### [**](#requestUrl)[**](https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/actor.ts#L1925)requestUrl **requestUrl: string URL which will be requested using HTTP POST request, when Actor run will reach the set event type. --- [Skip to main content](#__docusaurus_skipToContent_fallback) [![](/sdk/js/img/apify_sdk.svg)![](/sdk/js/img/apify_sdk_white.svg)](https://docs.apify.com) [Academy](https://docs.apify.com/academy)[Platform](https://docs.apify.com/platform) [API](https://docs.apify.com/api) * [Reference](https://docs.apify.com/api/v2) * [Client for JavaScript](https://docs.apify.com/api/client/js/) * [Client for Python](https://docs.apify.com/api/client/python/) [SDK](https://docs.apify.com/sdk) * [SDK for JavaScript](https://docs.apify.com/sdk/js/) * [SDK for Python](https://docs.apify.com/sdk/python/) [CLI](https://docs.apify.com/cli/) [Open source](https://docs.apify.com/open-source) * [Crawlee](https://crawlee.dev) * [Got Scraping](https://github.com/apify/got-scraping) * [Fingerprint Suite](https://github.com/apify/fingerprint-suite) * [Apify on GitHub](https://github.com/apify) * [Actor whitepaper](https://whitepaper.actor) [Discord](https://discord.com/invite/jyEM2PRvMU "Chat on Discord")[Get started](https://console.apify.com) [SDK for JavaScript](https://docs.apify.com/sdk/js/sdk/js/.md) [Docs](https://docs.apify.com/sdk/js/sdk/js/docs/guides/apify-platform.md)[Examples](https://docs.apify.com/sdk/js/sdk/js/docs/examples)[Reference](https://docs.apify.com/sdk/js/sdk/js/reference.md)[Changelog](https://docs.apify.com/sdk/js/sdk/js/docs/changelog.md)[GitHub](https://github.com/apify/apify-sdk-js) [3.4](https://docs.apify.com/sdk/js/sdk/js/docs/guides/apify-platform.md) * [Next](https://docs.apify.com/sdk/js/sdk/js/docs/next/guides/apify-platform) * [3.4](https://docs.apify.com/sdk/js/sdk/js/docs/guides/apify-platform.md) * [3.3](https://docs.apify.com/sdk/js/sdk/js/docs/3.3/guides/apify-platform) * [3.2](https://docs.apify.com/sdk/js/sdk/js/docs/3.2/guides/apify-platform) * [3.1](https://docs.apify.com/sdk/js/sdk/js/docs/3.1/guides/apify-platform) * [3.0](https://docs.apify.com/sdk/js/sdk/js/docs/3.0/guides/apify-platform) * [2.3](https://docs.apify.com/sdk/js/sdk/js/docs/2.3/guides/motivation) * [1.3](https://docs.apify.com/sdk/js/sdk/js/docs/1.3/guides/motivation) # Apify SDK for JavaScript and Node.js # Apify SDK for JavaScript and Node.js ## Toolkit for building [Actors](https://docs.apify.com/actors)—serverless microservices running (not only) on the Apify platform. [Get started](https://docs.apify.com/sdk/js/sdk/js/docs/guides/apify-platform.md)[GitHub](https://ghbtns.com/github-btn.html?user=apify\&repo=apify-sdk-js\&type=star\&count=true\&size=large) ![](/sdk/js/img/logo-blur.png) ``` npx apify-cli create my-crawler ``` ## Apify SDK v3 is out 🚀
What's new? Read below 👇 Four years ago, Apify released its **open-source Node.js** library for web scraping and automation, **Apify SDK**. It became popular among the community, but there was a problem. Despite being open-source, **the library's name caused users to think its features were restricted to the Apify platform**, which was never the case. With this in mind, we decided to split Apify SDK into two libraries,[** Crawlee**](https://crawlee.dev) and **Apify SDK v3**.** Crawlee** will retain all the **crawling and scraping-related tools **and will always strive to be the best web scraping library for its community. At the same time, **Apify SDK** will continue to exist, but keep only the Apify-specific features related to **building actors** on the [**Apify platform**](https://apify.com). ![](/sdk/js/img/sdk-split-light.png)![](/sdk/js/img/sdk-split-dark.png) ## How it works now ### Outside of the Apify platform If you want to use the **crawling functionality** of Apify SDK v2 outside of the Apify platform, head to [**Crawlee documentation**](https://crawlee.dev) to get started. The interface is almost exactly the same as the original SDK, but we've made a lot of improvements under the hood to improve the developer experience. ``` npm install crawlee ``` ### On the Apify platform In **Apify SDK v2**, both the **crawling and actor building logic were mixed** together. This made it easy to build crawlers on the Apify platform, but confusing to build anything else.** Apify SDK v3 includes only the Apify platform specific functionality**. To build crawlers on the Apify platform, you need to combine it with [**Crawlee**](https://crawlee.dev). Or you can use it standalone for other projects. ### Build a crawler like you're used to The following example shows how to build an **SDK-v2-like crawler on the Apify platform**. To use `PlaywrightCrawler` you need to install 3 libraries. Apify SDK v3, Crawlee and Playwright. In v2, you only needed to install Apify SDK v2 and Playwright. ``` npm install apify crawlee playwright ``` Don't forget about module imports To run the example, add a `"type": "module"` clause into your `package.json` or copy it into a file with an `.mjs` suffix. This enables `import` statements in Node.js. See [Node.js docs](https://nodejs.org/dist/latest-v16.x/docs/api/esm.html#enabling) for more information. ``` // Apify SDK v3 uses named exports instead of the Apify object. // You can import Dataset, KeyValueStore and more. import { Actor } from 'apify'; // We moved all the crawling components to Crawlee. // See the documentation on https://crawlee.dev import { PlaywrightCrawler } from 'crawlee'; // Initialize the actor on the platform. This function connects your // actor to platform events, storages and API. It replaces Apify.main() await Actor.init(); const crawler = new PlaywrightCrawler({ // handle(Page|Request)Functions of all Crawlers // are now simply called a requestHandler. async requestHandler({ request, page, enqueueLinks }) { const title = await page.title(); console.log(`Title of ${request.loadedUrl} is '${title}'`); // Use Actor instead of the Apify object to save data. await Actor.pushData({ title, url: request.loadedUrl }); // We simplified enqueuing links a lot, see the docs. // This way the function adds only links to same hostname. await enqueueLinks(); } }); // You can now add requests to the queue directly from the run function. // No need to create an instance of the queue separately. await crawler.run(['https://crawlee.dev']); // This function disconnects the actor from the platform // and optionally sends an exit message. await Actor.exit(); ``` upgrading guide For more information, see the[ upgrading guide](https://docs.apify.com/sdk/js/docs/upgrading/upgrading-to-v3) that explains all the changes in great detail. ### Build an actor without Crawlee If your actors are not crawlers, or you want to simply wrap existing code and turn it into an actor on the Apify platform, you can do that with standalone Apify SDK v3. ``` npm install apify ``` ``` import { Actor } from 'apify'; // Initialize the actor on the platform. This function connects your // actor to platform events, storages and API. It replaces Apify.main() await Actor.init(); const input = await Actor.getInput() // Do something with the input in your own code. const output = await magicallyCreateOutput(input) await Actor.setValue('my-output', output); // This function disconnects the actor from the platform // and optionally sends an exit message. await Actor.exit(); ``` Learn * [Academy](https://docs.apify.com/academy) * [Platform](https://docs.apify.com/platform) API * [Reference](https://docs.apify.com/api/v2) * [Client for JavaScript](https://docs.apify.com/api/client/js/) * [Client for Python](https://docs.apify.com/api/client/python/) SDK * [SDK for JavaScript](https://docs.apify.com/sdk/js/) * [SDK for Python](https://docs.apify.com/sdk/python/) Other * [CLI](https://docs.apify.com/cli/) * [Open source](https://docs.apify.com/open-source) More * [Crawlee](https://crawlee.dev) * [GitHub](https://github.com/apify) * [Discord](https://discord.com/invite/jyEM2PRvMU) * [Trust Center](https://trust.apify.com) [](https://apify.com) --- # SDK for Python | Apify Documentation ## sdk - [Search the documentation](https://docs.apify.com/sdk/python/search.md) - [Changelog](https://docs.apify.com/sdk/python/docs/changelog.md): All notable changes to this project will be documented in this file. - [Accessing Apify API](https://docs.apify.com/sdk/python/docs/concepts/access-apify-api.md): The Apify SDK contains many useful features for making Actor development easier. However, it does not cover all the features the Apify API offers. - [Actor configuration](https://docs.apify.com/sdk/python/docs/concepts/actor-configuration.md): The Actor class gets configured using the Configuration class, which initializes itself based on the provided environment variables. - [Actor events & state persistence](https://docs.apify.com/sdk/python/docs/concepts/actor-events.md): During its runtime, the Actor receives Actor events sent by the Apify platform or generated by the Apify SDK itself. - [Actor input](https://docs.apify.com/sdk/python/docs/concepts/actor-input.md): The Actor gets its input from the input record in its default key-value store. - [Actor lifecycle](https://docs.apify.com/sdk/python/docs/concepts/actor-lifecycle.md): In this guide, we will show you how to manage the lifecycle of an Apify Actor. - [Interacting with other Actors](https://docs.apify.com/sdk/python/docs/concepts/interacting-with-other-actors.md): There are several methods that interact with other Actors and Actor tasks on the Apify platform. - [Logging](https://docs.apify.com/sdk/python/docs/concepts/logging.md): The Apify SDK is logging useful information through the logging module from Python's standard library, into the logger with the name apify. - [Pay-per-event monetization](https://docs.apify.com/sdk/python/docs/concepts/pay-per-event.md): Monetize your Actors using the pay-per-event pricing model - [Proxy management](https://docs.apify.com/sdk/python/docs/concepts/proxy-management.md): IP address blocking is one of the oldest and most effective ways of preventing access to a website. It is therefore paramount for a good web scraping library to provide easy to use but powerful tools which can work around IP blocking. The most powerful weapon in your anti IP blocking arsenal is a proxy server. - [Running webserver in your Actor](https://docs.apify.com/sdk/python/docs/concepts/running-webserver.md): Each Actor run on the Apify platform is assigned a unique hard-to-guess URL (for example https://8segt5i81sokzm.runs.apify.net), which enables HTTP access to an optional web server running inside the Actor run's container. - [Working with storages](https://docs.apify.com/sdk/python/docs/concepts/storages.md): The Actor class provides methods to work either with the default storages of the Actor, or with any other storage, named or unnamed. - [Creating webhooks](https://docs.apify.com/sdk/python/docs/concepts/webhooks.md): Webhooks allow you to configure the Apify platform to perform an action when a certain event occurs. For example, you can use them to start another Actor when the current run finishes or fails. - [Using BeautifulSoup with HTTPX](https://docs.apify.com/sdk/python/docs/guides/beautifulsoup-httpx.md): In this guide, you'll learn how to use the BeautifulSoup library with the HTTPX library in your Apify Actors. - [Using Crawlee](https://docs.apify.com/sdk/python/docs/guides/crawlee.md): In this guide you'll learn how to use the Crawlee library in your Apify Actors. - [Using Playwright](https://docs.apify.com/sdk/python/docs/guides/playwright.md): Playwright is a tool for web automation and testing that can also be used for web scraping. It allows you to control a web browser programmatically and interact with web pages just as a human would. - [Using Scrapy](https://docs.apify.com/sdk/python/docs/guides/scrapy.md): Scrapy is an open-source web scraping framework for Python. It provides tools for defining scrapers, extracting data from web pages, following links, and handling pagination. With the Apify SDK, Scrapy projects can be converted into Apify Actors, integrated with Apify storages, and executed on the Apify platform. - [Using Selenium](https://docs.apify.com/sdk/python/docs/guides/selenium.md): Selenium is a tool for web automation and testing that can also be used for web scraping. It allows you to control a web browser programmatically and interact with web pages just as a human would. - [Actor structure](https://docs.apify.com/sdk/python/docs/overview/actor-structure.md): All Python Actor templates follow the same structure. - [Introduction](https://docs.apify.com/sdk/python/docs/overview/introduction.md): The Apify SDK for Python is the official library for creating Apify Actors using Python. - [Running Actors locally](https://docs.apify.com/sdk/python/docs/overview/running-actors-locally.md): In this page, you'll learn how to create and run Apify Actors locally on your computer. - [Upgrading to v2](https://docs.apify.com/sdk/python/docs/upgrading/upgrading-to-v2.md): This page summarizes most of the breaking changes between Apify Python SDK v1.x and v2.0. - [apify-sdk-python](https://docs.apify.com/sdk/python/reference.md) - [_RequestDetails](https://docs.apify.com/sdk/python/reference/class/_RequestDetails.md) - [_RequestsFromUrlInput](https://docs.apify.com/sdk/python/reference/class/_RequestsFromUrlInput.md) - [_SimpleUrlInput](https://docs.apify.com/sdk/python/reference/class/_SimpleUrlInput.md) - [AbortingEvent](https://docs.apify.com/sdk/python/reference/class/AbortingEvent.md) - [Actor](https://docs.apify.com/sdk/python/reference/class/Actor.md): The core class for building Actors on the Apify platform. Actors are serverless programs running in the cloud that can perform anything from simple actions (such as filling out a web form or sending an email) to complex operations (such as crawling an entire website or removing duplicates from a large dataset). They are packaged as Docker containers which accept well-defined JSON input, perform an action, and optionally produce well-defined output. ### References - Apify platform documentation: https://docs.apify.com/platform/actors - Actor whitepaper: https://whitepaper.actor/ ### Usage ```python import asyncio import httpx from apify import Actor from bs4 import BeautifulSoup async def main() -> None: async with Actor: actor_input = await Actor.get_input() async with httpx.AsyncClient() as client: response = await client.get(actor_input['url']) soup = BeautifulSoup(response.content, 'html.parser') data = { 'url': actor_input['url'], 'title': soup.title.string if soup.title else None, } await Actor.push_data(data) if __name__ == '__main__': asyncio.run(main()) ``` - [ActorChargeEvent](https://docs.apify.com/sdk/python/reference/class/ActorChargeEvent.md) - [ActorDatasetPushPipeline](https://docs.apify.com/sdk/python/reference/class/ActorDatasetPushPipeline.md): A Scrapy pipeline for pushing items to an Actor's default dataset. This pipeline is designed to be enabled only when the Scrapy project is run as an Actor. - [ActorLogFormatter](https://docs.apify.com/sdk/python/reference/class/ActorLogFormatter.md) - [ActorPricingInfo](https://docs.apify.com/sdk/python/reference/class/ActorPricingInfo.md): Result of the `ChargingManager.get_pricing_info` method. - [ActorRun](https://docs.apify.com/sdk/python/reference/class/ActorRun.md) - [ActorRunMeta](https://docs.apify.com/sdk/python/reference/class/ActorRunMeta.md) - [ActorRunOptions](https://docs.apify.com/sdk/python/reference/class/ActorRunOptions.md) - [ActorRunStats](https://docs.apify.com/sdk/python/reference/class/ActorRunStats.md) - [ActorRunUsage](https://docs.apify.com/sdk/python/reference/class/ActorRunUsage.md) - [AddRequestsResponse](https://docs.apify.com/sdk/python/reference/class/AddRequestsResponse.md): Model for a response to add requests to a queue. Contains detailed information about the processing results when adding multiple requests to a queue. This includes which requests were successfully processed and which ones encountered issues during processing. - [ApifyCacheStorage](https://docs.apify.com/sdk/python/reference/class/ApifyCacheStorage.md): A Scrapy cache storage that uses the Apify `KeyValueStore` to store responses. It can be set as a storage for Scrapy's built-in `HttpCacheMiddleware`, which caches responses to requests. See HTTPCache middleware settings (prefixed with `HTTPCACHE_`) in the Scrapy documentation for more information. Requires the asyncio Twisted reactor to be installed. - [ApifyHttpProxyMiddleware](https://docs.apify.com/sdk/python/reference/class/ApifyHttpProxyMiddleware.md): Apify HTTP proxy middleware for Scrapy. This middleware enhances request processing by adding a 'proxy' field to the request's meta and an authentication header. It draws inspiration from the `HttpProxyMiddleware` included by default in Scrapy projects. The proxy URL is sourced from the settings under the `APIFY_PROXY_SETTINGS` key. The value of this key, a dictionary, should be provided by the Actor input. An example of the proxy settings: proxy_settings = {'useApifyProxy': true, 'apifyProxyGroups': []} - [ApifyScheduler](https://docs.apify.com/sdk/python/reference/class/ApifyScheduler.md): A Scrapy scheduler that uses the Apify `RequestQueue` to manage requests. This scheduler requires the asyncio Twisted reactor to be installed. - [ApifyStorageClient](https://docs.apify.com/sdk/python/reference/class/ApifyStorageClient.md): A storage client implementation based on the Apify platform storage. - [AsyncThread](https://docs.apify.com/sdk/python/reference/class/AsyncThread.md): Class for running an asyncio event loop in a separate thread. This allows running asynchronous coroutines from synchronous code by executingthem on an event loop that runs in its own dedicated thread. - [ChargeResult](https://docs.apify.com/sdk/python/reference/class/ChargeResult.md): Result of the `ChargingManager.charge` method. - [ChargingManager](https://docs.apify.com/sdk/python/reference/class/ChargingManager.md): Provides fine-grained access to pay-per-event functionality. The ChargingManager allows you to charge for specific events in your Actor when using the pay-per-event pricing model. This enables precise cost control and transparent billing for different operations within your Actor. ### References - Apify platform documentation: https://docs.apify.com/platform/actors/publishing/monetize - [ChargingManagerImplementation](https://docs.apify.com/sdk/python/reference/class/ChargingManagerImplementation.md): Implementation of the `ChargingManager` Protocol - this is only meant to be instantiated internally. - [ChargingStateItem](https://docs.apify.com/sdk/python/reference/class/ChargingStateItem.md) - [Configuration](https://docs.apify.com/sdk/python/reference/class/Configuration.md): A class for specifying the configuration of an Actor. Can be used either globally via `Configuration.get_global_configuration()`, or it can be specific to each `Actor` instance on the `actor.config` property. - [Dataset](https://docs.apify.com/sdk/python/reference/class/Dataset.md): Dataset is a storage for managing structured tabular data. The dataset class provides a high-level interface for storing and retrieving structured data with consistent schema, similar to database tables or spreadsheets. It abstracts the underlying storage implementation details, offering a consistent API regardless of where the data is physically stored. Dataset operates in an append-only mode, allowing new records to be added but not modified or deleted after creation. This makes it particularly suitable for storing crawling results and other data that should be immutable once collected. The class provides methods for adding data, retrieving data with various filtering options, and exporting data to different formats. You can create a dataset using the `open` class method, specifying either a name or ID. The underlying storage implementation is determined by the configured storage client. ### Usage ```python from crawlee.storages import Dataset # Open a dataset dataset = await Dataset.open(name='my_dataset') # Add data await dataset.push_data({'title': 'Example Product', 'price': 99.99}) # Retrieve filtered data results = await dataset.get_data(limit=10, desc=True) # Export data await dataset.export_to('results.json', content_type='json') ``` - [DatasetClient](https://docs.apify.com/sdk/python/reference/class/DatasetClient.md): Dataset resource client implementation based on the Apify platform storage. - [DatasetCollectionClient](https://docs.apify.com/sdk/python/reference/class/DatasetCollectionClient.md): Dataset collection resource client implementation based on the Apify platform storage. - [DatasetItemsListPage](https://docs.apify.com/sdk/python/reference/class/DatasetItemsListPage.md): Model for a single page of dataset items returned from a collection list method. - [DatasetMetadata](https://docs.apify.com/sdk/python/reference/class/DatasetMetadata.md): Model for a dataset metadata. - [DeprecatedEvent](https://docs.apify.com/sdk/python/reference/class/DeprecatedEvent.md) - [EventAbortingData](https://docs.apify.com/sdk/python/reference/class/EventAbortingData.md): Data for the aborting event. - [EventExitData](https://docs.apify.com/sdk/python/reference/class/EventExitData.md): Data for the exit event. - [EventManager](https://docs.apify.com/sdk/python/reference/class/EventManager.md): Manage events and their listeners, enabling registration, emission, and execution control. It allows for registering event listeners, emitting events, and ensuring all listeners complete their execution. Built on top of `pyee.asyncio.AsyncIOEventEmitter`. It implements additional features such as waiting for all listeners to complete and emitting `PersistState` events at regular intervals. - [EventMigratingData](https://docs.apify.com/sdk/python/reference/class/EventMigratingData.md): Data for the migrating event. - [EventPersistStateData](https://docs.apify.com/sdk/python/reference/class/EventPersistStateData.md): Data for the persist state event. - [EventSystemInfoData](https://docs.apify.com/sdk/python/reference/class/EventSystemInfoData.md): Data for the system info event. - [EventWithoutData](https://docs.apify.com/sdk/python/reference/class/EventWithoutData.md) - [ExitEvent](https://docs.apify.com/sdk/python/reference/class/ExitEvent.md) - [FileSystemStorageClient](https://docs.apify.com/sdk/python/reference/class/FileSystemStorageClient.md): File system implementation of the storage client. This storage client provides access to datasets, key-value stores, and request queues that persist data to the local file system. Each storage type is implemented with its own specific file system client that stores data in a structured directory hierarchy. Data is stored in JSON format in predictable file paths, making it easy to inspect and manipulate the stored data outside of the Crawlee application if needed. All data persists between program runs but is limited to access from the local machine where the files are stored. Warning: This storage client is not safe for concurrent access from multiple crawler processes. Use it only when running a single crawler process at a time. - [FlatPricePerMonthActorPricingInfo](https://docs.apify.com/sdk/python/reference/class/FlatPricePerMonthActorPricingInfo.md) - [FreeActorPricingInfo](https://docs.apify.com/sdk/python/reference/class/FreeActorPricingInfo.md) - [KeyValueStore](https://docs.apify.com/sdk/python/reference/class/KeyValueStore.md): Key-value store is a storage for reading and writing data records with unique key identifiers. The key-value store class acts as a high-level interface for storing, retrieving, and managing data records identified by unique string keys. It abstracts away the underlying storage implementation details, allowing you to work with the same API regardless of whether data is stored in memory, on disk, or in the cloud. Each data record is associated with a specific MIME content type, allowing storage of various data formats such as JSON, text, images, HTML snapshots or any binary data. This class is commonly used to store inputs, outputs, and other artifacts of crawler operations. You can instantiate a key-value store using the `open` class method, which will create a store with the specified name or id. The underlying storage implementation is determined by the configured storage client. ### Usage ```python from crawlee.storages import KeyValueStore # Open a named key-value store kvs = await KeyValueStore.open(name='my-store') # Store and retrieve data await kvs.set_value('product-1234.json', [{'name': 'Smartphone', 'price': 799.99}]) product = await kvs.get_value('product-1234') ``` - [KeyValueStoreClient](https://docs.apify.com/sdk/python/reference/class/KeyValueStoreClient.md): Key-value store resource client implementation based on the Apify platform storage. - [KeyValueStoreCollectionClient](https://docs.apify.com/sdk/python/reference/class/KeyValueStoreCollectionClient.md): Key-value store collection resource client implementation based on the Apify platform storage. - [KeyValueStoreMetadata](https://docs.apify.com/sdk/python/reference/class/KeyValueStoreMetadata.md): Model for a key-value store metadata. - [KeyValueStoreRecord](https://docs.apify.com/sdk/python/reference/class/KeyValueStoreRecord.md): Model for a key-value store record. - [KeyValueStoreRecordMetadata](https://docs.apify.com/sdk/python/reference/class/KeyValueStoreRecordMetadata.md): Model for a key-value store record metadata. - [LocalEventManager](https://docs.apify.com/sdk/python/reference/class/LocalEventManager.md): Event manager for local environments. It extends the `EventManager` to emit `SystemInfo` events at regular intervals. The `LocalEventManager` is intended to be used in local environments, where the system metrics are required managing the `Snapshotter` and `AutoscaledPool`. - [MemoryStorageClient](https://docs.apify.com/sdk/python/reference/class/MemoryStorageClient.md): Memory implementation of the storage client. This storage client provides access to datasets, key-value stores, and request queues that store all data in memory using Python data structures (lists and dictionaries). No data is persisted between process runs, meaning all stored data is lost when the program terminates. The memory implementation provides fast access to data but is limited by available memory and does not support data sharing across different processes. All storage operations happen entirely in memory with no disk operations. The memory storage client is useful for testing and development environments, or short-lived crawler operations where persistence is not required. - [MigratingEvent](https://docs.apify.com/sdk/python/reference/class/MigratingEvent.md) - [PayPerEventActorPricingInfo](https://docs.apify.com/sdk/python/reference/class/PayPerEventActorPricingInfo.md) - [PersistStateEvent](https://docs.apify.com/sdk/python/reference/class/PersistStateEvent.md) - [PlatformEventManager](https://docs.apify.com/sdk/python/reference/class/PlatformEventManager.md): A class for managing Actor events. You shouldn't use this class directly, but instead use it via the `Actor.on()` and `Actor.off()` methods. - [PricePerDatasetItemActorPricingInfo](https://docs.apify.com/sdk/python/reference/class/PricePerDatasetItemActorPricingInfo.md) - [PricingInfoItem](https://docs.apify.com/sdk/python/reference/class/PricingInfoItem.md) - [PricingPerEvent](https://docs.apify.com/sdk/python/reference/class/PricingPerEvent.md) - [ProcessedRequest](https://docs.apify.com/sdk/python/reference/class/ProcessedRequest.md): Represents a processed request. - [ProxyConfiguration](https://docs.apify.com/sdk/python/reference/class/ProxyConfiguration.md): Configures a connection to a proxy server with the provided options. Proxy servers are used to prevent target websites from blocking your crawlers based on IP address rate limits or blacklists. The default servers used by this class are managed by [Apify Proxy](https://docs.apify.com/proxy). To be able to use Apify Proxy, you need an Apify account and access to the selected proxies. If you provide no configuration option, the proxies will be managed automatically using a smart algorithm. If you want to use your own proxies, use the `proxy_urls` or `new_url_function` constructor options. Your list of proxy URLs will be rotated by the configuration, if this option is provided. - [ProxyInfo](https://docs.apify.com/sdk/python/reference/class/ProxyInfo.md): Provides information about a proxy connection that is used for requests. - [Request](https://docs.apify.com/sdk/python/reference/class/Request.md): Represents a request in the Crawlee framework, containing the necessary information for crawling operations. The `Request` class is one of the core components in Crawlee, utilized by various components such as request providers, HTTP clients, crawlers, and more. It encapsulates the essential data for executing web requests, including the URL, HTTP method, headers, payload, and user data. The user data allows custom information to be stored and persisted throughout the request lifecycle, including its retries. Key functionalities include managing the request's identifier (`id`), unique key (`unique_key`) that is used for request deduplication, controlling retries, handling state management, and enabling configuration for session rotation and proxy handling. The recommended way to create a new instance is by using the `Request.from_url` constructor, which automatically generates a unique key and identifier based on the URL and request parameters. ### Usage ```python from crawlee import Request request = Request.from_url('https://crawlee.dev') ``` - [RequestList](https://docs.apify.com/sdk/python/reference/class/RequestList.md): Extends crawlee RequestList. Method open is used to create RequestList from actor's requestListSources input. - [RequestLoader](https://docs.apify.com/sdk/python/reference/class/RequestLoader.md): An abstract class defining the interface for classes that provide access to a read-only stream of requests. Request loaders are used to manage and provide access to a storage of crawling requests. Key responsibilities: - Fetching the next request to be processed. - Marking requests as successfully handled after processing. - Managing state information such as the total and handled request counts. - [RequestManager](https://docs.apify.com/sdk/python/reference/class/RequestManager.md): Base class that extends `RequestLoader` with the capability to enqueue new requests and reclaim failed ones. - [RequestManagerTandem](https://docs.apify.com/sdk/python/reference/class/RequestManagerTandem.md): Implements a tandem behaviour for a pair of `RequestLoader` and `RequestManager`. In this scenario, the contents of the "loader" get transferred into the "manager", allowing processing the requests from both sources and also enqueueing new requests (not possible with plain `RequestManager`). - [RequestQueue](https://docs.apify.com/sdk/python/reference/class/RequestQueue.md): Request queue is a storage for managing HTTP requests. The request queue class serves as a high-level interface for organizing and managing HTTP requests during web crawling. It provides methods for adding, retrieving, and manipulating requests throughout the crawling lifecycle, abstracting away the underlying storage implementation details. Request queue maintains the state of each URL to be crawled, tracking whether it has been processed, is currently being handled, or is waiting in the queue. Each URL in the queue is uniquely identified by a `unique_key` property, which prevents duplicate processing unless explicitly configured otherwise. The class supports both breadth-first and depth-first crawling strategies through its `forefront` parameter when adding requests. It also provides mechanisms for error handling and request reclamation when processing fails. You can open a request queue using the `open` class method, specifying either a name or ID to identify the queue. The underlying storage implementation is determined by the configured storage client. ### Usage ```python from crawlee.storages import RequestQueue # Open a request queue rq = await RequestQueue.open(name='my_queue') # Add a request await rq.add_request('https://example.com') # Process requests request = await rq.fetch_next_request() if request: try: # Process the request # ... await rq.mark_request_as_handled(request) except Exception: await rq.reclaim_request(request) ``` - [RequestQueueClient](https://docs.apify.com/sdk/python/reference/class/RequestQueueClient.md): Request queue resource client implementation based on the Apify platform storage. - [RequestQueueCollectionClient](https://docs.apify.com/sdk/python/reference/class/RequestQueueCollectionClient.md): Request queue collection resource client implementation based on the Apify platform storage. - [RequestQueueMetadata](https://docs.apify.com/sdk/python/reference/class/RequestQueueMetadata.md): Model for a request queue metadata. - [SitemapRequestLoader](https://docs.apify.com/sdk/python/reference/class/SitemapRequestLoader.md): A request loader that reads URLs from sitemap(s). The loader fetches and parses sitemaps in the background, allowing crawling to start before all URLs are loaded. It supports filtering URLs using glob and regex patterns. - [Storage](https://docs.apify.com/sdk/python/reference/class/Storage.md): Base class for storages. - [StorageClient](https://docs.apify.com/sdk/python/reference/class/StorageClient.md): Base class for storage clients. The `StorageClient` serves as an abstract base class that defines the interface for accessing Crawlee's storage types: datasets, key-value stores, and request queues. It provides methods to open clients for each of these storage types and handles common functionality. Storage clients implementations can be provided for various backends (file system, memory, databases, various cloud providers, etc.) to support different use cases from development to production environments. Each storage client implementation is responsible for ensuring proper initialization, data persistence (where applicable), and consistent access patterns across all storage types it supports. - [StorageMetadata](https://docs.apify.com/sdk/python/reference/class/StorageMetadata.md): Represents the base model for storage metadata. It contains common fields shared across all specific storage types. - [SystemInfoEvent](https://docs.apify.com/sdk/python/reference/class/SystemInfoEvent.md) - [SystemInfoEventData](https://docs.apify.com/sdk/python/reference/class/SystemInfoEventData.md) - [UnknownEvent](https://docs.apify.com/sdk/python/reference/class/UnknownEvent.md) - [Webhook](https://docs.apify.com/sdk/python/reference/class/Webhook.md) - [Event](https://docs.apify.com/sdk/python/reference/enum/Event.md): Names of all possible events that can be emitted using an `EventManager`. - [Apify SDK for Python is a toolkit for building Actors](https://docs.apify.com/sdk/python/index.md) --- # Full Documentation Content [Skip to main content](#__docusaurus_skipToContent_fallback) [![](/sdk/python/img/apify_sdk.svg)![](/sdk/python/img/apify_sdk_white.svg)](https://docs.apify.com) [Academy](https://docs.apify.com/academy)[Platform](https://docs.apify.com/platform) [API](https://docs.apify.com/api) * [Reference](https://docs.apify.com/api/v2) * [Client for JavaScript](https://docs.apify.com/api/client/js/) * [Client for Python](https://docs.apify.com/api/client/python/) [SDK](https://docs.apify.com/sdk) * [SDK for JavaScript](https://docs.apify.com/sdk/js/) * [SDK for Python](https://docs.apify.com/sdk/python/) [CLI](https://docs.apify.com/cli/) [Open source](https://docs.apify.com/open-source) * [Crawlee](https://crawlee.dev) * [Got Scraping](https://github.com/apify/got-scraping) * [Fingerprint Suite](https://github.com/apify/fingerprint-suite) * [Apify on GitHub](https://github.com/apify) * [Actor whitepaper](https://whitepaper.actor) [Discord](https://discord.com/invite/jyEM2PRvMU "Chat on Discord")[Get started](https://console.apify.com) [SDK for Python](https://docs.apify.com/sdk/python/sdk/python/.md) [Docs](https://docs.apify.com/sdk/python/sdk/python/docs/overview/introduction.md)[Reference](https://docs.apify.com/sdk/python/sdk/python/reference.md)[Changelog](https://docs.apify.com/sdk/python/sdk/python/docs/changelog.md)[GitHub](https://github.com/apify/apify-sdk-python) # Search the documentation Type your search here [](https://www.algolia.com/) Learn * [Academy](https://docs.apify.com/academy) * [Platform](https://docs.apify.com/platform) API * [Reference](https://docs.apify.com/api/v2) * [Client for JavaScript](https://docs.apify.com/api/client/js/) * [Client for Python](https://docs.apify.com/api/client/python/) SDK * [SDK for JavaScript](https://docs.apify.com/sdk/js/) * [SDK for Python](https://docs.apify.com/sdk/python/) Other * [CLI](https://docs.apify.com/cli/) * [Open source](https://docs.apify.com/open-source) More * [Crawlee](https://crawlee.dev) * [GitHub](https://github.com/apify) * [Discord](https://discord.com/invite/jyEM2PRvMU) * [Trust Center](https://trust.apify.com) [](https://apify.com) --- # Changelog All notable changes to this project will be documented in this file. ### 2.7.1 - **not yet released**[​](#271---not-yet-released "Direct link to 271---not-yet-released") #### 🐛 Bug Fixes[​](#-bug-fixes "Direct link to 🐛 Bug Fixes") * Restrict apify-shared and apify-client versions ([#523](https://github.com/apify/apify-sdk-python/pull/523)) ([b3ae5a9](https://github.com/apify/apify-sdk-python/commit/b3ae5a972a65454a4998eda59c9fcc3f6b7e8579)) by [@vdusek](https://github.com/vdusek) * Expose `APIFY_USER_IS_PAYING` env var to the configuration ([#507](https://github.com/apify/apify-sdk-python/pull/507)) ([0801e54](https://github.com/apify/apify-sdk-python/commit/0801e54887317c1280cc6828ecd3f2cc53287e76)) by [@stepskop](https://github.com/stepskop) ### [2.7.0](https://github.com/apify/apify-sdk-python/releases/tag/v2.7.0) (2025-07-14)[​](#270-2025-07-14 "Direct link to 270-2025-07-14") #### 🚀 Features[​](#-features "Direct link to 🚀 Features") * **crypto:** Decrypt secret objects ([#482](https://github.com/apify/apify-sdk-python/pull/482)) ([ce9daf7](https://github.com/apify/apify-sdk-python/commit/ce9daf7381212b8dc194e8a643e5ca0dedbc0078)) by [@MFori](https://github.com/MFori) #### 🐛 Bug Fixes[​](#-bug-fixes-1 "Direct link to 🐛 Bug Fixes") * Sync `@docusaurus` theme version \[internal] ([#500](https://github.com/apify/apify-sdk-python/pull/500)) ([a7485e7](https://github.com/apify/apify-sdk-python/commit/a7485e7d2276fde464ce862573d5b95e7d4d836a)) by [@katzino](https://github.com/katzino) * Tagline overlap ([#501](https://github.com/apify/apify-sdk-python/pull/501)) ([bae8340](https://github.com/apify/apify-sdk-python/commit/bae8340c46fea756ea35ea4d591da84c09d478e2)) by [@katzino](https://github.com/katzino) ### [2.7.3](https://github.com/apify/apify-sdk-python/releases/tag/v2.7.3) (2025-08-11)[​](#273-2025-08-11 "Direct link to 273-2025-08-11") #### 🐛 Bug Fixes[​](#-bug-fixes-2 "Direct link to 🐛 Bug Fixes") * Expose `APIFY_USER_IS_PAYING` env var to the configuration (#507) ([0de022c](https://github.com/apify/apify-sdk-python/commit/0de022c3435f24c821053c771e7b659433e3fb6e)) ### [2.7.2](https://github.com/apify/apify-sdk-python/releases/tag/v2.7.2) (2025-07-30)[​](#272-2025-07-30 "Direct link to 272-2025-07-30") #### 🐛 Bug Fixes[​](#-bug-fixes-3 "Direct link to 🐛 Bug Fixes") * Restrict apify-shared and apify-client versions ([#523](https://github.com/apify/apify-sdk-python/pull/523)) ([581ebae](https://github.com/apify/apify-sdk-python/commit/581ebae5752a984a34cbabc02c49945ae392db00)) by [@vdusek](https://github.com/vdusek) ### [2.7.1](https://github.com/apify/apify-sdk-python/releases/tag/v2.7.1) (2025-07-24)[​](#271-2025-07-24 "Direct link to 271-2025-07-24") #### 🐛 Bug Fixes[​](#-bug-fixes-4 "Direct link to 🐛 Bug Fixes") * Add back support for Python 3.9. ### [2.7.0](https://github.com/apify/apify-sdk-python/releases/tag/v2.7.0) (2025-07-14)[​](#270-2025-07-14-1 "Direct link to 270-2025-07-14-1") #### 🚀 Features[​](#-features-1 "Direct link to 🚀 Features") * Expose `logger` argument on `Actor.call` to control log redirection from started Actor run ([#487](https://github.com/apify/apify-sdk-python/pull/487)) ([aa6fa47](https://github.com/apify/apify-sdk-python/commit/aa6fa4750ea1bc7909be1191c0d276a2046930c2)) by [@Pijukatel](https://github.com/Pijukatel) * **crypto:** Decrypt secret objects ([#482](https://github.com/apify/apify-sdk-python/pull/482)) ([ce9daf7](https://github.com/apify/apify-sdk-python/commit/ce9daf7381212b8dc194e8a643e5ca0dedbc0078)) by [@MFori](https://github.com/MFori) ### [2.6.0](https://github.com/apify/apify-sdk-python/releases/tag/v2.6.0) (2025-06-09)[​](#260-2025-06-09 "Direct link to 260-2025-06-09") #### 🚀 Features[​](#-features-2 "Direct link to 🚀 Features") * Add `RemainingTime` option for `timeout` argument of `Actor.call` and `Actor.start` ([#473](https://github.com/apify/apify-sdk-python/pull/473)) ([ba7f757](https://github.com/apify/apify-sdk-python/commit/ba7f757a82661a5a181d9bd767950d09557409f9)) by [@Pijukatel](https://github.com/Pijukatel), closes [#472](https://github.com/apify/apify-sdk-python/issues/472) #### 🐛 Bug Fixes[​](#-bug-fixes-5 "Direct link to 🐛 Bug Fixes") * Fix duplicate logs from apify logger in Scrapy integration ([#457](https://github.com/apify/apify-sdk-python/pull/457)) ([2745ee6](https://github.com/apify/apify-sdk-python/commit/2745ee6529deecb4f2838c764b9bb3fb6606762b)) by [@vdusek](https://github.com/vdusek), closes [#391](https://github.com/apify/apify-sdk-python/issues/391) * Prefer proxy password from env var ([#468](https://github.com/apify/apify-sdk-python/pull/468)) ([1c4ad9b](https://github.com/apify/apify-sdk-python/commit/1c4ad9bcfbf6ac404f942d7d2d249b036c2e7f54)) by [@stepskop](https://github.com/stepskop) ### [2.5.0](https://github.com/apify/apify-sdk-python/releases/tag/v2.5.0) (2025-03-27)[​](#250-2025-03-27 "Direct link to 250-2025-03-27") #### 🚀 Features[​](#-features-3 "Direct link to 🚀 Features") * Implement Scrapy HTTP cache backend ([#403](https://github.com/apify/apify-sdk-python/pull/403)) ([137e3c8](https://github.com/apify/apify-sdk-python/commit/137e3c8d5c6b28cf6935cfb742b5f072cd2e0a02)) by [@honzajavorek](https://github.com/honzajavorek) #### 🐛 Bug Fixes[​](#-bug-fixes-6 "Direct link to 🐛 Bug Fixes") * Fix calculation of CPU utilization from SystemInfo events ([#447](https://github.com/apify/apify-sdk-python/pull/447)) ([eb4c8e4](https://github.com/apify/apify-sdk-python/commit/eb4c8e4e498e23f573b9e2d4c7dbd8e2ecc277d9)) by [@janbuchar](https://github.com/janbuchar) ### [2.4.0](https://github.com/apify/apify-sdk-python/releases/tag/v2.4.0) (2025-03-07)[​](#240-2025-03-07 "Direct link to 240-2025-03-07") #### 🚀 Features[​](#-features-4 "Direct link to 🚀 Features") * Update to Crawlee v0.6 ([#420](https://github.com/apify/apify-sdk-python/pull/420)) ([9be4336](https://github.com/apify/apify-sdk-python/commit/9be433667231cc5739861fa693d7a726860d6aca)) by [@vdusek](https://github.com/vdusek) * Add Actor `exit_process` option ([#424](https://github.com/apify/apify-sdk-python/pull/424)) ([994c832](https://github.com/apify/apify-sdk-python/commit/994c8323b994e009db0ccdcb624891a2fef97070)) by [@vdusek](https://github.com/vdusek), closes [#396](https://github.com/apify/apify-sdk-python/issues/396), [#401](https://github.com/apify/apify-sdk-python/issues/401) * Upgrade websockets to v14 to adapt to library API changes ([#425](https://github.com/apify/apify-sdk-python/pull/425)) ([5f49275](https://github.com/apify/apify-sdk-python/commit/5f49275ca1177e5ba56856ffe3860f6b97bee9ee)) by [@Mantisus](https://github.com/Mantisus), closes [#325](https://github.com/apify/apify-sdk-python/issues/325) * Add signing of public URL ([#407](https://github.com/apify/apify-sdk-python/pull/407)) ([a865461](https://github.com/apify/apify-sdk-python/commit/a865461c703aea01d91317f4fdf38c1bedd35f00)) by [@danpoletaev](https://github.com/danpoletaev) ### [2.3.1](https://github.com/apify/apify-sdk-python/releases/tag/v2.3.1) (2025-02-25)[​](#231-2025-02-25 "Direct link to 231-2025-02-25") #### 🐛 Bug Fixes[​](#-bug-fixes-7 "Direct link to 🐛 Bug Fixes") * Allow None value in 'inputBodyLen' in ActorRunStats ([#413](https://github.com/apify/apify-sdk-python/pull/413)) ([1cf37f1](https://github.com/apify/apify-sdk-python/commit/1cf37f13f8db1313ac82276d13200af4aa2bf773)) by [@janbuchar](https://github.com/janbuchar) ### [2.3.0](https://github.com/apify/apify-sdk-python/releases/tag/v2.3.0) (2025-02-19)[​](#230-2025-02-19 "Direct link to 230-2025-02-19") #### 🚀 Features[​](#-features-5 "Direct link to 🚀 Features") * Add `rate_limit_errors` property for `ApifyStorageClient` ([#387](https://github.com/apify/apify-sdk-python/pull/387)) ([89c230a](https://github.com/apify/apify-sdk-python/commit/89c230a21a1a8698159975f97c73a724b9063278)) by [@Mantisus](https://github.com/Mantisus), closes [#318](https://github.com/apify/apify-sdk-python/issues/318) * Unify Apify and Scrapy to use single event loop & remove `nest-asyncio` ([#390](https://github.com/apify/apify-sdk-python/pull/390)) ([96949be](https://github.com/apify/apify-sdk-python/commit/96949be4f7687ac9285992d1fb02ac6172307bdb)) by [@vdusek](https://github.com/vdusek), closes [#148](https://github.com/apify/apify-sdk-python/issues/148), [#176](https://github.com/apify/apify-sdk-python/issues/176), [#392](https://github.com/apify/apify-sdk-python/issues/392) * Support pay-per-event via `Actor.charge` ([#393](https://github.com/apify/apify-sdk-python/pull/393)) ([78888c4](https://github.com/apify/apify-sdk-python/commit/78888c4d6258211cdbc5fd5b5cbadbf23c39d818)) by [@janbuchar](https://github.com/janbuchar), closes [#374](https://github.com/apify/apify-sdk-python/issues/374) #### 🐛 Bug Fixes[​](#-bug-fixes-8 "Direct link to 🐛 Bug Fixes") * Fix RQ usage in Scrapy scheduler ([#385](https://github.com/apify/apify-sdk-python/pull/385)) ([3363478](https://github.com/apify/apify-sdk-python/commit/3363478dbf6eb35e45c237546fe0df5c104166f6)) by [@vdusek](https://github.com/vdusek) * Make sure that Actor instances with non-default configurations are also accessible through the global Actor proxy after initialization ([#402](https://github.com/apify/apify-sdk-python/pull/402)) ([b956a02](https://github.com/apify/apify-sdk-python/commit/b956a02d0ba59e0cfde489cc13ca92d7f8f8c84f)) by [@janbuchar](https://github.com/janbuchar), closes [#397](https://github.com/apify/apify-sdk-python/issues/397) ### [2.2.1](https://github.com/apify/apify-sdk-python/releases/tag/v2.2.1) (2025-01-17)[​](#221-2025-01-17 "Direct link to 221-2025-01-17") #### 🐛 Bug Fixes[​](#-bug-fixes-9 "Direct link to 🐛 Bug Fixes") * Better event listener type definitions ([#354](https://github.com/apify/apify-sdk-python/pull/354)) ([52a6dee](https://github.com/apify/apify-sdk-python/commit/52a6dee92cc0cc4fa032dfc8c312545bc5e07206)) by [@janbuchar](https://github.com/janbuchar), closes [#344](https://github.com/apify/apify-sdk-python/issues/344) ### [2.2.0](https://github.com/apify/apify-sdk-python/releases/tag/v2.2.0) (2025-01-10)[​](#220-2025-01-10 "Direct link to 220-2025-01-10") #### 🚀 Features[​](#-features-6 "Direct link to 🚀 Features") * Add new config variables to `Actor.config` ([#351](https://github.com/apify/apify-sdk-python/pull/351)) ([7b6478c](https://github.com/apify/apify-sdk-python/commit/7b6478c3fc239b454f733fbd98348dab7b3a1766)) by [@fnesveda](https://github.com/fnesveda) * Upgrade to Crawlee v0.5 ([#355](https://github.com/apify/apify-sdk-python/pull/355)) ([826f4db](https://github.com/apify/apify-sdk-python/commit/826f4dbcc8cfd693d97e40c17faf91d225d7ffaf)) by [@vdusek](https://github.com/vdusek) #### 🐛 Bug Fixes[​](#-bug-fixes-10 "Direct link to 🐛 Bug Fixes") * Better error message when attempting to use force\_cloud without an Apify token ([#356](https://github.com/apify/apify-sdk-python/pull/356)) ([33245ce](https://github.com/apify/apify-sdk-python/commit/33245ceddb1fa0ed39548181fb57fb3e6b98f954)) by [@janbuchar](https://github.com/janbuchar) * Allow calling `Actor.reboot()` from migrating handler, align reboot behavior with JS SDK ([#361](https://github.com/apify/apify-sdk-python/pull/361)) ([7ba0221](https://github.com/apify/apify-sdk-python/commit/7ba022121fe7b65470fec901295f74cebce72610)) by [@fnesveda](https://github.com/fnesveda) ### [2.1.0](https://github.com/apify/apify-sdk-python/releases/tag/v2.1.0) (2024-12-03)[​](#210-2024-12-03 "Direct link to 210-2024-12-03") #### 🚀 Features[​](#-features-7 "Direct link to 🚀 Features") * Handle request list user input ([#326](https://github.com/apify/apify-sdk-python/pull/326)) ([c14fb9a](https://github.com/apify/apify-sdk-python/commit/c14fb9a9527c8b699e32ed49d39ce0a69447f87c)) by [@Pijukatel](https://github.com/Pijukatel), closes [#310](https://github.com/apify/apify-sdk-python/issues/310) #### 🐛 Bug Fixes[​](#-bug-fixes-11 "Direct link to 🐛 Bug Fixes") * Add upper bound of HTTPX version ([#347](https://github.com/apify/apify-sdk-python/pull/347)) ([e86dbce](https://github.com/apify/apify-sdk-python/commit/e86dbce69f6978cf2c15910213655e5d80f62a23)) by [@vdusek](https://github.com/vdusek) ### [2.0.2](https://github.com/apify/apify-sdk-python/releases/tag/v2.0.2) (2024-11-12)[​](#202-2024-11-12 "Direct link to 202-2024-11-12") #### 🐛 Bug Fixes[​](#-bug-fixes-12 "Direct link to 🐛 Bug Fixes") * Fix CPU usage calculation ([#315](https://github.com/apify/apify-sdk-python/pull/315)) ([0521d91](https://github.com/apify/apify-sdk-python/commit/0521d911afbb8029ad29949f69c4f19166a01fc0)) by [@janbuchar](https://github.com/janbuchar) * Set version constraint of the `websockets` dependency to <14.0.0 ([#322](https://github.com/apify/apify-sdk-python/pull/322)) ([15ad055](https://github.com/apify/apify-sdk-python/commit/15ad0550e7a5508adff3eb35511248c611a0f595)) by [@Pijukatel](https://github.com/Pijukatel) * Fix Dataset.iter\_items for apify\_storage ([#321](https://github.com/apify/apify-sdk-python/pull/321)) ([2db1beb](https://github.com/apify/apify-sdk-python/commit/2db1beb2d56a7e7954cd76023d1273c7546d7cbf)) by [@Pijukatel](https://github.com/Pijukatel), closes [#320](https://github.com/apify/apify-sdk-python/issues/320) ### [2.0.1](https://github.com/apify/apify-sdk-python/releases/tag/v2.0.1) (2024-10-25)[​](#201-2024-10-25 "Direct link to 201-2024-10-25") #### 🚀 Features[​](#-features-8 "Direct link to 🚀 Features") * Add standby URL, change default standby port ([#287](https://github.com/apify/apify-sdk-python/pull/287)) ([8cd2f2c](https://github.com/apify/apify-sdk-python/commit/8cd2f2cb9d1191dbc93bf1b8a2d70189881c64ad)) by [@jirimoravcik](https://github.com/jirimoravcik) * Add crawlee version to system info print ([#304](https://github.com/apify/apify-sdk-python/pull/304)) ([c28f38f](https://github.com/apify/apify-sdk-python/commit/c28f38f4e205515e1b5d1ce97a2072be3a09d338)) by [@vdusek](https://github.com/vdusek) #### 🐛 Bug Fixes[​](#-bug-fixes-13 "Direct link to 🐛 Bug Fixes") * Adjust tests of scrapy user data ([#284](https://github.com/apify/apify-sdk-python/pull/284)) ([26ffb15](https://github.com/apify/apify-sdk-python/commit/26ffb15797effcfad1a25c840dd3d17663e26ea3)) by [@janbuchar](https://github.com/janbuchar) * Use HttpHeaders type in Scrapy integration ([#289](https://github.com/apify/apify-sdk-python/pull/289)) ([3e33e91](https://github.com/apify/apify-sdk-python/commit/3e33e9147bfd60554b9da41b032c0451f91ba27b)) by [@vdusek](https://github.com/vdusek) * Allow empty timeout\_at env variable ([#303](https://github.com/apify/apify-sdk-python/pull/303)) ([b67ec98](https://github.com/apify/apify-sdk-python/commit/b67ec989dfcc21756cc976c52edc25735a3f0501)) by [@janbuchar](https://github.com/janbuchar), closes [#596](https://github.com/apify/apify-sdk-python/issues/596) ### [2.0.0](https://github.com/apify/apify-sdk-python/releases/tag/v2.0.0) (2024-09-10)[​](#200-2024-09-10 "Direct link to 200-2024-09-10") #### 🚀 Features[​](#-features-9 "Direct link to 🚀 Features") * Better Actor API typing ([#256](https://github.com/apify/apify-sdk-python/pull/256)) ([abb87e7](https://github.com/apify/apify-sdk-python/commit/abb87e7f3c272f88a9a76292d8394fe93b98428a)) by [@janbuchar](https://github.com/janbuchar), closes [#243](https://github.com/apify/apify-sdk-python/issues/243) * Expose Request from Crawlee ([#266](https://github.com/apify/apify-sdk-python/pull/266)) ([1f01278](https://github.com/apify/apify-sdk-python/commit/1f01278c77f261500bc74efd700c0583ac45fd82)) by [@vdusek](https://github.com/vdusek) * Automatically configure logging ([#271](https://github.com/apify/apify-sdk-python/pull/271)) ([1906bb2](https://github.com/apify/apify-sdk-python/commit/1906bb216b8a3f1c2ad740c551ee019c2ba0696f)) by [@janbuchar](https://github.com/janbuchar) #### 🐛 Bug Fixes[​](#-bug-fixes-14 "Direct link to 🐛 Bug Fixes") * Make apify.log public again ([#249](https://github.com/apify/apify-sdk-python/pull/249)) ([22677f5](https://github.com/apify/apify-sdk-python/commit/22677f57b2aff6c9bddbee305e5a62e39bbf5915)) by [@janbuchar](https://github.com/janbuchar) * Dataset list response handling ([#257](https://github.com/apify/apify-sdk-python/pull/257)) ([0ea57d7](https://github.com/apify/apify-sdk-python/commit/0ea57d7c4788bff31f215c447c1881e56d6508bb)) by [@janbuchar](https://github.com/janbuchar) * Ignore deprecated platform events ([#258](https://github.com/apify/apify-sdk-python/pull/258)) ([ed5ab3b](https://github.com/apify/apify-sdk-python/commit/ed5ab3b80c851a817aa87806c39cd8ef3e86fde5)) by [@janbuchar](https://github.com/janbuchar) * Possible infinity loop in Apify-Scrapy proxy middleware ([#259](https://github.com/apify/apify-sdk-python/pull/259)) ([8647a94](https://github.com/apify/apify-sdk-python/commit/8647a94289423528f2940d9f7174f81682fbb407)) by [@vdusek](https://github.com/vdusek) * Hotfix for batch\_add\_requests batch size limit ([#261](https://github.com/apify/apify-sdk-python/pull/261)) ([61d7a39](https://github.com/apify/apify-sdk-python/commit/61d7a392d182a752c91193170dca351f4cb0fbf3)) by [@janbuchar](https://github.com/janbuchar) #### Refactor[​](#refactor "Direct link to Refactor") * \[**breaking**] Preparation for v2 release ([#210](https://github.com/apify/apify-sdk-python/pull/210)) ([2f9dcc5](https://github.com/apify/apify-sdk-python/commit/2f9dcc559414f31e3f4fc87e72417a36494b9c84)) by [@janbuchar](https://github.com/janbuchar), closes [#135](https://github.com/apify/apify-sdk-python/issues/135), [#137](https://github.com/apify/apify-sdk-python/issues/137), [#138](https://github.com/apify/apify-sdk-python/issues/138), [#147](https://github.com/apify/apify-sdk-python/issues/147), [#149](https://github.com/apify/apify-sdk-python/issues/149), [#237](https://github.com/apify/apify-sdk-python/issues/237) #### Chore[​](#chore "Direct link to Chore") * \[**breaking**] Drop support for Python 3.8 ### [1.7.2](https://github.com/apify/apify-sdk-python/releases/tag/v1.7.2) (2024-07-08)[​](#172-2024-07-08 "Direct link to 172-2024-07-08") * Add Actor Standby port ### [1.7.1](https://github.com/apify/apify-sdk-python/releases/tag/v1.7.1) (2024-05-23)[​](#171-2024-05-23 "Direct link to 171-2024-05-23") #### 🐛 Bug Fixes[​](#-bug-fixes-15 "Direct link to 🐛 Bug Fixes") * Set a timeout for Actor cleanup ### [1.7.0](https://github.com/apify/apify-sdk-python/releases/tag/v1.7.0) (2024-03-12)[​](#170-2024-03-12 "Direct link to 170-2024-03-12") #### 🚀 Features[​](#-features-10 "Direct link to 🚀 Features") * Add a new way of generating the `uniqueKey` field of the request, aligning it with the Crawlee. #### 🐛 Bug Fixes[​](#-bug-fixes-16 "Direct link to 🐛 Bug Fixes") * Improve error handling for `to_apify_request` serialization failures * Scrapy's `Request.dont_filter` works. ### [1.6.0](https://github.com/apify/apify-sdk-python/releases/tag/v1.6.0) (2024-02-23)[​](#160-2024-02-23 "Direct link to 160-2024-02-23") #### 🐛 Bug Fixes[​](#-bug-fixes-17 "Direct link to 🐛 Bug Fixes") * Update of Scrapy integration, fixes in `ApifyScheduler`, `to_apify_request` and `apply_apify_settings`. #### Chore[​](#chore-1 "Direct link to Chore") * Remove `ApifyRetryMiddleware` and stay with the Scrapy's default one ### [1.5.5](https://github.com/apify/apify-sdk-python/releases/tag/v1.5.5) (2024-02-01)[​](#155-2024-02-01 "Direct link to 155-2024-02-01") #### 🐛 Bug Fixes[​](#-bug-fixes-18 "Direct link to 🐛 Bug Fixes") * Fix conversion of `headers` fields in Apify <--> Scrapy request translation ### [1.5.4](https://github.com/apify/apify-sdk-python/releases/tag/v1.5.4) (2024-01-24)[​](#154-2024-01-24 "Direct link to 154-2024-01-24") #### 🐛 Bug Fixes[​](#-bug-fixes-19 "Direct link to 🐛 Bug Fixes") * Fix conversion of `userData` and `headers` fields in Apify <--> Scrapy request translation ### [1.5.3](https://github.com/apify/apify-sdk-python/releases/tag/v1.5.3) (2024-01-23)[​](#153-2024-01-23 "Direct link to 153-2024-01-23") #### 🚀 Features[​](#-features-11 "Direct link to 🚀 Features") * Add `apply_apify_settings` function to Scrapy subpackage ### [1.5.2](https://github.com/apify/apify-sdk-python/releases/tag/v1.5.2) (2024-01-19)[​](#152-2024-01-19 "Direct link to 152-2024-01-19") #### 🐛 Bug Fixes[​](#-bug-fixes-20 "Direct link to 🐛 Bug Fixes") * Add missing import check to `ApifyHttpProxyMiddleware` #### Chore[​](#chore-2 "Direct link to Chore") * Create a new subpackage for Scrapy pipelines * Remove some noqas thanks to the new Ruff release * Replace relative imports with absolute imports * Replace asserts with custom checks in Scrapy subpackage ### [1.5.1](https://github.com/apify/apify-sdk-python/releases/tag/v1.5.1) (2024-01-10)[​](#151-2024-01-10 "Direct link to 151-2024-01-10") #### Chore[​](#chore-3 "Direct link to Chore") * Allowed running integration tests from PRs from forks, after maintainer approval * Do not close `nested_event_loop` in the `Scheduler.__del__` ### [1.5.0](https://github.com/apify/apify-sdk-python/releases/tag/v1.5.0) (2024-01-03)[​](#150-2024-01-03 "Direct link to 150-2024-01-03") #### 🚀 Features[​](#-features-12 "Direct link to 🚀 Features") * Add `ApifyHttpProxyMiddleware` ### [1.4.1](https://github.com/apify/apify-sdk-python/releases/tag/v1.4.1) (2023-12-21)[​](#141-2023-12-21 "Direct link to 141-2023-12-21") #### 🐛 Bug Fixes[​](#-bug-fixes-21 "Direct link to 🐛 Bug Fixes") * Resolve issue in `ApifyRetryMiddleware.process_exception()`, where requests were getting stuck in the request queue #### Chore[​](#chore-4 "Direct link to Chore") * Fix type hint problems for resource clients ### [1.4.0](https://github.com/apify/apify-sdk-python/releases/tag/v1.4.0) (2023-12-05)[​](#140-2023-12-05 "Direct link to 140-2023-12-05") #### Chore[​](#chore-5 "Direct link to Chore") * Migrate from Autopep8 and Flake8 to Ruff ### [1.3.0](https://github.com/apify/apify-sdk-python/releases/tag/v1.3.0) (2023-11-15)[​](#130-2023-11-15 "Direct link to 130-2023-11-15") #### 🚀 Features[​](#-features-13 "Direct link to 🚀 Features") * Add `scrapy` extra ### [1.2.0](https://github.com/apify/apify-sdk-python/releases/tag/v1.2.0) (2023-10-23)[​](#120-2023-10-23 "Direct link to 120-2023-10-23") #### 🚀 Features[​](#-features-14 "Direct link to 🚀 Features") * Add support for Python 3.12 #### Chore[​](#chore-6 "Direct link to Chore") * Fix lint error (E721) in unit tests (for instance checks use `isinstance()`) ### [1.1.5](https://github.com/apify/apify-sdk-python/releases/tag/v1.1.5) (2023-10-03)[​](#115-2023-10-03 "Direct link to 115-2023-10-03") #### 🚀 Features[​](#-features-15 "Direct link to 🚀 Features") * Update the Apify log formatter to contain an option for adding the logger name #### Chore[​](#chore-7 "Direct link to Chore") * Rewrite documentation publication to use Docusaurus * Remove PR Toolkit workflow ### [1.1.4](https://github.com/apify/apify-sdk-python/releases/tag/v1.1.4) (2023-09-06)[​](#114-2023-09-06 "Direct link to 114-2023-09-06") #### 🐛 Bug Fixes[​](#-bug-fixes-22 "Direct link to 🐛 Bug Fixes") * Resolve issue with querying request queue head multiple times in parallel #### Chore[​](#chore-8 "Direct link to Chore") * Fix integration tests for Actor logger * Remove `pytest-randomly` Pytest plugin * Unpin `apify-client` and `apify-shared` to improve compatibility with their newer versions ### [1.1.3](https://github.com/apify/apify-sdk-python/releases/tag/v1.1.3) (2023-08-25)[​](#113-2023-08-25 "Direct link to 113-2023-08-25") #### Chore[​](#chore-9 "Direct link to Chore") * Unify indentation in configuration files * Update the `Actor.reboot` method to use the new reboot endpoint ### [1.1.2](https://github.com/apify/apify-sdk-python/releases/tag/v1.1.2) (2023-08-02)[​](#112-2023-08-02 "Direct link to 112-2023-08-02") #### Chore[​](#chore-10 "Direct link to Chore") * Start importing general constants and utilities from the `apify-shared` library * Simplify code via `flake8-simplify` * Start using environment variables with prefix `ACTOR_` instead of some with prefix `APIFY_` * Pin `apify-client` and `apify-shared` to prevent their implicit updates from breaking SDK ### [1.1.1](https://github.com/apify/apify-sdk-python/releases/tag/v1.1.1) (2023-05-23)[​](#111-2023-05-23 "Direct link to 111-2023-05-23") #### 🐛 Bug Fixes[​](#-bug-fixes-23 "Direct link to 🐛 Bug Fixes") * Relax dependency requirements to improve compatibility with other libraries ### [1.1.0](https://github.com/apify/apify-sdk-python/releases/tag/v1.1.0) (2023-05-23)[​](#110-2023-05-23 "Direct link to 110-2023-05-23") #### 🚀 Features[​](#-features-16 "Direct link to 🚀 Features") * Add option to add event handlers which accept no arguments * Add support for `is_terminal` flag in status message update * Add option to set status message along with `Actor.exit()` #### 🐛 Bug Fixes[​](#-bug-fixes-24 "Direct link to 🐛 Bug Fixes") * Start enforcing local storage to always use the UTF-8 encoding * Fix saving key-value store values to local storage with the right extension for a given content type #### Chore[​](#chore-11 "Direct link to Chore") * Switch from `setup.py` to `pyproject.toml` for specifying project setup ### [1.0.0](https://github.com/apify/apify-sdk-python/releases/tag/v1.0.0) (2023-03-13)[​](#100-2023-03-13 "Direct link to 100-2023-03-13") #### 🐛 Bug Fixes[​](#-bug-fixes-25 "Direct link to 🐛 Bug Fixes") * Fix `RequestQueue` not loading requests from an existing queue properly #### Chore[​](#chore-12 "Direct link to Chore") * Update to `apify-client` 1.0.0 * Start triggering base Docker image builds when releasing a new version ### [0.2.0](https://github.com/apify/apify-sdk-python/releases/tag/v0.2.0) (2023-03-06)[​](#020-2023-03-06 "Direct link to 020-2023-03-06") #### 🚀 Features[​](#-features-17 "Direct link to 🚀 Features") * Add chunking mechanism to push\_data, cleanup TODOs ([#67](https://github.com/apify/apify-sdk-python/pull/67)) ([5f38d51](https://github.com/apify/apify-sdk-python/commit/5f38d51a57912071439ac88405311d2cb7044190)) by [@jirimoravcik](https://github.com/jirimoravcik) ### [0.1.0](https://github.com/apify/apify-sdk-python/releases/tag/v0.1.0) (2023-02-09)[​](#010-2023-02-09 "Direct link to 010-2023-02-09") #### 🚀 Features[​](#-features-18 "Direct link to 🚀 Features") * Implement MemoryStorage and local storage clients ([#15](https://github.com/apify/apify-sdk-python/pull/15)) ([b7c9886](https://github.com/apify/apify-sdk-python/commit/b7c98869bdc749feadc7b5a0d105fce041506011)) by [@jirimoravcik](https://github.com/jirimoravcik) * Implement Dataset, KeyValueStore classes, create storage management logic ([#21](https://github.com/apify/apify-sdk-python/pull/21)) ([d1b357c](https://github.com/apify/apify-sdk-python/commit/d1b357cd02f7357137fd9413b105a8ac48b1796b)) by [@jirimoravcik](https://github.com/jirimoravcik) * Implement RequestQueue class ([#25](https://github.com/apify/apify-sdk-python/pull/25)) ([c6cad34](https://github.com/apify/apify-sdk-python/commit/c6cad3442d1a9a37c3eb3991cf45daed03e74ff5)) by [@jirimoravcik](https://github.com/jirimoravcik) * Add test for get\_env and is\_at\_home ([#29](https://github.com/apify/apify-sdk-python/pull/29)) ([cc45afb](https://github.com/apify/apify-sdk-python/commit/cc45afbf848db3626054c599cb3a5a2972a48748)) by [@drobnikj](https://github.com/drobnikj) * Updating pull request toolkit config \[INTERNAL] ([387143c](https://github.com/apify/apify-sdk-python/commit/387143ccf2c32a99c95e9931e5649e558d35daeb)) by [@mtrunkat](https://github.com/mtrunkat) * Add documentation for `StorageManager` and `StorageClientManager`, open\_\* methods in `Actor` ([#34](https://github.com/apify/apify-sdk-python/pull/34)) ([3f6b942](https://github.com/apify/apify-sdk-python/commit/3f6b9426dc03fea40d80af2e4c8f04ecf2620e8a)) by [@jirimoravcik](https://github.com/jirimoravcik) * Add tests for actor lifecycle ([#35](https://github.com/apify/apify-sdk-python/pull/35)) ([4674728](https://github.com/apify/apify-sdk-python/commit/4674728905be5076283ff3795332866e8bef6ee8)) by [@drobnikj](https://github.com/drobnikj) * Add docs for `Dataset`, `KeyValueStore`, and `RequestQueue` ([#37](https://github.com/apify/apify-sdk-python/pull/37)) ([174548e](https://github.com/apify/apify-sdk-python/commit/174548e952b47ee519d1a05c0821a2c42c2fddf6)) by [@jirimoravcik](https://github.com/jirimoravcik) * Docs string for memory storage clients ([#31](https://github.com/apify/apify-sdk-python/pull/31)) ([8f55d46](https://github.com/apify/apify-sdk-python/commit/8f55d463394307b004193efc43b67b44d030f6de)) by [@drobnikj](https://github.com/drobnikj) * Add test for storage actor methods ([#39](https://github.com/apify/apify-sdk-python/pull/39)) ([b89bbcf](https://github.com/apify/apify-sdk-python/commit/b89bbcfdcae4f436a68e92f1f60628aea1036dde)) by [@drobnikj](https://github.com/drobnikj) * Various fixes and improvements ([#41](https://github.com/apify/apify-sdk-python/pull/41)) ([5bae238](https://github.com/apify/apify-sdk-python/commit/5bae238821b3b63c73d0cbadf4b478511cb045d2)) by [@jirimoravcik](https://github.com/jirimoravcik) * Add the rest unit tests for actor ([#40](https://github.com/apify/apify-sdk-python/pull/40)) ([72d92ea](https://github.com/apify/apify-sdk-python/commit/72d92ea080670ceecc234c149058d2ebe763e3a8)) by [@drobnikj](https://github.com/drobnikj) * Decrypt input secrets if there are some ([#45](https://github.com/apify/apify-sdk-python/pull/45)) ([6eb1630](https://github.com/apify/apify-sdk-python/commit/6eb163077341218a3f9dcf566986d7464f6ab09e)) by [@drobnikj](https://github.com/drobnikj) * Add a few integration tests ([#48](https://github.com/apify/apify-sdk-python/pull/48)) ([1843f48](https://github.com/apify/apify-sdk-python/commit/1843f48845e724e1c2682b8d09a6b5c48c57d9ec)) by [@drobnikj](https://github.com/drobnikj) * Add integration tests for storages, proxy configuration ([#49](https://github.com/apify/apify-sdk-python/pull/49)) ([fd0566e](https://github.com/apify/apify-sdk-python/commit/fd0566ed3b8c85c7884f8bba3cf7394215fabed0)) by [@jirimoravcik](https://github.com/jirimoravcik) * Unify datetime handling, remove utcnow() ([#52](https://github.com/apify/apify-sdk-python/pull/52)) ([09dd8ac](https://github.com/apify/apify-sdk-python/commit/09dd8ac9dc26afee777f497ed1d2733af1eef848)) by [@jirimoravcik](https://github.com/jirimoravcik) * Separate ID and name params for `Actor.open_xxx` ([#56](https://github.com/apify/apify-sdk-python/pull/56)) ([a1e962e](https://github.com/apify/apify-sdk-python/commit/a1e962ebe74384baabb96fdbb4f0e0ed2f92e454)) by [@jirimoravcik](https://github.com/jirimoravcik) #### 🐛 Bug Fixes[​](#-bug-fixes-26 "Direct link to 🐛 Bug Fixes") * Key error for storage name ([#28](https://github.com/apify/apify-sdk-python/pull/28)) ([83b30a9](https://github.com/apify/apify-sdk-python/commit/83b30a90df4d3b173302f1c6006b346091fced60)) by [@drobnikj](https://github.com/drobnikj) --- # Accessing Apify API The Apify SDK contains many useful features for making Actor development easier. However, it does not cover all the features the Apify API offers. For working with the Apify API directly, you can use the provided instance of the [Apify API Client](https://docs.apify.com/api/client/python) library. ## Actor client[​](#actor-client "Direct link to Actor client") To access the provided instance of [`ApifyClientAsync`](https://docs.apify.com/api/client/python/reference/class/ApifyClientAsync), you can use the [`Actor.apify_client`](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#apify_client) property. For example, to get the details of your user, you can use this snippet: ``` from apify import Actor async def main() -> None: async with Actor: # Create a new user client. user_client = Actor.apify_client.user('me') # Get information about the current user. me = await user_client.get() Actor.log.info(f'User: {me}') ``` ## Actor new client[​](#actor-new-client "Direct link to Actor new client") If you want to create a completely new instance of the client, for example, to get a client for a different user or change the configuration of the client,you can use the [`Actor.new_client`](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#new_client) method: ``` from apify import Actor TOKEN = 'ANOTHER_USERS_TOKEN' async def main() -> None: async with Actor: # Create a new user client with a custom token. apify_client = Actor.new_client(token=TOKEN, max_retries=2) user_client = apify_client.user('me') # Get information about the another user. them = await user_client.get() Actor.log.info(f'Another user: {them}') ``` --- # Actor configuration The [`Actor`](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md) class gets configured using the [`Configuration`](https://docs.apify.com/sdk/python/sdk/python/reference/class/Configuration.md) class, which initializes itself based on the provided environment variables. If you're using the Apify SDK in your Actors on the Apify platform, or Actors running locally through the Apify CLI, you don't need to configure the `Actor` class manually,unless you have some specific requirements, everything will get configured automatically. If you need some special configuration, you can adjust it either through the `Configuration` class directly,or by setting environment variables when running the Actor locally. To see the full list of configuration options, check the `Configuration` class or the list of environment variables that the Actor understands. ## Configuring from code[​](#configuring-from-code "Direct link to Configuring from code") This will cause the Actor to persist its state every 10 seconds: ``` from datetime import timedelta from apify import Actor, Configuration, Event async def main() -> None: global_config = Configuration.get_global_configuration() global_config.persist_state_interval = timedelta(seconds=10) async with Actor: # Define a handler that will be called for every persist state event. async def save_state() -> None: await Actor.set_value('STATE', 'Hello, world!') # The save_state handler will be called every 10 seconds now. Actor.on(Event.PERSIST_STATE, save_state) ``` ## Configuring via environment variables[​](#configuring-via-environment-variables "Direct link to Configuring via environment variables") All the configuration options can be set via environment variables. The environment variables are prefixed with `APIFY_`, and the configuration options are in uppercase, with underscores as separators. See the [`Configuration`](https://docs.apify.com/sdk/python/sdk/python/reference/class/Configuration.md) API reference for the full list of configuration options. This Actor run will not persist its local storages to the filesystem: ``` APIFY_PERSIST_STORAGE=0 apify run ``` --- # Actor events & state persistence During its runtime, the Actor receives Actor events sent by the Apify platform or generated by the Apify SDK itself. ## Event types[​](#event-types "Direct link to Event types") | Event | Data | Description | | --------------- | ------------------------------------------------------------------------------------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `SYSTEM_INFO` | ``` { "created_at": datetime, "cpu_current_usage": float, "mem_current_bytes": int, "is_cpu_overloaded": bool } ``` | This event is emitted regularly and it indicates the current resource usage of the Actor.The `is_cpu_overloaded` argument indicates whether the current CPU usage is higher than `Config.max_used_cpu_ratio` | | `MIGRATING` | `None` | Emitted when the Actor running on the Apify platform is going to be [migrated](https://docs.apify.com/platform/actors/development/state-persistence#what-is-a-migration) to another worker server soon.You can use it to persist the state of the Actor so that once it is executed again on the new server, it doesn't have to start over from the beginning. Once you have persisted the state of your Actor, you can call [`Actor.reboot`](https://docs.apify.com/sdk/python/../../reference/class/Actor#reboot) to reboot the Actor and trigger the migration immediately, to speed up the process. | | `ABORTING` | `None` | When a user aborts an Actor run on the Apify platform, they can choose to abort gracefully to allow the Actor some time before getting killed. This graceful abort emits the `ABORTING` event which you can use to finish all running tasks and do cleanup. | | `PERSIST_STATE` | ``` { "is_migrating": bool } ``` | Emitted in regular intervals (by default 60 seconds) to notify the Actor that it should persist its state, in order to avoid repeating all work when the Actor restarts.This event is also emitted automatically when the `MIGRATING` event happens, in which case the `is_migrating` flag is set to `True`.Note that the `PERSIST_STATE` event is provided merely for user convenience, you can achieve the same effect by persisting the state regularly in an interval and listening for the migrating event. | ## Adding handlers to events[​](#adding-handlers-to-events "Direct link to Adding handlers to events") To add handlers to these events, you use the [`Actor.on`](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#on) method, and to remove them, you use the [`Actor.off`](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#off) method. ``` import asyncio from typing import Any from apify import Actor, Event async def main() -> None: async with Actor: total_items = 1000 # Load the state if it's saved from some previous execution processed_items = 0 actor_state = await Actor.get_value('STATE') if actor_state is not None: processed_items = actor_state # Save the state when the `PERSIST_STATE` event happens async def save_state(event_data: Any) -> None: nonlocal processed_items Actor.log.info('Saving Actor state', extra=event_data) await Actor.set_value('STATE', processed_items) Actor.on(Event.PERSIST_STATE, save_state) # Do some fake work for i in range(processed_items, total_items): Actor.log.info(f'Processing item {i}...') processed_items = i await asyncio.sleep(0.1) # Suppose we can stop saving the state now Actor.off(Event.PERSIST_STATE, save_state) # Do some more fake work, this time something that can't be restarted, # so no point persisting the state for j in range(10): Actor.log.info(f'Processing item {j} of another kind...') await asyncio.sleep(1) ``` --- # Actor input The Actor gets its [input](https://docs.apify.com/platform/actors/running/input) from the input record in its default [key-value store](https://docs.apify.com/platform/storage/key-value-store). To access it, instead of reading the record manually, you can use the [`Actor.get_input`](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#get_input) convenience method. It will get the input record key from the Actor configuration, read the record from the default key-value store,and decrypt any [secret input fields](https://docs.apify.com/platform/actors/development/secret-input). For example, if an Actor received a JSON input with two fields, `{ "firstNumber": 1, "secondNumber": 2 }`, this is how you might process it: ``` from apify import Actor async def main() -> None: async with Actor: actor_input = await Actor.get_input() or {} first_number = actor_input.get('firstNumber', 0) second_number = actor_input.get('secondNumber', 0) Actor.log.info('Sum: %s', first_number + second_number) ``` --- # Actor lifecycle In this guide, we will show you how to manage the lifecycle of an Apify Actor. ## Initialization and cleanup[​](#initialization-and-cleanup "Direct link to Initialization and cleanup") At the start of its runtime, the Actor needs to initialize itself, its event manager and its storages, and at the end of the runtime it needs to close these cleanly. The Apify SDK provides several options on how to manage this. The [`Actor.init`](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#init) method initializes the Actor, the event manager which processes the Actor events from the platform event websocket, and the storage client used in the execution environment. It should be called before performing any other Actor operations. The [`Actor.exit`](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#exit) method then exits the Actor cleanly, tearing down the event manager and the storage client. There is also the [`Actor.fail`](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#fail) method, which exits the Actor while marking it as failed. ``` from apify import Actor async def main() -> None: await Actor.init() try: Actor.log.info('Actor input:', await Actor.get_input()) await Actor.set_value('OUTPUT', 'Hello, world!') raise RuntimeError('Ouch!') except Exception as exc: Actor.log.exception('Error while running Actor') await Actor.fail(exit_code=91, exception=exc) await Actor.exit() ``` ### Context manager[​](#context-manager "Direct link to Context manager") So that you don't have to call the lifecycle methods manually, the [`Actor`](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md) class provides a context manager, which calls the [`Actor.init`](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#init) method on enter, the [`Actor.exit`](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#exit) method on a clean exit, and the [`Actor.fail`](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#fail) method when there is an exception during the run of the Actor. This is the recommended way to work with the `Actor` class. ``` from apify import Actor async def main() -> None: async with Actor: actor_input = await Actor.get_input() Actor.log.info('Actor input: %s', actor_input) await Actor.set_value('OUTPUT', 'Hello, world!') raise RuntimeError('Ouch!') ``` ## Rebooting an Actor[​](#rebooting-an-actor "Direct link to Rebooting an Actor") Sometimes, you want to restart your Actor to make it run from the beginning again. To do that, you can use the [`Actor.reboot`](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#reboot) method. When you call it, the Apify platform stops the container of the run, and starts a new container of the same Actor with the same run ID and storages. Don't do it unconditionally, or you might get the Actor in a reboot loop. ``` from apify import Actor async def main() -> None: async with Actor: # ... your code here ... await Actor.reboot() ``` ## Actor status message[​](#actor-status-message "Direct link to Actor status message") To inform you or the users running your Actors about the progress of their runs, you can set the status message for the run, which will then be visible in the run detail in Apify Console, or accessible through the Apify API. To set the status message for the Actor run, you can use the [`Actor.set_status_message`](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#set_status_message) method. ``` from apify import Actor async def main() -> None: async with Actor: await Actor.set_status_message('Here we go!') # Do some work... await Actor.set_status_message('So far so good...') # Do some more work... await Actor.set_status_message('Steady as she goes...') # Do even more work... await Actor.set_status_message('Almost there...') # Finish the job await Actor.set_status_message('Phew! That was not that hard!') ``` --- # Interacting with other Actors There are several methods that interact with other Actors and Actor tasks on the Apify platform. ## Actor start[​](#actor-start "Direct link to Actor start") The [`Actor.start`](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#start) method starts another Actor on the Apify platform, and immediately returns the details of the started Actor run. ``` from apify import Actor async def main() -> None: async with Actor: # Start your own Actor named 'my-fancy-actor'. actor_run = await Actor.start( actor_id='~my-fancy-actor', run_input={'foo': 'bar'}, ) # Log the Actor run ID. Actor.log.info(f'Actor run ID: {actor_run.id}') ``` ## Actor call[​](#actor-call "Direct link to Actor call") The [`Actor.call`](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#call) method starts another Actor on the Apify platform, and waits for the started Actor run to finish. ``` from apify import Actor async def main() -> None: async with Actor: # Start the apify/screenshot-url Actor. actor_run = await Actor.call( actor_id='apify/screenshot-url', run_input={'url': 'http://example.com', 'delay': 10000}, ) if actor_run is None: raise RuntimeError('Actor task failed to start.') # Wait for the Actor run to finish. run_client = Actor.apify_client.run(actor_run.id) await run_client.wait_for_finish() # Get the Actor output from the key-value store. kvs_client = run_client.key_value_store() output = await kvs_client.get_record('OUTPUT') Actor.log.info(f'Actor output: {output}') ``` ## Actor call task[​](#actor-call-task "Direct link to Actor call task") The [`Actor.call_task`](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#call_task) method starts an [Actor task](https://docs.apify.com/platform/actors/tasks) on the Apify platform, and waits for the started Actor run to finish. ``` from apify import Actor async def main() -> None: async with Actor: # Start the Actor task by its ID. actor_run = await Actor.call_task(task_id='Z3m6FPSj0GYZ25rQc') if actor_run is None: raise RuntimeError('Actor task failed to start.') # Wait for the task run to finish. run_client = Actor.apify_client.run(actor_run.id) await run_client.wait_for_finish() # Get the task run dataset items dataset_client = run_client.dataset() items = await dataset_client.list_items() Actor.log.info(f'Task run dataset items: {items}') ``` ## Actor metamorph[​](#actor-metamorph "Direct link to Actor metamorph") The [`Actor.metamorph`](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#metamorph) operation transforms an Actor run into a run of another Actor with a new input. This feature is useful if you want to use another Actor to finish the work of your current Actor, instead of internally starting a new Actor run and waiting for its finish. With metamorph, you can easily create new Actors on top of existing ones, and give your users nicer input structure and user interface for the final Actor. For the users of your Actors, the metamorph operation is completely transparent; they will just see your Actor got the work done. Internally, the system stops the container corresponding to the original Actor run and starts a new container using a different container image. All the default storages are preserved,and the new Actor input is stored under the `INPUT-METAMORPH-1` key in the same default key-value store. To make you Actor compatible with the metamorph operation, use [`Actor.get_input`](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#get_input) instead of [`Actor.get_value('INPUT')`](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#get_value) to read your Actor input. This method will fetch the input using the right key in a case of metamorphed run. For example, imagine you have an Actor that accepts a hotel URL on input, and then internally uses the [`apify/web-scraper`](https://apify.com/apify/web-scraper) public Actor to scrape all the hotel reviews. The metamorphing code would look as follows: ``` from apify import Actor async def main() -> None: async with Actor: # Get the original Actor input. actor_input = await Actor.get_input() or {} hotel_url = actor_input.get('hotel_url') # Create new input for apify/web-scraper Actor. web_scraper_input = { 'startUrls': [{'url': hotel_url}], 'pageFunction': """async function pageFunction(context) { // Here you pass the JavaScript page function // that scrapes all the reviews from the hotel's URL }""", } # Metamorph the Actor run to `apify/web-scraper` with the new input. await Actor.metamorph('apify/web-scraper', web_scraper_input) # This code will not be called, since the `metamorph` action terminates # the current Actor run container. Actor.log.info('You will not see this!') ``` --- # Logging The Apify SDK is logging useful information through the [`logging`](https://docs.python.org/3/library/logging.html) module from Python's standard library, into the logger with the name `apify`. ## Automatic configuration[​](#automatic-configuration "Direct link to Automatic configuration") When you create an Actor from an Apify-provided template, either in Apify Console or through the Apify CLI, you do not have to configure the logger yourself. The template already contains initialization code for the logger,which sets the logger level to `DEBUG` and the log formatter to [`ActorLogFormatter`](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorLogFormatter.md). ## Manual configuration[​](#manual-configuration "Direct link to Manual configuration") ### Configuring the log level[​](#configuring-the-log-level "Direct link to Configuring the log level") In Python's default behavior, if you don't configure the logger otherwise, only logs with level `WARNING` or higher are printed out to the standard output, without any formatting. To also have logs with `DEBUG` and `INFO` level printed out, you need to call the [`Logger.setLevel`](https://docs.python.org/3/library/logging.html#logging.Logger.setLevel) method on the logger, with the desired minimum level as an argument. ### Configuring the log formatting[​](#configuring-the-log-formatting "Direct link to Configuring the log formatting") By default, only the log message is printed out to the output, without any formatting. To have a nicer output, with the log level printed in color, the messages nicely aligned, and extra log fields printed out,you can use the [`ActorLogFormatter`](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorLogFormatter.md) class from the `apify.log` module. ### Example log configuration[​](#example-log-configuration "Direct link to Example log configuration") To configure and test the logger, you can use this snippet: ``` import logging from apify.log import ActorLogFormatter async def main() -> None: handler = logging.StreamHandler() handler.setFormatter(ActorLogFormatter()) apify_logger = logging.getLogger('apify') apify_logger.setLevel(logging.DEBUG) apify_logger.addHandler(handler) ``` This configuration will cause all levels of messages to be printed to the standard output, with some pretty formatting. ## Logger usage[​](#logger-usage "Direct link to Logger usage") Here you can see how all the log levels would look like. You can use the `extra` argument for all log levels, it's not specific to the warning level. When you use `Logger.exception`, there is no need to pass the Exception object to the log manually, it will automatiacally infer it from the current execution context and print the exception details. ``` import logging from apify import Actor from apify.log import ActorLogFormatter async def main() -> None: handler = logging.StreamHandler() handler.setFormatter(ActorLogFormatter()) apify_logger = logging.getLogger('apify') apify_logger.setLevel(logging.DEBUG) apify_logger.addHandler(handler) async with Actor: Actor.log.debug('This is a debug message') Actor.log.info('This is an info message') Actor.log.warning('This is a warning message', extra={'reason': 'Bad Actor!'}) Actor.log.error('This is an error message') try: raise RuntimeError('Ouch!') except RuntimeError: Actor.log.exception('This is an exceptional message') ``` Result: ``` DEBUG This is a debug message INFO This is an info message WARN This is a warning message ({"reason": "Bad Actor!"}) ERROR This is an error message ERROR This is an exceptional message Traceback (most recent call last): File "main.py", line 6, in raise RuntimeError('Ouch!') RuntimeError: Ouch! ``` ## Redirect logs from other Actor runs[​](#redirect-logs-from-other-actor-runs "Direct link to Redirect logs from other Actor runs") In some situations, one Actor is going to start one or more other Actors and wait for them to finish and produce some results. In such cases, you might want to redirect the logs and status messages of the started Actors runs back to the parent Actor run, so that you can see the progress of the started Actors' runs in the parent Actor's logs. This guide will show possibilities on how to do it. ### Redirecting logs from Actor.call[​](#redirecting-logs-from-actorcall "Direct link to Redirecting logs from Actor.call") Typical use case for log redirection is to call another Actor using the [`Actor.call`](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#call) method. This method has an optional `logger` argument, which is by default set to the `default` literal. This means that the logs of the called Actor will be automatically redirected to the parent Actor's logs with default formatting and filtering. If you set the `logger` argument to `None`, then no log redirection happens. The third option is to pass your own `Logger` instance with the possibility to define your own formatter, filter, and handler. Below you can see those three possible ways of log redirection when starting another Actor run through [`Actor.call`](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#call). ``` import logging from apify import Actor async def main() -> None: async with Actor: # Default redirect logger await Actor.call(actor_id='some_actor_id') # No redirect logger await Actor.call(actor_id='some_actor_id', logger=None) # Custom redirect logger await Actor.call( actor_id='some_actor_id', logger=logging.getLogger('custom_logger') ) ``` Each default redirect logger log entry will have a specific format. After the timestamp, it will contain cyan colored text that will contain the redirect information - the other actor's name and the run ID. The rest of the log message will be printed in the same manner as the parent Actor's logger is configured. The log redirection can be deep, meaning that if the other actor also starts another actor and is redirecting logs from it, then in the top-level Actor, you can see it as well. See the following example screenshot of the Apify log console when one actor recursively starts itself (there are 2 levels of recursion in the example). ![Console with redirected logs](/sdk/python/assets/images/redirected_logs_example-56d852dcd17849fecc65a2eb72cab7e3.webp "Example of console with redirected logs from recursively started actor.") ### Redirecting logs from already running Actor run[​](#redirecting-logs-from-already-running-actor-run "Direct link to Redirecting logs from already running Actor run") In some cases, you might want to connect to an already running Actor run and redirect its logs to your current Actor run. This can be done using the [ApifyClient](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#apify_client) and getting the streamed log from a specific Actor run. You can then use it as a context manager, and the log redirection will be active in the context, or you can control the log redirection manually by explicitly calling `start` and `stop` methods. You can further decide whether you want to redirect just new logs of the ongoing Actor run, or if you also want to redirect historical logs from that Actor's run, so all logs it has produced since it was started. Both options are shown in the example code below. ``` import asyncio from apify import Actor async def main() -> None: async with Actor: # Lifecycle of redirected logs is handled by the context manager. async with await Actor.apify_client.run('some_actor_id').get_streamed_log( # Redirect all logs from the start of that run, even the logs from past. from_start=True ): await asyncio.sleep(5) # Logging will stop out of context # Lifecycle of redirected logs can be handled manually. streamed_log = await Actor.apify_client.run('some_id').get_streamed_log( # Do not redirect historical logs from this actor run. # Redirect only new logs from now on. from_start=False ) streamed_log.start() await asyncio.sleep(5) await streamed_log.stop() ``` --- # Pay-per-event monetization Apify provides several [pricing models](https://docs.apify.com/platform/actors/publishing/monetize) for monetizing your Actors. The most recent and most flexible one is [pay-per-event](https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event), which lets you charge your users programmatically directly from your Actor. As the name suggests, you may charge the users each time a specific event occurs, for example a call to an external API or when you return a result. To use the pay-per-event pricing model, you first need to [set it up](https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event) for your Actor in the Apify console. After that, you're free to start charging for events. ## Charging for events[​](#charging-for-events "Direct link to Charging for events") After monetization is set in the Apify console, you can add [`Actor.charge`](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#charge) calls to your code and start monetizing! ``` from apify import Actor async def main() -> None: async with Actor: # Charge for a single occurence of an event await Actor.charge(event_name='init') # Prepare some mock results result = [ {'word': 'Lorem'}, {'word': 'Ipsum'}, {'word': 'Dolor'}, {'word': 'Sit'}, {'word': 'Amet'}, ] # Shortcut for charging for each pushed dataset item await Actor.push_data(result, 'result-item') # Or you can charge for a given number of events manually await Actor.charge( event_name='result-item', count=len(result), ) ``` Then you just push your code to Apify and that's it! The SDK will even keep track of the max total charge setting for you, so you will not provide more value than what the user chose to pay for. If you need finer control over charging, you can access call [`Actor.get_charging_manager()`](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#get_charging_manager) to access the [`ChargingManager`](https://docs.apify.com/sdk/python/sdk/python/reference/class/ChargingManager.md), which can provide more detailed information - for example how many events of each type can be charged before reaching the configured limit. ## Transitioning from a different pricing model[​](#transitioning-from-a-different-pricing-model "Direct link to Transitioning from a different pricing model") When you plan to start using the pay-per-event pricing model for an Actor that is already monetized with a different pricing model, your source code will need support both pricing models during the transition period enforced by the Apify platform. Arguably the most frequent case is the transition from the pay-per-result model which utilizes the `ACTOR_MAX_PAID_DATASET_ITEMS` environment variable to prevent returning unpaid dataset items. The following is an example how to handle such scenarios. The key part is the [`ChargingManager.get_pricing_info()`](https://docs.apify.com/sdk/python/sdk/python/reference/class/ChargingManager.md#get_pricing_info) method which returns information about the current pricing model. ``` from apify import Actor async def main() -> None: async with Actor: # Check the dataset because there might already be items # if the run migrated or was restarted default_dataset = await Actor.open_dataset() dataset_info = await default_dataset.get_info() charged_items = dataset_info.item_count if dataset_info else 0 if Actor.get_charging_manager().get_pricing_info().is_pay_per_event: await Actor.push_data({'hello': 'world'}, 'dataset-item') elif charged_items < (Actor.config.max_paid_dataset_items or 0): await Actor.push_data({'hello': 'world'}) charged_items += 1 ``` ## Local development[​](#local-development "Direct link to Local development") It is encouraged to test your monetization code on your machine before releasing it to the public. To tell your Actor that it should work in pay-per-event mode, pass it the `ACTOR_TEST_PAY_PER_EVENT` environment variable: ``` ACTOR_TEST_PAY_PER_EVENT=true python -m youractor ``` If you also wish to see a log of all the events charged throughout the run, the Apify SDK keeps a log of charged events in a so called charging dataset. Your charging dataset can be found under the `charging_log` name (unless you change your storage settings, this dataset is stored in `storage/datasets/charging_log/`). Please note that this log is not available when running the Actor in production on the Apify platform. Because pricing configuration is stored by the Apify platform, all events will have a default price of $1. --- # Proxy management [IP address blocking](https://en.wikipedia.org/wiki/IP_address_blocking) is one of the oldest and most effective ways of preventing access to a website. It is therefore paramount for a good web scraping library to provide easy to use but powerful tools which can work around IP blocking. The most powerful weapon in your anti IP blocking arsenal is a [proxy server](https://en.wikipedia.org/wiki/Proxy_server). With the Apify SDK, you can use your own proxy servers, proxy servers acquired from third-party providers, or you can rely on [Apify Proxy](https://apify.com/proxy) for your scraping needs. ## Quick start[​](#quick-start "Direct link to Quick start") If you want to use Apify Proxy locally, make sure that you run your Actors via the Apify CLI and that you are [logged in](https://docs.apify.com/cli/docs/installation#login-with-your-apify-account) with your Apify account in the CLI. ### Using Apify proxy[​](#using-apify-proxy "Direct link to Using Apify proxy") ``` from apify import Actor async def main() -> None: async with Actor: proxy_configuration = await Actor.create_proxy_configuration() if not proxy_configuration: raise RuntimeError('No proxy configuration available.') proxy_url = await proxy_configuration.new_url() Actor.log.info(f'Using proxy URL: {proxy_url}') ``` ### Using your own proxies[​](#using-your-own-proxies "Direct link to Using your own proxies") ``` from apify import Actor async def main() -> None: async with Actor: proxy_configuration = await Actor.create_proxy_configuration( proxy_urls=[ 'http://proxy-1.com', 'http://proxy-2.com', ], ) if not proxy_configuration: raise RuntimeError('No proxy configuration available.') proxy_url = await proxy_configuration.new_url() Actor.log.info(f'Using proxy URL: {proxy_url}') ``` ## Proxy configuration[​](#proxy-configuration "Direct link to Proxy configuration") All your proxy needs are managed by the [`ProxyConfiguration`](https://docs.apify.com/sdk/python/sdk/python/reference/class/ProxyConfiguration.md) class. You create an instance using the [`Actor.create_proxy_configuration()`](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#create_proxy_configuration) method. Then you generate proxy URLs using the [`ProxyConfiguration.new_url()`](https://docs.apify.com/sdk/python/sdk/python/reference/class/ProxyConfiguration.md#new_url) method. ### Apify proxy vs. your own proxies[​](#apify-proxy-vs-your-own-proxies "Direct link to Apify proxy vs. your own proxies") The `ProxyConfiguration` class covers both Apify Proxy and custom proxy URLs, so that you can easily switch between proxy providers. However, some features of the class are available only to Apify Proxy users, mainly because Apify Proxy is what one would call a super-proxy. It's not a single proxy server, but an API endpoint that allows connectionthrough millions of different IP addresses. So the class essentially has two modes: Apify Proxy or Your proxy. The difference is easy to remember. Using the `proxy_url` or `new_url_function` arguments enables use of your custom proxy URLs, whereas all the other options are there to configure Apify Proxy. Visit the [Apify Proxy docs](https://docs.apify.com/proxy) for more info on how these parameters work. ### IP rotation and session management[​](#ip-rotation-and-session-management "Direct link to IP rotation and session management") `ProxyConfiguration.new_url` allows you to pass a `session_id` parameter. It will then be used to create a `session_id`-`proxy_url` pair, and subsequent `new_url()` calls with the same `session_id` will always return the same `proxy_url`. This is extremely useful in scraping, because you want to create the impression of a real user. When no `session_id` is provided, your custom proxy URLs are rotated round-robin, whereas Apify Proxy manages their rotation using black magic to get the best performance. ``` from apify import Actor async def main() -> None: async with Actor: proxy_configuration = await Actor.create_proxy_configuration( proxy_urls=[ 'http://proxy-1.com', 'http://proxy-2.com', ], ) if not proxy_configuration: raise RuntimeError('No proxy configuration available.') proxy_url = await proxy_configuration.new_url() # http://proxy-1.com proxy_url = await proxy_configuration.new_url() # http://proxy-2.com proxy_url = await proxy_configuration.new_url() # http://proxy-1.com proxy_url = await proxy_configuration.new_url() # http://proxy-2.com proxy_url = await proxy_configuration.new_url( session_id='a' ) # http://proxy-1.com proxy_url = await proxy_configuration.new_url( session_id='b' ) # http://proxy-2.com proxy_url = await proxy_configuration.new_url( session_id='b' ) # http://proxy-2.com proxy_url = await proxy_configuration.new_url( session_id='a' ) # http://proxy-1.com ``` ### Apify proxy configuration[​](#apify-proxy-configuration "Direct link to Apify proxy configuration") With Apify Proxy, you can select specific proxy groups to use, or countries to connect from. This allows you to get better proxy performance after some initial research. ``` from apify import Actor async def main() -> None: async with Actor: proxy_configuration = await Actor.create_proxy_configuration( groups=['RESIDENTIAL'], country_code='US', ) if not proxy_configuration: raise RuntimeError('No proxy configuration available.') proxy_url = await proxy_configuration.new_url() Actor.log.info(f'Proxy URL: {proxy_url}') ``` Now your connections using proxy\_url will use only Residential proxies from the US. Note that you must first get access to a proxy group before you are able to use it. You can find your available proxy groups in the [proxy dashboard](https://console.apify.com/proxy). If you don't specify any proxy groups, automatic proxy selection will be used. ### Your own proxy configuration[​](#your-own-proxy-configuration "Direct link to Your own proxy configuration") There are two options how to make `ProxyConfiguration` work with your own proxies. Either you can pass it a list of your own proxy servers: ``` from apify import Actor async def main() -> None: async with Actor: proxy_configuration = await Actor.create_proxy_configuration( proxy_urls=[ 'http://proxy-1.com', 'http://proxy-2.com', ], ) if not proxy_configuration: raise RuntimeError('No proxy configuration available.') proxy_url = await proxy_configuration.new_url() Actor.log.info(f'Using proxy URL: {proxy_url}') ``` Or you can pass it a method (accepting one optional argument, the session ID), to generate proxy URLs automatically: ``` from __future__ import annotations from apify import Actor, Request async def custom_new_url_function( session_id: str | None = None, _: Request | None = None, ) -> str | None: if session_id is not None: return f'http://my-custom-proxy-supporting-sessions.com?session-id={session_id}' return 'http://my-custom-proxy-not-supporting-sessions.com' async def main() -> None: async with Actor: proxy_configuration = await Actor.create_proxy_configuration( new_url_function=custom_new_url_function, # type: ignore[arg-type] ) if not proxy_configuration: raise RuntimeError('No proxy configuration available.') proxy_url_with_session = await proxy_configuration.new_url('a') Actor.log.info(f'Using proxy URL: {proxy_url_with_session}') proxy_url_without_session = await proxy_configuration.new_url() Actor.log.info(f'Using proxy URL: {proxy_url_without_session}') ``` ### Configuring proxy based on Actor input[​](#configuring-proxy-based-on-actor-input "Direct link to Configuring proxy based on Actor input") To make selecting the proxies that the Actor uses easier, you can use an input field with the editor [`proxy` in your input schema](https://docs.apify.com/platform/actors/development/input-schema#object). This input will then be filled with a dictionary containing the proxy settings you or the users of your Actor selected for the Actor run. You can then use that input to create the proxy configuration: ``` from apify import Actor async def main() -> None: async with Actor: actor_input = await Actor.get_input() or {} proxy_settings = actor_input.get('proxySettings') proxy_configuration = await Actor.create_proxy_configuration( actor_proxy_input=proxy_settings ) if not proxy_configuration: raise RuntimeError('No proxy configuration available.') proxy_url = await proxy_configuration.new_url() Actor.log.info(f'Using proxy URL: {proxy_url}') ``` ## Using the generated proxy URLs[​](#using-the-generated-proxy-urls "Direct link to Using the generated proxy URLs") ### HTTPX[​](#httpx "Direct link to HTTPX") To use the generated proxy URLs with the `httpx` library, use the [`proxies`](https://www.python-httpx.org/advanced/#http-proxying) argument: ``` import httpx from apify import Actor async def main() -> None: async with Actor: proxy_configuration = await Actor.create_proxy_configuration( proxy_urls=[ 'http://proxy-1.com', 'http://proxy-2.com', ], ) if not proxy_configuration: raise RuntimeError('No proxy configuration available.') proxy_url = await proxy_configuration.new_url() async with httpx.AsyncClient(proxy=proxy_url) as httpx_client: response = await httpx_client.get('http://example.com') Actor.log.info(f'Response: {response}') ``` Make sure you have the `httpx` library installed: ``` pip install httpx ``` --- # Running webserver in your Actor Each Actor run on the Apify platform is assigned a unique hard-to-guess URL (for example `https://8segt5i81sokzm.runs.apify.net`), which enables HTTP access to an optional web server running inside the Actor run's container. The URL is available in the following places: * In Apify Console, on the Actor run details page as the **Container URL** field. * In the API as the `container_url` property of the [Run object](https://docs.apify.com/api/v2#/reference/actors/run-object/get-run). * In the Actor as the `Actor.config.container_url` property. The web server running inside the container must listen at the port defined by the `Actor.config.container_port` property. When running Actors locally, the port defaults to `4321`, so the web server will be accessible at `http://localhost:4321`. ## Example[​](#example "Direct link to Example") The following example demonstrates how to start a simple web server in your Actor,which will respond to every GET request with the number of items that the Actor has processed so far: ``` import asyncio from http.server import BaseHTTPRequestHandler, ThreadingHTTPServer from apify import Actor processed_items = 0 http_server = None # Just a simple handler that will print the number of processed items so far # on every GET request. class RequestHandler(BaseHTTPRequestHandler): def do_get(self) -> None: self.log_request() self.send_response(200) self.end_headers() self.wfile.write(bytes(f'Processed items: {processed_items}', encoding='utf-8')) def run_server() -> None: # Start the HTTP server on the provided port, # and save a reference to the server. global http_server with ThreadingHTTPServer( ('', Actor.config.web_server_port), RequestHandler ) as server: Actor.log.info(f'Server running on {Actor.config.web_server_port}') http_server = server server.serve_forever() async def main() -> None: global processed_items async with Actor: # Start the HTTP server in a separate thread. run_server_task = asyncio.get_running_loop().run_in_executor(None, run_server) # Simulate doing some work. for _ in range(100): await asyncio.sleep(1) processed_items += 1 Actor.log.info(f'Processed items: {processed_items}') if http_server is None: raise RuntimeError('HTTP server not started') # Signal the HTTP server to shut down, and wait for it to finish. http_server.shutdown() await run_server_task ``` --- # Working with storages The `Actor` class provides methods to work either with the default storages of the Actor, or with any other storage, named or unnamed. ## Types of storages[​](#types-of-storages "Direct link to Types of storages") There are three types of storages available to Actors. First are [datasets](https://docs.apify.com/platform/storage/dataset), which are append-only tables for storing the results of your Actors. You can open a dataset through the [`Actor.open_dataset`](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#open_dataset) method, and work with it through the resulting [`Dataset`](https://docs.apify.com/sdk/python/sdk/python/reference/class/Dataset.md) class instance. Next there are [key-value stores](https://docs.apify.com/platform/storage/key-value-store), which function as a read/write storage for storing file-like objects, typically the Actor state or binary results. You can open a key-value store through the [`Actor.open_key_value_store`](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#open_key_value_store) method, and work with it through the resulting [`KeyValueStore`](https://docs.apify.com/sdk/python/sdk/python/reference/class/KeyValueStore.md) class instance. Finally, there are [request queues](https://docs.apify.com/platform/storage/request-queue). These are queues into which you can put the URLs you want to scrape, and from which the Actor can dequeue them and process them. You can open a request queue through the [`Actor.open_request_queue`](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#open_request_queue) method, and work with it through the resulting [`RequestQueue`](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestQueue.md) class instance. Each Actor run has its default dataset, default key-value store and default request queue. ## Local storage emulation[​](#local-storage-emulation "Direct link to Local storage emulation") To be able to develop Actors locally, the storages that the Apify platform provides are emulated on the local filesystem. The storage contents are loaded from and saved to the `storage` folder in the Actor's main folder. Each storage type is stored in its own subfolder, so for example datasets are stored in the `storage/datasets` folder. Each storage is then stored in its own folder, named after the storage, or called `default` if it's the default storage. For example, a request queue with the name `my-queue` would be stored in `storage/request_queues/my-queue`. Each dataset item, key-value store record, or request in a request queue is then stored in its own file in the storage folder. Dataset items and request queue requests are always JSON files, and key-value store records can be any file type, based on its content type. For example, the Actor input is typically stored in `storage/key_value_stores/default/INPUT.json`. ## Local Actor run with remote storage[​](#local-actor-run-with-remote-storage "Direct link to Local Actor run with remote storage") When developing locally, opening any storage will by default use local storage. To change this behavior and to use remote storage you have to use `force_cloud=True` argument in [`Actor.open_dataset`](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#open_dataset), [`Actor.open_request_queue`](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#open_request_queue) or [`Actor.open_key_value_store`](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#open_key_value_store). Proper use of this argument allows you to work with both local and remote storages. Calling another remote Actor and accessing its default storage is typical use-case for using `force-cloud=True` argument to open remote Actor's storages. ### Local storage persistence[​](#local-storage-persistence "Direct link to Local storage persistence") By default, the storage contents are persisted across multiple Actor runs. To clean up the Actor storages before the running the Actor, use the `--purge` flag of the [`apify run`](https://docs.apify.com/cli/docs/reference#apify-run) command of the Apify CLI. ``` apify run --purge ``` ## Convenience methods for working with default storages[​](#convenience-methods-for-working-with-default-storages "Direct link to Convenience methods for working with default storages") There are several methods for directly working with the default key-value store or default dataset of the Actor. * [`Actor.get_value('my-record')`](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#get_value) reads a record from the default key-value store of the Actor. * [`Actor.set_value('my-record', 'my-value')`](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#set_value) saves a new value to the record in the default key-value store. * [`Actor.get_input`](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#get_input) reads the Actor input from the default key-value store of the Actor. * [`Actor.push_data([{'result': 'Hello, world!'}, ...])`](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#push_data) saves results to the default dataset of the Actor. ## Opening named and unnamed storages[​](#opening-named-and-unnamed-storages "Direct link to Opening named and unnamed storages") The [`Actor.open_dataset`](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#open_dataset), [`Actor.open_key_value_store`](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#open_key_value_store) and [`Actor.open_request_queue`](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#open_request_queue) methods can be used to open any storage for reading and writing. You can either use them without arguments to open the default storages, or you can pass a storage ID or name to open another storage. ``` from apify import Actor, Request async def main() -> None: async with Actor: # Work with the default dataset of the Actor dataset = await Actor.open_dataset() await dataset.push_data({'result': 'Hello, world!'}) # Work with the key-value store with ID 'mIJVZsRQrDQf4rUAf' key_value_store = await Actor.open_key_value_store(id='mIJVZsRQrDQf4rUAf') await key_value_store.set_value('record', 'Hello, world!') # Work with the request queue with the name 'my-queue' request_queue = await Actor.open_request_queue(name='my-queue') await request_queue.add_request(Request.from_url('https://apify.com')) ``` ## Deleting storages[​](#deleting-storages "Direct link to Deleting storages") To delete a storage, you can use the [`Dataset.drop`](https://docs.apify.com/sdk/python/sdk/python/reference/class/Dataset.md#drop), [`KeyValueStore.drop`](https://docs.apify.com/sdk/python/sdk/python/reference/class/KeyValueStore.md#drop) or [`RequestQueue.drop`](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestQueue.md#drop) methods. ``` from apify import Actor async def main() -> None: async with Actor: # Open a key-value store with the name 'my-cool-store' key_value_store = await Actor.open_key_value_store(name='my-cool-store') await key_value_store.set_value('record', 'Hello, world!') # Do something ... # Now we don't want it anymore await key_value_store.drop() ``` ## Working with datasets[​](#working-with-datasets "Direct link to Working with datasets") In this section we will show you how to work with [datasets](https://docs.apify.com/platform/storage/dataset). ### Reading & writing items[​](#reading--writing-items "Direct link to Reading & writing items") To write data into a dataset, you can use the [`Dataset.push_data`](https://docs.apify.com/sdk/python/sdk/python/reference/class/Dataset.md#push_data) method. To read data from a dataset, you can use the [`Dataset.get_data`](https://docs.apify.com/sdk/python/sdk/python/reference/class/Dataset.md#get_data) method. To get an iterator of the data, you can use the [`Dataset.iterate_items`](https://docs.apify.com/sdk/python/sdk/python/reference/class/Dataset.md#iterate_items) method. ``` from apify import Actor async def main() -> None: async with Actor: # Open a dataset and write some data in it dataset = await Actor.open_dataset(name='my-cool-dataset') await dataset.push_data([{'itemNo': i} for i in range(1000)]) # Read back the first half of the data first_half = await dataset.get_data(limit=500) Actor.log.info(f'The first half of items = {first_half.items}') # Iterate over the second half second_half = [item async for item in dataset.iterate_items(offset=500)] Actor.log.info(f'The second half of items = {second_half}') ``` ### Exporting items[​](#exporting-items "Direct link to Exporting items") You can also export the dataset items into a key-value store, as either a CSV or a JSON record, using the [`Dataset.export_to_csv`](https://docs.apify.com/sdk/python/sdk/python/reference/class/Dataset.md#export_to_csv) or [`Dataset.export_to_json`](https://docs.apify.com/sdk/python/sdk/python/reference/class/Dataset.md#export_to_json) method. ``` from apify import Actor async def main() -> None: async with Actor: # Open a dataset and write some data in it dataset = await Actor.open_dataset(name='my-cool-dataset') await dataset.push_data([{'itemNo': i} for i in range(1000)]) # Export the data as CSV await dataset.export_to( content_type='csv', key='data.csv', to_key_value_store_name='my-cool-key-value-store', ) # Export the data as JSON await dataset.export_to( content_type='json', key='data.json', to_key_value_store_name='my-cool-key-value-store', ) # Print the exported records store = await Actor.open_key_value_store(name='my-cool-key-value-store') csv_data = await store.get_value('data.csv') Actor.log.info(f'CSV data: {csv_data}') json_data = await store.get_value('data.json') Actor.log.info(f'JSON data: {json_data}') ``` ## Working with key-value stores[​](#working-with-key-value-stores "Direct link to Working with key-value stores") In this section we will show you how to work with [key-value stores](https://docs.apify.com/platform/storage/key-value-store). ### Reading and writing records[​](#reading-and-writing-records "Direct link to Reading and writing records") To read records from a key-value store, you can use the [`KeyValueStore.get_value`](https://docs.apify.com/sdk/python/sdk/python/reference/class/KeyValueStore.md#get_value) method. To write records into a key-value store, you can use the [`KeyValueStore.set_value`](https://docs.apify.com/sdk/python/sdk/python/reference/class/KeyValueStore.md#set_value) method. You can set the content type of a record with the `content_type` argument. To delete a record, set its value to `None`. ``` from apify import Actor async def main() -> None: async with Actor: # Open a named key-value store kvs = await Actor.open_key_value_store(name='my-cool-key-value-store') # Write some data to it await kvs.set_value('automatic_text', 'abcd') await kvs.set_value('automatic_json', {'ab': 'cd'}) await kvs.set_value('explicit_csv', 'a,b\nc,d', content_type='text/csv') # Get the values and log them automatic_text = await kvs.get_value('automatic_text') Actor.log.info(f'Automatic text: {automatic_text}') automatic_json = await kvs.get_value('automatic_json') Actor.log.info(f'Automatic JSON: {automatic_json}') explicit_csv = await kvs.get_value('explicit_csv') Actor.log.info(f'Explicit CSV: {explicit_csv}') # Delete the `automatic_text` value await kvs.set_value('automatic_text', None) ``` ### Iterating keys[​](#iterating-keys "Direct link to Iterating keys") To get an iterator of the key-value store record keys, you can use the [`KeyValueStore.iterate_keys`](https://docs.apify.com/sdk/python/sdk/python/reference/class/KeyValueStore.md#iterate_keys) method. ``` from apify import Actor async def main() -> None: async with Actor: # Open a named key-value store kvs = await Actor.open_key_value_store(name='my-cool-key-value-store') # Write some data to it await kvs.set_value('automatic_text', 'abcd') await kvs.set_value('automatic_json', {'ab': 'cd'}) await kvs.set_value('explicit_csv', 'a,b\nc,d', content_type='text/csv') # Print the info for each record Actor.log.info('Records in store:') async for key, info in kvs.iterate_keys(): Actor.log.info(f'key={key}, info={info}') ``` ### Public URLs of records[​](#public-urls-of-records "Direct link to Public URLs of records") To get a publicly accessible URL of a key-value store record, you can use the [`KeyValueStore.get_public_url`](https://docs.apify.com/sdk/python/sdk/python/reference/class/KeyValueStore.md#get_public_url) method. ``` from apify import Actor async def main() -> None: async with Actor: # Open a named key-value store store = await Actor.open_key_value_store(name='my-cool-key-value-store') # Get the public URL of a record my_record_url = await store.get_public_url('my_record') Actor.log.info(f'URL of "my_record": {my_record_url}') ``` ## Working with request queues[​](#working-with-request-queues "Direct link to Working with request queues") In this section we will show you how to work with [request queues](https://docs.apify.com/platform/storage/request-queue). ### Adding requests to a queue[​](#adding-requests-to-a-queue "Direct link to Adding requests to a queue") To add a request into the queue, you can use the [`RequestQueue.add_request`](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestQueue.md#add_request) method. You can use the `forefront` boolean argument to specify whether the request should go to the beginning of the queue, or to the end. You can use the `unique_key` of the request to uniquely identify a request. If you try to add more requests with the same unique key, only the first one will be added. Check out the [`Request`](https://docs.apify.com/sdk/python/sdk/python/reference/class/Request.md) for more information on how to create requests and what properties they have. ### Reading requests[​](#reading-requests "Direct link to Reading requests") To fetch the next request from the queue for processing, you can use the [`RequestQueue.fetch_next_request`](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestQueue.md#fetch_next_request) method. To get info about a specific request from the queue, you can use the [`RequestQueue.get_request`](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestQueue.md#get_request) method. ### Handling requests[​](#handling-requests "Direct link to Handling requests") To mark a request as handled, you can use the [`RequestQueue.mark_request_as_handled`](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestQueue.md#mark_request_as_handled) method. To mark a request as not handled, so that it gets retried, you can use the [`RequestQueue.reclaim_request`](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestQueue.md#reclaim_request) method. To check if all the requests in the queue are handled, you can use the [`RequestQueue.is_finished`](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestQueue.md#is_finished) method. ### Full example[​](#full-example "Direct link to Full example") ``` import asyncio import random from apify import Actor, Request FAILURE_RATE = 0.3 async def main() -> None: async with Actor: # Open the queue queue = await Actor.open_request_queue() # Add some requests to the queue for i in range(1, 10): await queue.add_request(Request.from_url(f'http://example.com/{i}')) # Add a request to the start of the queue, for priority processing await queue.add_request(Request.from_url('http://example.com/0'), forefront=True) # If you try to add an existing request again, it will not do anything add_request_info = await queue.add_request( Request.from_url('http://different-example.com/5') ) Actor.log.info(f'Add request info: {add_request_info}') processed_request = await queue.get_request(add_request_info.id) Actor.log.info(f'Processed request: {processed_request}') # Finally, process the queue until all requests are handled while not await queue.is_finished(): # Fetch the next unhandled request in the queue request = await queue.fetch_next_request() # This can happen due to the eventual consistency of the underlying request # queue storage, best solution is just to sleep a bit. if request is None: await asyncio.sleep(1) continue Actor.log.info(f'Processing request {request.unique_key}...') Actor.log.info(f'Scraping URL {request.url}...') # Do some fake work, which fails 30% of the time await asyncio.sleep(1) if random.random() > FAILURE_RATE: # If processing the request was successful, mark it as handled Actor.log.info('Request successful.') await queue.mark_request_as_handled(request) else: # If processing the request was unsuccessful, reclaim it so it can be # processed again. Actor.log.warning('Request failed, will retry!') await queue.reclaim_request(request) ``` --- # Creating webhooks Webhooks allow you to configure the Apify platform to perform an action when a certain event occurs. For example, you can use them to start another Actor when the current run finishes or fails. You can learn more in the [documentation for webhooks](https://docs.apify.com/platform/integrations/webhooks). ## Creating an ad-hoc webhook dynamically[​](#creating-an-ad-hoc-webhook-dynamically "Direct link to Creating an ad-hoc webhook dynamically") Besides creating webhooks manually in Apify Console, or through the Apify API,you can also create [ad-hoc webhooks](https://docs.apify.com/platform/integrations/webhooks/ad-hoc-webhooks) dynamically from the code of your Actor using the [`Actor.add_webhook`](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#add_webhook) method: ``` from apify import Actor, Webhook async def main() -> None: async with Actor: # Create a webhook that will be triggered when the Actor run fails. webhook = Webhook( event_types=['ACTOR.RUN.FAILED'], request_url='https://example.com/run-failed', ) # Add the webhook to the Actor. await Actor.add_webhook(webhook) # Raise an error to simulate a failed run. raise RuntimeError('I am an error and I know it!') ``` Note that webhooks are only supported when running on the Apify platform. When running the Actor locally, the method will print a warning and have no effect. ## Preventing duplicate webhooks[​](#preventing-duplicate-webhooks "Direct link to Preventing duplicate webhooks") To ensure that duplicate ad-hoc webhooks won't get created in a case of Actor restart, you can use the `idempotency_key` parameter. The idempotency key must be unique across all the webhooks of a user so that only one webhook gets created for a given value. You can use, for example, the Actor run ID as the idempotency key: ``` from apify import Actor, Webhook async def main() -> None: async with Actor: # Create a webhook that will be triggered when the Actor run fails. webhook = Webhook( event_types=['ACTOR.RUN.FAILED'], request_url='https://example.com/run-failed', idempotency_key=Actor.config.actor_run_id, ) # Add the webhook to the Actor. await Actor.add_webhook(webhook) # Raise an error to simulate a failed run. raise RuntimeError('I am an error and I know it!') ``` --- # Using BeautifulSoup with HTTPX In this guide, you'll learn how to use the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) library with the [HTTPX](https://www.python-httpx.org/) library in your Apify Actors. ## Introduction[​](#introduction "Direct link to Introduction") `BeautifulSoup` is a Python library for extracting data from HTML and XML files. It provides simple methods and Pythonic idioms for navigating, searching, and modifying a website's element tree, enabling efficient data extraction. `HTTPX` is a modern, high-level HTTP client library for Python. It provides a simple interface for making HTTP requests and supports both synchronous and asynchronous requests. To create an `Actor` which uses those libraries, start from the [BeautifulSoup & Python](https://apify.com/templates/categories/python) Actor template. This template includes the `BeautifulSoup` and `HTTPX` libraries preinstalled, allowing you to begin development immediately. ## Example Actor[​](#example-actor "Direct link to Example Actor") Below is a simple Actor that recursively scrapes titles from all linked websites, up to a specified maximum depth, starting from URLs provided in the Actor input. It uses `HTTPX` for fetching pages and `BeautifulSoup` for parsing their content to extract titles and links to other pages. ``` from __future__ import annotations from urllib.parse import urljoin from bs4 import BeautifulSoup from httpx import AsyncClient from apify import Actor, Request async def main() -> None: # Enter the context of the Actor. async with Actor: # Retrieve the Actor input, and use default values if not provided. actor_input = await Actor.get_input() or {} start_urls = actor_input.get('start_urls', [{'url': 'https://apify.com'}]) max_depth = actor_input.get('max_depth', 1) # Exit if no start URLs are provided. if not start_urls: Actor.log.info('No start URLs specified in Actor input, exiting...') await Actor.exit() # Open the default request queue for handling URLs to be processed. request_queue = await Actor.open_request_queue() # Enqueue the start URLs with an initial crawl depth of 0. for start_url in start_urls: url = start_url.get('url') Actor.log.info(f'Enqueuing {url} ...') new_request = Request.from_url(url, user_data={'depth': 0}) await request_queue.add_request(new_request) # Create an HTTPX client to fetch the HTML content of the URLs. async with AsyncClient() as client: # Process the URLs from the request queue. while request := await request_queue.fetch_next_request(): url = request.url if not isinstance(request.user_data['depth'], (str, int)): raise TypeError('Request.depth is an enexpected type.') depth = int(request.user_data['depth']) Actor.log.info(f'Scraping {url} (depth={depth}) ...') try: # Fetch the HTTP response from the specified URL using HTTPX. response = await client.get(url, follow_redirects=True) # Parse the HTML content using Beautiful Soup. soup = BeautifulSoup(response.content, 'html.parser') # If the current depth is less than max_depth, find nested links # and enqueue them. if depth < max_depth: for link in soup.find_all('a'): link_href = link.get('href') link_url = urljoin(url, link_href) if link_url.startswith(('http://', 'https://')): Actor.log.info(f'Enqueuing {link_url} ...') new_request = Request.from_url( link_url, user_data={'depth': depth + 1}, ) await request_queue.add_request(new_request) # Extract the desired data. data = { 'url': url, 'title': soup.title.string if soup.title else None, 'h1s': [h1.text for h1 in soup.find_all('h1')], 'h2s': [h2.text for h2 in soup.find_all('h2')], 'h3s': [h3.text for h3 in soup.find_all('h3')], } # Store the extracted data to the default dataset. await Actor.push_data(data) except Exception: Actor.log.exception(f'Cannot extract data from {url}.') finally: # Mark the request as handled to ensure it is not processed again. await request_queue.mark_request_as_handled(new_request) ``` ## Conclusion[​](#conclusion "Direct link to Conclusion") In this guide, you learned how to use the `BeautifulSoup` with the `HTTPX` in your Apify Actors. By combining these libraries, you can efficiently extract data from HTML or XML files, making it easy to build web scraping tasks in Python. See the [Actor templates](https://apify.com/templates/categories/python) to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/apify-sdk-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping! --- # Using Crawlee In this guide you'll learn how to use the [Crawlee](https://crawlee.dev/python) library in your Apify Actors. ## Introduction[​](#introduction "Direct link to Introduction") `Crawlee` is a Python library for web scraping and browser automation that provides a robust and flexible framework for building web scraping tasks. It seamlessly integrates with the Apify platform and supports a variety of scraping techniques, from static HTML parsing to dynamic JavaScript-rendered content handling. Crawlee offers a range of crawlers, including HTTP-based crawlers like [`HttpCrawler`](https://crawlee.dev/python/api/class/HttpCrawler), [`BeautifulSoupCrawler`](https://crawlee.dev/python/api/class/BeautifulSoupCrawler) and [`ParselCrawler`](https://crawlee.dev/python/api/class/ParselCrawler), and browser-based crawlers like [`PlaywrightCrawler`](https://crawlee.dev/python/api/class/PlaywrightCrawler), to suit different scraping needs. In this guide, you'll learn how to use Crawlee with `BeautifulSoupCrawler` and `PlaywrightCrawler` to build Apify Actors for web scraping. ## Actor with BeautifulSoupCrawler[​](#actor-with-beautifulsoupcrawler "Direct link to Actor with BeautifulSoupCrawler") The `BeautifulSoupCrawler` is ideal for extracting data from static HTML pages. It uses `BeautifulSoup` for parsing and [`HttpxHttpClient`](https://crawlee.dev/python/api/class/HttpxHttpClient) for HTTP communication, ensuring efficient and lightweight scraping. If you do not need to execute JavaScript on the page, `BeautifulSoupCrawler` is a great choice for your scraping tasks. Below is an example of how to use `BeautifulSoupCrawler` in an Apify Actor. ``` from __future__ import annotations from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext from apify import Actor async def main() -> None: # Enter the context of the Actor. async with Actor: # Retrieve the Actor input, and use default values if not provided. actor_input = await Actor.get_input() or {} start_urls = [ url.get('url') for url in actor_input.get( 'start_urls', [{'url': 'https://apify.com'}], ) ] # Exit if no start URLs are provided. if not start_urls: Actor.log.info('No start URLs specified in Actor input, exiting...') await Actor.exit() # Create a crawler. crawler = BeautifulSoupCrawler( # Limit the crawl to max requests. # Remove or increase it for crawling all links. max_requests_per_crawl=50, ) # Define a request handler, which will be called for every request. @crawler.router.default_handler async def request_handler(context: BeautifulSoupCrawlingContext) -> None: url = context.request.url Actor.log.info(f'Scraping {url}...') # Extract the desired data. data = { 'url': context.request.url, 'title': context.soup.title.string if context.soup.title else None, 'h1s': [h1.text for h1 in context.soup.find_all('h1')], 'h2s': [h2.text for h2 in context.soup.find_all('h2')], 'h3s': [h3.text for h3 in context.soup.find_all('h3')], } # Store the extracted data to the default dataset. await context.push_data(data) # Enqueue additional links found on the current page. await context.enqueue_links() # Run the crawler with the starting requests. await crawler.run(start_urls) ``` ## Actor with PlaywrightCrawler[​](#actor-with-playwrightcrawler "Direct link to Actor with PlaywrightCrawler") The `PlaywrightCrawler` is built for handling dynamic web pages that rely on JavaScript for content generation. Using the [Playwright](https://playwright.dev/) library, it provides a browser-based automation environment to interact with complex websites. Below is an example of how to use `PlaywrightCrawler` in an Apify Actor. ``` from __future__ import annotations from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext from apify import Actor async def main() -> None: # Enter the context of the Actor. async with Actor: # Retrieve the Actor input, and use default values if not provided. actor_input = await Actor.get_input() or {} start_urls = [ url.get('url') for url in actor_input.get( 'start_urls', [{'url': 'https://apify.com'}], ) ] # Exit if no start URLs are provided. if not start_urls: Actor.log.info('No start URLs specified in Actor input, exiting...') await Actor.exit() # Create a crawler. crawler = PlaywrightCrawler( # Limit the crawl to max requests. # Remove or increase it for crawling all links. max_requests_per_crawl=50, headless=True, browser_launch_options={ 'args': ['--disable-gpu'], }, ) # Define a request handler, which will be called for every request. @crawler.router.default_handler async def request_handler(context: PlaywrightCrawlingContext) -> None: url = context.request.url Actor.log.info(f'Scraping {url}...') # Extract the desired data. data = { 'url': context.request.url, 'title': await context.page.title(), 'h1s': [ await h1.text_content() for h1 in await context.page.locator('h1').all() ], 'h2s': [ await h2.text_content() for h2 in await context.page.locator('h2').all() ], 'h3s': [ await h3.text_content() for h3 in await context.page.locator('h3').all() ], } # Store the extracted data to the default dataset. await context.push_data(data) # Enqueue additional links found on the current page. await context.enqueue_links() # Run the crawler with the starting requests. await crawler.run(start_urls) ``` ## Conclusion[​](#conclusion "Direct link to Conclusion") In this guide, you learned how to use the `Crawlee` library in your Apify Actors. By using the `BeautifulSoupCrawler` and `PlaywrightCrawler` crawlers, you can efficiently scrape static or dynamic web pages, making it easy to build web scraping tasks in Python. See the [Actor templates](https://apify.com/templates/categories/python) to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/apify-sdk-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping! --- # Using Playwright [Playwright](https://playwright.dev) is a tool for web automation and testing that can also be used for web scraping. It allows you to control a web browser programmatically and interact with web pages just as a human would. Some of the key features of Playwright for web scraping include: * **Cross-browser support** - Playwright supports the latest versions of major browsers like Chrome, Firefox, and Safari, so you can choose the one that suits your needs the best. * **Headless mode** - Playwright can run in headless mode, meaning that the browser window is not visible on your screen while it is scraping, which can be useful for running scraping tasks in the background or in containers without a display. * **Powerful selectors** - Playwright provides a variety of powerful selectors that allow you to target specific elements on a web page, including CSS selectors, XPath, and text matching. * **Emulation of user interactions** - Playwright allows you to emulate user interactions like clicking, scrolling, filling out forms, and even typing in text, which can be useful for scraping websites that have dynamic content or require user input. ## Using Playwright in Actors[​](#using-playwright-in-actors "Direct link to Using Playwright in Actors") To create Actors which use Playwright, start from the [Playwright & Python](https://apify.com/templates/categories/python) Actor template. On the Apify platform, the Actor will already have Playwright and the necessary browsers preinstalled in its Docker image, including the tools and setup necessary to run browsers in headful mode. When running the Actor locally, you'll need to finish the Playwright setup yourself before you can run the Actor. * Linux / macOS * Windows ``` source .venv/bin/activate playwright install --with-deps ``` ``` .venv\Scripts\activate playwright install --with-deps ``` ## Example Actor[​](#example-actor "Direct link to Example Actor") This is a simple Actor that recursively scrapes titles from all linked websites, up to a maximum depth, starting from URLs in the Actor input. It uses Playwright to open the pages in an automated Chrome browser, and to extract the title and anchor elements after the pages load. ``` from __future__ import annotations from urllib.parse import urljoin from playwright.async_api import async_playwright from apify import Actor, Request # Note: To run this Actor locally, ensure that Playwright browsers are installed. # Run `playwright install --with-deps` in the Actor's virtual environment to install them. # When running on the Apify platform, these dependencies are already included # in the Actor's Docker image. async def main() -> None: # Enter the context of the Actor. async with Actor: # Retrieve the Actor input, and use default values if not provided. actor_input = await Actor.get_input() or {} start_urls = actor_input.get('start_urls', [{'url': 'https://apify.com'}]) max_depth = actor_input.get('max_depth', 1) # Exit if no start URLs are provided. if not start_urls: Actor.log.info('No start URLs specified in actor input, exiting...') await Actor.exit() # Open the default request queue for handling URLs to be processed. request_queue = await Actor.open_request_queue() # Enqueue the start URLs with an initial crawl depth of 0. for start_url in start_urls: url = start_url.get('url') Actor.log.info(f'Enqueuing {url} ...') new_request = Request.from_url(url, user_data={'depth': 0}) await request_queue.add_request(new_request) Actor.log.info('Launching Playwright...') # Launch Playwright and open a new browser context. async with async_playwright() as playwright: # Configure the browser to launch in headless mode as per Actor configuration. browser = await playwright.chromium.launch( headless=Actor.config.headless, args=['--disable-gpu'], ) context = await browser.new_context() # Process the URLs from the request queue. while request := await request_queue.fetch_next_request(): url = request.url if not isinstance(request.user_data['depth'], (str, int)): raise TypeError('Request.depth is an enexpected type.') depth = int(request.user_data['depth']) Actor.log.info(f'Scraping {url} (depth={depth}) ...') try: # Open a new page in the browser context and navigate to the URL. page = await context.new_page() await page.goto(url) # If the current depth is less than max_depth, find nested links # and enqueue them. if depth < max_depth: for link in await page.locator('a').all(): link_href = await link.get_attribute('href') link_url = urljoin(url, link_href) if link_url.startswith(('http://', 'https://')): Actor.log.info(f'Enqueuing {link_url} ...') new_request = Request.from_url( link_url, user_data={'depth': depth + 1}, ) await request_queue.add_request(new_request) # Extract the desired data. data = { 'url': url, 'title': await page.title(), } # Store the extracted data to the default dataset. await Actor.push_data(data) except Exception: Actor.log.exception(f'Cannot extract data from {url}.') finally: await page.close() # Mark the request as handled to ensure it is not processed again. await request_queue.mark_request_as_handled(request) ``` ## Conclusion[​](#conclusion "Direct link to Conclusion") In this guide you learned how to create Actors that use Playwright to scrape websites. Playwright is a powerful tool that can be used to manage browser instances and scrape websites that require JavaScript execution. See the [Actor templates](https://apify.com/templates/categories/python) to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/apify-sdk-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping! --- # Using Scrapy [Scrapy](https://scrapy.org/) is an open-source web scraping framework for Python. It provides tools for defining scrapers, extracting data from web pages, following links, and handling pagination. With the Apify SDK, Scrapy projects can be converted into Apify [Actors](https://docs.apify.com/platform/actors), integrated with Apify [storages](https://docs.apify.com/platform/storage), and executed on the Apify [platform](https://docs.apify.com/platform). ## Integrating Scrapy with the Apify platform[​](#integrating-scrapy-with-the-apify-platform "Direct link to Integrating Scrapy with the Apify platform") The Apify SDK provides an Apify-Scrapy integration. The main challenge of this is to combine two asynchronous frameworks that use different event loop implementations. Scrapy uses [Twisted](https://twisted.org/) for asynchronous execution, while the Apify SDK is based on [asyncio](https://docs.python.org/3/library/asyncio.html). The key thing is to install the Twisted's `asyncioreactor` to run Twisted's asyncio compatible event loop. This allows both Twisted and asyncio to run on a single event loop, enabling a Scrapy spider to run as an Apify Actor with minimal modifications. \_\_main.py\_\_: The Actor entry point ``` from __future__ import annotations from scrapy.utils.reactor import install_reactor # Install Twisted's asyncio reactor before importing any other Twisted or # Scrapy components. install_reactor('twisted.internet.asyncioreactor.AsyncioSelectorReactor') import os from apify.scrapy import initialize_logging, run_scrapy_actor # Import your main Actor coroutine here. from .main import main # Ensure the location to the Scrapy settings module is defined. os.environ['SCRAPY_SETTINGS_MODULE'] = 'src.settings' if __name__ == '__main__': initialize_logging() run_scrapy_actor(main()) ``` In this setup, `apify.scrapy.initialize_logging` configures an Apify log formatter and reconfigures loggers to ensure consistent logging across Scrapy, the Apify SDK, and other libraries. The `apify.scrapy.run_scrapy_actor` bridges asyncio coroutines with Twisted's reactor, enabling the Actor's main coroutine, which contains the Scrapy spider, to be executed. Make sure the `SCRAPY_SETTINGS_MODULE` environment variable is set to the path of the Scrapy settings module. This variable is also used by the `Actor` class to detect that the project is a Scrapy project, triggering additional actions. main.py: The Actor main coroutine ``` from __future__ import annotations from scrapy.crawler import CrawlerRunner from scrapy.utils.defer import deferred_to_future from apify import Actor from apify.scrapy import apply_apify_settings # Import your Scrapy spider here. from .spiders import TitleSpider as Spider async def main() -> None: """Apify Actor main coroutine for executing the Scrapy spider.""" async with Actor: # Retrieve and process Actor input. actor_input = await Actor.get_input() or {} start_urls = [url['url'] for url in actor_input.get('startUrls', [])] allowed_domains = actor_input.get('allowedDomains') proxy_config = actor_input.get('proxyConfiguration') # Apply Apify settings, which will override the Scrapy project settings. settings = apply_apify_settings(proxy_config=proxy_config) # Create CrawlerRunner and execute the Scrapy spider. crawler_runner = CrawlerRunner(settings) crawl_deferred = crawler_runner.crawl( Spider, start_urls=start_urls, allowed_domains=allowed_domains, ) await deferred_to_future(crawl_deferred) ``` Within the Actor's main coroutine, the Actor's input is processed as usual. The function `apify.scrapy.apply_apify_settings` is then used to configure Scrapy settings with Apify-specific components before the spider is executed. The key components and other helper functions are described in the next section. ## Key integration components[​](#key-integration-components "Direct link to Key integration components") The Apify SDK provides several custom components to support integration with the Apify platform: * [`apify.scrapy.ApifyScheduler`](https://docs.apify.com/sdk/python/sdk/python/reference/class/ApifyScheduler.md) - Replaces Scrapy's default [scheduler](https://docs.scrapy.org/en/latest/topics/scheduler.html) with one that uses Apify's [request queue](https://docs.apify.com/platform/storage/request-queue) for storing requests. It manages enqueuing, dequeuing, and maintaining the state and priority of requests. * [`apify.scrapy.ActorDatasetPushPipeline`](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorDatasetPushPipeline.md) - A Scrapy [item pipeline](https://docs.scrapy.org/en/latest/topics/item-pipeline.html) that pushes scraped items to Apify's [dataset](https://docs.apify.com/platform/storage/dataset). When enabled, every item produced by the spider is sent to the dataset. * [`apify.scrapy.ApifyHttpProxyMiddleware`](https://docs.apify.com/sdk/python/sdk/python/reference/class/ApifyHttpProxyMiddleware.md) - A Scrapy [middleware](https://docs.scrapy.org/en/latest/topics/downloader-middleware.html) that manages proxy configurations. This middleware replaces Scrapy's default `HttpProxyMiddleware` to facilitate the use of Apify's proxy service. * [`apify.scrapy.extensions.ApifyCacheStorage`](https://docs.apify.com/sdk/python/sdk/python/reference/class/ApifyCacheStorage.md) - A storage backend for Scrapy's built-in [HTTP cache middleware](https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.downloadermiddlewares.httpcache). This backend uses Apify's [key-value store](https://docs.apify.com/platform/storage/key-value-store). Make sure to set `HTTPCACHE_ENABLED` and `HTTPCACHE_EXPIRATION_SECS` in your settings, or caching won't work. Additional helper functions in the [`apify.scrapy`](https://github.com/apify/apify-sdk-python/tree/master/src/apify/scrapy) subpackage include: * `apply_apify_settings` - Applies Apify-specific components to Scrapy settings. * `to_apify_request` and `to_scrapy_request` - Convert between Apify and Scrapy request objects. * `initialize_logging` - Configures logging for the Actor environment. * `run_scrapy_actor` - Bridges asyncio and Twisted event loops. ## Create a new Apify-Scrapy project[​](#create-a-new-apify-scrapy-project "Direct link to Create a new Apify-Scrapy project") The simplest way to start using Scrapy in Apify Actors is to use the [Scrapy Actor template](https://apify.com/templates/python-scrapy). The template provides a pre-configured project structure and setup that includes all necessary components to run Scrapy spiders as Actors and store their output in Apify datasets. If you prefer manual setup, refer to the example Actor section below for configuration details. ## Wrapping an existing Scrapy project[​](#wrapping-an-existing-scrapy-project "Direct link to Wrapping an existing Scrapy project") The Apify CLI supports converting an existing Scrapy project into an Apify Actor with a single command. The CLI expects the project to follow the standard Scrapy layout (including a `scrapy.cfg` file in the project root). During the wrapping process, the CLI: * Creates the necessary files and directories for an Apify Actor. * Installs the Apify SDK and required dependencies. * Updates Scrapy settings to include Apify-specific components. For further details, see the [Scrapy migration guide](https://docs.apify.com/cli/docs/integrating-scrapy). ## Example Actor[​](#example-actor "Direct link to Example Actor") The following example demonstrates a Scrapy Actor that scrapes page titles and enqueues links found on each page. This example aligns with the structure provided in the Apify Actor templates. * \_\_main.py\_\_ * main.py * settings.py * items.py * spiders/title.py ``` from __future__ import annotations from scrapy.utils.reactor import install_reactor # Install Twisted's asyncio reactor before importing any other Twisted or # Scrapy components. install_reactor('twisted.internet.asyncioreactor.AsyncioSelectorReactor') import os from apify.scrapy import initialize_logging, run_scrapy_actor # Import your main Actor coroutine here. from .main import main # Ensure the location to the Scrapy settings module is defined. os.environ['SCRAPY_SETTINGS_MODULE'] = 'src.settings' if __name__ == '__main__': initialize_logging() run_scrapy_actor(main()) ``` ``` from __future__ import annotations from scrapy.crawler import CrawlerRunner from scrapy.utils.defer import deferred_to_future from apify import Actor from apify.scrapy import apply_apify_settings # Import your Scrapy spider here. from .spiders import TitleSpider as Spider async def main() -> None: """Apify Actor main coroutine for executing the Scrapy spider.""" async with Actor: # Retrieve and process Actor input. actor_input = await Actor.get_input() or {} start_urls = [url['url'] for url in actor_input.get('startUrls', [])] allowed_domains = actor_input.get('allowedDomains') proxy_config = actor_input.get('proxyConfiguration') # Apply Apify settings, which will override the Scrapy project settings. settings = apply_apify_settings(proxy_config=proxy_config) # Create CrawlerRunner and execute the Scrapy spider. crawler_runner = CrawlerRunner(settings) crawl_deferred = crawler_runner.crawl( Spider, start_urls=start_urls, allowed_domains=allowed_domains, ) await deferred_to_future(crawl_deferred) ``` ``` BOT_NAME = 'titlebot' DEPTH_LIMIT = 1 LOG_LEVEL = 'INFO' NEWSPIDER_MODULE = 'src.spiders' ROBOTSTXT_OBEY = True SPIDER_MODULES = ['src.spiders'] TELNETCONSOLE_ENABLED = False # Do not change the Twisted reactor unless you really know what you are doing. TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor' HTTPCACHE_ENABLED = True HTTPCACHE_EXPIRATION_SECS = 7200 ``` ``` from __future__ import annotations from scrapy import Field, Item class TitleItem(Item): """Represents a title item scraped from a web page.""" url = Field() title = Field() ``` ``` from __future__ import annotations from typing import TYPE_CHECKING, Any from urllib.parse import urljoin from scrapy import Request, Spider from ..items import TitleItem if TYPE_CHECKING: from collections.abc import Generator from scrapy.http.response import Response class TitleSpider(Spider): """A spider that scrapes web pages to extract titles and discover new links. This spider retrieves the content of the element from each page and queues any valid hyperlinks for further crawling. """ name = 'title_spider' # Limit the number of pages to scrape. custom_settings = {'CLOSESPIDER_PAGECOUNT': 10} def __init__( self, start_urls: list[str], allowed_domains: list[str], *args: Any, **kwargs: Any, ) -> None: """A default costructor. Args: start_urls: URLs to start the scraping from. allowed_domains: Domains that the scraper is allowed to crawl. *args: Additional positional arguments. **kwargs: Additional keyword arguments. """ super().__init__(*args, **kwargs) self.start_urls = start_urls self.allowed_domains = allowed_domains def parse(self, response: Response) -> Generator[TitleItem | Request, None, None]: """Parse the web page response. Args: response: The web page response. Yields: Yields scraped `TitleItem` and new `Request` objects for links. """ self.logger.info('TitleSpider is parsing %s...', response) # Extract and yield the TitleItem url = response.url title = response.css('title::text').extract_first() yield TitleItem(url=url, title=title) # Extract all links from the page, create `Request` objects out of them, # and yield them. for link_href in response.css('a::attr("href")'): link_url = urljoin(response.url, link_href.get()) if link_url.startswith(('http://', 'https://')): yield Request(link_url) ``` ## Dealing with ‘imminent migration to another host’[​](#dealing-with-imminent-migration-to-another-host "Direct link to Dealing with ‘imminent migration to another host’") Under some circumstances, the platform may decide to [migrate your Actor](https://docs.apify.com/academy/expert-scraping-with-apify/migrations-maintaining-state) from one piece of infrastructure to another while it's in progress. While [Crawlee](https://crawlee.dev/python)-based projects can pause and resume their work after a restart, achieving the same with a Scrapy-based project can be challenging. As a workaround for this issue (tracked as [apify/actor-templates#303](https://github.com/apify/actor-templates/issues/303)), turn on caching with `HTTPCACHE_ENABLED` and set `HTTPCACHE_EXPIRATION_SECS` to at least a few minutes—the exact value depends on your use case. If your Actor gets migrated and restarted, the subsequent run will hit the cache, making it fast and avoiding unnecessary resource consumption. ## Conclusion[​](#conclusion "Direct link to Conclusion") In this guide you learned how to use Scrapy in Apify Actors. You can now start building your own web scraping projects using Scrapy, the Apify SDK and host them on the Apify platform. See the [Actor templates](https://apify.com/templates/categories/python) to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/apify-sdk-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping! ## Additional resources[​](#additional-resources "Direct link to Additional resources") * [Apify CLI: Integrating Scrapy projects](https://docs.apify.com/cli/docs/integrating-scrapy) * [Apify: Run Scrapy spiders on Apify](https://apify.com/run-scrapy-in-cloud) * [Apify templates: Pyhon Actor Scrapy template](https://apify.com/templates/python-scrapy) * [Apify store: Scrapy Books Example Actor](https://apify.com/vdusek/scrapy-books-example) * [Scrapy: Official documentation](https://docs.scrapy.org/) --- # Using Selenium [Selenium](https://www.selenium.dev/) is a tool for web automation and testing that can also be used for web scraping. It allows you to control a web browser programmatically and interact with web pages just as a human would. Some of the key features of Selenium for web scraping include: * **Cross-browser support** - Selenium supports the latest versions of major browsers like Chrome, Firefox, and Safari, so you can choose the one that suits your needs the best. * **Headless mode** - Selenium can run in headless mode, meaning that the browser window is not visible on your screen while it is scraping, which can be useful for running scraping tasks in the background or in containers without a display. * **Powerful selectors** - Selenium provides a variety of powerful selectors that allow you to target specific elements on a web page, including CSS selectors, XPath, and text matching. * **Emulation of user interactions** - Selenium allows you to emulate user interactions like clicking, scrolling, filling out forms, and even typing in text, which can be useful for scraping websites that have dynamic content or require user input. ## Using Selenium in Actors[​](#using-selenium-in-actors "Direct link to Using Selenium in Actors") To create Actors which use Selenium, start from the [Selenium & Python](https://apify.com/templates/categories/python) Actor template. On the Apify platform, the Actor will already have Selenium and the necessary browsers preinstalled in its Docker image, including the tools and setup necessary to run browsers in headful mode. When running the Actor locally, you'll need to install the Selenium browser drivers yourself. Refer to the [Selenium documentation](https://www.selenium.dev/documentation/webdriver/getting_started/install_drivers/) for installation instructions. ## Example Actor[​](#example-actor "Direct link to Example Actor") This is a simple Actor that recursively scrapes titles from all linked websites, up to a maximum depth, starting from URLs in the Actor input. It uses Selenium ChromeDriver to open the pages in an automated Chrome browser, and to extract the title and anchor elements after the pages load. ``` from __future__ import annotations import asyncio from urllib.parse import urljoin from selenium import webdriver from selenium.webdriver.chrome.options import Options as ChromeOptions from selenium.webdriver.common.by import By from apify import Actor, Request # To run this Actor locally, you need to have the Selenium Chromedriver installed. # Follow the installation guide at: # https://www.selenium.dev/documentation/webdriver/getting_started/install_drivers/ # When running on the Apify platform, the Chromedriver is already included # in the Actor's Docker image. async def main() -> None: # Enter the context of the Actor. async with Actor: # Retrieve the Actor input, and use default values if not provided. actor_input = await Actor.get_input() or {} start_urls = actor_input.get('start_urls', [{'url': 'https://apify.com'}]) max_depth = actor_input.get('max_depth', 1) # Exit if no start URLs are provided. if not start_urls: Actor.log.info('No start URLs specified in actor input, exiting...') await Actor.exit() # Open the default request queue for handling URLs to be processed. request_queue = await Actor.open_request_queue() # Enqueue the start URLs with an initial crawl depth of 0. for start_url in start_urls: url = start_url.get('url') Actor.log.info(f'Enqueuing {url} ...') new_request = Request.from_url(url, user_data={'depth': 0}) await request_queue.add_request(new_request) # Launch a new Selenium Chrome WebDriver and configure it. Actor.log.info('Launching Chrome WebDriver...') chrome_options = ChromeOptions() if Actor.config.headless: chrome_options.add_argument('--headless') chrome_options.add_argument('--no-sandbox') chrome_options.add_argument('--disable-dev-shm-usage') driver = webdriver.Chrome(options=chrome_options) # Test WebDriver setup by navigating to an example page. driver.get('http://www.example.com') if driver.title != 'Example Domain': raise ValueError('Failed to open example page.') # Process the URLs from the request queue. while request := await request_queue.fetch_next_request(): url = request.url if not isinstance(request.user_data['depth'], (str, int)): raise TypeError('Request.depth is an enexpected type.') depth = int(request.user_data['depth']) Actor.log.info(f'Scraping {url} (depth={depth}) ...') try: # Navigate to the URL using Selenium WebDriver. Use asyncio.to_thread # for non-blocking execution. await asyncio.to_thread(driver.get, url) # If the current depth is less than max_depth, find nested links # and enqueue them. if depth < max_depth: for link in driver.find_elements(By.TAG_NAME, 'a'): link_href = link.get_attribute('href') link_url = urljoin(url, link_href) if link_url.startswith(('http://', 'https://')): Actor.log.info(f'Enqueuing {link_url} ...') new_request = Request.from_url( link_url, user_data={'depth': depth + 1}, ) await request_queue.add_request(new_request) # Extract the desired data. data = { 'url': url, 'title': driver.title, } # Store the extracted data to the default dataset. await Actor.push_data(data) except Exception: Actor.log.exception(f'Cannot extract data from {url}.') finally: # Mark the request as handled to ensure it is not processed again. await request_queue.mark_request_as_handled(request) driver.quit() ``` ## Conclusion[​](#conclusion "Direct link to Conclusion") In this guide you learned how to use Selenium for web scraping in Apify Actors. You can now create your own Actors that use Selenium to scrape dynamic websites and interact with web pages just like a human would. See the [Actor templates](https://apify.com/templates/categories/python) to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/apify-sdk-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping! --- # Actor structure All Python Actor templates follow the same structure. The `.actor/` directory contains the [Actor configuration](https://docs.apify.com/platform/actors/development/actor-config), such as the Actor's definition and input schema, and the Dockerfile necessary to run the Actor on the Apify platform. The Actor's runtime dependencies are specified in the `requirements.txt` file, which follows the [standard requirements file format](https://pip.pypa.io/en/stable/reference/requirements-file-format/). The Actor's source code is in the `src/` folder. This folder contains two important files: `main.py`, which contains the main function of the Actor, and `__main__.py`, which is the entrypoint of the Actor package, setting up the Actor [logger](https://docs.apify.com/sdk/python/sdk/python/docs/concepts/logging.md) and executing the Actor's main function via [`asyncio.run`](https://docs.python.org/3/library/asyncio-runner.html#asyncio.run). * main.py * \_\_main.py\_\_ ``` import asyncio from .main import main if __name__ == '__main__': asyncio.run(main()) ``` ``` from apify import Actor async def main() -> None: async with Actor: actor_input = await Actor.get_input() Actor.log.info('Actor input: %s', actor_input) await Actor.set_value('OUTPUT', 'Hello, world!') ``` If you want to modify the Actor structure, you need to make sure that your Actor is executable as a module, via `python -m src`, as that is the command started by `apify run` in the Apify CLI. We recommend keeping the entrypoint for the Actor in the `src/__main__.py` file. --- # Introduction The Apify SDK for Python is the official library for creating [Apify Actors](https://docs.apify.com/platform/actors) using Python. ``` import httpx from bs4 import BeautifulSoup from apify import Actor async def main() -> None: async with Actor: actor_input = await Actor.get_input() async with httpx.AsyncClient() as client: response = await client.get(actor_input['url']) soup = BeautifulSoup(response.content, 'html.parser') data = { 'url': actor_input['url'], 'title': soup.title.string if soup.title else None, } await Actor.push_data(data) ``` ## What are Actors?[​](#what-are-actors "Direct link to What are Actors?") Actors are serverless cloud programs capable of performing tasks in a web browser, similar to what a human can do. These tasks can range from simple operations, such as filling out forms or unsubscribing from services, to complex jobs like scraping and processing large numbers of web pages. Actors can be executed locally or on the [Apify platform](https://docs.apify.com/platform/), which provides features for running them at scale, monitoring, scheduling, and even publishing and monetizing them. If you're new to Apify, refer to the Apify platform documentation to learn [what Apify is](https://docs.apify.com/platform/about). ## Quick start[​](#quick-start "Direct link to Quick start") This section provides a quick start guide for creating and running Actors. ### Creating Actors[​](#creating-actors "Direct link to Creating Actors") To create and run Actors using the Apify Console, see the [Console documentation](https://docs.apify.com/platform/console). For creating and running Python Actors locally, refer to the documentation for [creating and running Python Actors locally](https://docs.apify.com/sdk/python/sdk/python/docs/overview/running-actors-locally.md). ### Guides[​](#guides "Direct link to Guides") Integrate the Apify SDK with popular web scraping libraries by following these guides: * [BeautifulSoup with HTTPX](https://docs.apify.com/sdk/python/sdk/python/docs/guides/beautifulsoup-httpx.md) * [Crawlee](https://docs.apify.com/sdk/python/sdk/python/docs/guides/crawlee.md) * [Playwright](https://docs.apify.com/sdk/python/sdk/python/docs/guides/playwright.md) * [Selenium](https://docs.apify.com/sdk/python/sdk/python/docs/guides/selenium.md) * [Scrapy](https://docs.apify.com/sdk/python/sdk/python/docs/guides/scrapy.md) ### Usage concepts[​](#usage-concepts "Direct link to Usage concepts") For a deeper understanding of the Apify SDK's features, refer to the **Usage concepts** section in the sidebar. Key topics include: * [Actor lifecycle](https://docs.apify.com/sdk/python/sdk/python/docs/concepts/actor-lifecycle.md) * [Working with storages](https://docs.apify.com/sdk/python/sdk/python/docs/concepts/storages.md) * [Handling Actor events](https://docs.apify.com/sdk/python/sdk/python/docs/concepts/actor-events.md) * [Using proxies](https://docs.apify.com/sdk/python/sdk/python/docs/concepts/proxy-management.md) ## Installing the Apify SDK separately[​](#installing-the-apify-sdk-separately "Direct link to Installing the Apify SDK separately") When creating an Actor using the Apify CLI, the Apify SDK for Python is installed automatically. If you want to install it independently, use the following command: ``` pip install apify ``` If your goal is not to develop Apify Actors but to interact with the Apify API from Python, consider using the [Apify API client for Python](https://docs.apify.com/api/client/python) directly. --- # Running Actors locally In this page, you'll learn how to create and run Apify Actors locally on your computer. ## Requirements[​](#requirements "Direct link to Requirements") The Apify SDK requires Python version 3.10 or above to run Python Actors locally. ## Creating your first Actor[​](#creating-your-first-actor "Direct link to Creating your first Actor") To create a new Apify Actor on your computer, you can use the [Apify CLI](https://docs.apify.com/cli), and select one of the [Python Actor templates](https://apify.com/templates/categories/python). For example, to create an Actor from the Python SDK template, you can use the [`apify create`](https://docs.apify.com/cli/docs/reference#apify-create-actorname) command. ``` apify create my-first-actor --template python-start ``` This will create a new folder called `my-first-actor`, download and extract the "Getting started with Python" Actor template there, create a virtual environment in `my-first-actor/.venv`, and install the Actor dependencies in it. ## Running the Actor[​](#running-the-actor "Direct link to Running the Actor") To run the Actor, you can use the [`apify run`](https://docs.apify.com/cli/docs/reference#apify-run) command: ``` cd my-first-actor apify run ``` This will activate the virtual environment in `.venv` (if no other virtual environment is activated yet), then start the Actor, passing the right environment variables for local running, and configure it to use local storages from the `storage` folder. The Actor input, for example, will be in `storage/key_value_stores/default/INPUT.json`. ## Adding dependencies[​](#adding-dependencies "Direct link to Adding dependencies") Adding dependencies into the Actor is simple. First, add them in the [`requirements.txt`](https://pip.pypa.io/en/stable/reference/requirements-file-format/) file in the Actor source folder. Then activate the virtual environment in `.venv`: * Linux / macOS * Windows ``` source .venv/bin/activate ``` ``` .venv\Scripts\activate ``` Then install the dependencies: ``` python -m pip install -r requirements.txt ``` --- # Upgrading to v2 This page summarizes most of the breaking changes between Apify Python SDK v1.x and v2.0. ## Python version support[​](#python-version-support "Direct link to Python version support") Support for Python 3.8 has been dropped. The Apify Python SDK v2.x now requires Python 3.9 or later. Make sure your environment is running a compatible version before upgrading. ## Storages[​](#storages "Direct link to Storages") * The SDK now uses [crawlee](https://github.com/apify/crawlee-python) for local storage emulation. This change should not affect intended usage (working with `Dataset`, `KeyValueStore` and `RequestQueue` classes from the `apify.storages` module or using the shortcuts exposed by the `Actor` class) in any way. * There is a difference in the `RequestQueue.add_request` method: it accepts an `apify.Request` object instead of a free-form dictionary. <!-- --> * A quick way to migrate from dict-based arguments is to wrap it with a `Request.model_validate()` call. * The preferred way is using the `Request.from_url` helper which prefills the `unique_key` and `id` attributes, or instantiating it directly, e.g., `Request(url='https://example.tld', ...)`. * For simple use cases, `add_request` also accepts plain strings that contain an URL, e.g. `queue.add_request('https://example.tld')`. * Removing the `StorageClientManager` class is a significant change. If you need to change the storage client, use `crawlee.service_container` instead. ## Configuration[​](#configuration "Direct link to Configuration") The `apify.Configuration` class now uses `pydantic_settings` to load configuration from environment variables. This eliminates the need for the helper functions which handled environment variables in `apify._utils`. Attributes suffixed with `_millis` were renamed to remove said suffix and have the `datetime.timedelta` type now. ## Actor[​](#actor "Direct link to Actor") * The `Actor.main` method has been removed as it brings no benefits compared to using `async with Actor`. * The `Actor.add_webhook`, `Actor.start`, `Actor.call` and `Actor.start_task` methods now accept instances of the `apify.Webhook` model instead of an untyped `dict`. * `Actor.start`, `Actor.call`, `Actor.start_task`, `Actor.set_status_message` and `Actor.abort` return instances of the `ActorRun` model instead of an untyped `dict`. * Upon entering the context manager (`async with Actor`), the `Actor` puts the default logging configuration in place. This can be disabled using the `configure_logging` parameter. * The `config` parameter of `Actor` has been renamed to `configuration`. * Event handlers registered via `Actor.on` will now receive Pydantic objects instead of untyped dicts. For example, where you would do `event['isMigrating']`, you should now use `event.is_migrating` ## Scrapy integration[​](#scrapy-integration "Direct link to Scrapy integration") The `apify.scrapy.utils.open_queue_with_custom_client` function is not necessary anymore and has been removed. ## Subpackage visibility[​](#subpackage-visibility "Direct link to Subpackage visibility") The following modules were made private: * `apify.proxy_configuration` (`ProxyConfiguration` is still exported from `apify`) * `apify.config` (`Configuration` is still exported from `apify`) * `apify.actor` (`Actor` is still exported from `apify`) * `apify.event_manager` * `apify.consts` --- # apify-sdk-python<!-- --> ## Index[**](#Index) ### Actor * [**Actor](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md) * [**ActorRun](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRun.md) * [**ActorRunMeta](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRunMeta.md) * [**ActorRunOptions](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRunOptions.md) * [**ActorRunStats](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRunStats.md) * [**ActorRunUsage](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRunUsage.md) * [**Webhook](https://docs.apify.com/sdk/python/sdk/python/reference/class/Webhook.md) ### Charging * [**ActorPricingInfo](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorPricingInfo.md) * [**ChargeResult](https://docs.apify.com/sdk/python/sdk/python/reference/class/ChargeResult.md) * [**ChargingManager](https://docs.apify.com/sdk/python/sdk/python/reference/class/ChargingManager.md) ### Configuration * [**Configuration](https://docs.apify.com/sdk/python/sdk/python/reference/class/Configuration.md) * [**ProxyConfiguration](https://docs.apify.com/sdk/python/sdk/python/reference/class/ProxyConfiguration.md) * [**ProxyInfo](https://docs.apify.com/sdk/python/sdk/python/reference/class/ProxyInfo.md) ### Event data * [**SystemInfoEventData](https://docs.apify.com/sdk/python/sdk/python/reference/class/SystemInfoEventData.md) * [**EventAbortingData](https://docs.apify.com/sdk/python/sdk/python/reference/class/EventAbortingData.md) * [**EventExitData](https://docs.apify.com/sdk/python/sdk/python/reference/class/EventExitData.md) * [**EventMigratingData](https://docs.apify.com/sdk/python/sdk/python/reference/class/EventMigratingData.md) * [**EventPersistStateData](https://docs.apify.com/sdk/python/sdk/python/reference/class/EventPersistStateData.md) * [**EventSystemInfoData](https://docs.apify.com/sdk/python/sdk/python/reference/class/EventSystemInfoData.md) ### Event managers * [**PlatformEventManager](https://docs.apify.com/sdk/python/sdk/python/reference/class/PlatformEventManager.md) * [**EventManager](https://docs.apify.com/sdk/python/sdk/python/reference/class/EventManager.md) * [**LocalEventManager](https://docs.apify.com/sdk/python/sdk/python/reference/class/LocalEventManager.md) ### Events * [**AbortingEvent](https://docs.apify.com/sdk/python/sdk/python/reference/class/AbortingEvent.md) * [**DeprecatedEvent](https://docs.apify.com/sdk/python/sdk/python/reference/class/DeprecatedEvent.md) * [**EventWithoutData](https://docs.apify.com/sdk/python/sdk/python/reference/class/EventWithoutData.md) * [**ExitEvent](https://docs.apify.com/sdk/python/sdk/python/reference/class/ExitEvent.md) * [**MigratingEvent](https://docs.apify.com/sdk/python/sdk/python/reference/class/MigratingEvent.md) * [**PersistStateEvent](https://docs.apify.com/sdk/python/sdk/python/reference/class/PersistStateEvent.md) * [**SystemInfoEvent](https://docs.apify.com/sdk/python/sdk/python/reference/class/SystemInfoEvent.md) * [**UnknownEvent](https://docs.apify.com/sdk/python/sdk/python/reference/class/UnknownEvent.md) * [**Event](https://docs.apify.com/sdk/python/sdk/python/reference/enum/Event.md) ### Request loaders * [**RequestList](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestList.md) * [**RequestLoader](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestLoader.md) * [**RequestManager](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestManager.md) * [**RequestManagerTandem](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestManagerTandem.md) * [**SitemapRequestLoader](https://docs.apify.com/sdk/python/sdk/python/reference/class/SitemapRequestLoader.md) ### Storage clients * [**ApifyStorageClient](https://docs.apify.com/sdk/python/sdk/python/reference/class/ApifyStorageClient.md) * [**StorageClient](https://docs.apify.com/sdk/python/sdk/python/reference/class/StorageClient.md) * [**MemoryStorageClient](https://docs.apify.com/sdk/python/sdk/python/reference/class/MemoryStorageClient.md) * [**FileSystemStorageClient](https://docs.apify.com/sdk/python/sdk/python/reference/class/FileSystemStorageClient.md) ### Storages * [**Storage](https://docs.apify.com/sdk/python/sdk/python/reference/class/Storage.md) * [**Dataset](https://docs.apify.com/sdk/python/sdk/python/reference/class/Dataset.md) * [**KeyValueStore](https://docs.apify.com/sdk/python/sdk/python/reference/class/KeyValueStore.md) * [**RequestQueue](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestQueue.md) ### Storage data * [**AddRequestsResponse](https://docs.apify.com/sdk/python/sdk/python/reference/class/AddRequestsResponse.md) * [**DatasetItemsListPage](https://docs.apify.com/sdk/python/sdk/python/reference/class/DatasetItemsListPage.md) * [**DatasetMetadata](https://docs.apify.com/sdk/python/sdk/python/reference/class/DatasetMetadata.md) * [**KeyValueStoreMetadata](https://docs.apify.com/sdk/python/sdk/python/reference/class/KeyValueStoreMetadata.md) * [**KeyValueStoreRecord](https://docs.apify.com/sdk/python/sdk/python/reference/class/KeyValueStoreRecord.md) * [**KeyValueStoreRecordMetadata](https://docs.apify.com/sdk/python/sdk/python/reference/class/KeyValueStoreRecordMetadata.md) * [**ProcessedRequest](https://docs.apify.com/sdk/python/sdk/python/reference/class/ProcessedRequest.md) * [**Request](https://docs.apify.com/sdk/python/sdk/python/reference/class/Request.md) * [**RequestQueueMetadata](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestQueueMetadata.md) * [**StorageMetadata](https://docs.apify.com/sdk/python/sdk/python/reference/class/StorageMetadata.md) --- # \_RequestDetails<!-- --> ### Hierarchy * *\_RequestDetails* * [\_RequestsFromUrlInput](https://docs.apify.com/sdk/python/sdk/python/reference/class/_RequestsFromUrlInput.md) * [\_SimpleUrlInput](https://docs.apify.com/sdk/python/sdk/python/reference/class/_SimpleUrlInput.md) ## Index[**](#Index) ### Properties * [**headers](https://docs.apify.com/sdk/python/sdk/python/reference/class/_RequestDetails.md#headers) * [**method](https://docs.apify.com/sdk/python/sdk/python/reference/class/_RequestDetails.md#method) * [**payload](https://docs.apify.com/sdk/python/sdk/python/reference/class/_RequestDetails.md#payload) * [**user\_data](https://docs.apify.com/sdk/python/sdk/python/reference/class/_RequestDetails.md#user_data) ## Properties<!-- -->[**](#Properties) ### [**](#headers)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/storages/_request_list.py#L26)headers **headers: dict\[str, str] ### [**](#method)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/storages/_request_list.py#L24)method **method: HttpMethod ### [**](#payload)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/storages/_request_list.py#L25)payload **payload: str ### [**](#user_data)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/storages/_request_list.py#L27)user\_data **user\_data: dict\[str, str] --- # \_RequestsFromUrlInput<!-- --> ### Hierarchy * [\_RequestDetails](https://docs.apify.com/sdk/python/sdk/python/reference/class/_RequestDetails.md) * *\_RequestsFromUrlInput* ## Index[**](#Index) ### Properties * [**headers](https://docs.apify.com/sdk/python/sdk/python/reference/class/_RequestsFromUrlInput.md#headers) * [**method](https://docs.apify.com/sdk/python/sdk/python/reference/class/_RequestsFromUrlInput.md#method) * [**payload](https://docs.apify.com/sdk/python/sdk/python/reference/class/_RequestsFromUrlInput.md#payload) * [**requests\_from\_url](https://docs.apify.com/sdk/python/sdk/python/reference/class/_RequestsFromUrlInput.md#requests_from_url) * [**user\_data](https://docs.apify.com/sdk/python/sdk/python/reference/class/_RequestsFromUrlInput.md#user_data) ## Properties<!-- -->[**](#Properties) ### [**](#headers)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/storages/_request_list.py#L26)headers **headers: dict\[str, str] Inherited from [\_RequestDetails.headers](https://docs.apify.com/sdk/python/sdk/python/reference/class/_RequestDetails.md#headers) ### [**](#method)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/storages/_request_list.py#L24)method **method: HttpMethod Inherited from [\_RequestDetails.method](https://docs.apify.com/sdk/python/sdk/python/reference/class/_RequestDetails.md#method) ### [**](#payload)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/storages/_request_list.py#L25)payload **payload: str Inherited from [\_RequestDetails.payload](https://docs.apify.com/sdk/python/sdk/python/reference/class/_RequestDetails.md#payload) ### [**](#requests_from_url)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/storages/_request_list.py#L31)requests\_from\_url **requests\_from\_url: str ### [**](#user_data)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/storages/_request_list.py#L27)user\_data **user\_data: dict\[str, str] Inherited from [\_RequestDetails.user\_data](https://docs.apify.com/sdk/python/sdk/python/reference/class/_RequestDetails.md#user_data) --- # \_SimpleUrlInput<!-- --> ### Hierarchy * [\_RequestDetails](https://docs.apify.com/sdk/python/sdk/python/reference/class/_RequestDetails.md) * *\_SimpleUrlInput* ## Index[**](#Index) ### Properties * [**headers](https://docs.apify.com/sdk/python/sdk/python/reference/class/_SimpleUrlInput.md#headers) * [**method](https://docs.apify.com/sdk/python/sdk/python/reference/class/_SimpleUrlInput.md#method) * [**payload](https://docs.apify.com/sdk/python/sdk/python/reference/class/_SimpleUrlInput.md#payload) * [**url](https://docs.apify.com/sdk/python/sdk/python/reference/class/_SimpleUrlInput.md#url) * [**user\_data](https://docs.apify.com/sdk/python/sdk/python/reference/class/_SimpleUrlInput.md#user_data) ## Properties<!-- -->[**](#Properties) ### [**](#headers)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/storages/_request_list.py#L26)headers **headers: dict\[str, str] Inherited from [\_RequestDetails.headers](https://docs.apify.com/sdk/python/sdk/python/reference/class/_RequestDetails.md#headers) ### [**](#method)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/storages/_request_list.py#L24)method **method: HttpMethod Inherited from [\_RequestDetails.method](https://docs.apify.com/sdk/python/sdk/python/reference/class/_RequestDetails.md#method) ### [**](#payload)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/storages/_request_list.py#L25)payload **payload: str Inherited from [\_RequestDetails.payload](https://docs.apify.com/sdk/python/sdk/python/reference/class/_RequestDetails.md#payload) ### [**](#url)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/storages/_request_list.py#L35)url **url: str ### [**](#user_data)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/storages/_request_list.py#L27)user\_data **user\_data: dict\[str, str] Inherited from [\_RequestDetails.user\_data](https://docs.apify.com/sdk/python/sdk/python/reference/class/_RequestDetails.md#user_data) --- # AbortingEvent<!-- --> ## Index[**](#Index) ### Properties * [**data](https://docs.apify.com/sdk/python/sdk/python/reference/class/AbortingEvent.md#data) * [**name](https://docs.apify.com/sdk/python/sdk/python/reference/class/AbortingEvent.md#name) ## Properties<!-- -->[**](#Properties) ### [**](#data)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_platform_event_manager.py#L81)data **data: EventAbortingData ### [**](#name)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_platform_event_manager.py#L80)name **name: Literal\[Event.ABORTING] --- # Actor<!-- --> The core class for building Actors on the Apify platform. Actors are serverless programs running in the cloud that can perform anything from simple actions (such as filling out a web form or sending an email) to complex operations (such as crawling an entire website or removing duplicates from a large dataset). They are packaged as Docker containers which accept well-defined JSON input, perform an action, and optionally produce well-defined output. ### References * Apify platform documentation: <https://docs.apify.com/platform/actors> * Actor whitepaper: <https://whitepaper.actor/> ### Usage ``` import asyncio import httpx from apify import Actor from bs4 import BeautifulSoup async def main() -> None: async with Actor: actor_input = await Actor.get_input() async with httpx.AsyncClient() as client: response = await client.get(actor_input['url']) soup = BeautifulSoup(response.content, 'html.parser') data = { 'url': actor_input['url'], 'title': soup.title.string if soup.title else None, } await Actor.push_data(data) if __name__ == '__main__': asyncio.run(main()) ``` ## Index[**](#Index) ### Methods * [**\_\_aenter\_\_](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#__aenter__) * [**\_\_aexit\_\_](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#__aexit__) * [**\_\_call\_\_](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#__call__) * [**\_\_init\_\_](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#__init__) * [**\_\_repr\_\_](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#__repr__) * [**abort](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#abort) * [**add\_webhook](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#add_webhook) * [**call](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#call) * [**call\_task](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#call_task) * [**charge](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#charge) * [**create\_proxy\_configuration](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#create_proxy_configuration) * [**exit](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#exit) * [**fail](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#fail) * [**get\_charging\_manager](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#get_charging_manager) * [**get\_env](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#get_env) * [**get\_input](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#get_input) * [**get\_value](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#get_value) * [**init](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#init) * [**is\_at\_home](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#is_at_home) * [**metamorph](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#metamorph) * [**new\_client](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#new_client) * [**off](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#off) * [**on](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#on) * [**open\_dataset](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#open_dataset) * [**open\_key\_value\_store](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#open_key_value_store) * [**open\_request\_queue](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#open_request_queue) * [**push\_data](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#push_data) * [**reboot](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#reboot) * [**set\_status\_message](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#set_status_message) * [**set\_value](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#set_value) * [**start](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#start) ### Properties * [**apify\_client](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#apify_client) * [**config](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#config) * [**configuration](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#configuration) * [**event\_manager](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#event_manager) * [**log](https://docs.apify.com/sdk/python/sdk/python/reference/class/Actor.md#log) ## Methods<!-- -->[**](#Methods) ### [**](#__aenter__)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_actor.py#L148)\_\_aenter\_\_ * **async **\_\_aenter\_\_**(): Self - Initialize the Actor. Automatically initializes the Actor instance when you use it in an `async with ...` statement. When you exit the `async with` block, the `Actor.exit()` method is called, and if any exception happens while executing the block code, the `Actor.fail` method is called. *** #### Returns Self ### [**](#__aexit__)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_actor.py#L159)\_\_aexit\_\_ * **async **\_\_aexit\_\_**(\_exc\_type, exc\_value, \_exc\_traceback): None - Exit the Actor, handling any exceptions properly. When you exit the `async with` block, the `Actor.exit()` method is called, and if any exception happens while executing the block code, the `Actor.fail` method is called. *** #### Parameters * ##### \_exc\_type: type\[BaseException] | None * ##### exc\_value: BaseException | None * ##### \_exc\_traceback: TracebackType | None #### Returns None ### [**](#__call__)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_actor.py#L185)\_\_call\_\_ * ****\_\_call\_\_**(configuration, \*, configure\_logging, exit\_process): Self - Make a new Actor instance with a non-default configuration. *** #### Parameters * ##### optionalconfiguration: [Configuration](https://docs.apify.com/sdk/python/sdk/python/reference/class/Configuration.md) | None = <!-- -->None * ##### optionalkeyword-onlyconfigure\_logging: bool = <!-- -->True * ##### optionalkeyword-onlyexit\_process: bool | None = <!-- -->None #### Returns Self ### [**](#__init__)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_actor.py#L101)\_\_init\_\_ * ****\_\_init\_\_**(configuration, \*, configure\_logging, exit\_process): None - Create an Actor instance. Note that you don't have to do this, all the functionality is accessible using the default instance (e.g. `Actor.open_dataset()`). *** #### Parameters * ##### optionalconfiguration: [Configuration](https://docs.apify.com/sdk/python/sdk/python/reference/class/Configuration.md) | None = <!-- -->None The Actor configuration to be used. If not passed, a new Configuration instance will be created. * ##### optionalkeyword-onlyconfigure\_logging: bool = <!-- -->True Should the default logging configuration be configured? * ##### optionalkeyword-onlyexit\_process: bool | None = <!-- -->None Whether the Actor should call `sys.exit` when the context manager exits. The default is True except for the IPython, Pytest and Scrapy environments. #### Returns None ### [**](#__repr__)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_actor.py#L179)\_\_repr\_\_ * ****\_\_repr\_\_**(): str - #### Returns str ### [**](#abort)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_actor.py#L805)abort * **async **abort**(run\_id, \*, token, status\_message, gracefully): [ActorRun](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRun.md) - Abort given Actor run on the Apify platform using the current user account. The user account is determined by the `APIFY_TOKEN` environment variable. *** #### Parameters * ##### run\_id: str The ID of the Actor run to be aborted. * ##### optionalkeyword-onlytoken: str | None = <!-- -->None The Apify API token to use for this request (defaults to the `APIFY_TOKEN` environment variable). * ##### optionalkeyword-onlystatus\_message: str | None = <!-- -->None Status message of the Actor to be set on the platform. * ##### optionalkeyword-onlygracefully: bool | None = <!-- -->None If True, the Actor run will abort gracefully. It will send `aborting` and `persistState` events into the run and force-stop the run after 30 seconds. It is helpful in cases where you plan to resurrect the run later. #### Returns [ActorRun](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRun.md) ### [**](#add_webhook)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_actor.py#L1077)add\_webhook * **async **add\_webhook**(webhook, \*, ignore\_ssl\_errors, do\_not\_retry, idempotency\_key): None - Create an ad-hoc webhook for the current Actor run. This webhook lets you receive a notification when the Actor run finished or failed. Note that webhooks are only supported for Actors running on the Apify platform. When running the Actor locally, the function will print a warning and have no effect. For more information about Apify Actor webhooks, please see the [documentation](https://docs.apify.com/webhooks). *** #### Parameters * ##### webhook: [Webhook](https://docs.apify.com/sdk/python/sdk/python/reference/class/Webhook.md) The webhook to be added * ##### optionalkeyword-onlyignore\_ssl\_errors: bool | None = <!-- -->None Whether the webhook should ignore SSL errors returned by request\_url * ##### optionalkeyword-onlydo\_not\_retry: bool | None = <!-- -->None Whether the webhook should retry sending the payload to request\_url upon failure. * ##### optionalkeyword-onlyidempotency\_key: str | None = <!-- -->None A unique identifier of a webhook. You can use it to ensure that you won't create the same webhook multiple times. #### Returns None ### [**](#call)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_actor.py#L839)call * **async **call**(actor\_id, run\_input, \*, token, content\_type, build, memory\_mbytes, timeout, webhooks, wait, logger): [ActorRun](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRun.md) | None - Start an Actor on the Apify Platform and wait for it to finish before returning. It waits indefinitely, unless the wait argument is provided. *** #### Parameters * ##### actor\_id: str The ID of the Actor to be run. * ##### optionalrun\_input: Any = <!-- -->None The input to pass to the Actor run. * ##### optionalkeyword-onlytoken: str | None = <!-- -->None The Apify API token to use for this request (defaults to the `APIFY_TOKEN` environment variable). * ##### optionalkeyword-onlycontent\_type: str | None = <!-- -->None The content type of the input. * ##### optionalkeyword-onlybuild: str | None = <!-- -->None Specifies the Actor build to run. It can be either a build tag or build number. By default, the run uses the build specified in the default run configuration for the Actor (typically latest). * ##### optionalkeyword-onlymemory\_mbytes: int | None = <!-- -->None Memory limit for the run, in megabytes. By default, the run uses a memory limit specified in the default run configuration for the Actor. * ##### optionalkeyword-onlytimeout: (timedelta | None) | Literal\[RemainingTime] = <!-- -->None Optional timeout for the run, in seconds. By default, the run uses timeout specified in the default run configuration for the Actor. Using `RemainingTime` will set timeout of the other Actor to the time remaining from this Actor timeout. * ##### optionalkeyword-onlywebhooks: [list](https://docs.apify.com/sdk/python/sdk/python/reference/class/DatasetCollectionClient.md#list)\[[Webhook](https://docs.apify.com/sdk/python/sdk/python/reference/class/Webhook.md)] | None = <!-- -->None Optional webhooks (<https://docs.apify.com/webhooks>) associated with the Actor run, which can be used to receive a notification, e.g. when the Actor finished or failed. If you already have a webhook set up for the Actor, you do not have to add it again here. * ##### optionalkeyword-onlywait: timedelta | None = <!-- -->None The maximum number of seconds the server waits for the run to finish. If not provided, waits indefinitely. * ##### optionalkeyword-onlylogger: (logging.Logger | None) | Literal\[default] = <!-- -->'default' Logger used to redirect logs from the Actor run. Using "default" literal means that a predefined default logger will be used. Setting `None` will disable any log propagation. Passing custom logger will redirect logs to the provided logger. #### Returns [ActorRun](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRun.md) | None ### [**](#call_task)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_actor.py#L914)call\_task * **async **call\_task**(task\_id, task\_input, \*, build, memory\_mbytes, timeout, webhooks, wait, token): [ActorRun](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRun.md) | None - Start an Actor task on the Apify Platform and wait for it to finish before returning. It waits indefinitely, unless the wait argument is provided. Note that an Actor task is a saved input configuration and options for an Actor. If you want to run an Actor directly rather than an Actor task, please use the `Actor.call` *** #### Parameters * ##### task\_id: str The ID of the Actor to be run. * ##### optionaltask\_input: dict | None = <!-- -->None Overrides the input to pass to the Actor run. * ##### optionalkeyword-onlybuild: str | None = <!-- -->None Specifies the Actor build to run. It can be either a build tag or build number. By default, the run uses the build specified in the default run configuration for the Actor (typically latest). * ##### optionalkeyword-onlymemory\_mbytes: int | None = <!-- -->None Memory limit for the run, in megabytes. By default, the run uses a memory limit specified in the default run configuration for the Actor. * ##### optionalkeyword-onlytimeout: timedelta | None = <!-- -->None Optional timeout for the run, in seconds. By default, the run uses timeout specified in the default run configuration for the Actor. * ##### optionalkeyword-onlywebhooks: [list](https://docs.apify.com/sdk/python/sdk/python/reference/class/DatasetCollectionClient.md#list)\[[Webhook](https://docs.apify.com/sdk/python/sdk/python/reference/class/Webhook.md)] | None = <!-- -->None Optional webhooks (<https://docs.apify.com/webhooks>) associated with the Actor run, which can be used to receive a notification, e.g. when the Actor finished or failed. If you already have a webhook set up for the Actor, you do not have to add it again here. * ##### optionalkeyword-onlywait: timedelta | None = <!-- -->None The maximum number of seconds the server waits for the run to finish. If not provided, waits indefinitely. * ##### optionalkeyword-onlytoken: str | None = <!-- -->None The Apify API token to use for this request (defaults to the `APIFY_TOKEN` environment variable). #### Returns [ActorRun](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRun.md) | None ### [**](#charge)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_actor.py#L601)charge * **async **charge**(event\_name, count): [ChargeResult](https://docs.apify.com/sdk/python/sdk/python/reference/class/ChargeResult.md) - Charge for a specified number of events - sub-operations of the Actor. This is relevant only for the pay-per-event pricing model. *** #### Parameters * ##### event\_name: str Name of the event to be charged for. * ##### optionalcount: int = <!-- -->1 Number of events to charge for. #### Returns [ChargeResult](https://docs.apify.com/sdk/python/sdk/python/reference/class/ChargeResult.md) ### [**](#create_proxy_configuration)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_actor.py#L1156)create\_proxy\_configuration * **async **create\_proxy\_configuration**(\*, actor\_proxy\_input, password, groups, country\_code, proxy\_urls, new\_url\_function): [ProxyConfiguration](https://docs.apify.com/sdk/python/sdk/python/reference/class/ProxyConfiguration.md) | None - Create a ProxyConfiguration object with the passed proxy configuration. Configures connection to a proxy server with the provided options. Proxy servers are used to prevent target websites from blocking your crawlers based on IP address rate limits or blacklists. For more details and code examples, see the `ProxyConfiguration` class. *** #### Parameters * ##### optionalkeyword-onlyactor\_proxy\_input: dict | None = <!-- -->None Proxy configuration field from the Actor input, if input has such input field. If you pass this argument, all the other arguments will be inferred from it. * ##### optionalkeyword-onlypassword: str | None = <!-- -->None Password for the Apify Proxy. If not provided, will use os.environ\['APIFY\_PROXY\_PASSWORD'], if available. * ##### optionalkeyword-onlygroups: [list](https://docs.apify.com/sdk/python/sdk/python/reference/class/DatasetCollectionClient.md#list)\[str] | None = <!-- -->None Proxy groups which the Apify Proxy should use, if provided. * ##### optionalkeyword-onlycountry\_code: str | None = <!-- -->None Country which the Apify Proxy should use, if provided. * ##### optionalkeyword-onlyproxy\_urls: [list](https://docs.apify.com/sdk/python/sdk/python/reference/class/DatasetCollectionClient.md#list)\[str | None] | None = <!-- -->None Custom proxy server URLs which should be rotated through. * ##### optionalkeyword-onlynew\_url\_function: \_NewUrlFunction | None = <!-- -->None Function which returns a custom proxy URL to be used. #### Returns [ProxyConfiguration](https://docs.apify.com/sdk/python/sdk/python/reference/class/ProxyConfiguration.md) | None ### [**](#exit)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_actor.py#L292)exit * **async **exit**(\*, exit\_code, event\_listeners\_timeout, status\_message, cleanup\_timeout): None - Exit the Actor instance. This stops the Actor instance. It cancels all the intervals for regularly sending `PERSIST_STATE` events, sends a final `PERSIST_STATE` event, waits for all the event listeners to finish, and stops the event manager. *** #### Parameters * ##### optionalkeyword-onlyexit\_code: int = <!-- -->0 The exit code with which the Actor should fail (defaults to `0`). * ##### optionalkeyword-onlyevent\_listeners\_timeout: timedelta | None = <!-- -->EVENT\_LISTENERS\_TIMEOUT How long should the Actor wait for Actor event listeners to finish before exiting. * ##### optionalkeyword-onlystatus\_message: str | None = <!-- -->None The final status message that the Actor should display. * ##### optionalkeyword-onlycleanup\_timeout: timedelta = <!-- -->timedelta(seconds=30) How long we should wait for event listeners. #### Returns None ### [**](#fail)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_actor.py#L338)fail * **async **fail**(\*, exit\_code, exception, status\_message): None - Fail the Actor instance. This performs all the same steps as Actor.exit(), but it additionally sets the exit code to `1` (by default). *** #### Parameters * ##### optionalkeyword-onlyexit\_code: int = <!-- -->1 The exit code with which the Actor should fail (defaults to `1`). * ##### optionalkeyword-onlyexception: BaseException | None = <!-- -->None The exception with which the Actor failed. * ##### optionalkeyword-onlystatus\_message: str | None = <!-- -->None The final status message that the Actor should display. #### Returns None ### [**](#get_charging_manager)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_actor.py#L596)get\_charging\_manager * ****get\_charging\_manager**(): [ChargingManager](https://docs.apify.com/sdk/python/sdk/python/reference/class/ChargingManager.md) - Retrieve the charging manager to access granular pricing information. *** #### Returns [ChargingManager](https://docs.apify.com/sdk/python/sdk/python/reference/class/ChargingManager.md) ### [**](#get_env)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_actor.py#L694)get\_env * ****get\_env**(): dict - Return a dictionary with information parsed from all the `APIFY_XXX` environment variables. For a list of all the environment variables, see the [Actor documentation](https://docs.apify.com/actors/development/environment-variables). If some variables are not defined or are invalid, the corresponding value in the resulting dictionary will be None. *** #### Returns dict ### [**](#get_input)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_actor.py#L549)get\_input * **async **get\_input**(): Any - Get the Actor input value from the default key-value store associated with the current Actor run. *** #### Returns Any ### [**](#get_value)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_actor.py#L565)get\_value * **async **get\_value**(key, default\_value): Any - Get a value from the default key-value store associated with the current Actor run. *** #### Parameters * ##### key: str The key of the record which to retrieve. * ##### optionaldefault\_value: Any = <!-- -->None Default value returned in case the record does not exist. #### Returns Any ### [**](#init)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_actor.py#L243)init * **async **init**(): None - Initialize the Actor instance. This initializes the Actor instance. It configures the right storage client based on whether the Actor is running locally or on the Apify platform, it initializes the event manager for processing Actor events, and starts an interval for regularly sending `PERSIST_STATE` events, so that the Actor can regularly persist its state in response to these events. This method should be called immediately before performing any additional Actor actions, and it should be called only once. *** #### Returns None ### [**](#is_at_home)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_actor.py#L690)is\_at\_home * ****is\_at\_home**(): bool - Return `True` when the Actor is running on the Apify platform, and `False` otherwise (e.g. local run). *** #### Returns bool ### [**](#metamorph)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_actor.py#L975)metamorph * **async **metamorph**(target\_actor\_id, run\_input, \*, target\_actor\_build, content\_type, custom\_after\_sleep): None - Transform this Actor run to an Actor run of a different Actor. The platform stops the current Actor container and starts a new container with the new Actor instead. All the default storages are preserved, and the new input is stored under the `INPUT-METAMORPH-1` key in the same default key-value store. *** #### Parameters * ##### target\_actor\_id: str ID of the target Actor that the run should be transformed into * ##### optionalrun\_input: Any = <!-- -->None The input to pass to the new run. * ##### optionalkeyword-onlytarget\_actor\_build: str | None = <!-- -->None The build of the target Actor. It can be either a build tag or build number. By default, the run uses the build specified in the default run configuration for the target Actor (typically the latest build). * ##### optionalkeyword-onlycontent\_type: str | None = <!-- -->None The content type of the input. * ##### optionalkeyword-onlycustom\_after\_sleep: timedelta | None = <!-- -->None How long to sleep for after the metamorph, to wait for the container to be stopped. #### Returns None ### [**](#new_client)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_actor.py#L363)new\_client * ****new\_client**(\*, token, api\_url, max\_retries, min\_delay\_between\_retries, timeout): ApifyClientAsync - Return a new instance of the Apify API client. The `ApifyClientAsync` class is provided by the [apify-client](https://github.com/apify/apify-client-python) package, and it is automatically configured using the `APIFY_API_BASE_URL` and `APIFY_TOKEN` environment variables. You can override the token via the available options. That's useful if you want to use the client as a different Apify user than the SDK internals are using. *** #### Parameters * ##### optionalkeyword-onlytoken: str | None = <!-- -->None The Apify API token. * ##### optionalkeyword-onlyapi\_url: str | None = <!-- -->None The URL of the Apify API server to which to connect to. Defaults to <https://api.apify.com>. * ##### optionalkeyword-onlymax\_retries: int | None = <!-- -->None How many times to retry a failed request at most. * ##### optionalkeyword-onlymin\_delay\_between\_retries: timedelta | None = <!-- -->None How long will the client wait between retrying requests (increases exponentially from this value). * ##### optionalkeyword-onlytimeout: timedelta | None = <!-- -->None The socket timeout of the HTTP requests sent to the Apify API. #### Returns ApifyClientAsync ### [**](#off)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_actor.py#L678)off * ****off**(event\_name: Event, listener? <!-- --> : Callable | None): None * ****off**(event\_name: Literal\[Event.PERSIST\_STATE], listener: EventListener\[EventPersistStateData]): None * ****off**(event\_name: Literal\[Event.SYSTEM\_INFO], listener: EventListener\[EventSystemInfoData]): None * ****off**(event\_name: Literal\[Event.MIGRATING], listener: EventListener\[EventMigratingData]): None * ****off**(event\_name: Literal\[Event.ABORTING], listener: EventListener\[EventAbortingData]): None * ****off**(event\_name: Literal\[Event.EXIT], listener: EventListener\[EventExitData]): None * ****off**(event\_name: Event, listener: EventListener\[None]): None - Remove a listener, or all listeners, from an Actor event. *** #### Parameters * ##### event\_name: Event The Actor event for which to remove listeners. * ##### optionallistener: Callable | None = <!-- -->None The listener which is supposed to be removed. If not passed, all listeners of this event are removed. #### Returns None ### [**](#on)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_actor.py#L636)on * ****on**(event\_name: Event, listener: EventListener\[Any]): EventListener\[Any] * ****on**(event\_name: Literal\[Event.PERSIST\_STATE], listener: EventListener\[EventPersistStateData]): EventListener\[EventPersistStateData] * ****on**(event\_name: Literal\[Event.SYSTEM\_INFO], listener: EventListener\[EventSystemInfoData]): EventListener\[EventSystemInfoData] * ****on**(event\_name: Literal\[Event.MIGRATING], listener: EventListener\[EventMigratingData]): EventListener\[EventMigratingData] * ****on**(event\_name: Literal\[Event.ABORTING], listener: EventListener\[EventAbortingData]): EventListener\[EventAbortingData] * ****on**(event\_name: Literal\[Event.EXIT], listener: EventListener\[EventExitData]): EventListener\[EventExitData] * ****on**(event\_name: Event, listener: EventListener\[None]): EventListener\[Any] - Add an event listener to the Actor's event manager. The following events can be emitted: * `Event.SYSTEM_INFO`: Emitted every minute; the event data contains information about the Actor's resource usage. * `Event.MIGRATING`: Emitted when the Actor on the Apify platform is about to be migrated to another worker server. Use this event to persist the Actor's state and gracefully stop in-progress tasks, preventing disruption. * `Event.PERSIST_STATE`: Emitted regularly (default: 60 seconds) to notify the Actor to persist its state, preventing work repetition after a restart. This event is emitted together with the `MIGRATING` event, where the `isMigrating` flag in the event data is `True`; otherwise, the flag is `False`. This event is for convenience; the same effect can be achieved by setting an interval and listening for the `MIGRATING` event. * `Event.ABORTING`: Emitted when a user aborts an Actor run on the Apify platform, allowing the Actor time to clean up its state if the abort is graceful. *** #### Parameters * ##### event\_name: Event The Actor event to listen for. * ##### listener: EventListener\[Any] The function to be called when the event is emitted (can be async). #### Returns EventListener\[Any] ### [**](#open_dataset)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_actor.py#L401)open\_dataset * **async **open\_dataset**(\*, id, name, force\_cloud): Dataset - Open a dataset. Datasets are used to store structured data where each object stored has the same attributes, such as online store products or real estate offers. The actual data is stored either on the local filesystem or in the Apify cloud. *** #### Parameters * ##### optionalkeyword-onlyid: str | None = <!-- -->None ID of the dataset to be opened. If neither `id` nor `name` are provided, the method returns the default dataset associated with the Actor run. * ##### optionalkeyword-onlyname: str | None = <!-- -->None Name of the dataset to be opened. If neither `id` nor `name` are provided, the method returns the default dataset associated with the Actor run. * ##### optionalkeyword-onlyforce\_cloud: bool = <!-- -->False If set to `True` then the Apify cloud storage is always used. This way it is possible to combine local and cloud storage. #### Returns Dataset ### [**](#open_key_value_store)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_actor.py#L437)open\_key\_value\_store * **async **open\_key\_value\_store**(\*, id, name, force\_cloud): KeyValueStore - Open a key-value store. Key-value stores are used to store records or files, along with their MIME content type. The records are stored and retrieved using a unique key. The actual data is stored either on a local filesystem or in the Apify cloud. *** #### Parameters * ##### optionalkeyword-onlyid: str | None = <!-- -->None ID of the key-value store to be opened. If neither `id` nor `name` are provided, the method returns the default key-value store associated with the Actor run. * ##### optionalkeyword-onlyname: str | None = <!-- -->None Name of the key-value store to be opened. If neither `id` nor `name` are provided, the method returns the default key-value store associated with the Actor run. * ##### optionalkeyword-onlyforce\_cloud: bool = <!-- -->False If set to `True` then the Apify cloud storage is always used. This way it is possible to combine local and cloud storage. #### Returns KeyValueStore ### [**](#open_request_queue)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_actor.py#L471)open\_request\_queue * **async **open\_request\_queue**(\*, id, name, force\_cloud): RequestQueue - Open a request queue. Request queue represents a queue of URLs to crawl, which is stored either on local filesystem or in the Apify cloud. The queue is used for deep crawling of websites, where you start with several URLs and then recursively follow links to other pages. The data structure supports both breadth-first and depth-first crawling orders. *** #### Parameters * ##### optionalkeyword-onlyid: str | None = <!-- -->None ID of the request queue to be opened. If neither `id` nor `name` are provided, the method returns the default request queue associated with the Actor run. * ##### optionalkeyword-onlyname: str | None = <!-- -->None Name of the request queue to be opened. If neither `id` nor `name` are provided, the method returns the default request queue associated with the Actor run. * ##### optionalkeyword-onlyforce\_cloud: bool = <!-- -->False If set to `True` then the Apify cloud storage is always used. This way it is possible to combine local and cloud storage. #### Returns RequestQueue ### [**](#push_data)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_actor.py#L512)push\_data * **async **push\_data**(data: dict | [list](https://docs.apify.com/sdk/python/sdk/python/reference/class/DatasetCollectionClient.md#list)\[dict], charged\_event\_name? <!-- --> : str | None): [ChargeResult](https://docs.apify.com/sdk/python/sdk/python/reference/class/ChargeResult.md) | None * **async **push\_data**(data: dict | [list](https://docs.apify.com/sdk/python/sdk/python/reference/class/DatasetCollectionClient.md#list)\[dict]): None * **async **push\_data**(data: dict | [list](https://docs.apify.com/sdk/python/sdk/python/reference/class/DatasetCollectionClient.md#list)\[dict], charged\_event\_name: str): [ChargeResult](https://docs.apify.com/sdk/python/sdk/python/reference/class/ChargeResult.md) - Store an object or a list of objects to the default dataset of the current Actor run. *** #### Parameters * ##### data: dict | [list](https://docs.apify.com/sdk/python/sdk/python/reference/class/DatasetCollectionClient.md#list)\[dict] The data to push to the default dataset. * ##### optionalcharged\_event\_name: str | None = <!-- -->None If provided and if the Actor uses the pay-per-event pricing model, the method will attempt to charge for the event for each pushed item. #### Returns [ChargeResult](https://docs.apify.com/sdk/python/sdk/python/reference/class/ChargeResult.md) | None ### [**](#reboot)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_actor.py#L1022)reboot * **async **reboot**(\*, event\_listeners\_timeout, custom\_after\_sleep): None - Internally reboot this Actor. The system stops the current container and starts a new one, with the same run ID and default storages. *** #### Parameters * ##### optionalkeyword-onlyevent\_listeners\_timeout: timedelta | None = <!-- -->EVENT\_LISTENERS\_TIMEOUT How long should the Actor wait for Actor event listeners to finish before exiting. * ##### optionalkeyword-onlycustom\_after\_sleep: timedelta | None = <!-- -->None How long to sleep for after the reboot, to wait for the container to be stopped. #### Returns None ### [**](#set_status_message)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_actor.py#L1124)set\_status\_message * **async **set\_status\_message**(status\_message, \*, is\_terminal): [ActorRun](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRun.md) | None - Set the status message for the current Actor run. *** #### Parameters * ##### status\_message: str The status message to set to the run. * ##### optionalkeyword-onlyis\_terminal: bool | None = <!-- -->None Set this flag to True if this is the final status message of the Actor run. #### Returns [ActorRun](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRun.md) | None ### [**](#set_value)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_actor.py#L577)set\_value * **async **set\_value**(key, value, \*, content\_type): None - Set or delete a value in the default key-value store associated with the current Actor run. *** #### Parameters * ##### key: str The key of the record which to set. * ##### value: Any The value of the record which to set, or None, if the record should be deleted. * ##### optionalkeyword-onlycontent\_type: str | None = <!-- -->None The content type which should be set to the value. #### Returns None ### [**](#start)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_actor.py#L723)start * **async **start**(actor\_id, run\_input, \*, token, content\_type, build, memory\_mbytes, timeout, wait\_for\_finish, webhooks): [ActorRun](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRun.md) - Run an Actor on the Apify platform. Unlike `Actor.call`, this method just starts the run without waiting for finish. *** #### Parameters * ##### actor\_id: str The ID of the Actor to be run. * ##### optionalrun\_input: Any = <!-- -->None The input to pass to the Actor run. * ##### optionalkeyword-onlytoken: str | None = <!-- -->None The Apify API token to use for this request (defaults to the `APIFY_TOKEN` environment variable). * ##### optionalkeyword-onlycontent\_type: str | None = <!-- -->None The content type of the input. * ##### optionalkeyword-onlybuild: str | None = <!-- -->None Specifies the Actor build to run. It can be either a build tag or build number. By default, the run uses the build specified in the default run configuration for the Actor (typically latest). * ##### optionalkeyword-onlymemory\_mbytes: int | None = <!-- -->None Memory limit for the run, in megabytes. By default, the run uses a memory limit specified in the default run configuration for the Actor. * ##### optionalkeyword-onlytimeout: (timedelta | None) | Literal\[RemainingTime] = <!-- -->None Optional timeout for the run, in seconds. By default, the run uses timeout specified in the default run configuration for the Actor. Using `RemainingTime` will set timeout of the other Actor to the time remaining from this Actor timeout. * ##### optionalkeyword-onlywait\_for\_finish: int | None = <!-- -->None The maximum number of seconds the server waits for the run to finish. By default, it is 0, the maximum value is 300. * ##### optionalkeyword-onlywebhooks: [list](https://docs.apify.com/sdk/python/sdk/python/reference/class/DatasetCollectionClient.md#list)\[[Webhook](https://docs.apify.com/sdk/python/sdk/python/reference/class/Webhook.md)] | None = <!-- -->None Optional ad-hoc webhooks (<https://docs.apify.com/webhooks/ad-hoc-webhooks>) associated with the Actor run which can be used to receive a notification, e.g. when the Actor finished or failed. If you already have a webhook set up for the Actor or task, you do not have to add it again here. #### Returns [ActorRun](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRun.md) ## Properties<!-- -->[**](#Properties) ### [**](#apify_client)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_actor.py#L200)apify\_client **apify\_client: ApifyClientAsync The ApifyClientAsync instance the Actor instance uses. ### [**](#config)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_actor.py#L210)config **config: [Configuration](https://docs.apify.com/sdk/python/sdk/python/reference/class/Configuration.md) The Configuration instance the Actor instance uses. ### [**](#configuration)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_actor.py#L205)configuration **configuration: [Configuration](https://docs.apify.com/sdk/python/sdk/python/reference/class/Configuration.md) The Configuration instance the Actor instance uses. ### [**](#event_manager)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_actor.py#L215)event\_manager **event\_manager: EventManager The EventManager instance the Actor instance uses. ### [**](#log)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_actor.py#L220)log **log: logging.Logger The logging.Logger instance the Actor uses. --- # ActorChargeEvent<!-- --> ## Index[**](#Index) ### Properties * [**event\_description](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorChargeEvent.md#event_description) * [**event\_price\_usd](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorChargeEvent.md#event_price_usd) * [**event\_title](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorChargeEvent.md#event_title) ## Properties<!-- -->[**](#Properties) ### [**](#event_description)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L156)event\_description **event\_description: str | None ### [**](#event_price_usd)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L154)event\_price\_usd **event\_price\_usd: Decimal ### [**](#event_title)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L155)event\_title **event\_title: str --- # ActorDatasetPushPipeline<!-- --> A Scrapy pipeline for pushing items to an Actor's default dataset. This pipeline is designed to be enabled only when the Scrapy project is run as an Actor. ## Index[**](#Index) ### Methods * [**process\_item](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorDatasetPushPipeline.md#process_item) ## Methods<!-- -->[**](#Methods) ### [**](#process_item)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/scrapy/pipelines/actor_dataset_push.py#L22)process\_item * **async **process\_item**(item, spider): Item - Pushes the provided Scrapy item to the Actor's default dataset. *** #### Parameters * ##### item: Item * ##### spider: Spider #### Returns Item --- # ActorLogFormatter<!-- --> --- # ActorPricingInfo<!-- --> Result of the `ChargingManager.get_pricing_info` method. ## Index[**](#Index) ### Properties * [**is\_pay\_per\_event](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorPricingInfo.md#is_pay_per_event) * [**max\_total\_charge\_usd](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorPricingInfo.md#max_total_charge_usd) * [**per\_event\_prices](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorPricingInfo.md#per_event_prices) * [**pricing\_model](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorPricingInfo.md#pricing_model) ## Properties<!-- -->[**](#Properties) ### [**](#is_pay_per_event)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_charging.py#L94)is\_pay\_per\_event **is\_pay\_per\_event: bool A shortcut - true if the Actor runs with the pay-per-event pricing model. ### [**](#max_total_charge_usd)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_charging.py#L91)max\_total\_charge\_usd **max\_total\_charge\_usd: Decimal A configured limit for the total charged amount - if you exceed it, you won't receive more money than this. ### [**](#per_event_prices)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_charging.py#L97)per\_event\_prices **per\_event\_prices: dict\[str, Decimal] Price of every known event type. ### [**](#pricing_model)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_charging.py#L88)pricing\_model **pricing\_model: Literal\[ 'FREE', 'FLAT\_PRICE\_PER\_MONTH', 'PRICE\_PER\_DATASET\_ITEM', 'PAY\_PER\_EVENT',] | None The currently effective pricing model. --- # ActorRun<!-- --> ## Index[**](#Index) ### Properties * [**\_\_model\_config\_\_](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRun.md#__model_config__) * [**act\_id](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRun.md#act_id) * [**actor\_task\_id](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRun.md#actor_task_id) * [**build\_id](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRun.md#build_id) * [**build\_number](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRun.md#build_number) * [**charged\_event\_counts](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRun.md#charged_event_counts) * [**container\_url](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRun.md#container_url) * [**default\_dataset\_id](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRun.md#default_dataset_id) * [**default\_key\_value\_store\_id](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRun.md#default_key_value_store_id) * [**default\_request\_queue\_id](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRun.md#default_request_queue_id) * [**exit\_code](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRun.md#exit_code) * [**finished\_at](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRun.md#finished_at) * [**git\_branch\_name](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRun.md#git_branch_name) * [**id](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRun.md#id) * [**is\_container\_server\_ready](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRun.md#is_container_server_ready) * [**is\_status\_message\_terminal](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRun.md#is_status_message_terminal) * [**meta](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRun.md#meta) * [**options](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRun.md#options) * [**pricing\_info](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRun.md#pricing_info) * [**started\_at](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRun.md#started_at) * [**stats](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRun.md#stats) * [**status](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRun.md#status) * [**status\_message](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRun.md#status_message) * [**usage](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRun.md#usage) * [**usage\_total\_usd](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRun.md#usage_total_usd) * [**usage\_usd](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRun.md#usage_usd) * [**user\_id](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRun.md#user_id) ## Properties<!-- -->[**](#Properties) ### [**](#__model_config__)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L97)\_\_model\_config\_\_ **\_\_model\_config\_\_: Undefined ### [**](#act_id)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L100)act\_id **act\_id: str ### [**](#actor_task_id)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L102)actor\_task\_id **actor\_task\_id: str | None ### [**](#build_id)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L111)build\_id **build\_id: str ### [**](#build_number)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L116)build\_number **build\_number: str | None ### [**](#charged_event_counts)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L131)charged\_event\_counts **charged\_event\_counts: dict\[str, int] | None ### [**](#container_url)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L117)container\_url **container\_url: str ### [**](#default_dataset_id)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L114)default\_dataset\_id **default\_dataset\_id: str ### [**](#default_key_value_store_id)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L113)default\_key\_value\_store\_id **default\_key\_value\_store\_id: str ### [**](#default_request_queue_id)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L115)default\_request\_queue\_id **default\_request\_queue\_id: str ### [**](#exit_code)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L112)exit\_code **exit\_code: int | None ### [**](#finished_at)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L104)finished\_at **finished\_at: datetime | None ### [**](#git_branch_name)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L119)git\_branch\_name **git\_branch\_name: str | None ### [**](#id)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L99)id **id: str ### [**](#is_container_server_ready)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L118)is\_container\_server\_ready **is\_container\_server\_ready: bool | None ### [**](#is_status_message_terminal)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L107)is\_status\_message\_terminal **is\_status\_message\_terminal: bool | None ### [**](#meta)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L108)meta **meta: [ActorRunMeta](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRunMeta.md) ### [**](#options)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L110)options **options: [ActorRunOptions](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRunOptions.md) ### [**](#pricing_info)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L123)pricing\_info **pricing\_info: ((([FreeActorPricingInfo](https://docs.apify.com/sdk/python/sdk/python/reference/class/FreeActorPricingInfo.md) | [FlatPricePerMonthActorPricingInfo](https://docs.apify.com/sdk/python/sdk/python/reference/class/FlatPricePerMonthActorPricingInfo.md)) | [PricePerDatasetItemActorPricingInfo](https://docs.apify.com/sdk/python/sdk/python/reference/class/PricePerDatasetItemActorPricingInfo.md)) | [PayPerEventActorPricingInfo](https://docs.apify.com/sdk/python/sdk/python/reference/class/PayPerEventActorPricingInfo.md)) | None ### [**](#started_at)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L103)started\_at **started\_at: datetime ### [**](#stats)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L109)stats **stats: [ActorRunStats](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRunStats.md) ### [**](#status)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L105)status **status: ActorJobStatus ### [**](#status_message)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L106)status\_message **status\_message: str | None ### [**](#usage)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L120)usage **usage: [ActorRunUsage](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRunUsage.md) | None ### [**](#usage_total_usd)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L121)usage\_total\_usd **usage\_total\_usd: float | None ### [**](#usage_usd)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L122)usage\_usd **usage\_usd: [ActorRunUsage](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRunUsage.md) | None ### [**](#user_id)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L101)user\_id **user\_id: str --- # ActorRunMeta<!-- --> ## Index[**](#Index) ### Properties * [**\_\_model\_config\_\_](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRunMeta.md#__model_config__) * [**origin](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRunMeta.md#origin) ## Properties<!-- -->[**](#Properties) ### [**](#__model_config__)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L40)\_\_model\_config\_\_ **\_\_model\_config\_\_: Undefined ### [**](#origin)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L42)origin **origin: MetaOrigin --- # ActorRunOptions<!-- --> ## Index[**](#Index) ### Properties * [**\_\_model\_config\_\_](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRunOptions.md#__model_config__) * [**build](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRunOptions.md#build) * [**disk\_mbytes](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRunOptions.md#disk_mbytes) * [**max\_total\_charge\_usd](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRunOptions.md#max_total_charge_usd) * [**memory\_mbytes](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRunOptions.md#memory_mbytes) * [**timeout](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRunOptions.md#timeout) ## Properties<!-- -->[**](#Properties) ### [**](#__model_config__)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L68)\_\_model\_config\_\_ **\_\_model\_config\_\_: Undefined ### [**](#build)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L70)build **build: str ### [**](#disk_mbytes)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L73)disk\_mbytes **disk\_mbytes: int ### [**](#max_total_charge_usd)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L74)max\_total\_charge\_usd **max\_total\_charge\_usd: Decimal | None ### [**](#memory_mbytes)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L72)memory\_mbytes **memory\_mbytes: int ### [**](#timeout)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L71)timeout **timeout: timedelta --- # ActorRunStats<!-- --> ## Index[**](#Index) ### Properties * [**\_\_model\_config\_\_](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRunStats.md#__model_config__) * [**compute\_units](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRunStats.md#compute_units) * [**cpu\_avg\_usage](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRunStats.md#cpu_avg_usage) * [**cpu\_current\_usage](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRunStats.md#cpu_current_usage) * [**cpu\_max\_usage](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRunStats.md#cpu_max_usage) * [**duration](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRunStats.md#duration) * [**input\_body\_len](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRunStats.md#input_body_len) * [**mem\_avg\_bytes](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRunStats.md#mem_avg_bytes) * [**mem\_current\_bytes](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRunStats.md#mem_current_bytes) * [**mem\_max\_bytes](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRunStats.md#mem_max_bytes) * [**metamorph](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRunStats.md#metamorph) * [**net\_rx\_bytes](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRunStats.md#net_rx_bytes) * [**net\_tx\_bytes](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRunStats.md#net_tx_bytes) * [**restart\_count](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRunStats.md#restart_count) * [**resurrect\_count](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRunStats.md#resurrect_count) * [**run\_time](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRunStats.md#run_time) ## Properties<!-- -->[**](#Properties) ### [**](#__model_config__)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L47)\_\_model\_config\_\_ **\_\_model\_config\_\_: Undefined ### [**](#compute_units)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L63)compute\_units **compute\_units: float ### [**](#cpu_avg_usage)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L55)cpu\_avg\_usage **cpu\_avg\_usage: float | None ### [**](#cpu_current_usage)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L57)cpu\_current\_usage **cpu\_current\_usage: float | None ### [**](#cpu_max_usage)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L56)cpu\_max\_usage **cpu\_max\_usage: float | None ### [**](#duration)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L60)duration **duration: timedelta\_ms | None ### [**](#input_body_len)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L49)input\_body\_len **input\_body\_len: int | None ### [**](#mem_avg_bytes)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L52)mem\_avg\_bytes **mem\_avg\_bytes: float | None ### [**](#mem_current_bytes)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L54)mem\_current\_bytes **mem\_current\_bytes: int | None ### [**](#mem_max_bytes)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L53)mem\_max\_bytes **mem\_max\_bytes: int | None ### [**](#metamorph)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L62)metamorph **metamorph: int | None ### [**](#net_rx_bytes)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L58)net\_rx\_bytes **net\_rx\_bytes: int | None ### [**](#net_tx_bytes)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L59)net\_tx\_bytes **net\_tx\_bytes: int | None ### [**](#restart_count)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L50)restart\_count **restart\_count: int ### [**](#resurrect_count)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L51)resurrect\_count **resurrect\_count: int ### [**](#run_time)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L61)run\_time **run\_time: timedelta | None --- # ActorRunUsage<!-- --> ## Index[**](#Index) ### Properties * [**\_\_model\_config\_\_](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRunUsage.md#__model_config__) * [**actor\_compute\_units](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRunUsage.md#actor_compute_units) * [**data\_transfer\_external\_gbytes](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRunUsage.md#data_transfer_external_gbytes) * [**data\_transfer\_internal\_gbytes](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRunUsage.md#data_transfer_internal_gbytes) * [**dataset\_reads](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRunUsage.md#dataset_reads) * [**dataset\_writes](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRunUsage.md#dataset_writes) * [**key\_value\_store\_lists](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRunUsage.md#key_value_store_lists) * [**key\_value\_store\_reads](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRunUsage.md#key_value_store_reads) * [**key\_value\_store\_writes](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRunUsage.md#key_value_store_writes) * [**proxy\_residential\_transfer\_gbytes](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRunUsage.md#proxy_residential_transfer_gbytes) * [**proxy\_serps](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRunUsage.md#proxy_serps) * [**request\_queue\_reads](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRunUsage.md#request_queue_reads) * [**request\_queue\_writes](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorRunUsage.md#request_queue_writes) ## Properties<!-- -->[**](#Properties) ### [**](#__model_config__)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L79)\_\_model\_config\_\_ **\_\_model\_config\_\_: Undefined ### [**](#actor_compute_units)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L81)actor\_compute\_units **actor\_compute\_units: float | None ### [**](#data_transfer_external_gbytes)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L90)data\_transfer\_external\_gbytes **data\_transfer\_external\_gbytes: float | None ### [**](#data_transfer_internal_gbytes)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L89)data\_transfer\_internal\_gbytes **data\_transfer\_internal\_gbytes: float | None ### [**](#dataset_reads)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L82)dataset\_reads **dataset\_reads: float | None ### [**](#dataset_writes)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L83)dataset\_writes **dataset\_writes: float | None ### [**](#key_value_store_lists)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L86)key\_value\_store\_lists **key\_value\_store\_lists: float | None ### [**](#key_value_store_reads)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L84)key\_value\_store\_reads **key\_value\_store\_reads: float | None ### [**](#key_value_store_writes)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L85)key\_value\_store\_writes **key\_value\_store\_writes: float | None ### [**](#proxy_residential_transfer_gbytes)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L91)proxy\_residential\_transfer\_gbytes **proxy\_residential\_transfer\_gbytes: float | None ### [**](#proxy_serps)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L92)proxy\_serps **proxy\_serps: float | None ### [**](#request_queue_reads)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L87)request\_queue\_reads **request\_queue\_reads: float | None ### [**](#request_queue_writes)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L88)request\_queue\_writes **request\_queue\_writes: float | None --- # AddRequestsResponse<!-- --> Model for a response to add requests to a queue. Contains detailed information about the processing results when adding multiple requests to a queue. This includes which requests were successfully processed and which ones encountered issues during processing. ## Index[**](#Index) ### Properties * [**model\_config](https://docs.apify.com/sdk/python/sdk/python/reference/class/AddRequestsResponse.md#model_config) * [**processed\_requests](https://docs.apify.com/sdk/python/sdk/python/reference/class/AddRequestsResponse.md#processed_requests) * [**unprocessed\_requests](https://docs.apify.com/sdk/python/sdk/python/reference/class/AddRequestsResponse.md#unprocessed_requests) ## Properties<!-- -->[**](#Properties) ### [**](#model_config)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storage_clients/models.py#L166)model\_config **model\_config: Undefined ### [**](#processed_requests)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storage_clients/models.py#L168)processed\_requests **processed\_requests: list\[[ProcessedRequest](https://crawlee.dev/python/api/class/ProcessedRequest)] Successfully processed requests, including information about whether they were already present in the queue and whether they had been handled previously. ### [**](#unprocessed_requests)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storage_clients/models.py#L172)unprocessed\_requests **unprocessed\_requests: list\[[UnprocessedRequest](https://crawlee.dev/python/api/class/UnprocessedRequest)] Requests that could not be processed, typically due to validation errors or other issues. --- # ApifyCacheStorage<!-- --> A Scrapy cache storage that uses the Apify `KeyValueStore` to store responses. It can be set as a storage for Scrapy's built-in `HttpCacheMiddleware`, which caches responses to requests. See HTTPCache middleware settings (prefixed with `HTTPCACHE_`) in the Scrapy documentation for more information. Requires the asyncio Twisted reactor to be installed. ## Index[**](#Index) ### Methods * [**\_\_init\_\_](https://docs.apify.com/sdk/python/sdk/python/reference/class/ApifyCacheStorage.md#__init__) * [**close\_spider](https://docs.apify.com/sdk/python/sdk/python/reference/class/ApifyCacheStorage.md#close_spider) * [**open\_spider](https://docs.apify.com/sdk/python/sdk/python/reference/class/ApifyCacheStorage.md#open_spider) * [**retrieve\_response](https://docs.apify.com/sdk/python/sdk/python/reference/class/ApifyCacheStorage.md#retrieve_response) * [**store\_response](https://docs.apify.com/sdk/python/sdk/python/reference/class/ApifyCacheStorage.md#store_response) ## Methods<!-- -->[**](#Methods) ### [**](#__init__)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/scrapy/extensions/_httpcache.py#L38)\_\_init\_\_ * ****\_\_init\_\_**(settings): None - #### Parameters * ##### settings: BaseSettings #### Returns None ### [**](#close_spider)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/scrapy/extensions/_httpcache.py#L65)close\_spider * ****close\_spider**(\_, current\_time): None - Close the cache storage for a spider. *** #### Parameters * ##### \_: Spider * ##### optionalcurrent\_time: int | None = <!-- -->None #### Returns None ### [**](#open_spider)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/scrapy/extensions/_httpcache.py#L46)open\_spider * ****open\_spider**(spider): None - Open the cache storage for a spider. *** #### Parameters * ##### spider: Spider #### Returns None ### [**](#retrieve_response)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/scrapy/extensions/_httpcache.py#L108)retrieve\_response * ****retrieve\_response**(\_, request, current\_time): Response | None - Retrieve a response from the cache storage. *** #### Parameters * ##### \_: Spider * ##### request: Request * ##### optionalcurrent\_time: int | None = <!-- -->None #### Returns Response | None ### [**](#store_response)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/scrapy/extensions/_httpcache.py#L140)store\_response * ****store\_response**(\_, request, response): None - Store a response in the cache storage. *** #### Parameters * ##### \_: Spider * ##### request: Request * ##### response: Response #### Returns None --- # ApifyHttpProxyMiddleware<!-- --> Apify HTTP proxy middleware for Scrapy. This middleware enhances request processing by adding a 'proxy' field to the request's meta and an authentication header. It draws inspiration from the `HttpProxyMiddleware` included by default in Scrapy projects. The proxy URL is sourced from the settings under the `APIFY_PROXY_SETTINGS` key. The value of this key, a dictionary, should be provided by the Actor input. An example of the proxy settings: proxy\_settings = {'useApifyProxy': true, 'apifyProxyGroups': \[]} ## Index[**](#Index) ### Methods * [**\_\_init\_\_](https://docs.apify.com/sdk/python/sdk/python/reference/class/ApifyHttpProxyMiddleware.md#__init__) * [**from\_crawler](https://docs.apify.com/sdk/python/sdk/python/reference/class/ApifyHttpProxyMiddleware.md#from_crawler) * [**process\_exception](https://docs.apify.com/sdk/python/sdk/python/reference/class/ApifyHttpProxyMiddleware.md#process_exception) * [**process\_request](https://docs.apify.com/sdk/python/sdk/python/reference/class/ApifyHttpProxyMiddleware.md#process_request) ## Methods<!-- -->[**](#Methods) ### [**](#__init__)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/scrapy/middlewares/apify_proxy.py#L28)\_\_init\_\_ * ****\_\_init\_\_**(proxy\_settings): None - Create a new instance. *** #### Parameters * ##### proxy\_settings: dict Dictionary containing proxy settings, provided by the Actor input. #### Returns None ### [**](#from_crawler)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/scrapy/middlewares/apify_proxy.py#L39)from\_crawler * ****from\_crawler**(crawler): [ApifyHttpProxyMiddleware](https://docs.apify.com/sdk/python/sdk/python/reference/class/ApifyHttpProxyMiddleware.md) - Create an instance of ApifyHttpProxyMiddleware from a Scrapy Crawler. *** #### Parameters * ##### crawler: Crawler Scrapy Crawler object. #### Returns [ApifyHttpProxyMiddleware](https://docs.apify.com/sdk/python/sdk/python/reference/class/ApifyHttpProxyMiddleware.md) ### [**](#process_exception)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/scrapy/middlewares/apify_proxy.py#L89)process\_exception * ****process\_exception**(request, exception, spider): None - Process an exception that occurs during request processing. *** #### Parameters * ##### request: Request Scrapy Request object. * ##### exception: Exception Exception object. * ##### spider: Spider Scrapy Spider object. #### Returns None ### [**](#process_request)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/scrapy/middlewares/apify_proxy.py#L67)process\_request * **async **process\_request**(request, spider): None - Process a Scrapy request by assigning a new proxy. *** #### Parameters * ##### request: Request Scrapy Request object. * ##### spider: Spider Scrapy Spider object. #### Returns None --- # ApifyScheduler<!-- --> A Scrapy scheduler that uses the Apify `RequestQueue` to manage requests. This scheduler requires the asyncio Twisted reactor to be installed. ## Index[**](#Index) ### Methods * [**\_\_init\_\_](https://docs.apify.com/sdk/python/sdk/python/reference/class/ApifyScheduler.md#__init__) * [**close](https://docs.apify.com/sdk/python/sdk/python/reference/class/ApifyScheduler.md#close) * [**enqueue\_request](https://docs.apify.com/sdk/python/sdk/python/reference/class/ApifyScheduler.md#enqueue_request) * [**has\_pending\_requests](https://docs.apify.com/sdk/python/sdk/python/reference/class/ApifyScheduler.md#has_pending_requests) * [**next\_request](https://docs.apify.com/sdk/python/sdk/python/reference/class/ApifyScheduler.md#next_request) * [**open](https://docs.apify.com/sdk/python/sdk/python/reference/class/ApifyScheduler.md#open) ## Methods<!-- -->[**](#Methods) ### [**](#__init__)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/scrapy/scheduler.py#L30)\_\_init\_\_ * ****\_\_init\_\_**(): None - #### Returns None ### [**](#close)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/scrapy/scheduler.py#L66)close * ****close**(reason): None - Close the scheduler. Shut down the event loop and its thread gracefully. *** #### Parameters * ##### reason: str The reason for closing the spider. #### Returns None ### [**](#enqueue_request)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/scrapy/scheduler.py#L104)enqueue\_request * ****enqueue\_request**(request): bool - Add a request to the scheduler. This could be called from either from a spider or a downloader middleware (e.g. redirect, retry, ...). *** #### Parameters * ##### request: Request The request to add to the scheduler. #### Returns bool ### [**](#has_pending_requests)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/scrapy/scheduler.py#L87)has\_pending\_requests * ****has\_pending\_requests**(): bool - Check if the scheduler has any pending requests. *** #### Returns bool ### [**](#next_request)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/scrapy/scheduler.py#L138)next\_request * ****next\_request**(): Request | None - Fetch the next request from the scheduler. *** #### Returns Request | None ### [**](#open)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/scrapy/scheduler.py#L43)open * ****open**(spider): Deferred\[None] | None - Open the scheduler. *** #### Parameters * ##### spider: Spider The spider that the scheduler is associated with. #### Returns Deferred\[None] | None --- # ApifyStorageClient<!-- --> A storage client implementation based on the Apify platform storage. ## Index[**](#Index) ### Methods * [**\_\_init\_\_](https://docs.apify.com/sdk/python/sdk/python/reference/class/ApifyStorageClient.md#__init__) * [**dataset](https://docs.apify.com/sdk/python/sdk/python/reference/class/ApifyStorageClient.md#dataset) * [**datasets](https://docs.apify.com/sdk/python/sdk/python/reference/class/ApifyStorageClient.md#datasets) * [**from\_config](https://docs.apify.com/sdk/python/sdk/python/reference/class/ApifyStorageClient.md#from_config) * [**get\_rate\_limit\_errors](https://docs.apify.com/sdk/python/sdk/python/reference/class/ApifyStorageClient.md#get_rate_limit_errors) * [**key\_value\_store](https://docs.apify.com/sdk/python/sdk/python/reference/class/ApifyStorageClient.md#key_value_store) * [**key\_value\_stores](https://docs.apify.com/sdk/python/sdk/python/reference/class/ApifyStorageClient.md#key_value_stores) * [**purge\_on\_start](https://docs.apify.com/sdk/python/sdk/python/reference/class/ApifyStorageClient.md#purge_on_start) * [**request\_queue](https://docs.apify.com/sdk/python/sdk/python/reference/class/ApifyStorageClient.md#request_queue) * [**request\_queues](https://docs.apify.com/sdk/python/sdk/python/reference/class/ApifyStorageClient.md#request_queues) ## Methods<!-- -->[**](#Methods) ### [**](#__init__)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/apify_storage_client/_apify_storage_client.py#L27)\_\_init\_\_ * ****\_\_init\_\_**(\*, configuration): None - #### Parameters * ##### keyword-onlyconfiguration: [Configuration](https://docs.apify.com/sdk/python/sdk/python/reference/class/Configuration.md) #### Returns None ### [**](#dataset)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/apify_storage_client/_apify_storage_client.py#L43)dataset * ****dataset**(id): [DatasetClient](https://docs.apify.com/sdk/python/sdk/python/reference/class/DatasetClient.md) - #### Parameters * ##### id: str #### Returns [DatasetClient](https://docs.apify.com/sdk/python/sdk/python/reference/class/DatasetClient.md) ### [**](#datasets)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/apify_storage_client/_apify_storage_client.py#L47)datasets * ****datasets**(): [DatasetCollectionClient](https://docs.apify.com/sdk/python/sdk/python/reference/class/DatasetCollectionClient.md) - #### Returns [DatasetCollectionClient](https://docs.apify.com/sdk/python/sdk/python/reference/class/DatasetCollectionClient.md) ### [**](#from_config)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/apify_storage_client/_apify_storage_client.py#L39)from\_config * ****from\_config**(config): [ApifyStorageClient](https://docs.apify.com/sdk/python/sdk/python/reference/class/ApifyStorageClient.md) - #### Parameters * ##### config: [Configuration](https://docs.apify.com/sdk/python/sdk/python/reference/class/Configuration.md) #### Returns [ApifyStorageClient](https://docs.apify.com/sdk/python/sdk/python/reference/class/ApifyStorageClient.md) ### [**](#get_rate_limit_errors)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/apify_storage_client/_apify_storage_client.py#L71)get\_rate\_limit\_errors * ****get\_rate\_limit\_errors**(): dict\[int, int] - #### Returns dict\[int, int] ### [**](#key_value_store)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/apify_storage_client/_apify_storage_client.py#L51)key\_value\_store * ****key\_value\_store**(id): [KeyValueStoreClient](https://docs.apify.com/sdk/python/sdk/python/reference/class/KeyValueStoreClient.md) - #### Parameters * ##### id: str #### Returns [KeyValueStoreClient](https://docs.apify.com/sdk/python/sdk/python/reference/class/KeyValueStoreClient.md) ### [**](#key_value_stores)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/apify_storage_client/_apify_storage_client.py#L55)key\_value\_stores * ****key\_value\_stores**(): [KeyValueStoreCollectionClient](https://docs.apify.com/sdk/python/sdk/python/reference/class/KeyValueStoreCollectionClient.md) - #### Returns [KeyValueStoreCollectionClient](https://docs.apify.com/sdk/python/sdk/python/reference/class/KeyValueStoreCollectionClient.md) ### [**](#purge_on_start)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/apify_storage_client/_apify_storage_client.py#L67)purge\_on\_start * **async **purge\_on\_start**(): None - #### Returns None ### [**](#request_queue)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/apify_storage_client/_apify_storage_client.py#L59)request\_queue * ****request\_queue**(id): [RequestQueueClient](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestQueueClient.md) - #### Parameters * ##### id: str #### Returns [RequestQueueClient](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestQueueClient.md) ### [**](#request_queues)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/apify_storage_client/_apify_storage_client.py#L63)request\_queues * ****request\_queues**(): [RequestQueueCollectionClient](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestQueueCollectionClient.md) - #### Returns [RequestQueueCollectionClient](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestQueueCollectionClient.md) --- # AsyncThread<!-- --> Class for running an asyncio event loop in a separate thread. This allows running asynchronous coroutines from synchronous code by executingthem on an event loop that runs in its own dedicated thread. ## Index[**](#Index) ### Methods * [**\_\_init\_\_](https://docs.apify.com/sdk/python/sdk/python/reference/class/AsyncThread.md#__init__) * [**close](https://docs.apify.com/sdk/python/sdk/python/reference/class/AsyncThread.md#close) * [**run\_coro](https://docs.apify.com/sdk/python/sdk/python/reference/class/AsyncThread.md#run_coro) ## Methods<!-- -->[**](#Methods) ### [**](#__init__)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/scrapy/_async_thread.py#L23)\_\_init\_\_ * ****\_\_init\_\_**(): None - #### Returns None ### [**](#close)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/scrapy/_async_thread.py#L70)close * ****close**(timeout): None - Close the event loop and its thread gracefully. This method cancels all pending tasks, stops the event loop, and waits for the thread to exit. If the thread does not exit within the given timeout, a forced shutdown is attempted. *** #### Parameters * ##### optionaltimeout: timedelta = <!-- -->timedelta(seconds=60) The maximum number of seconds to wait for the event loop thread to exit. #### Returns None ### [**](#run_coro)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/scrapy/_async_thread.py#L33)run\_coro * ****run\_coro**(coro, timeout): Any - Run a coroutine on an event loop running in a separate thread. This method schedules the coroutine to run on the event loop and blocks until the coroutine completes or the specified timeout is reached. *** #### Parameters * ##### coro: Coroutine The coroutine to run. * ##### optionaltimeout: timedelta = <!-- -->timedelta(seconds=60) The maximum number of seconds to wait for the coroutine to finish. #### Returns Any --- # ChargeResult<!-- --> Result of the `ChargingManager.charge` method. ## Index[**](#Index) ### Properties * [**chargeable\_within\_limit](https://docs.apify.com/sdk/python/sdk/python/reference/class/ChargeResult.md#chargeable_within_limit) * [**charged\_count](https://docs.apify.com/sdk/python/sdk/python/reference/class/ChargeResult.md#charged_count) * [**event\_charge\_limit\_reached](https://docs.apify.com/sdk/python/sdk/python/reference/class/ChargeResult.md#event_charge_limit_reached) ## Properties<!-- -->[**](#Properties) ### [**](#chargeable_within_limit)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_charging.py#L79)chargeable\_within\_limit **chargeable\_within\_limit: dict\[str, int | None] How many events of each known type can still be charged within the limit. ### [**](#charged_count)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_charging.py#L76)charged\_count **charged\_count: int Total amount of charged events - may be lower than the requested amount. ### [**](#event_charge_limit_reached)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_charging.py#L73)event\_charge\_limit\_reached **event\_charge\_limit\_reached: bool If true, no more events of this type can be charged within the limit. --- # ChargingManager<!-- --> Provides fine-grained access to pay-per-event functionality. The ChargingManager allows you to charge for specific events in your Actor when using the pay-per-event pricing model. This enables precise cost control and transparent billing for different operations within your Actor. ### References * Apify platform documentation: <https://docs.apify.com/platform/actors/publishing/monetize> ### Hierarchy * *ChargingManager* * [ChargingManagerImplementation](https://docs.apify.com/sdk/python/sdk/python/reference/class/ChargingManagerImplementation.md) ## Index[**](#Index) ### Methods * [**calculate\_max\_event\_charge\_count\_within\_limit](https://docs.apify.com/sdk/python/sdk/python/reference/class/ChargingManager.md#calculate_max_event_charge_count_within_limit) * [**calculate\_total\_charged\_amount](https://docs.apify.com/sdk/python/sdk/python/reference/class/ChargingManager.md#calculate_total_charged_amount) * [**charge](https://docs.apify.com/sdk/python/sdk/python/reference/class/ChargingManager.md#charge) * [**get\_pricing\_info](https://docs.apify.com/sdk/python/sdk/python/reference/class/ChargingManager.md#get_pricing_info) ## Methods<!-- -->[**](#Methods) ### [**](#calculate_max_event_charge_count_within_limit)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_charging.py#L54)calculate\_max\_event\_charge\_count\_within\_limit * ****calculate\_max\_event\_charge\_count\_within\_limit**(event\_name): int | None - Overrides [ChargingManager.calculate\_max\_event\_charge\_count\_within\_limit](https://docs.apify.com/sdk/python/sdk/python/reference/class/ChargingManager.md#calculate_max_event_charge_count_within_limit) Calculate how many instances of an event can be charged before we reach the configured limit. *** #### Parameters * ##### event\_name: str Name of the inspected event. #### Returns int | None ### [**](#calculate_total_charged_amount)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_charging.py#L51)calculate\_total\_charged\_amount * ****calculate\_total\_charged\_amount**(): Decimal - Overrides [ChargingManager.calculate\_total\_charged\_amount](https://docs.apify.com/sdk/python/sdk/python/reference/class/ChargingManager.md#calculate_total_charged_amount) Calculate the total amount of money charged for pay-per-event events so far. *** #### Returns Decimal ### [**](#charge)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_charging.py#L41)charge * **async **charge**(event\_name, count): [ChargeResult](https://docs.apify.com/sdk/python/sdk/python/reference/class/ChargeResult.md) - Overrides [ChargingManager.charge](https://docs.apify.com/sdk/python/sdk/python/reference/class/ChargingManager.md#charge) Charge for a specified number of events - sub-operations of the Actor. This is relevant only for the pay-per-event pricing model. *** #### Parameters * ##### event\_name: str Name of the event to be charged for. * ##### optionalcount: int = <!-- -->1 Number of events to charge for. #### Returns [ChargeResult](https://docs.apify.com/sdk/python/sdk/python/reference/class/ChargeResult.md) ### [**](#get_pricing_info)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_charging.py#L61)get\_pricing\_info * ****get\_pricing\_info**(): [ActorPricingInfo](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorPricingInfo.md) - Overrides [ChargingManager.get\_pricing\_info](https://docs.apify.com/sdk/python/sdk/python/reference/class/ChargingManager.md#get_pricing_info) Retrieve detailed information about the effective pricing of the current Actor run. This can be used for instance when your code needs to support multiple pricing models in transition periods. *** #### Returns [ActorPricingInfo](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorPricingInfo.md) --- # ChargingManagerImplementation<!-- --> Implementation of the `ChargingManager` Protocol - this is only meant to be instantiated internally. ### Hierarchy * [ChargingManager](https://docs.apify.com/sdk/python/sdk/python/reference/class/ChargingManager.md) * *ChargingManagerImplementation* ## Index[**](#Index) ### Methods * [**\_\_aenter\_\_](https://docs.apify.com/sdk/python/sdk/python/reference/class/ChargingManagerImplementation.md#__aenter__) * [**\_\_aexit\_\_](https://docs.apify.com/sdk/python/sdk/python/reference/class/ChargingManagerImplementation.md#__aexit__) * [**\_\_init\_\_](https://docs.apify.com/sdk/python/sdk/python/reference/class/ChargingManagerImplementation.md#__init__) * [**calculate\_max\_event\_charge\_count\_within\_limit](https://docs.apify.com/sdk/python/sdk/python/reference/class/ChargingManagerImplementation.md#calculate_max_event_charge_count_within_limit) * [**calculate\_total\_charged\_amount](https://docs.apify.com/sdk/python/sdk/python/reference/class/ChargingManagerImplementation.md#calculate_total_charged_amount) * [**charge](https://docs.apify.com/sdk/python/sdk/python/reference/class/ChargingManagerImplementation.md#charge) * [**get\_pricing\_info](https://docs.apify.com/sdk/python/sdk/python/reference/class/ChargingManagerImplementation.md#get_pricing_info) ### Properties * [**LOCAL\_CHARGING\_LOG\_DATASET\_NAME](https://docs.apify.com/sdk/python/sdk/python/reference/class/ChargingManagerImplementation.md#LOCAL_CHARGING_LOG_DATASET_NAME) ## Methods<!-- -->[**](#Methods) ### [**](#__aenter__)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_charging.py#L131)\_\_aenter\_\_ * **async **\_\_aenter\_\_**(): None - Initialize the charging manager - this is called by the `Actor` class and shouldn't be invoked manually. *** #### Returns None ### [**](#__aexit__)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_charging.py#L174)\_\_aexit\_\_ * **async **\_\_aexit\_\_**(exc\_type, exc\_value, exc\_traceback): None - #### Parameters * ##### exc\_type: type\[BaseException] | None * ##### exc\_value: BaseException | None * ##### exc\_traceback: TracebackType | None #### Returns None ### [**](#__init__)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_charging.py#L106)\_\_init\_\_ * ****\_\_init\_\_**(configuration, client): None - #### Parameters * ##### configuration: [Configuration](https://docs.apify.com/sdk/python/sdk/python/reference/class/Configuration.md) * ##### client: ApifyClientAsync #### Returns None ### [**](#calculate_max_event_charge_count_within_limit)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_charging.py#L283)calculate\_max\_event\_charge\_count\_within\_limit * ****calculate\_max\_event\_charge\_count\_within\_limit**(event\_name): int | None - Overrides [ChargingManager.calculate\_max\_event\_charge\_count\_within\_limit](https://docs.apify.com/sdk/python/sdk/python/reference/class/ChargingManager.md#calculate_max_event_charge_count_within_limit) Calculate how many instances of an event can be charged before we reach the configured limit. *** #### Parameters * ##### event\_name: str Name of the inspected event. #### Returns int | None ### [**](#calculate_total_charged_amount)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_charging.py#L276)calculate\_total\_charged\_amount * ****calculate\_total\_charged\_amount**(): Decimal - Overrides [ChargingManager.calculate\_total\_charged\_amount](https://docs.apify.com/sdk/python/sdk/python/reference/class/ChargingManager.md#calculate_total_charged_amount) Calculate the total amount of money charged for pay-per-event events so far. *** #### Returns Decimal ### [**](#charge)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_charging.py#L186)charge * **async **charge**(event\_name, count): [ChargeResult](https://docs.apify.com/sdk/python/sdk/python/reference/class/ChargeResult.md) - Overrides [ChargingManager.charge](https://docs.apify.com/sdk/python/sdk/python/reference/class/ChargingManager.md#charge) Charge for a specified number of events - sub-operations of the Actor. This is relevant only for the pay-per-event pricing model. *** #### Parameters * ##### event\_name: str Name of the event to be charged for. * ##### optionalcount: int = <!-- -->1 Number of events to charge for. #### Returns [ChargeResult](https://docs.apify.com/sdk/python/sdk/python/reference/class/ChargeResult.md) ### [**](#get_pricing_info)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_charging.py#L300)get\_pricing\_info * ****get\_pricing\_info**(): [ActorPricingInfo](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorPricingInfo.md) - Overrides [ChargingManager.get\_pricing\_info](https://docs.apify.com/sdk/python/sdk/python/reference/class/ChargingManager.md#get_pricing_info) Retrieve detailed information about the effective pricing of the current Actor run. This can be used for instance when your code needs to support multiple pricing models in transition periods. *** #### Returns [ActorPricingInfo](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorPricingInfo.md) ## Properties<!-- -->[**](#Properties) ### [**](#LOCAL_CHARGING_LOG_DATASET_NAME)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_charging.py#L104)LOCAL\_CHARGING\_LOG\_DATASET\_NAME **LOCAL\_CHARGING\_LOG\_DATASET\_NAME: Undefined --- # ChargingStateItem<!-- --> ## Index[**](#Index) ### Properties * [**charge\_count](https://docs.apify.com/sdk/python/sdk/python/reference/class/ChargingStateItem.md#charge_count) * [**total\_charged\_amount](https://docs.apify.com/sdk/python/sdk/python/reference/class/ChargingStateItem.md#total_charged_amount) ## Properties<!-- -->[**](#Properties) ### [**](#charge_count)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_charging.py#L315)charge\_count **charge\_count: int ### [**](#total_charged_amount)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_charging.py#L316)total\_charged\_amount **total\_charged\_amount: Decimal --- # Configuration<!-- --> A class for specifying the configuration of an Actor. Can be used either globally via `Configuration.get_global_configuration()`, or it can be specific to each `Actor` instance on the `actor.config` property. ## Index[**](#Index) ### Methods * [**disable\_browser\_sandbox\_on\_platform](https://docs.apify.com/sdk/python/sdk/python/reference/class/Configuration.md#disable_browser_sandbox_on_platform) * [**get\_global\_configuration](https://docs.apify.com/sdk/python/sdk/python/reference/class/Configuration.md#get_global_configuration) ### Properties * [**actor\_build\_id](https://docs.apify.com/sdk/python/sdk/python/reference/class/Configuration.md#actor_build_id) * [**actor\_build\_number](https://docs.apify.com/sdk/python/sdk/python/reference/class/Configuration.md#actor_build_number) * [**actor\_build\_tags](https://docs.apify.com/sdk/python/sdk/python/reference/class/Configuration.md#actor_build_tags) * [**actor\_events\_ws\_url](https://docs.apify.com/sdk/python/sdk/python/reference/class/Configuration.md#actor_events_ws_url) * [**actor\_full\_name](https://docs.apify.com/sdk/python/sdk/python/reference/class/Configuration.md#actor_full_name) * [**actor\_id](https://docs.apify.com/sdk/python/sdk/python/reference/class/Configuration.md#actor_id) * [**actor\_run\_id](https://docs.apify.com/sdk/python/sdk/python/reference/class/Configuration.md#actor_run_id) * [**actor\_task\_id](https://docs.apify.com/sdk/python/sdk/python/reference/class/Configuration.md#actor_task_id) * [**api\_base\_url](https://docs.apify.com/sdk/python/sdk/python/reference/class/Configuration.md#api_base_url) * [**api\_public\_base\_url](https://docs.apify.com/sdk/python/sdk/python/reference/class/Configuration.md#api_public_base_url) * [**dedicated\_cpus](https://docs.apify.com/sdk/python/sdk/python/reference/class/Configuration.md#dedicated_cpus) * [**disable\_outdated\_warning](https://docs.apify.com/sdk/python/sdk/python/reference/class/Configuration.md#disable_outdated_warning) * [**fact](https://docs.apify.com/sdk/python/sdk/python/reference/class/Configuration.md#fact) * [**input\_key](https://docs.apify.com/sdk/python/sdk/python/reference/class/Configuration.md#input_key) * [**input\_secrets\_private\_key\_file](https://docs.apify.com/sdk/python/sdk/python/reference/class/Configuration.md#input_secrets_private_key_file) * [**input\_secrets\_private\_key\_passphrase](https://docs.apify.com/sdk/python/sdk/python/reference/class/Configuration.md#input_secrets_private_key_passphrase) * [**is\_at\_home](https://docs.apify.com/sdk/python/sdk/python/reference/class/Configuration.md#is_at_home) * [**latest\_sdk\_version](https://docs.apify.com/sdk/python/sdk/python/reference/class/Configuration.md#latest_sdk_version) * [**log\_format](https://docs.apify.com/sdk/python/sdk/python/reference/class/Configuration.md#log_format) * [**max\_paid\_dataset\_items](https://docs.apify.com/sdk/python/sdk/python/reference/class/Configuration.md#max_paid_dataset_items) * [**max\_total\_charge\_usd](https://docs.apify.com/sdk/python/sdk/python/reference/class/Configuration.md#max_total_charge_usd) * [**meta\_origin](https://docs.apify.com/sdk/python/sdk/python/reference/class/Configuration.md#meta_origin) * [**metamorph\_after\_sleep](https://docs.apify.com/sdk/python/sdk/python/reference/class/Configuration.md#metamorph_after_sleep) * [**proxy\_hostname](https://docs.apify.com/sdk/python/sdk/python/reference/class/Configuration.md#proxy_hostname) * [**proxy\_password](https://docs.apify.com/sdk/python/sdk/python/reference/class/Configuration.md#proxy_password) * [**proxy\_port](https://docs.apify.com/sdk/python/sdk/python/reference/class/Configuration.md#proxy_port) * [**proxy\_status\_url](https://docs.apify.com/sdk/python/sdk/python/reference/class/Configuration.md#proxy_status_url) * [**standby\_port](https://docs.apify.com/sdk/python/sdk/python/reference/class/Configuration.md#standby_port) * [**standby\_url](https://docs.apify.com/sdk/python/sdk/python/reference/class/Configuration.md#standby_url) * [**started\_at](https://docs.apify.com/sdk/python/sdk/python/reference/class/Configuration.md#started_at) * [**test\_pay\_per\_event](https://docs.apify.com/sdk/python/sdk/python/reference/class/Configuration.md#test_pay_per_event) * [**timeout\_at](https://docs.apify.com/sdk/python/sdk/python/reference/class/Configuration.md#timeout_at) * [**token](https://docs.apify.com/sdk/python/sdk/python/reference/class/Configuration.md#token) * [**user\_id](https://docs.apify.com/sdk/python/sdk/python/reference/class/Configuration.md#user_id) * [**user\_is\_paying](https://docs.apify.com/sdk/python/sdk/python/reference/class/Configuration.md#user_is_paying) * [**web\_server\_port](https://docs.apify.com/sdk/python/sdk/python/reference/class/Configuration.md#web_server_port) * [**web\_server\_url](https://docs.apify.com/sdk/python/sdk/python/reference/class/Configuration.md#web_server_url) * [**workflow\_key](https://docs.apify.com/sdk/python/sdk/python/reference/class/Configuration.md#workflow_key) ## Methods<!-- -->[**](#Methods) ### [**](#disable_browser_sandbox_on_platform)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_configuration.py#L378)disable\_browser\_sandbox\_on\_platform * ****disable\_browser\_sandbox\_on\_platform**(): Self - Disable the browser sandbox mode when running on the Apify platform. Running in environment where `is_at_home` is True does not benefit from browser sandbox as it is already running in a container. It can be on the contrary undesired as the process in the container might be running as root and this will crash chromium that was started with browser sandbox mode. *** #### Returns Self ### [**](#get_global_configuration)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_configuration.py#L391)get\_global\_configuration * ****get\_global\_configuration**(): [Configuration](https://docs.apify.com/sdk/python/sdk/python/reference/class/Configuration.md) - Retrieve the global instance of the configuration. Mostly for the backwards compatibility. It is recommended to use the `service_locator.get_configuration()` instead. *** #### Returns [Configuration](https://docs.apify.com/sdk/python/sdk/python/reference/class/Configuration.md) ## Properties<!-- -->[**](#Properties) ### [**](#actor_build_id)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_configuration.py#L67)actor\_build\_id **actor\_build\_id: str | None ### [**](#actor_build_number)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_configuration.py#L78)actor\_build\_number **actor\_build\_number: str | None ### [**](#actor_build_tags)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_configuration.py#L89)actor\_build\_tags **actor\_build\_tags: [list](https://docs.apify.com/sdk/python/sdk/python/reference/class/DatasetCollectionClient.md#list)\[str] | None ### [**](#actor_events_ws_url)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_configuration.py#L108)actor\_events\_ws\_url **actor\_events\_ws\_url: str | None ### [**](#actor_full_name)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_configuration.py#L48)actor\_full\_name **actor\_full\_name: str | None ### [**](#actor_id)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_configuration.py#L36)actor\_id **actor\_id: str | None ### [**](#actor_run_id)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_configuration.py#L55)actor\_run\_id **actor\_run\_id: str | None ### [**](#actor_task_id)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_configuration.py#L97)actor\_task\_id **actor\_task\_id: str | None ### [**](#api_base_url)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_configuration.py#L119)api\_base\_url **api\_base\_url: str ### [**](#api_public_base_url)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_configuration.py#L127)api\_public\_base\_url **api\_public\_base\_url: str ### [**](#dedicated_cpus)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_configuration.py#L135)dedicated\_cpus **dedicated\_cpus: float | None ### [**](#disable_outdated_warning)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_configuration.py#L143)disable\_outdated\_warning **disable\_outdated\_warning: bool ### [**](#fact)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_configuration.py#L152)fact **fact: str | None ### [**](#input_key)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_configuration.py#L154)input\_key **input\_key: str ### [**](#input_secrets_private_key_file)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_configuration.py#L166)input\_secrets\_private\_key\_file **input\_secrets\_private\_key\_file: str | None ### [**](#input_secrets_private_key_passphrase)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_configuration.py#L174)input\_secrets\_private\_key\_passphrase **input\_secrets\_private\_key\_passphrase: str | None ### [**](#is_at_home)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_configuration.py#L182)is\_at\_home **is\_at\_home: bool ### [**](#latest_sdk_version)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_configuration.py#L190)latest\_sdk\_version **latest\_sdk\_version: str | None ### [**](#log_format)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_configuration.py#L200)log\_format **log\_format: str | None ### [**](#max_paid_dataset_items)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_configuration.py#L206)max\_paid\_dataset\_items **max\_paid\_dataset\_items: int | None ### [**](#max_total_charge_usd)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_configuration.py#L215)max\_total\_charge\_usd **max\_total\_charge\_usd: Decimal | None ### [**](#meta_origin)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_configuration.py#L232)meta\_origin **meta\_origin: str | None ### [**](#metamorph_after_sleep)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_configuration.py#L240)metamorph\_after\_sleep **metamorph\_after\_sleep: timedelta\_ms ### [**](#proxy_hostname)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_configuration.py#L248)proxy\_hostname **proxy\_hostname: str ### [**](#proxy_password)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_configuration.py#L256)proxy\_password **proxy\_password: str | None ### [**](#proxy_port)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_configuration.py#L264)proxy\_port **proxy\_port: int ### [**](#proxy_status_url)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_configuration.py#L272)proxy\_status\_url **proxy\_status\_url: str ### [**](#standby_port)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_configuration.py#L303)standby\_port **standby\_port: int ### [**](#standby_url)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_configuration.py#L312)standby\_url **standby\_url: str ### [**](#started_at)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_configuration.py#L280)started\_at **started\_at: datetime | None ### [**](#test_pay_per_event)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_configuration.py#L224)test\_pay\_per\_event **test\_pay\_per\_event: bool ### [**](#timeout_at)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_configuration.py#L291)timeout\_at **timeout\_at: datetime | None ### [**](#token)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_configuration.py#L321)token **token: str | None ### [**](#user_id)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_configuration.py#L329)user\_id **user\_id: str | None ### [**](#user_is_paying)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_configuration.py#L337)user\_is\_paying **user\_is\_paying: bool ### [**](#web_server_port)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_configuration.py#L346)web\_server\_port **web\_server\_port: int ### [**](#web_server_url)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_configuration.py#L358)web\_server\_url **web\_server\_url: str ### [**](#workflow_key)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_configuration.py#L369)workflow\_key **workflow\_key: str | None --- # Dataset<!-- --> Dataset is a storage for managing structured tabular data. The dataset class provides a high-level interface for storing and retrieving structured data with consistent schema, similar to database tables or spreadsheets. It abstracts the underlying storage implementation details, offering a consistent API regardless of where the data is physically stored. Dataset operates in an append-only mode, allowing new records to be added but not modified or deleted after creation. This makes it particularly suitable for storing crawling results and other data that should be immutable once collected. The class provides methods for adding data, retrieving data with various filtering options, and exporting data to different formats. You can create a dataset using the `open` class method, specifying either a name or ID. The underlying storage implementation is determined by the configured storage client. ### Usage ``` from crawlee.storages import Dataset # Open a dataset dataset = await Dataset.open(name='my_dataset') # Add data await dataset.push_data({'title': 'Example Product', 'price': 99.99}) # Retrieve filtered data results = await dataset.get_data(limit=10, desc=True) # Export data await dataset.export_to('results.json', content_type='json') ``` ### Hierarchy * [Storage](https://crawlee.dev/python/api/class/Storage) * *Dataset* ## Index[**](#Index) ### Methods * [**\_\_init\_\_](https://docs.apify.com/sdk/python/sdk/python/reference/class/Dataset.md#__init__) * [**drop](https://docs.apify.com/sdk/python/sdk/python/reference/class/Dataset.md#drop) * [**export\_to](https://docs.apify.com/sdk/python/sdk/python/reference/class/Dataset.md#export_to) * [**get\_data](https://docs.apify.com/sdk/python/sdk/python/reference/class/Dataset.md#get_data) * [**get\_metadata](https://docs.apify.com/sdk/python/sdk/python/reference/class/Dataset.md#get_metadata) * [**iterate\_items](https://docs.apify.com/sdk/python/sdk/python/reference/class/Dataset.md#iterate_items) * [**list\_items](https://docs.apify.com/sdk/python/sdk/python/reference/class/Dataset.md#list_items) * [**open](https://docs.apify.com/sdk/python/sdk/python/reference/class/Dataset.md#open) * [**purge](https://docs.apify.com/sdk/python/sdk/python/reference/class/Dataset.md#purge) * [**push\_data](https://docs.apify.com/sdk/python/sdk/python/reference/class/Dataset.md#push_data) ### Properties * [**id](https://docs.apify.com/sdk/python/sdk/python/reference/class/Dataset.md#id) * [**name](https://docs.apify.com/sdk/python/sdk/python/reference/class/Dataset.md#name) ## Methods<!-- -->[**](#Methods) ### [**](#__init__)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storages/_dataset.py#L68)\_\_init\_\_ * ****\_\_init\_\_**(client, id, name): None - Initialize a new instance. Preferably use the `Dataset.open` constructor to create a new instance. *** #### Parameters * ##### client: [DatasetClient](https://crawlee.dev/python/api/class/DatasetClient) An instance of a storage client. * ##### id: str The unique identifier of the storage. * ##### name: str | None The name of the storage, if available. #### Returns None ### [**](#drop)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storages/_dataset.py#L118)drop * **async **drop**(): None - Overrides [Storage.drop](https://crawlee.dev/python/api/class/Storage#drop) Drop the storage, removing it from the underlying storage client and clearing the cache. *** #### Returns None ### [**](#export_to)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storages/_dataset.py#L313)export\_to * **async **export\_to**(key: str, content\_type? <!-- --> : Literal\[json, csv], to\_kvs\_id? <!-- --> : str | None, to\_kvs\_name? <!-- --> : str | None, to\_kvs\_storage\_client? <!-- --> : [StorageClient](https://crawlee.dev/python/api/class/StorageClient) | None, to\_kvs\_configuration? <!-- --> : [Configuration](https://crawlee.dev/python/api/class/Configuration) | None, kwargs: Any): None * **async **export\_to**(key: str, content\_type: Literal\[json], to\_kvs\_id? <!-- --> : str | None, to\_kvs\_name? <!-- --> : str | None, to\_kvs\_storage\_client? <!-- --> : [StorageClient](https://crawlee.dev/python/api/class/StorageClient) | None, to\_kvs\_configuration? <!-- --> : [Configuration](https://crawlee.dev/python/api/class/Configuration) | None, \*: , skipkeys: NotRequired\[bool], ensure\_ascii: NotRequired\[bool], check\_circular: NotRequired\[bool], allow\_nan: NotRequired\[bool], cls: NotRequired\[[type](https://crawlee.dev/python/api/class/SitemapSource#type)\[json.JSONEncoder]], indent: NotRequired\[int], separators: NotRequired\[tuple\[str, str]], default: NotRequired\[Callable], sort\_keys: NotRequired\[bool]): None * **async **export\_to**(key: str, content\_type: Literal\[csv], to\_kvs\_id? <!-- --> : str | None, to\_kvs\_name? <!-- --> : str | None, to\_kvs\_storage\_client? <!-- --> : [StorageClient](https://crawlee.dev/python/api/class/StorageClient) | None, to\_kvs\_configuration? <!-- --> : [Configuration](https://crawlee.dev/python/api/class/Configuration) | None, \*: , dialect: NotRequired\[str], delimiter: NotRequired\[str], doublequote: NotRequired\[bool], escapechar: NotRequired\[str], lineterminator: NotRequired\[str], quotechar: NotRequired\[str], quoting: NotRequired\[int], skipinitialspace: NotRequired\[bool], strict: NotRequired\[bool]): None - Export the entire dataset into a specified file stored under a key in a key-value store. This method consolidates all entries from a specified dataset into one file, which is then saved under a given key in a key-value store. The format of the exported file is determined by the `content_type` parameter. Either the dataset's ID or name should be specified, and similarly, either the target key-value store's ID or name should be used. *** #### Parameters * ##### key: str The key under which to save the data in the key-value store. * ##### optionalcontent\_type: Literal\[json, csv] = <!-- -->'json' The format in which to export the data. * ##### optionalto\_kvs\_id: str | None = <!-- -->None ID of the key-value store to save the exported file. Specify only one of ID or name. * ##### optionalto\_kvs\_name: str | None = <!-- -->None Name of the key-value store to save the exported file. Specify only one of ID or name. * ##### optionalto\_kvs\_storage\_client: [StorageClient](https://crawlee.dev/python/api/class/StorageClient) | None = <!-- -->None Storage client to use for the key-value store. * ##### optionalto\_kvs\_configuration: [Configuration](https://crawlee.dev/python/api/class/Configuration) | None = <!-- -->None Configuration for the key-value store. * ##### kwargs: Any Additional parameters for the export operation, specific to the chosen content type. #### Returns None ### [**](#get_data)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storages/_dataset.py#L140)get\_data * **async **get\_data**(\*, offset, limit, clean, desc, fields, omit, unwind, skip\_empty, skip\_hidden, flatten, view): [DatasetItemsListPage](https://crawlee.dev/python/api/class/DatasetItemsListPage) - Retrieve a paginated list of items from a dataset based on various filtering parameters. This method provides the flexibility to filter, sort, and modify the appearance of dataset items when listed. Each parameter modifies the result set according to its purpose. The method also supports pagination through 'offset' and 'limit' parameters. *** #### Parameters * ##### optionalkeyword-onlyoffset: int = <!-- -->0 Skips the specified number of items at the start. * ##### optionalkeyword-onlylimit: int | None = <!-- -->999\_999\_999\_999 The maximum number of items to retrieve. Unlimited if None. * ##### optionalkeyword-onlyclean: bool = <!-- -->False Return only non-empty items and excludes hidden fields. Shortcut for skip\_hidden and skip\_empty. * ##### optionalkeyword-onlydesc: bool = <!-- -->False Set to True to sort results in descending order. * ##### optionalkeyword-onlyfields: list\[str] | None = <!-- -->None Fields to include in each item. Sorts fields as specified if provided. * ##### optionalkeyword-onlyomit: list\[str] | None = <!-- -->None Fields to exclude from each item. * ##### optionalkeyword-onlyunwind: str | None = <!-- -->None Unwinds items by a specified array field, turning each element into a separate item. * ##### optionalkeyword-onlyskip\_empty: bool = <!-- -->False Excludes empty items from the results if True. * ##### optionalkeyword-onlyskip\_hidden: bool = <!-- -->False Excludes fields starting with '#' if True. * ##### optionalkeyword-onlyflatten: list\[str] | None = <!-- -->None Fields to be flattened in returned items. * ##### optionalkeyword-onlyview: str | None = <!-- -->None Specifies the dataset view to be used. #### Returns [DatasetItemsListPage](https://crawlee.dev/python/api/class/DatasetItemsListPage) ### [**](#get_metadata)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storages/_dataset.py#L93)get\_metadata * **async **get\_metadata**(): ([DatasetMetadata](https://crawlee.dev/python/api/class/DatasetMetadata) | [KeyValueStoreMetadata](https://crawlee.dev/python/api/class/KeyValueStoreMetadata)) | [RequestQueueMetadata](https://crawlee.dev/python/api/class/RequestQueueMetadata) - Overrides [Storage.get\_metadata](https://crawlee.dev/python/api/class/Storage#get_metadata) Get the storage metadata. *** #### Returns ([DatasetMetadata](https://crawlee.dev/python/api/class/DatasetMetadata) | [KeyValueStoreMetadata](https://crawlee.dev/python/api/class/KeyValueStoreMetadata)) | [RequestQueueMetadata](https://crawlee.dev/python/api/class/RequestQueueMetadata) ### [**](#iterate_items)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storages/_dataset.py#L191)iterate\_items * **async **iterate\_items**(\*, offset, limit, clean, desc, fields, omit, unwind, skip\_empty, skip\_hidden): AsyncIterator\[dict\[str, Any]] - Iterate over items in the dataset according to specified filters and sorting. This method allows for asynchronously iterating through dataset items while applying various filters such as skipping empty items, hiding specific fields, and sorting. It supports pagination via `offset` and `limit` parameters, and can modify the appearance of dataset items using `fields`, `omit`, `unwind`, `skip_empty`, and `skip_hidden` parameters. *** #### Parameters * ##### optionalkeyword-onlyoffset: int = <!-- -->0 Skips the specified number of items at the start. * ##### optionalkeyword-onlylimit: int | None = <!-- -->999\_999\_999\_999 The maximum number of items to retrieve. Unlimited if None. * ##### optionalkeyword-onlyclean: bool = <!-- -->False Return only non-empty items and excludes hidden fields. Shortcut for skip\_hidden and skip\_empty. * ##### optionalkeyword-onlydesc: bool = <!-- -->False Set to True to sort results in descending order. * ##### optionalkeyword-onlyfields: list\[str] | None = <!-- -->None Fields to include in each item. Sorts fields as specified if provided. * ##### optionalkeyword-onlyomit: list\[str] | None = <!-- -->None Fields to exclude from each item. * ##### optionalkeyword-onlyunwind: str | None = <!-- -->None Unwinds items by a specified array field, turning each element into a separate item. * ##### optionalkeyword-onlyskip\_empty: bool = <!-- -->False Excludes empty items from the results if True. * ##### optionalkeyword-onlyskip\_hidden: bool = <!-- -->False Excludes fields starting with '#' if True. #### Returns AsyncIterator\[dict\[str, Any]] ### [**](#list_items)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storages/_dataset.py#L239)list\_items * **async **list\_items**(\*, offset, limit, clean, desc, fields, omit, unwind, skip\_empty, skip\_hidden): list\[dict\[str, Any]] - Retrieve a list of all items from the dataset according to specified filters and sorting. This method collects all dataset items into a list while applying various filters such as skipping empty items, hiding specific fields, and sorting. It supports pagination via `offset` and `limit` parameters, and can modify the appearance of dataset items using `fields`, `omit`, `unwind`, `skip_empty`, and `skip_hidden` parameters. *** #### Parameters * ##### optionalkeyword-onlyoffset: int = <!-- -->0 Skips the specified number of items at the start. * ##### optionalkeyword-onlylimit: int | None = <!-- -->999\_999\_999\_999 The maximum number of items to retrieve. Unlimited if None. * ##### optionalkeyword-onlyclean: bool = <!-- -->False Return only non-empty items and excludes hidden fields. Shortcut for skip\_hidden and skip\_empty. * ##### optionalkeyword-onlydesc: bool = <!-- -->False Set to True to sort results in descending order. * ##### optionalkeyword-onlyfields: list\[str] | None = <!-- -->None Fields to include in each item. Sorts fields as specified if provided. * ##### optionalkeyword-onlyomit: list\[str] | None = <!-- -->None Fields to exclude from each item. * ##### optionalkeyword-onlyunwind: str | None = <!-- -->None Unwinds items by a specified array field, turning each element into a separate item. * ##### optionalkeyword-onlyskip\_empty: bool = <!-- -->False Excludes empty items from the results if True. * ##### optionalkeyword-onlyskip\_hidden: bool = <!-- -->False Excludes fields starting with '#' if True. #### Returns list\[dict\[str, Any]] ### [**](#open)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storages/_dataset.py#L98)open * **async **open**(\*, id, name, configuration, storage\_client): [Storage](https://crawlee.dev/python/api/class/Storage) - Overrides [Storage.open](https://crawlee.dev/python/api/class/Storage#open) Open a storage, either restore existing or create a new one. *** #### Parameters * ##### optionalkeyword-onlyid: str | None = <!-- -->None The storage ID. * ##### optionalkeyword-onlyname: str | None = <!-- -->None The storage name. * ##### optionalkeyword-onlyconfiguration: [Configuration](https://crawlee.dev/python/api/class/Configuration) | None = <!-- -->None Configuration object used during the storage creation or restoration process. * ##### optionalkeyword-onlystorage\_client: [StorageClient](https://crawlee.dev/python/api/class/StorageClient) | None = <!-- -->None Underlying storage client to use. If not provided, the default global storage client from the service locator will be used. #### Returns [Storage](https://crawlee.dev/python/api/class/Storage) ### [**](#purge)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storages/_dataset.py#L124)purge * **async **purge**(): None - Overrides [Storage.purge](https://crawlee.dev/python/api/class/Storage#purge) Purge the storage, removing all items from the underlying storage client. This method does not remove the storage itself, e.g. don't remove the metadata, but clears all items within it. *** #### Returns None ### [**](#push_data)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storages/_dataset.py#L127)push\_data * **async **push\_data**(data): None - Store an object or an array of objects to the dataset. The size of the data is limited by the receiving API and therefore `push_data()` will only allow objects whose JSON representation is smaller than 9MB. When an array is passed, none of the included objects may be larger than 9MB, but the array itself may be of any size. *** #### Parameters * ##### data: list\[dict\[str, Any]] | dict\[str, Any] A JSON serializable data structure to be stored in the dataset. The JSON representation of each item must be smaller than 9MB. #### Returns None ## Properties<!-- -->[**](#Properties) ### [**](#id)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storages/_dataset.py#L84)id **id: str Overrides [Storage.id](https://crawlee.dev/python/api/class/Storage#id) Get the storage ID. ### [**](#name)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storages/_dataset.py#L89)name **name: str | None Overrides [Storage.name](https://crawlee.dev/python/api/class/Storage#name) Get the storage name. --- # DatasetClient<!-- --> Dataset resource client implementation based on the Apify platform storage. ## Index[**](#Index) ### Methods * [**\_\_init\_\_](https://docs.apify.com/sdk/python/sdk/python/reference/class/DatasetClient.md#__init__) * [**delete](https://docs.apify.com/sdk/python/sdk/python/reference/class/DatasetClient.md#delete) * [**get](https://docs.apify.com/sdk/python/sdk/python/reference/class/DatasetClient.md#get) * [**get\_items\_as\_bytes](https://docs.apify.com/sdk/python/sdk/python/reference/class/DatasetClient.md#get_items_as_bytes) * [**iterate\_items](https://docs.apify.com/sdk/python/sdk/python/reference/class/DatasetClient.md#iterate_items) * [**list\_items](https://docs.apify.com/sdk/python/sdk/python/reference/class/DatasetClient.md#list_items) * [**push\_items](https://docs.apify.com/sdk/python/sdk/python/reference/class/DatasetClient.md#push_items) * [**stream\_items](https://docs.apify.com/sdk/python/sdk/python/reference/class/DatasetClient.md#stream_items) * [**update](https://docs.apify.com/sdk/python/sdk/python/reference/class/DatasetClient.md#update) ## Methods<!-- -->[**](#Methods) ### [**](#__init__)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/apify_storage_client/_dataset_client.py#L23)\_\_init\_\_ * ****\_\_init\_\_**(apify\_dataset\_client): None - #### Parameters * ##### apify\_dataset\_client: DatasetClientAsync #### Returns None ### [**](#delete)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/apify_storage_client/_dataset_client.py#L44)delete * **async **delete**(): None - #### Returns None ### [**](#get)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/apify_storage_client/_dataset_client.py#L27)get * **async **get**(): DatasetMetadata | None - #### Returns DatasetMetadata | None ### [**](#get_items_as_bytes)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/apify_storage_client/_dataset_client.py#L109)get\_items\_as\_bytes * **async **get\_items\_as\_bytes**(\*, item\_format, offset, limit, desc, clean, bom, delimiter, fields, omit, unwind, skip\_empty, skip\_header\_row, skip\_hidden, xml\_root, xml\_row, flatten): bytes - #### Parameters * ##### optionalkeyword-onlyitem\_format: str = <!-- -->'json' * ##### optionalkeyword-onlyoffset: int | None = <!-- -->None * ##### optionalkeyword-onlylimit: int | None = <!-- -->None * ##### optionalkeyword-onlydesc: bool = <!-- -->False * ##### optionalkeyword-onlyclean: bool = <!-- -->False * ##### optionalkeyword-onlybom: bool = <!-- -->False * ##### optionalkeyword-onlydelimiter: str | None = <!-- -->None * ##### optionalkeyword-onlyfields: [list](https://docs.apify.com/sdk/python/sdk/python/reference/class/DatasetCollectionClient.md#list)\[str] | None = <!-- -->None * ##### optionalkeyword-onlyomit: [list](https://docs.apify.com/sdk/python/sdk/python/reference/class/DatasetCollectionClient.md#list)\[str] | None = <!-- -->None * ##### optionalkeyword-onlyunwind: str | None = <!-- -->None * ##### optionalkeyword-onlyskip\_empty: bool = <!-- -->False * ##### optionalkeyword-onlyskip\_header\_row: bool = <!-- -->False * ##### optionalkeyword-onlyskip\_hidden: bool = <!-- -->False * ##### optionalkeyword-onlyxml\_root: str | None = <!-- -->None * ##### optionalkeyword-onlyxml\_row: str | None = <!-- -->None * ##### optionalkeyword-onlyflatten: [list](https://docs.apify.com/sdk/python/sdk/python/reference/class/DatasetCollectionClient.md#list)\[str] | None = <!-- -->None #### Returns bytes ### [**](#iterate_items)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/apify_storage_client/_dataset_client.py#L82)iterate\_items * **async **iterate\_items**(\*, offset, limit, clean, desc, fields, omit, unwind, skip\_empty, skip\_hidden): AsyncIterator\[dict] - #### Parameters * ##### optionalkeyword-onlyoffset: int = <!-- -->0 * ##### optionalkeyword-onlylimit: int | None = <!-- -->None * ##### optionalkeyword-onlyclean: bool = <!-- -->False * ##### optionalkeyword-onlydesc: bool = <!-- -->False * ##### optionalkeyword-onlyfields: [list](https://docs.apify.com/sdk/python/sdk/python/reference/class/DatasetCollectionClient.md#list)\[str] | None = <!-- -->None * ##### optionalkeyword-onlyomit: [list](https://docs.apify.com/sdk/python/sdk/python/reference/class/DatasetCollectionClient.md#list)\[str] | None = <!-- -->None * ##### optionalkeyword-onlyunwind: str | None = <!-- -->None * ##### optionalkeyword-onlyskip\_empty: bool = <!-- -->False * ##### optionalkeyword-onlyskip\_hidden: bool = <!-- -->False #### Returns AsyncIterator\[dict] ### [**](#list_items)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/apify_storage_client/_dataset_client.py#L48)list\_items * **async **list\_items**(\*, offset, limit, clean, desc, fields, omit, unwind, skip\_empty, skip\_hidden, flatten, view): DatasetItemsListPage - #### Parameters * ##### optionalkeyword-onlyoffset: int | None = <!-- -->0 * ##### optionalkeyword-onlylimit: int | None = <!-- -->BaseDatasetClient.\_LIST\_ITEMS\_LIMIT * ##### optionalkeyword-onlyclean: bool = <!-- -->False * ##### optionalkeyword-onlydesc: bool = <!-- -->False * ##### optionalkeyword-onlyfields: [list](https://docs.apify.com/sdk/python/sdk/python/reference/class/DatasetCollectionClient.md#list)\[str] | None = <!-- -->None * ##### optionalkeyword-onlyomit: [list](https://docs.apify.com/sdk/python/sdk/python/reference/class/DatasetCollectionClient.md#list)\[str] | None = <!-- -->None * ##### optionalkeyword-onlyunwind: str | None = <!-- -->None * ##### optionalkeyword-onlyskip\_empty: bool = <!-- -->False * ##### optionalkeyword-onlyskip\_hidden: bool = <!-- -->False * ##### optionalkeyword-onlyflatten: [list](https://docs.apify.com/sdk/python/sdk/python/reference/class/DatasetCollectionClient.md#list)\[str] | None = <!-- -->None * ##### optionalkeyword-onlyview: str | None = <!-- -->None #### Returns DatasetItemsListPage ### [**](#push_items)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/apify_storage_client/_dataset_client.py#L187)push\_items * **async **push\_items**(items): None - #### Parameters * ##### items: JsonSerializable #### Returns None ### [**](#stream_items)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/apify_storage_client/_dataset_client.py#L149)stream\_items * **async **stream\_items**(\*, item\_format, offset, limit, desc, clean, bom, delimiter, fields, omit, unwind, skip\_empty, skip\_header\_row, skip\_hidden, xml\_root, xml\_row): AbstractAsyncContextManager\[Response | None] - #### Parameters * ##### optionalkeyword-onlyitem\_format: str = <!-- -->'json' * ##### optionalkeyword-onlyoffset: int | None = <!-- -->None * ##### optionalkeyword-onlylimit: int | None = <!-- -->None * ##### optionalkeyword-onlydesc: bool = <!-- -->False * ##### optionalkeyword-onlyclean: bool = <!-- -->False * ##### optionalkeyword-onlybom: bool = <!-- -->False * ##### optionalkeyword-onlydelimiter: str | None = <!-- -->None * ##### optionalkeyword-onlyfields: [list](https://docs.apify.com/sdk/python/sdk/python/reference/class/DatasetCollectionClient.md#list)\[str] | None = <!-- -->None * ##### optionalkeyword-onlyomit: [list](https://docs.apify.com/sdk/python/sdk/python/reference/class/DatasetCollectionClient.md#list)\[str] | None = <!-- -->None * ##### optionalkeyword-onlyunwind: str | None = <!-- -->None * ##### optionalkeyword-onlyskip\_empty: bool = <!-- -->False * ##### optionalkeyword-onlyskip\_header\_row: bool = <!-- -->False * ##### optionalkeyword-onlyskip\_hidden: bool = <!-- -->False * ##### optionalkeyword-onlyxml\_root: str | None = <!-- -->None * ##### optionalkeyword-onlyxml\_row: str | None = <!-- -->None #### Returns AbstractAsyncContextManager\[Response | None] ### [**](#update)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/apify_storage_client/_dataset_client.py#L32)update * **async **update**(\*, name): DatasetMetadata - #### Parameters * ##### optionalkeyword-onlyname: str | None = <!-- -->None #### Returns DatasetMetadata --- # DatasetCollectionClient<!-- --> Dataset collection resource client implementation based on the Apify platform storage. ## Index[**](#Index) ### Methods * [**\_\_init\_\_](https://docs.apify.com/sdk/python/sdk/python/reference/class/DatasetCollectionClient.md#__init__) * [**get\_or\_create](https://docs.apify.com/sdk/python/sdk/python/reference/class/DatasetCollectionClient.md#get_or_create) * [**list](https://docs.apify.com/sdk/python/sdk/python/reference/class/DatasetCollectionClient.md#list) ## Methods<!-- -->[**](#Methods) ### [**](#__init__)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/apify_storage_client/_dataset_collection_client.py#L17)\_\_init\_\_ * ****\_\_init\_\_**(apify\_dataset\_collection\_client): None - #### Parameters * ##### apify\_dataset\_collection\_client: DatasetCollectionClientAsync #### Returns None ### [**](#get_or_create)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/apify_storage_client/_dataset_collection_client.py#L21)get\_or\_create * **async **get\_or\_create**(\*, id, name, schema): DatasetMetadata - #### Parameters * ##### optionalkeyword-onlyid: str | None = <!-- -->None * ##### optionalkeyword-onlyname: str | None = <!-- -->None * ##### optionalkeyword-onlyschema: dict | None = <!-- -->None #### Returns DatasetMetadata ### [**](#list)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/apify_storage_client/_dataset_collection_client.py#L36)list * **async **list**(\*, unnamed, limit, offset, desc): DatasetListPage - #### Parameters * ##### optionalkeyword-onlyunnamed: bool = <!-- -->False * ##### optionalkeyword-onlylimit: int | None = <!-- -->None * ##### optionalkeyword-onlyoffset: int | None = <!-- -->None * ##### optionalkeyword-onlydesc: bool = <!-- -->False #### Returns DatasetListPage --- # DatasetItemsListPage<!-- --> Model for a single page of dataset items returned from a collection list method. ## Index[**](#Index) ### Properties * [**count](https://docs.apify.com/sdk/python/sdk/python/reference/class/DatasetItemsListPage.md#count) * [**desc](https://docs.apify.com/sdk/python/sdk/python/reference/class/DatasetItemsListPage.md#desc) * [**items](https://docs.apify.com/sdk/python/sdk/python/reference/class/DatasetItemsListPage.md#items) * [**limit](https://docs.apify.com/sdk/python/sdk/python/reference/class/DatasetItemsListPage.md#limit) * [**model\_config](https://docs.apify.com/sdk/python/sdk/python/reference/class/DatasetItemsListPage.md#model_config) * [**offset](https://docs.apify.com/sdk/python/sdk/python/reference/class/DatasetItemsListPage.md#offset) * [**total](https://docs.apify.com/sdk/python/sdk/python/reference/class/DatasetItemsListPage.md#total) ## Properties<!-- -->[**](#Properties) ### [**](#count)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storage_clients/models.py#L115)count **count: int The number of objects returned on this page. ### [**](#desc)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storage_clients/models.py#L127)desc **desc: bool Indicates if the returned list is in descending order. ### [**](#items)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storage_clients/models.py#L130)items **items: list\[dict] The list of dataset items returned on this page. ### [**](#limit)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storage_clients/models.py#L121)limit **limit: int The maximum number of objects to return, as specified in the API call. ### [**](#model_config)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storage_clients/models.py#L113)model\_config **model\_config: Undefined ### [**](#offset)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storage_clients/models.py#L118)offset **offset: int The starting position of the first object returned, as specified in the API call. ### [**](#total)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storage_clients/models.py#L124)total **total: int The total number of objects that match the criteria of the API call. --- # DatasetMetadata<!-- --> Model for a dataset metadata. ### Hierarchy * [StorageMetadata](https://crawlee.dev/python/api/class/StorageMetadata) * *DatasetMetadata* ## Index[**](#Index) ### Properties * [**accessed\_at](https://docs.apify.com/sdk/python/sdk/python/reference/class/DatasetMetadata.md#accessed_at) * [**created\_at](https://docs.apify.com/sdk/python/sdk/python/reference/class/DatasetMetadata.md#created_at) * [**id](https://docs.apify.com/sdk/python/sdk/python/reference/class/DatasetMetadata.md#id) * [**item\_count](https://docs.apify.com/sdk/python/sdk/python/reference/class/DatasetMetadata.md#item_count) * [**model\_config](https://docs.apify.com/sdk/python/sdk/python/reference/class/DatasetMetadata.md#model_config) * [**modified\_at](https://docs.apify.com/sdk/python/sdk/python/reference/class/DatasetMetadata.md#modified_at) * [**name](https://docs.apify.com/sdk/python/sdk/python/reference/class/DatasetMetadata.md#name) ## Properties<!-- -->[**](#Properties) ### [**](#accessed_at)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storage_clients/models.py#L31)accessed\_at **accessed\_at: datetime Inherited from [StorageMetadata.accessed\_at](https://crawlee.dev/python/api/class/StorageMetadata#accessed_at) The timestamp when the storage was last accessed. ### [**](#created_at)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storage_clients/models.py#L34)created\_at **created\_at: datetime Inherited from [StorageMetadata.created\_at](https://crawlee.dev/python/api/class/StorageMetadata#created_at) The timestamp when the storage was created. ### [**](#id)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storage_clients/models.py#L25)id **id: str Inherited from [StorageMetadata.id](https://crawlee.dev/python/api/class/StorageMetadata#id) The unique identifier of the storage. ### [**](#item_count)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storage_clients/models.py#L47)item\_count **item\_count: int The number of items in the dataset. ### [**](#model_config)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storage_clients/models.py#L45)model\_config **model\_config: Undefined Overrides [StorageMetadata.model\_config](https://crawlee.dev/python/api/class/StorageMetadata#model_config) ### [**](#modified_at)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storage_clients/models.py#L37)modified\_at **modified\_at: datetime Inherited from [StorageMetadata.modified\_at](https://crawlee.dev/python/api/class/StorageMetadata#modified_at) The timestamp when the storage was last modified. ### [**](#name)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storage_clients/models.py#L28)name **name: str | None Inherited from [StorageMetadata.name](https://crawlee.dev/python/api/class/StorageMetadata#name) The name of the storage. --- # DeprecatedEvent<!-- --> ## Index[**](#Index) ### Properties * [**data](https://docs.apify.com/sdk/python/sdk/python/reference/class/DeprecatedEvent.md#data) * [**name](https://docs.apify.com/sdk/python/sdk/python/reference/class/DeprecatedEvent.md#name) ## Properties<!-- -->[**](#Properties) ### [**](#data)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_platform_event_manager.py#L106)data **data: dict\[str, Any] ### [**](#name)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_platform_event_manager.py#L105)name **name: Literal\[cpuInfo] --- # EventAbortingData<!-- --> Data for the aborting event. ## Index[**](#Index) ### Properties * [**model\_config](https://docs.apify.com/sdk/python/sdk/python/reference/class/EventAbortingData.md#model_config) ## Properties<!-- -->[**](#Properties) ### [**](#model_config)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/events/_types.py#L76)model\_config **model\_config: Undefined --- # EventExitData<!-- --> Data for the exit event. ## Index[**](#Index) ### Properties * [**model\_config](https://docs.apify.com/sdk/python/sdk/python/reference/class/EventExitData.md#model_config) ## Properties<!-- -->[**](#Properties) ### [**](#model_config)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/events/_types.py#L83)model\_config **model\_config: Undefined --- # EventManager<!-- --> Manage events and their listeners, enabling registration, emission, and execution control. It allows for registering event listeners, emitting events, and ensuring all listeners complete their execution. Built on top of `pyee.asyncio.AsyncIOEventEmitter`. It implements additional features such as waiting for all listeners to complete and emitting `PersistState` events at regular intervals. ### Hierarchy * *EventManager* * [LocalEventManager](https://crawlee.dev/python/api/class/LocalEventManager) ## Index[**](#Index) ### Methods * [**\_\_aenter\_\_](https://docs.apify.com/sdk/python/sdk/python/reference/class/EventManager.md#__aenter__) * [**\_\_aexit\_\_](https://docs.apify.com/sdk/python/sdk/python/reference/class/EventManager.md#__aexit__) * [**\_\_init\_\_](https://docs.apify.com/sdk/python/sdk/python/reference/class/EventManager.md#__init__) * [](https://crawlee.dev/python/api/class/EventManager#emit) * [](https://crawlee.dev/python/api/class/EventManager#off) * [](https://crawlee.dev/python/api/class/EventManager#on) * [](https://crawlee.dev/python/api/class/EventManager#wait_for_all_listeners_to_complete) ### Properties * [**active](https://docs.apify.com/sdk/python/sdk/python/reference/class/EventManager.md#active) ## Methods<!-- -->[**](#Methods) ### [**](#__aenter__)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/events/_event_manager.py#L104)\_\_aenter\_\_ * **async **\_\_aenter\_\_**(): [EventManager](https://crawlee.dev/python/api/class/EventManager) - Initialize the event manager upon entering the async context. *** #### Returns [EventManager](https://crawlee.dev/python/api/class/EventManager) ### [**](#__aexit__)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/events/_event_manager.py#L117)\_\_aexit\_\_ * **async **\_\_aexit\_\_**(exc\_type, exc\_value, exc\_traceback): None - Close the local event manager upon exiting the async context. This will stop listening for the events, and it will wait for all the event listeners to finish. *** #### Parameters * ##### exc\_type: [type](https://crawlee.dev/python/api/class/SitemapSource#type)\[BaseException] | None * ##### exc\_value: BaseException | None * ##### exc\_traceback: TracebackType | None #### Returns None ### [**](#__init__)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/events/_event_manager.py#L63)\_\_init\_\_ * ****\_\_init\_\_**(\*, persist\_state\_interval, close\_timeout): None - Initialize a new instance. *** #### Parameters * ##### optionalkeyword-onlypersist\_state\_interval: timedelta = <!-- -->timedelta(minutes=1) Interval between emitted `PersistState` events to maintain state persistence. * ##### optionalkeyword-onlyclose\_timeout: timedelta | None = <!-- -->None Optional timeout for canceling pending event listeners if they exceed this duration. #### Returns None ### [**](#undefined)[**](https://github.com/apify/apify-sdk-python/blob/master//src/crawlee/events/_event_manager.py#L233) : ### [**](#undefined)[**](https://github.com/apify/apify-sdk-python/blob/master//src/crawlee/events/_event_manager.py#L201) : ### [**](#undefined)[**](https://github.com/apify/apify-sdk-python/blob/master//src/crawlee/events/_event_manager.py#L155) : ### [**](#undefined)[**](https://github.com/apify/apify-sdk-python/blob/master//src/crawlee/events/_event_manager.py#L243) : ## Properties<!-- -->[**](#Properties) ### [**](#active)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/events/_event_manager.py#L100)active **active: bool Indicate whether the context is active. --- # EventMigratingData<!-- --> Data for the migrating event. ## Index[**](#Index) ### Properties * [**model\_config](https://docs.apify.com/sdk/python/sdk/python/reference/class/EventMigratingData.md#model_config) * [**time\_remaining](https://docs.apify.com/sdk/python/sdk/python/reference/class/EventMigratingData.md#time_remaining) ## Properties<!-- -->[**](#Properties) ### [**](#model_config)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/events/_types.py#L65)model\_config **model\_config: Undefined ### [**](#time_remaining)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/events/_types.py#L69)time\_remaining **time\_remaining: [timedelta\_secs](https://crawlee.dev/python/api#timedelta_secs) | None --- # EventPersistStateData<!-- --> Data for the persist state event. ## Index[**](#Index) ### Properties * [**is\_migrating](https://docs.apify.com/sdk/python/sdk/python/reference/class/EventPersistStateData.md#is_migrating) * [**model\_config](https://docs.apify.com/sdk/python/sdk/python/reference/class/EventPersistStateData.md#model_config) ## Properties<!-- -->[**](#Properties) ### [**](#is_migrating)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/events/_types.py#L45)is\_migrating **is\_migrating: bool ### [**](#model_config)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/events/_types.py#L43)model\_config **model\_config: Undefined --- # EventSystemInfoData<!-- --> Data for the system info event. ## Index[**](#Index) ### Properties * [**cpu\_info](https://docs.apify.com/sdk/python/sdk/python/reference/class/EventSystemInfoData.md#cpu_info) * [**memory\_info](https://docs.apify.com/sdk/python/sdk/python/reference/class/EventSystemInfoData.md#memory_info) * [**model\_config](https://docs.apify.com/sdk/python/sdk/python/reference/class/EventSystemInfoData.md#model_config) ## Properties<!-- -->[**](#Properties) ### [**](#cpu_info)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/events/_types.py#L54)cpu\_info **cpu\_info: [CpuInfo](https://crawlee.dev/python/api/class/CpuInfo) ### [**](#memory_info)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/events/_types.py#L55)memory\_info **memory\_info: [MemoryUsageInfo](https://crawlee.dev/python/api/class/MemoryUsageInfo) ### [**](#model_config)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/events/_types.py#L52)model\_config **model\_config: Undefined --- # EventWithoutData<!-- --> ## Index[**](#Index) ### Properties * [**data](https://docs.apify.com/sdk/python/sdk/python/reference/class/EventWithoutData.md#data) * [**name](https://docs.apify.com/sdk/python/sdk/python/reference/class/EventWithoutData.md#name) ## Properties<!-- -->[**](#Properties) ### [**](#data)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_platform_event_manager.py#L100)data **data: Any ### [**](#name)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_platform_event_manager.py#L92)name **name: Literal\[Event.SESSION\_RETIRED, Event.BROWSER\_LAUNCHED, Event.BROWSER\_RETIRED, Event.BROWSER\_CLOSED, Event.PAGE\_CREATED, Event.PAGE\_CLOSED] --- # ExitEvent<!-- --> ## Index[**](#Index) ### Properties * [**data](https://docs.apify.com/sdk/python/sdk/python/reference/class/ExitEvent.md#data) * [**name](https://docs.apify.com/sdk/python/sdk/python/reference/class/ExitEvent.md#name) ## Properties<!-- -->[**](#Properties) ### [**](#data)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_platform_event_manager.py#L87)data **data: EventExitData ### [**](#name)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_platform_event_manager.py#L86)name **name: Literal\[Event.EXIT] --- # FileSystemStorageClient<!-- --> File system implementation of the storage client. This storage client provides access to datasets, key-value stores, and request queues that persist data to the local file system. Each storage type is implemented with its own specific file system client that stores data in a structured directory hierarchy. Data is stored in JSON format in predictable file paths, making it easy to inspect and manipulate the stored data outside of the Crawlee application if needed. All data persists between program runs but is limited to access from the local machine where the files are stored. Warning: This storage client is not safe for concurrent access from multiple crawler processes. Use it only when running a single crawler process at a time. ### Hierarchy * [StorageClient](https://crawlee.dev/python/api/class/StorageClient) * *FileSystemStorageClient* ## Index[**](#Index) ### Methods * [**create\_dataset\_client](https://docs.apify.com/sdk/python/sdk/python/reference/class/FileSystemStorageClient.md#create_dataset_client) * [**create\_kvs\_client](https://docs.apify.com/sdk/python/sdk/python/reference/class/FileSystemStorageClient.md#create_kvs_client) * [**create\_rq\_client](https://docs.apify.com/sdk/python/sdk/python/reference/class/FileSystemStorageClient.md#create_rq_client) * [**get\_rate\_limit\_errors](https://docs.apify.com/sdk/python/sdk/python/reference/class/FileSystemStorageClient.md#get_rate_limit_errors) ## Methods<!-- -->[**](#Methods) ### [**](#create_dataset_client)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storage_clients/_file_system/_storage_client.py#L33)create\_dataset\_client * **async **create\_dataset\_client**(\*, id, name, configuration): [DatasetClient](https://crawlee.dev/python/api/class/DatasetClient) - Overrides [StorageClient.create\_dataset\_client](https://crawlee.dev/python/api/class/StorageClient#create_dataset_client) Create a dataset client. *** #### Parameters * ##### optionalkeyword-onlyid: str | None = <!-- -->None * ##### optionalkeyword-onlyname: str | None = <!-- -->None * ##### optionalkeyword-onlyconfiguration: [Configuration](https://crawlee.dev/python/api/class/Configuration) | None = <!-- -->None #### Returns [DatasetClient](https://crawlee.dev/python/api/class/DatasetClient) ### [**](#create_kvs_client)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storage_clients/_file_system/_storage_client.py#L46)create\_kvs\_client * **async **create\_kvs\_client**(\*, id, name, configuration): [KeyValueStoreClient](https://crawlee.dev/python/api/class/KeyValueStoreClient) - Overrides [StorageClient.create\_kvs\_client](https://crawlee.dev/python/api/class/StorageClient#create_kvs_client) Create a key-value store client. *** #### Parameters * ##### optionalkeyword-onlyid: str | None = <!-- -->None * ##### optionalkeyword-onlyname: str | None = <!-- -->None * ##### optionalkeyword-onlyconfiguration: [Configuration](https://crawlee.dev/python/api/class/Configuration) | None = <!-- -->None #### Returns [KeyValueStoreClient](https://crawlee.dev/python/api/class/KeyValueStoreClient) ### [**](#create_rq_client)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storage_clients/_file_system/_storage_client.py#L59)create\_rq\_client * **async **create\_rq\_client**(\*, id, name, configuration): [RequestQueueClient](https://crawlee.dev/python/api/class/RequestQueueClient) - Overrides [StorageClient.create\_rq\_client](https://crawlee.dev/python/api/class/StorageClient#create_rq_client) Create a request queue client. *** #### Parameters * ##### optionalkeyword-onlyid: str | None = <!-- -->None * ##### optionalkeyword-onlyname: str | None = <!-- -->None * ##### optionalkeyword-onlyconfiguration: [Configuration](https://crawlee.dev/python/api/class/Configuration) | None = <!-- -->None #### Returns [RequestQueueClient](https://crawlee.dev/python/api/class/RequestQueueClient) ### [**](#get_rate_limit_errors)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storage_clients/_base/_storage_client.py#L61)get\_rate\_limit\_errors * ****get\_rate\_limit\_errors**(): dict\[int, int] - Inherited from [StorageClient.get\_rate\_limit\_errors](https://crawlee.dev/python/api/class/StorageClient#get_rate_limit_errors) Return statistics about rate limit errors encountered by the HTTP client in storage client. *** #### Returns dict\[int, int] --- # FlatPricePerMonthActorPricingInfo<!-- --> ## Index[**](#Index) ### Properties * [**price\_per\_unit\_usd](https://docs.apify.com/sdk/python/sdk/python/reference/class/FlatPricePerMonthActorPricingInfo.md#price_per_unit_usd) * [**pricing\_model](https://docs.apify.com/sdk/python/sdk/python/reference/class/FlatPricePerMonthActorPricingInfo.md#pricing_model) * [**trial\_minutes](https://docs.apify.com/sdk/python/sdk/python/reference/class/FlatPricePerMonthActorPricingInfo.md#trial_minutes) ## Properties<!-- -->[**](#Properties) ### [**](#price_per_unit_usd)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L144)price\_per\_unit\_usd **price\_per\_unit\_usd: Decimal ### [**](#pricing_model)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L142)pricing\_model **pricing\_model: Literal\[FLAT\_PRICE\_PER\_MONTH] ### [**](#trial_minutes)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L143)trial\_minutes **trial\_minutes: int | None --- # FreeActorPricingInfo<!-- --> ## Index[**](#Index) ### Properties * [**pricing\_model](https://docs.apify.com/sdk/python/sdk/python/reference/class/FreeActorPricingInfo.md#pricing_model) ## Properties<!-- -->[**](#Properties) ### [**](#pricing_model)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L138)pricing\_model **pricing\_model: Literal\[FREE] --- # KeyValueStore<!-- --> Key-value store is a storage for reading and writing data records with unique key identifiers. The key-value store class acts as a high-level interface for storing, retrieving, and managing data records identified by unique string keys. It abstracts away the underlying storage implementation details, allowing you to work with the same API regardless of whether data is stored in memory, on disk, or in the cloud. Each data record is associated with a specific MIME content type, allowing storage of various data formats such as JSON, text, images, HTML snapshots or any binary data. This class is commonly used to store inputs, outputs, and other artifacts of crawler operations. You can instantiate a key-value store using the `open` class method, which will create a store with the specified name or id. The underlying storage implementation is determined by the configured storage client. ### Usage ``` from crawlee.storages import KeyValueStore # Open a named key-value store kvs = await KeyValueStore.open(name='my-store') # Store and retrieve data await kvs.set_value('product-1234.json', [{'name': 'Smartphone', 'price': 799.99}]) product = await kvs.get_value('product-1234') ``` ### Hierarchy * [Storage](https://crawlee.dev/python/api/class/Storage) * *KeyValueStore* ## Index[**](#Index) ### Methods * [**\_\_init\_\_](https://docs.apify.com/sdk/python/sdk/python/reference/class/KeyValueStore.md#__init__) * [**delete\_value](https://docs.apify.com/sdk/python/sdk/python/reference/class/KeyValueStore.md#delete_value) * [**drop](https://docs.apify.com/sdk/python/sdk/python/reference/class/KeyValueStore.md#drop) * [**get\_auto\_saved\_value](https://docs.apify.com/sdk/python/sdk/python/reference/class/KeyValueStore.md#get_auto_saved_value) * [**get\_metadata](https://docs.apify.com/sdk/python/sdk/python/reference/class/KeyValueStore.md#get_metadata) * [**get\_public\_url](https://docs.apify.com/sdk/python/sdk/python/reference/class/KeyValueStore.md#get_public_url) * [**get\_value](https://docs.apify.com/sdk/python/sdk/python/reference/class/KeyValueStore.md#get_value) * [**iterate\_keys](https://docs.apify.com/sdk/python/sdk/python/reference/class/KeyValueStore.md#iterate_keys) * [**list\_keys](https://docs.apify.com/sdk/python/sdk/python/reference/class/KeyValueStore.md#list_keys) * [**open](https://docs.apify.com/sdk/python/sdk/python/reference/class/KeyValueStore.md#open) * [**persist\_autosaved\_values](https://docs.apify.com/sdk/python/sdk/python/reference/class/KeyValueStore.md#persist_autosaved_values) * [**purge](https://docs.apify.com/sdk/python/sdk/python/reference/class/KeyValueStore.md#purge) * [**record\_exists](https://docs.apify.com/sdk/python/sdk/python/reference/class/KeyValueStore.md#record_exists) * [**set\_value](https://docs.apify.com/sdk/python/sdk/python/reference/class/KeyValueStore.md#set_value) ### Properties * [**id](https://docs.apify.com/sdk/python/sdk/python/reference/class/KeyValueStore.md#id) * [**name](https://docs.apify.com/sdk/python/sdk/python/reference/class/KeyValueStore.md#name) ## Methods<!-- -->[**](#Methods) ### [**](#__init__)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storages/_key_value_store.py#L77)\_\_init\_\_ * ****\_\_init\_\_**(client, id, name): None - Initialize a new instance. Preferably use the `KeyValueStore.open` constructor to create a new instance. *** #### Parameters * ##### client: [KeyValueStoreClient](https://crawlee.dev/python/api/class/KeyValueStoreClient) An instance of a storage client. * ##### id: str The unique identifier of the storage. * ##### name: str | None The name of the storage, if available. #### Returns None ### [**](#delete_value)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storages/_key_value_store.py#L178)delete\_value * **async **delete\_value**(key): None - Delete a value from the KVS. *** #### Parameters * ##### key: str Key of the record to delete. #### Returns None ### [**](#drop)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storages/_key_value_store.py#L130)drop * **async **drop**(): None - Overrides [Storage.drop](https://crawlee.dev/python/api/class/Storage#drop) Drop the storage, removing it from the underlying storage client and clearing the cache. *** #### Returns None ### [**](#get_auto_saved_value)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storages/_key_value_store.py#L252)get\_auto\_saved\_value * **async **get\_auto\_saved\_value**(key, default\_value): dict\[str, JsonSerializable] - Get a value from KVS that will be automatically saved on changes. *** #### Parameters * ##### key: str Key of the record, to store the value. * ##### optionaldefault\_value: dict\[str, JsonSerializable] | None = <!-- -->None Value to be used if the record does not exist yet. Should be a dictionary. #### Returns dict\[str, JsonSerializable] ### [**](#get_metadata)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storages/_key_value_store.py#L105)get\_metadata * **async **get\_metadata**(): ([DatasetMetadata](https://crawlee.dev/python/api/class/DatasetMetadata) | [KeyValueStoreMetadata](https://crawlee.dev/python/api/class/KeyValueStoreMetadata)) | [RequestQueueMetadata](https://crawlee.dev/python/api/class/RequestQueueMetadata) - Overrides [Storage.get\_metadata](https://crawlee.dev/python/api/class/Storage#get_metadata) Get the storage metadata. *** #### Returns ([DatasetMetadata](https://crawlee.dev/python/api/class/DatasetMetadata) | [KeyValueStoreMetadata](https://crawlee.dev/python/api/class/KeyValueStoreMetadata)) | [RequestQueueMetadata](https://crawlee.dev/python/api/class/RequestQueueMetadata) ### [**](#get_public_url)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storages/_key_value_store.py#L241)get\_public\_url * **async **get\_public\_url**(key): str - Get the public URL for the given key. *** #### Parameters * ##### key: str Key of the record for which URL is required. #### Returns str ### [**](#get_value)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storages/_key_value_store.py#L150)get\_value * **async **get\_value**(key: str, default\_value? <!-- --> : [T](https://crawlee.dev/python/api#T) | None): [T](https://crawlee.dev/python/api#T) | None * **async **get\_value**(key: str): Any * **async **get\_value**(key: str, default\_value: [T](https://crawlee.dev/python/api#T)): [T](https://crawlee.dev/python/api#T) * **async **get\_value**(key: str, default\_value? <!-- --> : [T](https://crawlee.dev/python/api#T) | None): [T](https://crawlee.dev/python/api#T) | None - Get a value from the KVS. *** #### Parameters * ##### key: str Key of the record to retrieve. * ##### optionaldefault\_value: [T](https://crawlee.dev/python/api#T) | None = <!-- -->None Default value returned in case the record does not exist. #### Returns [T](https://crawlee.dev/python/api#T) | None ### [**](#iterate_keys)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storages/_key_value_store.py#L186)iterate\_keys * **async **iterate\_keys**(exclusive\_start\_key, limit): AsyncIterator\[[KeyValueStoreRecordMetadata](https://crawlee.dev/python/api/class/KeyValueStoreRecordMetadata)] - Iterate over the existing keys in the KVS. *** #### Parameters * ##### optionalexclusive\_start\_key: str | None = <!-- -->None Key to start the iteration from. * ##### optionallimit: int | None = <!-- -->None Maximum number of keys to return. None means no limit. #### Returns AsyncIterator\[[KeyValueStoreRecordMetadata](https://crawlee.dev/python/api/class/KeyValueStoreRecordMetadata)] ### [**](#list_keys)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storages/_key_value_store.py#L206)list\_keys * **async **list\_keys**(exclusive\_start\_key, limit): list\[[KeyValueStoreRecordMetadata](https://crawlee.dev/python/api/class/KeyValueStoreRecordMetadata)] - List all the existing keys in the KVS. It uses client's `iterate_keys` method to get the keys. *** #### Parameters * ##### optionalexclusive\_start\_key: str | None = <!-- -->None Key to start the iteration from. * ##### optionallimit: int = <!-- -->1000 Maximum number of keys to return. #### Returns list\[[KeyValueStoreRecordMetadata](https://crawlee.dev/python/api/class/KeyValueStoreRecordMetadata)] ### [**](#open)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storages/_key_value_store.py#L110)open * **async **open**(\*, id, name, configuration, storage\_client): [Storage](https://crawlee.dev/python/api/class/Storage) - Overrides [Storage.open](https://crawlee.dev/python/api/class/Storage#open) Open a storage, either restore existing or create a new one. *** #### Parameters * ##### optionalkeyword-onlyid: str | None = <!-- -->None The storage ID. * ##### optionalkeyword-onlyname: str | None = <!-- -->None The storage name. * ##### optionalkeyword-onlyconfiguration: [Configuration](https://crawlee.dev/python/api/class/Configuration) | None = <!-- -->None Configuration object used during the storage creation or restoration process. * ##### optionalkeyword-onlystorage\_client: [StorageClient](https://crawlee.dev/python/api/class/StorageClient) | None = <!-- -->None Underlying storage client to use. If not provided, the default global storage client from the service locator will be used. #### Returns [Storage](https://crawlee.dev/python/api/class/Storage) ### [**](#persist_autosaved_values)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storages/_key_value_store.py#L286)persist\_autosaved\_values * **async **persist\_autosaved\_values**(): None - Force autosaved values to be saved without waiting for an event in Event Manager. *** #### Returns None ### [**](#purge)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storages/_key_value_store.py#L138)purge * **async **purge**(): None - Overrides [Storage.purge](https://crawlee.dev/python/api/class/Storage#purge) Purge the storage, removing all items from the underlying storage client. This method does not remove the storage itself, e.g. don't remove the metadata, but clears all items within it. *** #### Returns None ### [**](#record_exists)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storages/_key_value_store.py#L230)record\_exists * **async **record\_exists**(key): bool - Check if a record with the given key exists in the key-value store. *** #### Parameters * ##### key: str Key of the record to check for existence. #### Returns bool ### [**](#set_value)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storages/_key_value_store.py#L163)set\_value * **async **set\_value**(key, value, content\_type): None - Set a value in the KVS. *** #### Parameters * ##### key: str Key of the record to set. * ##### value: Any Value to set. * ##### optionalcontent\_type: str | None = <!-- -->None The MIME content type string. #### Returns None ## Properties<!-- -->[**](#Properties) ### [**](#id)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storages/_key_value_store.py#L96)id **id: str Overrides [Storage.id](https://crawlee.dev/python/api/class/Storage#id) Get the storage ID. ### [**](#name)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storages/_key_value_store.py#L101)name **name: str | None Overrides [Storage.name](https://crawlee.dev/python/api/class/Storage#name) Get the storage name. --- # KeyValueStoreClient<!-- --> Key-value store resource client implementation based on the Apify platform storage. ## Index[**](#Index) ### Methods * [**\_\_init\_\_](https://docs.apify.com/sdk/python/sdk/python/reference/class/KeyValueStoreClient.md#__init__) * [**delete](https://docs.apify.com/sdk/python/sdk/python/reference/class/KeyValueStoreClient.md#delete) * [**delete\_record](https://docs.apify.com/sdk/python/sdk/python/reference/class/KeyValueStoreClient.md#delete_record) * [**get](https://docs.apify.com/sdk/python/sdk/python/reference/class/KeyValueStoreClient.md#get) * [**get\_public\_url](https://docs.apify.com/sdk/python/sdk/python/reference/class/KeyValueStoreClient.md#get_public_url) * [**get\_record](https://docs.apify.com/sdk/python/sdk/python/reference/class/KeyValueStoreClient.md#get_record) * [**get\_record\_as\_bytes](https://docs.apify.com/sdk/python/sdk/python/reference/class/KeyValueStoreClient.md#get_record_as_bytes) * [**list\_keys](https://docs.apify.com/sdk/python/sdk/python/reference/class/KeyValueStoreClient.md#list_keys) * [**set\_record](https://docs.apify.com/sdk/python/sdk/python/reference/class/KeyValueStoreClient.md#set_record) * [**stream\_record](https://docs.apify.com/sdk/python/sdk/python/reference/class/KeyValueStoreClient.md#stream_record) * [**update](https://docs.apify.com/sdk/python/sdk/python/reference/class/KeyValueStoreClient.md#update) ## Methods<!-- -->[**](#Methods) ### [**](#__init__)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/apify_storage_client/_key_value_store_client.py#L26)\_\_init\_\_ * ****\_\_init\_\_**(apify\_key\_value\_store\_client, api\_public\_base\_url): None - #### Parameters * ##### apify\_key\_value\_store\_client: KeyValueStoreClientAsync * ##### api\_public\_base\_url: str #### Returns None ### [**](#delete)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/apify_storage_client/_key_value_store_client.py#L44)delete * **async **delete**(): None - #### Returns None ### [**](#delete_record)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/apify_storage_client/_key_value_store_client.py#L84)delete\_record * **async **delete\_record**(key): None - #### Parameters * ##### key: str #### Returns None ### [**](#get)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/apify_storage_client/_key_value_store_client.py#L31)get * **async **get**(): KeyValueStoreMetadata | None - #### Returns KeyValueStoreMetadata | None ### [**](#get_public_url)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/apify_storage_client/_key_value_store_client.py#L89)get\_public\_url * **async **get\_public\_url**(key): str - Get a URL for the given key that may be used to publicly access the value in the remote key-value store. *** #### Parameters * ##### key: str The key for which the URL should be generated. #### Returns str ### [**](#get_record)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/apify_storage_client/_key_value_store_client.py#L57)get\_record * **async **get\_record**(key): KeyValueStoreRecord | None - #### Parameters * ##### key: str #### Returns KeyValueStoreRecord | None ### [**](#get_record_as_bytes)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/apify_storage_client/_key_value_store_client.py#L62)get\_record\_as\_bytes * **async **get\_record\_as\_bytes**(key): KeyValueStoreRecord | None - #### Parameters * ##### key: str #### Returns KeyValueStoreRecord | None ### [**](#list_keys)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/apify_storage_client/_key_value_store_client.py#L48)list\_keys * **async **list\_keys**(\*, limit, exclusive\_start\_key): KeyValueStoreListKeysPage - #### Parameters * ##### optionalkeyword-onlylimit: int = <!-- -->1000 * ##### optionalkeyword-onlyexclusive\_start\_key: str | None = <!-- -->None #### Returns KeyValueStoreListKeysPage ### [**](#set_record)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/apify_storage_client/_key_value_store_client.py#L76)set\_record * **async **set\_record**(key, value, content\_type): None - #### Parameters * ##### key: str * ##### value: Any * ##### optionalcontent\_type: str | None = <!-- -->None #### Returns None ### [**](#stream_record)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/apify_storage_client/_key_value_store_client.py#L67)stream\_record * **async **stream\_record**(key): AbstractAsyncContextManager\[KeyValueStoreRecord\[Response] | None] - #### Parameters * ##### key: str #### Returns AbstractAsyncContextManager\[KeyValueStoreRecord\[Response] | None] ### [**](#update)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/apify_storage_client/_key_value_store_client.py#L36)update * **async **update**(\*, name): KeyValueStoreMetadata - #### Parameters * ##### optionalkeyword-onlyname: str | None = <!-- -->None #### Returns KeyValueStoreMetadata --- # KeyValueStoreCollectionClient<!-- --> Key-value store collection resource client implementation based on the Apify platform storage. ## Index[**](#Index) ### Methods * [**\_\_init\_\_](https://docs.apify.com/sdk/python/sdk/python/reference/class/KeyValueStoreCollectionClient.md#__init__) * [**get\_or\_create](https://docs.apify.com/sdk/python/sdk/python/reference/class/KeyValueStoreCollectionClient.md#get_or_create) * [**list](https://docs.apify.com/sdk/python/sdk/python/reference/class/KeyValueStoreCollectionClient.md#list) ## Methods<!-- -->[**](#Methods) ### [**](#__init__)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/apify_storage_client/_key_value_store_collection_client.py#L17)\_\_init\_\_ * ****\_\_init\_\_**(apify\_dataset\_collection\_client): None - #### Parameters * ##### apify\_dataset\_collection\_client: KeyValueStoreCollectionClientAsync #### Returns None ### [**](#get_or_create)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/apify_storage_client/_key_value_store_collection_client.py#L21)get\_or\_create * **async **get\_or\_create**(\*, id, name, schema): KeyValueStoreMetadata - #### Parameters * ##### optionalkeyword-onlyid: str | None = <!-- -->None * ##### optionalkeyword-onlyname: str | None = <!-- -->None * ##### optionalkeyword-onlyschema: dict | None = <!-- -->None #### Returns KeyValueStoreMetadata ### [**](#list)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/apify_storage_client/_key_value_store_collection_client.py#L36)list * **async **list**(\*, unnamed, limit, offset, desc): KeyValueStoreListPage - #### Parameters * ##### optionalkeyword-onlyunnamed: bool = <!-- -->False * ##### optionalkeyword-onlylimit: int | None = <!-- -->None * ##### optionalkeyword-onlyoffset: int | None = <!-- -->None * ##### optionalkeyword-onlydesc: bool = <!-- -->False #### Returns KeyValueStoreListPage --- # KeyValueStoreMetadata<!-- --> Model for a key-value store metadata. ### Hierarchy * [StorageMetadata](https://crawlee.dev/python/api/class/StorageMetadata) * *KeyValueStoreMetadata* ## Index[**](#Index) ### Properties * [**accessed\_at](https://docs.apify.com/sdk/python/sdk/python/reference/class/KeyValueStoreMetadata.md#accessed_at) * [**created\_at](https://docs.apify.com/sdk/python/sdk/python/reference/class/KeyValueStoreMetadata.md#created_at) * [**id](https://docs.apify.com/sdk/python/sdk/python/reference/class/KeyValueStoreMetadata.md#id) * [**model\_config](https://docs.apify.com/sdk/python/sdk/python/reference/class/KeyValueStoreMetadata.md#model_config) * [**modified\_at](https://docs.apify.com/sdk/python/sdk/python/reference/class/KeyValueStoreMetadata.md#modified_at) * [**name](https://docs.apify.com/sdk/python/sdk/python/reference/class/KeyValueStoreMetadata.md#name) ## Properties<!-- -->[**](#Properties) ### [**](#accessed_at)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storage_clients/models.py#L31)accessed\_at **accessed\_at: datetime Inherited from [StorageMetadata.accessed\_at](https://crawlee.dev/python/api/class/StorageMetadata#accessed_at) The timestamp when the storage was last accessed. ### [**](#created_at)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storage_clients/models.py#L34)created\_at **created\_at: datetime Inherited from [StorageMetadata.created\_at](https://crawlee.dev/python/api/class/StorageMetadata#created_at) The timestamp when the storage was created. ### [**](#id)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storage_clients/models.py#L25)id **id: str Inherited from [StorageMetadata.id](https://crawlee.dev/python/api/class/StorageMetadata#id) The unique identifier of the storage. ### [**](#model_config)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storage_clients/models.py#L55)model\_config **model\_config: Undefined Overrides [StorageMetadata.model\_config](https://crawlee.dev/python/api/class/StorageMetadata#model_config) ### [**](#modified_at)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storage_clients/models.py#L37)modified\_at **modified\_at: datetime Inherited from [StorageMetadata.modified\_at](https://crawlee.dev/python/api/class/StorageMetadata#modified_at) The timestamp when the storage was last modified. ### [**](#name)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storage_clients/models.py#L28)name **name: str | None Inherited from [StorageMetadata.name](https://crawlee.dev/python/api/class/StorageMetadata#name) The name of the storage. --- # KeyValueStoreRecord<!-- --> Model for a key-value store record. ### Hierarchy * [KeyValueStoreRecordMetadata](https://crawlee.dev/python/api/class/KeyValueStoreRecordMetadata) * *KeyValueStoreRecord* ## Index[**](#Index) ### Properties * [**content\_type](https://docs.apify.com/sdk/python/sdk/python/reference/class/KeyValueStoreRecord.md#content_type) * [**key](https://docs.apify.com/sdk/python/sdk/python/reference/class/KeyValueStoreRecord.md#key) * [**model\_config](https://docs.apify.com/sdk/python/sdk/python/reference/class/KeyValueStoreRecord.md#model_config) * [**size](https://docs.apify.com/sdk/python/sdk/python/reference/class/KeyValueStoreRecord.md#size) * [**value](https://docs.apify.com/sdk/python/sdk/python/reference/class/KeyValueStoreRecord.md#value) ## Properties<!-- -->[**](#Properties) ### [**](#content_type)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storage_clients/models.py#L89)content\_type **content\_type: str Inherited from [KeyValueStoreRecordMetadata.content\_type](https://crawlee.dev/python/api/class/KeyValueStoreRecordMetadata#content_type) The MIME type of the record. Describe the format and type of data stored in the record, following the MIME specification. ### [**](#key)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storage_clients/models.py#L83)key **key: str Inherited from [KeyValueStoreRecordMetadata.key](https://crawlee.dev/python/api/class/KeyValueStoreRecordMetadata#key) The key of the record. A unique identifier for the record in the key-value store. ### [**](#model_config)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storage_clients/models.py#L103)model\_config **model\_config: Undefined Overrides [KeyValueStoreRecordMetadata.model\_config](https://crawlee.dev/python/api/class/KeyValueStoreRecordMetadata#model_config) ### [**](#size)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storage_clients/models.py#L95)size **size: int | None Inherited from [KeyValueStoreRecordMetadata.size](https://crawlee.dev/python/api/class/KeyValueStoreRecordMetadata#size) The size of the record in bytes. ### [**](#value)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storage_clients/models.py#L105)value **value: [KvsValueType](https://crawlee.dev/python/api#KvsValueType) The value of the record. --- # KeyValueStoreRecordMetadata<!-- --> Model for a key-value store record metadata. ### Hierarchy * *KeyValueStoreRecordMetadata* * [KeyValueStoreRecord](https://crawlee.dev/python/api/class/KeyValueStoreRecord) ## Index[**](#Index) ### Properties * [**content\_type](https://docs.apify.com/sdk/python/sdk/python/reference/class/KeyValueStoreRecordMetadata.md#content_type) * [**key](https://docs.apify.com/sdk/python/sdk/python/reference/class/KeyValueStoreRecordMetadata.md#key) * [**model\_config](https://docs.apify.com/sdk/python/sdk/python/reference/class/KeyValueStoreRecordMetadata.md#model_config) * [**size](https://docs.apify.com/sdk/python/sdk/python/reference/class/KeyValueStoreRecordMetadata.md#size) ## Properties<!-- -->[**](#Properties) ### [**](#content_type)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storage_clients/models.py#L89)content\_type **content\_type: str The MIME type of the record. Describe the format and type of data stored in the record, following the MIME specification. ### [**](#key)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storage_clients/models.py#L83)key **key: str The key of the record. A unique identifier for the record in the key-value store. ### [**](#model_config)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storage_clients/models.py#L81)model\_config **model\_config: Undefined ### [**](#size)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storage_clients/models.py#L95)size **size: int | None The size of the record in bytes. --- # LocalEventManager<!-- --> Event manager for local environments. It extends the `EventManager` to emit `SystemInfo` events at regular intervals. The `LocalEventManager` is intended to be used in local environments, where the system metrics are required managing the `Snapshotter` and `AutoscaledPool`. ### Hierarchy * [EventManager](https://crawlee.dev/python/api/class/EventManager) * *LocalEventManager* ## Index[**](#Index) ### Methods * [**\_\_aenter\_\_](https://docs.apify.com/sdk/python/sdk/python/reference/class/LocalEventManager.md#__aenter__) * [**\_\_aexit\_\_](https://docs.apify.com/sdk/python/sdk/python/reference/class/LocalEventManager.md#__aexit__) * [**\_\_init\_\_](https://docs.apify.com/sdk/python/sdk/python/reference/class/LocalEventManager.md#__init__) * [**emit](https://docs.apify.com/sdk/python/sdk/python/reference/class/LocalEventManager.md#emit) * [**from\_config](https://docs.apify.com/sdk/python/sdk/python/reference/class/LocalEventManager.md#from_config) * [**off](https://docs.apify.com/sdk/python/sdk/python/reference/class/LocalEventManager.md#off) * [**on](https://docs.apify.com/sdk/python/sdk/python/reference/class/LocalEventManager.md#on) * [**wait\_for\_all\_listeners\_to\_complete](https://docs.apify.com/sdk/python/sdk/python/reference/class/LocalEventManager.md#wait_for_all_listeners_to_complete) ### Properties * [**active](https://docs.apify.com/sdk/python/sdk/python/reference/class/LocalEventManager.md#active) ## Methods<!-- -->[**](#Methods) ### [**](#__aenter__)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/events/_local_event_manager.py#L72)\_\_aenter\_\_ * **async **\_\_aenter\_\_**(): [LocalEventManager](https://crawlee.dev/python/api/class/LocalEventManager) - Overrides [EventManager.\_\_aenter\_\_](https://crawlee.dev/python/api/class/EventManager#__aenter__) Initialize the local event manager upon entering the async context. It starts emitting system info events at regular intervals. *** #### Returns [LocalEventManager](https://crawlee.dev/python/api/class/LocalEventManager) ### [**](#__aexit__)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/events/_local_event_manager.py#L81)\_\_aexit\_\_ * **async **\_\_aexit\_\_**(exc\_type, exc\_value, exc\_traceback): None - Overrides [EventManager.\_\_aexit\_\_](https://crawlee.dev/python/api/class/EventManager#__aexit__) Close the local event manager upon exiting the async context. It stops emitting system info events and closes the event manager. *** #### Parameters * ##### exc\_type: [type](https://crawlee.dev/python/api/class/SitemapSource#type)\[BaseException] | None * ##### exc\_value: BaseException | None * ##### exc\_traceback: TracebackType | None #### Returns None ### [**](#__init__)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/events/_local_event_manager.py#L34)\_\_init\_\_ * ****\_\_init\_\_**(system\_info\_interval, \*, persist\_state\_interval, close\_timeout): None - Overrides [EventManager.\_\_init\_\_](https://crawlee.dev/python/api/class/EventManager#__init__) Initialize a new instance. In most cases, you should use the `from_config` constructor to create a new instance based on the provided configuration. *** #### Parameters * ##### optionalsystem\_info\_interval: timedelta = <!-- -->timedelta(seconds=1) Interval at which `SystemInfo` events are emitted. * ##### keyword-onlyoptionalpersist\_state\_interval: timedelta Interval between emitted `PersistState` events to maintain state persistence. * ##### keyword-onlyoptionalclose\_timeout: timedelta | None Optional timeout for canceling pending event listeners if they exceed this duration. #### Returns None ### [**](#emit)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/events/_event_manager.py#L233)emit * ****emit**(\*: , event: [Event](https://crawlee.dev/python/api/enum/Event), event\_data: [EventData](https://crawlee.dev/python/api#EventData)): None * ****emit**(\*: , event: Literal\[Event.PERSIST\_STATE], event\_data: [EventPersistStateData](https://crawlee.dev/python/api/class/EventPersistStateData)): None * ****emit**(\*: , event: Literal\[Event.SYSTEM\_INFO], event\_data: [EventSystemInfoData](https://crawlee.dev/python/api/class/EventSystemInfoData)): None * ****emit**(\*: , event: Literal\[Event.MIGRATING], event\_data: [EventMigratingData](https://crawlee.dev/python/api/class/EventMigratingData)): None * ****emit**(\*: , event: Literal\[Event.ABORTING], event\_data: [EventAbortingData](https://crawlee.dev/python/api/class/EventAbortingData)): None * ****emit**(\*: , event: Literal\[Event.EXIT], event\_data: [EventExitData](https://crawlee.dev/python/api/class/EventExitData)): None * ****emit**(\*: , event: Literal\[Event.CRAWLER\_STATUS], event\_data: [EventCrawlerStatusData](https://crawlee.dev/python/api/class/EventCrawlerStatusData)): None * ****emit**(\*: , event: [Event](https://crawlee.dev/python/api/enum/Event), event\_data: Any): None - Inherited from [EventManager.emit](https://crawlee.dev/python/api/class/EventManager#emit) Emit an event with the associated data to all registered listeners. *** #### Parameters * ##### keyword-onlyevent: [Event](https://crawlee.dev/python/api/enum/Event) The event which will be emitted. * ##### keyword-onlyevent\_data: [EventData](https://crawlee.dev/python/api#EventData) The data which will be passed to the event listeners. #### Returns None ### [**](#from_config)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/events/_local_event_manager.py#L59)from\_config * ****from\_config**(config): [LocalEventManager](https://crawlee.dev/python/api/class/LocalEventManager) - Initialize a new instance based on the provided `Configuration`. *** #### Parameters * ##### optionalconfig: [Configuration](https://crawlee.dev/python/api/class/Configuration) | None = <!-- -->None The `Configuration` instance. Uses the global (default) one if not provided. #### Returns [LocalEventManager](https://crawlee.dev/python/api/class/LocalEventManager) ### [**](#off)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/events/_event_manager.py#L201)off * ****off**(\*, event, listener): None - Inherited from [EventManager.off](https://crawlee.dev/python/api/class/EventManager#off) Remove a specific listener or all listeners for an event. *** #### Parameters * ##### keyword-onlyevent: [Event](https://crawlee.dev/python/api/enum/Event) The Actor event for which to remove listeners. * ##### optionalkeyword-onlylistener: [EventListener](https://crawlee.dev/python/api#EventListener)\[Any] | None = <!-- -->None The listener which is supposed to be removed. If not passed, all listeners of this event are removed. #### Returns None ### [**](#on)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/events/_event_manager.py#L155)on * ****on**(\*: , event: [Event](https://crawlee.dev/python/api/enum/Event), listener: [EventListener](https://crawlee.dev/python/api#EventListener)\[Any]): None * ****on**(\*: , event: Literal\[Event.PERSIST\_STATE], listener: [EventListener](https://crawlee.dev/python/api#EventListener)\[[EventPersistStateData](https://crawlee.dev/python/api/class/EventPersistStateData)]): None * ****on**(\*: , event: Literal\[Event.SYSTEM\_INFO], listener: [EventListener](https://crawlee.dev/python/api#EventListener)\[[EventSystemInfoData](https://crawlee.dev/python/api/class/EventSystemInfoData)]): None * ****on**(\*: , event: Literal\[Event.MIGRATING], listener: [EventListener](https://crawlee.dev/python/api#EventListener)\[[EventMigratingData](https://crawlee.dev/python/api/class/EventMigratingData)]): None * ****on**(\*: , event: Literal\[Event.ABORTING], listener: [EventListener](https://crawlee.dev/python/api#EventListener)\[[EventAbortingData](https://crawlee.dev/python/api/class/EventAbortingData)]): None * ****on**(\*: , event: Literal\[Event.EXIT], listener: [EventListener](https://crawlee.dev/python/api#EventListener)\[[EventExitData](https://crawlee.dev/python/api/class/EventExitData)]): None * ****on**(\*: , event: Literal\[Event.CRAWLER\_STATUS], listener: [EventListener](https://crawlee.dev/python/api#EventListener)\[[EventCrawlerStatusData](https://crawlee.dev/python/api/class/EventCrawlerStatusData)]): None * ****on**(\*: , event: [Event](https://crawlee.dev/python/api/enum/Event), listener: [EventListener](https://crawlee.dev/python/api#EventListener)\[None]): None - Inherited from [EventManager.on](https://crawlee.dev/python/api/class/EventManager#on) Register an event listener for a specific event. *** #### Parameters * ##### keyword-onlyevent: [Event](https://crawlee.dev/python/api/enum/Event) The event for which to listen to. * ##### keyword-onlylistener: [EventListener](https://crawlee.dev/python/api#EventListener)\[Any] The function (sync or async) which is to be called when the event is emitted. #### Returns None ### [**](#wait_for_all_listeners_to_complete)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/events/_event_manager.py#L243)wait\_for\_all\_listeners\_to\_complete * **async **wait\_for\_all\_listeners\_to\_complete**(\*, timeout): None - Inherited from [EventManager.wait\_for\_all\_listeners\_to\_complete](https://crawlee.dev/python/api/class/EventManager#wait_for_all_listeners_to_complete) Wait for all currently executing event listeners to complete. *** #### Parameters * ##### optionalkeyword-onlytimeout: timedelta | None = <!-- -->None The maximum time to wait for the event listeners to finish. If they do not complete within the specified timeout, they will be canceled. #### Returns None ## Properties<!-- -->[**](#Properties) ### [**](#active)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/events/_event_manager.py#L100)active **active: bool Inherited from [EventManager.active](https://crawlee.dev/python/api/class/EventManager#active) Indicate whether the context is active. --- # MemoryStorageClient<!-- --> Memory implementation of the storage client. This storage client provides access to datasets, key-value stores, and request queues that store all data in memory using Python data structures (lists and dictionaries). No data is persisted between process runs, meaning all stored data is lost when the program terminates. The memory implementation provides fast access to data but is limited by available memory and does not support data sharing across different processes. All storage operations happen entirely in memory with no disk operations. The memory storage client is useful for testing and development environments, or short-lived crawler operations where persistence is not required. ### Hierarchy * [StorageClient](https://crawlee.dev/python/api/class/StorageClient) * *MemoryStorageClient* ## Index[**](#Index) ### Methods * [**create\_dataset\_client](https://docs.apify.com/sdk/python/sdk/python/reference/class/MemoryStorageClient.md#create_dataset_client) * [**create\_kvs\_client](https://docs.apify.com/sdk/python/sdk/python/reference/class/MemoryStorageClient.md#create_kvs_client) * [**create\_rq\_client](https://docs.apify.com/sdk/python/sdk/python/reference/class/MemoryStorageClient.md#create_rq_client) * [**get\_rate\_limit\_errors](https://docs.apify.com/sdk/python/sdk/python/reference/class/MemoryStorageClient.md#get_rate_limit_errors) ## Methods<!-- -->[**](#Methods) ### [**](#create_dataset_client)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storage_clients/_memory/_storage_client.py#L31)create\_dataset\_client * **async **create\_dataset\_client**(\*, id, name, configuration): [DatasetClient](https://crawlee.dev/python/api/class/DatasetClient) - Overrides [StorageClient.create\_dataset\_client](https://crawlee.dev/python/api/class/StorageClient#create_dataset_client) Create a dataset client. *** #### Parameters * ##### optionalkeyword-onlyid: str | None = <!-- -->None * ##### optionalkeyword-onlyname: str | None = <!-- -->None * ##### optionalkeyword-onlyconfiguration: [Configuration](https://crawlee.dev/python/api/class/Configuration) | None = <!-- -->None #### Returns [DatasetClient](https://crawlee.dev/python/api/class/DatasetClient) ### [**](#create_kvs_client)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storage_clients/_memory/_storage_client.py#L44)create\_kvs\_client * **async **create\_kvs\_client**(\*, id, name, configuration): [KeyValueStoreClient](https://crawlee.dev/python/api/class/KeyValueStoreClient) - Overrides [StorageClient.create\_kvs\_client](https://crawlee.dev/python/api/class/StorageClient#create_kvs_client) Create a key-value store client. *** #### Parameters * ##### optionalkeyword-onlyid: str | None = <!-- -->None * ##### optionalkeyword-onlyname: str | None = <!-- -->None * ##### optionalkeyword-onlyconfiguration: [Configuration](https://crawlee.dev/python/api/class/Configuration) | None = <!-- -->None #### Returns [KeyValueStoreClient](https://crawlee.dev/python/api/class/KeyValueStoreClient) ### [**](#create_rq_client)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storage_clients/_memory/_storage_client.py#L57)create\_rq\_client * **async **create\_rq\_client**(\*, id, name, configuration): [RequestQueueClient](https://crawlee.dev/python/api/class/RequestQueueClient) - Overrides [StorageClient.create\_rq\_client](https://crawlee.dev/python/api/class/StorageClient#create_rq_client) Create a request queue client. *** #### Parameters * ##### optionalkeyword-onlyid: str | None = <!-- -->None * ##### optionalkeyword-onlyname: str | None = <!-- -->None * ##### optionalkeyword-onlyconfiguration: [Configuration](https://crawlee.dev/python/api/class/Configuration) | None = <!-- -->None #### Returns [RequestQueueClient](https://crawlee.dev/python/api/class/RequestQueueClient) ### [**](#get_rate_limit_errors)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storage_clients/_base/_storage_client.py#L61)get\_rate\_limit\_errors * ****get\_rate\_limit\_errors**(): dict\[int, int] - Inherited from [StorageClient.get\_rate\_limit\_errors](https://crawlee.dev/python/api/class/StorageClient#get_rate_limit_errors) Return statistics about rate limit errors encountered by the HTTP client in storage client. *** #### Returns dict\[int, int] --- # MigratingEvent<!-- --> ## Index[**](#Index) ### Properties * [**data](https://docs.apify.com/sdk/python/sdk/python/reference/class/MigratingEvent.md#data) * [**name](https://docs.apify.com/sdk/python/sdk/python/reference/class/MigratingEvent.md#name) ## Properties<!-- -->[**](#Properties) ### [**](#data)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_platform_event_manager.py#L75)data **data: EventMigratingData ### [**](#name)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_platform_event_manager.py#L74)name **name: Literal\[Event.MIGRATING] --- # PayPerEventActorPricingInfo<!-- --> ## Index[**](#Index) ### Properties * [**minimal\_max\_total\_charge\_usd](https://docs.apify.com/sdk/python/sdk/python/reference/class/PayPerEventActorPricingInfo.md#minimal_max_total_charge_usd) * [**pricing\_model](https://docs.apify.com/sdk/python/sdk/python/reference/class/PayPerEventActorPricingInfo.md#pricing_model) * [**pricing\_per\_event](https://docs.apify.com/sdk/python/sdk/python/reference/class/PayPerEventActorPricingInfo.md#pricing_per_event) ## Properties<!-- -->[**](#Properties) ### [**](#minimal_max_total_charge_usd)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L166)minimal\_max\_total\_charge\_usd **minimal\_max\_total\_charge\_usd: Decimal | None ### [**](#pricing_model)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L164)pricing\_model **pricing\_model: Literal\[PAY\_PER\_EVENT] ### [**](#pricing_per_event)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L165)pricing\_per\_event **pricing\_per\_event: [PricingPerEvent](https://docs.apify.com/sdk/python/sdk/python/reference/class/PricingPerEvent.md) --- # PersistStateEvent<!-- --> ## Index[**](#Index) ### Properties * [**data](https://docs.apify.com/sdk/python/sdk/python/reference/class/PersistStateEvent.md#data) * [**name](https://docs.apify.com/sdk/python/sdk/python/reference/class/PersistStateEvent.md#name) ## Properties<!-- -->[**](#Properties) ### [**](#data)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_platform_event_manager.py#L63)data **data: EventPersistStateData ### [**](#name)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_platform_event_manager.py#L62)name **name: Literal\[Event.PERSIST\_STATE] --- # PlatformEventManager<!-- --> A class for managing Actor events. You shouldn't use this class directly, but instead use it via the `Actor.on()` and `Actor.off()` methods. ## Index[**](#Index) ### Methods * [**\_\_aenter\_\_](https://docs.apify.com/sdk/python/sdk/python/reference/class/PlatformEventManager.md#__aenter__) * [**\_\_aexit\_\_](https://docs.apify.com/sdk/python/sdk/python/reference/class/PlatformEventManager.md#__aexit__) * [**\_\_init\_\_](https://docs.apify.com/sdk/python/sdk/python/reference/class/PlatformEventManager.md#__init__) ## Methods<!-- -->[**](#Methods) ### [**](#__aenter__)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_platform_event_manager.py#L149)\_\_aenter\_\_ * **async **\_\_aenter\_\_**(): Self - #### Returns Self ### [**](#__aexit__)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_platform_event_manager.py#L167)\_\_aexit\_\_ * **async **\_\_aexit\_\_**(exc\_type, exc\_value, exc\_traceback): None - #### Parameters * ##### exc\_type: type\[BaseException] | None * ##### exc\_value: BaseException | None * ##### exc\_traceback: TracebackType | None #### Returns None ### [**](#__init__)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_platform_event_manager.py#L135)\_\_init\_\_ * ****\_\_init\_\_**(config): None - Create an instance of the EventManager. *** #### Parameters * ##### config: [Configuration](https://docs.apify.com/sdk/python/sdk/python/reference/class/Configuration.md) The Actor configuration to be used in this event manager. #### Returns None --- # PricePerDatasetItemActorPricingInfo<!-- --> ## Index[**](#Index) ### Properties * [**price\_per\_unit\_usd](https://docs.apify.com/sdk/python/sdk/python/reference/class/PricePerDatasetItemActorPricingInfo.md#price_per_unit_usd) * [**pricing\_model](https://docs.apify.com/sdk/python/sdk/python/reference/class/PricePerDatasetItemActorPricingInfo.md#pricing_model) * [**unit\_name](https://docs.apify.com/sdk/python/sdk/python/reference/class/PricePerDatasetItemActorPricingInfo.md#unit_name) ## Properties<!-- -->[**](#Properties) ### [**](#price_per_unit_usd)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L150)price\_per\_unit\_usd **price\_per\_unit\_usd: Decimal ### [**](#pricing_model)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L148)pricing\_model **pricing\_model: Literal\[PRICE\_PER\_DATASET\_ITEM] ### [**](#unit_name)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L149)unit\_name **unit\_name: str | None --- # PricingInfoItem<!-- --> ## Index[**](#Index) ### Properties * [**price](https://docs.apify.com/sdk/python/sdk/python/reference/class/PricingInfoItem.md#price) * [**title](https://docs.apify.com/sdk/python/sdk/python/reference/class/PricingInfoItem.md#title) ## Properties<!-- -->[**](#Properties) ### [**](#price)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_charging.py#L321)price **price: Decimal ### [**](#title)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_charging.py#L322)title **title: str --- # PricingPerEvent<!-- --> ## Index[**](#Index) ### Properties * [**actor\_charge\_events](https://docs.apify.com/sdk/python/sdk/python/reference/class/PricingPerEvent.md#actor_charge_events) ## Properties<!-- -->[**](#Properties) ### [**](#actor_charge_events)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L160)actor\_charge\_events **actor\_charge\_events: dict\[str, [ActorChargeEvent](https://docs.apify.com/sdk/python/sdk/python/reference/class/ActorChargeEvent.md)] --- # ProcessedRequest<!-- --> Represents a processed request. ## Index[**](#Index) ### Properties * [**id](https://docs.apify.com/sdk/python/sdk/python/reference/class/ProcessedRequest.md#id) * [**model\_config](https://docs.apify.com/sdk/python/sdk/python/reference/class/ProcessedRequest.md#model_config) * [**unique\_key](https://docs.apify.com/sdk/python/sdk/python/reference/class/ProcessedRequest.md#unique_key) * [**was\_already\_handled](https://docs.apify.com/sdk/python/sdk/python/reference/class/ProcessedRequest.md#was_already_handled) * [**was\_already\_present](https://docs.apify.com/sdk/python/sdk/python/reference/class/ProcessedRequest.md#was_already_present) ## Properties<!-- -->[**](#Properties) ### [**](#id)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storage_clients/models.py#L140)id **id: str ### [**](#model_config)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storage_clients/models.py#L138)model\_config **model\_config: Undefined ### [**](#unique_key)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storage_clients/models.py#L141)unique\_key **unique\_key: str ### [**](#was_already_handled)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storage_clients/models.py#L143)was\_already\_handled **was\_already\_handled: bool ### [**](#was_already_present)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storage_clients/models.py#L142)was\_already\_present **was\_already\_present: bool --- # ProxyConfiguration<!-- --> Configures a connection to a proxy server with the provided options. Proxy servers are used to prevent target websites from blocking your crawlers based on IP address rate limits or blacklists. The default servers used by this class are managed by [Apify Proxy](https://docs.apify.com/proxy). To be able to use Apify Proxy, you need an Apify account and access to the selected proxies. If you provide no configuration option, the proxies will be managed automatically using a smart algorithm. If you want to use your own proxies, use the `proxy_urls` or `new_url_function` constructor options. Your list of proxy URLs will be rotated by the configuration, if this option is provided. ## Index[**](#Index) ### Methods * [**\_\_init\_\_](https://docs.apify.com/sdk/python/sdk/python/reference/class/ProxyConfiguration.md#__init__) * [**initialize](https://docs.apify.com/sdk/python/sdk/python/reference/class/ProxyConfiguration.md#initialize) * [**new\_proxy\_info](https://docs.apify.com/sdk/python/sdk/python/reference/class/ProxyConfiguration.md#new_proxy_info) ## Methods<!-- -->[**](#Methods) ### [**](#__init__)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_proxy_configuration.py#L105)\_\_init\_\_ * ****\_\_init\_\_**(\*, password, groups, country\_code, proxy\_urls, new\_url\_function, tiered\_proxy\_urls, \_actor\_config, \_apify\_client): None - Create a ProxyConfiguration instance. It is highly recommended to use `Actor.create_proxy_configuration()` instead of this. *** #### Parameters * ##### optionalkeyword-onlypassword: str | None = <!-- -->None Password for the Apify Proxy. If not provided, will use os.environ\['APIFY\_PROXY\_PASSWORD'], if available. * ##### optionalkeyword-onlygroups: [list](https://docs.apify.com/sdk/python/sdk/python/reference/class/DatasetCollectionClient.md#list)\[str] | None = <!-- -->None Proxy groups which the Apify Proxy should use, if provided. * ##### optionalkeyword-onlycountry\_code: str | None = <!-- -->None Country which the Apify Proxy should use, if provided. * ##### optionalkeyword-onlyproxy\_urls: [list](https://docs.apify.com/sdk/python/sdk/python/reference/class/DatasetCollectionClient.md#list)\[str | None] | None = <!-- -->None Custom proxy server URLs which should be rotated through. * ##### optionalkeyword-onlynew\_url\_function: \_NewUrlFunction | None = <!-- -->None Function which returns a custom proxy URL to be used. * ##### optionalkeyword-onlytiered\_proxy\_urls: [list](https://docs.apify.com/sdk/python/sdk/python/reference/class/DatasetCollectionClient.md#list)\[[list](https://docs.apify.com/sdk/python/sdk/python/reference/class/DatasetCollectionClient.md#list)\[str | None]] | None = <!-- -->None Proxy URLs arranged into tiers * ##### optionalkeyword-only\_actor\_config: [Configuration](https://docs.apify.com/sdk/python/sdk/python/reference/class/Configuration.md) | None = <!-- -->None * ##### optionalkeyword-only\_apify\_client: ApifyClientAsync | None = <!-- -->None #### Returns None ### [**](#initialize)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_proxy_configuration.py#L177)initialize * **async **initialize**(): None - Check if using proxy, if so, check the access. Load the Apify Proxy password from API (only if not passed to constructor or through env var). Only called if Apify Proxy configuration is used. Also checks if country has access to Apify Proxy groups if the country code is provided. You should use the Actor.create\_proxy\_configuration function to create a pre-initialized `ProxyConfiguration` instance instead of calling this manually. *** #### Returns None ### [**](#new_proxy_info)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_proxy_configuration.py#L202)new\_proxy\_info * **async **new\_proxy\_info**(session\_id, request, proxy\_tier): [ProxyInfo](https://docs.apify.com/sdk/python/sdk/python/reference/class/ProxyInfo.md) | None - Create a new ProxyInfo object. Use it if you want to work with a rich representation of a proxy URL. If you need the URL string only, use `ProxyConfiguration.new_url`. *** #### Parameters * ##### optionalsession\_id: str | None = <!-- -->None Represents the identifier of a proxy session (<https://docs.apify.com/proxy#sessions>). All the HTTP requests going through the proxy with the same session identifier will use the same target proxy server (i.e. the same IP address). The identifier must not be longer than 50 characters and include only the following: `0-9`, `a-z`, `A-Z`, `"."`, `"_"` and `"~"`. * ##### optionalrequest: Request | None = <!-- -->None request for which the proxy info is being issued, used in proxy tier handling. * ##### optionalproxy\_tier: int | None = <!-- -->None allows forcing the proxy tier to be used. #### Returns [ProxyInfo](https://docs.apify.com/sdk/python/sdk/python/reference/class/ProxyInfo.md) | None --- # ProxyInfo<!-- --> Provides information about a proxy connection that is used for requests. ## Index[**](#Index) ### Properties * [**country\_code](https://docs.apify.com/sdk/python/sdk/python/reference/class/ProxyInfo.md#country_code) * [**groups](https://docs.apify.com/sdk/python/sdk/python/reference/class/ProxyInfo.md#groups) ## Properties<!-- -->[**](#Properties) ### [**](#country_code)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_proxy_configuration.py#L79)country\_code **country\_code: str | None If set and relevant proxies are available in your Apify account, all proxied requests will use IP addresses that are geolocated to the specified country. For example `GB` for IPs from Great Britain. Note that online services often have their own rules for handling geolocation and thus the country selection is a best attempt at geolocation, rather than a guaranteed hit. This parameter is optional, by default, each proxied request is assigned an IP address from a random country. The country code needs to be a two letter ISO country code. See the [full list of available country codes](https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2#Officially_assigned_code_elements). This parameter is optional, by default, the proxy uses all available proxy servers from all countries. ### [**](#groups)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_proxy_configuration.py#L75)groups **groups: [list](https://docs.apify.com/sdk/python/sdk/python/reference/class/DatasetCollectionClient.md#list)\[str] An array of proxy groups to be used by the [Apify Proxy](https://docs.apify.com/proxy). If not provided, the proxy will select the groups automatically. --- # Request<!-- --> Represents a request in the Crawlee framework, containing the necessary information for crawling operations. The `Request` class is one of the core components in Crawlee, utilized by various components such as request providers, HTTP clients, crawlers, and more. It encapsulates the essential data for executing web requests, including the URL, HTTP method, headers, payload, and user data. The user data allows custom information to be stored and persisted throughout the request lifecycle, including its retries. Key functionalities include managing the request's identifier (`id`), unique key (`unique_key`) that is used for request deduplication, controlling retries, handling state management, and enabling configuration for session rotation and proxy handling. The recommended way to create a new instance is by using the `Request.from_url` constructor, which automatically generates a unique key and identifier based on the URL and request parameters. ### Usage ``` from crawlee import Request request = Request.from_url('https://crawlee.dev') ``` ### Hierarchy * *Request* * [RequestWithLock](https://crawlee.dev/python/api/class/RequestWithLock) ## Index[**](#Index) ### Methods * [**crawl\_depth](https://docs.apify.com/sdk/python/sdk/python/reference/class/Request.md#crawl_depth) * [**enqueue\_strategy](https://docs.apify.com/sdk/python/sdk/python/reference/class/Request.md#enqueue_strategy) * [**forefront](https://docs.apify.com/sdk/python/sdk/python/reference/class/Request.md#forefront) * [](https://crawlee.dev/python/api/class/Request#from_url) * [](https://crawlee.dev/python/api/class/Request#get_query_param_from_url) * [**last\_proxy\_tier](https://docs.apify.com/sdk/python/sdk/python/reference/class/Request.md#last_proxy_tier) * [**max\_retries](https://docs.apify.com/sdk/python/sdk/python/reference/class/Request.md#max_retries) * [**session\_rotation\_count](https://docs.apify.com/sdk/python/sdk/python/reference/class/Request.md#session_rotation_count) * [**state](https://docs.apify.com/sdk/python/sdk/python/reference/class/Request.md#state) ### Properties * [**crawl\_depth](https://docs.apify.com/sdk/python/sdk/python/reference/class/Request.md#crawl_depth) * [**crawlee\_data](https://docs.apify.com/sdk/python/sdk/python/reference/class/Request.md#crawlee_data) * [**enqueue\_strategy](https://docs.apify.com/sdk/python/sdk/python/reference/class/Request.md#enqueue_strategy) * [**forefront](https://docs.apify.com/sdk/python/sdk/python/reference/class/Request.md#forefront) * [**handled\_at](https://docs.apify.com/sdk/python/sdk/python/reference/class/Request.md#handled_at) * [**headers](https://docs.apify.com/sdk/python/sdk/python/reference/class/Request.md#headers) * [**id](https://docs.apify.com/sdk/python/sdk/python/reference/class/Request.md#id) * [**label](https://docs.apify.com/sdk/python/sdk/python/reference/class/Request.md#label) * [**last\_proxy\_tier](https://docs.apify.com/sdk/python/sdk/python/reference/class/Request.md#last_proxy_tier) * [**loaded\_url](https://docs.apify.com/sdk/python/sdk/python/reference/class/Request.md#loaded_url) * [**max\_retries](https://docs.apify.com/sdk/python/sdk/python/reference/class/Request.md#max_retries) * [**method](https://docs.apify.com/sdk/python/sdk/python/reference/class/Request.md#method) * [**model\_config](https://docs.apify.com/sdk/python/sdk/python/reference/class/Request.md#model_config) * [**no\_retry](https://docs.apify.com/sdk/python/sdk/python/reference/class/Request.md#no_retry) * [**payload](https://docs.apify.com/sdk/python/sdk/python/reference/class/Request.md#payload) * [**retry\_count](https://docs.apify.com/sdk/python/sdk/python/reference/class/Request.md#retry_count) * [**session\_id](https://docs.apify.com/sdk/python/sdk/python/reference/class/Request.md#session_id) * [**session\_rotation\_count](https://docs.apify.com/sdk/python/sdk/python/reference/class/Request.md#session_rotation_count) * [**state](https://docs.apify.com/sdk/python/sdk/python/reference/class/Request.md#state) * [**unique\_key](https://docs.apify.com/sdk/python/sdk/python/reference/class/Request.md#unique_key) * [**url](https://docs.apify.com/sdk/python/sdk/python/reference/class/Request.md#url) * [**user\_data](https://docs.apify.com/sdk/python/sdk/python/reference/class/Request.md#user_data) * [**was\_already\_handled](https://docs.apify.com/sdk/python/sdk/python/reference/class/Request.md#was_already_handled) ## Methods<!-- -->[**](#Methods) ### [**](#crawl_depth)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/_request.py#L349)crawl\_depth * ****crawl\_depth**(new\_value): None - #### Parameters * ##### new\_value: int #### Returns None ### [**](#enqueue_strategy)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/_request.py#L385)enqueue\_strategy * ****enqueue\_strategy**(new\_enqueue\_strategy): None - #### Parameters * ##### new\_enqueue\_strategy: [EnqueueStrategy](https://crawlee.dev/python/api#EnqueueStrategy) #### Returns None ### [**](#forefront)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/_request.py#L403)forefront * ****forefront**(new\_value): None - #### Parameters * ##### new\_value: bool #### Returns None ### [**](#undefined)[**](https://github.com/apify/apify-sdk-python/blob/master//src/crawlee/_request.py#L232) : ### [**](#undefined)[**](https://github.com/apify/apify-sdk-python/blob/master//src/crawlee/_request.py#L319) : ### [**](#last_proxy_tier)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/_request.py#L394)last\_proxy\_tier * ****last\_proxy\_tier**(new\_value): None - #### Parameters * ##### new\_value: int #### Returns None ### [**](#max_retries)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/_request.py#L367)max\_retries * ****max\_retries**(new\_max\_retries): None - #### Parameters * ##### new\_max\_retries: int #### Returns None ### [**](#session_rotation_count)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/_request.py#L376)session\_rotation\_count * ****session\_rotation\_count**(new\_session\_rotation\_count): None - #### Parameters * ##### new\_session\_rotation\_count: int #### Returns None ### [**](#state)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/_request.py#L358)state * ****state**(new\_state): None - #### Parameters * ##### new\_state: [RequestState](https://crawlee.dev/python/api/class/RequestState) #### Returns None ## Properties<!-- -->[**](#Properties) ### [**](#crawl_depth)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/_request.py#L344)crawl\_depth **crawl\_depth: int The depth of the request in the crawl tree. ### [**](#crawlee_data)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/_request.py#L335)crawlee\_data **crawlee\_data: [CrawleeRequestData](https://crawlee.dev/python/api/class/CrawleeRequestData) Crawlee-specific configuration stored in the `user_data`. ### [**](#enqueue_strategy)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/_request.py#L380)enqueue\_strategy **enqueue\_strategy: [EnqueueStrategy](https://crawlee.dev/python/api#EnqueueStrategy) The strategy that was used for enqueuing the request. ### [**](#forefront)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/_request.py#L398)forefront **forefront: bool Indicate whether the request should be enqueued at the front of the queue. ### [**](#handled_at)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/_request.py#L228)handled\_at **handled\_at: datetime | None Timestamp when the request was handled. ### [**](#headers)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/_request.py#L191)headers **headers: [HttpHeaders](https://crawlee.dev/python/api/class/HttpHeaders) HTTP request headers. ### [**](#id)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/_request.py#L168)id **id: str A unique identifier for the request. Note that this is not used for deduplication, and should not be confused with `unique_key`. ### [**](#label)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/_request.py#L325)label **label: str | None A string used to differentiate between arbitrary request types. ### [**](#last_proxy_tier)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/_request.py#L389)last\_proxy\_tier **last\_proxy\_tier: int | None The last proxy tier used to process the request. ### [**](#loaded_url)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/_request.py#L225)loaded\_url **loaded\_url: str | None URL of the web page that was loaded. This can differ from the original URL in case of redirects. ### [**](#max_retries)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/_request.py#L362)max\_retries **max\_retries: int | None Crawlee-specific limit on the number of retries of the request. ### [**](#method)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/_request.py#L188)method **method: [HttpMethod](https://crawlee.dev/python/api#HttpMethod) HTTP request method. ### [**](#model_config)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/_request.py#L166)model\_config **model\_config: Undefined ### [**](#no_retry)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/_request.py#L222)no\_retry **no\_retry: bool If set to `True`, the request will not be retried in case of failure. ### [**](#payload)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/_request.py#L194)payload **payload: [HttpPayload](https://crawlee.dev/python/api#HttpPayload) | None HTTP request payload. ### [**](#retry_count)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/_request.py#L219)retry\_count **retry\_count: int Number of times the request has been retried. ### [**](#session_id)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/_request.py#L330)session\_id **session\_id: str | None The ID of the bound session, if there is any. ### [**](#session_rotation_count)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/_request.py#L371)session\_rotation\_count **session\_rotation\_count: int | None Crawlee-specific number of finished session rotations for the request. ### [**](#state)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/_request.py#L353)state **state: [RequestState](https://crawlee.dev/python/api/class/RequestState) | None Crawlee-specific request handling state. ### [**](#unique_key)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/_request.py#L172)unique\_key **unique\_key: str A unique key identifying the request. Two requests with the same `unique_key` are considered as pointing to the same URL. If `unique_key` is not provided, then it is automatically generated by normalizing the URL. For example, the URL of `HTTP://www.EXAMPLE.com/something/` will produce the `unique_key` of `http://www.example.com/something`. Pass an arbitrary non-empty text value to the `unique_key` property to override the default behavior and specify which URLs shall be considered equal. ### [**](#url)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/_request.py#L184)url **url: str The URL of the web page to crawl. Must be a valid HTTP or HTTPS URL, and may include query parameters and fragments. ### [**](#user_data)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/_request.py#L201)user\_data **user\_data: dict\[str, JsonSerializable] Custom user data assigned to the request. Use this to save any request related data to the request's scope, keeping them accessible on retries, failures etc. ### [**](#was_already_handled)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/_request.py#L407)was\_already\_handled **was\_already\_handled: bool Indicates whether the request was handled. --- # RequestList<!-- --> Extends crawlee RequestList. Method open is used to create RequestList from actor's requestListSources input. ## Index[**](#Index) ### Methods * [**open](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestList.md#open) ## Methods<!-- -->[**](#Methods) ### [**](#open)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/storages/_request_list.py#L49)open * **async **open**(name, request\_list\_sources\_input, http\_client): [RequestList](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestList.md) - Initialize a new instance from request list source input. *** #### Parameters * ##### optionalname: str | None = <!-- -->None Name of the returned RequestList. * ##### optionalrequest\_list\_sources\_input: [list](https://docs.apify.com/sdk/python/sdk/python/reference/class/DatasetCollectionClient.md#list)\[dict\[str, Any]] | None = <!-- -->None List of dicts with either url key or requestsFromUrl key. * ##### optionalhttp\_client: HttpClient | None = <!-- -->None Client that will be used to send get request to urls defined by value of requestsFromUrl keys. #### Returns [RequestList](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestList.md) --- # RequestLoader<!-- --> An abstract class defining the interface for classes that provide access to a read-only stream of requests. Request loaders are used to manage and provide access to a storage of crawling requests. Key responsibilities: * Fetching the next request to be processed. * Marking requests as successfully handled after processing. * Managing state information such as the total and handled request counts. ### Hierarchy * *RequestLoader* * [RequestManager](https://crawlee.dev/python/api/class/RequestManager) * [SitemapRequestLoader](https://crawlee.dev/python/api/class/SitemapRequestLoader) * [RequestList](https://crawlee.dev/python/api/class/RequestList) ## Index[**](#Index) ### Methods * [**fetch\_next\_request](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestLoader.md#fetch_next_request) * [**get\_handled\_count](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestLoader.md#get_handled_count) * [**get\_total\_count](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestLoader.md#get_total_count) * [**is\_empty](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestLoader.md#is_empty) * [**is\_finished](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestLoader.md#is_finished) * [**mark\_request\_as\_handled](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestLoader.md#mark_request_as_handled) * [](https://crawlee.dev/python/api/class/RequestLoader#to_tandem) ## Methods<!-- -->[**](#Methods) ### [**](#fetch_next_request)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/request_loaders/_request_loader.py#L45)fetch\_next\_request * **async **fetch\_next\_request**(): [Request](https://crawlee.dev/python/api/class/Request) | None - Overrides [RequestManager.fetch\_next\_request](https://crawlee.dev/python/api/class/RequestManager#fetch_next_request) Return the next request to be processed, or `null` if there are no more pending requests. *** #### Returns [Request](https://crawlee.dev/python/api/class/Request) | None ### [**](#get_handled_count)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/request_loaders/_request_loader.py#L29)get\_handled\_count * **async **get\_handled\_count**(): int - Overrides [RequestManager.get\_handled\_count](https://crawlee.dev/python/api/class/RequestManager#get_handled_count) Get the number of requests in the loader that have been handled. *** #### Returns int ### [**](#get_total_count)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/request_loaders/_request_loader.py#L33)get\_total\_count * **async **get\_total\_count**(): int - Overrides [RequestManager.get\_total\_count](https://crawlee.dev/python/api/class/RequestManager#get_total_count) Get an offline approximation of the total number of requests in the loader (i.e. pending + handled). *** #### Returns int ### [**](#is_empty)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/request_loaders/_request_loader.py#L37)is\_empty * **async **is\_empty**(): bool - Overrides [RequestManager.is\_empty](https://crawlee.dev/python/api/class/RequestManager#is_empty) Return True if there are no more requests in the loader (there might still be unfinished requests). *** #### Returns bool ### [**](#is_finished)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/request_loaders/_request_loader.py#L41)is\_finished * **async **is\_finished**(): bool - Overrides [RequestManager.is\_finished](https://crawlee.dev/python/api/class/RequestManager#is_finished) Return True if all requests have been handled. *** #### Returns bool ### [**](#mark_request_as_handled)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/request_loaders/_request_loader.py#L49)mark\_request\_as\_handled * **async **mark\_request\_as\_handled**(request): [ProcessedRequest](https://crawlee.dev/python/api/class/ProcessedRequest) | None - Overrides [RequestManager.mark\_request\_as\_handled](https://crawlee.dev/python/api/class/RequestManager#mark_request_as_handled) Mark a request as handled after a successful processing (or after giving up retrying). *** #### Parameters * ##### request: [Request](https://crawlee.dev/python/api/class/Request) #### Returns [ProcessedRequest](https://crawlee.dev/python/api/class/ProcessedRequest) | None ### [**](#undefined)[**](https://github.com/apify/apify-sdk-python/blob/master//src/crawlee/request_loaders/_request_loader.py#L52) : --- # RequestManager<!-- --> Base class that extends `RequestLoader` with the capability to enqueue new requests and reclaim failed ones. ### Hierarchy * [RequestLoader](https://crawlee.dev/python/api/class/RequestLoader) * *RequestManager* * [RequestQueue](https://crawlee.dev/python/api/class/RequestQueue) * [RequestManagerTandem](https://crawlee.dev/python/api/class/RequestManagerTandem) ## Index[**](#Index) ### Methods * [](https://crawlee.dev/python/api/class/RequestManager#add_request) * [](https://crawlee.dev/python/api/class/RequestManager#add_requests) * [**drop](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestManager.md#drop) * [](https://crawlee.dev/python/api/class/RequestManager#fetch_next_request) * [](https://crawlee.dev/python/api/class/RequestManager#get_handled_count) * [](https://crawlee.dev/python/api/class/RequestManager#get_total_count) * [](https://crawlee.dev/python/api/class/RequestManager#is_empty) * [](https://crawlee.dev/python/api/class/RequestManager#is_finished) * [](https://crawlee.dev/python/api/class/RequestManager#mark_request_as_handled) * [](https://crawlee.dev/python/api/class/RequestManager#reclaim_request) * [**to\_tandem](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestManager.md#to_tandem) ## Methods<!-- -->[**](#Methods) ### [**](#undefined)[**](https://github.com/apify/apify-sdk-python/blob/master//src/crawlee/request_loaders/_request_manager.py#L26) : ### [**](#undefined)[**](https://github.com/apify/apify-sdk-python/blob/master//src/crawlee/request_loaders/_request_manager.py#L43) : ### [**](#drop)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/request_loaders/_request_manager.py#L22)drop * **async **drop**(): None - Overrides [Storage.drop](https://crawlee.dev/python/api/class/Storage#drop) Remove persistent state either from the Apify Cloud storage or from the local database. *** #### Returns None ### [**](#undefined)[**](https://github.com/apify/apify-sdk-python/blob/master//src/crawlee/request_loaders/_request_loader.py#L45) : ### [**](#undefined)[**](https://github.com/apify/apify-sdk-python/blob/master//src/crawlee/request_loaders/_request_loader.py#L29) : ### [**](#undefined)[**](https://github.com/apify/apify-sdk-python/blob/master//src/crawlee/request_loaders/_request_loader.py#L33) : ### [**](#undefined)[**](https://github.com/apify/apify-sdk-python/blob/master//src/crawlee/request_loaders/_request_loader.py#L37) : ### [**](#undefined)[**](https://github.com/apify/apify-sdk-python/blob/master//src/crawlee/request_loaders/_request_loader.py#L41) : ### [**](#undefined)[**](https://github.com/apify/apify-sdk-python/blob/master//src/crawlee/request_loaders/_request_loader.py#L49) : ### [**](#undefined)[**](https://github.com/apify/apify-sdk-python/blob/master//src/crawlee/request_loaders/_request_manager.py#L70) : ### [**](#to_tandem)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/request_loaders/_request_loader.py#L52)to\_tandem * **async **to\_tandem**(request\_manager): [RequestManagerTandem](https://crawlee.dev/python/api/class/RequestManagerTandem) - Inherited from [RequestLoader.to\_tandem](https://crawlee.dev/python/api/class/RequestLoader#to_tandem) Combine the loader with a request manager to support adding and reclaiming requests. *** #### Parameters * ##### optionalrequest\_manager: [RequestManager](https://crawlee.dev/python/api/class/RequestManager) | None = <!-- -->None Request manager to combine the loader with. If None is given, the default request queue is used. #### Returns [RequestManagerTandem](https://crawlee.dev/python/api/class/RequestManagerTandem) --- # RequestManagerTandem<!-- --> Implements a tandem behaviour for a pair of `RequestLoader` and `RequestManager`. In this scenario, the contents of the "loader" get transferred into the "manager", allowing processing the requests from both sources and also enqueueing new requests (not possible with plain `RequestManager`). ### Hierarchy * [RequestManager](https://crawlee.dev/python/api/class/RequestManager) * *RequestManagerTandem* ## Index[**](#Index) ### Methods * [**\_\_init\_\_](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestManagerTandem.md#__init__) * [**add\_request](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestManagerTandem.md#add_request) * [**add\_requests](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestManagerTandem.md#add_requests) * [**drop](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestManagerTandem.md#drop) * [**fetch\_next\_request](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestManagerTandem.md#fetch_next_request) * [**get\_handled\_count](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestManagerTandem.md#get_handled_count) * [**get\_total\_count](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestManagerTandem.md#get_total_count) * [**is\_empty](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestManagerTandem.md#is_empty) * [**is\_finished](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestManagerTandem.md#is_finished) * [**mark\_request\_as\_handled](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestManagerTandem.md#mark_request_as_handled) * [**reclaim\_request](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestManagerTandem.md#reclaim_request) * [**to\_tandem](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestManagerTandem.md#to_tandem) ## Methods<!-- -->[**](#Methods) ### [**](#__init__)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/request_loaders/_request_manager_tandem.py#L31)\_\_init\_\_ * ****\_\_init\_\_**(request\_loader, request\_manager): None - #### Parameters * ##### request\_loader: [RequestLoader](https://crawlee.dev/python/api/class/RequestLoader) * ##### request\_manager: [RequestManager](https://crawlee.dev/python/api/class/RequestManager) #### Returns None ### [**](#add_request)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/request_loaders/_request_manager_tandem.py#L52)add\_request * **async **add\_request**(request, \*, forefront): [ProcessedRequest](https://crawlee.dev/python/api/class/ProcessedRequest) - Overrides [RequestManager.add\_request](https://crawlee.dev/python/api/class/RequestManager#add_request) Add a single request to the manager and store it in underlying resource client. *** #### Parameters * ##### request: str | [Request](https://crawlee.dev/python/api/class/Request) The request object (or its string representation) to be added to the manager. * ##### optionalkeyword-onlyforefront: bool = <!-- -->False Determines whether the request should be added to the beginning (if True) or the end (if False) of the manager. #### Returns [ProcessedRequest](https://crawlee.dev/python/api/class/ProcessedRequest) ### [**](#add_requests)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/request_loaders/_request_manager_tandem.py#L56)add\_requests * **async **add\_requests**(requests, \*, forefront, batch\_size, wait\_time\_between\_batches, wait\_for\_all\_requests\_to\_be\_added, wait\_for\_all\_requests\_to\_be\_added\_timeout): None - Overrides [RequestManager.add\_requests](https://crawlee.dev/python/api/class/RequestManager#add_requests) Add requests to the manager in batches. *** #### Parameters * ##### requests: Sequence\[str | [Request](https://crawlee.dev/python/api/class/Request)] Requests to enqueue. * ##### optionalkeyword-onlyforefront: bool = <!-- -->False If True, add requests to the beginning of the queue. * ##### optionalkeyword-onlybatch\_size: int = <!-- -->1000 The number of requests to add in one batch. * ##### optionalkeyword-onlywait\_time\_between\_batches: timedelta = <!-- -->timedelta(seconds=1) Time to wait between adding batches. * ##### optionalkeyword-onlywait\_for\_all\_requests\_to\_be\_added: bool = <!-- -->False If True, wait for all requests to be added before returning. * ##### optionalkeyword-onlywait\_for\_all\_requests\_to\_be\_added\_timeout: timedelta | None = <!-- -->None Timeout for waiting for all requests to be added. #### Returns None ### [**](#drop)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/request_loaders/_request_manager_tandem.py#L107)drop * **async **drop**(): None - Overrides [Storage.drop](https://crawlee.dev/python/api/class/Storage#drop) Remove persistent state either from the Apify Cloud storage or from the local database. *** #### Returns None ### [**](#fetch_next_request)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/request_loaders/_request_manager_tandem.py#L76)fetch\_next\_request * **async **fetch\_next\_request**(): [Request](https://crawlee.dev/python/api/class/Request) | None - Overrides [RequestManager.fetch\_next\_request](https://crawlee.dev/python/api/class/RequestManager#fetch_next_request) Return the next request to be processed, or `null` if there are no more pending requests. *** #### Returns [Request](https://crawlee.dev/python/api/class/Request) | None ### [**](#get_handled_count)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/request_loaders/_request_manager_tandem.py#L36)get\_handled\_count * **async **get\_handled\_count**(): int - Overrides [RequestManager.get\_handled\_count](https://crawlee.dev/python/api/class/RequestManager#get_handled_count) Get the number of requests in the loader that have been handled. *** #### Returns int ### [**](#get_total_count)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/request_loaders/_request_manager_tandem.py#L40)get\_total\_count * **async **get\_total\_count**(): int - Overrides [RequestManager.get\_total\_count](https://crawlee.dev/python/api/class/RequestManager#get_total_count) Get an offline approximation of the total number of requests in the loader (i.e. pending + handled). *** #### Returns int ### [**](#is_empty)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/request_loaders/_request_manager_tandem.py#L44)is\_empty * **async **is\_empty**(): bool - Overrides [RequestManager.is\_empty](https://crawlee.dev/python/api/class/RequestManager#is_empty) Return True if there are no more requests in the loader (there might still be unfinished requests). *** #### Returns bool ### [**](#is_finished)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/request_loaders/_request_manager_tandem.py#L48)is\_finished * **async **is\_finished**(): bool - Overrides [RequestManager.is\_finished](https://crawlee.dev/python/api/class/RequestManager#is_finished) Return True if all requests have been handled. *** #### Returns bool ### [**](#mark_request_as_handled)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/request_loaders/_request_manager_tandem.py#L103)mark\_request\_as\_handled * **async **mark\_request\_as\_handled**(request): [ProcessedRequest](https://crawlee.dev/python/api/class/ProcessedRequest) | None - Overrides [RequestManager.mark\_request\_as\_handled](https://crawlee.dev/python/api/class/RequestManager#mark_request_as_handled) Mark a request as handled after a successful processing (or after giving up retrying). *** #### Parameters * ##### request: [Request](https://crawlee.dev/python/api/class/Request) #### Returns [ProcessedRequest](https://crawlee.dev/python/api/class/ProcessedRequest) | None ### [**](#reclaim_request)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/request_loaders/_request_manager_tandem.py#L99)reclaim\_request * **async **reclaim\_request**(request, \*, forefront): [ProcessedRequest](https://crawlee.dev/python/api/class/ProcessedRequest) | None - Overrides [RequestManager.reclaim\_request](https://crawlee.dev/python/api/class/RequestManager#reclaim_request) Reclaims a failed request back to the source, so that it can be returned for processing later again. It is possible to modify the request data by supplying an updated request as a parameter. *** #### Parameters * ##### request: [Request](https://crawlee.dev/python/api/class/Request) * ##### optionalkeyword-onlyforefront: bool = <!-- -->False #### Returns [ProcessedRequest](https://crawlee.dev/python/api/class/ProcessedRequest) | None ### [**](#to_tandem)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/request_loaders/_request_loader.py#L52)to\_tandem * **async **to\_tandem**(request\_manager): [RequestManagerTandem](https://crawlee.dev/python/api/class/RequestManagerTandem) - Inherited from [RequestLoader.to\_tandem](https://crawlee.dev/python/api/class/RequestLoader#to_tandem) Combine the loader with a request manager to support adding and reclaiming requests. *** #### Parameters * ##### optionalrequest\_manager: [RequestManager](https://crawlee.dev/python/api/class/RequestManager) | None = <!-- -->None Request manager to combine the loader with. If None is given, the default request queue is used. #### Returns [RequestManagerTandem](https://crawlee.dev/python/api/class/RequestManagerTandem) --- # RequestQueue<!-- --> Request queue is a storage for managing HTTP requests. The request queue class serves as a high-level interface for organizing and managing HTTP requests during web crawling. It provides methods for adding, retrieving, and manipulating requests throughout the crawling lifecycle, abstracting away the underlying storage implementation details. Request queue maintains the state of each URL to be crawled, tracking whether it has been processed, is currently being handled, or is waiting in the queue. Each URL in the queue is uniquely identified by a `unique_key` property, which prevents duplicate processing unless explicitly configured otherwise. The class supports both breadth-first and depth-first crawling strategies through its `forefront` parameter when adding requests. It also provides mechanisms for error handling and request reclamation when processing fails. You can open a request queue using the `open` class method, specifying either a name or ID to identify the queue. The underlying storage implementation is determined by the configured storage client. ### Usage ``` from crawlee.storages import RequestQueue # Open a request queue rq = await RequestQueue.open(name='my_queue') # Add a request await rq.add_request('https://example.com') # Process requests request = await rq.fetch_next_request() if request: try: # Process the request # ... await rq.mark_request_as_handled(request) except Exception: await rq.reclaim_request(request) ``` ### Hierarchy * [RequestManager](https://crawlee.dev/python/api/class/RequestManager) * [Storage](https://crawlee.dev/python/api/class/Storage) * *RequestQueue* ## Index[**](#Index) ### Methods * [**\_\_init\_\_](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestQueue.md#__init__) * [**add\_request](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestQueue.md#add_request) * [**add\_requests](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestQueue.md#add_requests) * [**drop](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestQueue.md#drop) * [**fetch\_next\_request](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestQueue.md#fetch_next_request) * [**get\_handled\_count](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestQueue.md#get_handled_count) * [**get\_metadata](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestQueue.md#get_metadata) * [**get\_request](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestQueue.md#get_request) * [**get\_total\_count](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestQueue.md#get_total_count) * [**is\_empty](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestQueue.md#is_empty) * [**is\_finished](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestQueue.md#is_finished) * [**mark\_request\_as\_handled](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestQueue.md#mark_request_as_handled) * [**open](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestQueue.md#open) * [**purge](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestQueue.md#purge) * [**reclaim\_request](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestQueue.md#reclaim_request) * [**to\_tandem](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestQueue.md#to_tandem) ### Properties * [**id](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestQueue.md#id) * [**name](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestQueue.md#name) ## Methods<!-- -->[**](#Methods) ### [**](#__init__)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storages/_request_queue.py#L73)\_\_init\_\_ * ****\_\_init\_\_**(client, id, name): None - Initialize a new instance. Preferably use the `RequestQueue.open` constructor to create a new instance. *** #### Parameters * ##### client: [RequestQueueClient](https://crawlee.dev/python/api/class/RequestQueueClient) An instance of a storage client. * ##### id: str The unique identifier of the storage. * ##### name: str | None The name of the storage, if available. #### Returns None ### [**](#add_request)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storages/_request_queue.py#L148)add\_request * **async **add\_request**(request, \*, forefront): [ProcessedRequest](https://crawlee.dev/python/api/class/ProcessedRequest) - Overrides [RequestManager.add\_request](https://crawlee.dev/python/api/class/RequestManager#add_request) Add a single request to the manager and store it in underlying resource client. *** #### Parameters * ##### request: str | [Request](https://crawlee.dev/python/api/class/Request) The request object (or its string representation) to be added to the manager. * ##### optionalkeyword-onlyforefront: bool = <!-- -->False Determines whether the request should be added to the beginning (if True) or the end (if False) of the manager. #### Returns [ProcessedRequest](https://crawlee.dev/python/api/class/ProcessedRequest) ### [**](#add_requests)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storages/_request_queue.py#L159)add\_requests * **async **add\_requests**(requests, \*, forefront, batch\_size, wait\_time\_between\_batches, wait\_for\_all\_requests\_to\_be\_added, wait\_for\_all\_requests\_to\_be\_added\_timeout): None - Overrides [RequestManager.add\_requests](https://crawlee.dev/python/api/class/RequestManager#add_requests) Add requests to the manager in batches. *** #### Parameters * ##### requests: Sequence\[str | [Request](https://crawlee.dev/python/api/class/Request)] Requests to enqueue. * ##### optionalkeyword-onlyforefront: bool = <!-- -->False If True, add requests to the beginning of the queue. * ##### optionalkeyword-onlybatch\_size: int = <!-- -->1000 The number of requests to add in one batch. * ##### optionalkeyword-onlywait\_time\_between\_batches: timedelta = <!-- -->timedelta(seconds=1) Time to wait between adding batches. * ##### optionalkeyword-onlywait\_for\_all\_requests\_to\_be\_added: bool = <!-- -->False If True, wait for all requests to be added before returning. * ##### optionalkeyword-onlywait\_for\_all\_requests\_to\_be\_added\_timeout: timedelta | None = <!-- -->None Timeout for waiting for all requests to be added. #### Returns None ### [**](#drop)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storages/_request_queue.py#L136)drop * **async **drop**(): None - Overrides [Storage.drop](https://crawlee.dev/python/api/class/Storage#drop) Remove persistent state either from the Apify Cloud storage or from the local database. *** #### Returns None ### [**](#fetch_next_request)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storages/_request_queue.py#L209)fetch\_next\_request * **async **fetch\_next\_request**(): [Request](https://crawlee.dev/python/api/class/Request) | None - Overrides [RequestManager.fetch\_next\_request](https://crawlee.dev/python/api/class/RequestManager#fetch_next_request) Return the next request in the queue to be processed. Once you successfully finish processing of the request, you need to call `RequestQueue.mark_request_as_handled` to mark the request as handled in the queue. If there was some error in processing the request, call `RequestQueue.reclaim_request` instead, so that the queue will give the request to some other consumer in another call to the `fetch_next_request` method. Note that the `None` return value does not mean the queue processing finished, it means there are currently no pending requests. To check whether all requests in queue were finished, use `RequestQueue.is_finished` instead. *** #### Returns [Request](https://crawlee.dev/python/api/class/Request) | None ### [**](#get_handled_count)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storages/_request_queue.py#L105)get\_handled\_count * **async **get\_handled\_count**(): int - Overrides [RequestManager.get\_handled\_count](https://crawlee.dev/python/api/class/RequestManager#get_handled_count) Get the number of requests in the loader that have been handled. *** #### Returns int ### [**](#get_metadata)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storages/_request_queue.py#L101)get\_metadata * **async **get\_metadata**(): ([DatasetMetadata](https://crawlee.dev/python/api/class/DatasetMetadata) | [KeyValueStoreMetadata](https://crawlee.dev/python/api/class/KeyValueStoreMetadata)) | [RequestQueueMetadata](https://crawlee.dev/python/api/class/RequestQueueMetadata) - Overrides [Storage.get\_metadata](https://crawlee.dev/python/api/class/Storage#get_metadata) Get the storage metadata. *** #### Returns ([DatasetMetadata](https://crawlee.dev/python/api/class/DatasetMetadata) | [KeyValueStoreMetadata](https://crawlee.dev/python/api/class/KeyValueStoreMetadata)) | [RequestQueueMetadata](https://crawlee.dev/python/api/class/RequestQueueMetadata) ### [**](#get_request)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storages/_request_queue.py#L226)get\_request * **async **get\_request**(request\_id): [Request](https://crawlee.dev/python/api/class/Request) | None - Retrieve a specific request from the queue by its ID. *** #### Parameters * ##### request\_id: str The ID of the request to retrieve. #### Returns [Request](https://crawlee.dev/python/api/class/Request) | None ### [**](#get_total_count)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storages/_request_queue.py#L110)get\_total\_count * **async **get\_total\_count**(): int - Overrides [RequestManager.get\_total\_count](https://crawlee.dev/python/api/class/RequestManager#get_total_count) Get an offline approximation of the total number of requests in the loader (i.e. pending + handled). *** #### Returns int ### [**](#is_empty)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storages/_request_queue.py#L274)is\_empty * **async **is\_empty**(): bool - Overrides [RequestManager.is\_empty](https://crawlee.dev/python/api/class/RequestManager#is_empty) Check if the request queue is empty. An empty queue means that there are no requests currently in the queue, either pending or being processed. However, this does not necessarily mean that the crawling operation is finished, as there still might be tasks that could add additional requests to the queue. *** #### Returns bool ### [**](#is_finished)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storages/_request_queue.py#L286)is\_finished * **async **is\_finished**(): bool - Overrides [RequestManager.is\_finished](https://crawlee.dev/python/api/class/RequestManager#is_finished) Check if the request queue is finished. A finished queue means that all requests in the queue have been processed (the queue is empty) and there are no more tasks that could add additional requests to the queue. This is the definitive way to check if a crawling operation is complete. *** #### Returns bool ### [**](#mark_request_as_handled)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storages/_request_queue.py#L237)mark\_request\_as\_handled * **async **mark\_request\_as\_handled**(request): [ProcessedRequest](https://crawlee.dev/python/api/class/ProcessedRequest) | None - Overrides [RequestManager.mark\_request\_as\_handled](https://crawlee.dev/python/api/class/RequestManager#mark_request_as_handled) Mark a request as handled after successful processing. This method should be called after a request has been successfully processed. Once marked as handled, the request will be removed from the queue and will not be returned in subsequent calls to `fetch_next_request` method. *** #### Parameters * ##### request: [Request](https://crawlee.dev/python/api/class/Request) The request to mark as handled. #### Returns [ProcessedRequest](https://crawlee.dev/python/api/class/ProcessedRequest) | None ### [**](#open)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storages/_request_queue.py#L116)open * **async **open**(\*, id, name, configuration, storage\_client): [Storage](https://crawlee.dev/python/api/class/Storage) - Overrides [Storage.open](https://crawlee.dev/python/api/class/Storage#open) Open a storage, either restore existing or create a new one. *** #### Parameters * ##### optionalkeyword-onlyid: str | None = <!-- -->None The storage ID. * ##### optionalkeyword-onlyname: str | None = <!-- -->None The storage name. * ##### optionalkeyword-onlyconfiguration: [Configuration](https://crawlee.dev/python/api/class/Configuration) | None = <!-- -->None Configuration object used during the storage creation or restoration process. * ##### optionalkeyword-onlystorage\_client: [StorageClient](https://crawlee.dev/python/api/class/StorageClient) | None = <!-- -->None Underlying storage client to use. If not provided, the default global storage client from the service locator will be used. #### Returns [Storage](https://crawlee.dev/python/api/class/Storage) ### [**](#purge)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storages/_request_queue.py#L144)purge * **async **purge**(): None - Overrides [Storage.purge](https://crawlee.dev/python/api/class/Storage#purge) Purge the storage, removing all items from the underlying storage client. This method does not remove the storage itself, e.g. don't remove the metadata, but clears all items within it. *** #### Returns None ### [**](#reclaim_request)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storages/_request_queue.py#L252)reclaim\_request * **async **reclaim\_request**(request, \*, forefront): [ProcessedRequest](https://crawlee.dev/python/api/class/ProcessedRequest) | None - Overrides [RequestManager.reclaim\_request](https://crawlee.dev/python/api/class/RequestManager#reclaim_request) Reclaim a failed request back to the queue for later processing. If a request fails during processing, this method can be used to return it to the queue. The request will be returned for processing again in a subsequent call to `RequestQueue.fetch_next_request`. *** #### Parameters * ##### request: [Request](https://crawlee.dev/python/api/class/Request) The request to return to the queue. * ##### optionalkeyword-onlyforefront: bool = <!-- -->False If true, the request will be added to the beginning of the queue. Otherwise, it will be added to the end. #### Returns [ProcessedRequest](https://crawlee.dev/python/api/class/ProcessedRequest) | None ### [**](#to_tandem)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/request_loaders/_request_loader.py#L52)to\_tandem * **async **to\_tandem**(request\_manager): [RequestManagerTandem](https://crawlee.dev/python/api/class/RequestManagerTandem) - Inherited from [RequestLoader.to\_tandem](https://crawlee.dev/python/api/class/RequestLoader#to_tandem) Combine the loader with a request manager to support adding and reclaiming requests. *** #### Parameters * ##### optionalrequest\_manager: [RequestManager](https://crawlee.dev/python/api/class/RequestManager) | None = <!-- -->None Request manager to combine the loader with. If None is given, the default request queue is used. #### Returns [RequestManagerTandem](https://crawlee.dev/python/api/class/RequestManagerTandem) ## Properties<!-- -->[**](#Properties) ### [**](#id)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storages/_request_queue.py#L92)id **id: str Overrides [Storage.id](https://crawlee.dev/python/api/class/Storage#id) Get the storage ID. ### [**](#name)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storages/_request_queue.py#L97)name **name: str | None Overrides [Storage.name](https://crawlee.dev/python/api/class/Storage#name) Get the storage name. --- # RequestQueueClient<!-- --> Request queue resource client implementation based on the Apify platform storage. ## Index[**](#Index) ### Methods * [**\_\_init\_\_](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestQueueClient.md#__init__) * [**add\_request](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestQueueClient.md#add_request) * [**batch\_add\_requests](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestQueueClient.md#batch_add_requests) * [**batch\_delete\_requests](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestQueueClient.md#batch_delete_requests) * [**delete](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestQueueClient.md#delete) * [**delete\_request](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestQueueClient.md#delete_request) * [**delete\_request\_lock](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestQueueClient.md#delete_request_lock) * [**get](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestQueueClient.md#get) * [**get\_request](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestQueueClient.md#get_request) * [**list\_and\_lock\_head](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestQueueClient.md#list_and_lock_head) * [**list\_head](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestQueueClient.md#list_head) * [**prolong\_request\_lock](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestQueueClient.md#prolong_request_lock) * [**update](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestQueueClient.md#update) * [**update\_request](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestQueueClient.md#update_request) ## Methods<!-- -->[**](#Methods) ### [**](#__init__)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/apify_storage_client/_request_queue_client.py#L27)\_\_init\_\_ * ****\_\_init\_\_**(apify\_request\_queue\_client): None - #### Parameters * ##### apify\_request\_queue\_client: RequestQueueClientAsync #### Returns None ### [**](#add_request)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/apify_storage_client/_request_queue_client.py#L70)add\_request * **async **add\_request**(request, \*, forefront): ProcessedRequest - #### Parameters * ##### request: Request * ##### optionalkeyword-onlyforefront: bool = <!-- -->False #### Returns ProcessedRequest ### [**](#batch_add_requests)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/apify_storage_client/_request_queue_client.py#L144)batch\_add\_requests * **async **batch\_add\_requests**(requests, \*, forefront): BatchRequestsOperationResponse - #### Parameters * ##### requests: Sequence\[Request] * ##### optionalkeyword-onlyforefront: bool = <!-- -->False #### Returns BatchRequestsOperationResponse ### [**](#batch_delete_requests)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/apify_storage_client/_request_queue_client.py#L166)batch\_delete\_requests * **async **batch\_delete\_requests**(requests): BatchRequestsOperationResponse - #### Parameters * ##### requests: [list](https://docs.apify.com/sdk/python/sdk/python/reference/class/DatasetCollectionClient.md#list)\[Request] #### Returns BatchRequestsOperationResponse ### [**](#delete)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/apify_storage_client/_request_queue_client.py#L49)delete * **async **delete**(): None - #### Returns None ### [**](#delete_request)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/apify_storage_client/_request_queue_client.py#L112)delete\_request * **async **delete\_request**(request\_id): None - #### Parameters * ##### request\_id: str #### Returns None ### [**](#delete_request_lock)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/apify_storage_client/_request_queue_client.py#L132)delete\_request\_lock * **async **delete\_request\_lock**(request\_id, \*, forefront): None - #### Parameters * ##### request\_id: str * ##### optionalkeyword-onlyforefront: bool = <!-- -->False #### Returns None ### [**](#get)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/apify_storage_client/_request_queue_client.py#L31)get * **async **get**(): RequestQueueMetadata | None - #### Returns RequestQueueMetadata | None ### [**](#get_request)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/apify_storage_client/_request_queue_client.py#L90)get\_request * **async **get\_request**(request\_id): Request | None - #### Parameters * ##### request\_id: str #### Returns Request | None ### [**](#list_and_lock_head)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/apify_storage_client/_request_queue_client.py#L61)list\_and\_lock\_head * **async **list\_and\_lock\_head**(\*, lock\_secs, limit): RequestQueueHeadWithLocks - #### Parameters * ##### keyword-onlylock\_secs: int * ##### optionalkeyword-onlylimit: int | None = <!-- -->None #### Returns RequestQueueHeadWithLocks ### [**](#list_head)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/apify_storage_client/_request_queue_client.py#L53)list\_head * **async **list\_head**(\*, limit): RequestQueueHead - #### Parameters * ##### optionalkeyword-onlylimit: int | None = <!-- -->None #### Returns RequestQueueHead ### [**](#prolong_request_lock)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/apify_storage_client/_request_queue_client.py#L116)prolong\_request\_lock * **async **prolong\_request\_lock**(request\_id, \*, forefront, lock\_secs): ProlongRequestLockResponse - #### Parameters * ##### request\_id: str * ##### optionalkeyword-onlyforefront: bool = <!-- -->False * ##### keyword-onlylock\_secs: int #### Returns ProlongRequestLockResponse ### [**](#update)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/apify_storage_client/_request_queue_client.py#L36)update * **async **update**(\*, name): RequestQueueMetadata - #### Parameters * ##### optionalkeyword-onlyname: str | None = <!-- -->None #### Returns RequestQueueMetadata ### [**](#update_request)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/apify_storage_client/_request_queue_client.py#L95)update\_request * **async **update\_request**(request, \*, forefront): ProcessedRequest - #### Parameters * ##### request: Request * ##### optionalkeyword-onlyforefront: bool = <!-- -->False #### Returns ProcessedRequest --- # RequestQueueCollectionClient<!-- --> Request queue collection resource client implementation based on the Apify platform storage. ## Index[**](#Index) ### Methods * [**\_\_init\_\_](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestQueueCollectionClient.md#__init__) * [**get\_or\_create](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestQueueCollectionClient.md#get_or_create) * [**list](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestQueueCollectionClient.md#list) ## Methods<!-- -->[**](#Methods) ### [**](#__init__)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/apify_storage_client/_request_queue_collection_client.py#L17)\_\_init\_\_ * ****\_\_init\_\_**(apify\_request\_queue\_collection\_client): None - #### Parameters * ##### apify\_request\_queue\_collection\_client: RequestQueueCollectionClientAsync #### Returns None ### [**](#get_or_create)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/apify_storage_client/_request_queue_collection_client.py#L21)get\_or\_create * **async **get\_or\_create**(\*, id, name, schema): RequestQueueMetadata - #### Parameters * ##### optionalkeyword-onlyid: str | None = <!-- -->None * ##### optionalkeyword-onlyname: str | None = <!-- -->None * ##### optionalkeyword-onlyschema: dict | None = <!-- -->None #### Returns RequestQueueMetadata ### [**](#list)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/apify_storage_client/_request_queue_collection_client.py#L36)list * **async **list**(\*, unnamed, limit, offset, desc): RequestQueueListPage - #### Parameters * ##### optionalkeyword-onlyunnamed: bool = <!-- -->False * ##### optionalkeyword-onlylimit: int | None = <!-- -->None * ##### optionalkeyword-onlyoffset: int | None = <!-- -->None * ##### optionalkeyword-onlydesc: bool = <!-- -->False #### Returns RequestQueueListPage --- # RequestQueueMetadata<!-- --> Model for a request queue metadata. ### Hierarchy * [StorageMetadata](https://crawlee.dev/python/api/class/StorageMetadata) * *RequestQueueMetadata* ## Index[**](#Index) ### Properties * [**accessed\_at](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestQueueMetadata.md#accessed_at) * [**created\_at](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestQueueMetadata.md#created_at) * [**had\_multiple\_clients](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestQueueMetadata.md#had_multiple_clients) * [**handled\_request\_count](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestQueueMetadata.md#handled_request_count) * [**id](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestQueueMetadata.md#id) * [**model\_config](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestQueueMetadata.md#model_config) * [**modified\_at](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestQueueMetadata.md#modified_at) * [**name](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestQueueMetadata.md#name) * [**pending\_request\_count](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestQueueMetadata.md#pending_request_count) * [**total\_request\_count](https://docs.apify.com/sdk/python/sdk/python/reference/class/RequestQueueMetadata.md#total_request_count) ## Properties<!-- -->[**](#Properties) ### [**](#accessed_at)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storage_clients/models.py#L31)accessed\_at **accessed\_at: datetime Inherited from [StorageMetadata.accessed\_at](https://crawlee.dev/python/api/class/StorageMetadata#accessed_at) The timestamp when the storage was last accessed. ### [**](#created_at)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storage_clients/models.py#L34)created\_at **created\_at: datetime Inherited from [StorageMetadata.created\_at](https://crawlee.dev/python/api/class/StorageMetadata#created_at) The timestamp when the storage was created. ### [**](#had_multiple_clients)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storage_clients/models.py#L64)had\_multiple\_clients **had\_multiple\_clients: bool Indicates whether the queue has been accessed by multiple clients (consumers). ### [**](#handled_request_count)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storage_clients/models.py#L67)handled\_request\_count **handled\_request\_count: int The number of requests that have been handled from the queue. ### [**](#id)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storage_clients/models.py#L25)id **id: str Inherited from [StorageMetadata.id](https://crawlee.dev/python/api/class/StorageMetadata#id) The unique identifier of the storage. ### [**](#model_config)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storage_clients/models.py#L62)model\_config **model\_config: Undefined Overrides [StorageMetadata.model\_config](https://crawlee.dev/python/api/class/StorageMetadata#model_config) ### [**](#modified_at)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storage_clients/models.py#L37)modified\_at **modified\_at: datetime Inherited from [StorageMetadata.modified\_at](https://crawlee.dev/python/api/class/StorageMetadata#modified_at) The timestamp when the storage was last modified. ### [**](#name)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storage_clients/models.py#L28)name **name: str | None Inherited from [StorageMetadata.name](https://crawlee.dev/python/api/class/StorageMetadata#name) The name of the storage. ### [**](#pending_request_count)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storage_clients/models.py#L70)pending\_request\_count **pending\_request\_count: int The number of requests that are still pending in the queue. ### [**](#total_request_count)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storage_clients/models.py#L73)total\_request\_count **total\_request\_count: int The total number of requests that have been added to the queue. --- # SitemapRequestLoader<!-- --> A request loader that reads URLs from sitemap(s). The loader fetches and parses sitemaps in the background, allowing crawling to start before all URLs are loaded. It supports filtering URLs using glob and regex patterns. ### Hierarchy * [RequestLoader](https://crawlee.dev/python/api/class/RequestLoader) * *SitemapRequestLoader* ## Index[**](#Index) ### Methods * [**\_\_init\_\_](https://docs.apify.com/sdk/python/sdk/python/reference/class/SitemapRequestLoader.md#__init__) * [**abort\_loading](https://docs.apify.com/sdk/python/sdk/python/reference/class/SitemapRequestLoader.md#abort_loading) * [**fetch\_next\_request](https://docs.apify.com/sdk/python/sdk/python/reference/class/SitemapRequestLoader.md#fetch_next_request) * [**get\_handled\_count](https://docs.apify.com/sdk/python/sdk/python/reference/class/SitemapRequestLoader.md#get_handled_count) * [**get\_total\_count](https://docs.apify.com/sdk/python/sdk/python/reference/class/SitemapRequestLoader.md#get_total_count) * [**is\_empty](https://docs.apify.com/sdk/python/sdk/python/reference/class/SitemapRequestLoader.md#is_empty) * [**is\_finished](https://docs.apify.com/sdk/python/sdk/python/reference/class/SitemapRequestLoader.md#is_finished) * [**mark\_request\_as\_handled](https://docs.apify.com/sdk/python/sdk/python/reference/class/SitemapRequestLoader.md#mark_request_as_handled) * [**to\_tandem](https://docs.apify.com/sdk/python/sdk/python/reference/class/SitemapRequestLoader.md#to_tandem) ## Methods<!-- -->[**](#Methods) ### [**](#__init__)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/request_loaders/_sitemap_request_loader.py#L34)\_\_init\_\_ * ****\_\_init\_\_**(sitemap\_urls, http\_client, \*, proxy\_info, include, exclude, max\_buffer\_size, parse\_sitemap\_options): None - Initialize the sitemap request loader. *** #### Parameters * ##### sitemap\_urls: list\[str] Configuration options for the loader. * ##### http\_client: [HttpClient](https://crawlee.dev/python/api/class/HttpClient) the instance of `HttpClient` to use for fetching sitemaps. * ##### optionalkeyword-onlyproxy\_info: [ProxyInfo](https://crawlee.dev/python/api/class/ProxyInfo) | None = <!-- -->None Optional proxy to use for fetching sitemaps. * ##### optionalkeyword-onlyinclude: list\[re.Pattern\[Any] | [Glob](https://crawlee.dev/python/api/class/Glob)] | None = <!-- -->None List of glob or regex patterns to include URLs. * ##### optionalkeyword-onlyexclude: list\[re.Pattern\[Any] | [Glob](https://crawlee.dev/python/api/class/Glob)] | None = <!-- -->None List of glob or regex patterns to exclude URLs. * ##### optionalkeyword-onlymax\_buffer\_size: int = <!-- -->200 Maximum number of URLs to buffer in memory. * ##### optionalkeyword-onlyparse\_sitemap\_options: [ParseSitemapOptions](https://crawlee.dev/python/api/class/ParseSitemapOptions) | None = <!-- -->None Options for parsing sitemaps, such as `SitemapSource` and `max_urls`. #### Returns None ### [**](#abort_loading)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/request_loaders/_sitemap_request_loader.py#L172)abort\_loading * **async **abort\_loading**(): None - Abort the sitemap loading process. *** #### Returns None ### [**](#fetch_next_request)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/request_loaders/_sitemap_request_loader.py#L146)fetch\_next\_request * **async **fetch\_next\_request**(): [Request](https://crawlee.dev/python/api/class/Request) | None - Overrides [RequestLoader.fetch\_next\_request](https://crawlee.dev/python/api/class/RequestLoader#fetch_next_request) Fetch the next request to process. *** #### Returns [Request](https://crawlee.dev/python/api/class/Request) | None ### [**](#get_handled_count)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/request_loaders/_sitemap_request_loader.py#L168)get\_handled\_count * **async **get\_handled\_count**(): int - Overrides [RequestLoader.get\_handled\_count](https://crawlee.dev/python/api/class/RequestLoader#get_handled_count) Return the number of handled requests. *** #### Returns int ### [**](#get_total_count)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/request_loaders/_sitemap_request_loader.py#L134)get\_total\_count * **async **get\_total\_count**(): int - Overrides [RequestLoader.get\_total\_count](https://crawlee.dev/python/api/class/RequestLoader#get_total_count) Return the total number of URLs found so far. *** #### Returns int ### [**](#is_empty)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/request_loaders/_sitemap_request_loader.py#L138)is\_empty * **async **is\_empty**(): bool - Overrides [RequestLoader.is\_empty](https://crawlee.dev/python/api/class/RequestLoader#is_empty) Check if there are no more URLs to process. *** #### Returns bool ### [**](#is_finished)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/request_loaders/_sitemap_request_loader.py#L142)is\_finished * **async **is\_finished**(): bool - Overrides [RequestLoader.is\_finished](https://crawlee.dev/python/api/class/RequestLoader#is_finished) Check if all URLs have been processed. *** #### Returns bool ### [**](#mark_request_as_handled)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/request_loaders/_sitemap_request_loader.py#L161)mark\_request\_as\_handled * **async **mark\_request\_as\_handled**(request): [ProcessedRequest](https://crawlee.dev/python/api/class/ProcessedRequest) | None - Overrides [RequestLoader.mark\_request\_as\_handled](https://crawlee.dev/python/api/class/RequestLoader#mark_request_as_handled) Mark a request as successfully handled. *** #### Parameters * ##### request: [Request](https://crawlee.dev/python/api/class/Request) #### Returns [ProcessedRequest](https://crawlee.dev/python/api/class/ProcessedRequest) | None ### [**](#to_tandem)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/request_loaders/_request_loader.py#L52)to\_tandem * **async **to\_tandem**(request\_manager): [RequestManagerTandem](https://crawlee.dev/python/api/class/RequestManagerTandem) - Inherited from [RequestLoader.to\_tandem](https://crawlee.dev/python/api/class/RequestLoader#to_tandem) Combine the loader with a request manager to support adding and reclaiming requests. *** #### Parameters * ##### optionalrequest\_manager: [RequestManager](https://crawlee.dev/python/api/class/RequestManager) | None = <!-- -->None Request manager to combine the loader with. If None is given, the default request queue is used. #### Returns [RequestManagerTandem](https://crawlee.dev/python/api/class/RequestManagerTandem) --- # Storage<!-- --> Base class for storages. ### Hierarchy * *Storage* * [RequestQueue](https://crawlee.dev/python/api/class/RequestQueue) * [Dataset](https://crawlee.dev/python/api/class/Dataset) * [KeyValueStore](https://crawlee.dev/python/api/class/KeyValueStore) ## Index[**](#Index) ### Methods * [](https://crawlee.dev/python/api/class/Storage#drop) * [](https://crawlee.dev/python/api/class/Storage#get_metadata) * [](https://crawlee.dev/python/api/class/Storage#open) * [](https://crawlee.dev/python/api/class/Storage#purge) ### Properties * [**id](https://docs.apify.com/sdk/python/sdk/python/reference/class/Storage.md#id) * [**name](https://docs.apify.com/sdk/python/sdk/python/reference/class/Storage.md#name) ## Methods<!-- -->[**](#Methods) ### [**](#undefined)[**](https://github.com/apify/apify-sdk-python/blob/master//src/crawlee/storages/_base.py#L53) : ### [**](#undefined)[**](https://github.com/apify/apify-sdk-python/blob/master//src/crawlee/storages/_base.py#L29) : ### [**](#undefined)[**](https://github.com/apify/apify-sdk-python/blob/master//src/crawlee/storages/_base.py#L34) : ### [**](#undefined)[**](https://github.com/apify/apify-sdk-python/blob/master//src/crawlee/storages/_base.py#L57) : ## Properties<!-- -->[**](#Properties) ### [**](#id)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storages/_base.py#L20)id **id: str Get the storage ID. ### [**](#name)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storages/_base.py#L25)name **name: str | None Get the storage name. --- # StorageClient<!-- --> Base class for storage clients. The `StorageClient` serves as an abstract base class that defines the interface for accessing Crawlee's storage types: datasets, key-value stores, and request queues. It provides methods to open clients for each of these storage types and handles common functionality. Storage clients implementations can be provided for various backends (file system, memory, databases, various cloud providers, etc.) to support different use cases from development to production environments. Each storage client implementation is responsible for ensuring proper initialization, data persistence (where applicable), and consistent access patterns across all storage types it supports. ### Hierarchy * *StorageClient* * [MemoryStorageClient](https://crawlee.dev/python/api/class/MemoryStorageClient) * [FileSystemStorageClient](https://crawlee.dev/python/api/class/FileSystemStorageClient) ## Index[**](#Index) ### Methods * [](https://crawlee.dev/python/api/class/StorageClient#create_dataset_client) * [](https://crawlee.dev/python/api/class/StorageClient#create_kvs_client) * [](https://crawlee.dev/python/api/class/StorageClient#create_rq_client) * [](https://crawlee.dev/python/api/class/StorageClient#get_rate_limit_errors) ## Methods<!-- -->[**](#Methods) ### [**](#undefined)[**](https://github.com/apify/apify-sdk-python/blob/master//src/crawlee/storage_clients/_base/_storage_client.py#L32) : ### [**](#undefined)[**](https://github.com/apify/apify-sdk-python/blob/master//src/crawlee/storage_clients/_base/_storage_client.py#L42) : ### [**](#undefined)[**](https://github.com/apify/apify-sdk-python/blob/master//src/crawlee/storage_clients/_base/_storage_client.py#L52) : ### [**](#undefined)[**](https://github.com/apify/apify-sdk-python/blob/master//src/crawlee/storage_clients/_base/_storage_client.py#L61) : --- # StorageMetadata<!-- --> Represents the base model for storage metadata. It contains common fields shared across all specific storage types. ### Hierarchy * *StorageMetadata* * [DatasetMetadata](https://crawlee.dev/python/api/class/DatasetMetadata) * [KeyValueStoreMetadata](https://crawlee.dev/python/api/class/KeyValueStoreMetadata) * [RequestQueueMetadata](https://crawlee.dev/python/api/class/RequestQueueMetadata) ## Index[**](#Index) ### Properties * [**accessed\_at](https://docs.apify.com/sdk/python/sdk/python/reference/class/StorageMetadata.md#accessed_at) * [**created\_at](https://docs.apify.com/sdk/python/sdk/python/reference/class/StorageMetadata.md#created_at) * [**id](https://docs.apify.com/sdk/python/sdk/python/reference/class/StorageMetadata.md#id) * [**model\_config](https://docs.apify.com/sdk/python/sdk/python/reference/class/StorageMetadata.md#model_config) * [**modified\_at](https://docs.apify.com/sdk/python/sdk/python/reference/class/StorageMetadata.md#modified_at) * [**name](https://docs.apify.com/sdk/python/sdk/python/reference/class/StorageMetadata.md#name) ## Properties<!-- -->[**](#Properties) ### [**](#accessed_at)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storage_clients/models.py#L31)accessed\_at **accessed\_at: datetime The timestamp when the storage was last accessed. ### [**](#created_at)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storage_clients/models.py#L34)created\_at **created\_at: datetime The timestamp when the storage was created. ### [**](#id)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storage_clients/models.py#L25)id **id: str The unique identifier of the storage. ### [**](#model_config)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storage_clients/models.py#L23)model\_config **model\_config: Undefined ### [**](#modified_at)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storage_clients/models.py#L37)modified\_at **modified\_at: datetime The timestamp when the storage was last modified. ### [**](#name)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/storage_clients/models.py#L28)name **name: str | None The name of the storage. --- # SystemInfoEvent<!-- --> ## Index[**](#Index) ### Properties * [**data](https://docs.apify.com/sdk/python/sdk/python/reference/class/SystemInfoEvent.md#data) * [**name](https://docs.apify.com/sdk/python/sdk/python/reference/class/SystemInfoEvent.md#name) ## Properties<!-- -->[**](#Properties) ### [**](#data)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_platform_event_manager.py#L69)data **data: [SystemInfoEventData](https://docs.apify.com/sdk/python/sdk/python/reference/class/SystemInfoEventData.md) ### [**](#name)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_platform_event_manager.py#L68)name **name: Literal\[Event.SYSTEM\_INFO] --- # SystemInfoEventData<!-- --> ## Index[**](#Index) ### Methods * [**to\_crawlee\_format](https://docs.apify.com/sdk/python/sdk/python/reference/class/SystemInfoEventData.md#to_crawlee_format) ### Properties * [**cpu\_avg\_usage](https://docs.apify.com/sdk/python/sdk/python/reference/class/SystemInfoEventData.md#cpu_avg_usage) * [**cpu\_current\_usage](https://docs.apify.com/sdk/python/sdk/python/reference/class/SystemInfoEventData.md#cpu_current_usage) * [**cpu\_max\_usage](https://docs.apify.com/sdk/python/sdk/python/reference/class/SystemInfoEventData.md#cpu_max_usage) * [**created\_at](https://docs.apify.com/sdk/python/sdk/python/reference/class/SystemInfoEventData.md#created_at) * [**is\_cpu\_overloaded](https://docs.apify.com/sdk/python/sdk/python/reference/class/SystemInfoEventData.md#is_cpu_overloaded) * [**mem\_avg\_bytes](https://docs.apify.com/sdk/python/sdk/python/reference/class/SystemInfoEventData.md#mem_avg_bytes) * [**mem\_current\_bytes](https://docs.apify.com/sdk/python/sdk/python/reference/class/SystemInfoEventData.md#mem_current_bytes) * [**mem\_max\_bytes](https://docs.apify.com/sdk/python/sdk/python/reference/class/SystemInfoEventData.md#mem_max_bytes) ## Methods<!-- -->[**](#Methods) ### [**](#to_crawlee_format)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_platform_event_manager.py#L44)to\_crawlee\_format * ****to\_crawlee\_format**(dedicated\_cpus): EventSystemInfoData - #### Parameters * ##### dedicated\_cpus: float #### Returns EventSystemInfoData ## Properties<!-- -->[**](#Properties) ### [**](#cpu_avg_usage)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_platform_event_manager.py#L38)cpu\_avg\_usage **cpu\_avg\_usage: float ### [**](#cpu_current_usage)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_platform_event_manager.py#L40)cpu\_current\_usage **cpu\_current\_usage: float ### [**](#cpu_max_usage)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_platform_event_manager.py#L39)cpu\_max\_usage **cpu\_max\_usage: float ### [**](#created_at)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_platform_event_manager.py#L42)created\_at **created\_at: datetime ### [**](#is_cpu_overloaded)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_platform_event_manager.py#L41)is\_cpu\_overloaded **is\_cpu\_overloaded: bool ### [**](#mem_avg_bytes)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_platform_event_manager.py#L35)mem\_avg\_bytes **mem\_avg\_bytes: float ### [**](#mem_current_bytes)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_platform_event_manager.py#L36)mem\_current\_bytes **mem\_current\_bytes: float ### [**](#mem_max_bytes)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_platform_event_manager.py#L37)mem\_max\_bytes **mem\_max\_bytes: float --- # UnknownEvent<!-- --> ## Index[**](#Index) ### Properties * [**data](https://docs.apify.com/sdk/python/sdk/python/reference/class/UnknownEvent.md#data) * [**name](https://docs.apify.com/sdk/python/sdk/python/reference/class/UnknownEvent.md#name) ## Properties<!-- -->[**](#Properties) ### [**](#data)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_platform_event_manager.py#L112)data **data: dict\[str, Any] ### [**](#name)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_platform_event_manager.py#L111)name **name: str --- # Webhook<!-- --> ## Index[**](#Index) ### Properties * [**\_\_model\_config\_\_](https://docs.apify.com/sdk/python/sdk/python/reference/class/Webhook.md#__model_config__) * [**event\_types](https://docs.apify.com/sdk/python/sdk/python/reference/class/Webhook.md#event_types) * [**payload\_template](https://docs.apify.com/sdk/python/sdk/python/reference/class/Webhook.md#payload_template) * [**request\_url](https://docs.apify.com/sdk/python/sdk/python/reference/class/Webhook.md#request_url) ## Properties<!-- -->[**](#Properties) ### [**](#__model_config__)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L21)\_\_model\_config\_\_ **\_\_model\_config\_\_: Undefined ### [**](#event_types)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L23)event\_types **event\_types: [list](https://docs.apify.com/sdk/python/sdk/python/reference/class/DatasetCollectionClient.md#list)\[WebhookEventType] ### [**](#payload_template)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L32)payload\_template **payload\_template: str | None ### [**](#request_url)[**](https://github.com/apify/apify-sdk-python/blob/master//src/apify/_models.py#L27)request\_url **request\_url: str --- # Event<!-- --> Names of all possible events that can be emitted using an `EventManager`. ## Index[**](#Index) ### Enumeration members * [**ABORTING](https://docs.apify.com/sdk/python/sdk/python/reference/enum/Event.md#ABORTING) * [**BROWSER\_CLOSED](https://docs.apify.com/sdk/python/sdk/python/reference/enum/Event.md#BROWSER_CLOSED) * [**BROWSER\_LAUNCHED](https://docs.apify.com/sdk/python/sdk/python/reference/enum/Event.md#BROWSER_LAUNCHED) * [**BROWSER\_RETIRED](https://docs.apify.com/sdk/python/sdk/python/reference/enum/Event.md#BROWSER_RETIRED) * [**CRAWLER\_STATUS](https://docs.apify.com/sdk/python/sdk/python/reference/enum/Event.md#CRAWLER_STATUS) * [**EXIT](https://docs.apify.com/sdk/python/sdk/python/reference/enum/Event.md#EXIT) * [**MIGRATING](https://docs.apify.com/sdk/python/sdk/python/reference/enum/Event.md#MIGRATING) * [**PAGE\_CLOSED](https://docs.apify.com/sdk/python/sdk/python/reference/enum/Event.md#PAGE_CLOSED) * [**PAGE\_CREATED](https://docs.apify.com/sdk/python/sdk/python/reference/enum/Event.md#PAGE_CREATED) * [**PERSIST\_STATE](https://docs.apify.com/sdk/python/sdk/python/reference/enum/Event.md#PERSIST_STATE) * [**SESSION\_RETIRED](https://docs.apify.com/sdk/python/sdk/python/reference/enum/Event.md#SESSION_RETIRED) * [**SYSTEM\_INFO](https://docs.apify.com/sdk/python/sdk/python/reference/enum/Event.md#SYSTEM_INFO) ## Enumeration members<!-- -->[**](<#Enumeration members>) ### [**](#ABORTING)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/events/_types.py#L22)ABORTING **ABORTING: 'aborting' ### [**](#BROWSER_CLOSED)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/events/_types.py#L31)BROWSER\_CLOSED **BROWSER\_CLOSED: 'browserClosed' ### [**](#BROWSER_LAUNCHED)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/events/_types.py#L29)BROWSER\_LAUNCHED **BROWSER\_LAUNCHED: 'browserLaunched' ### [**](#BROWSER_RETIRED)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/events/_types.py#L30)BROWSER\_RETIRED **BROWSER\_RETIRED: 'browserRetired' ### [**](#CRAWLER_STATUS)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/events/_types.py#L36)CRAWLER\_STATUS **CRAWLER\_STATUS: 'crawlerStatus' ### [**](#EXIT)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/events/_types.py#L23)EXIT **EXIT: 'exit' ### [**](#MIGRATING)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/events/_types.py#L21)MIGRATING **MIGRATING: 'migrating' ### [**](#PAGE_CLOSED)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/events/_types.py#L33)PAGE\_CLOSED **PAGE\_CLOSED: 'pageClosed' ### [**](#PAGE_CREATED)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/events/_types.py#L32)PAGE\_CREATED **PAGE\_CREATED: 'pageCreated' ### [**](#PERSIST_STATE)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/events/_types.py#L19)PERSIST\_STATE **PERSIST\_STATE: 'persistState' ### [**](#SESSION_RETIRED)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/events/_types.py#L26)SESSION\_RETIRED **SESSION\_RETIRED: 'sessionRetired' ### [**](#SYSTEM_INFO)[**](https://github.com/apify/crawlee-python/blob/master//src/crawlee/events/_types.py#L20)SYSTEM\_INFO **SYSTEM\_INFO: 'systemInfo' --- [Skip to main content](#__docusaurus_skipToContent_fallback) [![](/sdk/python/img/apify_sdk.svg)![](/sdk/python/img/apify_sdk_white.svg)](https://docs.apify.com) [Academy](https://docs.apify.com/academy)[Platform](https://docs.apify.com/platform) [API](https://docs.apify.com/api) * [Reference](https://docs.apify.com/api/v2) * [Client for JavaScript](https://docs.apify.com/api/client/js/) * [Client for Python](https://docs.apify.com/api/client/python/) [SDK](https://docs.apify.com/sdk) * [SDK for JavaScript](https://docs.apify.com/sdk/js/) * [SDK for Python](https://docs.apify.com/sdk/python/) [CLI](https://docs.apify.com/cli/) [Open source](https://docs.apify.com/open-source) * [Crawlee](https://crawlee.dev) * [Got Scraping](https://github.com/apify/got-scraping) * [Fingerprint Suite](https://github.com/apify/fingerprint-suite) * [Apify on GitHub](https://github.com/apify) * [Actor whitepaper](https://whitepaper.actor) [Discord](https://discord.com/invite/jyEM2PRvMU "Chat on Discord")[Get started](https://console.apify.com) [SDK for Python](https://docs.apify.com/sdk/python/sdk/python/.md) [Docs](https://docs.apify.com/sdk/python/sdk/python/docs/overview/introduction.md)[Reference](https://docs.apify.com/sdk/python/sdk/python/reference.md)[Changelog](https://docs.apify.com/sdk/python/sdk/python/docs/changelog.md)[GitHub](https://github.com/apify/apify-sdk-python) # Apify SDK for Python<br />is a toolkit for<br />building Actors # Apify SDK for Python<br />is a toolkit for<br />building Actors ## ## The Apify SDK for Python is the official library for creating Apify Actors in Python. It provides useful features like Actor lifecycle management, local storage emulation, and Actor event handling. [Get Started](https://docs.apify.com/sdk/python/sdk/python/docs/overview/introduction.md)[GitHub](https://ghbtns.com/github-btn.html?user=apify\&repo=apify-sdk-python\&type=star\&count=true\&size=large) ![](/sdk/python/img/logo-blur.png) ``` apify create my-python-actor ``` For example, the Apify SDK makes it easy to read the Actor input with the `Actor.get_input()` method, and to save scraped data from your Actors to a dataset by simply using the `Actor.push_data()` method. ``` import httpx from bs4 import BeautifulSoup from apify import Actor async def main() -> None: async with Actor: actor_input = await Actor.get_input() async with httpx.AsyncClient() as client: response = await client.get(actor_input['url']) soup = BeautifulSoup(response.content, 'html.parser') data = {'url': actor_input['url'], 'title': soup.title.string if soup.title else None} await Actor.push_data(data) ``` Learn * [Academy](https://docs.apify.com/academy) * [Platform](https://docs.apify.com/platform) API * [Reference](https://docs.apify.com/api/v2) * [Client for JavaScript](https://docs.apify.com/api/client/js/) * [Client for Python](https://docs.apify.com/api/client/python/) SDK * [SDK for JavaScript](https://docs.apify.com/sdk/js/) * [SDK for Python](https://docs.apify.com/sdk/python/) Other * [CLI](https://docs.apify.com/cli/) * [Open source](https://docs.apify.com/open-source) More * [Crawlee](https://crawlee.dev) * [GitHub](https://github.com/apify) * [Discord](https://discord.com/invite/jyEM2PRvMU) * [Trust Center](https://trust.apify.com) [](https://apify.com) --- # CLI | Apify Documentation ## cli - [Search the documentation](https://docs.apify.com/cli/search.md) - [Apify CLI](https://docs.apify.com/cli/docs.md): Apify command-line interface (Apify CLI) helps you create, develop, build and run - [Changelog](https://docs.apify.com/cli/docs/changelog.md): v0.20.11 - [Installation](https://docs.apify.com/cli/docs/installation.md): Learn how to install Apify CLI, and how to create, run, and manage Actors through it. - [Integrating Scrapy projects](https://docs.apify.com/cli/docs/integrating-scrapy.md): Learn how to run Scrapy projects as Apify Actors and deploy them on the Apify platform. - [Apify CLI Reference Documentation](https://docs.apify.com/cli/docs/reference.md): The Apify CLI provides tools for managing your Apify projects and resources from the command line. Use these commands to develop Actors locally, deploy them to Apify platform, manage storage, orchestrate runs, and handle account configuration. - [Telemetry](https://docs.apify.com/cli/docs/telemetry.md): Apify collects telemetry data about the general usage of the CLI to help us improve the product. Participation in this program is optional and you may opt out if you prefer not to share any information. - [Troubleshooting](https://docs.apify.com/cli/docs/troubleshooting.md): For general support, reach out to us at apify.com/contact. - [Environment variables](https://docs.apify.com/cli/docs/vars.md): There are two options how you can set up environment variables for Actors. - [Apify command-line interface (CLI)](https://docs.apify.com/cli/index.md) --- # Full Documentation Content [Skip to main content](#__docusaurus_skipToContent_fallback) [![](/cli/img/apify_sdk.svg)![](/cli/img/apify_sdk_white.svg)](https://docs.apify.com) [Academy](https://docs.apify.com/academy)[Platform](https://docs.apify.com/platform) [API](https://docs.apify.com/api) * [Reference](https://docs.apify.com/api/v2) * [Client for JavaScript](https://docs.apify.com/api/client/js/) * [Client for Python](https://docs.apify.com/api/client/python/) [SDK](https://docs.apify.com/sdk) * [SDK for JavaScript](https://docs.apify.com/sdk/js/) * [SDK for Python](https://docs.apify.com/sdk/python/) [CLI](https://docs.apify.com/cli/) [Open source](https://docs.apify.com/open-source) * [Crawlee](https://crawlee.dev) * [Got Scraping](https://github.com/apify/got-scraping) * [Fingerprint Suite](https://github.com/apify/fingerprint-suite) * [Apify on GitHub](https://github.com/apify) * [Actor whitepaper](https://whitepaper.actor) [Discord](https://discord.com/invite/jyEM2PRvMU "Chat on Discord")[Get started](https://console.apify.com) [Apify CLI](https://docs.apify.com/cli/cli/.md) [Docs](https://docs.apify.com/cli/cli/docs.md)[Reference](https://docs.apify.com/cli/cli/docs/reference.md)[Changelog](https://docs.apify.com/cli/cli/docs/changelog.md)[GitHub](https://github.com/apify/apify-cli) [0.21](https://docs.apify.com/cli/cli/docs.md) * [Next](https://docs.apify.com/cli/cli/docs/next) * [0.21](https://docs.apify.com/cli/cli/docs.md) * [0.20](https://docs.apify.com/cli/cli/docs/0.20) # Search the documentation Type your search here Next (current) [](https://www.algolia.com/) Learn * [Academy](https://docs.apify.com/academy) * [Platform](https://docs.apify.com/platform) API * [Reference](https://docs.apify.com/api/v2) * [Client for JavaScript](https://docs.apify.com/api/client/js/) * [Client for Python](https://docs.apify.com/api/client/python/) SDK * [SDK for JavaScript](https://docs.apify.com/sdk/js/) * [SDK for Python](https://docs.apify.com/sdk/python/) Other * [CLI](https://docs.apify.com/cli/) * [Open source](https://docs.apify.com/open-source) More * [Crawlee](https://crawlee.dev) * [GitHub](https://github.com/apify) * [Discord](https://discord.com/invite/jyEM2PRvMU) * [Trust Center](https://trust.apify.com) [](https://apify.com) --- # Apify CLI Apify command-line interface (Apify CLI) helps you create, develop, build and run [Apify Actors](https://apify.com/actors), and manage the Apify cloud platform from any computer. Apify Actors are cloud programs that can perform arbitrary web scraping, automation or data processing job. They accept input, perform their job and generate output. While you can develop Actors in an online IDE directly in the [Apify web application](https://console.apify.com/), for complex projects it is more convenient to develop Actors locally on your computer using [Apify SDK](https://github.com/apify/apify-sdk-js) and only push the Actors to the Apify cloud during deployment. This is where the Apify CLI comes in. Note that Actors running on the Apify platform are executed in Docker containers, so with an appropriate `Dockerfile` you can build your Actors in any programming language. However, we recommend using JavaScript / Node.js, for which we provide most libraries and support. --- # Changelog ### [0.21.10-beta.22](https://github.com/apify/apify-cli/releases/tag/v0.21.10-beta.22)[​](#02110-beta22 "Direct link to heading") ### [0.21.10-beta.21](https://github.com/apify/apify-cli/releases/tag/v0.21.10-beta.21)[​](#02110-beta21 "Direct link to heading") ### [0.21.10-beta.20](https://github.com/apify/apify-cli/releases/tag/v0.21.10-beta.20)[​](#02110-beta20 "Direct link to heading") ### [0.21.10-beta.19](https://github.com/apify/apify-cli/releases/tag/v0.21.10-beta.19)[​](#02110-beta19 "Direct link to heading") ### [0.21.10-beta.18](https://github.com/apify/apify-cli/releases/tag/v0.21.10-beta.18)[​](#02110-beta18 "Direct link to heading") ### [0.21.10-beta.17](https://github.com/apify/apify-cli/releases/tag/v0.21.10-beta.17)[​](#02110-beta17 "Direct link to heading") ### [0.21.10-beta.16](https://github.com/apify/apify-cli/releases/tag/v0.21.10-beta.16)[​](#02110-beta16 "Direct link to heading") ### [0.21.10-beta.15](https://github.com/apify/apify-cli/releases/tag/v0.21.10-beta.15)[​](#02110-beta15 "Direct link to heading") ### [0.21.10-beta.14](https://github.com/apify/apify-cli/releases/tag/v0.21.10-beta.14)[​](#02110-beta14 "Direct link to heading") ### [0.21.10-beta.13](https://github.com/apify/apify-cli/releases/tag/v0.21.10-beta.13)[​](#02110-beta13 "Direct link to heading") ### [0.21.10-beta.12](https://github.com/apify/apify-cli/releases/tag/v0.21.10-beta.12)[​](#02110-beta12 "Direct link to heading") ### [0.21.10-beta.11](https://github.com/apify/apify-cli/releases/tag/v0.21.10-beta.11)[​](#02110-beta11 "Direct link to heading") ### [0.21.10-beta.9](https://github.com/apify/apify-cli/releases/tag/v0.21.10-beta.9)[​](#02110-beta9 "Direct link to heading") ### [0.21.10-beta.10](https://github.com/apify/apify-cli/releases/tag/v0.21.10-beta.10)[​](#02110-beta10 "Direct link to heading") ### [0.21.10-beta.8](https://github.com/apify/apify-cli/releases/tag/v0.21.10-beta.8)[​](#02110-beta8 "Direct link to heading") ### [0.21.10-beta.7](https://github.com/apify/apify-cli/releases/tag/v0.21.10-beta.7)[​](#02110-beta7 "Direct link to heading") ### [0.21.10-beta.6](https://github.com/apify/apify-cli/releases/tag/v0.21.10-beta.6)[​](#02110-beta6 "Direct link to heading") ### [0.21.10-beta.5](https://github.com/apify/apify-cli/releases/tag/v0.21.10-beta.5)[​](#02110-beta5 "Direct link to heading") ### [0.21.10-beta.4](https://github.com/apify/apify-cli/releases/tag/v0.21.10-beta.4)[​](#02110-beta4 "Direct link to heading") ### [0.21.10-beta.3](https://github.com/apify/apify-cli/releases/tag/v0.21.10-beta.3)[​](#02110-beta3 "Direct link to heading") ### [0.21.10-beta.2](https://github.com/apify/apify-cli/releases/tag/v0.21.10-beta.2)[​](#02110-beta2 "Direct link to heading") ### [0.21.10-beta.1](https://github.com/apify/apify-cli/releases/tag/v0.21.10-beta.1)[​](#02110-beta1 "Direct link to heading") ### [0.21.10-beta.0](https://github.com/apify/apify-cli/releases/tag/v0.21.10-beta.0)[​](#02110-beta0 "Direct link to heading") ### [0.21.9](https://github.com/apify/apify-cli/releases/tag/v0.21.9)[​](#0219 "Direct link to heading") ##### [0.21.9](https://github.com/apify/apify-cli/releases/tag/v0.21.9) (2025-07-17)[​](#0219-2025-07-17 "Direct link to heading") ###### 🚀 Features[​](#-features "Direct link to heading") * Handle sub-schema validation (#853) ([51cc733](https://github.com/apify/apify-cli/commit/51cc7333ca665a6e0dfab8a6dc296a396b3e40ff)) ### [0.21.9-beta.8](https://github.com/apify/apify-cli/releases/tag/v0.21.9-beta.8)[​](#0219-beta8 "Direct link to heading") ### [0.21.9-beta.7](https://github.com/apify/apify-cli/releases/tag/v0.21.9-beta.7)[​](#0219-beta7 "Direct link to heading") ### [0.21.9-beta.6](https://github.com/apify/apify-cli/releases/tag/v0.21.9-beta.6)[​](#0219-beta6 "Direct link to heading") ### [0.21.9-beta.5](https://github.com/apify/apify-cli/releases/tag/v0.21.9-beta.5)[​](#0219-beta5 "Direct link to heading") ### [0.21.9-beta.4](https://github.com/apify/apify-cli/releases/tag/v0.21.9-beta.4)[​](#0219-beta4 "Direct link to heading") ### [0.21.9-beta.3](https://github.com/apify/apify-cli/releases/tag/v0.21.9-beta.3)[​](#0219-beta3 "Direct link to heading") --- # Installation ## Installation[​](#installation "Direct link to heading") You can install Apify CLI either using [Homebrew package manager](https://brew.sh) on macOS or Linux or using NPM on all platforms including Windows. ### Via Homebrew[​](#via-homebrew "Direct link to heading") Run the following command: ``` brew install apify-cli ``` ### Via NPM[​](#via-npm "Direct link to heading") First, make sure you have [Node.js](https://nodejs.org) version 18 or higher with NPM installed on your computer: ``` node --version npm --version ``` Install or upgrade Apify CLI by running: ``` npm -g install apify-cli ``` If you receive a permission error, read npm's [official guide](https://docs.npmjs.com/resolving-eacces-permissions-errors-when-installing-packages-globally) on installing packages globally. Alternatively, you can use [Node Version Manager (nvm)](https://github.com/nvm-sh/nvm) and install Apify CLI only into a selected user-level Node version without requiring root privileges: ``` nvm install 18 nvm use 18 npm -g install apify-cli ``` After using either of these methods , verify that Apify CLI was installed correctly by running: ``` apify --version ``` which should print something like: ``` apify-cli/0.19.1 linux-x64 node-v18.17.0 ``` ## Basic Usage[​](#basic-usage "Direct link to heading") The following examples demonstrate the basic usage of Apify CLI. ### Create a New Actor from Scratch[​](#create-a-new-actor-from-scratch "Direct link to heading") ``` apify create my-hello-world ``` First, you will be prompted to select a template with the boilerplate for the Actor, to help you get started quickly. The command will create a directory called `my-hello-world` that contains a Node.js project for the Actor and a few configuration files. ### Create a New Actor from Existing Project[​](#create-a-new-actor-from-existing-project "Direct link to heading") Automatic Actor directory initialization When you create an Actor using the `apify create` command, the directory will already be initialized. ``` cd ./my/awesome/project apify init ``` This command will only set up local Actor development environment in an existing directory, i.e. it will create the `.actor/actor.json` file and `apify_storage` directory. Before you can run your project locally using `apify run`, you have to set up the right start command in `package.json` under scripts.start. For example: ``` { ... "scripts": { "start": "node your_main_file.js", }, ... } ``` You can find more information about by running `apify help run`. ### Run the Actor Locally[​](#run-the-actor-locally "Direct link to heading") ``` cd my-hello-world apify run ``` This command runs the Actor on your local machine. Now's your chance to develop the logic - or magic <!-- -->😏 ### Login with your Apify account[​](#login-with-your-apify-account "Direct link to heading") ``` apify login ``` Before you can interact with the Apify cloud, you need to [create an Apify account](https://console.apify.com/) and log in to it using the above command. You will be prompted for your [Apify API token](https://console.apify.com/settings/integrations). API token save directory The command will store the API token and other sensitive information to `~/.apify`. ### Push the Actor to the Apify Cloud[​](#push-the-actor-to-the-apify-cloud "Direct link to heading") ``` apify push ``` This command uploads your project to the Apify cloud and builds an Actor from it. On the platform, Actor needs to be built before it can be run. ### Run an Actor on the Apify Cloud[​](#run-an-actor-on-the-apify-cloud "Direct link to heading") ``` apify call ``` Runs the Actor corresponding to the current directory on the Apify Platform. This command can also be used to run other Actors, for example: ``` apify call apify/hello-world ``` ### So what's in this `.actor/actor.json` File?[​](#so-whats-in-this-actoractorjson-file "Direct link to heading") This file associates your local development project with an Actor on the Apify Platform. It contains information such as Actor name, version, build tag and environment variables. Make sure you commit this file to the Git repository. For example, `.actor/actor.json` file can look as follows: ``` { "actorSpecification": 1, "name": "name-of-my-scraper", "version": "0.0", "buildTag": "latest", "environmentVariables": { "MYSQL_USER": "my_username", "MYSQL_PASSWORD": "@mySecretPassword" }, "dockerfile": "./Dockerfile", "readme": "./ACTOR.md", "input": "./input_schema.json", "storages": { "dataset": "./dataset_schema.json" } } ``` **`Dockerfile` field** If you specify the path to your Docker file under the `dockerfile` field, this file will be used for Actor builds on the platform. If not specified, the system will look for Docker files at `.actor/Dockerfile` and `Dockerfile` in this order of preference. **`Readme` field** If you specify the path to your readme file under the `readme` field, the readme at this path will be used on the platform. If not specified, readme at `.actor/README.md` and `README.md` will be used in this order of preference. **`Input` field** You can embed your [input schema](https://docs.apify.com/actors/development/input-schema#specification-version-1) object directly in `actor.json` under `input` field. Alternatively, you can provide a path to a custom input schema. If not provided, the input schema at `.actor/INPUT_SCHEMA.json` and `INPUT_SCHEMA.json` is used in this order of preference. **`Storages.dataset` field** You can define the schema of the items in your dataset under the `storages.dataset` field. This can be either an embedded object or a path to a JSON schema file. You can read more about the schema of your Actor output [here](https://docs.apify.com/actors/development/output-schema#specification-version-1). Migration from deprecated config "apify.json" Note that previously, Actor config was stored in the `apify.json` file that has been deprecated. You can find the (very slight) differences and migration info in [migration guidelines](https://github.com/apify/apify-cli/blob/master/MIGRATIONS.md). --- # Integrating Scrapy projects [Scrapy](https://scrapy.org/) is a widely used open-source web scraping framework for Python. Scrapy projects can now be executed on the Apify platform using our dedicated wrapping tool. This tool allows users to transform their Scrapy projects into [Apify Actors](https://docs.apify.com/platform/actors) with just a few simple commands. ## Getting started[​](#getting-started "Direct link to heading") ### Install Apify CLI[​](#install-apify-cli "Direct link to heading") To run the migration tool, you need to have the Apify CLI installed. You can install it using Homebrew with the following command: ``` brew install apify-cli ``` Alternatively, you can install it using NPM with the following command: ``` npm i -g apify-cli ``` In case of any issues, please refer to the [installation guide](https://docs.apify.com/cli/cli/docs/installation.md). ## Actorization of your existing Scrapy spider[​](#actorization-of-your-existing-scrapy-spider "Direct link to heading") Assuming your Scrapy project is set up, navigate to the project root where the `scrapy.cfg` file is located. ``` cd your_scraper ``` Verify the directory contents to ensure the correct location. ``` $ ls -R .: your_scraper README.md requirements.txt scrapy.cfg ./your_scraper: __init__.py items.py __main__.py main.py pipelines.py settings.py spiders ./your_scraper/spiders: your_spider.py __init__.py ``` To convert your Scrapy project into an Apify Actor, initiate the wrapping process by executing the following command: ``` apify init ``` The script will prompt you with a series of questions. Upon completion, the output might resemble the following: ``` Info: The current directory looks like a Scrapy project. Using automatic project wrapping. ? Enter the Scrapy BOT_NAME (see settings.py): books_scraper ? What folder are the Scrapy spider modules stored in? (see SPIDER_MODULES in settings.py): books_scraper.spiders ? Pick the Scrapy spider you want to wrap: BookSpider (/home/path/to/actor-scrapy-books-example/books_scraper/spiders/book.py) Info: Downloading the latest Scrapy wrapper template... Info: Wrapping the Scrapy project... Success: The Scrapy project has been wrapped successfully. ``` For example, here is a [source code](https://github.com/apify/actor-scrapy-books-example) of an actorized Scrapy project, and [here](https://apify.com/vdusek/scrapy-books-example) the corresponding Actor in Apify Store. ### Run the Actor locally[​](#run-the-actor-locally "Direct link to heading") Create a Python virtual environment by running: ``` python -m virtualenv .venv ``` Activate the virtual environment: ``` source .venv/bin/activate ``` Install Python dependencies using the provided requirements file named `requirements_apify.txt`. Ensure these requirements are installed before executing your project as an Apify Actor locally. You can put your own dependencies there as well. ``` pip install -r requirements-apify.txt [-r requirements.txt] ``` Finally execute the Apify Actor. ``` apify run [--purge] ``` If [ActorDatasetPushPipeline](https://github.com/apify/apify-sdk-python/blob/master/src/apify/scrapy/pipelines.py) is configured, the Actor's output will be stored in the `storage/datasets/default/` directory. ### Run the scraper as Scrapy project[​](#run-the-scraper-as-scrapy-project "Direct link to heading") The project remains executable as a Scrapy project. ``` scrapy crawl your_spider -o books.json ``` ## Deploy on Apify[​](#deploy-on-apify "Direct link to heading") ### Log in to Apify[​](#log-in-to-apify "Direct link to heading") You will need to provide your [Apify API Token](https://console.apify.com/settings/integrations) to complete this action. ``` apify login ``` ### Deploy your Actor[​](#deploy-your-actor "Direct link to heading") This command will deploy and build the Actor on the Apify platform. You can find your newly created Actor under [Actors -> My Actors](https://console.apify.com/actors?tab=my). ``` apify push ``` ## What the wrapping process does[​](#what-the-wrapping-process-does "Direct link to heading") The initialization command enhances your project by adding necessary files and updating some of them while preserving its functionality as a typical Scrapy project. The additional requirements file, named `requirements_apify.txt`, includes the Apify Python SDK and other essential requirements. The `.actor/` directory contains basic configuration of your Actor. We provide two new Python files [main.py](https://github.com/apify/actor-templates/blob/master/templates/python-scrapy/src/main.py) and [\_\_main\_\_.py](https://github.com/apify/actor-templates/blob/master/templates/python-scrapy/src/__main__.py), where we encapsulate the Scrapy project within an Actor. We also import and use there a few Scrapy components from our [Python SDK](https://github.com/apify/apify-sdk-python/tree/master/src/apify/scrapy). These components facilitate the integration of the Scrapy projects with the Apify platform. Further details about these components are provided in the following subsections. ### Scheduler[​](#scheduler "Direct link to heading") The [scheduler](https://docs.scrapy.org/en/latest/topics/scheduler.html) is a core component of Scrapy responsible for receiving and providing requests to be processed. To leverage the [Apify request queue](https://docs.apify.com/platform/storage/request-queue) for storing requests, a custom scheduler becomes necessary. Fortunately, Scrapy is a modular framework, allowing the creation of custom components. As a result, we have implemented the [ApifyScheduler](https://github.com/apify/apify-sdk-python/blob/master/src/apify/scrapy/scheduler.py). When using the Apify CLI wrapping tool, the scheduler is configured in the [src/main.py](https://github.com/apify/actor-templates/blob/master/templates/python-scrapy/src/main.py) file of your Actor. ### Dataset push pipeline[​](#dataset-push-pipeline "Direct link to heading") [Item pipelines](https://docs.scrapy.org/en/latest/topics/item-pipeline.html) are used for the processing of the results produced by your spiders. To handle the transmission of result data to the [Apify dataset](https://docs.apify.com/platform/storage/dataset), we have implemented the [ActorDatasetPushPipeline](https://github.com/apify/apify-sdk-python/blob/master/src/apify/scrapy/pipelines.py). When using the Apify CLI wrapping tool, the pipeline is configured in the [src/main.py](https://github.com/apify/actor-templates/blob/master/templates/python-scrapy/src/main.py) file of your Actor. It is assigned the highest integer value (1000), ensuring its execution as the final step in the pipeline sequence. ### Retry middleware[​](#retry-middleware "Direct link to heading") [Downloader middlewares](https://docs.scrapy.org/en/latest/topics/downloader-middleware.html) are a way how to hook into Scrapy's request/response processing. Scrapy comes with various default middlewares, including the [RetryMiddleware](https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.downloadermiddlewares.retry), designed to handle retries for requests that may have failed due to temporary issues. When integrating with the [Apify request queue](https://docs.apify.com/platform/storage/request-queue), it becomes necessary to enhance this middleware to facilitate communication with the request queue marking the requests either as handled or ready for a retry. When using the Apify CLI wrapping tool, the default `RetryMiddleware` is disabled, and [ApifyRetryMiddleware](https://github.com/apify/apify-sdk-python/blob/master/src/apify/scrapy/middlewares/apify_retry.py) takes its place. Configuration for the middlewares is established in the [src/main.py](https://github.com/apify/actor-templates/blob/master/templates/python-scrapy/src/main.py) file of your Actor. ### HTTP proxy middleware[​](#http-proxy-middleware "Direct link to heading") Another default Scrapy [downloader middleware](https://docs.scrapy.org/en/latest/topics/downloader-middleware.html) that requires replacement is [HttpProxyMiddleware](https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.downloadermiddlewares.httpproxy). To utilize the use of proxies managed through the Apify [ProxyConfiguration](https://github.com/apify/apify-sdk-python/blob/master/src/apify/proxy_configuration.py), we provide [ApifyHttpProxyMiddleware](https://github.com/apify/apify-sdk-python/blob/master/src/apify/scrapy/middlewares/apify_proxy.py). When using the Apify CLI wrapping tool, the default `HttpProxyMiddleware` is disabled, and [ApifyHttpProxyMiddleware](https://github.com/apify/apify-sdk-python/blob/master/src/apify/scrapy/middlewares/apify_proxy.py) takes its place. Additionally, inspect the [.actor/input\_schema.json](https://github.com/apify/actor-templates/blob/master/templates/python-scrapy/.actor/input_schema.json) file, where proxy configuration is specified as an input property for your Actor. The processing of this input is carried out together with the middleware configuration in [src/main.py](https://github.com/apify/actor-templates/blob/master/templates/python-scrapy/src/main.py). ## Known limitations[​](#known-limitations "Direct link to heading") There are some known limitations of running the Scrapy projects on Apify platform we are aware of. ### Asynchronous code in spiders and other components[​](#asynchronous-code-in-spiders-and-other-components "Direct link to heading") Scrapy asynchronous execution is based on the [Twisted](https://twisted.org/) library, not the [AsyncIO](https://docs.python.org/3/library/asyncio.html), which brings some complications on the table. Due to the asynchronous nature of the Actors, all of their code is executed as a coroutine inside the `asyncio.run`. In order to execute Scrapy code inside an Actor, following the section [Run Scrapy from a script](https://docs.scrapy.org/en/latest/topics/practices.html?highlight=CrawlerProcess#run-scrapy-from-a-script) from the official Scrapy documentation, we need to invoke a [`CrawlProcess.start`](https://github.com/scrapy/scrapy/blob/2.11.0/scrapy/crawler.py#L393:L427) method. This method triggers Twisted's event loop, also known as a reactor. Consequently, Twisted's event loop is executed within AsyncIO's event loop. On top of that, when employing AsyncIO code in spiders or other components, it necessitates the creation of a new AsyncIO event loop, within which the coroutines from these components are executed. This means there is an execution of the AsyncIO event loop inside the Twisted event loop inside the AsyncIO event loop. We have resolved this issue by leveraging the [nest-asyncio](https://pypi.org/project/nest-asyncio/) library, enabling the execution of nested AsyncIO event loops. For executing a coroutine within a spider or other component, it is recommended to use Apify's instance of the nested event loop. Refer to the code example below or derive inspiration from Apify's Scrapy components, such as the [ApifyScheduler](https://github.com/apify/apify-sdk-python/blob/v1.5.0/src/apify/scrapy/scheduler.py#L114). ``` from apify.scrapy.utils import nested_event_loop ... # Coroutine execution inside a spider nested_event_loop.run_until_complete(my_coroutine()) ``` ### More spiders per Actor[​](#more-spiders-per-actor "Direct link to heading") It is recommended to execute only one Scrapy spider per Apify Actor. Mapping more Scrapy spiders to a single Apify Actor does not make much sense. We would have to create a separate instace of the [request queue](https://docs.apify.com/platform/storage/request-queue) for every spider. Also, every spider can produce a different output resulting in a mess in an output [dataset](https://docs.apify.com/platform/storage/dataset). A solution for this could be to store an output of every spider to a different [key-value store](https://docs.apify.com/platform/storage/key-value-store). However, a much more simple solution to this problem would be to just have a single spider per Actor. If you want to share common Scrapy components (middlewares, item pipelines, ...) among more spiders (Actors), you can use a dedicated Python package containing your components and install it to your Actors environment. The other solution to this problem could be to have more spiders per Actor, but keep only one spider run per Actor. What spider is going to be executed in an Actor run can be specified in the [input schema](https://docs.apify.com/academy/deploying-your-code/input-schema). ## Additional links[​](#additional-links "Direct link to heading") * [Scrapy Books Example Actor](https://apify.com/vdusek/scrapy-books-example) * [Python Actor Scrapy template](https://apify.com/templates/python-scrapy) * [Apify SDK for Python](https://docs.apify.com/sdk/python) * [Apify platform](https://docs.apify.com/platform) * [Join our developer community on Discord](https://discord.com/invite/jyEM2PRvMU) > We welcome any feedback! Please feel free to contact us at <python@apify.com>. Thank you for your valuable input. --- # Apify CLI Reference Documentation The Apify CLI provides tools for managing your Apify projects and resources from the command line. Use these commands to develop Actors locally, deploy them to Apify platform, manage storage, orchestrate runs, and handle account configuration. This reference guide documents available commands, their options, and common usage patterns, to efficiently work with Apify platform. ### General[​](#general "Direct link to heading") The general commands provide basic functionality for getting help and information about the Apify CLI. ##### `apify help [COMMAND]`[​](#apify-help-command "Direct link to heading") Display help for apify. ``` USAGE $ apify help [COMMAND...] [-n] ARGUMENTS COMMAND... Command to show help for. FLAGS -n, --nested-commands Include all nested commands in the output. DESCRIPTION Display help for apify. ``` ### Authentication & Account Management[​](#authentication--account-management "Direct link to heading") Use these commands to manage your Apify account authentication, access tokens, and configuration settings. These commands control how you interact with Apify platform and manage sensitive information. ##### `apify login`[​](#apify-login "Direct link to heading") Authenticates your Apify account and saves credentials to '\~/.apify'. ``` USAGE $ apify login [-t <value>] [-m console|manual] FLAGS -m, --method=<option> [Optional] Method of logging in to Apify <options: console|manual> -t, --token=<value> [Optional] Apify API token DESCRIPTION Authenticates your Apify account and saves credentials to '~/.apify'. All other commands use these stored credentials. Run 'apify logout' to remove authentication. ``` ##### `apify logout`[​](#apify-logout "Direct link to heading") Removes authentication by deleting your API token and account information from '\~/.apify'. ``` USAGE $ apify logout DESCRIPTION Removes authentication by deleting your API token and account information from '~/.apify'. Run 'apify login' to authenticate again. ``` ##### `apify info`[​](#apify-info "Direct link to heading") Prints details about your currently authenticated Apify account. ``` USAGE $ apify info DESCRIPTION Prints details about your currently authenticated Apify account. ``` ##### `apify secrets`[​](#apify-secrets "Direct link to heading") Manages secure environment variables for Actors. ``` USAGE $ apify secrets DESCRIPTION Manages secure environment variables for Actors. Example: $ apify secrets add mySecret TopSecretValue123 The "mySecret" value can be used in an environment variable defined in '.actor/actor.json' file by adding the "@" prefix: { "actorSpecification": 1, "name": "my_actor", "environmentVariables": { "SECRET_ENV_VAR": "@mySecret" }, "version": "0.1" } When the Actor is pushed to Apify cloud, the "SECRET_ENV_VAR" and its value is stored as a secret environment variable of the Actor. ``` ##### `apify secrets add NAME VALUE`[​](#apify-secrets-add-name-value "Direct link to heading") Adds a new secret to '\~/.apify' for use in Actor environment variables. ``` USAGE $ apify secrets add NAME VALUE ARGUMENTS NAME Name of the secret VALUE Value of the secret DESCRIPTION Adds a new secret to '~/.apify' for use in Actor environment variables. ``` ##### `apify secrets rm NAME`[​](#apify-secrets-rm-name "Direct link to heading") Permanently deletes a secret from your stored credentials. ``` USAGE $ apify secrets rm NAME ARGUMENTS NAME Name of the secret DESCRIPTION Permanently deletes a secret from your stored credentials. ``` ### Actor Development[​](#actor-development "Direct link to heading") These commands help you develop Actors locally. Use them to create new Actor projects, initialize configurations, run Actors in development mode, and validate input schemas. ##### `apify create [ACTORNAME]`[​](#apify-create-actorname "Direct link to heading") Creates an Actor project from a template in a new directory. ``` USAGE $ apify create [ACTORNAME] [-t <value>] [--skip-dependency-install] [--omit-optional-deps] ARGUMENTS ACTORNAME Name of the Actor and its directory FLAGS -t, --template=<value> Template for the Actor. If not provided, the command will prompt for it. Visit https://raw.githubusercontent.com/apify/actor-templates/master/templates/manifest.json to find available template names. --omit-optional-deps Skip installing optional dependencies. --skip-dependency-install Skip installing Actor dependencies. DESCRIPTION Creates an Actor project from a template in a new directory. ``` ##### `apify init [ACTORNAME]`[​](#apify-init-actorname "Direct link to heading") Sets up an Actor project in your current directory by creating actor.json and storage files. ``` USAGE $ apify init [ACTORNAME] [-y] ARGUMENTS ACTORNAME Name of the Actor. If not provided, you will be prompted for it. FLAGS -y, --yes Automatic yes to prompts; assume "yes" as answer to all prompts. Note that in some cases, the command may still ask for confirmation. DESCRIPTION Sets up an Actor project in your current directory by creating actor.json and storage files. If the directory contains a Scrapy project in Python, the command automatically creates wrappers so that you can run your scrapers without changes. Creates the '.actor/actor.json' file and the 'storage' directory in the current directory, but does not touch any other existing files or directories. WARNING: Overwrites existing 'storage' directory. ``` ##### `apify run`[​](#apify-run "Direct link to heading") Executes Actor locally with simulated Apify environment variables. ``` USAGE $ apify run [-p] [--purge-queue] [--purge-dataset] [--purge-key-value-store] [--entrypoint <value>] [-i <value> | --input-file <value>] FLAGS -i, --input=<value> Optional JSON input to be given to the Actor. -p, --purge Shortcut that combines the --purge-queue, --purge-dataset and --purge-key-value-store options. --entrypoint=<value> Optional entrypoint for running with injected environment variables. For Python, it is the module name, or a path to a file. For node.js, it is the npm script name, or a path to a JS/MJS file. You can also pass in a directory name, provided that directory contains an "index.js" file. --input-file=<value> Optional path to a file with JSON input to be given to the Actor. The file must be a valid JSON file. You can also specify `-` to read from standard input. --purge-dataset Deletes the local directory containing the default dataset before the run starts. --purge-key-value-store Deletes all records from the default key-value store in the local directory before the run starts, except for the "INPUT" key. --purge-queue Deletes the local directory containing the default request queue before the run starts. DESCRIPTION Executes Actor locally with simulated Apify environment variables. Stores data in local 'storage' directory. NOTE: For Node.js Actors, customize behavior by modifying the 'start' script in package.json file. ``` ##### `apify validate-schema [PATH]`[​](#apify-validate-schema-path "Direct link to heading") Validates Actor input schema from one of these locations (in priority order): ``` USAGE $ apify validate-schema [PATH] ARGUMENTS PATH Optional path to your INPUT_SCHEMA.json file. If not provided ./INPUT_SCHEMA.json is used. DESCRIPTION Validates Actor input schema from one of these locations (in priority order): 1. Object in '.actor/actor.json' under "input" key 2. JSON file path in '.actor/actor.json' "input" key 3. .actor/INPUT_SCHEMA.json 4. INPUT_SCHEMA.json Optionally specify custom schema path to validate. ``` ### Actor Management[​](#actor-management "Direct link to heading") These commands let you manage Actors on Apify platform. They provide functionality for deployment, execution, monitoring, and maintenance of your Actors in the cloud environment. #### Basic Actor Operations[​](#basic-actor-operations "Direct link to heading") Use these commands to handle core Actor operations like creation, listing, deletion, and basic runtime management. These are the essential commands for working with Actors on Apify platform. ##### `apify actors`[​](#apify-actors "Direct link to heading") Manages Actor creation, deployment, and execution on the Apify platform. ``` USAGE $ apify actors DESCRIPTION Manages Actor creation, deployment, and execution on the Apify platform. ``` ##### `apify actors ls`[​](#apify-actors-ls "Direct link to heading") Prints a list of recently executed Actors or Actors you own. ``` USAGE $ apify actors ls [--json] [--my] [--offset <value>] [--limit <value>] [--desc] FLAGS --desc Sort Actors in descending order. --limit=<value> [default: 20] Number of Actors that will be listed. --my Whether to list Actors made by the logged in user. --offset=<value> Number of Actors that will be skipped. GLOBAL FLAGS --json Format output as json. DESCRIPTION Prints a list of recently executed Actors or Actors you own. ``` ##### `apify actors rm ACTORID`[​](#apify-actors-rm-actorid "Direct link to heading") Permanently removes an Actor from your account. ``` USAGE $ apify actors rm ACTORID ARGUMENTS ACTORID The Actor ID to delete. DESCRIPTION Permanently removes an Actor from your account. ``` ##### `apify actor`[​](#apify-actor "Direct link to heading") Manages runtime data operations inside of a running Actor. ``` USAGE $ apify actor DESCRIPTION Manages runtime data operations inside of a running Actor. ``` ##### `apify actor charge EVENTNAME`[​](#apify-actor-charge-eventname "Direct link to heading") Charge for a specific event in the pay-per-event Actor run. ``` USAGE $ apify actor charge EVENTNAME [--count <value>] [--idempotency-key <value>] [--test-pay-per-event] ARGUMENTS EVENTNAME Name of the event to charge for FLAGS --count=<value> [default: 1] Number of events to charge --idempotency-key=<value> Idempotency key for the charge request --test-pay-per-event Test pay-per-event charging without actually charging DESCRIPTION Charge for a specific event in the pay-per-event Actor run. ``` ##### `apify actor get-input`[​](#apify-actor-get-input "Direct link to heading") Gets the Actor input value from the default key-value store associated with the Actor run. ``` USAGE $ apify actor get-input DESCRIPTION Gets the Actor input value from the default key-value store associated with the Actor run. ``` ##### `apify actor get-public-url KEY`[​](#apify-actor-get-public-url-key "Direct link to heading") Get an HTTP URL that allows public access to a key-value store item. ``` USAGE $ apify actor get-public-url KEY ARGUMENTS KEY Key of the record in key-value store DESCRIPTION Get an HTTP URL that allows public access to a key-value store item. ``` ##### `apify actor get-value KEY`[​](#apify-actor-get-value-key "Direct link to heading") Gets a value from the default key-value store associated with the Actor run. ``` USAGE $ apify actor get-value KEY ARGUMENTS KEY Key of the record in key-value store DESCRIPTION Gets a value from the default key-value store associated with the Actor run. ``` ##### `apify actor push-data [ITEM]`[​](#apify-actor-push-data-item "Direct link to heading") Saves data to Actor's run default dataset. ``` USAGE $ apify actor push-data [ITEM] ARGUMENTS ITEM JSON string with one object or array of objects containing data to be stored in the default dataset. DESCRIPTION Saves data to Actor's run default dataset. Accept input as: - JSON argument: $ apify actor push-data {"key": "value"} - Piped stdin: $ cat ./test.json | apify actor push-data ``` ##### `apify actor set-value KEY [VALUE]`[​](#apify-actor-set-value-key-value "Direct link to heading") Sets or removes record into the default key-value store associated with the Actor run. ``` USAGE $ apify actor set-value KEY [VALUE] [-c <value>] ARGUMENTS KEY Key of the record in key-value store. VALUE Record data, which can be one of the following values: - If empty, the record in the key-value store is deleted. - If no `contentType` flag is specified, value is expected to be any JSON string value. - If options.contentType is set, value is taken as is. FLAGS -c, --contentType=<value> Specifies a custom MIME content type of the record. By default "application/json" is used. DESCRIPTION Sets or removes record into the default key-value store associated with the Actor run. It is possible to pass data using argument or stdin. Passing data using argument: $ apify actor set-value KEY my-value Passing data using stdin with pipe: $ cat ./my-text-file.txt | apify actor set-value KEY --contentType text/plain ``` #### Actor Deployment[​](#actor-deployment "Direct link to heading") These commands handle the deployment workflow of Actors to Apify platform. Use them to push local changes, pull remote Actors, and manage Actor versions and builds. ##### `apify push [ACTORID]`[​](#apify-push-actorid "Direct link to heading") ##### `apify actors push [ACTORID]`[​](#apify-actors-push-actorid "Direct link to heading") Deploys Actor to Apify platform using settings from '.actor/actor.json'. ``` USAGE $ apify actors push [ACTORID] [-v <value>] [-b <value>] [-w <value>] [--no-prompt] [--force] [--dir <value>] ARGUMENTS ACTORID Name or ID of the Actor to push (e.g. "apify/hello-world" or "E2jjCZBezvAZnX8Rb"). If not provided, the command will create or modify the Actor with the name specified in '.actor/actor.json' file. FLAGS -b, --build-tag=<value> Build tag to be applied to the successful Actor build. By default, it is taken from the '.actor/actor.json' file -v, --version=<value> Actor version number to which the files should be pushed. By default, it is taken from the '.actor/actor.json' file. -w, --wait-for-finish=<value> Seconds for waiting to build to finish, if no value passed, it waits forever. --dir=<value> Directory where the Actor is located --force Push an Actor even when the local files are older than the Actor on the platform. --no-prompt Do not prompt for opening the Actor details in a browser. This will also not open the browser automatically. DESCRIPTION Deploys Actor to Apify platform using settings from '.actor/actor.json'. Files under '3' MB upload as "Multiple source files"; larger projects upload as ZIP file. Use --force to override newer remote versions. ``` ##### `apify pull [ACTORID]`[​](#apify-pull-actorid "Direct link to heading") ##### `apify actors pull [ACTORID]`[​](#apify-actors-pull-actorid "Direct link to heading") Download Actor code to current directory. Clones Git repositories or fetches Actor files based on the source type. ``` USAGE $ apify actors pull [ACTORID] [-v <value>] [--dir <value>] ARGUMENTS ACTORID Name or ID of the Actor to run (e.g. "apify/hello-world" or "E2jjCZBezvAZnX8Rb"). If not provided, the command will update the Actor in the current directory based on its name in ".actor/actor.json" file. FLAGS -v, --version=<value> Actor version number which will be pulled, e.g. 1.2. Default: the highest version --dir=<value> Directory where the Actor should be pulled to DESCRIPTION Download Actor code to current directory. Clones Git repositories or fetches Actor files based on the source type. ``` ##### `apify call [ACTORID]`[​](#apify-call-actorid "Direct link to heading") ##### `apify actors call [ACTORID]`[​](#apify-actors-call-actorid "Direct link to heading") Executes Actor remotely using your authenticated account. ``` USAGE $ apify actors call [ACTORID] [--json] [-b <value>] [-t <value>] [-m <value>] [-i <value> | --input-file <value>] [-s] [-o] ARGUMENTS ACTORID Name or ID of the Actor to run (e.g. "my-actor", "apify/hello-world" or "E2jjCZBezvAZnX8Rb"). If not provided, the command runs the remote Actor specified in the '.actor/actor.json' file. FLAGS -b, --build=<value> Tag or number of the build to run (e.g. "latest" or "1.2.34"). -i, --input=<value> Optional JSON input to be given to the Actor. -m, --memory=<value> Amount of memory allocated for the Actor run, in megabytes. -o, --output-dataset Prints out the entire default dataset on successful run of the Actor. -s, --silent Prevents printing the logs of the Actor run to the console. -t, --timeout=<value> Timeout for the Actor run in seconds. Zero value means there is no timeout. --input-file=<value> Optional path to a file with JSON input to be given to the Actor. The file must be a valid JSON file. You can also specify `-` to read from standard input. GLOBAL FLAGS --json Format output as json. DESCRIPTION Executes Actor remotely using your authenticated account. Reads input from local key-value store by default. ``` ##### `apify actors start [ACTORID]`[​](#apify-actors-start-actorid "Direct link to heading") Starts Actor remotely and returns run details immediately. ``` USAGE $ apify actors start [ACTORID] [--json] [-b <value>] [-t <value>] [-m <value>] [-i <value> | --input-file <value>] ARGUMENTS ACTORID Name or ID of the Actor to run (e.g. "my-actor", "apify/hello-world" or "E2jjCZBezvAZnX8Rb"). If not provided, the command runs the remote Actor specified in the '.actor/actor.json' file. FLAGS -b, --build=<value> Tag or number of the build to run (e.g. "latest" or "1.2.34"). -i, --input=<value> Optional JSON input to be given to the Actor. -m, --memory=<value> Amount of memory allocated for the Actor run, in megabytes. -t, --timeout=<value> Timeout for the Actor run in seconds. Zero value means there is no timeout. --input-file=<value> Optional path to a file with JSON input to be given to the Actor. The file must be a valid JSON file. You can also specify `-` to read from standard input. GLOBAL FLAGS --json Format output as json. DESCRIPTION Starts Actor remotely and returns run details immediately. Uses authenticated account and local key-value store for input. ``` ##### `apify actors info ACTORID`[​](#apify-actors-info-actorid "Direct link to heading") Get information about an Actor. ``` USAGE $ apify actors info ACTORID [--json] [--readme | --input] ARGUMENTS ACTORID The ID of the Actor to return information about. FLAGS --input Return the Actor input schema. --readme Return the Actor README. GLOBAL FLAGS --json Format output as json. DESCRIPTION Get information about an Actor. ``` #### Actor Builds[​](#actor-builds "Direct link to heading") Use these commands to manage Actor build processes. They help you create, monitor, and maintain versioned snapshots of your Actors that can be executed on Apify platform. ##### `apify builds`[​](#apify-builds "Direct link to heading") Manages Actor build processes and versioning. ``` USAGE $ apify builds DESCRIPTION Manages Actor build processes and versioning. ``` ##### `apify builds create [ACTORID]`[​](#apify-builds-create-actorid "Direct link to heading") Creates a new build of the Actor. ``` USAGE $ apify builds create [ACTORID] [--json] [--tag <value>] [--version <value>] [--log] ARGUMENTS ACTORID Optional Actor ID or Name to trigger a build for. By default, it will use the Actor from the current directory. FLAGS --log Whether to print out the build log after the build is triggered. --tag=<value> Build tag to be applied to the successful Actor build. By default, this is "latest". --version=<value> Optional Actor Version to build. By default, this will be inferred from the tag, but this flag is required when multiple versions have the same tag. GLOBAL FLAGS --json Format output as json. DESCRIPTION Creates a new build of the Actor. ``` ##### `apify actors build [ACTORID]`[​](#apify-actors-build-actorid "Direct link to heading") Creates a new build of the Actor. ``` USAGE $ apify actors build [ACTORID] [--json] [--tag <value>] [--version <value>] [--log] ARGUMENTS ACTORID Optional Actor ID or Name to trigger a build for. By default, it will use the Actor from the current directory. FLAGS --log Whether to print out the build log after the build is triggered. --tag=<value> Build tag to be applied to the successful Actor build. By default, this is "latest". --version=<value> Optional Actor Version to build. By default, this will be inferred from the tag, but this flag is required when multiple versions have the same tag. GLOBAL FLAGS --json Format output as json. DESCRIPTION Creates a new build of the Actor. ``` ##### `apify builds info BUILDID`[​](#apify-builds-info-buildid "Direct link to heading") Prints information about a specific build. ``` USAGE $ apify builds info BUILDID [--json] ARGUMENTS BUILDID The build ID to get information about. GLOBAL FLAGS --json Format output as json. DESCRIPTION Prints information about a specific build. ``` ##### `apify builds log BUILDID`[​](#apify-builds-log-buildid "Direct link to heading") Prints the log of a specific build. ``` USAGE $ apify builds log BUILDID ARGUMENTS BUILDID The build ID to get the log from. DESCRIPTION Prints the log of a specific build. ``` ##### `apify builds ls [ACTORID]`[​](#apify-builds-ls-actorid "Direct link to heading") Lists all builds of the Actor. ``` USAGE $ apify builds ls [ACTORID] [--json] [--offset <value>] [--limit <value>] [--desc] [-c] ARGUMENTS ACTORID Optional Actor ID or Name to list runs for. By default, it will use the Actor from the current directory. FLAGS -c, --compact Display a compact table. --desc Sort builds in descending order. --limit=<value> [default: 10] Number of builds that will be listed. --offset=<value> Number of builds that will be skipped. GLOBAL FLAGS --json Format output as json. DESCRIPTION Lists all builds of the Actor. ``` ##### `apify builds rm BUILDID`[​](#apify-builds-rm-buildid "Direct link to heading") Permanently removes an Actor build from the Apify platform. ``` USAGE $ apify builds rm BUILDID ARGUMENTS BUILDID The build ID to delete. DESCRIPTION Permanently removes an Actor build from the Apify platform. ``` #### Actor Runs[​](#actor-runs "Direct link to heading") These commands control Actor execution on Apify platform. Use them to start, monitor, and manage Actor runs, including accessing logs and handling execution states. ##### `apify runs`[​](#apify-runs "Direct link to heading") Manages Actor run operations ``` USAGE $ apify runs DESCRIPTION Manages Actor run operations ``` ##### `apify runs abort RUNID`[​](#apify-runs-abort-runid "Direct link to heading") Aborts an Actor run. ``` USAGE $ apify runs abort RUNID [--json] [-f] ARGUMENTS RUNID The run ID to abort. FLAGS -f, --force Whether to force the run to abort immediately, instead of gracefully. GLOBAL FLAGS --json Format output as json. DESCRIPTION Aborts an Actor run. ``` ##### `apify runs info RUNID`[​](#apify-runs-info-runid "Direct link to heading") Prints information about an Actor run. ``` USAGE $ apify runs info RUNID [--json] [-v] ARGUMENTS RUNID The run ID to print information about. FLAGS -v, --verbose Prints more in-depth information about the Actor run. GLOBAL FLAGS --json Format output as json. DESCRIPTION Prints information about an Actor run. ``` ##### `apify runs log RUNID`[​](#apify-runs-log-runid "Direct link to heading") Prints the log of a specific run. ``` USAGE $ apify runs log RUNID ARGUMENTS RUNID The run ID to get the log from. DESCRIPTION Prints the log of a specific run. ``` ##### `apify runs ls [ACTORID]`[​](#apify-runs-ls-actorid "Direct link to heading") Lists all runs of the Actor. ``` USAGE $ apify runs ls [ACTORID] [--json] [--offset <value>] [--limit <value>] [--desc] [-c] ARGUMENTS ACTORID Optional Actor ID or Name to list runs for. By default, it will use the Actor from the current directory. FLAGS -c, --compact Display a compact table. --desc Sort runs in descending order. --limit=<value> [default: 10] Number of runs that will be listed. --offset=<value> Number of runs that will be skipped. GLOBAL FLAGS --json Format output as json. DESCRIPTION Lists all runs of the Actor. ``` ##### `apify runs resurrect RUNID`[​](#apify-runs-resurrect-runid "Direct link to heading") Resurrects an aborted or finished Actor Run. ``` USAGE $ apify runs resurrect RUNID [--json] ARGUMENTS RUNID The run ID to resurrect. GLOBAL FLAGS --json Format output as json. DESCRIPTION Resurrects an aborted or finished Actor Run. ``` ##### `apify runs rm RUNID`[​](#apify-runs-rm-runid "Direct link to heading") Deletes an Actor Run. ``` USAGE $ apify runs rm RUNID ARGUMENTS RUNID The run ID to delete. DESCRIPTION Deletes an Actor Run. ``` ### Storage[​](#storage "Direct link to heading") These commands manage data storage on Apify platform. Use them to work with datasets, key-value stores, and request queues for persistent data storage and retrieval. #### Datasets[​](#datasets "Direct link to heading") Use these commands to manage datasets, which provide structured storage for tabular data. They enable creation, modification, and data manipulation within datasets. ##### `apify datasets`[​](#apify-datasets "Direct link to heading") Manages structured data storage and retrieval. ``` USAGE $ apify datasets DESCRIPTION Manages structured data storage and retrieval. ``` ##### `apify datasets create [DATASETNAME]`[​](#apify-datasets-create-datasetname "Direct link to heading") Creates a new dataset for storing structured data on your account. ``` USAGE $ apify datasets create [DATASETNAME] [--json] ARGUMENTS DATASETNAME Optional name for the Dataset GLOBAL FLAGS --json Format output as json. DESCRIPTION Creates a new dataset for storing structured data on your account. ``` ##### `apify datasets get-items DATASETID`[​](#apify-datasets-get-items-datasetid "Direct link to heading") Retrieves dataset items in specified format (JSON, CSV, etc). ``` USAGE $ apify datasets get-items DATASETID [--limit <value>] [--offset <value>] [--format json|jsonl|csv|html|rss|xml|xlsx] ARGUMENTS DATASETID The ID of the Dataset to export the items for FLAGS --format=<option> [default: json] The format of the returned output. By default, it is set to 'json' <options: json|jsonl|csv|html|rss|xml|xlsx> --limit=<value> The amount of elements to get from the dataset. By default, it will return all available items. --offset=<value> The offset in the dataset where to start getting items. DESCRIPTION Retrieves dataset items in specified format (JSON, CSV, etc). ``` ##### `apify datasets info STOREID`[​](#apify-datasets-info-storeid "Direct link to heading") Prints information about a specific dataset. ``` USAGE $ apify datasets info STOREID [--json] ARGUMENTS STOREID The dataset store ID to print information about. GLOBAL FLAGS --json Format output as json. DESCRIPTION Prints information about a specific dataset. ``` ##### `apify datasets ls`[​](#apify-datasets-ls "Direct link to heading") Prints all datasets on your account. ``` USAGE $ apify datasets ls [--json] [--offset <value>] [--limit <value>] [--desc] [--unnamed] FLAGS --desc Sorts datasets in descending order. --limit=<value> [default: 20] Number of datasets that will be listed. --offset=<value> Number of datasets that will be skipped. --unnamed Lists datasets that don't have a name set. GLOBAL FLAGS --json Format output as json. DESCRIPTION Prints all datasets on your account. ``` ##### `apify datasets push-items NAMEORID [ITEM]`[​](#apify-datasets-push-items-nameorid-item "Direct link to heading") Adds data items to specified dataset. Accepts single object or array of objects. ``` USAGE $ apify datasets push-items NAMEORID [ITEM] ARGUMENTS NAMEORID The dataset ID or name to push the objects to ITEM The object or array of objects to be pushed. DESCRIPTION Adds data items to specified dataset. Accepts single object or array of objects. ``` ##### `apify datasets rename NAMEORID [NEWNAME]`[​](#apify-datasets-rename-nameorid-newname "Direct link to heading") Change dataset name or removes name with --unname flag. ``` USAGE $ apify datasets rename NAMEORID [NEWNAME] [--unname] ARGUMENTS NAMEORID The dataset ID or name to delete. NEWNAME The new name for the dataset. FLAGS --unname Removes the unique name of the dataset. DESCRIPTION Change dataset name or removes name with --unname flag. ``` ##### `apify datasets rm DATASETNAMEORID`[​](#apify-datasets-rm-datasetnameorid "Direct link to heading") Permanently removes a dataset. ``` USAGE $ apify datasets rm DATASETNAMEORID ARGUMENTS DATASETNAMEORID The dataset ID or name to delete DESCRIPTION Permanently removes a dataset. ``` #### Key-Value Stores[​](#key-value-stores "Direct link to heading") These commands handle key-value store operations. Use them to create stores, manage key-value pairs, and handle persistent storage of arbitrary data types. ##### `apify key-value-stores`[​](#apify-key-value-stores "Direct link to heading") Manages persistent key-value storage. ``` USAGE $ apify key-value-stores DESCRIPTION Manages persistent key-value storage. Alias: kvs ``` ##### `apify key-value-stores create [KEYVALUESTORENAME]`[​](#apify-key-value-stores-create-keyvaluestorename "Direct link to heading") Creates a new key-value store on your account. ``` USAGE $ apify key-value-stores create [KEYVALUESTORENAME] [--json] ARGUMENTS KEYVALUESTORENAME Optional name for the key-value store GLOBAL FLAGS --json Format output as json. DESCRIPTION Creates a new key-value store on your account. ``` ##### `apify key-value-stores delete-value STOREID ITEMKEY`[​](#apify-key-value-stores-delete-value-storeid-itemkey "Direct link to heading") Delete a value from a key-value store. ``` USAGE $ apify key-value-stores delete-value STOREID ITEMKEY ARGUMENTS STOREID The key-value store ID to delete the value from. ITEMKEY The key of the item in the key-value store. DESCRIPTION Delete a value from a key-value store. ``` ##### `apify key-value-stores get-value KEYVALUESTOREID ITEMKEY`[​](#apify-key-value-stores-get-value-keyvaluestoreid-itemkey "Direct link to heading") Retrieves stored value for specified key. Use --only-content-type to check MIME type. ``` USAGE $ apify key-value-stores get-value KEYVALUESTOREID ITEMKEY [--only-content-type] ARGUMENTS KEYVALUESTOREID The key-value store ID to get the value from. ITEMKEY The key of the item in the key-value store. FLAGS --only-content-type Only return the content type of the specified key DESCRIPTION Retrieves stored value for specified key. Use --only-content-type to check MIME type. ``` ##### `apify key-value-stores info STOREID`[​](#apify-key-value-stores-info-storeid "Direct link to heading") Shows information about a key-value store. ``` USAGE $ apify key-value-stores info STOREID [--json] ARGUMENTS STOREID The key-value store ID to print information about. GLOBAL FLAGS --json Format output as json. DESCRIPTION Shows information about a key-value store. ``` ##### `apify key-value-stores keys STOREID`[​](#apify-key-value-stores-keys-storeid "Direct link to heading") Lists all keys in a key-value store. ``` USAGE $ apify key-value-stores keys STOREID [--json] [--limit <value>] [--exclusive-start-key <value>] ARGUMENTS STOREID The key-value store ID to list keys for. FLAGS --exclusive-start-key=<value> The key to start the list from. --limit=<value> [default: 20] The maximum number of keys to return. GLOBAL FLAGS --json Format output as json. DESCRIPTION Lists all keys in a key-value store. ``` ##### `apify key-value-stores ls`[​](#apify-key-value-stores-ls "Direct link to heading") Lists all key-value stores on your account. ``` USAGE $ apify key-value-stores ls [--json] [--offset <value>] [--limit <value>] [--desc] [--unnamed] FLAGS --desc Sorts key-value stores in descending order. --limit=<value> [default: 20] Number of key-value stores that will be listed. --offset=<value> Number of key-value stores that will be skipped. --unnamed Lists key-value stores that don't have a name set. GLOBAL FLAGS --json Format output as json. DESCRIPTION Lists all key-value stores on your account. ``` ##### `apify key-value-stores rename KEYVALUESTORENAMEORID [NEWNAME]`[​](#apify-key-value-stores-rename-keyvaluestorenameorid-newname "Direct link to heading") Renames a key-value store, or removes its unique name. ``` USAGE $ apify key-value-stores rename KEYVALUESTORENAMEORID [NEWNAME] [--unname] ARGUMENTS KEYVALUESTORENAMEORID The key-value store ID or name to delete NEWNAME The new name for the key-value store FLAGS --unname Removes the unique name of the key-value store DESCRIPTION Renames a key-value store, or removes its unique name. ``` ##### `apify key-value-stores rm KEYVALUESTORENAMEORID`[​](#apify-key-value-stores-rm-keyvaluestorenameorid "Direct link to heading") Permanently removes a key-value store. ``` USAGE $ apify key-value-stores rm KEYVALUESTORENAMEORID ARGUMENTS KEYVALUESTORENAMEORID The key-value store ID or name to delete DESCRIPTION Permanently removes a key-value store. ``` ##### `apify key-value-stores set-value STOREID ITEMKEY [VALUE]`[​](#apify-key-value-stores-set-value-storeid-itemkey-value "Direct link to heading") Stores value with specified key. Set content-type with --content-type flag. ``` USAGE $ apify key-value-stores set-value STOREID ITEMKEY [VALUE] [--content-type <value>] ARGUMENTS STOREID The key-value store ID to set the value in. ITEMKEY The key of the item in the key-value store. VALUE The value to set. FLAGS --content-type=<value> [default: application/json] The MIME content type of the value. By default, "application/json" is assumed. DESCRIPTION Stores value with specified key. Set content-type with --content-type flag. ``` #### Request Queues[​](#request-queues "Direct link to heading") These commands manage request queues, which handle URL processing for web scraping and automation tasks. Use them to maintain lists of URLs with automatic retry mechanisms and state management. ##### `apify request-queues`[​](#apify-request-queues "Direct link to heading") Manages URL queues for web scraping and automation tasks. ``` USAGE $ apify request-queues DESCRIPTION Manages URL queues for web scraping and automation tasks. ``` ### Tasks[​](#tasks "Direct link to heading") These commands help you manage scheduled and configured Actor runs. Use them to create, modify, and execute predefined Actor configurations as tasks. ##### `apify task`[​](#apify-task "Direct link to heading") Manages scheduled and predefined Actor configurations. ``` USAGE $ apify task DESCRIPTION Manages scheduled and predefined Actor configurations. ``` ##### `apify task run TASKID`[​](#apify-task-run-taskid "Direct link to heading") Executes predefined Actor task remotely using local key-value store for input. ``` USAGE $ apify task run TASKID [-b <value>] [-t <value>] [-m <value>] ARGUMENTS TASKID Name or ID of the Task to run (e.g. "my-task" or "E2jjCZBezvAZnX8Rb"). FLAGS -b, --build=<value> Tag or number of the build to run (e.g. "latest" or "1.2.34"). -m, --memory=<value> Amount of memory allocated for the Task run, in megabytes. -t, --timeout=<value> Timeout for the Task run in seconds. Zero value means there is no timeout. DESCRIPTION Executes predefined Actor task remotely using local key-value store for input. Customize with --memory and --timeout flags. ``` --- # Telemetry Apify collects telemetry data about the general usage of the CLI to help us improve the product. Participation in this program is optional and you may opt out if you prefer not to share any information. ## Data Collection[​](#data-collection "Direct link to heading") All telemetry data is collected and stored securely on [Mixpanel](https://mixpanel.com/). We do not collect any sensitive information such as your API token or personal information. ### Metrics Collected[​](#metrics-collected "Direct link to heading") Before a user connects to the Apify platform, we collect anonymous information about CLI usage including: * Usage of all commands * Internal attributes of the local environment (OS, shell, Node.js version, Python version, Apify CLI version) * For the `actor create` command, we identify which template was used to create the Actor (language, template name, template ID) After a user connects to the Apify platform (successful `apify login`), we collect the same information about CLI usage along with the ID of the connected user. You can read more about how we protect personal information in our [Privacy Policy](https://apify.com/privacy-policy). ## How to opt out[​](#how-to-opt-out "Direct link to heading") You can disable telemetry by setting the "APIFY\_CLI\_DISABLE\_TELEMETRY" environment variable to "1". After setting this variable, the CLI will not send any telemetry data whether you are connected with Apify or not. --- # Troubleshooting For general support, reach out to us at [apify.com/contact](https://apify.com/contact). If you believe you are encountering a bug, file it on [GitHub](https://github.com/apify/apify-cli/issues/new). --- # Environment variables There are two options how you can set up environment variables for Actors. ### Set up environment variables in `.actor/actor.json`[​](#set-up-environment-variables-in-actoractorjson "Direct link to heading") All keys from `env` will be set as environment variables into Apify platform after you push Actor to Apify. Current values on Apify will be overridden. ``` { "actorSpecification": 1, "name": "dataset-to-mysql", "version": "0.1", "buildTag": "latest", "environmentVariables": { "MYSQL_USER": "my_username", "MYSQL_PASSWORD": "@mySecretPassword" } } ``` ### Set up environment variables in Apify Console[​](#set-up-environment-variables-in-apify-console "Direct link to heading") In [Apify Console](https://console.apify.com/actors) select your Actor, you can set up variables into Source tab. After setting up variables in the app, remove the `environmentVariables` from `.actor/actor.json`. Otherwise, variables from `.actor/actor.json` will override variables in the app. ``` { "actorSpecification": 1, "name": "dataset-to-mysql", "version": "0.1", "buildTag": "latest" } ``` #### How to set secret environment variables in `.actor/actor.json`[​](#how-to-set-secret-environment-variables-in-actoractorjson "Direct link to heading") CLI provides commands to manage secrets environment variables. Secrets are stored to the `~/.apify` directory. You can add a new secret using the command: ``` apify secrets add mySecretPassword pwd1234 ``` After adding a new secret you can use the secret in `.actor/actor.json`. ``` { "actorSpecification": 1, "name": "dataset-to-mysql", ... "environmentVariables": { "MYSQL_PASSWORD": "@mySecretPassword" }, ... } ``` ### Need help?[​](#need-help "Direct link to heading") To see all CLI commands simply run: ``` apify help ``` To get information about a specific command run: ``` apify help COMMAND ``` Still haven't found what you were looking for? Please go to [Apify Help center](https://www.apify.com/help) or [contact us](https://www.apify.com/contact). --- [Skip to main content](#__docusaurus_skipToContent_fallback) [![](/cli/img/apify_sdk.svg)![](/cli/img/apify_sdk_white.svg)](https://docs.apify.com) [Academy](https://docs.apify.com/academy)[Platform](https://docs.apify.com/platform) [API](https://docs.apify.com/api) * [Reference](https://docs.apify.com/api/v2) * [Client for JavaScript](https://docs.apify.com/api/client/js/) * [Client for Python](https://docs.apify.com/api/client/python/) [SDK](https://docs.apify.com/sdk) * [SDK for JavaScript](https://docs.apify.com/sdk/js/) * [SDK for Python](https://docs.apify.com/sdk/python/) [CLI](https://docs.apify.com/cli/) [Open source](https://docs.apify.com/open-source) * [Crawlee](https://crawlee.dev) * [Got Scraping](https://github.com/apify/got-scraping) * [Fingerprint Suite](https://github.com/apify/fingerprint-suite) * [Apify on GitHub](https://github.com/apify) * [Actor whitepaper](https://whitepaper.actor) [Discord](https://discord.com/invite/jyEM2PRvMU "Chat on Discord")[Get started](https://console.apify.com) [Apify CLI](https://docs.apify.com/cli/cli/.md) [Docs](https://docs.apify.com/cli/cli/docs.md)[Reference](https://docs.apify.com/cli/cli/docs/reference.md)[Changelog](https://docs.apify.com/cli/cli/docs/changelog.md)[GitHub](https://github.com/apify/apify-cli) [0.21](https://docs.apify.com/cli/cli/docs.md) * [Next](https://docs.apify.com/cli/cli/docs/next) * [0.21](https://docs.apify.com/cli/cli/docs.md) * [0.20](https://docs.apify.com/cli/cli/docs/0.20) # Apify command-line interface (CLI) ## ## Create, develop, build, and run [Apify Actors](https://docs.apify.com/actors) from your terminal. Manage the Apify platform from shell scripts. [Learn more](https://docs.apify.com/cli/cli/docs.md)[GitHub](https://ghbtns.com/github-btn.html?user=apify\&repo=apify-cli\&type=star\&count=true\&size=large) ![](/cli/assets/images/logo-blur-5206054b91a93d20690b49e1aeb1f62e.png) ``` npm i -g apify-cli ``` Apify Actors are cloud programs that can perform arbitrary web scraping, automation, or data processing job. They accept input, perform their job and generate output. While you can develop Actors in an online IDE in [Apify Console](https://console.apify.com/), for larger projects, it is more convenient to develop Actors locally on your computer using [Apify SDK](https://docs.apify.com/sdk/js/) and only push the Actors to the Apify platform during deployment. This is where the Apify CLI comes in to allow you to quickly develop locally and then deploy to the cloud with a single command. ``` # Create your first Actor apify create my-actor # Go into the project directory cd my-actor # Run it locally apify run # Log into your Apify account and deploy it to Apify Platform apify login apify push ``` Learn * [Academy](https://docs.apify.com/academy) * [Platform](https://docs.apify.com/platform) API * [Reference](https://docs.apify.com/api/v2) * [Client for JavaScript](https://docs.apify.com/api/client/js/) * [Client for Python](https://docs.apify.com/api/client/python/) SDK * [SDK for JavaScript](https://docs.apify.com/sdk/js/) * [SDK for Python](https://docs.apify.com/sdk/python/) Other * [CLI](https://docs.apify.com/cli/) * [Open source](https://docs.apify.com/open-source) More * [Crawlee](https://crawlee.dev) * [GitHub](https://github.com/apify) * [Discord](https://discord.com/invite/jyEM2PRvMU) * [Trust Center](https://trust.apify.com) [](https://apify.com) --- # The Web Actor Programming Model Whitepaper **This whitepaper describes a new concept for building serverless microapps called **_Actors_**, which are easy to develop, share, integrate, and build upon. Actors are a reincarnation of the UNIX philosophy for programs running in the cloud.** By [Jan Čurn](https://apify.com/jancurn), [Marek Trunkát](https://apify.com/mtrunkat), [Ondra Urban](https://apify.com/mnmkng), and the entire Apify team. **Version 0.999 (February 2025)** ## Contents <!-- toc --> - [Introduction](#introduction) * [Background](#background) * [Overview](#overview) * [Apify platform](#apify-platform) - [Basic concepts](#basic-concepts) * [Input](#input) * [Run environment](#run-environment) * [Output](#output) * [Storage](#storage) * [Integrations](#integrations) * [What Actors are not](#what-actors-are-not) - [Philosophy](#philosophy) * [UNIX programs vs. Actors](#unix-programs-vs-actors) * [Design principles](#design-principles) * [Relation to the Actor model](#relation-to-the-actor-model) * [Why the name "Actor"](#why-the-name-actor) - [Installation and setup](#installation-and-setup) * [Running on the Apify platform](#running-on-the-apify-platform) * [Node.js](#nodejs) * [Python](#python) * [Command-line interface (CLI)](#command-line-interface-cli) - [Actor programming interface](#actor-programming-interface) * [Initialization](#initialization) * [Get input](#get-input) * [Key-value store access](#key-value-store-access) * [Push results to dataset](#push-results-to-dataset) * [Exit Actor](#exit-actor) * [Environment variables](#environment-variables) * [Actor status](#actor-status) * [System events](#system-events) * [Get memory information](#get-memory-information) * [Start another Actor](#start-another-actor) * [Metamorph](#metamorph) * [Attach webhook to an Actor run](#attach-webhook-to-an-actor-run) * [Abort another Actor](#abort-another-actor) * [Reboot an Actor](#reboot-an-actor) * [Actor web server](#actor-web-server) * [Standby mode](#standby-mode) * [Migration to another server](#migration-to-another-server) * [Charging money](#charging-money) - [Actor definition files](#actor-definition-files) * [Actor file](#actor-file) * [Dockerfile](#dockerfile) * [README](#readme) * [Input schema file](#input-schema-file) * [Output schema file](#output-schema-file) * [Storage schema files](#storage-schema-files) * [Backward compatibility](#backward-compatibility) - [Development](#development) * [Local development](#local-development) * [Deployment to Apify platform](#deployment-to-apify-platform) * [Continuous integration and delivery](#continuous-integration-and-delivery) * [Actorizing existing code](#actorizing-existing-code) - [Sharing and publishing](#sharing-and-publishing) * [Monetization](#monetization) - [Future work](#future-work) - [Links](#links) <!-- tocstop --> ## Introduction This whitepaper introduces **_Actors_**, a new language-agnostic model for building general-purpose web computing and automation programs (also known as agents, functions, or apps). The main goal for Actors is to make it easy for developers to build and ship reusable software tools, which are easy to run, integrate, and build upon. Actors are useful for building web scrapers, crawlers, automations, and AI agents. ### Background Actors were first introduced by [Apify](https://apify.com/) in late 2017, as a way to easily build, package, and ship web scraping and web automation jobs to customers. Over the years, Apify has continued to develop the concept and applied it successfully to thousands of real-world use cases in many business areas, well beyond the domain of web scraping. Building on this experience, we're releasing this whitepaper to introduce the philosophy of Actors to other developers and receive your feedback on it. We aim to establish the Actor programming model as an open standard, which will help the community to more effectively build and ship reusable software automation tools, as well as encourage new implementations of the model in other programming languages. The goal of this whitepaper is to be the North Star that shows what the Actor programming model is and what operations it should support. **But this document is not an official specification.** The specification will be an OpenAPI schema of the Actor system interface, to enable new independent implementations of both the client libraries and backend systems. This is currently a work in progress. Currently, the most complete implementation of the Actor model is provided by the Apify platform, with SDKs for [Node.js](https://sdk.apify.com/) and [Python](https://pypi.org/project/apify/), and a [command-line interface (CLI)](https://docs.apify.com/cli). Beware that the frameworks might not yet implement all the features of the Actor programming model described in this whitepaper. ### Overview Actors are serverless programs that run in the cloud. They can perform anything from simple actions such as filling out a web form or sending an email, to complex operations such as crawling an entire website, or removing duplicates from a large dataset. Actors can persist their state and be restarted, and thus they can run as short or as long as necessary, from seconds to hours, even infinitely. Basically, Actors are programs packaged as Docker images, which accept a well-defined JSON input, perform an action, and optionally produce a well-defined JSON output. Actors have the following elements: - **Dockerfile** which specifies where the Actor's source code is, how to build it, and run it. - **Documentation** in a form of a README.md file. - **Input and output schemas** that describe what input the Actor requires, and what results it produces. - Access to an out-of-the-box **storage system** for Actor data, results, and files. - **Metadata** such as the Actor name, description, author, and version. The documentation and the input/output schemas make it possible for people to easily understand what the Actor does, enter the required inputs both in user interface or API, and integrate the results of the Actor into their other workflows. Actors can easily call and interact with each other, enabling the building of more complex systems on top of simple ones. <!-- ASTRO: <Diagram horizontal={illuDiagramHoriz} vertical={illuDiagramVert} alt="Actor drawing" /> --> ![Apify Actor diagram](./img/apify-actor-drawing.png) <!-- Image sources: https://docs.google.com/presentation/d/1nDgrI0p2r8ouP_t_Wn02aTllP8_Std-kRuIbO8QLE7M/edit https://www.figma.com/design/6vbmKvB6oY3b3mTN0oAscE/Actor-Whitepaper-Diagrams-and-Presentations?node-id=0-1&p=f&t=JwAJfru2GjdQBpBV-11 --> ### Apify platform Actors can be published on the [Apify platform](https://apify.com/store), which automatically generates a rich website with documentation based on the README and a practical user interface, in order to encourage people to try the Actor right away. The Apify platform takes care of securely hosting the Actor's Docker containers and scaling the computing, storage and network resources as needed, so neither Actor developers nor the users need to deal with the infrastructure. It just works. The Apify platform provides an open API, cron-style scheduler, webhooks and [integrations](https://apify.com/integrations) to services such as Zapier or Make, which make it easy for users to integrate Actors into their existing workflows. Additionally, the Actor developers can set a price tag for the usage of their Actors, and thus earn income and have an incentive to keep developing and improving the Actor for the users. For details, see [Monetization](#monetization). ## Basic concepts This section describes core features of Actors, what they are good for, and how Actors differ from other serverless computing systems. ### Input <!-- ASTRO: <Illustration description="Each Actor accepts an input object, which tells it what it should do." position="content" image={illuBasicConceptsInput} noCaption /> --> Each Actor accepts an **input object**, which tells it what it should do. The object is passed in JSON format, and its properties have a similar role as command-line arguments when running a program in a UNIX-like operating system. For example, an input object for an Actor `bob/screenshotter` could look like this: ```json { "url": "https://www.example.com", "width": 800 } ``` The input object represents a standardized way for the caller to control the Actor's activity, whether starting it using API, user interface, CLI, or scheduler. The Actor can access the value of the input object using the [Get input](#get-input) function. In order to specify what kind of input object an Actor expects, the Actor developer can define an [Input schema file](#input-schema-file). The input schema is used by the system to generate user interface, API examples, and simplify integration with external systems. #### Example of auto-generated Actor input UI ![Screenshot Taker Input UI](./img/screenshot-taker-input.png) <!-- ASTRO: <Picture src={illuTakerInput} alt="Taker input" formats={['avif', 'webp']} /> --> ### Run environment <!-- ASTRO: <Illustration description="The Actors run within an isolated Docker container" position="right" image={illuBasicConceptsRunEnvironment} /> --> Actors run within an isolated Docker container with access to local file system and network, and they can perform arbitrary computing activity or call external APIs. The **standard output** of the Actor's program (stdout and stderr) is printed out and logged, which is useful for development and debugging. To inform the users about the progress, the Actors might set a [status message](#actor-status), which is then displayed in the user interface and also available via API. A running Actor can also launch a [web server](#actor-web-server), which is assigned a unique local or public URL to receive HTTP requests. For example, this is useful for messaging and interaction between Actors, for running request-response REST APIs, or providing a full-featured website. Actors can store their working data or results into specialized **storages** called [Key-value store](#key-value-store) and [Dataset](#dataset) storages, from which they can be easily exported using API or integrated in other Actors. ### Output <!-- ASTRO: <Illustration description="The Actors can generate an output object, which is a standardized way to display, consume, and integrate Actors' results." position="right" image={illuBasicConceptsOutput} /> --> While the input object provides a standardized way to invoke Actors, Actors can also generate an **output object**, which is a standardized way to display, consume, and integrate Actors' results. Actor results are typically fully available only after the Actor run finishes, but the consumers of the results might want to access partial results during the run. Therefore, Actors don't generate the output object in their code, but they define an [Output schema file](#output-schema-file), which contains instruction how to generate such output object automatically. You can define how the Actor output looks using the [Output schema file](#output-schema-file). The system uses this information to automatically generate an immutable JSON file, which tells users where to find the results produced by the Actor. The output object is stored by the system to the Actor run object under the `output` property, and returned via API immediately when the Actor is started, without the need to wait for it to finish or generate the actual results. This is useful to automatically generate UI previews of the results, API examples, and integrations. <div class="clear-both" /> The output object is similar to the input object, as it contains properties and values. For example, for the `bob/screenshotter` Actor, the output object could look like this: ```jsonc { "screenshotUrl": "https://api.apify.com/v2/key-value-stores/skgGkFLQpax59AsFD/records/screenshot.jpg", "productImages": "https://api.apify.com/v2/key-value-stores/skgGkFLQpax59AsFD/records/product*.jpg", "productDetails": "https://api.apify.com/datasets/9dFknjkxxGkspwWd/records?fields=url,name", "productExplorer": "https://bob--screenshotter.apify.actor/product-explorer", // or this with live view "productExplorer": "https://13413434.runs.apify.net/product-explorer" } ``` ### Storage <!-- ASTRO: <Illustration description="The Actor system provides two specialized storages that can be used by Actors for storing files and results: Key-value store and Dataset" position="content" image={illuBasicConceptsStorage} noCaption /> --> The Actor system provides two specialized storages that can be used by Actors for storing files and results: **key-value store** and **dataset**, respectively. For each Actor run, the system automatically creates so-called **default storages** of both these types in empty state and makes them readily available for the Actor. Alternatively, a caller can request reusing existing storage when starting a new Actor run. This is similar to redirecting standard input in UNIX, and it is useful if you want an Actor to operate on an existing key-value store or dataset instead of creating a new one. <!-- TODO: The above feature is not implemented yet --> Besides these so-called **default storages**, which are created or linked automatically, Actors are free to create new storages or access existing ones, either by ID or a name that can be set for them (e.g. `bob/screenshots`). The [input schema file](#input-schema-file) and [output schema file](#output-schema-file) provide special support for referencing these storages, in order to simplify linking an output of one Actor to an input of another. The storages are also accessible through an API and SDK externally, for example, to download results when the Actor finishes. Note that Actors are free to access any other external storage system through a third-party API, e.g. an SQL database or a vector database. #### Key-value store <!-- ASTRO: <Illustration description="The key-value store is a simple data storage that is used for saving and reading files or data records" position="right" image={illuBasicConceptsStorageKeyValueStore} /> --> The key-value store is a simple data storage that is used for saving and reading files or data records. The records are represented by a unique text key and the data associated with a MIME content type. Key-value stores are ideal for saving things like screenshots, web pages, PDFs, or to persist the state of Actors e.g. as a JSON file. Each Actor run is associated with a default empty key-value store, which is created exclusively for the run, or alternatively with an existing key-value store if requested by the user on Actor start. The [Actor input](#input) is stored as JSON file into the default key-value store under the key defined by the `ACTOR_INPUT_KEY` environment variable (usually `INPUT`). The Actor can read this input object using the [Get input](#get-input) function. An Actor can read and write records to key-value stores using the API. For details, see [Key-value store access](#key-value-store-access). An Actor can define a schema for the key-value store to ensure files stored in it conform to certain rules. For details, see [Storage schema files](#storage-schema-files). #### Dataset <!-- ASTRO: <Illustration description="The dataset is an append-only storage that allows you to store a series of data objects such as results from web scraping, crawling, or data processing jobs." position="right" image={illuBasicConceptsStorageDataset} /> --> The dataset is an append-only storage that allows you to store a series of data objects such as results from web scraping, crawling, or data processing jobs. You or your users can then export the dataset to formats such as JSON, CSV, XML, RSS, Excel, or HTML. The dataset represents a store for structured data where each object stored has the same attributes, such as online store products or real estate offers. You can imagine it as a table, where each object is a row and its attributes are columns. Dataset is an append-only storage — you can only add new records to it, but you cannot modify or remove existing records. Typically, it is used to store an array or collection of results, such as a list of products or web pages. An Actor can define a schema for the Dataset to ensure objects stored in it conform to certain rules. For details, see [Storage schema files](#storage-schema-files). ### Integrations <!-- ASTRO: <Illustration description="Actors are designed for interoperability. Thanks to the input and output schemas, it's easy to connect Actors with external systems, be it directly via REST API, Node.js or Python clients, CLI, or no-code automations." position="content" image={illuBasicConceptsIntegrations} noCaption /> --> **Actors are designed for interoperability.** Thanks to the input and output schemas, it easy to connect Actors with external systems, be it directly via REST API, Node.js or Python clients, CLI, or no-code automations. From the schema files, the system can automatically generate API documentation, OpenAPI specification, and validate inputs and outputs, simplifying their integrations to any other systems. Furthermore, Actors can interact with themselves, for example [start another Actors](#start-another-actor), attach [Webhooks](#attach-webhook-to-an-actor-run) to process the results, or [Metamorph](#metamorph) into another Actor to have it finish the work. ### What Actors are not Actors are best suited for compute operations that take an input, perform an isolated job for a user, and potentially produce some output. For long-running jobs, Actor execution might be migrated from server to another server, making it unsuitable for running dependable storage workloads such as SQL databases. As Actors are based on Docker, it takes a certain amount of time to spin up the container and launch its main process. Doing this for every small HTTP transaction (e.g. API call) is not efficient, even for highly-optimized Docker images. However, Actor [Standby mode](#standby-mode) enables an Actor to run as a web server, to more effectively process small API requests. ## Philosophy Actors are inspired by the **[UNIX philosophy](https://en.wikipedia.org/wiki/Unix_philosophy)** from the 1970s, adapted to the age of the cloud: 1. **Make each program do one thing well**. To do a new job, build afresh rather than complicate old programs by adding new “features”. 2. Expect the **output of every program to become the input to another, as yet unknown, program**. Don’t clutter output with extraneous information. Avoid stringently columnar or binary input formats. Don’t insist on interactive input. 3. Design and build software, even operating systems, to be **tried early**, ideally within weeks. Don’t hesitate to throw away the clumsy parts and rebuild them. 4. **Use tools in preference to unskilled help** to lighten a programming task, even if you have to detour to build the tools and expect to throw some of them out after you’ve finished using them. The UNIX philosophy is arguably one of the most important software engineering paradigms which, together with other favorable design choices of UNIX operating systems, ushered in the computer and internet revolution. By combining smaller parts that can be developed and used independently (programs), it suddenly became possible to build, manage and gradually evolve ever more complex computing systems. Even today's modern mobile devices are effectively UNIX-based machines that run a lot of programs interacting with each other, and provide a terminal which looks very much like early UNIX terminals. In fact, terminal is just another program. UNIX-style programs represent a great way to package software for usage on a local computer. The programs can easily be used stand-alone, but also in combination and in scripts in order to perform much more complex tasks than an individual program ever could, which in turn can be packaged as new programs. The idea of Actors is to bring the benefits of UNIX-style programs from a local computer to a cloud environment where programs run on multiple computers communicating over a network that is subject to latency and partitioning, there is no global atomic filesystem, and where programs are invoked via API calls rather than system calls. Each Actor should do just one thing and do it well. Actors can be used stand-alone, as well as combined or scripted into more complex systems, which in turn can become new Actors. Actors provide a simple user interface and documentation to help users interact with them. ### UNIX programs vs. Actors The following table shows the equivalents of key concepts of UNIX programs and Actors. | UNIX programs | Actors | |----------------------------|------------------------------------------------------------------------------------------------------------------------------------| | Command-line options | [Input object](#get-input) | | Read stdin | No direct equivalent, you can [read from a dataset](#dataset) specified in the input. | | Write to stdout | [Push results to dataset](#push-results-to-dataset), set [Actor status](#actor-status) | | Write to stderr | No direct equivalent, you can write errors to log, set error status message, or push failed dataset items into an "error" dataset. | | File system | [Key-value store](#key-value-store-access) | | Process identifier (PID) | Actor run ID | | Process exit code | [Actor exit code](#exit-actor) | ### Design principles - Each Actor should do just one thing, and do it well. - Optimize for the users of the Actors, help them understand what the Actor does, easily run it, and integrate. - Also optimize for interoperability, to make it ever easier to connect Actors with other systems. Expect objects you work with to contain additional not-yet-known fields. - Keep the API as simple as possible and write great documentation, so that Actors can be built and used by >90% of software developers, even ones using no-code tools (yes, that's also software development!). ### Relation to the Actor model Note that Actors are only loosely related to the **Actor model** in computer science introduced by the 1973 paper by Carl Hewitt. According to [Wikipedia](https://en.wikipedia.org/wiki/Actor_model): > The Actor model in computer science is a mathematical model of concurrent computation > that treats Actor as the universal primitive of concurrent computation. > In response to a message it receives, an Actor can: make local decisions, > create more Actors, send more messages, and determine how to respond to the > next message received. Actors may modify their own private state, > but can only affect each other indirectly through messaging > (removing the need for lock-based synchronization). While the theoretical Actor model is conceptually very similar to "our" Actor programming model, this similarity is rather coincidental. Our primary focus was always on practical software engineering utility, not an implementation of a formal mathematical model. For example, our Actors do not provide any standard message passing mechanism, but they can communicate together directly via HTTP requests (see [Actor web server](#actor-web-server)), manipulate each other's operation via the Apify platform API (e.g. abort another Actor), or affect each other by sharing some internal state or storage. Actors do not have any formal restrictions, and they can access whichever external systems they want, thus going beyond the formal mathematical Actor model. ### Why the name "Actor" In movies and theater, an _actor_ is someone who gets a script and plays a role according to that script. Our Actors also perform an act on someone's behalf, using a provided script. They work well with Puppeteers and Playwrights. To make it clear that Actors are not people, the letter "A" is capitalized. <!-- ASTRO: <Illustration description="Actors" position="content" image={illuPhilosophyWhyTheName} noCaption /> --> ## Installation and setup Below are the steps to start building Actors in various languages and environments. ### Running on the Apify platform You can develop and run Actors in [Apify Console](https://console.apify.com/actors) without installing any software locally. Just create a free Apify account, and start building Actors in an online IDE. <!-- ASTRO: <CodeSwitcher> --> <!-- ASTRO: <CodeExample title="Node.js"> --> ### Node.js The most complete implementation of the Actor system is provided by the Apify SDK for Node.js, via the [apify](https://www.npmjs.com/package/apify) NPM package. The package contains everything that you need to start building Actors locally. You can install it to your Node.js project by running: ```bash $ npm install apify ``` <!-- ASTRO: </CodeExample> --> <!-- ASTRO: <CodeExample title="Python"> --> ### Python To build Actors in Python, simply install the Apify SDK for Python, via the [apify](https://pypi.org/project/apify/) PyPi package into your project: ```bash $ pip3 install apify ``` <!-- ASTRO: </CodeExample> --> <!-- ASTRO: <CodeExample title="CLI"> --> ### Command-line interface (CLI) For local development of Actors and management of the Apify platform, it is handy to install the Apify CLI. You can install it with: ```bash $ brew install apify-cli ``` or via the [apify-cli](https://www.npmjs.com/package/apify-cli) Node.js package: ```bash $ npm install -g apify-cli ``` <!-- ASTRO: </CodeExample> --> <!-- ASTRO: </CodeSwitcher> --> You can confirm the installation succeeded and log in to the Apify platform by running: ```bash $ apify login ``` The Apify CLI provides two commands: `apify` and `actor`. `apify` command lets you interact with the Apify platform, for example run an Actor, push deployment of an Actor to cloud, or access storages. For details, see [Local development](#local-development). `actor` command is to be used from within an Actor in the runtime, to implement the Actors functionality in a shell script. For details, see [Actorizing existing code](#actorizing-existing-code). To get help for a specific command, run: ```bash $ apify help <command> $ actor help <command> ``` ## Actor programming interface The commands described in this section are expected to be called from within a context of a running Actor, both in local environment or on the Apify platform. The Actor runtime system passes the context via [environment variables](#environment-variables), such as `APIFY_TOKEN` or `ACTOR_RUN_ID`, which is used by the SDK or CLI to interact with the runtime. ### Initialization The SDKs provide convenience methods to initialize the Actor and handle its results. During initialization, the SDK loads environment variables, checks the configuration, prepares to receive system events, and optionally purges previous state from local storage. <div class="clear-both" /> <!-- ASTRO: <CodeSwitcher> --> <!-- ASTRO: <CodeExample title="Node.js"> --> #### Node.js In Node.js the Actor is initialized by calling the `init()` method. It should be paired with an `exit()` method which terminates the Actor. Use of `exit()` is not required, but recommended. For more information go to [Exit Actor](#exit-actor). ```js import { Actor } from 'apify'; await Actor.init(); const input = await Actor.getInput(); console.log(input); await Actor.exit(); ``` An alternative way of initializing the Actor is with a `main()` function. This is useful in environments where the latest JavaScript syntax and top level awaits are not supported. The main function is only syntax-sugar for `init()` and `exit()`. It will call `init()` before it executes its callback and `exit()` after the callback resolves. ```js import { Actor } from 'apify'; Actor.main(async () => { const input = await Actor.getInput(); // ... }); ``` <!-- ASTRO: </CodeExample> --> <!-- ASTRO: <CodeExample title="Python"> --> #### Python ```python import asyncio from apify import Actor async def main(): async with Actor: input = await Actor.get_input() print(input) asyncio.run(main()) ``` <!-- ASTRO: </CodeExample> --> <!-- ASTRO: <CodeExample title="CLI"> --> #### CLI No initialization needed, the process exit terminates the Actor, with the process status code determining whether it succeeded or failed. ```bash $ actor set-status-message "My work is done, friend" $ exit 0 ``` <!-- ASTRO: </CodeExample> --> <!-- ASTRO: <CodeExample title="UNIX equivalent"> --> #### UNIX equivalent ```c int main (int argc, char *argv[]) { ... } ``` <!-- ASTRO: </CodeExample> --> <!-- ASTRO: </CodeSwitcher> --> ### Get input <!-- ASTRO: <Illustration description="The input object is passed by the user and stored in the Actor's default key-value store. The input is an object with properties. If the Actor defines the input schema, the input object is guaranteed to conform to it." position="right" image={illuAPIGetInput} noCaption /> --> Get access to the Actor input object passed by the user. It is parsed from a JSON file, which is stored by the system in the Actor's default key-value store, Usually the file is called `INPUT`, but the exact key is defined in the `ACTOR_INPUT_KEY` [environment variable](#environment-variables). The input is an object with properties. If the Actor defines the input schema, the input object is guaranteed to conform to it. For details, see [Input](#input). <div class="clear-both" /> <!-- ASTRO: <CodeSwitcher> --> <!-- ASTRO: <CodeExample title="Node.js"> --> #### Node.js ```js const input = await Actor.getInput(); console.log(input); // prints: { "option1": "aaa", "option2": 456 } ``` <!-- ASTRO: </CodeExample> --> <!-- ASTRO: <CodeExample title="Python"> --> #### Python ```python input = Actor.get_input() print(input) ``` <!-- ASTRO: </CodeExample> --> <!-- ASTRO: <CodeExample title="CLI"> --> #### CLI ```bash # Emits a JSON object, which can be parsed e.g. using the "jq" tool $ actor get-input | jq > { "option1": "aaa", "option2": 456 } ``` <!-- ASTRO: </CodeExample> --> <!-- ASTRO: <CodeExample title="UNIX equivalent"> --> #### UNIX equivalent ```bash $ command --option1=aaa --option2=bbb ``` ```c int main (int argc, char *argv[]) {} ``` <!-- ASTRO: </CodeExample> --> <!-- ASTRO: </CodeSwitcher> --> ### Key-value store access <!-- ASTRO: <Illustration description="Write and read arbitrary files using a storage called Key-value store. When an Actor starts, by default it is associated with a newly-created key-value store, which only contains one file with input of the Actor." position="right" image={illuAPIKeyValueStoreAccess} noCaption /> --> Write and read arbitrary files using a storage called [Key-value store](https://sdk.apify.com/docs/api/key-value-store). When an Actor starts, by default it is associated with a newly-created key-value store, which only contains one file with the input of the Actor (see [Get input](#get-input)). The user can override this behavior and specify another key-value store or input key when running the Actor. <div class="clear-both" /> <!-- ASTRO: <CodeSwitcher> --> <!-- ASTRO: <CodeExample title="Node.js"> --> #### Node.js ```js // Save objects to the default key-value store await Actor.setValue('my_state', { something: 123 }); // (stringified to JSON) await Actor.setValue('screenshot.png', buffer, { contentType: 'image/png' }); // Get record from the default key-value store, automatically parsed from JSON const value = await Actor.getValue('my_state'); // Access another key-value store by its name const store = await Actor.openKeyValueStore('screenshots-store'); const imageBuffer = await store.getValue('screenshot.png'); ``` <!-- ASTRO: </CodeExample> --> <!-- ASTRO: <CodeExample title="Python"> --> #### Python ```python # Save object to store (stringified to JSON) await Actor.set_value('my-state', { 'something': 123 }) # Save binary file to store with content type await Actor.set_value('screenshot', buffer, content_type='image/png') # Get object from store (automatically parsed from JSON) state = await Actor.get_value('my-state') ``` <!-- ASTRO: </CodeExample> --> <!-- ASTRO: <CodeExample title="UNIX"> --> #### UNIX ```bash $ echo "hello world" > file.txt $ cat file.txt ``` <!-- ASTRO: </CodeExample> --> <!-- ASTRO: </CodeSwitcher> --> ### Push results to dataset <!-- ASTRO: <Illustration description="Larger results can be saved to append-only object storage called Dataset. When an Actor starts, by default it is associated with a newly-created empty default dataset. The Actor can create additional datasets or access existing datasets created by other Actors, and use those as needed." position="right" image={illuAPIPush} noCaption /> --> Larger results can be saved to append-only object storage called [Dataset](https://sdk.apify.com/docs/api/dataset). When an Actor starts, by default it is associated with a newly-created empty default dataset. The Actor can create additional datasets or access existing datasets created by other Actors, and use those as needed. Note that datasets can optionally be equipped with schema that ensures only certain kinds of objects are stored in them. See [Dataset schema file](./pages/DATASET_SCHEMA.md) for more details. <div class="clear-both" /> <!-- ASTRO: <CodeSwitcher> --> <!-- ASTRO: <CodeExample title="Node.js"> --> #### Node.js ```js // Append result object to the default dataset associated with the run await Actor.pushData({ someResult: 123, }); // Append result object to a specific named dataset const dataset = await Actor.openDataset('bob/poll-results-2019'); await dataset.pushData({ someResult: 123 }); ``` <!-- ASTRO: </CodeExample> --> <!-- ASTRO: <CodeExample title="Python"> --> #### Python ```python # Append result object to the default dataset associated with the run await Actor.push_data({ 'some_result': 123 }) # Append result object to a specific named dataset dataset = await Actor.open_dataset('bob/poll-results-2019') await dataset.push_data({ 'some_result': 123 }) ``` <!-- ASTRO: </CodeExample> --> <!-- ASTRO: <CodeExample title="CLI"> --> #### CLI ```bash # Push data to default dataset, in JSON format $ echo '{ "someResult": 123 }' | actor push-data --json $ actor push-data --json='{ "someResult": 123 }' $ actor push-data --json=@result.json # Push data to default dataset, in text format $ echo "someResult=123" | actor push-data $ actor push-data someResult=123 # Push to a specific dataset in the cloud $ actor push-data --dataset=bob/election-data someResult=123 # Push to dataset on local system $ actor push-data --dataset=./my_dataset someResult=123 ``` <!-- ASTRO: </CodeExample> --> <!-- ASTRO: <CodeExample title="UNIX equivalent"> --> #### UNIX equivalent ```c printf("Hello world\tColum 2\tColumn 3"); ``` <!-- ASTRO: </CodeExample> --> <!-- ASTRO: </CodeSwitcher> --> ### Exit Actor When the main Actor process exits and the Docker container stops running, the Actor run is considered finished and the process exit code is used to determine whether the Actor has succeeded (exit code `0` leads to status `SUCCEEDED`) or failed (exit code not equal to `0` leads to status `FAILED`). In the event of a non-zero exit code, the system automatically sets the Actor [status message](#actor-status) to something like `Actor exited with code 0`, and it might attempt to restart the Actor to recover from the error, depending on the system and Actor configuration. A preferred way to exit an Actor intentionally is using the `exit` or `fail` functions in SDK, as shown below. This has several advantages: - You can provide a custom status message for users to tell them what the Actor achieved, or why it failed and how they can fix it. This greatly improves user experience. - When using `fail` to fail the Actor, the system considers the error permanent and will not attempt to restart the Actor. - The SDK emits the `exit` event, which can be listened to and used by various components of the Actor to perform a cleanup, persist state, etc. Note that the caller of exit can specify how long should the system wait for all `exit` event handlers to complete before closing the process, using the `timeoutSecs` option. For details, see [System Events](#system-events). <!-- ASTRO: <CodeSwitcher> --> <!-- ASTRO: <CodeExample title="Node.js"> --> #### Node.js ```js // Actor will finish with 'SUCCEEDED' status await Actor.exit('Succeeded, crawled 50 pages'); // Exit right away without calling `exit` handlers at all await Actor.exit('Done right now', { timeoutSecs: 0 }); // Actor will finish with 'FAILED' status await Actor.exit('Could not finish the crawl, try increasing memory', { exitCode: 1 }); // ... or nicer way using this syntactic sugar: await Actor.fail('Could not finish the crawl, try increasing memory'); // Register a handler to be called on exit. // Note that the handler has `timeoutSecs` to finish its job Actor.on('exit', ({ statusMessage, exitCode, timeoutSecs }) => { // Perform cleanup... }) ``` <!-- ASTRO: </CodeExample> --> <!-- ASTRO: <CodeExample title="Python"> --> #### Python ```python # Actor will finish in 'SUCCEEDED' state await Actor.exit('Generated 14 screenshots') # Actor will finish in 'FAILED' state await Actor.exit('Could not finish the crawl, try increasing memory', exit_code=1) # ... or nicer way using this syntactic sugar: await Actor.fail('Could not finish the crawl, try increasing memory'); ``` <!-- ASTRO: </CodeExample> --> <!-- ASTRO: <CodeExample title="CLI"> --> #### CLI ```bash # Actor will finish in 'SUCCEEDED' state $ actor exit $ actor exit --message "Email sent" # Actor will finish in 'FAILED' state $ actor exit --code=1 --message "Couldn't fetch the URL" ``` <!-- ASTRO: </CodeExample> --> <!-- ASTRO: <CodeExample title="UNIX equivalent"> --> #### UNIX equivalent ```c exit(1); ``` <!-- ASTRO: </CodeExample> --> <!-- ASTRO: </CodeSwitcher> --> ### Environment variables Actors have access to standard process environment variables. The Apify platform uses environment variables prefixed with `ACTOR_` to pass information to Actors about the execution context. | Environment variable | Description | |------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | `ACTOR_ID` | ID of the Actor. | | `ACTOR_FULL_NAME` | Full technical name of the Actor, in the format `owner-username/actor-name`. | | `ACTOR_RUN_ID` | ID of the Actor run. | | `ACTOR_BUILD_ID` | ID of the Actor build. | | `ACTOR_BUILD_NUMBER` | A string representing the version of the current Actor build. | | `ACTOR_BUILD_TAGS` | A comma-separated list of tags of the Actor build used in the run. Note that this environment variable is assigned at the time of start of the Actor and doesn't change over time, even if the assigned build tags change. | | `ACTOR_TASK_ID` | ID of the saved Actor task. | | `ACTOR_DEFAULT_KEY_VALUE_STORE_ID` | ID of the key-value store where the Actor's input and output data are stored. | | `ACTOR_DEFAULT_DATASET_ID` | ID of the dataset where you can push the data. | | `ACTOR_DEFAULT_REQUEST_QUEUE_ID` | ID of the request queue that stores and handles requests that you enqueue. | | `ACTOR_INPUT_KEY` | The key of the record in the default key-value store that holds the Actor input. Typically it's `INPUT`, but it might be something else. | | `ACTOR_MEMORY_MBYTES` | Indicates the size of memory allocated for the Actor run, in megabytes (1,000,000 bytes). It can be used by Actors to optimize their memory usage. | | `ACTOR_STARTED_AT` | Date when the Actor was started, in ISO 8601 format. For example, `2022-01-02T03:04:05.678`. | | `ACTOR_TIMEOUT_AT` | Date when the Actor will time out, in ISO 8601 format. | | `ACTOR_EVENTS_WEBSOCKET_URL` | Websocket URL where Actor may listen for events from Actor platform. See [System events](#system-events) for details. | | `ACTOR_WEB_SERVER_PORT` | TCP port on which the Actor can start a HTTP server to receive messages from the outside world, either as [Actor web server](#actor-web-server) or in the [Standby mode](#standby-mode). | | `ACTOR_WEB_SERVER_URL` | A unique hard-to-guess URL under which the current Actor run's web server is accessible from the outside world. See [Actor web server](#actor-web-server) section for details. | | `ACTOR_STANDBY_URL` | A general public URL under which the Actor can be started and its web server accessed in the [Standby mode](#standby-mode). | | `ACTOR_MAX_PAID_DATASET_ITEMS` | A maximum number of results that will be charged to the user using a pay-per-result Actor. | | `ACTOR_MAX_TOTAL_CHARGE_USD` | The maximum amount of money in USD an Actor can charge its user. See [Charging money](#charging-money) for details. | The Actor developer can also define custom environment variables that are then passed to the Actor process both in the local development environment or on the Apify platform. These variables are defined in the [Actor file](#actor-file) at `.actor/actor.json` using the `environmentVariables` directive, or manually in the user interface in Apify Console. The environment variables can be set as secure in order to protect sensitive data such as API keys or passwords. The value of a secure environment variable is encrypted and can only be retrieved by the Actors during their run, but not outside runs. Furthermore, values of secure environment variables are omitted from the log. <!-- ASTRO: <CodeSwitcher> --> <!-- ASTRO: <CodeExample title="Node.js"> --> #### Node.js For convenience, rather than using environment vars directly, we provide a `Configuration` class that allows reading and updating the Actor configuration. ```javascript const token = Actor.config.get('token'); // use different token Actor.config.set('token', 's0m3n3wt0k3n') ``` <!-- ASTRO: </CodeExample> --> <!-- ASTRO: <CodeExample title="CLI"> --> #### CLI ```bash $ echo "$ACTOR_RUN_ID started at $ACTOR_STARTED_AT" ``` <!-- ASTRO: </CodeExample> --> <!-- ASTRO: <CodeExample title="UNIX equivalent"> --> #### UNIX equivalent ```bash $ echo $ACTOR_RUN_ID ``` <!-- ASTRO: </CodeExample> --> <!-- ASTRO: </CodeSwitcher> --> ### Actor status Each Actor run has a status (the `status` field), which indicates its stage in the Actor's lifecycle. The status can be one of the following values: <div class="clear-both" /> |Status|Type| Description | |--- |--- |---------------------------------------------| |`READY`|initial| Started but not allocated to any worker yet | |`RUNNING`|transitional| Executing on a worker | |`SUCCEEDED`|terminal| Finished successfully | |`FAILED`|terminal| Run failed | |`TIMING-OUT`|transitional| Timing out now | |`TIMED-OUT`|terminal| Timed out | |`ABORTING`|transitional| Being aborted by a user or system | |`ABORTED`|terminal| Aborted by a user or system | Additionally, the Actor run has a status message (the `statusMessage` field), which contains text for users informing them what the Actor is currently doing, and thus greatly improving their user experience. When an Actor exits, the status message is either automatically set to some default text (e.g. "Actor finished with exit code 1"), or to a custom message - see [Exit Actor](#exit-actor) for details. When the Actor is running, it should periodically update the status message as follows, to keep users informed and happy. The function can be called as often as necessary, the SDK only invokes API if status changed. This is to simplify usage. <!-- ASTRO: <CodeSwitcher> --> <!-- ASTRO: <CodeExample title="Node.js"> --> #### Node.js ```js await Actor.setStatusMessage('Crawled 45 of 100 pages'); // Setting status message to other Actor externally is also possible await Actor.setStatusMessage('Everyone is well', { actorRunId: 123 }); ``` <!-- ASTRO: </CodeExample> --> <!-- ASTRO: <CodeExample title="Python"> --> #### Python ```python await Actor.set_status_message('Crawled 45 of 100 pages') ``` <!-- ASTRO: </CodeExample> --> <!-- ASTRO: <CodeExample title="CLI"> --> #### CLI ```bash $ actor set-status-message "Crawled 45 of 100 pages" $ actor set-status-message --run=[RUN_ID] --token=X "Crawled 45 of 100 pages" ``` Convention: The end user of an Actor should never need to look into the log to understand what happened, e.g. why the Actor failed. All necessary information must be set by the Actor in the status message. <!-- ASTRO: </CodeExample> --> <!-- ASTRO: </CodeSwitcher> --> ### System events Actors are notified by the system about various events such as a migration to another server, [abort operation](#abort-another-actor) triggered by another Actor, or the CPU being overloaded. Currently, the system sends the following events: | Event name | Payload | Description | | -------------- | ------- |------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | `cpuInfo` | `{ isCpuOverloaded: Boolean }` | The event is emitted approximately every second and it indicates whether the Actor is using the maximum of available CPU resources. If that’s the case, the Actor should not add more workload. For example, this event is used by the AutoscaledPool class. | | `migrating` | N/A | Emitted when the Actor running on the Apify platform is going to be migrated to another worker server soon. You can use it to persist the state of the Actor and abort the run, to speed up migration. See [Migration to another server](#migration-to-another-server). | | `aborting` | N/A | When a user aborts an Actor run on the Apify platform, they can choose to abort gracefully to allow the Actor some time before getting killed. This graceful abort emits the `aborting` event which the SDK uses to gracefully stop running crawls and you can use it to do your own cleanup as well. | | `persistState` | `{ isMigrating: Boolean }` | Emitted in regular intervals (by default 60 seconds) to notify all components of Apify SDK that it is time to persist their state, in order to avoid repeating all work when the Actor restarts. This event is automatically emitted together with the migrating event, in which case the `isMigrating` flag is set to `true`. Otherwise the flag is `false`. Note that the `persistState` event is provided merely for user convenience, you can achieve the same effect using `setInterval()` and listening for the `migrating` event. | In the future, the event mechanism might be extended to custom events and messages enabling communication between Actors. Under the hood, Actors receive system events by connecting to a web socket address specified by the `ACTOR_EVENTS_WEBSOCKET_URL` environment variable. The system sends messages in JSON format in the following structure: ```js { // Event name name: String, // Time when the event was created, in ISO format createdAt: String, // Optional object with payload data: Object, } ``` Note that some events (e.g. `persistState`) are not sent by the system via the web socket, but generated virtually on the Actor SDK level. <!-- ASTRO: <CodeSwitcher> --> <!-- ASTRO: <CodeExample title="Node.js"> --> #### Node.js ```js // Add event handler const handler = (data) => { if (data.isCpuOverloaded) console.log('Oh no, we need to slow down!'); } Actor.on('systemInfo', handler); // Remove all handlers for a specific event Actor.off('systemInfo'); // Remove a specific event handler Actor.off('systemInfo', handler); ``` <!-- ASTRO: </CodeExample> --> <!-- ASTRO: <CodeExample title="Python"> --> #### Python ```python from apify import Actor, Event # Add event handler async def handler(data): if data.cpu_info.is_overloaded: print('Oh no, we need to slow down!') Actor.on(Event.SYSTEM_INFO, handler); # Remove all handlers for a specific event Actor.off(Event.SYSTEM_INFO); # Remove a specific event handler Actor.off(Event.SYSTEM_INFO, handler); ``` <!-- ASTRO: </CodeExample> --> <!-- ASTRO: <CodeExample title="UNIX equivalent"> --> #### UNIX equivalent ```c signal(SIGINT, handle_sigint); ``` <!-- ASTRO: </CodeExample> --> <!-- ASTRO: </CodeSwitcher> --> ### Get memory information Get information about the total and available memory of the Actor’s container or local system. This is useful to, for example, auto-scale a pool of workers used for crawling large websites. <!-- ASTRO: <CodeSwitcher> --> <!-- ASTRO: <CodeExample title="Node.js"> --> #### Node.js ```js const memoryInfo = await Actor.getMemoryInfo(); ``` <!-- ASTRO: </CodeExample> --> <!-- ASTRO: <CodeExample title="UNIX equivalent"> --> #### UNIX equivalent ```bash # Print memory usage of programs $ ps -a ``` <!-- ASTRO: </CodeExample> --> <!-- ASTRO: </CodeSwitcher> --> ### Start another Actor <!-- ASTRO: <Illustration description="Actor can start other Actors, if they have permission. It can override the default dataset or key-value store, and e.g. forwarding the data to another named dataset, that will be consumed by the other Actor." position="content" image={illuAPIStartAnother} noCaption /> --> Actor can start other Actors, if they have permission. The Actor can override the default dataset or key-value store, and, e.g. forward the data to another named dataset that will be consumed by the other Actor. The `call` operation waits for the other Actor to finish, the `start` operation returns immediately. <!-- ASTRO: <CodeSwitcher> --> <!-- ASTRO: <CodeExample title="Node.js"> --> #### Node.js ```js // Start Actor and return a Run object const run = await Actor.start( 'apify/google-search-scraper', // name of the Actor to start { queries: 'test' }, // input of the Actor { memory: 2048 }, // run configuration ); // Start Actor and wait for it to finish const run2 = await Actor.call( 'apify/google-search-scraper', { queries: 'test' }, { memory: 2048 }, ); ``` <!-- ASTRO: </CodeExample> --> <!-- ASTRO: <CodeExample title="CLI"> --> #### CLI ```bash # On stdout, the commands emit Actor run object (in text or JSON format), # we shouldn't wait for finish, for that it should be e.g. "execute" $ apify call apify/google-search-scraper queries='test\ntest2' \ countryCode='US' $ apify call --json apify/google-search-scraper '{ "queries": }' $ apify call --input=@data.json --json apify/google-search-scraper $ apify call --memory=1024 --build=beta apify/google-search-scraper $ apify call --output-record-key=SCREENSHOT apify/google-search-scraper # Pass input from stdin $ cat input.json | actor call apify/google-search-scraper --json # Call local actor during development $ apify call file:../some-dir someInput='xxx' ``` <!-- ASTRO: </CodeExample> --> <!-- ASTRO: <CodeExample title="Slack"> --> #### Slack It will also be possible to run Actors from the Slack app. The following command starts the Actor, and then prints the messages to a Slack channel. ``` /apify start bob/google-search-scraper startUrl=afff ``` <!-- ASTRO: </CodeExample> --> <!-- ASTRO: <CodeExample title="API"> --> #### API ``` [POST] https://api.apify.com/v2/actors/apify~google-search-scraper/run [POST|GET] https://api.apify.com/v2/actors/apify~google-search-scraper/run-sync? token=rWLaYmvZeK55uatRrZib4xbZs& outputRecordKey=OUTPUT returnDataset=true ``` <!-- ASTRO: </CodeExample> --> <!-- ASTRO: <CodeExample title="UNIX equivalent"> --> #### UNIX equivalent ```bash # Run a program in the background $ command <arg1>, <arg2>, … & ``` ```c // Spawn another process posix_spawn(); ``` <!-- ASTRO: </CodeExample> --> <!-- ASTRO: </CodeSwitcher> --> ### Metamorph <!-- ASTRO: <Illustration description="Metamorph is the most magical Actor operation. It replaces running Actor’s Docker image with another Actor, similarly to UNIX `exec` command. It is used for building new Actors on top of existing ones. You simply define input schema and write README for a specific use case, and then delegate the work to another Actor." position="content" image={illuAPIMetamorph} noCaption /> --> This is the most magical Actor operation. It replaces a running Actor’s Docker image with another Actor, similarly to UNIX `exec` command. It is used for building new Actors on top of existing ones. You simply define the input schema and write README for a specific use case, and then delegate the work to another Actor. The target Actor inherits the default storages used by the calling Actor. The target Actor input is stored to the default key-value store, under a key such as `INPUT-2` (the actual key is passed via the `ACTOR_INPUT_KEY` [environment variable](#environment-variables)). Internally, the target Actor can recursively metamorph into another Actor. An Actor can metamorph only to Actors that have compatible output schema as the main Actor, in order to ensure logical and consistent outcomes for users. If the output schema of the target Actor is not compatible, the system should throw an error. <!-- ASTRO: <CodeSwitcher> --> <!-- ASTRO: <CodeExample title="Node.js"> --> #### Node.js ```js await Actor.metamorph( 'bob/web-scraper', { startUrls: [ "https://www.example.com" ] }, { memoryMbytes: 4096 }, ); ``` <!-- ASTRO: </CodeExample> --> <!-- ASTRO: <CodeExample title="CLI"> --> #### CLI ```bash $ actor metamorph bob/web-scraper startUrls=http://example.com $ actor metamorph --input=@input.json --json --memory=4096 \ bob/web-scraper ``` <!-- ASTRO: </CodeExample> --> <!-- ASTRO: <CodeExample title="UNIX equivalent"> --> #### UNIX equivalent ```bash $ exec /bin/bash ``` <!-- ASTRO: </CodeExample> --> <!-- ASTRO: </CodeSwitcher> --> ### Attach webhook to an Actor run Run another Actor or an external HTTP API endpoint after an Actor run finishes or fails. <div class="clear-both" /> <!-- ASTRO: <CodeSwitcher> --> <!-- ASTRO: <CodeExample title="Node.js"> --> #### Node.js ```js await Actor.addWebhook({ eventType: ['ACTOR.RUN.SUCCEEDED', 'ACTOR.RUN.FAILED'], requestUrl: 'http://api.example.com?something=123', payloadTemplate: `{ "userId": {{userId}}, "createdAt": {{createdAt}}, "eventType": {{eventType}}, "eventData": {{eventData}}, "resource": {{resource}} }`, }); ``` <!-- ASTRO: </CodeExample> --> <!-- ASTRO: <CodeExample title="CLI"> --> #### CLI ```bash $ actor add-webhook \\ --event-types=ACTOR.RUN.SUCCEEDED,ACTOR.RUN.FAILED \\ --request-url=https://api.example.com \\ --payload-template='{ "test": 123" }' $ actor add-webhook --event-types=ACTOR.RUN.SUCCEEDED \\ --request-actor=apify/send-mail \\ --memory=4096 --build=beta \\ --payload-template=@template.json # Or maybe have a simpler API for self-actor? $ actor add-webhook --event-types=ACTOR.RUN.SUCCEEDED --request-actor=apify/send-mail ``` <!-- ASTRO: </CodeExample> --> <!-- ASTRO: <CodeExample title="UNIX equivalent"> --> #### UNIX equivalent ```bash # Execute commands sequentially, based on their status $ command1; command2 # (command separator) $ command1 && command2 # ("andf" symbol) $ command1 || command2 # ("orf" symbol) ``` <!-- ASTRO: </CodeExample> --> <!-- ASTRO: </CodeSwitcher> --> ### Abort another Actor Abort itself or another Actor running on the Apify platform. Aborting an Actor changes its [status](#actor-status) to `ABORTED`. <div class="clear-both" /> <!-- ASTRO: <CodeSwitcher> --> <!-- ASTRO: <CodeExample title="Node.js"> --> #### Node.js ```js await Actor.abort({ statusMessage: 'Your job is done, friend.', actorRunId: 'RUN_ID' }); ``` <!-- ASTRO: </CodeExample> --> <!-- ASTRO: <CodeExample title="CLI"> --> #### CLI ```bash $ actor abort --run-id RUN_ID ``` <!-- ASTRO: </CodeExample> --> <!-- ASTRO: <CodeExample title="UNIX equivalent"> --> #### UNIX equivalent ```bash # Terminate a program $ kill <PID> ``` <!-- ASTRO: </CodeExample> --> <!-- ASTRO: </CodeSwitcher> --> ### Reboot an Actor <!-- ASTRO: <Illustration description="Sometimes, an Actor might get into some error state from which it's not safe or possible to recover, e.g. an assertion error or a web browser crash. Rather than crashing and potentially failing the user job, the Actor can reboot its own Docker container and continue work from its previously persisted state." position="right" image={illuAPIReboot} noCaption /> --> Sometimes, an Actor might get into some error state from which it's not safe or possible to recover, e.g. an assertion error or a web browser crash. Rather than crashing and potentially failing the user job, the Actor can reboot its own Docker container and continue work from its previously persisted state. Normally, if an Actor crashes, the system also restarts its container, but if that happens too often in a short period of time, the system might completely [abort](#actor-status) the Actor run. The reboot operation can be used by the Actor developer to indicate that this is a controlled operation, and not to be considered by the system as a crash. <div class="clear-both" /> <!-- ASTRO: <CodeSwitcher> --> <!-- ASTRO: <CodeExample title="Node.js"> --> #### Node.js ```js await Actor.reboot(); ``` <!-- ASTRO: </CodeExample> --> <!-- ASTRO: <CodeExample title="Python"> --> #### Python ```python await Actor.reboot() ``` <!-- ASTRO: </CodeExample> --> <!-- ASTRO: <CodeExample title="CLI"> --> #### CLI ```bash $ actor reboot ``` <!-- ASTRO: </CodeExample> --> <!-- ASTRO: </CodeSwitcher> --> ### Actor web server An Actor can launch an HTTP web server that is exposed to the outside world to handle requests. This enables Actors to provide a custom HTTP API to integrate with other systems, to provide a web application for human users, to show Actor run details, diagnostics, charts, or to run an arbitrary web app. The port on which the Actor can launch the web server is specified by the `ACTOR_WEB_SERVER_PORT` environment variable. Once the web server is started, it is exposed to the public internet on a **live view URL** identified by the `ACTOR_WEB_SERVER_URL`, for example: ``` https://hard-to-guess-identifier.runs.apify.net ``` The live view URL has a unique hostname, which is practically impossible to guess. This lets you keep the web server hidden from the public yet accessible from the external internet by any parties with whom you share the URL. <!-- ASTRO: <Illustration description="An Actor can launch an HTTP web server that is exposed to the outside world to handle requests. This enables Actors to provide a custom HTTP API to integrate with other systems, to provide a web application for human users, to show Actor run details, diagnostics, charts, or to run an arbitrary web app." position="left" image={illuAPIWebServer} noCaption /> --> <div class="clear-both" /> <!-- ASTRO: <CodeSwitcher> --> <!-- ASTRO: <CodeExample title="Node.js"> --> #### Node.js ```js const express = require('express'); const app = express(); app.get('/', (req, res) => { res.send('Hello World!') }) app.listen(process.env.ACTOR_WEB_SERVER_PORT, () => { console.log(`Example live view web server running at ${process.env.ACTOR_WEB_SERVER_URL}`) }) ``` <!-- ASTRO: </CodeExample> --> <!-- ASTRO: </CodeSwitcher> --> ### Standby mode Actor Standby mode lets Actors run in the background and respond to incoming HTTP requests, like a web or API server. Starting an Actor run requires launching a Docker container, and so it comes with a performance penalty, sometimes many seconds for large images. For batch jobs this penalty is negligible, but for quick request-response interactions it becomes inefficient. Standby mode lets developers run Actors as web servers to run jobs that require quick response times. To use Standby mode, start an HTTP web server at the `ACTOR_WEB_SERVER_PORT` TCP port, and process HTTP requests. The Actor system publishes a Standby Actor's web server at a URL reported in the `ACTOR_STANDBY_URL` environment variable, and will automatically start or abort an Actor run as needed by the volume of HTTP requests or system load. The external Standby public URL might look like this: ``` https://bob--screenshotter.apify.actor ``` Unlike the live view URL reported in the `ACTOR_WEB_SERVER_URL` environment variable, the Standby URL is the same for all runs of the Actor, and it's intended to be publicly known. The Actor system can perform authentication of the requests going to the Standby URL using API tokens. Currently, the specific Standby mode settings, authentication options, or OpenAPI schema are not part of this Actor specification, but they might be in the future introduced as new settings in the `actor.json` file. ### Migration to another server Actors can be migrated from another host server from time to time, especially long-running ones. When migration is imminent, the system sends the Actor the `migrating` [system event](#system-events) to inform the Actor, so that it can persist its state to storages. All executed writes to the default Actor [storage](#storage) are guaranteed to be persisted before the migration. After migration, the Actor is restarted on a new host. It can restore its customer state from the storages again. ### Charging money <!-- ASTRO: <Illustration description="To run an Actor on the Apify platform or another cloud platform, a user typically needs to pay to cover the computing costs. Additionally, the platforms are free to introduce other monetization mechanisms, such as charging the users a fixed monthly fee for 'renting' the Actor, or a variable fee for the number of results produced by the Actor. These charging mechanisms are beyond the scope of this whitepaper." position="right" image={illuSharingChargingMoney} noCaption /> --> To run an Actor on the Apify platform or another cloud platform, a user typically needs to pay to cover the computing costs. Additionally, the platforms are free to introduce other [monetization](#monetization) mechanisms, such as charging the users a fixed monthly fee for "renting" the Actor, or a variable fee for the number of results produced by the Actor. These charging mechanisms are beyond the scope of this whitepaper. On top of these external monetization systems, Actors provide a built-in monetization system that enables developers to charge users variable amounts per event, e.g. based on the number of returned results, complexity of the input, or the cost of external APIs used internally by the Actor. An Actor can dynamically charge the current user a specific amount of money by calling the `charge` function. Users of Actors can limit the maximum amount to be charged by the Actor using the `maxTotalChargeUsd` run option, which is then passed to the Actor using the `ACTOR_MAX_TOTAL_CHARGE_USD` environment variable. The Actor can call the `charge` function as many times as necessary, but once the total sum of charged credits would exceed this maximum limit, the invocation of the function throws an error. When a paid Actor subsequently starts another paid Actor, the charges performed by the subsequent Actors are taken from the calling Actor's allowance. This enables Actor economy, where Actors hierarchically pay other Actors or external APIs to perform parts of the job. An Actor developer can also charge the current user of an Actor a specific amount of USD. <!-- ASTRO: <CodeSwitcher> --> <!-- ASTRO: <CodeExample title="Node.js"> --> #### Node.js ```js const chargeInfo = await Actor.charge({ eventName: 'gpt-4o-token', count: 1000, chargePerEventUsd: 0.0001, }); ``` <!-- ASTRO: </CodeExample> --> <!-- ASTRO: <CodeExample title="Python"> --> #### Python ```python charge_info = await Actor.charge( event_name='gpt-4o-token', count=1000, charge_per_event_usd=0.0001 ) ``` <!-- ASTRO: </CodeExample> --> <!-- ASTRO: <CodeExample title="CLI"> --> #### CLI ```bash $ actor charge gpt-4o-token \ --count=1000 --chargePerEventUsd=0.0001 ``` <!-- ASTRO: </CodeExample> --> <!-- ASTRO: </CodeSwitcher> --> An Actor user can specify the maximum amount they are willing to pay when starting an Actor. <!-- ASTRO: <CodeSwitcher> --> <!-- ASTRO: <CodeExample title="Node.js"> --> #### Node.js ```js const run = await Actor.call( 'bob/analyse-images', { imageUrls: ['https://www.example.com/image.png'] }, { // By default this is 0, hence Actors cannot charge users unless they explicitly allow that. maxTotalChargeUsd: 5, }, ); ``` <!-- ASTRO: </CodeExample> --> <!-- ASTRO: <CodeExample title="Python"> --> #### Python ```python run = await Actor.call( 'bob/analyse-images' , {'imageUrls': ['https://www.example.com/image.png']}, max_total_charge_usd=5 ) ``` <!-- ASTRO: </CodeExample> --> <!-- ASTRO: <CodeExample title="CLI"> --> #### CLI ```bash $ actor call bob/analyse-images \ --input='{"imageUrls": ["https://www.example.com/image.png"]}' --max-total-charge-usd=5 ``` <!-- ASTRO: </CodeExample> --> <!-- ASTRO: </CodeSwitcher> --> #### Rules for building Actors with variable charging If your Actor is charging users, you need to make sure at the earliest time possible that the Actor is being run with sufficient credits with respect to its input. If the maximum credits specified by the `ACTOR_MAX_TOTAL_CHARGE_USD` environment variable is not sufficient for the Actor's operation with respect to the input (e.g. user is requesting too many results for too little money), the Actor must fail immediately with an explanatory error status message for the user, and not charge the user anything. You must also charge users only **after** you have incurred the costs, not before. If an Actor fails in the middle of a run or is aborted, the users only need to be charged for results they actually received. Nothing will make users of your Actors angrier than charging them for something they didn't receive. ## Actor definition files The Actor system uses several special files that define Actor metadata, documentation, instructions how to build and run it, input and output schema, etc. **These files MUST be stored in the `.actor` directory placed in Actor's top-level directory. The entire `.actor` directory should be added to the source control.** The only required files are [Actor file](#actor-file) and [Dockerfile](#dockerfile), all other files are optional. The Actor definition files are used by the CLI (e.g. by `apify push` and `apify run` commands), as well as when building Actors on the Apify platform. The motivation to place the files into a separate directory is to keep the source code repository tidy and to prevent interactions with other source files, in particular when creating an Actor from pre-existing software repositories. ### Actor file This is the main definition file of the Actor, and it must always be present at `.actor/actor.json`. This file is in JSON format and contains a single object, whose properties define the main features of the Actor and link to all other necessary files. For details, see the [Actor file specification](./pages/ACTOR_FILE.md) page. #### Example Actor file at `.actor/actor.json` ```json { "actorSpecification": 1, "name": "screenshotter", "title": "Screenshotter", "description": "Take a screenshot of any URL", "version": "0.0", "inputSchema": "./input_schema.json", "outputSchema": "./output_schema.json", "dockerfile": "./Dockerfile" } ``` ### Dockerfile This file contains instructions for the system on how to build the Actor's Docker image and how to run it. Actors are started by running their Docker image, both locally using the `apify run` command as well as on the Apify platform. The Dockerfile is referenced from the [Actor file](#actor-file) using the `dockerfile` directive, and is typically stored at `.actor/Dockerfile`. Note that paths in Dockerfile are always specified relative to the Dockerfile's location. Learn more about Dockerfiles in the official [Docker reference](https://docs.docker.com/engine/reference/builder/). #### Example Dockerfile of an Actor ```dockerfile # Specify the base Docker image. You can read more about # the available images at https://crawlee.dev/docs/guides/docker-images # You can also use any other image from Docker Hub. FROM apify/actor-node-playwright-chrome:22-1.46.0 AS builder # Copy just package.json and package-lock.json # to speed up the build using Docker layer cache. COPY --chown=myuser package*.json ./ # Install all dependencies. Don't audit to speed up the installation. RUN npm install --include=dev --audit=false # Next, copy the source files using the user set # in the base image. COPY --chown=myuser . ./ # Install all dependencies and build the project. # Don't audit to speed up the installation. RUN npm run build # Create final image FROM apify/actor-node-playwright-firefox:22-1.46.0 # Copy just package.json and package-lock.json # to speed up the build using Docker layer cache. COPY --chown=myuser package*.json ./ # Install NPM packages, skip optional and development dependencies to # keep the image small. Avoid logging too much and print the dependency # tree for debugging RUN npm --quiet set progress=false \ && npm install --omit=dev --omit=optional \ && echo "Installed NPM packages:" \ && (npm list --omit=dev --all || true) \ && echo "Node.js version:" \ && node --version \ && echo "NPM version:" \ && npm --version \ && rm -r ~/.npm # Install all required Playwright dependencies for Firefox RUN npx playwright install firefox # Copy built JS files from builder image COPY --from=builder --chown=myuser /home/myuser/dist ./dist # Next, copy the remaining files and directories with the source code. # Since we do this after NPM install, quick build will be really fast # for most source file changes. COPY --chown=myuser . ./ # Run the image. If you know you won't need headful browsers, # you can remove the XVFB start script for a micro perf gain. CMD ./start_xvfb_and_run_cmd.sh && ./run_protected.sh npm run start:prod --silent ``` ### README The README file contains Actor documentation written in [Markdown](https://docs.github.com/en/github/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax). It should contain a great explanation of what the Actor does and how to use it. The README file is used to generate an Actor's public web page on Apify and for other purposes. The README file is referenced from the [Actor file](#actor-file) using the `readme` property, and typically stored at `.actor/README.md`. Good documentation makes good Actors. [Read the Apify Actor marketing playbook](https://apify.notion.site/3fdc9fd4c8164649a2024c9ca7a2d0da?v=6d262c0b026d49bfa45771cd71f8c9ab) for tips on how to write great READMEs and market Actors. ### Input schema file <!-- ASTRO: <Illustration description="Actors accept an input JSON object on start, whose schema can be defined by the input schema file. This file is referenced in the Actor file as the `input` property. It is a standard JSON Schema file with our extensions, and it is typically stored at .actor/input_schema.json." position="right" image={illuDefinitionFilesInputSchemaFile} noCaption /> --> Actors accept an [input](#input) JSON object on start, whose schema can be defined by the input schema file. This file is referenced in the Actor file (`.actor/actor.json`) file as the `input` property. It is a standard JSON Schema file with our extensions, and it is typically stored at `.actor/input_schema.json`. The input schema file defines properties accepted by Actor on input. It is used by the system to: - Validate the passed input JSON object on Actor run, so that Actors don't need to perform input validation and error handling in their code. - Render user interface for Actors to make it easy for users to run and test them manually. - Generate Actor API documentation and integration code examples on the web or in CLI, making Actors easy to integrate for users. - Simplify integration of Actors into automation workflows such as Zapier or Make, by providing smart connectors that smartly pre-populate and link Actor input properties. For details, see [Actor input schema file specification](./pages/INPUT_SCHEMA.md). <div class="clear-both" /> This is an example of the input schema file for the `bob/screenshotter` Actor: ```json { "actorInputSchemaVersion": 1, "title": "Input schema for Screenshotter Actor", "description": "Enter a web page URL and it will take its screenshot with a specific width", "type": "object", "properties": { "url": { "title": "URL", "type": "string", "editor": "textfield", "description": "URL of the webpage" }, "width": { "title": "Viewport width", "type": "integer", "description": "Width of the browser window.", "default": 1200, "minimum": 1, "unit": "pixels" } }, "required": [ "url" ] } ``` ### Output schema file <!-- ASTRO: <Illustration description="Similarly to input, Actors can generate an output JSON object, which links to their results. The Actor output schema file defines how such output object looks like, including types of its properties and description. This file is referenced in the Actor file as the `output` property. It is a standard JSON Schema file with our extensions, and it is typically stored at .actor/output_schema.json." position="right" image={illuDefinitionFilesOutputSchemaFile} noCaption /> --> Similarly to input, Actors can generate an [output](#output) JSON object, which links to their results. The Actor output schema file defines how such output object looks like, including types of its properties and description. This file is referenced in the Actor file (`.actor/actor.json`) file as the `output` property. It is a standard JSON Schema file with our extensions, and it is typically stored at `.actor/output_schema.json`. The output schema describes how the Actor stores its results, and it is used by the other systems: - Generate API documentation for users of Actors to figure where to find results. - Publish OpenAPI specification to make it easy for callers of Actors to figure where to find results. - Enable integrating Actors with external systems and automated workflows. For details, see [Actor output schema file specification](./pages/OUTPUT_SCHEMA.md). <div class="clear-both" /> This is an example of the output schema file for the `bob/screenshotter` Actor: ```json { "actorOutputSchemaVersion": 1, "title": "Output schema for Screenshotter Actor", "description": "The URL to the resulting screenshot", "properties": { "screenshotUrl": { "type": "string", "title": "Web page screenshot", "resourceType": "file", "template": "{{actorRun.defaultKeyValueStoreUrl}}/screenshot.png" } } } ``` ### Storage schema files Both main Actor file and input and output schema files can additionally reference schema files for specific storages. These files have custom JSON-based formats, see: - [Dataset schema file](./pages/DATASET_SCHEMA.md) - [Key-value store schema file](./pages/KEY_VALUE_STORE_SCHEMA.md) - [Request queue schema file](./pages/REQUEST_QUEUE_SCHEMA.md) These storage schemas are used to ensure that stored objects or files fulfil specific criteria, their fields have certain types, etc. On the Apify platform, the schemas can be applied to the storages directly, without Actors. Note that all the storage schemas are weak, in a sense that if the schema doesn't define a property, such property can be added to the storage and have an arbitrary type. Only properties explicitly mentioned by the schema are validated. This is an important feature which allows extensibility. For example, a data deduplication Actor might require on input datasets that have an `uuid: String` field in objects, but it does not care about other fields. ### Backward compatibility If the `.actor/actor.json` file is missing, the system falls back to the legacy mode, and looks for `apify.json`, `Dockerfile`, `README.md` and `INPUT_SCHEMA.json` files in the Actor's top-level directory instead. This behavior might be deprecated in the future. ## Development Actors can be developed locally, using a git integration, or in a web IDE. The SDK is currently available for Node.js, Python, and CLI. ### Local development <!-- ASTRO: <Illustration description="The Actor programming model is language agnostic, but the framework has native support for detection of the JavaScript and Python languages." position="right" image={illuDevelopmentLocal} noCaption /> --> The Actor programming model is language agnostic, but the framework has native support for detection of the JavaScript and Python languages. Tip: [Apify CLI](https://docs.apify.com/cli/docs/next/reference#apify-create-actorname) provides [convenient templates](https://apify.com/templates) to bootstrap an Actor in Python, JavaScript, and TypeScript. This example is describing how to create a simple "echo" Actor locally. The Actor will retrieve the [Input Object](#input) and it will [push](#push-results-to-dataset) it to the default [dataset](#dataset). #### Bootstrap the Actor directory The `actor bootstrap` CLI command will automatically generate the `.actor` directory and configuration files: ```bash $ actor bootstrap ? Actor name: actor-test Success: The Actor has been initialized in the current directory. $ tree -a . |-- .actor | `-- actor.json |-- .gitignore `-- storage |-- datasets | `-- default |-- key_value_stores | `-- default | `-- INPUT.json `-- request_queues `-- default ``` The command works on the best-effort basis, creating necessary configuration files for the specific programming language and libraries. Note: this command is not yet available and represents a future vision for the CLI. #### Add the Actor code ``` $ cat << EOF > Dockerfile FROM node:alpine RUN npm -g install apify-cli CMD actor push-data $(actor get-input) EOF ``` #### Run to test the Actor locally ``` $ echo '{"bar": "foo"}' | actor run -o -s [{ "foo": "bar" }] ``` `apify run` - starts the Actor using Dockerfile referenced from `.actor/actor.json` or Dockerfile in the Actor top-level directory (if the first is not present) ### Deployment to Apify platform The `apify push` CLI command takes information from the `.actor` directory and builds an Actor on the Apify platform, so that you can run it remotely. ```bash $ apify login ? Choose how you want to log in to Apify (Use arrow keys) ❯ Through Apify Console in your default browser $ apify push ``` <!-- ASTRO: <Illustration description="The `apify push` CLI command takes information from the `.actor` directory and builds an Actor on the Apify platform, so that you can run it remotely." position="left" image={illuDevelopmentDeployment} noCaption /> --> ### Continuous integration and delivery The source code of Actors can be hosted on external source control systems like GitHub or GitLab, and integrated into CI/CD pipelines. The implementation details, as well as details of the Actor build and version management process, are beyond the scope of this whitepaper. ### Actorizing existing code You can repackage many existing software repositories as an Actor by creating the `.actor/` directory with the [Actor definition files](#actor-definition-files), and providing a Dockerfile with instruction how to run the software. The `actor` CLI command can be used from the Dockerfile's `RUN` script transform the Actor JSON input into the configuration of the software, usually passed via command-line arguments, and then store the Actor output results. This example wraps the [`curl`](https://curl.se/docs/tutorial.html) UNIX command and pushes the result to the Actor's [key-value store](#key-value-store): ```bash FROM alpine/curl:latest # Install node to the Alpine Docker image COPY --from=node:current-alpine /usr/lib /usr/lib COPY --from=node:current-alpine /usr/local/lib /usr/local/lib COPY --from=node:current-alpine /usr/local/include /usr/local/include COPY --from=node:current-alpine /usr/local/bin /usr/local/bin # Install the Actor CLI RUN npm -g install apify-cli CMD curl $(actor get-input) | actor set-value example-com --contentType text/html ``` Actorization of existing code gives developers an easy way to give their code a presence in the cloud in the form of an Actor, so that the users can easily try it without having to install and manage it locally. ## Sharing and publishing Once an Actor is developed, the Actor platform lets you share it with other specific users, and decide whether you want to make its source code open or closed. You can also publish the Actor for anyone to use on a marketplace like [Apify Store](https://apify.com/store). The Actor will get a public landing page like `https://apify.com/bob/screenshotter`, showing its README, description of inputs, outputs, API examples, etc. Once published, your Actor is automatically exposed to organic traffic of users and potential customers. ![Apify Actor Store](./img/apify-store.png) <!-- ASTRO: <Picture src={illuApifyStore} alt="Apify Store" formats={['avif', 'webp']} /> --> ### Monetization <!-- ASTRO: <Illustration description="The monetization options for Actors include fixed rental fee, payment per result, or payment per event." position="right" image={illuSharingMonetization} noCaption /> --> To build a SaaS product, you usually need to: 1. Develop the product 2. Write documentation 3. Find and buy a domain name 4. Set up a website 5. Setup cloud infrastructure where it runs and scales 6. Handle payments, billing, and taxes 7. Marketing (content, ads, SEO, and more) 8. Sales (demos, procurement) Building software as an Actor and deploying it to the Apify platform changes this to: 1. Develop the Actor 2. Write the README 3. Publish the Actor on Apify Store Packaging your software as Actors makes it faster to launch new small SaaS products and then earn income on them, using various monetization options, e.g. fixed rental fee, payment per result, or payment per event (see [Charging money](#charging-money)). This monetization gives developers an incentive to further develop and maintain their Actors. Actors provide a new way for software developers like you to monetize their skills, bringing the creator economy model to SaaS. For more details, read our essay [Make passive income developing web automation Actors](https://blog.apify.com/make-regular-passive-income-developing-web-automation-Actors-b0392278d085/). ## Future work The goal of this whitepaper is to introduce the Actor philosophy and programming model to other developers, to receive feedback, and to open the way to making Actors an open standard. To create an open standard, we need to: - Define a standardized low-level HTTP REST API interface for the Actor system, to separate "frontend" and "backend" Actor programming model implementations. For example, if somebody wants to build support for the Actor programming model in Rust, they should just need to write a Rust "frontend" translating the commands to HTTP API calls, rather than having to implement the entire system. And equally, if one decides to develop a new Actor "backend", all existing client libraries for Rust or other languages should work with it. - Finalize specification of all the schema files, including [output](#output-schema-file) and [storage](#storage-schema-files) schema files. - Clearly separate what is the part of the standard and what is up to the discretion of the implementations. ## Links - [Apify Store](https://apify.com/store) - [Monetize your Actor on Apify](https://apify.com/partners/actor-developers) - [Open-source Actors on GitHub](https://github.com/search?q=path%3A.actor%2Factor.json+&type=code) # Actor file specification This JSON file must be present at `.actor/actor.json` and defines core properties of a single web Actor. The file contains a single JSON object with the following properties: ```jsonc { // Required field, indicates that this is an Actor definition file and the specific version of the Actor specification. "actorSpecification": 1, // Required "technical" name of the Actor, must be a DNS hostname-friendly text. "name": "google-search-scraper", // Human-friendly name and description of the Actor. "title": "Google Search Scraper", "description": "A 200-char description", // Required, indicates the version of the Actor. Since actor.json file is commited to Git, you can have different Actor // versions in different branches. "version": "0.0", // Optional tag that is applied to the builds of this Actor. If omitted, it defaults to "latest". "buildTag": "latest", // An optional object with environment variables expected by the Actor. // Secret values are prefixed by @ and their actual values need to be registered with the CLI, for example: // $ apify secrets add mySecretPassword pwd1234 "environmentVariables": { "MYSQL_USER": "my_username", "MYSQL_PASSWORD": "@mySecretPassword" }, // Optional field. If true, the Actor indicates it can be run in the Standby mode, // to get started and be kept alive by the system to handle incoming HTTP REST requests by the Actor's web server. "usesStandbyMode": true, // An optional metadata object enabling implementations to pass arbitrary additional properties. // The properties and their values must be strings. "labels": { "something": "bla bla" }, // Optional minimum and maximum memory for running the Actor. "minMemoryMbytes": 128, "maxMemoryMbytes": 4096, // When user doesn't specify memory when starting an Actor run, the system will use this amount. // The goal of this feature is to optimize user experience vs. compute costs. // The value might reference properties of the Actor run object (e.g. `{{actorRun.options.maxTotalChargeUsd}}`) // or Actor input (e.g. `{{actorRun.input}}`), similar to Output schema. It can also use basic arithmetic expressions. // The value will be clamped between `minMemoryMbytes` and `maxMemoryMbytes` (if provided), and rounded up to the nearest higher power of 2. // If the variable is undefined or empty, the behavior is undefined and the system will select memory arbitrarily. // In the future, we might change this behavior. "defaultMemoryMbytes": "{{actorRun.input.maxParallelRequests}} * 256 + 128", // Optional link to the Actor Dockerfile. // If omitted, the system looks for "./Dockerfile" or "../Dockerfile" "dockerfile": "./Dockerfile", // Optional link to the Actor README file in Markdown format. // If omitted, the system looks for "./ACTOR.md" and "../README.md" "readme": "./README.md", // Optional link to the Actor changelog file in Markdown format. "changelog": "../../../shared/CHANGELOG.md", // Optional link to Actor input or output schema file, or inlined schema object, // which is a JSON schema with our extensions. For details see ./INPUT_SCHEMA.md or ./OUTPUT_SCHEMA.md, respectively. // BACKWARDS COMPATIBILITY: "inputSchema" used to be called "input", all implementations should support this. "inputSchema": "./input_schema.json", "outputSchema": "./output_schema.json", // Optional path to Dataset or Key-value Store schema file or inlined schema object for the Actor's default dataset or key-value store. // For detail, see ./DATASET_SCHEMA.md or ./KEY_VALUE_STORE_SCHEMA.md, respectively. // BACKWARDS COMPATIBILITY: "datasetSchema" used to be "storages.keyValueStore" sub-object, all implementations should support this. "datasetSchema": "../shared_schemas/generic_dataset_schema.json", "keyValueStoreSchema": "./key_value_store_schema.json", // Optional path or inlined schema object of the Actor's web server in OpenAPI format. "webServerSchema": "./web_server_openapi.json", // Optional URL path and query parameters to the Model Context Protocol (MCP) server exposed by the Actor web server. // If present, the system knows the Actor provides an MCP server, which can be used by the platform // and integrations to integrate the Actor with various AI/LLM systems. "webServerMcpPath": "/mcp?version=2", // Scripts can be used by tools like the CLI to do certain actions based on the commands you run. // The presence of this object in your Actor config is optional, but we recommend always defining at least the `run` key. "scripts": { // The `run` script is special - it defines *the* way to run your Actor locally. While tools can decide // to implement mechanisms to detect what type of project your Actor is, and how to run it, you can choose to // define this as the source of truth. // // This should be the same command you run as if you were at the root of your Actor when you start it locally. // This can be anything from an npm script, as shown below, to a full chain of commands (ex.: `cargo test && cargo run --release`). // // CLIs may opt to also request this command when initializing a new Actor, or to automatically migrate and add it in the first time // you start the Actor locally. "run": "npm start" } } ``` ## Notes - The `name` doesn't contain the developer username, so that the Actor can be easily deployed to any user account. This is useful for tutorials and examples, as well as pull requests done externally to create Actors from existing source code files owned by external developers (the developer might not have Apify account yet, and we might want to show them deployment to some testing account). Note that `apify push` has option `--target=eva/my-actor:0.0` that allows deployment of the Actor under a different user account, using permissions and personal API token of the current user. We should also add options to override only parts of this, like `--target-user` (ID or username), `--name`, `--build-tag` and `--version`, it would be useful e.g. in CI for beta versions etc. - Note that `version` and `buildTag` are shared across Actor deployments to all user accounts, similarly as with software libraries, and hence they are part of `actor.json`. - The `dockerfile` property points to a Dockerfile that is to be used to build the Actor image. If not present, the system looks for Dockerfile in the `.actor` directory and if not found, then in Actor's top-level directory. This setting is useful if the source code repository has some other Dockerfile in the top-level directory, to separate Actor Docker image from the other one. Note that paths in Dockerfile are ALWAYS relative to the Dockerfile's location. When calling `apify run`, the system runs the Actor using the Dockerfile. - When calling `actor push` and the `title` or `description` are already set on the Actor (maybe SEO-optimized versions from copywriter), by default we do not overwrite them unless `apify push` is called with options `--force-title` or `--force-description`. ## Changes from the legacy `apify.json` file The `.actor/actor.json` replaces the legacy `apify.json` file. Here are main changes from the previous version: - We removed the `template` property as it's not needed for anything, it only stored the original template - There's a new `title` field for a human-readable name of the Actor. We're moving towards having human-readable names shown for Actors everywhere, so it makes sense to define `title` directly in the source code. - Similarly, we added `description` for the short description of what the Actor does. - `env` was renamed to `environmentVariables` for more clarity. `apify build` or `apify run` could have an option `--apply-env-vars-to-build` like we have it on platform. - The `dockerfile` and `readme` directives are optional, the system falls back to reasonable defaults, first in `.actor` directory and then in the top-level directory. - `scripts` section was added # Dataset schema file specification 1.0 Dataset storage enables you to sequentially store and retrieve data records, in various formats. Each Actor run is assigned its own dataset, which is created when the first item is stored to it. Datasets usually contain results from web scraping, crawling or data processing jobs. The data can be visualized as a table where each object is a row and its attributes are the columns. The data can be exported in JSON, CSV, XML, RSS, Excel, or HTML formats. The specification is also at https://docs.apify.com/platform/actors/development/actor-definition/output-schema Dataset can be assigned a schema which describes: - Content of the dataset, i.e., the schema of objects that are allowed to be added - Different views on how we can look at the data, aka transformations - Visualization of the View using predefined components (grid, table, ...), which improves the run view interface at Apify Console and also provides a better interface for datasets shared by Apify users <img src="https://user-images.githubusercontent.com/594801/147474979-a224008c-8cba-43a6-8d2e-c24f6b0d5b37.png" width="500"> <!-- ASTRO: <Picture src={illuDatasetSchema} alt="Dataset schema" formats={['avif', 'webp']} /> --> ## Basic properties - Storage is immutable. I.e., if you want to change the structure, then you need to create a new dataset. - Its schema is weak. I.e., you can always push their additional properties, but schema will ensure that all the listed once are there with a correct type. This is to make Actors more compatible, i.e., some Actor expects dataset to contain certain fields but does not care about the additional ones. There are two ways how to create a dataset with schema: 1. User can start the Actor that has dataset schema linked from its [OUTPUT_SCHEMA.json](./OUTPUT_SCHEMA.md) 2. Or user can do it pragmatically via API (for empty dataset) by - either by passing the schema as payload to [create dataset](https://docs.apify.com/api#/reference/datasets/dataset-collection/create-dataset) API endpoint. - or using the SDK: ```js const dataset = await Apify.openDataset('my-new-dataset', { schema }); ``` By opening an **existing** dataset with `schema` parameter, the system ensures that you are opening a dataset that is compatible with the Actor as otherwise, you get an error: ``` Uncaught Error: Dataset schema is not compatible with the provided schema ``` ## Structure ```jsonc { "actorDatasetSchemaVersion": 1, "title": "E-shop products", "description": "Dataset containing the whole product catalog including prices and stock availability.", // A JSON schema object describing the dataset fields, with our extensions: the "title", "description", and "example" properties. // "example" is used to generate code and API examples for the Actor output. // For details, see https://docs.apify.com/platform/actors/development/actor-definition/dataset-schema "fields": { "type": "object", "properties": { "title": { "type": "string", "description": "The name of the results", }, "imageUrl": { "type": "string", "description": "Function executed for each request", }, "priceUsd": { "type": "integer", "description": "Price of the item", }, "manufacturer": { "type": "object", "properties": { "title": { ... }, "url": { ... }, } }, ... }, "required": ["title"], }, // Define the ways how to present the Dataset to users "views": { "overview": { "title": "Products overview", "description": "Displays only basic fields such as title and price", "transformation": { "flatten": ["stockInfo"], "fields": [ "title", "imageUrl", "variants" ] }, "display": { "component": "table", "properties": { "title": { "label": "Title" }, "imageUrl": { "label": "Image", "format": "image" // optional, in this case the format is overridden to show "image" instead of image link "text". "image" format only works with .jpeg, .png or other image format urls. }, "stockInfo.availability": { "label": "Availability" } } } }, "productVariants": { "title": "Product variants", "description": "Each product expanded into item per variant", "transformation": { "fields": [ "title", "price", "productVariants" ], "unwind": "productVariants" }, "display": { // Simply renders all the available fields. // This component is used by default when no display is specified. "component": "table" } } }, } ``` ## DatasetSchema object definition | Property | Type | Required | Description | | ------------------ | ---------------------------- | -------- | -------------------------------------------------------------------------------------------------- | | actorSpecification | integer | true | Specifies the version of dataset schema <br/>structure document. <br/>Currently only version 1 is available. | | fields | JSON schema | true | JSON schema object with more formats in the future. | | views | [DatasetView] | true | An array of objects with a description of an API <br/>and UI views. | ### JSON schema Items of a dataset can be described by a JSON schema definition, passed into the `fields` property. The Actor system then ensures that each records added to the dataset complies with the provided schema. ```jsonc { "type": "object", "required": [ "name", "email" ], "properties": { "id": { "type": "string" }, "name": { "type": "string" }, "email": { "type": "string" }, "arr": { "type": "array", "items": { "type": "object", "required": [], "properties": { "site": { "type": "string" }, "url": { "type": "string" } } } } } } ``` ### DatasetView object definition | Property | Type | Required | Description | | -------------- | ------------------------- | -------- | ----------------------------------------------------------------------------------------------------- | | title | string | true | The title is visible in UI in the Output tab <br/>as well as in the API. | | description | string | false | The description is only available in the API response. <br/>The usage of this field is optional. | | transformation | ViewTransformation object | true | The definition of data transformation <br/>is applied when dataset data are loaded from <br/>Dataset API. | | display | ViewDisplay object | true | The definition of Output tab UI visualization. | ### ViewTransformation object definition | Property | Type | Required | Description | | -------- | -------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | fields | string[] | true | Selects fields that are going to be presented in the output. <br/>The order of fields matches the order of columns <br/>in visualization UI. In case the fields value <br/>is missing, it will be presented as “undefined” in the UI. | | unwind | string | false | Deconstructs nested children into parent object, <br/>e.g.: with unwind:[”foo”], the object `{”foo”:{”bar”:”hello”}}` <br/> is turned into `{’bar”:”hello”}`. | | flatten | string[] | false | Transforms nested object into flat structure. <br/>eg: with flatten:[”foo”] the object `{”foo”:{”bar”:”hello”}}` <br/> is turned into `{’foo.bar”:”hello”}`. | | omit | string | false | Removes the specified fields from the output. <br/>Nested fields names can be used there as well. | | limit | integer | false | The maximum number of results returned. <br/>Default is all results. | | desc | boolean | false | By default, results are sorted in ascending based <br/>on the write event into the dataset. desc:true param <br/>will return the newest writes to the dataset first. | ### ViewDisplay object definition | Property | Type | Required | Description | | ---------- | ------------------------------------------------------------------------------------------------------------------ | -------- | ---------------------------------------------------------------------------------------------------------------------------- | | component | string | true | Only component “table” is available. | | properties | Object | false | Object with keys matching the `transformation.fields` <br/> and ViewDisplayProperty as values. In case properties are not set <br/>the table will be rendered automatically with fields formatted as Strings, <br/>Arrays or Objects. | ### ViewDisplayProperty object definition | Property | Type | Required | Description | | -------- | ------------------------------------------------------- | -------- | ---------------------------------------------------------------------------------------------- | | label | string | false | In case the data are visualized as in Table view. <br/>The label will be visible table column’s header. | | format | enum(text, number, date, link, <br/>boolean, image, array, object) | false | Describes how output data values are formatted <br/>in order to be rendered in the output tab UI. | # Sandbox for various ideas Here you can find random ideas and notes, in no particular order, relevance, or promise they will be implemented. ## TODOs - Add ideas for the permission system - Note from Marek regarding permission: - Just a note on this, I was thinking about how this could be done systematically, so dropping the notes here: - By default, the Actor should have following permissions that the user would accept when running the Actor for the first time: - Write to all the default + named storages linked in the output schema - Proxy - simply because we want all the traffic to run thru the proxy so we don't want Actors scraping directly - In `actor.json` the Actor could request additional permissions, basically anything from [permissions](https://docs.apify.com/access-rights/list-of-permissions#actor-task), for example, `DATASET.READ` to be able to read all the datasets or `SCHEDULER.WRITE` to manage schedules There is one tricky part: - If an Actor needs to `.call()` other Actors then basically the user must give him full permissions. Otherwise, the Actor would have to list all the other Actors it's going to call and the user would have to accept all the permissions needed in recursive calls. Extra question: - What to do if the new version of the Actor requires more permissions? We should probably require the author to increase a major version and keep users on the old build + email them to accept the updated permissions. - We should make env vars independent of Apify, i.e. start them with `ACTOR_`, rather then `APIFY_` - To storages, add info about atomic rename, e.g. `setName` function, and link to other operations... - Maybe add `Actor.getThisRun()` function to return run object of the current Actor. Not sure about use case... - Figure the push/build workflow, see https://github.com/apify/actor-specs/pull/7/files#r997020215 / https://github.com/apify/actor-specs/pull/7#pullrequestreview-1144097598 how should that work with - Would be nice to have an API that would send a message to a run and the run would get it as `.on('message', (msg) => { ... })`. Would save people from implementing their own servers in Actors. It would make it easier to orchestrate Actors. Currently it's a bit painful to create a "master" Actor and then "workers" to process some workloads. But it could probably be achieved with a queue. if it were distributed and generic. Explain why is this better than live-view HTTP API - NOTE: BTW we have a new API v3 doc with ideas for changes in API https://www.notion.so/apify/API-v3-6fcd240d9621427f9650b741ec6fa06b ? - For DATASET schema, In future versions let's consider referencing schema using URL, for now let's keep it simple ### Pipe result of an Actor to another (aka chaining) Actor can start other Actors and pass them its own dataset or key-value store. For example, the main Actor can produce files and the spawned others can consume them, from the same storages. In the future, we could let datasets be cleaned up from the beginning, effectively creating a pipe, with custom rolling window. Webhooks can be attached to storage operations, and so launch other Actors to consume newly added items or files. #### UNIX equivalent ```bash $ ls -l | grep "something" | wc -l ``` **TODO (@jancurn):** **Move to IDEAS.md** We could have a special CLI support for creating Actor chains using pipe operator, like this: ``` $ apify call apify/google-search-scraper | apify call apify/send-email queryTerms="aaa\nbbb" ``` Note from Marek: Here we will need some way how to map outputs from old Actor to inputs of the following Actor, perhaps we could pipeline thru some utility like [jq](https://stedolan.github.io/jq/tutorial/) or use some mapping like: ``` --input-dataset-id="$output.defaultDatasetId" --dataset-name="xxx" ``` Note from Ondra: I tried to write a JS example for piping, but figured that piping is not really aligned with how Actors work, because piping assumes the output of one program is immediately processed by another program. Actors can produce output like this, but they can't process input like this. Input is provided only once, when the Actor starts. Unless we consider e.g. request queue as input. We will have to think about this a bit differently. Note from Jan: Indeed, the flow is to start one Actor, and pass one of it's storages as default to the other newly started Actor. If we had a generic Queue, it could be used nicely for these use case. I'm adding these notes to the doc, so that we can get back to them later. Jan: I'd get rid of the Request queue from Actor specification, and kept it as Apify's extension only. # Actor input schema file specification 1.0 This JSON file defines the schema and description of the input object accepted by the Actor (see [Input](../README.md#input) for details). The file is referenced from the main [Actor file (.actor/actor.json)](ACTOR_FILE.md) using the `input` directive, and it is typically stored in `.actor/input_schema.json`. The file is a JSON schema with our extensions describing a single Actor input object and its properties, including documentation, default value, and user interface definition. **For full reference, see [Input schema specification](https://docs.apify.com/platform/actors/development/actor-definition/input-schema/specification/v1) in Apify documentation.** <!-- TODO: Move the full specs including JSON meta schema to this repo --> <!-- TODO: Consider renaming "editor" values to camelCase, for consistency --> ## Example Actor input schema ```jsonc { "actorInputSchemaVersion": 1, "title": "Input schema for an Actor", "description": "Enter the start URL(s) of the website(s) to crawl, configure other optional settings, and run the Actor to crawl the pages and extract their text content.", "type": "object", "properties": { "startUrls": { "title": "Start URLs", "type": "array", "description": "One or more URLs of the pages where the crawler will start. Note that the Actor will additionally only crawl sub-pages of these URLs. For example, for the start URL `https://www.example.com/blog`, it will crawl pages like `https://example.com/blog/article-1`, but will skip `https://example.com/docs/something-else`.", "editor": "requestListSources", "prefill": [{ "url": "https://docs.apify.com/" }] }, // The input value is another Dataset. The system can generate an UI to make it easy to select the dataset. "processDatasetId": { "title": "Input dataset", "type": "string", "resourceType": "dataset", "description": "Dataset to be processed by the Actor", // Optional link to dataset schema, used by the system to validate the input dataset "schema": "./input_dataset_schema.json" }, "screenshotsKeyValueStoreId": { "title": "Screenshots to process", "type": "string", "resourceType": "keyValueStore", "description": "Screenshots to be compressed", "schema": "./input_key_value_store_schema.json" }, "singleFileUrl": { "title": "Some file", "type": "string", "editor": "fileupload", "description": "Screenshots to be compressed", "schema": "./input_key_value_store_schema.json" }, "crawlerType": { "sectionCaption": "Crawler settings", "title": "Crawler type", "type": "string", "enum": ["playwright:chrome", "cheerio", "jsdom"], "enumTitles": ["Headless web browser (Chrome+Playwright)", "Raw HTTP client (Cheerio)", "Raw HTTP client with JS execution (JSDOM) (experimental!)"], "description": "Select the crawling engine:\n- **Headless web browser** (default) - Useful for modern websites with anti-scraping protections and JavaScript rendering. It recognizes common blocking patterns like CAPTCHAs and automatically retries blocked requests through new sessions. However, running web browsers is more expensive as it requires more computing resources and is slower. It is recommended to use at least 8 GB of RAM.\n- **Raw HTTP client** - High-performance crawling mode that uses raw HTTP requests to fetch the pages. It is faster and cheaper, but it might not work on all websites.", "default": "playwright:chrome" }, "maxCrawlDepth": { "title": "Max crawling depth", "type": "integer", "description": "The maximum number of links starting from the start URL that the crawler will recursively descend. The start URLs have a depth of 0, the pages linked directly from the start URLs have a depth of 1, and so on.\n\nThis setting is useful to prevent accidental crawler runaway. By setting it to 0, the Actor will only crawl start URLs.", "minimum": 0, "default": 20 }, "maxCrawlPages": { "title": "Max pages", "type": "integer", "description": "The maximum number pages to crawl. It includes the start URLs, pagination pages, pages with no content, etc. The crawler will automatically finish after reaching this number. This setting is useful to prevent accidental crawler runaway.", "minimum": 0, "default": 9999999 } } } ``` ## Random notes We could also add an `actor` resource type. The use case could be for example a testing Actor with three inputs: - Actor to be tested - test function containing for example Jest unit test over the output - input for the Actor ...and the testing Actor would call the given Actor with a given output and in the end execute tests if the results are correct. # Key-value store schema file specification [work in progress] This JSON file should contain schema for files stored in the key-value store, defining their name, format, or content type. **BEWARE: This is currently not implemented yet and subject to change.** ## Basic properties Key-value store schema has two main use cases described in the following examples: 1. Some Actors such as [Instagram scraper](https://apify.com/jaroslavhejlek/instagram-scraper) store multiple types of files into the key-value store. Let's say the scraper stores images and user pictures. So for each of these, we would define a prefix group called collection and allow the user to list images from a single collection in both the UI and API. ```jsonc { "collections": { "screenshots": { "name": "Post images", "keyPrefix": "images-", "contentTypes": ["image/jpeg", "image/png"] } } } ``` 2. Some Actor stores a specific record, and we want to ensure the content type to be HTML and embed it into the run view. A good example is [monitoring](https://apify.com/apify/monitoring#check-frequency) Actor that generates HTML report that we would like to embed to run view for the user once the monitoring is finished. ```jsonc { "collections": { "monitoringReport": { "name": "Monitoring report", "description": "HTML page containing monitoring results", "key": "REPORT", "contentTypes": ["text/html"] } } } ``` 3. Some Actors store a record that has a specific structure. The structure can be specified using [JSON schema](https://json-schema.org/draft-07). Contrary to dataset schema, the record in key-value store represents output that is a single item, instead of a sequence of items. But both approaches use JSON schema to describe the structure. ```jsonc { "collections": { "monitoringReportData": { "name": "Monitoring report data", "description": "JSON containing the report data", "key": "report-data.json", "contentTypes": ["application/json"], "jsonSchema": { "$schema": "http://json-schema.org/draft-07/schema#", "type": "object", "properties": { "summary": { "type": "string" }, "totalResults": { "type": "number" } } } // alternatively "jsonSchema": "./report-schema.json" can be used } } } ``` ## Structure ```jsonc { "actorKeyValueStoreSchemaVersion": 1, "name": "My Instagram backup", "description": "Backup of my Instagram account", "collections": { "postImages": { "name": "Post images", "description": "Contains all Instagram post images", "keyPrefix": "post-image-", "contentTypes": ["image/jpeg", "image/png"] }, "profilePicture": { "name": "Profile picture", "key": "profile-picture", "contentTypes": ["image/*"] // Be able to enable all images or text types etc. } } } ``` ## API implications Enable user to list keys for specific collection: ``` https://api.apify.com/v2/key-value-stores/storeId/keys?collection=postImages&exclusiveStartKey=xxx ``` In addition to this user will be able to list by prefix directly: ``` https://api.apify.com/v2/key-value-stores/storeId/keys?prefix=post-images- ``` # Actor output schema file specification 1.0 [work in progress] This JSON file defines the schema of the [output](../README.md#output) object produced by a web Actor. The file is referenced from the main [Actor file](./ACTOR_FILE.md) using the `output` property, and it is typically stored in `.actor/output_schema.json`. The format is a JSON Schema with our extensions, describing a single object. The output schema is used by the system to generate the output JSON object, whose fields corresponding to `properties`, where values are URLs linking to actual Actor results in a dataset, key-value store files, or live view web server. This output object is generated by system right when the Actor starts withour executing any Actor's code, and remains static over entire lifecycle of Actor; only the linked content changes over time as Actor produces the results. This is necessary to enable integrations of results to other systems, as you don't need to run an Actor to see format of its results as it's predefined by the output schema. The output schema is also used by the system to generate the user interface, API examples, integrations, etc. ## Structure ```jsonc { "actorOutputSchemaVersion": 1, "title": "Some title", "description": "This text is shown in the Output UI", "type": "object", "properties": { // This property in output object will contain a URL to the dataset containing Actor results, // for example: https://api.apify.com/v2/datasets/XYZabc/items?format=json&view=product_details "currentProductsDatasetUrl": { // Type is string, because the value in output object is a URL "type": "string", "title": "Current products", "description": "Yaddada", // Identifies what kind of object is refereced by this output property (same syntax as "resourceType" in input schema). // If used, the system will interepret the "source" and render the dataset in UI special way. "resourceType": "dataset", // Defines how the output value is created, using text format where {{x}} denote variables (same syntax as webhook templates) "template": "{{actorRun.defaultDatasetUrl}}?format=json&view=product_details", // Or reference a property from input object, the linkage will be checked for type compatibility // "template": "{{actorRun.input.myProductsDatasetId}}" }, // Selects a specific group of records with a certain prefix. In UI, this can be shown // as a list of images. In the output object, this will be a link to a API with "prefix" param. "productImagesUrl": { "type": "string", "title": "Product screenshots", "resourceType": "keyValueStore", // Define how the URL is created, in this case it will link to the default Actor key-value store "template": "{{actorRun.defaultKeyValueStoreUrl}}?collection=screenshots" }, // Example of reference to a file stored in Actor's default key-value store. // In UI can be rendered as a file download. "mainScreenshotFileUrl": { "type": "string", "title": "Main screenshot", "description": "URL to an image with main product screenshot.", "template": "{{actorRun.defaultKeyValueStoreUrl}}/screenshot.png", }, // Live view web server for to the Actor // In the "output" view, this page is rendered in an IFRAME "productExplorerWebUrl": { "type": "string", "resourceType": "webServer", "title": "Live product explorer app", "description": "API documentation is available in swagger.com/api/xxxx", // optional // TODO: ideally this should be named {{actorRun.webServerUrl}} for consistency, but we'd need to change ActorRun everywhere "template": "{{actorRun.containerUrl}}/product-explorer/", } } } ``` ## Random notes The output schema can reference other datasets/kv-stores/queues but only those ones that are referenced in the input, or the default. Hence there's no point to include storage schema here again, as it's done elsewhere. - **NOTE:** The output schema should enable developers to define schema for the default dataset and key-value store. But how? It should be declarative so that the system can check that e.g. the overridden default dataset has the right schema. But then, when it comes to kv-store, that's not purely output object but INPUT, similarly for overridden dataset or request queue. Perhaps the cleanest way would be to set these directly in `.actor/actor.json`. - The Run Sync API could have an option to automatically return (or redirect to?) a specific property (i.e. URL) of the output object. This would supersede the `outputRecordKey=OUTPUT` API param as well as the run-sync-get-dataset-items API endpoint. Maybe we could have one of the output properties as the main one, which would be used by default for this kind of API endpoint, and just return data to user. - Same as we show Output in UI, we need to autogenerate the OUTPUT in API e.g. JSON format. There would be properties like in the output_schema.json file, with e.g. URL to dataset, log file, kv-store, live view etc. So it would be an auto-generated field "output" that we can add to JSON returned by the Run API endpoints (e.g. https://docs.apify.com/api/v2#/reference/actor-tasks/run-collection/run-task) - Also see: https://github.com/apify/actor-specs/pull/5#discussion_r775641112 - `output` will be a property of run object generated from Output schema ## Examples of ideal Actor run UI - For the majority of Actors, we want to see the dataset with new records being added in realtime - For [Google Spreadsheet Import](https://apify.com/lukaskrivka/google-sheets), we want to first display Live View for the user to set up OAUTH, and once this is set up, then we want to display the log next time. - For technical Actors, it might be a log - For [HTML to PDF convertor](https://apify.com/jancurn/url-to-pdf) it's a single record from key-value store - For [Monitoring](https://apify.com/apify/monitoring-runner) it's log during the runtime and a single HTML record in an iframe in the end - For an Actor that has failed, it might be the log ## How to define Actor run UI ### Simple version There will be a new tab on Actor run detail for every Actor with output schema called "Output". This tab will be at the first position and displayed by default. Tab will show the following: - Items from output schema with property `visible: true` will be rendered in the same order as they are in schema - The live view will be displayed only when it has `visible: true` and when it's active. Otherwise, we should show just a short message "This show is over". - If the dataset has more views then we should have some select or tabs to select the view ### Ideal most comprehensive state - Default setup, i.e., what output components should be displayed at the default run tab - Optionally, the setup for different states - Be able to pragmatically changes this using API by Actor itself # Request queue schema file specification [work in progress] Currently, this is neither specified nor implemented. We think that request queue schema might be useful for two things: - ensuring what kind of URLs might be enqueued (certain domains or subdomains, ...) - ensure that for example each request has `userData.label`, i.e. schema of `userData` the same way as we enforce it for the Datasets We should consider renaming `RequestQueue` to just `Queue` and make it more generic, and then it makes sense to have request schema. **This is to be done**