Integrating Scrapy projects
Scrapy is a widely used open-source web scraping framework for Python. Scrapy projects can now be executed on the Apify platform using our dedicated wrapping tool. This tool allows users to transform their Scrapy projects into Apify Actors with just a few simple commands.
Getting started
Install Apify CLI
To run the migration tool, you need to have the Apify CLI installed. You can install it using Homebrew with the following command:
brew install apify-cli
Alternatively, you can install it using NPM with the following command:
npm i -g apify-cli
In case of any issues, please refer to the installation guide.
Actorization of your existing Scrapy spider
Assuming your Scrapy project is set up, navigate to the project root where the scrapy.cfg
file is located.
cd your_scraper
Verify the directory contents to ensure the correct location.
$ ls -R
.:
your_scraper README.md requirements.txt scrapy.cfg
./your_scraper:
__init__.py items.py __main__.py main.py pipelines.py settings.py spiders
./your_scraper/spiders:
your_spider.py __init__.py
To convert your Scrapy project into an Apify Actor, initiate the wrapping process by executing the following command:
apify init
The script will prompt you with a series of questions. Upon completion, the output might resemble the following:
Info: The current directory looks like a Scrapy project. Using automatic project wrapping.
? Enter the Scrapy BOT_NAME (see settings.py): books_scraper
? What folder are the Scrapy spider modules stored in? (see SPIDER_MODULES in settings.py): books_scraper.spiders
? Pick the Scrapy spider you want to wrap: BookSpider (/home/path/to/actor-scrapy-books-example/books_scraper/spiders/book.py)
Info: Downloading the latest Scrapy wrapper template...
Info: Wrapping the Scrapy project...
Success: The Scrapy project has been wrapped successfully.
For example, here is a source code of an actorized Scrapy project, and here the corresponding Actor in Apify Store.
Run the Actor locally
Create a Python virtual environment by running:
python -m virtualenv .venv
Activate the virtual environment:
source .venv/bin/activate
Install Python dependencies using the provided requirements file named requirements_apify.txt
. Ensure these requirements are installed before executing your project as an Apify Actor locally. You can put your own dependencies there as well.
pip install -r requirements-apify.txt [-r requirements.txt]
Finally execute the Apify Actor.
apify run [--purge]
If ActorDatasetPushPipeline is configured, the Actor's output will be stored in the storage/datasets/default/
directory.
Run the scraper as Scrapy project
The project remains executable as a Scrapy project.
scrapy crawl your_spider -o books.json
Deploy on Apify
Log in to Apify
You will need to provide your Apify API Token to complete this action.
apify login
Deploy your Actor
This command will deploy and build the Actor on the Apify platform. You can find your newly created Actor under Actors -> My Actors.
apify push
What the wrapping process does
The initialization command enhances your project by adding necessary files and updating some of them while preserving its functionality as a typical Scrapy project. The additional requirements file, named requirements_apify.txt
, includes the Apify Python SDK and other essential requirements. The .actor/
directory contains basic configuration of your Actor. We provide two new Python files main.py and __main__.py, where we encapsulate the Scrapy project within an Actor. We also import and use there a few Scrapy components from our Python SDK. These components facilitate the integration of the Scrapy projects with the Apify platform. Further details about these components are provided in the following subsections.
Scheduler
The scheduler is a core component of Scrapy responsible for receiving and providing requests to be processed. To leverage the Apify request queue for storing requests, a custom scheduler becomes necessary. Fortunately, Scrapy is a modular framework, allowing the creation of custom components. As a result, we have implemented the ApifyScheduler. When using the Apify CLI wrapping tool, the scheduler is configured in the src/main.py file of your Actor.
Dataset push pipeline
Item pipelines are used for the processing of the results produced by your spiders. To handle the transmission of result data to the Apify dataset, we have implemented the ActorDatasetPushPipeline. When using the Apify CLI wrapping tool, the pipeline is configured in the src/main.py file of your Actor. It is assigned the highest integer value (1000), ensuring its execution as the final step in the pipeline sequence.
Retry middleware
Downloader middlewares are a way how to hook into Scrapy's request/response processing. Scrapy comes with various default middlewares, including the RetryMiddleware, designed to handle retries for requests that may have failed due to temporary issues. When integrating with the Apify request queue, it becomes necessary to enhance this middleware to facilitate communication with the request queue marking the requests either as handled or ready for a retry. When using the Apify CLI wrapping tool, the default RetryMiddleware
is disabled, and ApifyRetryMiddleware takes its place. Configuration for the middlewares is established in the src/main.py file of your Actor.
HTTP proxy middleware
Another default Scrapy downloader middleware that requires replacement is HttpProxyMiddleware. To utilize the use of proxies managed through the Apify ProxyConfiguration, we provide ApifyHttpProxyMiddleware. When using the Apify CLI wrapping tool, the default HttpProxyMiddleware
is disabled, and ApifyHttpProxyMiddleware takes its place. Additionally, inspect the .actor/input_schema.json file, where proxy configuration is specified as an input property for your Actor. The processing of this input is carried out together with the middleware configuration in src/main.py.