Using Scrapy
Scrapy is an open-source web scraping framework for Python. It provides tools for defining scrapers, extracting data from web pages, following links, and handling pagination. With the Apify SDK, Scrapy projects can be converted into Apify Actors, integrated with Apify storages, and executed on the Apify platform.
Integrating Scrapy with the Apify platform
The Apify SDK provides an Apify-Scrapy integration. The main challenge of this is to combine two asynchronous frameworks that use different event loop implementations. Scrapy uses Twisted for asynchronous execution, while the Apify SDK is based on asyncio. The key thing is to install the Twisted's asyncioreactor
to run Twisted's asyncio compatible event loop. This allows both Twisted and asyncio to run on a single event loop, enabling a Scrapy spider to run as an Apify Actor with minimal modifications.
from __future__ import annotations
from scrapy.utils.reactor import install_reactor
# Install Twisted's asyncio reactor before importing any other Twisted or
# Scrapy components.
install_reactor('twisted.internet.asyncioreactor.AsyncioSelectorReactor')
import os
from apify.scrapy import initialize_logging, run_scrapy_actor
# Import your main Actor coroutine here.
from .main import main
# Ensure the location to the Scrapy settings module is defined.
os.environ['SCRAPY_SETTINGS_MODULE'] = 'src.settings'
if __name__ == '__main__':
initialize_logging()
run_scrapy_actor(main())
In this setup, apify.scrapy.initialize_logging
configures an Apify log formatter and reconfigures loggers to ensure consistent logging across Scrapy, the Apify SDK, and other libraries. The apify.scrapy.run_scrapy_actor
bridges asyncio coroutines with Twisted's reactor, enabling the Actor's main coroutine, which contains the Scrapy spider, to be executed.
Make sure the SCRAPY_SETTINGS_MODULE
environment variable is set to the path of the Scrapy settings module. This variable is also used by the Actor
class to detect that the project is a Scrapy project, triggering additional actions.
from __future__ import annotations
from scrapy.crawler import CrawlerRunner
from scrapy.utils.defer import deferred_to_future
from apify import Actor
from apify.scrapy import apply_apify_settings
# Import your Scrapy spider here.
from .spiders import TitleSpider as Spider
async def main() -> None:
"""Apify Actor main coroutine for executing the Scrapy spider."""
async with Actor:
# Retrieve and process Actor input.
actor_input = await Actor.get_input() or {}
start_urls = [url['url'] for url in actor_input.get('startUrls', [])]
allowed_domains = actor_input.get('allowedDomains')
proxy_config = actor_input.get('proxyConfiguration')
# Apply Apify settings, which will override the Scrapy project settings.
settings = apply_apify_settings(proxy_config=proxy_config)
# Create CrawlerRunner and execute the Scrapy spider.
crawler_runner = CrawlerRunner(settings)
crawl_deferred = crawler_runner.crawl(
Spider,
start_urls=start_urls,
allowed_domains=allowed_domains,
)
await deferred_to_future(crawl_deferred)
Within the Actor's main coroutine, the Actor's input is processed as usual. The function apify.scrapy.apply_apify_settings
is then used to configure Scrapy settings with Apify-specific components before the spider is executed. The key components and other helper functions are described in the next section.
Key integration components
The Apify SDK provides several custom components to support integration with the Apify platform:
apify.scrapy.ApifyScheduler
- Replaces Scrapy's default scheduler with one that uses Apify's request queue for storing requests. It manages enqueuing, dequeuing, and maintaining the state and priority of requests.apify.scrapy.ActorDatasetPushPipeline
- A Scrapy item pipeline that pushes scraped items to Apify's dataset. When enabled, every item produced by the spider is sent to the dataset.apify.scrapy.ApifyHttpProxyMiddleware
- A Scrapy middleware that manages proxy configurations. This middleware replaces Scrapy's defaultHttpProxyMiddleware
to facilitate the use of Apify's proxy service.
Additional helper functions in the apify.scrapy
subpackage include:
apply_apify_settings
- Applies Apify-specific components to Scrapy settings.to_apify_request
andto_scrapy_request
- Convert between Apify and Scrapy request objects.initialize_logging
- Configures logging for the Actor environment.run_scrapy_actor
- Bridges asyncio and Twisted event loops.
Create a new Apify-Scrapy project
The simplest way to start using Scrapy in Apify Actors is to use the Scrapy Actor template. The template provides a pre-configured project structure and setup that includes all necessary components to run Scrapy spiders as Actors and store their output in Apify datasets. If you prefer manual setup, refer to the example Actor section below for configuration details.
Wrapping an existing Scrapy project
The Apify CLI supports converting an existing Scrapy project into an Apify Actor with a single command. The CLI expects the project to follow the standard Scrapy layout (including a scrapy.cfg
file in the project root). During the wrapping process, the CLI:
- Creates the necessary files and directories for an Apify Actor.
- Installs the Apify SDK and required dependencies.
- Updates Scrapy settings to include Apify-specific components.
For further details, see the Scrapy migration guide.
Example Actor
The following example demonstrates a Scrapy Actor that scrapes page titles and enqueues links found on each page. This example aligns with the structure provided in the Apify Actor templates.
- __main.py__
- main.py
- settings.py
- items.py
- spiders/title.py
from __future__ import annotations
from scrapy.utils.reactor import install_reactor
# Install Twisted's asyncio reactor before importing any other Twisted or
# Scrapy components.
install_reactor('twisted.internet.asyncioreactor.AsyncioSelectorReactor')
import os
from apify.scrapy import initialize_logging, run_scrapy_actor
# Import your main Actor coroutine here.
from .main import main
# Ensure the location to the Scrapy settings module is defined.
os.environ['SCRAPY_SETTINGS_MODULE'] = 'src.settings'
if __name__ == '__main__':
initialize_logging()
run_scrapy_actor(main())
from __future__ import annotations
from scrapy.crawler import CrawlerRunner
from scrapy.utils.defer import deferred_to_future
from apify import Actor
from apify.scrapy import apply_apify_settings
# Import your Scrapy spider here.
from .spiders import TitleSpider as Spider
async def main() -> None:
"""Apify Actor main coroutine for executing the Scrapy spider."""
async with Actor:
# Retrieve and process Actor input.
actor_input = await Actor.get_input() or {}
start_urls = [url['url'] for url in actor_input.get('startUrls', [])]
allowed_domains = actor_input.get('allowedDomains')
proxy_config = actor_input.get('proxyConfiguration')
# Apply Apify settings, which will override the Scrapy project settings.
settings = apply_apify_settings(proxy_config=proxy_config)
# Create CrawlerRunner and execute the Scrapy spider.
crawler_runner = CrawlerRunner(settings)
crawl_deferred = crawler_runner.crawl(
Spider,
start_urls=start_urls,
allowed_domains=allowed_domains,
)
await deferred_to_future(crawl_deferred)
BOT_NAME = 'titlebot'
DEPTH_LIMIT = 1
LOG_LEVEL = 'INFO'
NEWSPIDER_MODULE = 'src.spiders'
ROBOTSTXT_OBEY = True
SPIDER_MODULES = ['src.spiders']
TELNETCONSOLE_ENABLED = False
# Do not change the Twisted reactor unless you really know what you are doing.
TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'
from __future__ import annotations
from scrapy import Field, Item
class TitleItem(Item):
"""Represents a title item scraped from a web page."""
url = Field()
title = Field()
from __future__ import annotations
from typing import TYPE_CHECKING, Any
from urllib.parse import urljoin
from scrapy import Request, Spider
from ..items import TitleItem
if TYPE_CHECKING:
from collections.abc import Generator
from scrapy.http.response import Response
class TitleSpider(Spider):
"""A spider that scrapes web pages to extract titles and discover new links.
This spider retrieves the content of the <title> element from each page and queues
any valid hyperlinks for further crawling.
"""
name = 'title_spider'
# Limit the number of pages to scrape.
custom_settings = {'CLOSESPIDER_PAGECOUNT': 10}
def __init__(
self,
start_urls: list[str],
allowed_domains: list[str],
*args: Any,
**kwargs: Any,
) -> None:
"""A default costructor.
Args:
start_urls: URLs to start the scraping from.
allowed_domains: Domains that the scraper is allowed to crawl.
*args: Additional positional arguments.
**kwargs: Additional keyword arguments.
"""
super().__init__(*args, **kwargs)
self.start_urls = start_urls
self.allowed_domains = allowed_domains
def parse(self, response: Response) -> Generator[TitleItem | Request, None, None]:
"""Parse the web page response.
Args:
response: The web page response.
Yields:
Yields scraped `TitleItem` and new `Request` objects for links.
"""
self.logger.info('TitleSpider is parsing %s...', response)
# Extract and yield the TitleItem
url = response.url
title = response.css('title::text').extract_first()
yield TitleItem(url=url, title=title)
# Extract all links from the page, create `Request` objects out of them,
# and yield them.
for link_href in response.css('a::attr("href")'):
link_url = urljoin(response.url, link_href.get())
if link_url.startswith(('http://', 'https://')):
yield Request(link_url)
Conclusion
In this guide you learned how to use Scrapy in Apify Actors. You can now start building your own web scraping projects using Scrapy, the Apify SDK and host them on the Apify platform. See the Actor templates to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our GitHub or join our Discord community. Happy scraping!