Using Crawlee
In this guide you'll learn how to use the Crawlee library in your Apify Actors.
Introduction
Crawlee
is a Python library for web scraping and browser automation that provides a robust and flexible framework for building web scraping tasks. It seamlessly integrates with the Apify platform and supports a variety of scraping techniques, from static HTML parsing to dynamic JavaScript-rendered content handling. Crawlee offers a range of crawlers, including HTTP-based crawlers like HttpCrawler
, BeautifulSoupCrawler
and ParselCrawler
, and browser-based crawlers like PlaywrightCrawler
, to suit different scraping needs.
In this guide, you'll learn how to use Crawlee with BeautifulSoupCrawler
and PlaywrightCrawler
to build Apify Actors for web scraping.
Actor with BeautifulSoupCrawler
The BeautifulSoupCrawler
is ideal for extracting data from static HTML pages. It uses BeautifulSoup
for parsing and HttpxHttpClient
for HTTP communication, ensuring efficient and lightweight scraping. If you do not need to execute JavaScript on the page, BeautifulSoupCrawler
is a great choice for your scraping tasks. Below is an example of how to use BeautifulSoupCrawler
in an Apify Actor.
from __future__ import annotations
from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
from apify import Actor
async def main() -> None:
# Enter the context of the Actor.
async with Actor:
# Retrieve the Actor input, and use default values if not provided.
actor_input = await Actor.get_input() or {}
start_urls = [
url.get('url')
for url in actor_input.get(
'start_urls',
[{'url': 'https://apify.com'}],
)
]
# Exit if no start URLs are provided.
if not start_urls:
Actor.log.info('No start URLs specified in Actor input, exiting...')
await Actor.exit()
# Create a crawler.
crawler = BeautifulSoupCrawler(
# Limit the crawl to max requests.
# Remove or increase it for crawling all links.
max_requests_per_crawl=50,
)
# Define a request handler, which will be called for every request.
@crawler.router.default_handler
async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
url = context.request.url
Actor.log.info(f'Scraping {url}...')
# Extract the desired data.
data = {
'url': context.request.url,
'title': context.soup.title.string if context.soup.title else None,
'h1s': [h1.text for h1 in context.soup.find_all('h1')],
'h2s': [h2.text for h2 in context.soup.find_all('h2')],
'h3s': [h3.text for h3 in context.soup.find_all('h3')],
}
# Store the extracted data to the default dataset.
await context.push_data(data)
# Enqueue additional links found on the current page.
await context.enqueue_links()
# Run the crawler with the starting requests.
await crawler.run(start_urls)
Actor with PlaywrightCrawler
The PlaywrightCrawler
is built for handling dynamic web pages that rely on JavaScript for content generation. Using the Playwright library, it provides a browser-based automation environment to interact with complex websites. Below is an example of how to use PlaywrightCrawler
in an Apify Actor.
from __future__ import annotations
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext
from apify import Actor
async def main() -> None:
# Enter the context of the Actor.
async with Actor:
# Retrieve the Actor input, and use default values if not provided.
actor_input = await Actor.get_input() or {}
start_urls = [
url.get('url')
for url in actor_input.get(
'start_urls',
[{'url': 'https://apify.com'}],
)
]
# Exit if no start URLs are provided.
if not start_urls:
Actor.log.info('No start URLs specified in Actor input, exiting...')
await Actor.exit()
# Create a crawler.
crawler = PlaywrightCrawler(
# Limit the crawl to max requests.
# Remove or increase it for crawling all links.
max_requests_per_crawl=50,
headless=True,
browser_launch_options={
'args': ['--disable-gpu'],
},
)
# Define a request handler, which will be called for every request.
@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
url = context.request.url
Actor.log.info(f'Scraping {url}...')
# Extract the desired data.
data = {
'url': context.request.url,
'title': await context.page.title(),
'h1s': [
await h1.text_content()
for h1 in await context.page.locator('h1').all()
],
'h2s': [
await h2.text_content()
for h2 in await context.page.locator('h2').all()
],
'h3s': [
await h3.text_content()
for h3 in await context.page.locator('h3').all()
],
}
# Store the extracted data to the default dataset.
await context.push_data(data)
# Enqueue additional links found on the current page.
await context.enqueue_links()
# Run the crawler with the starting requests.
await crawler.run(start_urls)
Conclusion
In this guide, you learned how to use the Crawlee
library in your Apify Actors. By using the BeautifulSoupCrawler
and PlaywrightCrawler
crawlers, you can efficiently scrape static or dynamic web pages, making it easy to build web scraping tasks in Python. See the Actor templates to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our GitHub or join our Discord community. Happy scraping!