Skip to main content

Using Crawlee

In this guide you'll learn how to use the Crawlee library in your Apify Actors.

Introduction

Crawlee is a Python library for web scraping and browser automation that provides a robust and flexible framework for building web scraping tasks. It seamlessly integrates with the Apify platform and supports a variety of scraping techniques, from static HTML parsing to dynamic JavaScript-rendered content handling. Crawlee offers a range of crawlers, including HTTP-based crawlers like HttpCrawler, BeautifulSoupCrawler and ParselCrawler, and browser-based crawlers like PlaywrightCrawler, to suit different scraping needs.

In this guide, you'll learn how to use Crawlee with BeautifulSoupCrawler and PlaywrightCrawler to build Apify Actors for web scraping.

Actor with BeautifulSoupCrawler

The BeautifulSoupCrawler is ideal for extracting data from static HTML pages. It uses BeautifulSoup for parsing and HttpxHttpClient for HTTP communication, ensuring efficient and lightweight scraping. If you do not need to execute JavaScript on the page, BeautifulSoupCrawler is a great choice for your scraping tasks. Below is an example of how to use BeautifulSoupCrawler in an Apify Actor.

from __future__ import annotations

from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext

from apify import Actor


async def main() -> None:
# Enter the context of the Actor.
async with Actor:
# Retrieve the Actor input, and use default values if not provided.
actor_input = await Actor.get_input() or {}
start_urls = [
url.get('url')
for url in actor_input.get(
'start_urls',
[{'url': 'https://apify.com'}],
)
]

# Exit if no start URLs are provided.
if not start_urls:
Actor.log.info('No start URLs specified in Actor input, exiting...')
await Actor.exit()

# Create a crawler.
crawler = BeautifulSoupCrawler(
# Limit the crawl to max requests.
# Remove or increase it for crawling all links.
max_requests_per_crawl=50,
)

# Define a request handler, which will be called for every request.
@crawler.router.default_handler
async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
url = context.request.url
Actor.log.info(f'Scraping {url}...')

# Extract the desired data.
data = {
'url': context.request.url,
'title': context.soup.title.string if context.soup.title else None,
'h1s': [h1.text for h1 in context.soup.find_all('h1')],
'h2s': [h2.text for h2 in context.soup.find_all('h2')],
'h3s': [h3.text for h3 in context.soup.find_all('h3')],
}

# Store the extracted data to the default dataset.
await context.push_data(data)

# Enqueue additional links found on the current page.
await context.enqueue_links()

# Run the crawler with the starting requests.
await crawler.run(start_urls)

Actor with PlaywrightCrawler

The PlaywrightCrawler is built for handling dynamic web pages that rely on JavaScript for content generation. Using the Playwright library, it provides a browser-based automation environment to interact with complex websites. Below is an example of how to use PlaywrightCrawler in an Apify Actor.

from __future__ import annotations

from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext

from apify import Actor


async def main() -> None:
# Enter the context of the Actor.
async with Actor:
# Retrieve the Actor input, and use default values if not provided.
actor_input = await Actor.get_input() or {}
start_urls = [
url.get('url')
for url in actor_input.get(
'start_urls',
[{'url': 'https://apify.com'}],
)
]

# Exit if no start URLs are provided.
if not start_urls:
Actor.log.info('No start URLs specified in Actor input, exiting...')
await Actor.exit()

# Create a crawler.
crawler = PlaywrightCrawler(
# Limit the crawl to max requests.
# Remove or increase it for crawling all links.
max_requests_per_crawl=50,
headless=True,
browser_launch_options={
'args': ['--disable-gpu'],
},
)

# Define a request handler, which will be called for every request.
@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
url = context.request.url
Actor.log.info(f'Scraping {url}...')

# Extract the desired data.
data = {
'url': context.request.url,
'title': await context.page.title(),
'h1s': [
await h1.text_content()
for h1 in await context.page.locator('h1').all()
],
'h2s': [
await h2.text_content()
for h2 in await context.page.locator('h2').all()
],
'h3s': [
await h3.text_content()
for h3 in await context.page.locator('h3').all()
],
}

# Store the extracted data to the default dataset.
await context.push_data(data)

# Enqueue additional links found on the current page.
await context.enqueue_links()

# Run the crawler with the starting requests.
await crawler.run(start_urls)

Conclusion

In this guide, you learned how to use the Crawlee library in your Apify Actors. By using the BeautifulSoupCrawler and PlaywrightCrawler crawlers, you can efficiently scrape static or dynamic web pages, making it easy to build web scraping tasks in Python. See the Actor templates to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our GitHub or join our Discord community. Happy scraping!