Skip to main content
Version: 3.4

LLM-ready scraping with Crawl4AI

In this guide, you'll learn how to use the Crawl4AI library for LLM-ready web scraping in your Apify Actors.

Introduction

Crawl4AI is an open-source, asynchronous web crawler built for LLM and AI workflows. It renders a page in a real browser and turns the result into clean, structured Markdown that you can feed into a language model or a retrieval-augmented generation (RAG) pipeline. It also gives you the raw HTML, extracted links, and media.

Crawl4AI is a great fit for Apify Actors:

  • Crawl4AI converts each page into clean Markdown, stripping boilerplate and optionally filtering content, so the output can be fed straight into a language model.
  • Pages are loaded in a Playwright-driven browser, so JavaScript-heavy and dynamically rendered websites work out of the box.
  • Every crawl returns the page's links already split into internal and external groups, together with the media it found, which makes recursive crawling straightforward.
  • Beyond Markdown, Crawl4AI can extract structured data with CSS/XPath schemas or with an LLM, all configured per request.
  • The AsyncWebCrawler is built on asyncio, which integrates naturally with the asyncio-based Apify SDK.
  • Each request can be routed through its own proxy, which pairs well with Apify Proxy and its rotating IP addresses.

Crawl4AI drives a real browser through Playwright. After installing the library, download the browser binaries once with the crawl4ai-setup command:

pip install crawl4ai
crawl4ai-setup

Example Actor

The following Actor recursively crawls pages, starting from the URLs in the Actor input and following links up to a user-defined maximum depth. It uses Crawl4AI's AsyncWebCrawler to render each page through Apify Proxy, stores the page's Markdown in the dataset, and follows the internal links that Crawl4AI discovers.

The whole Actor fits in a single file. A scrape_page helper holds the Crawl4AI-specific crawling and parsing, while the main coroutine handles the Actor lifecycle, reads the input, sets up Apify Proxy and the request queue, opens a single browser-backed crawler, and drives the crawl:

Run on
import asyncio
from typing import Any

from crawl4ai import (
AsyncWebCrawler,
BrowserConfig,
CacheMode,
CrawlerRunConfig,
ProxyConfig,
)

from apify import Actor, Request
from apify.storages import RequestQueue


async def scrape_page(
crawler: AsyncWebCrawler,
url: str,
*,
proxy_url: str | None = None,
) -> tuple[dict[str, Any], list[str]]:
"""Crawl a page with Crawl4AI and return its markdown and same-site links."""
run_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
proxy_config=ProxyConfig.from_string(proxy_url) if proxy_url else None,
)

result = await crawler.arun(url, config=run_config)
if not result.success:
raise RuntimeError(result.error_message or f'Failed to crawl {url}')

data = {
'url': result.url,
'title': (result.metadata or {}).get('title'),
'markdown': str(result.markdown),
}

# Crawl4AI already classifies links; follow only the internal ones.
internal_links = result.links.get('internal', [])
links = [link['href'] for link in internal_links if link.get('href')]

return data, links


async def enqueue_links(
request_queue: RequestQueue,
links: list[str],
*,
depth: int,
max_depth: int,
) -> None:
"""Enqueue the links one level deeper, unless max_depth was reached."""
if depth >= max_depth:
return

for link_url in links:
Actor.log.info(f'Enqueuing {link_url} ...')
request = Request.from_url(link_url)
request.crawl_depth = depth + 1
await request_queue.add_request(request)


async def main() -> None:
async with Actor:
# Read the Actor input.
actor_input = await Actor.get_input() or {}
start_urls = actor_input.get('startUrls', [{'url': 'https://crawlee.dev'}])
max_depth = actor_input.get('maxDepth', 1)

if not start_urls:
Actor.log.info('No start URLs specified in Actor input, exiting...')
await Actor.exit()

# Set up Apify Proxy and the request queue.
proxy_configuration = await Actor.create_proxy_configuration()
request_queue = await Actor.open_request_queue()

# Enqueue the start URLs (crawl depth defaults to 0).
for start_url in start_urls:
url = start_url.get('url')
Actor.log.info(f'Enqueuing start URL: {url}')
await request_queue.add_request(Request.from_url(url))

# Cap the crawl; raise or remove to follow more pages.
max_requests = 50
handled_requests = 0

# Reuse one headless browser-backed crawler for every request.
browser_config = BrowserConfig(headless=True)

async with AsyncWebCrawler(config=browser_config) as crawler:
while handled_requests < max_requests and (
request := await request_queue.fetch_next_request()
):
handled_requests += 1
url = request.url
depth = request.crawl_depth
Actor.log.info(f'Scraping {url} (depth={depth}) ...')

try:
# Fresh proxy URL per request (None if no proxy).
proxy_url = None
if proxy_configuration:
proxy_url = await proxy_configuration.new_url()

data, links = await scrape_page(crawler, url, proxy_url=proxy_url)
await Actor.push_data(data)
Actor.log.info(
f'Stored data from {url} '
f'(title={data["title"]!r}, {len(links)} links found).'
)
await enqueue_links(
request_queue, links, depth=depth, max_depth=max_depth
)

except Exception:
Actor.log.exception(f'Cannot extract data from {url}.')

finally:
await request_queue.mark_request_as_handled(request)


if __name__ == '__main__':
asyncio.run(main())

Note that:

  • A single AsyncWebCrawler is opened once and reused for every request. The crawler manages one browser instance, so reusing it across the whole crawl is cheaper than launching a new browser per page.
  • Keeping the crawling and parsing in scrape_page separates the Crawl4AI-specific code from the Actor's orchestration logic. The function returns the extracted data together with the discovered links, so main decides what to store and what to enqueue.
  • result.markdown is the rendered page as clean Markdown, and result.metadata carries page-level fields such as the title. This is the kind of output you need when preparing data for an LLM.
  • result.links already separates internal (same-site) links from external ones. The example follows only the internal links to keep the crawl on the same website.
  • CacheMode.BYPASS tells Crawl4AI to always fetch a fresh copy of the page instead of serving it from its local cache.

Using Apify Proxy

Running on the Apify platform gives your scraper access to Apify Proxy, which rotates IP addresses to avoid rate limiting and blocking. In the example above, main creates a proxy configuration with Actor.create_proxy_configuration and passes a fresh proxy URL to scrape_page for every request, which forwards it to Crawl4AI's per-request CrawlerRunConfig.

ProxyConfig.from_string parses the proxy URL returned by ProxyConfiguration.new_url (for example http://groups-RESIDENTIAL:<password>@proxy.apify.com:8000) into the server, username, and password that the browser needs. The browser can't take the credentials embedded directly in the URL. To select specific proxy groups or a country, pass the relevant arguments to Actor.create_proxy_configuration. For details, see Proxy management.

Running on the Apify platform

Because Crawl4AI renders pages in a real browser, the Actor image needs a browser and its system-level dependencies. Build on top of the Apify Playwright base image, which already ships a browser. Crawl4AI reuses those binaries, so no separate browser-install step is required in the Dockerfile.

Pin the Python 3.13 variant of that image (for example apify/actor-python-playwright:3.13-1.60.0), because some of Crawl4AI's dependencies do not yet publish wheels for the newest Python versions, which would otherwise force a slow source build during the image build.

Add apify and crawl4ai to your requirements.txt:

apify
crawl4ai

Conclusion

In this guide, you learned how to use Crawl4AI in your Apify Actors. You can now render pages in a real browser, turn them into LLM-ready Markdown, follow the links Crawl4AI discovers, route requests through Apify Proxy, and run the whole thing on the Apify platform. To get started with your own scraping tasks, see the Actor templates. If you have questions or need assistance, feel free to reach out on our GitHub or join our Discord community. Happy scraping!

Additional resources