How to build Actors
At Apify, we try to make building web scraping and automation straightforward. You can customize our universal scrapers with JavaScript for quick tweaks, use our code templates for rapid setup in JavaScript, TypeScript, or Python, or build from scratch using our JavaScript and Python SDKs or Crawlee libraries for Node.js and Python for ultimate flexibility and control. This guide offers a quick overview of our tools to help you find the right fit for your needs.
Three ways to build Actors
- Our universal scrapers — customize our boilerplate tools to your needs with a bit of JavaScript and setup.
- Our code templates for web scraping projects — for a quick project setup to save you development time (includes JavaScript, TypeScript, and Python templates).
- Open-source libraries and SDKs
- JavaScript SDK & Python SDK — for creating your own solution from scratch on the Apify platform using our free development kits. Involves more coding but offers infinite flexibility.
- Crawlee and Crawlee for Python — for creating your own solutions from scratch using our free web automation libraries. Involves even more coding but offers infinite flexibility. There’s also no need to host these on the platform.
Universal scrapers & what are they for
Universal scrapers were built to provide an intuitive UI plus configuration that will help you start extracting data as quickly as possible. Usually, you just provide a simple JavaScript function and set up one or two parameters, and you're good to go.
Since scraping and automation come in various forms, we decided to build not just one, but six scrapers. This way, you can always pick the right tool for the job. Let's take a look at each particular tool and its advantages and disadvantages.
Scraper | Technology | Advantages | Disadvantages | Best for |
---|---|---|---|---|
🌐 Web Scraper | Headless Chrome Browser | Simple, fully JavaScript-rendered pages | Executes only client-side JavaScript | Websites with heavy client-side JavaScript |
👐 Puppeteer Scraper | Headless Chrome Browser | Powerful Puppeteer functions, executes both server-side and client-side JavaScript | More complex | Advanced scraping with client/server-side JS |
🎭 Playwright Scraper | Cross-browser support with Playwright library | Cross-browser support, executes both server-side and client-side JavaScript | More complex | Cross-browser scraping with advanced features |
🍩 Cheerio Scraper | HTTP requests + Cheerio parser (JQuery-like for servers) | Simple, fast, cost-effective | Pages may not be fully rendered (lacks JavaScript rendering), executes only server-side JavaScript | High-speed, cost-effective scraping |
⚠️ JSDOM Scraper | JSDOM library (Browser-like DOM API) | + Handles client-side JavaScript + Faster than full-browser solutions + Ideal for light scripting | Not for heavy dynamic JavaScript, executes server-side code only, depends on pre-installed NPM modules | Speedy scraping with light client-side JS |
🍲 BeautifulSoup Scraper | Python-based, HTTP requests + BeautifulSoup parser | Python-based, supports recursive crawling and URL lists | No full-featured web browser, not suitable for dynamic JavaScript-rendered pages | Python users needing simple, recursive crawling |
How do I choose the right universal web scraper to start with?
🎯 Decision points:
- Use 🌐 Web Scraper if you need simplicity with full browser capabilities and client-side JavaScript rendering.
- Use 🍩 Cheerio Scraper for fast, cost-effective scraping of static pages with simple server-side JavaScript execution.
- Use 🎭 Playwright Scraper when cross-browser compatibility is crucial.
- Use 👐 Puppeteer Scraper for advanced, powerful scraping where you need both client-side and server-side JavaScript handling.
- Use ⚠️ JSDOM Scraper for lightweight, speedy scraping with minimal client-side JavaScript requirements.
- Use 🍲 BeautifulSoup Scraper for Python-based scraping, especially with recursive crawling and processing URL lists.
To make it easier, here's a short questionnaire that guides you on selecting the best scraper based on your specific use case:
Questionnaire
- Is the website content rendered with a lot of client-side JavaScript?
- Yes:
- Do you need full browser capabilities?
- Yes: use Web Scraper or Playwright Scraper
- No, but I still want advanced features: use Puppeteer Scraper
- Do you need full browser capabilities?
- No:
- Do you prioritize speed and cost-effectiveness?
- Yes: use Cheerio Scraper
- No: use JSDOM Scraper
- Do you prioritize speed and cost-effectiveness?
- Yes:
- Do you need cross-browser support for scraping?
- Yes:** use Playwright Scraper
- No:** continue to the next step.
- Is your preferred scripting language Python?**
- Yes:** use BeautifulSoup Scraper
- No:** continue to the next step.
- Are you dealing with static pages or lightweight client-side JavaScript?**
- Static pages: use Cheerio Scraper or BeautifulSoup Scraper
- Light client-side JavaScript:
- Do you want a balance between speed and client-side JavaScript handling?
- Yes: use JSDOM Scraper
- No: use Web Scraper or Puppeteer Scraper
- Do you want a balance between speed and client-side JavaScript handling?
- Do you need to support recursive crawling or process lists of URLs?
- Yes, and I prefer Python: use BeautifulSoup Scraper
- Yes, and I prefer JavaScript: use Web Scraper or Cheerio Scraper
- No: choose based on other criteria above.
This should help you navigate through the options and choose the right scraper based on the website’s complexity, your scripting language preference, and your need for speed or advanced features.
📚 Resources:
- How to use Web Scraper to scrape any website
- How to use Beautiful Soup to scrape the web
- Learn about our $1/month Creator plan that encourages devs to build Actors based on universal scrapers
Web scraping code templates
Similar to our universal scrapers, our code templates also provide a quick start for developing web scrapers, automation scripts, and testing tools. Built on popular libraries like BeautifulSoup for Python or Playwright for JavaScript, they save time on setup, allowing you to focus on customization. Though they require more coding than universal scrapers, they're ideal for those who want a flexible foundation while still needing room to tailor their solutions.
Code template | Supported libraries | Purpose | Pros | Cons |
---|---|---|---|---|
🐍 Python | Requests, BeautifulSoup, Scrapy, Selenium, Playwright | Creating scrapers Automation Testing tools | - Simplifies setup - Supports major Python libraries | - Requires more manual coding (than universal scrapers)- May be restrictive for complex tasks |
☕️ JavaScript | Playwright, Selenium, Cheerio, Cypress, LangChain | Creating scrapers Automation Testing tools | - Eases development with pre-set configurations - Flexibility with JavaScript and TypeScript | - Requires more manual coding (than universal scrapers)- May be restrictive for tasks needing full control |
📚 Resources:
- How to build a scraper using a web scraper template.
Toolkits and libraries
Apify JavaScript and Python SDKs
Apify SDKs are designed for developers who want to interact directly with the Apify platform. It allows you to perform tasks like saving data in Apify Datasets, running Apify Actors, and accessing the key-value store. Ideal for those who are familiar with Node.js and Python, SDKs provide the tools needed to develop software specifically on the Apify platform, offering complete freedom and flexibility within the JavaScript ecosystem.
- Best for: interacting with the Apify platform (e.g., saving data, running Actors, etc)
- Pros: full control over platform-specific operations, integrates seamlessly with Apify services
- Cons: requires writing boilerplate code, higher complexity with more room for errors
Crawlee
Crawlee (for both Node.js and Python) is a powerful web scraping library that focuses on tasks like extracting data from web pages, automating browser interactions, and managing complex scraping workflows. Unlike the Apify SDK, Crawlee does not require the Apify platform and can be used independently for web scraping tasks. It handles complex operations like concurrency management, auto-scaling, and request queuing, allowing you to concentrate on the actual scraping tasks.
- Best for: web scraping and automation (e.g., scraping paragraphs, automating clicks)
- Pros: full flexibility in web scraping tasks, does not require the Apify platform, leverages the JavaScript ecosystem
- Cons: requires more setup and coding, higher chance of mistakes with complex operations
Combining Apify SDK and Crawlee
While these tools are distinct, they can be combined. For example, you can use Crawlee to scrape data from a page and then use the Apify SDK to save that data in an Apify dataset. This integration allows developers to make use of the strengths of both tools while working within the Apify ecosystem.
📚 Resources:
- Introduction to Crawlee
- Crawlee blog
- Webinar on scraping with Crawlee 101: how to create scrapers in JavaScript and TypeScript
- Step-by-step video guide: building an Amazon Scraper in Node.js with Crawlee
- Webinar on how to use Crawlee Python
- Introduction to Apify's Python SDK
Code templates vs. universal scrapers vs. libraries
Basically, the choice here depends on how much flexibility you need and how much coding you're willing to do. More flexibility → more coding.
Universal scrapers are simple to set up but are less flexible and configurable. Our libraries, on the other hand, enable the development of a standard Node.js or Python application, so be prepared to write a little more code. The reward for that is almost infinite flexibility.
Code templates are sort of a middle ground between scrapers and libraries. But since they are built on libraries, they are still on the rather more coding than less coding side. They will only give you a starter code to begin with. So please take this into account when choosing the way to build your scraper, and if in doubt — just ask us, and we'll help you out.
Switching sides: how to transfer an existing solution from another platform
You can also take advantage of the Apify platform's features without having to modify your existing scraping or automation solutions.
Integrating Scrapy spiders
The Apify platform fully supports Scrapy spiders. By deploying your existing Scrapy code to Apify, you can take advantage of features like scheduling, monitoring, scaling, and API access, all without needing to modify your original spider. This process is made easy with the Apify CLI, which allows you to convert your Scrapy spider into an Apify Actor with just a few commands. Once deployed, your spider can run in the cloud, offering a reliable and scalable solution for your web scraping needs.
Additionally, you can monetize your spiders by publishing them as Actors on Apify Store, potentially earning passive income from your work while benefiting from the platform’s extensive features.
ScrapingBee, ScrapingAnt, ScraperAPI
To make the transition from these platforms easier, we've also created SuperScraper API. This API is an open-source REST API designed for scraping websites by simply passing a URL and receiving the rendered HTML content in return. This service functions as a cost-effective alternative to other scraping services like ScrapingBee, ScrapingAnt, and ScraperAPI. It supports dynamic content rendering with a headless browser, can use various proxies to avoid blocking, and offers features such as capturing screenshots of web pages. It is ideal for large-scale scraping tasks due to its scalable nature.
To use SuperScraper API, you can deploy it with an Apify API token and access it via HTTP requests. The API supports multiple parameters for fine-tuning your scraping tasks, including options for rendering JavaScript, waiting for specific elements, and handling cookies and proxies. It also allows for custom data extraction rules and JavaScript execution on the scraped pages. Pricing is based on actual usage, which can be cheaper or more expensive than competitors, depending on the configuration.
📚 Resources:
- How to integrate Scrapy projects
- Scrapy monitoring: how to manage your Scrapy spider on Apify
- Run ScrapingBee, ScraperAPI, and ScrapingAnt on Apify — SuperScraper API Tutorial
General resources
- Creating your Actor: Actor sources
- Use it, build it or buy it? Choosing the right solution on Apify
- How to programmatically retrieve data with the Apify API
- Improved way to build your scrapers from a Git repo
- Webinar on how to build and monetize Actors on Apify Store
- 6 things you should know before buying or building a web scraper
- For a comprehensive guide on creating your first Actor, visit the Apify Academy.