Web scraping basics for Python devs
Learn how to use Python to extract information from websites in this practical course, starting from the absolute basics.
In this course we'll use Python to create an application for watching prices. It'll be able to scrape all product pages of an e-commerce website and record prices. Data from several runs of such program would be useful for seeing trends in price changes, detecting discounts, etc.
What we'll do
- Inspect pages using browser DevTools.
- Download web pages using the HTTPX library.
- Extract data from web pages using the Beautiful Soup library.
- Save extracted data in various formats, e.g. CSV which MS Excel or Google Sheets can open.
- Follow links programmatically (crawling).
- Save time and effort with frameworks, such as Crawlee, and scraping platforms, such as Apify.
Who this course is for
Anyone with basic knowledge of developing programs in Python who wants to start with web scraping can take this course. The course does not expect you to have any prior knowledge of web technologies or scraping.
Requirements
- A macOS, Linux, or Windows machine with a web browser and Python installed.
- Familiarity with Python basics: variables, conditions, loops, functions, strings, lists, dictionaries, files, classes, and exceptions.
- Comfort with importing from the Python standard library, using virtual environments, and installing dependencies with
pip
. - Familiarity with running commands in Terminal (macOS/Linux) or Command Prompt (Windows).
You may want to know
Let's explore the key reasons to take this course. What is web scraping good for, and what career opportunities does it enable for you?
Why learn scraping
The internet is full of useful data, but most of it isn't offered in a structured way that's easy to process programmatically. That's why you need scraping, a set of approaches to download websites and extract data from them.
Scraper development is also a fun and challenging way to learn web development, web technologies, and understand the internet. You'll reverse-engineer websites, understand how they work internally, discover what technologies they use, and learn how they communicate with servers. You'll also master your chosen programming language and core programming concepts. Understanding web scraping gives you a head start in learning web technologies such as HTML, CSS, JavaScript, frontend frameworks (like React or Next.js), HTTP, REST APIs, GraphQL APIs, and more.
Why build your own scrapers
Scrapers are programs specifically designed to mine data from the internet. Point-and-click or no-code scraping solutions do exist, but they only take you so far. While simple to use, they lack the flexibility and optimization needed to handle advanced cases. Only custom-built scrapers can tackle more difficult challenges. And unlike ready-made solutions, they can be fine-tuned to perform tasks more efficiently, at a lower cost, or with greater precision.
Why become a scraper dev
As a scraper developer, you are not limited by whether certain data is available programmatically through an official API—the entire web becomes your API! Here are some things you can do if you understand scraping:
- Improve your productivity by building personal tools, such as your own real estate or rare sneakers watchdog.
- Companies can hire you to build custom scrapers mining data important for their business.
- Become an invaluable asset to data journalism, data science, or nonprofit teams working to make the world a better place.
- You can publish your scrapers on platforms like the Apify Store and earn money by renting them out to others.
Why learn with Apify
We are Apify, a web scraping and automation platform. We do our best to build this course on top of open source technologies. That means what you learn applies to any scraping project, and you'll be able to run your scrapers on any computer. We will show you how a scraping platform can simplify your life, but that lesson is optional and designed to fit within our free tier.
Course content
📄️ DevTools: Inspecting
**In this lesson we'll use the browser tools for developers to inspect and manipulate the structure of a website.** --- A browser is the most complete tool for navigating websites.
📄️ DevTools: Locating HTML elements
**In this lesson we'll use the browser tools for developers to manually find products on an e-commerce website.** --- Inspecting Wikipedia and tweaking its subtitle is fun, but let's shift gears and focus on building an app to track prices on an e-commerce site.
📄️ DevTools: Extracting data
**In this lesson we'll use the browser tools for developers to manually extract product data from an e-commerce website.** --- In our pursuit to scrape products from the [Sales page](https://warehouse-theme-metal.myshopify.com/collections/sales), we've been able to locate parent elements containing relevant data.
📄️ Downloading HTML
**In this lesson we'll start building a Python application for watching prices.
📄️ Parsing HTML
**In this lesson we'll look for products in the downloaded HTML.
📄️ Locating HTML elements
**In this lesson we'll locate product data in the downloaded HTML.
📄️ Extracting data from HTML
**In this lesson we'll finish extracting product data from the downloaded HTML.
📄️ Saving data
**In this lesson, we'll save the data we scraped in the popular formats, such as CSV or JSON.
📄️ Getting links from HTML
**In this lesson, we'll locate and extract links to individual product pages.
📄️ Crawling websites
**In this lesson, we'll follow links to individual product pages.
📄️ Scraping product variants
**In this lesson, we'll scrape the product detail pages to represent each product variant as a separate item in our dataset.** --- We'll need to figure out how to extract variants from the product detail page, and then change how we add items to the data list so we can add multiple items after scraping one product URL.
📄️ Using a framework
**In this lesson, we'll rework our application for watching prices so that it builds on top of a scraping framework.
📄️ Using a platform
**In this lesson, we'll deploy our application to a scraping platform that automatically runs it daily.