search docs

Basics of crawling

Learn how to crawl the web with your scraper. How to extract links and URLs from web pages and how to manage the collected links to crawl the web.

Welcome to the second part of our Web scraping for beginners course. In the Basics of data collection part, we learned how to extract data from a web page. Specifically, the Alexa Top Sites index.

In this part, we will take a look at moving between web pages, which we call crawling. We will collect browsing data for all the countries in the Alexa Top Sites by Country index. To do that, we need to crawl the individual country websites.

How to crawl?

Crawling websites is a fairly straightforward process. We'll start by opening the first web page and collecting all the links (URLs) that lead to the other pages we want to visit. To do that, we'll use the skills learned in the Basics of data collection course. We'll add some extra filtering to make sure we only get the correct URLs. Then, we'll save those URLs, so in case something happens to our scraper, we won't have to collect them again. And, finally, we will visit those URLs one by one.

At any point, we can collect URLs, data, or both. Crawling can be separate from data collection but it's not a requirement and, in most projects, it's actually easier and faster to do both at the same time. To summarize, it goes like this:

  1. Visit the start URL.
  2. Collect next URLs (and data) and save them.
  3. Visit one of the collected URLs and save data and/or more URLs.
  4. Repeat 2 and 3 until you have everything you needed.

Next up

First, let's make sure we all understand the foundations. In the next chapter we will review the scraper code we already have from the Basics of data collection section of the Academy.