Learn how to crawl the web with your scraper. How to extract links and URLs from web pages and how to manage the collected links to crawl the web.
In this part, we will take a look at moving between web pages, which we call crawling. We will collect browsing data for all the countries in the Alexa Top Sites by Country index. To do that, we need to crawl the individual country websites.
Crawling websites is a fairly straightforward process. We'll start by opening the first web page and collecting all the links (URLs) that lead to the other pages we want to visit. To do that, we'll use the skills learned in the Basics of data collection course. We'll add some extra filtering to make sure we only get the correct URLs. Then, we'll save those URLs, so in case something happens to our scraper, we won't have to collect them again. And, finally, we will visit those URLs one by one.
At any point, we can collect URLs, data, or both. Crawling can be separate from data collection but it's not a requirement and, in most projects, it's actually easier and faster to do both at the same time. To summarize, it goes like this:
- Visit the start URL.
- Collect next URLs (and data) and save them.
- Visit one of the collected URLs and save data and/or more URLs.
- Repeat 2 and 3 until you have everything you needed.