search docs
Edit

Finish Node.js scraper

Continue learning how to create a web scraper with Node.js and cheerio. Learn how to parse HTML and print results.

In the first part of the Node.js tutorial we downloaded the HTML of the Alexa Top Sites index and parsed it with Cheerio. Now, we will replicate the collection logic from the Collecting Data with DevTools chapters and finish our scraper.

Querying data with Cheerio

As a reminder, the data we need from the Top Sites index is available in the 50 <div> elements with class site-listing. The CSS selector to find those is div.site-listing.

Selecting an element from the Elements tab

To get all the elements with that CSS selector using Cheerio, we call the $ function with the selector.

$('div.site-listing');

We will use the same approach as in the previous DevTools chapters. Using a for..of loop we will iterate over the array of sites we saved in the sites variable. The code is a little different from DevTools, because we're using Node.js and Cheerio not a browser.

// main.js
import { gotScraping } from 'got-scraping';
import cheerio from 'cheerio';

const response = await gotScraping('https://www.alexa.com/topsites');
const html = response.body;

const $ = cheerio.load(html);
const sites = $('div.site-listing');
for (const site of sites) {
    const element = $(site);
    console.log(element.text());
}

After you run this script, you should see data of all the 50 sites printed in your terminal. Don't forget about the const element = $(site); line. Without wrapping each site with $(), we wouldn't be able to call the .text() function on it.

Collecting final data

Now we only need to repeat the process from the DevTools chapters and add individual data point collection to the loop. From those chapters we know that the data are in <div> elements with class td.

Finding child elements in Elements tab

We will loop over all the sites and collect the data points from each of them using the for..of loop. For reference, this is the code from the DevTools chapter, where we collected the data using a browser.

// This is code from the browser Console. It won't work in Node.js
const results = [];

for (const site of sites) {
    const fields = site.querySelectorAll('div.td');
    results.push({
        rank: fields[0].textContent.trim(),
        site: fields[1].textContent.trim(),
        dailyTimeOnSite: fields[2].textContent.trim(),
        dailyPageViews: fields[3].textContent.trim(),
        percentFromSearch: fields[4].textContent.trim(),
        totalLinkingSites: fields[5].textContent.trim(),
    });
}

console.log(results);

And this is how the code will look like with Node.js and Cheerio.

const results = [];

for (const site of sites) {
    const fields = $(site).find('div.td');
    results.push({
        rank: fields.eq(0).text().trim(),
        site: fields.eq(1).text().trim(),
        dailyTimeOnSite: fields.eq(2).text().trim(),
        dailyPageViews: fields.eq(3).text().trim(),
        percentFromSearch: fields.eq(4).text().trim(),
        totalLinkingSites: fields.eq(5).text().trim(),
    });
}

The main difference is that we used the .find() function to select all the div.td elements and also that we need to access the individual fields with .eq(). If you find the differences confusing, don't worry about it. It will become very natural once you do it a few times. The final scraper code looks like this:

// main.js
import { gotScraping } from 'got-scraping';
import cheerio from 'cheerio';

const response = await gotScraping('https://www.alexa.com/topsites');
const html = response.body;

const $ = cheerio.load(html);
const sites = $('div.site-listing');
const results = [];

for (const site of sites) {
    const fields = $(site).find('div.td');
    results.push({
        rank: fields.eq(0).text().trim(),
        site: fields.eq(1).text().trim(),
        dailyTimeOnSite: fields.eq(2).text().trim(),
        dailyPageViews: fields.eq(3).text().trim(),
        percentFromSearch: fields.eq(4).text().trim(),
        totalLinkingSites: fields.eq(5).text().trim(),
    });
}

console.log(results);

Printing all websites' data to terminal

If you were able to get here, run the code, get results and also understand everything, you can pat yourself on the back and congratulate yourself on completing the Basics of data collection part of the Web Scraping Academy. Great job! ๐Ÿ‘๐ŸŽ‰

Next up

While we were able to collect the data, it's not very useful to have those printed to the console. In the next, bonus chapter, we will learn how to convert the data to a CSV and save it to a file.