search docs
Edit

Filtering links

When you collect links from a web page, you often end up with a lot of irrelevant URLs. Learn how to filter the links to only keep the ones you need.

Web pages are full of links and frankly, most of them are useless for us. There are two approaches to this. The first tries to target the links we're interested in by using unique CSS selectors. The second collects all links and then uses pattern matching to find the correct URLs. In real scraping scenarios, those two approaches are often combined for the most powerful filtering.

Filtering with unique CSS selectors

In the previous chapter, we simply grabbed all the links from the HTML document.

document.querySelectorAll('a');

Attribute selector

But that's not the only way. Since we're interested in the href attributes, a first very reasonable filter is to exclusively target the <a> tags that have the href attribute (yes, anchors without it can exist). You can do that by using the CSS attribute selector.

document.querySelectorAll('a[href]');

Always adding the [href] selector will save you from nasty bug hunts on some pages. Next, we can limit the number of results by only targeting the country links. In DevTools, we see that all the country links are in an HTML list denoted by <ul> and <li> tags. In addition, the <ul> element has the class countries. We can leverage that.

Learn more about HTML lists and the <ul and <li> tags.

Descendant selector

document.querySelectorAll('ul.countries a[href]');

We already know both the ul.countries and a[href] selectors, but their combination is new. It's called a descendant selector and it selects all <a href="..."> elements that are descendants of an <ul class="countries"> element. A descendant is any element that's nested somewhere inside another element.

nested HTML tags

When we print all the URLs in the DevTools console, we'll see that we've correctly filtered only the country links.

for (const a of document.querySelectorAll('ul.countries a[href]')) {
    console.log(a.href);
}

country URLs printed to console

Filtering with pattern matching

Another common way to filter links (or any text really) is to match patterns with regular expressions.

Learn more about regular expressions.

We can inspect the country URLs, and we'll soon find that they all look like the following. That is, they're exactly the same except for the last 2 letters.

https://www.alexa.com/topsites/countries/AF
https://www.alexa.com/topsites/countries/AX
https://www.alexa.com/topsites/countries/AL
https://www.alexa.com/topsites/countries/DZ
https://www.alexa.com/topsites/countries/AR
...
https://www.alexa.com/topsites/countries/{2_LETTER_COUNTRY_CODE}

Now, we create a regular expression that matches those links. There are many ways to do this. For simplicity, let's go with this one:

alexa\.com\/topsites\/countries\/[A-Z]{2}

This regular expression matches all URLs that include the alexa.com/topsites/countries/ substring in them and then follow with 2 capital letters.

A great way to learn more about regular expression syntax and to test your expressions are tools like regexr.com or regex101.com.

To test our regular expression in the DevTools console, we'll first create a RegExp object and then test the URLs with the regExp.test(string) function.

// To demonstrate pattern matching, we use only the 'a'
// selector to select all links on the page.
for (const a of document.querySelectorAll('a')) {
    const regExp = new RegExp(/alexa\.com\/topsites\/countries\/[A-Z]{2}/);
    const url = a.href;
    if (regExp.test(url)) {
        console.log(url);
    }
}

If you run the code in DevTools, you'll see that it produces exactly the same URLs as the CSS filter.

Yes, filtering with CSS selectors is often the better option. But sometimes it's not enough. Learning regular expressions is a very useful skill in many scenarios.

Next up

In the next chapter we'll see how rewriting this code to Node.js is not so simple and learn about absolute and relative URLs in the process.