Version: 3.0

Crawl a website with relative links

When crawling a website, you may encounter different types of links present that you may want to crawl. To facilitate the easy crawling of such links, we provide the enqueueLinks() method on the crawler context, which will automatically find links and add them to the crawler's RequestQueue.

We provide 3 different strategies for crawling relative links:

All , which will enqueue all links found, regardless of the domain they point to.
SameHostname , which will enqueue all links found for the same hostname (regardless of any subdomains present).
SameSubdomain , which will enqueue all links found that have the same subdomain and hostname. This is the default strategy.

note

For these examples, we are using the CheerioCrawler, however the same method is available for both the PuppeteerCrawler and PlaywrightCrawler, and you use it the exact same way.

All Links
Same Hostname
Same Subdomain

Example domains

Any urls found will be matched by this strategy, even if they go off of the site you are currently crawling.

import { Actor } from 'apify';
import { CheerioCrawler } from 'crawlee';

await Actor.init();

const crawler = new CheerioCrawler({
    maxRequestsPerCrawl: 10, // Limitation for only 10 requests (do not use if you want to crawl all links)
    async requestHandler({ request, enqueueLinks }) {
        console.log(request.url);
        await enqueueLinks({
            // Setting the strategy to 'all' will enqueue all links found
            strategy: 'all',
        });
    },
});

// Run the crawler
await crawler.run(['https://apify.com/']);

await Actor.exit();

Example domains

For a url of https://example.com, enqueueLinks() will match relative urls, urls that point to the same full domain or urls that point to any subdomain of the provided domain.

For instance, hyperlinks like https://subdomain.example.com/some/path, https://example.com/some/path, /absolute/example or ./relative/example will all be matched by this strategy.

import { Actor } from 'apify';
import { CheerioCrawler, EnqueueStrategy } from 'crawlee';

await Actor.init();

const crawler = new CheerioCrawler({
    maxRequestsPerCrawl: 10, // Limitation for only 10 requests (do not use if you want to crawl all links)
    async requestHandler({ request, enqueueLinks }) {
        console.log(request.url);
        await enqueueLinks({
            // Setting the strategy to 'same-subdomain' will enqueue all links found that are on the same hostname
            // as request.loadedUrl or request.url
            strategy: EnqueueStrategy.SameHostname,
            // Alternatively, you can pass in the string 'same-hostname'
            // strategy: 'same-hostname',
        });
    },
});

// Run the crawler
await crawler.run(['https://apify.com/']);

await Actor.exit();

tip

This is the default strategy when calling enqueueLinks(), so you don't have to specify it.

Example domains

For a url of https://subdomain.example.com, enqueueLinks() will only match relative urls or urls that point to the same full domain.

For instance, hyperlinks like https://subdomain.example.com/some/path, /absolute/example or ./relative/example will all be matched by this strategy, while https://other-subdomain.example.com or https://otherexample.com will not.

import { Actor } from 'apify';
import { CheerioCrawler, EnqueueStrategy } from 'crawlee';

await Actor.init();

const crawler = new CheerioCrawler({
    maxRequestsPerCrawl: 10, // Limitation for only 10 requests (do not use if you want to crawl all links)
    async requestHandler({ request, enqueueLinks }) {
        console.log(request.url);
        await enqueueLinks({
            // Setting the strategy to 'same-subdomain' will enqueue all links found that are on the same subdomain and hostname
            // as request.loadedUrl or request.url
            strategy: EnqueueStrategy.SameHostname,
            // Alternatively, you can pass in the string 'same-subdomain'
            // strategy: 'same-subdomain',
        });
    },
});

// Run the crawler
await crawler.run(['https://apify.com/']);

await Actor.exit();