Dealing with headers, cookies, and tokens

Learn about how some APIs require certain cookies, headers, and/or tokens to be present in a request in order for data to be received.

Unfortunately, most APIs will require a valid cookie to be included in the cookie field within a request's headers in order to be authorized. Other APIs may require special tokens, or other data that validates the request.

Luckily, there are ways to retrieve and set cookies for requests prior to sending them, which will be covered more in-depth within future Scraping Academy modules. The most important things to know at the moment are:

Cookies

For sites that heavily rely on cookies for user-verification and request authorization, certain generic requests (such as to the website's main page, or to the target page) will return back a (or multiple) set-cookie header(s).
The set-cookie response header(s) can be parsed and used as the cookie header in the headers of a request. A great package for parsing these values from a response's headers is set-cookie-parser. With this package, cookies can be parsed from headers like so:

import axios from 'axios';

// import the set-cookie-parser module
import setCookieParser from 'set-cookie-parser';

const getCookie = async () => {
    // make a request to the target site
    const response = await axios.get('https://www.example.com/');

    // parse the cookies from the response
    const cookies = setCookieParser.parse(response);

    // format the parsed data into a usable string
    const cookieString = cookies.map(({ name, value }) => `${name}=${value};`).join(' ');

    // log the final cookie string to be used in a 'cookie' header
    console.log(cookieString);
};

getCookie();

Headers

Other APIs may not require a valid cookie header, but instead will require certain headers to be attached to the request which are typically attached when a user makes a "real" request from a browser. The most commonly required headers are:

User-Agent
Referer
Origin
Host

Headers required by the target API can be configured manually in a manner such as this, and attached to every single request the scraper sends:

const HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko)'
        + 'Chrome/96.0.4664.110 YaBrowser/22.1.0.2500 Yowser/2.5 Safari/537.36',
    Referer: 'https://soundcloud.com',
    // ...
};

However, a much better option is to use either a custom implementation of generating random headers for each request, or to use a package such as got-scraping to automatically do this.

With got-scraping, generating request-specific headers can be done right within a request with headerGeneratorOptions. Specific headers can also be set with the headers option:

const response = await gotScraping({
    url: 'https://example.com',
    headerGeneratorOptions: {
        browsers: [
            {
                name: 'chrome',
                minVersion: 87,
                maxVersion: 89,
            },
        ],
        devices: ['desktop'],
        locales: ['de-DE', 'en-US'],
        operatingSystems: ['windows', 'linux'],
    },
    headers: {
        'some-header': 'Hello, Academy!',
    },
});

Tokens

For our SoundCloud example, testing the endpoint from the previous section in a tool like Postman works perfectly, and returns the data we want; however, when the client_id parameter is removed, we receive a 401 Unauthorized error. Luckily, the Client ID is the same for every user, which means that it is not tied to a session or an IP address (this is based on our own observations and tests). The big downfall is that the token being used by SoundCloud changes every few weeks, so it shouldn't be hardcoded. This case is actually quite common, and is not only seen with SoundCloud.

Ideally, this client_id should be scraped dynamically, especially since it changes frequently, but unfortunately, the token cannot be found anywhere on SoundCloud's pages. We already know that it's available within the parameters of certain requests though, and luckily, Puppeteer offers a way to analyze each response when on a page. It's a bit like using browser DevTools, which you are already familiar with by now, but programmatically instead.

Here is a way you could dynamically scrape the client_id using Puppeteer:

// import the puppeteer module
import puppeteer from 'puppeteer';

const scrapeClientId = async () => {
    const browser = await puppeteer.launch({ headless: false });
    const page = await browser.newPage();

    // initialize a variable that will eventually hold the client_id
    let clientId = null;

    // handle each response
    page.on('response', async (res) => {
        // try to grab the 'client_id' parameter from each URL
        const id = new URL(res.url()).searchParams.get('client_id') ?? null;

        // if the parameter exists, set our clientId variable to the newly parsed value
        if (id) clientId = id;
    });

    // visit the page
    await page.goto('https://soundcloud.com/tiesto/tracks');

    // wait for a selector that ensures the page has time to load and make requests to its API
    await page.waitForSelector('.profileHeader__link');

    await browser.close();
    console.log(clientId); // log the retrieved client_id
};

scrapeClientId();

Next up

Keep the code above in mind, because we'll be using it in the next lesson when paginating through results from SoundCloud's API.

Cookies​

Headers​

Tokens​

Next up​

Cookies

Headers

Tokens

Next up