Edit

How to analyze pages and fix errors

Learn to deal with random crashes in your web scraping and automation jobs. Find our the essentials of debugging and fixing problems in your actors.

Debugging is essential in programming. Even if you would not call yourself a programmer, having basic debugging skills will make building and maintaining scrapers and integration actors on Apify easier. It will help you avoid hiring an expensive developer and solve your issues faster.

This article covers the absolute basics. It discusses the most common problems and the simplest tools for analyzing the issue.

Possible problems

It is often tricky to see the full scope of what can go wrong. We assume once the code is set up correctly, it will keep working. Unfortunately, that is rarely true in the realm of web scraping and automation.

Websites change, they introduce new anti-scraping technologies, programming tools change and, in addition, people make mistakes.

Here are the most common reasons your working solution may break.

  • The website changes its layout or data feed.

  • A site's layout changes depending on location or uses A/B testing.

  • A page starts to block you (recognizes you as a bot).

  • The website loads its data later dynamically, so the code works only sometimes, if you are slow or lucky enough.

  • You made a mistake when updating your code.

  • The code worked locally but not on the Apify platform.

  • You have lost access to Apify proxy (your proxy trial is over).

  • You have upgraded your dependencies (other software that you rely upon) and the new versions no longer work (this is harder to debug).

This is a long list, and it is by no means complete. However, if you use the right tools and remember the most common causes, you can find the problem quickly.

Analysis

Web scraping and automation are very specific types of programming. It is not possible to rely on specialized debugging tools, since the code does not output the same results every time.

Many issues are edge cases, which occur in just one of a thousand pages or are time-dependent. Because of this, you cannot rely only on determinism.

Logging

Logging is an essential tool for any programmer. When used correctly, they help you capture a surprising amount of information.

Note that Apify logs are not infinite. If you see messages with skipped lines, consider toning down your logging.

General rules for logging:

  • Usually, many logs is better than no logs.

  • Putting more information into one line, rather than logging multiple short lines, helps reduce the overall log size.

  • Focus on numbers. Log how many items you extract from a page, etc.

  • Structure your logs and use the same structure in all your logs.

  • Append the current page's URL to each log. This lets you immediately open that page and review it.

Example of a structured log

[CATEGORY]: Products: 20, Unique products: 4, Next page: true --- https://apify.com/store

The log begins with the page type. Usually, we use labels such as [CATEGORY] and [DETAIL]. Then, we log important numbers and other information. Finally, we add the page's URL so we can check if the log is correct.

Errors

Errors require a different approach because, if your code crashes, you usual logs will not be called. Instead, exception handlers will print your error, but these are usually ugly messages with a stack trace that only Apify experts will understand.

You can overcome this by adding try/catch blocks into your code. In the catch block, explain what happened and re-throw the error (so the request is automatically retried).

try {
    // Sensitive code block
    // ...
} catch (error) {
    // You know where the code crashed so you can explain here
    console.error('Request failed during login with an error:');
    throw error;
}

Read more information about logging and error handling in our public wiki about developer best practices.

Saving snapshots

By snapshots, we mean screenshots if you use a browser/Puppeteer and HTML saved into a key-value store that you can easily display in your browser. Snapshots are useful throughout your code but especially important in error handling.

Note that an error can happen only in a few pages out of a thousand and look completely random. There is not much you can do other than save and analyze a snapshot.

Snapshots can tell you if:

  • A website has changed its layout. This can also mean A/B testing or different content for different locations.

  • You have been blocked – you open a CAPTCHA or Access Denied page.

  • Data load later dynamically – the page is empty.

  • The page was redirected – the content is different.

How to save a snapshot

In Apify scrapers (Web Scraper (apify/web-scraper), Cheerio Scraper (apify/cheerio-scraper) and Puppeteer Scraper (apify/puppeteer-scraper)), you can use their built-in context.saveSnapshot() function. Once called, it saves a screenshot and HTML into the run's key-value store.

When building your own actors with Puppeteer or the the Apify SDK package, you can use the powerful utils.puppeteer.saveSnapshot() function. It allows you name the screenshot, so you can identify it later.

Cheerio-based actors do not have a helper function because they allow taking snapshots with a single line of code. Just save the HTML with the correct content type.

await Apify.setValue('SNAPSHOT', html, { contentType: 'text/html' });

When to save snapshots

The most common approach is to save on error. We can enhance our previous try/catch block like this:

// storeId is ID of current key value store, where we save snapshots
const storeId = Apify.getEnv().defaultKeyValueStoreId;
try {
    // Sensitive code block
    // ...
} catch (error) {
    // Change the way you save it depending on what tool you use
    const randomNumber = Math.random();
    const key = `ERROR-LOGIN-${randomNumber}`;
    await Apify.utils.puppeteer.saveSnapshot(page, { key });
    const screenshotLink = `https://api.apify.com/v2/key-value-stores/${storeId}/records/${key}.jpg?disableRedirect=true`

    // You know where the code crashed so you can explain here
    console.error(`Request failed during login with an error. Screenshot: ${screenshotLink}`);
    throw error;
}

To make the error snapshot descriptive, we name it ERROR-LOGIN. We add a random number so the next ERROR-LOGINs would not overwrite this one and we can see all the snapshots. If you can use an ID of some sort, it is even better.

Beware:

  • The snapshot's name (key) can only contain letter, number, dot and dash characters. Other characters will cause an error, which makes the random number a safe pick.

  • Do not overdo the snapshots. Once you get out of the testing phase, limit them to critical places. Saving snapshots uses resources.

Error reporting

Logging and snapshotting are great tools but once you reach a certain run size, it may be hard to read through them all. For a large project, it is handy to create a more sophisticated reporting system. For example, let's just look at simple dataset reporting.

This example extends our previous snapshot solution by creating a named dataset (named datasets have infinite retention), where we will accumulate error reports. Those reports will explain what happened and will link to a saved snapshot, so we can do a quick visual check.

// Let's create reporting dataset
// If you already have one, this will continue adding to it
const reportingDataset = await Apify.openDataset('REPORTING');

// storeId is ID of current key-value store, where we save snapshots
const storeId = Apify.getEnv().defaultKeyValueStoreId;

// We can also capture actor and run IDs
// to have easy access in the reporting dataset
const { actorId, actorRunId } = Apify.getEnv();
const linkToRun = `https://my.apify.com/actors/actorId#/runs/actorRunId`;

try {
    // Sensitive code block
    // ...
} catch (error) {
    // Change the way you save it depending on what tool you use
    const randomNumber = Math.random();
    const key = `ERROR-LOGIN-${randomNumber}`;
    await Apify.utils.puppeteer.saveSnapshot(page, { key });

    const screenshotLink = `https://api.apify.com/v2/key-value-stores/${storeId}/records/${key}.jpg?disableRedirect=true`;

    // We create a report object
    const report = {
        errorType: 'login',
        errorMessage: error.toString(),

        // You will have to adjust the keys if you save them in a non-standard way
        htmlSnapshot: `https://api.apify.com/v2/key-value-stores/${storeId}/records/${key}.html?disableRedirect=true`,
        screenshot: screenshotLink,
        run: linkToRun,
    };

    // And we push the report
    await reportingDataset.pushData(report);

    // You know where the code crashed so you can explain here
    console.error(
        `Request failed during login with an error. Screenshot: ${screenshotLink}`
    );
    throw error;
}