Exporting data
Learn how to export the data you scraped using Crawlee to CSV or JSON.
In the previous lessons, you learned that:
- You can use
Dataset.pushData()
to save data to the default dataset. - The default dataset files are saved in the
./storage/datasets/default
folder.
But when we look inside the folder, we see that there are a lot of files, and we don't want to work with those manually. We can use the dataset itself to export the data.
Exporting data to CSV
Crawlee's Dataset
provides a way to export all your scraped data into one big CSV file. You can then open it in Excel or any other data processor. To do that, you need to call Dataset.exportToCSV()
after collecting all the data. That means, after your crawler run finishes.
// ...
await crawler.run();
// Add this line to export to CSV.
await Dataset.exportToCSV('results');
After you add this one line and run the code, you'll find your CSV with all the scraped products in here:
./storage/key-value-stores/default/results.csv
Key-value store is another of Crawlee's storages. It's best for saving files like CSVs, PDFs or images, but also large JSONs or crawler statistics.
Exporting data to JSON
Exporting to JSON is very similar to exporting to CSV, but we'll use a different function: Dataset.exportToJSON
. Exporting to JSON is useful when you don't want to work with each item separately, but would rather have one big JSON file with all the results.
// ...
await crawler.run();
// Add this line to export to JSON.
await Dataset.exportToJSON('results');
You will find the resulting JSON here:
./storage/key-value-stores/default/results.json
Final scraper code
import { PlaywrightCrawler, Dataset } from 'crawlee';
const crawler = new PlaywrightCrawler({
// We removed the headless: false option to hide the browser windows.
requestHandler: async ({ parseWithCheerio, request, enqueueLinks }) => {
console.log(`Fetching URL: ${request.url}`);
if (request.label === 'start-url') {
await enqueueLinks({
selector: 'a.product-item__title',
});
return;
}
// Fourth, parse the browser's page with Cheerio.
const $ = await parseWithCheerio();
const title = $('h1').text().trim();
const vendor = $('a.product-meta__vendor').text().trim();
const price = $('span.price').contents()[2].nodeValue;
const reviewCount = parseInt($('span.rating__caption').text(), 10);
const description = $('div[class*="description"] div.rte').text().trim();
const recommendedProducts = $('.product-recommendations a.product-item__title')
.map((i, el) => $(el).text().trim())
.toArray();
await Dataset.pushData({
title,
vendor,
price,
reviewCount,
description,
recommendedProducts,
});
},
});
await crawler.addRequests([{
url: 'https://warehouse-theme-metal.myshopify.com/collections/sales',
label: 'start-url',
}]);
await crawler.run();
await Dataset.exportToCSV('results');
Next up
And this is it for the Basics of crawling section of the Web scraping for beginners course. If you want to learn more, test your knowledge of the methods and concepts you learned in this course by moving forward with the challenge.