Scraping with puppeteer and export in JSON format
Export Scraped data from puppeteer
App development is no walk in the park, figuring out the right themes to use, frameworks or libraries when building an app can be a tedious process, also sourcing for the right APIs to build on can be quite challenging especially when the one you need is not available.
Well in this case scraping the data can be all you need to provide the right data for your app.
In this blog post, we are going to be learning how to use Puppeteer to scrape data from the web and then export it in JSON format.
This will enable you to know how to scrape data with Puppeteer and implement it on any project of your choice.
Prerequires
A basic knowledge of Node.js and JavaScript is required to be able to follow along.
Setup
To get started, create a folder and name it as puppeteer
, then open it up as a workspace on VS Code as indicated below π
Click on Ctrl
+ `
backtick to open the terminal on VS Code. π
You should already have node.js
installed on your system, but if you are unsure if you have it installed. run the command below.
node -v
This will return the current version you have installed on your system
π // v21.1.0
Am currently running on version v21.1.0
but depending on when you are reading this or what version you have installed, it will be different from mine.
Next step, generate a package.json
file. Type the command below and hit enter.
npm init -y
This will create and set up a package.json
file where all dependencies and devDependencies will be saved.
Now that we have that figured out, let's install Puppeteer.
npm i puppeteer
The above command will install Puppeteer on your system if your default package manager is npm
but if you are making use of yarn
or pnpm
you can try the following command below.
# using yarn
yarn add puppeteer
# using pnpm
pnpm i puppeteer*
Directly on your terminal, the installed node_modules
should indicate as follows. π
By default, Puppeteer will install node_modules
and package-lock.json
and update the package.json
file with the latest version of Puppeteer. As at the time of writing, I have 21.5.0
installed, yours should be similar or higher. Use npm i puppeteer@12.5.0
to install the exact version used in this tutorial.
Entry Point Setup
Create index.js
file. This will serve as an entry point for our application.
const puppeteer = require('puppeteer');
click to open the index.js
file, type the code above. The syntax above is commonJS, this is what weβll be using in this tutorial.
(async () => {
π // your code should come write here
})();
The code above is immediately-invoked function expression (IIFE), this allows the code specified inside the function to be executed immediately and it is associated with async/await syntax for handling asynchronous operations.
Now that we have set the function boilerplate, let's create a headless browser instance with Puppeteer.
(async () => {
const browser = await puppeteer.launch({ headless: "new" });
const page = await browser.newPage();
await browser.close();
})();
Here, we are creating an instance of a browser
and puppeteer
is set to launch
. This opens up the browser.
The key and value { headless: "new" } object syntax indicates that the browser should operate in a headless mode, that is to run in the background without displaying its UI.
Next the browser.newPage()
opens up a new tag and assigns it to a page
variable.
Then the await browser.close()
method closes the browser when the function stops running.
Scraping web page data
To begin scraping data, you have to decide which platform data you want to scrap. While it is easy to scrap data, some platforms prevent access to scraping of it data and you will have to use a third-party tool like ZenRows or Oxylabs just to mention a few.
In this example, I will be making use of Techmeme.com for scraping.
(async () => {
const browser = await puppeteer.launch({ headless: "new" });
const page = await browser.newPage();
π await page.goto("https://www.techmeme.com/");
await browser.close();
}
)();
Here, the goto()
method is used to specify the web URL we want to scrap.
Once the URL has been specified. visit and inspect
the website you want to scrap to have an overview of the HTML structure. In this case, we will visit techmeme.com and inspect the page to know which element to grab and return.
Right-click on the page, this will open a dialog box, then click inspect
. for short hit F12
key. See the example below π
Your DevTools should open up as indicated below.
From the above screenshot, we have identified the tag we want to scrap which is the div
with the class of clus
. The parent of clue
has a class of topcol1
. So we will be grabbing this with Puppeteer.
Try the code sample below. π
const browser = await puppeteer.launch({ headless: "new" });
const page = await browser.newPage();
await page.goto("https://www.techmeme.com/");
π const techNewsApis = await page.evaluate(() =>
Array.from(document.querySelectorAll("#topcol1 .clus"), (e) => ({
// Some code example here.
}))
);
Here, the Array.from() method is used to grab all the elements with a class of clus
that is a direct parent of topcol1
, this will create a shallow-copied Array instance from the returned elements.
Also, notice we are using the evaluate()
method, which takes a call-back function. It evaluates the response of a Promise and returns its values when the Promise is resolved. In this case, the resolved value is returned to the techNewsApis
variable.
We have seen the basic example of how to grab elements, but this is not all, what we have are elements that Puppeteer does not know what to do with it.
What we want to do is return the headline on the webpage and the link associated with it. We also want to implement a unique id
to the data.
From the above screenshot, I inspect the first <a>
tag and return the name value as an id.
Similarly, I inspect the <a>
tag with a class of ourh
which are children of the clue
tag and then return both the headline and the link associated with it.
See the code below
const techNewsApis = await page.evaluate(() =>
Array.from(document.querySelectorAll("#topcol1 .clus"), (e) => ({
π id: e.querySelector(".clus > a").name,
π title: e.querySelector("a.ourh").innerText,
π url: e.querySelector("a.ourh").href,
}))
);
Here, querySelector()
method is used to access the elements and a property and two attribute is attached to return the respective value associated with it.
name
the attribute returns the number string value specified on the<a>
taginnerText
returns the text content in between the open and close<a>
taghref
returns the link specified on the<a>
tag
Now that we have successfully scraped the data, let's return it and see what the data looks like.
const techNewsApis = await page.evaluate(() =>
Array.from(document.querySelectorAll("#topcol1 .clus"), (e) => ({
id: e.querySelector(".clus > a").name,
title: e.querySelector("a.ourh").innerText,
url: e.querySelector("a.ourh").href,
}))
);
π console.log(techNewsApis);
await browser.close();
Parse the resolved value to the console.log()
method.
node index.js
Go to the terminal and run the above command.
The scraped data is now logged to the console as shown above. With the id
, title
and url
specified respectively.
Exporting scraped data in a JSON format
Logging data to the terminal is not a good use case since we can not access or use it directly from the terminal.
What to do next is to save it on the local directory of the project we are working on so that we can make use of it.
const puppeteer = require('puppeteer');
π const fs = require('fs');
With the fs (file system) module specified, we can write to save the data to something we can use in an application.
const puppeteer = require('puppeteer');
const fs = require('fs');
(async () => {
const browser = await puppeteer.launch({ headless: "new" });
const page = await browser.newPage();
await page.goto("https://www.techmeme.com/");
const techNewsApis = await page.evaluate(() =>
Array.from(document.querySelectorAll("#topcol1 .clus"), (e) => ({
id: e.querySelector(".clus > a").name,
title: e.querySelector("a.ourh").innerText,
url: e.querySelector("a.ourh").href,
}))
);
//Save data to JSON file
π fs.writeFile("techNewsApis.json", JSON.stringify(techNewsApis), (error) => {
π if (error) throw error;
π console.log(`techNewsApis is now saved on your project folder`);
});
console.log(techNewsApis);
await browser.close();
}
)();
Here, the fs module is implemented as follows;
fs.writeFile("techNewsApis.json")
set the name of the file to be saved.JSON.stringify(techNewsApis)
is used to receive and parse the value to a valid JSON fileif (error) throw error
throws an error if any occur
Next, go to the terminal and run the command as shown below.
node index.js
This will save the exported data as a valid JSON file for external use.
From the above screenshot, the exported data is saved directly in the project folder.
Conclusion
Scraping can be very useful in getting specific data in situations where the data we need is not readily available or can be found on another website, in this case, we can scrape such data and use it in our project.
However, this comes with some caveats as some web owners do not grant access for the scraping of their website.