Web scraping with Puppeteer – Quick Start
Puppeteer is a Node library from the Google Chrome team we can use to control a headless Chrome instance. With Puppeteer you can make screenshots, track page loading performance, generate PDF from web page, scrape web pages and a lot of more.
Prerequisites
Node.js is installed on your computer.
Installing Puppeteer
Create a project folder MyTestProject
and run below command:
npm install puppeteer
Catch the Screen
In project folder create a file Screenshot.js:
const puppeteer = require('puppeteer');
console.log("Hello j‑labs");
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://www.j-labs.pl');
await page.screenshot({path: 'jlabs.png'});
await browser.close();
})();
Screenshot.js file
We have to inform Node.js about puppeteer library usage:
const puppeteer = require('puppeteer');
This line of code contains a traditional „Hello World”. It will be displayed in the console window:
console.log("Hello j‑labs");
The browser instance is created using 'launch()’ method. To get the page object 'newPage()’ method has to be used on the browser object:
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
})();
Set the web page to start with:
await page.goto('https://www.j-labs.pl');
Take a screenshot and save in your project folder using:
await page.screenshot({path: 'jlabs.png'});
It’s important to wait for the browser to close:
await browser.close();
Let`s grab author and post date from j‑labs blog
The script below scrapes post date and the author from the first page of j‑labs blog. On the beginning we need to find proper selector on the page. It’s as easy as launching Chrome Developer Tools and finding it. In our case it’s div.NewsSummaryPostdate
const puppeteer = require("puppeteer");
var fs = require("fs");
(async () => {
try {
// open the headless browser
var browser = await puppeteer.launch({ headless: true });
// open a new page
var page = await browser.newPage();
// enter url in page
await page.goto(`https://blog.j-labs.pl/`);
await page.waitForSelector("div.NewsSummaryPostdate");
var news = await page.evaluate(() => {
var contentList = document.querySelectorAll(`div.NewsSummaryPostdate`);
var dataArray = [];
for (var i = 0; i < contentList.length; i++) {
dataArray[i] = {
title: contentList[i].innerText.trim()
};
}
return dataArray;
});
// console.log(news);
await browser.close();
// Writing the news inside a json file
fs.writeFile("output.json", JSON.stringify(news), function(err) {
if (err) throw err;
console.log("Output Saved");
});
console.log(success("Browser Closed"));
} catch (err) {
// Catch and display errors
console.log(error(err));
await browser.close();
console.log(error("Browser closed with error"));
}
})();
Variable fs
is needed to save the file with output:
var fs = require("fs");
The script below gets a list of nodes matching div.NewsSummaryPostdate
selector:
var contentList = document.querySelectorAll(`div.NewsSummaryPostdate`);
In the for
loop, we get innerText
for each node:
for (var i = 0; i < contentList.length; i++) {
dataArray[i] = {
title: contentList[i].innerText.trim()
};
}
Time to save the output to file:
fs.writeFile("output.json", JSON.stringify(news), function(err) {
if (err) throw err;
console.log("Output Saved");
});
Summary
Pupeeter is a tool with many possibilities. You can use it for page performance tracking, automatic testing or creating single page apps etc. Automated testing is much faster than Selenium. The two main disadvantages is that it works only with Chrome and as a Node.js library it supports only it’s language.