WebscrapingWeb scraping using Puppeteer

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. This can be useful when a resource does not provide a public API for its data or when your application transposes the content of one resource onto itself.

Scraping can be done with a number of different software. It is possible to do web scraping with pure Node if you wish to do so, but there are libraries that make the process easier.

Puppeteer is a Node library which provides a high-level API to control Chrome or Chromium over the DevTools Protocol. Puppeteer runs headless by default but can be configured to run full (non-headless) Chrome or Chromium.” — from the Puppeteer github repo README.

Puppeteer runs a headless browser that you can use (amongst other things) to scrape a webpage for its content using HTML DOM selectors. In this article, we will showcase some basics of Puppeteer that you can use for developing your scrapers or automating UI testing for your application.

Let's make a simple single-page scraper using Puppeteer. Scraping for public data is usually allowed, but sometimes scraping can be prohibited by privacy policies. For that reason, we will scrape the quotes list website, provided by toscrape, a website that is made for the purpose of being scraped.

Start by installing the puppeteer library in your project. The following command will install both, the library and the headless browser for which the API is designed:

npm i puppeteer 
# or "yarn add puppeteer"

We need to initialize the headless browser, it is better to do this in a separate file we will call ‘browser.js’ in order to keep the file structure well organized, as your scraper can grow in complexity really fast.

  const puppeteer = require('puppeteer'); //import puppeteer 

async function startBrowser(){
    let browser;
    try {
        console.log("Opening the browser......");
        browser = await puppeteer.launch({
            headless: true,
        });
        //Start browser instance in headless mode.
    } catch (err) {
        console.log("Could not create a browser instance => : ", err);
    }
    return browser;
}
module.exports = {startBrowser}

Note that the startBrowser() function is an async function, and the await handler is used. That is because the launch() method on puppeteer returns a promise that needs to be resolved. This is something that will apply to most of the puppeteer API.

We can then create a new file called ‘pageScraper.js’ where the actual scraper functionality will be defined.

const scraperObject = {
url: 'https://quotes.toscrape.com/',
async scraper(browserInstance){

    let browser = await browserInstance; //init browser
    let page = await browser.newPage(); //init page
    
    console.log(`Navigating to ${this.url}...`);
        // Navigate to the selected page
        
    await page.goto(this.url); //nav to url.
          
    // Wait for the required DOM to be rendered
    await page.waitForSelector('body > div > div:nth-child(2) >  
    div.col-md-8');
              
    //Use DOM selectors to find the quote text and author of each quote listed on the page.
    let quoteList = await page.$$eval('body > div > div:nth-child(2) > 
    div.col-md-8 > div', quotes => {
                      
    
      let data = [] //init array containing quote objects
      text = quotes.map(el => el.querySelector('div > span:nth-child(1)
      .text').textContent);
                                
      author = quotes.map(el => el.querySelector('div > span:nth-child(
      2) > small').textContent)
                                      
      //Create the quotes object
        for (let i = 0; i < quotes.length; i++) {
              data[i] = {
                  text: text[i],
                  author: author[i]
               }
       }
     
       return data; //Return promise
       
     })
                                               
     return quoteList;
                                              
    }
                                            
  }
 
  module.exports = scraperObject;

The scraper is done. All that is left is to create an index file, in which we will pass the instance of the browser to our scraper. Let's create ‘index.js’.

node index.js

This is an extremely simple example. It can only scrape single-page applications, however, this can be a good base structure for creating more-complex scrapers.

Puppeteer API comes with a really detailed documentation which can be found in their github or on pptr.dev for the interactive documentation.

If you have any questions, feel free to contact us.

Nikita Sazhinov

Full-Stack Developer

Accelerate your digital transformation.

With a strong innovation and technology-focused mindset, we explore your problems and come up with the best tailor-made solution.

Contact us