Great Learning Guide of Puppeteer Class (Tutorial 7)

Puppeteer which is an open-source node js library, can be used as a web scraping tool. Understanding of command line, Javascript, and HTML DOM structure should be good to start with this puppeteer tutorial. The Series of Puppeteer tutorial is distributed among below Sub section to get a good hold on Puppeteer. 

Puppeteer Tutorial

Puppeteer Tutorial #1: Puppeteer Overview

Puppeteer Tutorial #2: Puppeteer Environment Variables

Puppeteer Tutorial #3: Puppeteer Web Scraping and Puppeteer Test Automation Overview

Puppeteer Tutorial #4: Install Puppeteer

Puppeteer Tutorial #5: Sample Puppeteer Project

Puppeteer Tutorial #6: Puppeteer Automation Testing

Puppeteer Tutorial #7: Puppeteer Class

Puppeteer Tutorial #8: Puppeteer Browser Class

Puppeteer Tutorial #9: Puppeteer Page Class

In this “Puppeteer Class” tutorial, we will explain the below classes which include the important namespaces(if any), events(if any), and methods which are frequently used in Puppeteer web scraping techniques. 

We will explain important components with examples throughout this article.  

Puppeteer Class

Conceptually, the class is a blueprint of an object which defines a set of instructions( variables and methods). Here, the Puppeteer class is defined using javascript to perform different actions to perform web scraping. Let’s check the below example, the Puppeteer class module has been used to launch a Chromium web instance.

const puppeteer = require('puppeteer');
(async () => {
  const browserChrome = await puppeteer.launch();
  const pageChrome = await browserChrome.newPage();
  await pageChrome.goto('https://www.google.com');
  // We can write steps here
  await browserChrome.close();
})();

Puppeteer class also provides multiple Namespaces and Methods, which supports the web scraping process. The frequently used Namespaces and Methods are explained next sections.

Puppeteer Class – Namespaces:

It’s a container that defines multiple identifiers, methods, variables, etc., in javascript. It’s a way to group the code in a logical and organized way. Below namespaces are provided by the Puppeteer class.

puppeteer.devices: It returns a list of devices that can be used by the method page.emulate(options) to perform scraping in mobile devices. 

Example – Open and close google web page on a mobile device –

const puppeteer = require('puppeteer');
const samsung = puppeteer.devices['Samsung J5'];
(async () => {
  const browserChrome = await puppeteer.launch();
  const pageChrome = await browserChrome.newPage();
  await pageChrome.emulate(samsung);
  await pageChrome.goto('https://www.google.com'); 
  await browserChrome.close();
})();

puppeteer.errors: While working with different puppeteer methods, there is a chance of exceptions. Mostly, if the methods are unable to fulfill the requests, it throws errors. There are different classes defined to handle errors through the ‘puppeteer.errors’ namespace.

Example – for method page.waitForSelector, if the web element does not appear within the specified time, then the timeout error will appear. Please go through the below example, which shows an approach to handle timeout,

try {
  await page.waitForSelector('<web-element>');
} catch (err) {
  if (err instanceof puppeteer.errors.TimeoutError) {
    // Write code to handle the timeout error.
  }
} 

puppeteer.networkConditions: It returns a list of network conditions that can be used on the method page.emulateNetworkConditions(networkConditions). The complete list of network conditions is defined here.

Example – Through this code sample, we will open the google web page using a pre-defined network condition.

const puppeteer = require('puppeteer');
const net = puppeteer.networkConditions['Fast 3G'];
(async () => {
  const browserChrome = await puppeteer.launch();
  const pageChrome = await browserChrome.newPage();
  await pageChrome.emulateNetworkConditions(net);
  await pageChrome.goto('https://www.google.com');
  await browserChrome.close();
})();

puppeteer.product: It returns the name of the browser, which will be used for automation(Chrome or Firefox). The product for the browser is set by either environment variable PUPPETEER_PRODUCT or the product option available in puppeteer class method puppeteer.launch([options]). The default value is Chrome.

Reference: Click here to learn more on Puppeteer Class namespaces.

Puppeteer Class – Methods:

Methods contain statements to perform the specific action. The puppeteer class have below methods,

puppeteer.clearCustomQueryHandlers() – It clears all registered handlers.

puppeteer.connect(options) – This method is used to connect puppeteer with any existing browsers. It returns an object of type promise which indicates the status of this asynchronous process. Example – In the below example, puppeteer disconnect from the current browser and reconnect,

const puppeteer = require('puppeteer');
(async () => {
  const browserChrome = await puppeteer.launch();
  // Copy the endpoint reference which will be reconnected later
  const endpoint = browserChrome.wsEndpoint();
  // Disconnect puppeteer
  browserChrome.disconnect();
  // Use the endpoint to re-connect
  const browserChrome2 = await puppeteer.connect({endpoint});
  // Close second instance of Chromium
  await browserChrome2.close();
})();

puppeteer.createBrowserFetcher([options]) – It creates a browser fetcher object to download and manage the different versions of browsers (Chrome and Firefox).

const browserFetcher = puppeteer.createBrowserFetcher();

puppeteer.customQueryHandlerNames() – It returns an array of all the registered custom query handlers.

puppeteer.defaultArgs([options]) – It returns the default configuration options of chrome browser as an array while launching. Also, we can set the configurable options of a browser using the optional argument option.

const args = puppeteer.defaultArgs();

puppeteer.executablePath() – It returns the path is expected by the puppeteer for the bundled browser instance. The path that would not be available in the download was skipped by the environment variable PUPPETEER_SKIP_DOWNLOAD. Also, we can use the environment variables PUPPETEER_EXECUTABLE_PATH and PUPPETEER_CHROMIUM_REVISION to change the path.

const args = puppeteer.executablePath();

puppeteer.launch([options]) – This puppeteer class method is used to launch the web browser. Through the optional argument, we can pass the different configurations of the browser, such as product(browser name), headless, devtools, etc. This method returns the promise object, which holds the reference of the launched browser.

const browser = await puppeteer.launch();

puppeteer.registerCustomQueryHandler(name, queryHandler) – It’s used to register a custom query handler. Here “name” provides the name of the query handler, and “queryHandler” defines the actual custom query handler.

puppeteer.unregisterCustomQueryHandler(name) – It’s used to unregister any custom query handler.

Reference: Click here to read more on Puppeteer Class methods.

Target Class

The target class provides methods to work with targets. The most frequently used methods which are available with target class are explained in the next section.

Target Class – Methods:

Below methods are available in targets class –

  • Target.browser() – It returns the browser object which is linked to the target.
  • Target.browserContext() – It returns an object of type browserContext which is linked to the target.
  • Target.createCDPSession() – It creates and returns the devtool protocol session of the chrome, which is attached to the target.
  • Target.opener() – It returns the target which opens this target. Basically, this method is used to get the parent target. It returns null for the top-level target.
  • Target.page() – It returns the page object of the target. If the type of the target is not a page, it returns a null value.
  • Target.type() – It’s used to get the type of the target. The return value can be either of the options – ’background_page’ , ‘page’ ,’shared_worker’,’service_worker’,’browser’ or ‘other’.
  • Target.url() – It returns the url of the target.
  • Target.worker() – It returns the webworker object. The return value is null if the target is neither ‘service_worker’ nor ‘shared_worker’.

Reference: Click here to read more on Target class methods.

ConsoleMessage Class

The ConsoleMessage class objects are dispatched by page through the console event. The frequently used methods of the consoleMessage class are explained in the next section.

ConsoleMessage Class – Methods:

Below methods are available in ConsoleMessage class –

  • consoleMessage.args() – It returns a array of JSHandler object. The JSHandler prevents the linked JS object from being garbage collected until the handle is disposed of. It’s automatically destroyed when the parent browser context is destroyed.
  • consoleMessage.location() – It returns an object of the resource, which includes the below parameters.
  • url – It denotes the URL of the known resource. If not known, it will keep an undefined value.
  • LineNumber – It’s the 0-based line number that is available in the resource. If not available, it will keep an undefined value.
  • columNumber – It’s the 0-based column number that is available in the resource. If not available, it will keep an undefined value.
  • consoleMessage.stackTrace() – It returns a list of objects(each object refers a resource) which includes below parameters.
  • url – It denotes the URL of the known resource. If not known, it will keep an undefined value.
  • LineNumber – It’s the 0-based line number that is available in the resource. If not available, it will keep an undefined value.
  • columNumber – It’s the 0-based column number that is available in the resource. If not available, it will keep an undefined value.
  • consoleMessage.text() – It returns the text of the console.
  •  consoleMessage.type() – It returns the string as the type of console message. The type can be either of the values – log, debug, info, error, warning, dir, dirxml, table, trace, clear, startGroup, startGroupCollapsed, endGroup, assert, profile, profileEnd, count, timeEnd.

Reference: Click here to learn more on consoleMessage class methods.

TimeoutError Class

While working with different puppeteer, there is a chance of exceptions. Mostly, if the methods are unable to fulfill the requests, it throws errors. The TimeoutError class is used to handle this kind of exception.

Example of TimeoutError Class – for method page.waitForSelector, if the web element does not appear within the specified time, then the timeout error will appear. Please go through the below example, which shows an approach to handle timeout,

try {
  await page.waitForSelector('<element>');
} catch (e) {
  if (e instanceof puppeteer.errors.TimeoutError) {
    // Write code to handle the error.
  }
} 

FileChooser Class

The file chooser class object is created using the method page.waitForFileChooser. The FileChooser class is used to interact with files. The frequently used methods of the FileChooser class are explained in the next section.

FileChooser Class – Methods:

Below methods are available for FileChooser class –

  • fileChooser.accept(file_with_path) – This method is used to upload any file (for which path is provided as an argument).
  • fileChooser.cancel() – This method is used to cancel the file upload process.
  • fileChooser.isMultiple() – This method checks if the fileChooser allows to select multiple values. It returns a boolean expression(true or false).

An example of FileChooser class –

const [fileChooser] = await Promise.all([
  page.waitForFileChooser(),
  page.click('#attach-button'), 
]);
await fileChooser.accept(['/puppeteer_proj/data/sample_file.pdf']);

Conclusion:

In this “Puppeteer Class” tutorial, we have explained the Puppeteer class, Target class, MessageConsole class and TimeoutError class which includes the important namespaces(if any), events(if any), and methods that are frequently used in Puppeteer web scraping techniques with examples. In the next article, we will explain BrowserContext, Browser, and BrowserContext Class.

Leave a Comment