The Full Guide to Web Scraping & Automation with JavaScript and NodeJS

George Gkasdrogkas
George Gkasdrogkas
/
September 30, 2022
Bonus: if you like our content and this “NodeJS and JavaScript web scraping” guide, you can join our web browser automation Slack community.

For nearly 30 years JavaScript has been one of the most popular programming languages for web development. It is so widely used and liked by developers that it has also been adapted to run outside the browser; including the most popular server-side JavaScript environment – Node. Node has created an enormous ecosystem and use cases ranging from data science, game development, AI, and of course, web automation.
You’re here to learn more about the last one, web automation, so we have created a full guide for anyone who wants to start web automation in 2022 using this incredible JavaScript platform.

 Web Scraping & Automation with JavaScript and NodeJS

Web fundamentals

There are several different ways to do web automation in Node, and fundamentally each one uses some library that provides a high-level API for communicating with the underlying engine of the browser. We have dedicated many articles on how to use these different APIs. To become a true expert, it is beneficial to get a more in-depth understanding of the core concepts behind the working of websites and how these libraries can extract data and simulate user-specific actions on a web browser.

Most web automation tasks include some kind of data retrieval from websites. We might want to extract information about our favorite YouTube video or conduct E2E tests on our web application. To handle those cases, we rely on the web browser to do all the heavy lifting (request the resources, execute the various scripts, and render the web page). Sometimes we use the term browser automation interchangeably to denote that we can take advantage of the underlying API each browser exposes and communicate with it. But why can we query a web page in the first place? The answers lie in how the browsers process the data behind the scenes.

How websites are built

Websites all start with a text file containing semantics in a format called HTML (HyperText Markup Language), the most fundamental building block of the Web, which defines the meaning and structure of web content.  We can add styling through CSS (Cascadian Style Sheet) or functionality/behavior through JS (JavaScript). A website usually contains multiple HTML, CSS, and JS files.
The website is then saved in a remote computer that runs a web server; software that allows us to make documents available to everyone on the World Wide Web through a public static IP address and the HTTP protocol (more on that later). We often associate a readable text, called a domain name, to that IP address, so we do not have to type strange numbers. It’s easier to remember google.com than 2a00:1450:4001:809::200e.

How do browsers access websites?

Now that we have a website up and running, the next step is to use a web browser to access it. This is easy; we type the domain, e.g. google.com, in the address bar and the front page of Google is shown after a few seconds. But what happens behind the scenes?

Web fundamentals

If we closely inspect the address bar of the above screenshot, we can see that the domain name is prepended by the text https://. This signifies the protocol of the address, which in this case is HTTP over SSL. Wow, two new terms here.

The HTTP (Hypertext Transfer Protocol) protocol is the foundation of data communication for the World Wide Web. It is how we can communicate with different remote entities and exchange information. Accessing websites always requires two parties; a client that requests a web resource (ie. HTML, media, scripts, etc.), which can be a web browser, and a web server that responds to those requests. Depending on the availability of the resource, the HTTP protocol defines various response status codes to inform the client accordingly.

Let’s take the browser role: a user requests the resource example.com. This is a website, so we will create an HTTP request to that address (called the host). Here is a simplified example of such a request:


GET / HTTP/1.1
Host: example.com
User-Agent: Mozilla/5.0

HTTP requests consist of the following elements:

  • The first line consists of the HTTP Method, usually a verb like GET, POST, or a noun like HEAD that defines the operation that needs to be performed. We are trying to fetch a resource, so we are using GET.
  • The HTTP method is followed by the Path of the resource to fetch. In our example, we want the root path denoted by a slash character. If we tried to access the page example.com/pages/1, we would use /pages/1 as the Path.
  • The version of the HTTP protocol used completes the first line. There were many versions across the years. Currently, HTTP/2 is used by the majority of web servers.
  • The next lines define headers that convey additional information for the servers, like the host requested and the browser version used to make the request.

After the web server receives the request, it processes the various headers, ensures that the resource exists, and returns the appropriate HTTP response to the browser.  Suppose that we try to access an HTML page and that page exists, the response will look like this:


HTTP responses consist of the following elements:

  • In the first line is the HTTP protocol version.
  • The protocol version is followed by a response status code and the corresponding status message, a short description of the status code. There are various predefined status codes.
  • The following lines contain response headers, like those for requests.
  • Optionally, a body containing the fetched resource completes the response. We requested an HTML page, so the response body corresponds to that data.

This is a high-level representation of the communication behind the scenes. After receiving the HTTP response, the browser must render the plain text of the response body and make an interactive page. The following section explains the basics of this process, but before that, we have a pending issue to resolve.
Recall that when we accessed google.com, the browser used the HTTPS protocol, which we mentioned is HTTP over SSL. To secure HTTP communications, we created an extension of the HTTP protocol, called HTTPS (Hypertext Transfer Protocol Secure), which allows the communication protocol to be encrypted using TLS (Transport Layer Security) or, formerly, SSL (Secure Sockets Layer). This ensures that third-party malicious users will not tamper with the exchanged data.

How do browsers render websites?

This is a guide for beginners. If you know all this, feel free to skip to the next section.

Though every rendering engine works differently, here is a summary of how most of them work.

After retrieving the assets that correspond to a webpage through multiple HTTP requests, the next thing in line is for the browser to build the “content tree”; the rendering engine will start parsing the HTML document and convert elements to DOM (Document Object Model) nodes, which is an objective presentation of the HTML document and the interface of HTML elements. DOM is specified by the W3C organization. It is a generic specification for manipulating documents.

The browser engine will also parse style data from inline style elements and external CSS files, to create the “render tree”. This is another object constructed in parallel with the “content tree”. The render tree is the visual representation of the document and enables the painting of the contents in their correct order. It contains rectangles with visual attributes like color and dimensions, which are in the correct order to be displayed on the screen.

The next thing is to create the “layout”. This means giving each node the exact coordinates where it should appear on the screen.

The final process is “painting”. The browser will call the renderer’s “paint()” method to display content on the screen.

Web automation fundamentals

We access the DOM tree when we try to “scape” a web page, depending on the software we use. The DOM has an almost one-to-one relation to the markup.

<html>
 <body>
   <p>
     Hello World
   </p>
   <div> <img src="hello_image.png"/></div>
 </body>
</html>

For example, this markup would be translated to the following DOM tree:

javascript scraper

That concludes our introduction to web fundamentals. Now that we have gained a deeper understanding of the underline workings that are executed, we can implement a simple web scraping script.

Scraping a web page using plain HTTP requests and Regex

> Web scraping is extracting valuable data from a web page. There are several ways to do scraping: directly accessing the World Wide Web using HTTP or a web automation framework. In the following three sections, we will present various web scraping methods, starting with a lower-level approach through plain HTTP requests and ending using a full-featured automation library with a remote automation platform.

A browser is not the only way to retrieve a web page’s HTML and markup data. If we know that the website is static, it doesn’t rely on complex rendering logic on the runtime, we can use plain HTTP requests to retrieve that data. In Node, there is a built-in HTTP client that can be used.

In this section, we will try to retrieve the Table of Contents from a Wikipedia article using plain HTTP requests to access the web page contents and regex to retrieve the required data. Here is a screenshot from the article we will use.

Web automation for beginners

That concludes our introduction to web fundamentals. Now that we gain a deeper understanding of the underline works that are executed, we can implement a simple web scraping script.

Scraping a web page using plain HTTP requests and Regex

> Web scraping is extracting valuable data from a web page. There are several ways to do scraping: directly accessing the World Wide Web using HTTP or a web automation framework. In the following three sections, we will present various web scraping methods, starting with a lower-level approach through plain HTTP requests and ending using a full-featured automation library with a remote automation platform.

A browser is not the only way to retrieve a web page’s HTML and markup data. If we know that the website is static, it doesn’t rely on complex rendering logic on the runtime, we can use plain HTTP requests to retrieve that data. In Node, there is a built-in HTTP client that can be used.

In this section, we will try to retrieve the Table of Contents from a Wikipedia article using plain HTTP requests to access the web page contents and regex to retrieve the required data. Here is a screenshot from the article we will use.

web scraping nodejs

Introducing HTTP module in Node

Node.js comes with both HTTP and HTTPS modules in the standard library. Those modules allow us to conduct HTTP and HTTPS requests. The first task is to import the module into a JS script, make an HTTPS request using the desired URL, and save the returned HTML page into a string to later parse it.


import https from 'node:https'
 
const request = https.request({
  hostname: 'en.wikipedia.org',
  port: 443,
  path: '/wiki/World_Wide_Web',
  method: 'GET'
}, result => {
  let html = ''
 
  result.on('data', dataBuffer => {
    const partialHTML = dataBuffer.toString()
    html += partialHTML
  })
 
  result.on('end', () => {
    // DO SOMETHING WITH THE HTML STRING
    console.log(html)
  })
})
 
request.on('error', error => {
  console.error(error)
})
 
request.end()

Let’s walk through the code. First, we import the https module. We use ESM and prepend the package name with the node: string; this ensures that the Node’s https module is imported in case of a name conflict with one of our other dependencies. We are using import, which means the project would have to use something like Babel or set the type property on the package.json to “module”.


import https from 'node:https'

Next, we call the request method on https instance. This returns a ClientRequest object representing an in-progress request whose header has already been queued.


const request = https.request({

The request method accepts two parameters: a configuration option and a callback with the response object. In the configuration, we define all the appropriate information to conduct the HTTP request, such as the hostname, the resource path, and the HTTP method. We also define a port property set to 443; most HTTPS servers listen on port 443.


{
  hostname: 'en.wikipedia.org',
  port: 443,
  path: '/wiki/World_Wide_Web',
  method: 'GET'
}

The second parameter to request is a callback. The callback provides a result object as a parameter. We can register a data event listener on the result, and its callback which will be called multiple times with partial data. To get the complete HTML string, we have defined an html variable that will gradually be built after each data callback gets executed. We then define another event listener to listen to the end event. Its callback will get executed when all the data get retrieved successfully. From inside, we can access the html variable, which will contain the entire HTML markup of the page. Later we will parse this string to get the appropriate information, but for now, we just print it to the console.


result => {
  let html = ''
 
  result.on('data', dataBuffer => {
    const partialHTML = dataBuffer.toString()
    html += partialHTML
  })
 
  result.on('end', () => {
    // DO SOMETHING WITH THE HTML STRING
    console.log(html)
  })
}

After initializing a request, we register an error-handling event. This ensures we can access any errors that might occur when conducting the request. For this simple example, we just print to standard error.


request.on('error', error => {
  console.error(error)
})

We finish sending the request by calling the end method on the request object.


req.end()

After executing the complete script, we should get the entire HTML page of the Wikipedia article we requested.

Extract data from HTML using Regular Expressions

Now that we have retrieved the HTML of the article as a string value, we can proceed to the next part, which is to retrieve the list items of the contents table. We achieve this by utilizing Regular Expressions (regex), but first, we’ll have to know the structure of the content. By accessing the URL of the article and making a quick inspection of the DOM tree, we can identify the DOM elements that represent each corresponding list item.

web scraping javascript

As we can see from the above screenshot, each list item is enclosed by a span tag that contains a class attribute with a value of toctext.

<span class="toctext">History</span>

To retrieve the text inside the enclosing tags, the solution will be a regex that will match the literal value of the enclosing title and provide a capturing group for all the potential characters.

/<span class="toctext">(.*?)<\/span>/gm

Transferring this regex into our JS script, inside the callback of the end event listener, we get the following result:


result.on('end', () => {
  const contents = [...html.matchAll(/ <span class="toctext">(.*?) <\/span>/gm)].map(match =>>match[1])
  console.log(contents)
})

A call to matchAll method of the html string variable will result in an iterator being returned, which can be turned into a JS array using spread syntax.

Running the whole script, we should be presented with an array of strings: the list of items from the table of contents.


[
  'History',              'Function',
  'HTML',                 'Linking',
  'WWW prefix',           'Scheme specifiers',
  'Pages',                'Static page',
  'Dynamic pages',        'Website',
  'Browser',              'Server',
  'Cookie',               'Search engine',
  'Deep web',             'Caching',
  'Security',             'Privacy',
  'Standards',            'Accessibility',
  'Internationalisation', 'See also',
  'References',           'Further reading',
  'External links'
]

What does it mean for this method to work only on static websites?

Start using Browserless web automation for FREE

At the beginning of this section, it was mentioned that the desired web page for scraping needs to be static. Many modern front-ends are built using frameworks and libraries like Angular and React. They render the appropriate content dynamically after all the corresponding JavaScript assets are loaded and evaluated on the page. This refers to CSR  (Client Side Rendering): we redirect the request to a single HTML file, and the server will deliver it without any content (or with a  loading screen) until we fetch all the JavaScript and let the browser compile everything before rendering the content.

If we try to make an HTTP request to a webpage that relies on CSR, we won’t be able to get any valuable data, because the scripts were not yet evaluated to construct the appropriate DOM tree.

This issue is not presented with old server-side rendering solutions. When we built a web page —with PHP, for instance— the server compiled everything, included the data, and delivered a fully populated HTML page to the client. There is more to this story, but keep that in mind for now. This is why solutions that rely on headless browsers were developed to successfully retrieve and interact with modern websites.

Scraping a web page using Axios and Cheerio

Regular Expressions are pretty powerful and allow the creation of simple matching rules. However, they get complicated quickly and are not suited for complex tasks (except if you don’t mind ending with one that validates email addresses). Using the standard https module is not very convenient either; its functionality heavily relies on event listeners and callbacks, which are unsuitable for large programs as they can get messy quickly. This section will introduce two powerful libraries, Axios for conducting HTTP requests and Cheerio for parsing HTML markup into a DOM structure. We need to install the appropriate NPM packages to use those two libraries. In our project run:


npm i -S axios cheerio

Let’s rewrite the previous example using those two libraries to understand their simplicity better.


import axios from 'axios' // or const axios = require('axios');
import { load } from 'cheerio'// const { load } = require('cheerio')
 
try {
  const { data: html } = await axios.get('https://en.wikipedia.org/wiki/World_Wide_Web')
  const $ = load(html)
  const contents = $('[class="toctext"]').map((_, elem) => $(elem).text()).get()
  console.log(contents)
} catch (e) {
  console.error(e)
}

Look how compact the script has become! The first two lines contain our import statements. For the Cheerio module, we only have to import the load function.


import axios from 'axios'
import { load } from 'cheerio'

Enclosing the main functionality into a try/catch block ensures that any errors our statements return will be handled.


try {
 // ...
} catch (e) {
  console.error(e)
}


Now we can start implementing the main body of the script. First, we have to retrieve the HTML page from the Wikipedia article. This can be done by calling the get method, provided by Axios. Axios is a promise-based library, so we can leverage the use of async/await. Beginning from Node v14, top-level await syntax is supported by ESM. The resulting value is an object that contains the data property, which holds the corresponding data retrieved by the HTTP request. In this example, data will hold the string representation of the HTML page. Object destructuring assignment is used to rename it into something more appropriate.


const { data: html } = await axios.get('https://en.wikipedia.org/wiki/World_Wide_Web')

We can now create a Cheerio instance by passing the HTML markup into the load function. Cheerio parses the markup and provides an API for traversing/manipulating the resulting data structure, similar to the JQuery library.


const $ = load(html)

Remember from the previous section that each list item in the contents table is enclosed by a span tag that contains a class attribute with a value of toctext. To retrieve the desired data, selectors match all the elements that have toctext as a class and then call the text method on each element, which will return the text inside. Calling get to the returned value will retrieve all elements matched by the Cheerio object, as a plain JS array.


const contents = $('[class="toctext"]').map((_, elem) => $(elem).text()).get()

Lastly, we can print the results to the console.


console.log(contents)

Headless browser with Browserless and Selenium WebDriver

As we have already explained, it is not always possible to retrieve the entire content of a page by making HTTP requests, especially on websites that utilize front-end frameworks and libraries like Angular and React. But fear not – we have an ace in our sleeves – using headless browsers through browser automation libraries.

Browser automation is the process of simulating user-specific tasks on a web browser. In recent years the importance of browser automation as a core tool for everything from automating internal processes, to web scraping, to E2E tests, has led to the birth of several different automation libraries. One of them is Selenium WebDriver, which leverages the ability of most modern browsers to provide an API to directly control them.

Apart from the automation library of our choice, we also need a local or remote browser instance. While using a local browser instance is the convenient method, it is not the most efficient solution. For Selenium, we should take additional steps to configure our development environment. This can be troublesome if we want multiple browser instances to run in parallel or when our machines have limited computational resources. To tackle this issue, web automation platforms were developed to provide ready-to-use remote browser instances.

Browserless: A free tool for JavaScript & NodeJS web scraping

Browserless is an online headless automation platform that provides fast, scalable, reliable web browser automation ideal for data analysis and web scraping. It’s open-source with nearly 7K stars on GitHub. Some of the largest companies worldwide use it daily for QA testing and data collection tasks. It supports the WebDriver protocol, allowing anyone to connect easily to their remote browser instances.

The platform offers free plans and paid plans if we need more powerful processing power. The free tier offers up to 6 hours of usage, which is more than enough for evaluating the platform capabilities or simple use cases.

After completing the registration process, the platform supplies us with an API key. We will use this key to access the Browserless services later on.

Initializing a remote connection to Browserless instance

Now that we have access to Browserless, let’s connect to a remote browser instance. The first step is to install the appropriate dependencies. Type the following command inside your Node project to install Selenium WebDriver.


> npm i -S selenium-webdriver

Create a new Node script and append the following lines.


import webdriver from 'selenium-webdriver'
import chrome from 'selenium-webdriver/chrome.js'
 
// NOTE: Use the API key retrieved from the Browserless dashboard.
const BROWSERLESS_API_KEY = '***'
 
const chromeOptions = new chrome.Options().addArguments([
  '--headless',
  '--no-sandbox',
  '--window-size=1920,1080'
])
 
const driver = new webdriver.Builder()
  .usingServer('https://chrome.browserless.io/webdriver')
  .withCapabilities({
    'browserless:token': BROWSERLESS_API_KEY
  })
  .forBrowser(webdriver.Browser.CHROME)
  .setChromeOptions(chromeOptions)
  .build()

Selenium WebDriver provides a very handy API to configure a new connection instance. Let’s analyze it.
As always, we have to import our dependencies first. We must import both the main WebDriver module and its appropriate browser binding. Browserless operates on Chrome machines, so we import the chrome module.


import webdriver from 'selenium-webdriver'
import chrome from 'selenium-webdriver/chrome.js'

Next, we define a constant variable that contains the API key for establishing a remote connection to the Browserless platform. Do not forget to replace its value with your API key!


// NOTE: Use the API key retrieved from the Browserless dashboard.
const BROWSERLESS_API_KEY = '***'

After that, we have to set some browser-specific arguments. Most of the time, we should define the –no-sandbox flag. This will disable the Sandbox environment, which is responsible for setting privileges on the Chrome browser for security purposes. Running a browser with a sandbox enabled is a good security practice, but for the use case of web scraping, it restricts us from some actions. Setting a preferred window size is also mandatory to prevent some pages from rendering a mobile version.


const chromeOptions = new chrome.Options().addArguments([  '--no-sandbox',  '--window-size=1920,1080'])

To conclude the instantiation process, a new connection object must be created. The Node bindings of WebDriver use the builder design pattern, which organizes object construction into a set of steps. To begin, we have to instantiate a Builder() instance.


new webdriver.Builder()

The next step is to set the connection URL.


.usingServer('https://chrome.browserless.io/webdriver')

We also need to supply the API key to connect successfully. To do this, we create a new capability object using the withCapabilities() method. Selenium Capabilities is a configuration object defining basic requirements for driving the browser instance.


.withCapabilities({  'browserless:token': BROWSERLESS_API_KEY})

Next, we have to define the target browser.


.forBrowser(webdriver.Browser.CHROME)

Lastly, we supply the options object we constructed earlier.


.setChromeOptions(chromeOptions)

Calling the build method finalizes our configuration and returns a new WebDriver client based on the builder’s current configuration.


  .build()


NodeJS web scraping of Wikipedia articles

Now that we have established a remote connection to the Browserless platform, it’s time to implement the main functionality of the script. We will follow the same example as the previous sections; retrieving the Table of Contents from the Wikipedia article for the World Wide Web.


try {
  await driver.get('https://en.wikipedia.org/wiki/World_Wide_Web')
  const allListItemElements = await driver.findElements(webdriver.By.className('toctext'))
  const allListItems = await Promise.all(allListItemElements.map(elem => elem.getText()))
  console.log(allListItems)
} catch (e) {
  console.error(e)
} finally {
  await driver.quit()
}

As you can see, the process is relatively straightforward. Enclosing all the statements relating to the scraping process into a try/catch block will handle all the unexpected errors. Calling driver.quit() in the finally block will quit the current session. This will ensure that all the allocated resources will be released even if the script encounters an error. After calling quit, the connection instance will be invalidated and can no longer issue commands against the browser. For the try block, the execution steps can be summarized as follows:

First, we access the desired URL.


await driver.get('https://en.wikipedia.org/wiki/World_Wide_Web')

Next, we have to retrieve all the elements that contain the respective locator:


const allListItemElements = await driver.findElements(webdriver.By.className('toctext'))

Finally, we retrieve the inner text by calling the getText method on each matched element. Calling getText results in a promise, so we use Promise.all to wait until all the calls are resolved. We can then print the results on the screen.


const allListItems = await Promise.all(allListItemElements.map(elem => elem.getText()))

We can then print the results on the screen.


console.log(allListItems)


Comparing the different NodeJS & JavaScript web scraping techniques

Web scraping is an everyday use case when dealing with web automation. In this article we presented three approaches, from using native Node modules and no external dependencies, to utilizing a complete automation library like Selenium WebDriver and an online platform like Browserless. How do those techniques compare to each other, and which one should you choose for your next project? Here is a list of pros and cons for each one:

Method 1: Plain HTTP requests and Regular Expressions

Pros:

  • Relies on the native Node HTTP/S module, a robust and well-tested implementation.
  • The low-level nature of the HTTP/S module and regexes allows for increased performance gains and less memory consumption, which is ideal for devices with low computational resources.
  • It serves you well for simple cases.
  • It can be parallelized easily using worker threads.

Cons:

  • Regexes do not scale well for complex rule matching.
  • It is not suited for large scraping procedures.
  • The event-based nature of the HTTP/S module that heavily relies on callbacks can result in callback chaos.
  • The implementation can end up being a rather verbose codebase.
  • This method is not suited for websites built with front-end frameworks or libraries.
  • This method is only suited for web-scraping tasks.

Method 2: Axios and Cheerio

Pros:

  • Axios is a promise-based library. The promise chain grows linearly from the top down, making the code base easier to understand. We can even use async/await to remove some of its drawbacks.
  • Axios is a robust, well-maintained, and actively developed HTTP client library with more than 33M monthly downloads and nearly 100K stars on GitHub.
  • Cheerio implements a subset of core jQuery, which allows us to make complex DOM queries.
  • Cheerio does not rely on a browser which increases the performance of its method.
  • Like Axios, Cheerio is a robust, well-maintained, and actively developed library. It receives more than 7M monthly downloads and has over 25K stars on GitHub.
  • It can be parallelized easily using worker threads.

Cons:

  • This method is not suited for websites built with front-end frameworks or libraries.
  • This method is only suited for web scraping and cannot handle any other web automation task.

Method 3: Selenium WebDriver and Browserless

Pros:

  • Selenium WebDriver is a full-featured browser automation library. You can use it for every web automation task, from simple web scraping scripts to completing E2E test suits.
  • The WebDriver bindings for Node receive nearly 3M monthly downloads, and the corresponding GitHub page has over 24K stars.
  • Robust software with many years in commercial production environments.
  • Selenium provides IDE that supports codeless test creation and execution.

Cons:

  • An extensive configuration is required to use WebDriver correctly, depending on your browser.
  • Selenium WebDriver is a browser automation platform: it drives the specified browser instance. For that reason, you must have installed the corresponding browsers on your machine.
  • It cannot be parallelized easily locally. This can be resolved using a remote browser instance through a web automation platform like Browserless.

Start using Browserless web automation for FREE

Summary

In this guide, we shared an in-depth analysis of the working mechanisms of the web. We introduced three different methods for scraping a website using Node.js, with a complete analysis of the pros and cons of each technique. There are still other aspects of scraping and web automation that we could not cover in this post. To learn more about web automation, check out our other articles and follow us on social media.

Share this article

Ready to try the benefits of Browserless?

Sign Up