Headless Handbook Chapter 1: Introduction to browser automation

Topics in this chapter

About the handbook
Introduction
What is browser automation?
What does headless mean?
Why are there so many libraries?
Why all the new libraries if there's already WebDriver?
What's the difference between puppeteer and playwright?
Closing Summary

About the headless handbook

If you've ever wondered what all the fuss is about with libraries like Puppeteer, Playwright, and Selenium then this book is for you. If you're a longtime veteran of browser automation, but are curious about the "how's" and "why's," then this book is for you. If you've got some web automation experience under your belt, but are looking to sharpen your skills about web automation, then this book is for as well. Essentially, this book is for everyone. While it is for everyone, it's better enjoyed if you've done some programming, especially in JavaScript or NodeJS. Enough about that, let me give you some of my background.

After running browserless.io for over 3 years (at the time of this writing), it became clear to me that there was a lack of best practices and fundamentals in web automation. Because of this lack of information, my goal with this book is to distill all my learnings in an approachable format: this book. It's also my hope to share everything I know with you, so that you can be successful in how to go about automating the much-complicated web browser.

I've decided to publish this as a series of blog posts. With that out of the way, let's move on.

Introduction

You've probably heard of them at some point. That is libraries like puppeteer, playwright or maybe you've been writing code to automate browsers for some time so terms like Selenium and WebDriver are also familiar. Regardless of how long or little you've been an engineer, there's a really good chance that at some point you'll have to interact with these libraries. But when do you use them, and why? Are some better than others? If older technologies exist, why the sudden "re-inventing" of the wheel by companies like Microsoft and Google? Does Jim Henson have something to do with it?!

These are all important questions to ask, and it's better to have some familiarity with these tools rather then trying to figure them out in the middle of a project. So, in this first chapter of our book, we're going to talk about what these technologies do, why you'd use them, and (more importantly) when to use them.

What is browser automation?

As the term says, browser automation is pretty straightforward: it's automating some task you'd normally do manually, with a web-browser. The first thing that might come to mind is getting data (pricing, availability, status, or anything else) from a web-page or application. This process is more generally known as "scraping," as in "scraping" data off of a web-page. Now, some of you might be saying: "Wait! APIs are the way to do that, and that's what you should be using!" This is a great observation! More often than naught the data that you need out there is likely available via some kind of API (for those unfamiliar, and API is an acronym for Application Programming Interface, and more specifically a REST, or Representational State Transfer, based API). However, what happens if the data you need isn't readily available via some kind of API, REST or otherwise? What happens if the data you need can be thought of as a competitive advantage (like pricing)? What if the data you need isn't supposed to be gotten by automated means?

This is where we have to come out and say "it depends." The legalities of scraping are quite contrived, and not every web-scraping lawsuit has given clear precedence on what to expect. Here's the part where we have to say:

Ok, with that out of the way, there's a pretty famous case of hiQ versus LinkedIn that you might want to check out. In short: if you're trying to do scraping, and there's a chance you might be infringing on terms of conditions, then you'd probably better be ready to "lawyer up" and be ready for the worst. Otherwise, scraping is mostly considered legal with some exceptions.

So what if you're not needing to scrape data, is there any use in a headless browser? Of course! Even though scraping data is a rather popular use for browser automation, there's a multitude of other interesting things you can do with them. A short list of popular things are:

PDF generation: if you have a nice admin dashboard or visualizations that you want to print and share, automatically.
Screenshots: if you want to have a nice picture of your site when someone posts a link in Slack (unfurling), a headless browser can do this.
Performance analysis: if you've ever done any kind of profiling in a browser, you can automate almost all of it.
Testing: doing some kind of regression-testing, or others, is a great use-case for a headless browser.
Much much more: anything from customized coffee cups and cookies. If a browser can do it, you can automate it.

What does headless mean?

No, this isn't some fairy-tale term that found its way into computer science. Headless is just an easy way of saying that there's no visual content being rendered, much like you'd see on a consumer computer of some kind. Why is that so important? For one: it uses significantly less resources than a normal browser that provides a visual image of a webpage. If you've ever ran into the frequent complaints about how Chrome, or other browsers, eat up all your RAM then you can begin to appreciate why having more resources is important.

The other reason why headless is important is that fact that almost all computers out there that "run" the web aren't attached to a display. This means that any program that tries to run and display data will likely be greeted with some kind of bizarre error like Error: no display specified. Of course there's no display specified: there isn't any! Now, in full disclosure, there has been ways around that in the past through packages like XVFB (x-virtual frame buffer: a way to trick the computer that there's a display), however those workarounds do cost something at some point. The big takeaway here to realize is that having an established way to run a web-browser in an automated fashion is a big deal.

The final realization here is that, by offering a browser headlessly, browser vendors are "blessing" automated use of their technologies. What this means, more specifically, is that we as engineers should no longer have to worry about finding the next hot-patch, library version, or some other operating-system package, anytime an update to these browsers occurs. This isn't to say that there won't be problems (there always is), but for the most part the biggest pains of the past are purely that: of the past. With this quasi-contract established, you can expect that newer libraries out there should be a lot more stable now that the vendors themselves are on-board. Which nicely leads us into the next section.

Why are there so many libraries?

Selenium, playwright, puppeteer, phantomJS, and more. What? Why?

Now that we've gotten a handle on what a headless browser is, and why we'd want to use one, the next question is what tool do we want to use to control it. For better or worse, most browsers out there don't have a dedicated programming interface to control them, and if they do they're generally cumbersome to setup and operate. For instance, if we take Chrome as an example, it has an embedded protocol called the Devtools Protocol, and doing something like a page to navigation, you'd effectively need the following:

const CDP = require("chrome-remote-interface");

(async () => {
  let client;
  try {
    // Connect to a Chrome browser
    client = await CDP();

    // Capture the relevant "Domains" from the protocol
    const { Network, Page } = client;

    // Setup any handlers
    Network.requestWillBeSent((params) => {
      console.log(params.request.url);
    });

    // Enable our events...
    await Network.enable();
    await Page.enable();

    // Now navigate!
    await Page.navigate({ url: "https://github.com" });
    await Page.loadEventFired();
  } catch (err) {
    console.error(err);
  } finally {
    if (client) {
      await client.close();
    }
  }
})();

While not a terrible amount of code (~30 lines), the problem is that there's a bit of setup work to figure out for what can be thought of as a "single" action (going to a page). Our problems quickly grow, as this workflow won't work for every browser, meaning that we'd need a library for each browser we wish to automate. You can imagine why this is a problem!

The truth is that this problem, while new to us, is actually fairly old. It's because of this that the Selenium project was started. Selenium, over the years, has now grown into a whole suite of application tools and technologies to ease the pain that comes with automating a web-browser. If we were to peal away all of the Developer SDKs, driver, bindings, libraries and other layers; what we have is really 3 simple parts that drive Selenium: A web-browser to automate, a separate binary application that both understands how to automate the browser and also exposes an HTTP-backed interface (WebDriver), and finally a language-specific library or SDK to "talk" with this browser-specific WebDriver binary. To think of it another way, let's do a quick flow of how a call from your Selenium script flows through to Chrome:

Your script calls a method: driver.goto('https://example.com').
The library, which your script uses, takes this call and transforms it into an HTTP request:

POST http://some-service.com:4444/wd/session/123/url { "url": "https://example.com" }

An HTTP server at some-service.com is listening on port 4444 and passes this request to a WebDriver binary already running.
The WebDriver binary, having accepted this request, begins to process it. Note that since this binary was already running it has likely setup the browser for us.
Once the WebDriver binary has processed the request, it interprets the HTTP request into a browser-specific command.
This command is passed into the running browser as well, and calls a protocol-specific command to issue the navigation.
The browser runs your request, and responds back or emits some kind of event that the WebDriver binary listens to.
WebDriver parses the results of the browser, resolve the original HTTP request, and your script proceeds to the next command.

This is a pretty high-level view of how Webdriver and Selenium work, and you can freely read more about the full specification here. In short, the process of doing all of this interaction is fairly heavy, hence why entire businesses are built around this one set of tools. Whilst the technology is now fairly mature, and certain languages like Java enjoy great tooling, the fact is that there's a lot of parts that can be removed without losing the original motivation.

Why all the new libraries if there's already WebDriver?

This is where new libraries like playwright and puppeteer come in. The timing around these libraries is pretty crucial to understand, as it plays greatly into their recent successes, and is something to consider when weighing when to use them. Puppeteer, for instance, came out roughly 6 months after Chrome announced that it can run headless-ly. Since Google is behind both Chrome and puppeteer, this means that for the first time we have a browser vendor with a dedicated set of tools for interacting with their browser in an automated fashion. The other key to understand here is that, instead of an older language like Java, puppeteer is a library made for NodeJS. By doing this they've captured an entirely new segmentation of developers that are likely younger and more open to a new style of development. Effectively, Google combined the "blessing" behind running Chrome automatically, a brand-spanking new high-level library, and a choice of runtime to capture one of the largest developer segments in existence.

But it's not just the fact that there's a new library and language preference involved. No: the fact is that these new libraries ditch the whole HTTP "chattiness" altogether, and long with it this sibling binary to interpret requests and forward them into the browser. This accomplishes two key things:

It's much easier to load-balance a stateful socket connection than frequent HTTP requests.
There's less packaging involved than deploying a Selenium based solution.

Let's break that down a bit, as I'm sure there's some head scratching going on. Hypothetically, let's say that you need to run 1000's of browsers to scrape and index the web. Some of us will likely know this means that we'll need to run hundreds, possibly thousands, of machines to handle this type of load as (again) we've determined that web browsers take a lot of resources. In order to accomplish this, we'll need some way to balance this load and determine how to distribute these thousands of scraping jobs easily. In the Webdriver world every request can be a new session starting, or a command that needs to go to a specific browser being ran somewhere. Hopefully you can start to get where I'm going with this! You'll need some kind of "sticky" load-balancing technology in order to ensure that currently-running sessions are routed to the appropriate machine where the session originally started. This can be especially tough for users of technologies like nginx, and others, as they don't necessarily have a prescribed way of doing this. Fortunately for puppeteer and playwright, this whole problem simply disappears as they utilize a protocol called WebSockets. In WebSockets, once a connection is initiated all further network traffic gets forwarded to that same exact machine. This means that there's nothing you or I have to do as developers, and that session management will simply "just work."

The other advantage that newer libraries hold is that they no longer need this translation layer between the web browser and your languages library. Selenium, at its core, requires that some flavor of WebDriver be running to handle translating browser-specific protocols to your script. Again, playwright and puppeteer get around this issue by dictating that every browser speaks the same basic automation language. For the case of playwright, which works on the big three vendors (WebKit, Gecko and Chromium), this means Chrome Devtools Protocol.

What's the difference between puppeteer and playwright?

I'd be remiss to not go over some of the differences between these two libraries, as they're both very similar in how they function. In order to do that properly, we need to go get some more historical context on how they came to be.

In late 2017, Google published the first version of puppeteer, a NodeJS library to automate Chrome. As we've discussed prior, Chrome comes with an embedded protocol that any language can use to take control of Chrome. If you're familiar with any sort of development tools (DevTools in Chrome parlance), these also use Chrome's DevTools Protocol. In most cases, having a program or application simply just use the DevTools protocol can be fine, however it's a fairly complex protocol and one that would benefit from a higher-level API. Said another way, there's a lot of managing, orchestrating, and event-handling in the DevTools protocol that's best left to some kind of library. This is where puppeteer comes into play: it's a higher-level API that helps manage and hide all the pain and complexity that comes with running Chrome.

When it was released, it quickly got attention and became one of the highest starred projects on GitHub. It's popularity wasn't mis-placed as it effectively did solve a lot of the painful parts of working with Chrome Devtools Protocol (herein now referred to as CDP). Even with it's power and fame, developers wanted the same kind of experience with the other browsers as well, which eventually became the springboard for playwright.

Playwright, being a Microsoft project, also poached a few of the original puppeteer authors into helping them automate the big three browser vendors: Chrome, Mozilla and WebKit. As well as working with "the big three," playwright also has support for more edge-case type of workflows. Think of things like downloads and video capture, which start to bleed over into operating system territory. Puppeteer, while it does have some support downloads and videos, doesn't yet come close to what playwright can do.

The obvious question that comes with all of this is when do you use which library? You'll have to wait for the next chapter, when we break down use-cases for all of them!

Closing Summary

After all of that, I hope you feel more comfortable about the ecosystem and state of web browser automation. What once started as a very prescriptive workflow (WebDriver), has evolved into a more curated and browser-centric method. Alongside of the evolving libraries out there, the method of communication has evolved as well, which in itself is a fairly large step. What once was purely HTTP based has now moved onto a more stateful socket-based connection. WebSockets more specifically.

With the libraries in mind we also touched on when you'd want to bring something like puppeteer into your applications. If you need to reuse things already in your app, like an administrative dashboard, then headless automation can use those assets to produce clean PDFs and screenshots for other purposes. Finally, while most sites out there have REST APIs associated with their data, you can utilize web automation to help collect more data across the web. Similar to that, using libraries like playwright can also ensure your sites and applications work seamlessly with automated tests and performance capturing.

In the next chapter we'll be taking a deeper look into these libraries, and setup a rough guide on which ones to pick.