Improving Puppeteer Performance

May 3, 2019

contents

One of the things that stands out when using a headless browser (versus cURL or other simpler tools) is that it can be painfully slow. Some of the cost is unavoidable -- you'll have to sta

rt the browser, wait for it to initialize, and then proceed from there. This is even harder in platforms that offer functions-as-a-service as there's a warming phase plus the fact that you cannot cache any results!

Today, I'd like to go over how browserless can help you mitigate some of these slower operations, and in certain cases even cut your loading time in half.

Launch with a user-data-dir

One of the best things about chrome is that it allows you to specify a --user-data-dir when running in the command line. browserless also exposes this flag as a query-string parameter, meaning you can start a session and cache the cookies, local-storage and more for the next run:


const browser = await puppeteer.connect({
    browserWSEndpoint: 'wss://chrome.browserless.io?--user-data-dir=/tmp/session-123',
});

We've found that when using this option you can significantly speed up your existing sessions by having a full cache of all asset requests. Likewise, any login actions or other startup tasks can also be cached for a faster run on the next session!

Please do note that, when using this parameter, that cookies are saved for the next session. This sometimes isn't ideal for all workflows, so be sure to change or update where you're storing your user-dat-dir's so that you don't run into issues with your scripts!

Keep Chrome running

When under heavy load, we've noticed that Chrome can take several seconds to start in a production environment. This isn't ideal for many scenarios, and a startup time can likely be a non-started for many scenarios. In order to mitigate this startup cost, you can start Chrome in the background and keep it running -- but how do you facilitate connections to it?

In browserless this is painfully easy in our docker image. You simply start the container with an environment variable of PREBOOT_CHROME to TRUE:


docker run -d -p 3000:3000 -e DEBUG=browserless* -e PREBOOT_CHROME=true -e MAX_CONCURRENT_SESSIONS=10 --name browserless browserless/chrome:latest

This will launch an instances of Chrome to match your MAX_CONCURRENT_SESSIONS parameter (which is 10 here). Any request that comes in will automatically use one of these pre-booted browsers, saving you seconds of startup costs. If a session starts with flags that aren't present in one of the Chrome instances, then we'll generate a fresh instance for you with those flags. This allows for customization of incoming requests as well as the ability to utilize a pre-started instance.

After your session is done (the REST call closes or puppeteer disconnects), we'll automatically kill that browser and spawn a new one in its place. This behavior is also configurable with KEEP_ALIVE=true, where we instead keep the browser up and running (but close the pages inside). If you do use this parameter, be sure to call browser.disconnect as opposed to browser.close as the latter will terminate the browser!

Blocking ad-network calls

Of all the strategies in increasing performance, one of the best we've seen is blocking of external ad-network calls. In some cases we've seen performance double when this happens. In pupeeter this is a fairly easy thing to implement with network-request interception. However you'll have to source a list of domains and use that as a basis for rejecting ad traffic. Whenever a request is initiated you'll simply check the prior domain list and see if they match -- then reject the request.

Starting in version 1.7.0 of our docker image, as well as our cloud instances both dedicated and usage-based, you can easily toggle this behavior by adding a query-string parameter in your URL's:


https://chrome.browserless.io?token=YOUR-API-TOKEN&blockAds

When present, we instrument all page objects to automatically block requests that match our internal list of domains! You won't have to source or do any request interception yourself, and can continue going on with your scripts!

Putting it all together

As a final demonstration of the above, let's apply all we've learned to see if we can't improve the performance of CNN's website. Here's the sample code we'll use:


const puppeteer = require('puppeteer');

async function run() {
  let browser = null;
  try {
    browser = await puppeteer.connect({
      browserWSEndpoint: `ws://localhost:3000`,
    });

    const page = await browser.newPage();
    const start = Date.now();
    await page.goto('https://cnn.com/', { waitUntil: 'networkidle2' });
    console.log('Took', Date.now() - start, 'ms');
  } catch (e) {
    console.error(e);
  } finally {
    if (browser) browser.close();
  }
}
run();

And in our docker image, running locally, we've allowed the default variables to play out, so no performance enhancements whatsoever. Our session times for 5 runs are following:


Took 5831 ms
Took 5390 ms
Took 7470 ms
Took 5449 ms
Took 5455 ms

Now, let's apply all of our performance enhancements listed above and see what our timing looks like!


Took 2221 ms
Took 2153 ms
Took 2098 ms
Took 2077 ms
Took 2109 ms

I hope you now notice that we were able to cut request time by at least half, and in some cases (the worst) by 73%! Not only does it improve overall performance, but if you're using a proxy-service as well this can potentially shave off the amount of traffic throughput as well. Often this means a lower bill since you're usage has dropped!

Wrapping it up

We're extremely excited to share these performance improvements with you, and all of the above is available in both our docker images as well as our dedicated accounts. Usage-based accounts have all this functionality as well, save for the --user-data-dir feature, which we'll look too add in the coming weeks.

We hope this helps you improve the performance of your headless browser sessions!