How to manage sessions for web automation with Puppeteer + Browserless

contents

In this article we'll look at managing sessions and their data when using Puppeteer. That includes storing cookies and keeping browsers open.

These techniques are essential if you're performing complex workflows behind login pages. The clean slate model of automation libraries means it's up to you to connect cookies to instances, secure credentials, synchronize the localstorage and much more.

So let's dive into how to manage sessions:

How to manage sessions in puppeteer

Freshly baked cookies

Currently, one of the most used approaches is to do a login, then to save the session data to disk. Take, for instance, this typical session-data-saving example


const innit = async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  /*Your login code*/

  const cookies = JSON.stringify(await page.cookies());
  const sessionStorage = await page.evaluate(() =>JSON.stringify(sessionStorage));
  const localStorage = await page.evaluate(() => JSON.stringify(localStorage));

  await fs.writeFile("./cookies.json", cookies);
  await fs.writeFile("./sessionStorage.json", sessionStorage);
  await fs.writeFile("./localStorage.json", localStorage);
  browser.close();
};

for which you also have to implement a cookie, sessionStorage and localStorage loading.


const start = async () => {
  await innit();

  const cookiesString = await fs.readFile("./cookies.json");
  const cookies = JSON.parse(cookiesString);

  const sessionStorageString = await fs.readFile("./sessionStorage.json");
  const sessionStorage = JSON.parse(sessionStorageString);

  const localStorageString = await fs.readFile("./localStorage.json");
  const localStorage = JSON.parse(localStorageString);

  await page.setCookie(...cookies);

  await page.evaluate((data) => {
    for (const [key, value] of Object.entries(data)) {
      sessionStorage[key] = value;
    }
  }, sessionStorage);

  await page.evaluate((data) => {
    for (const [key, value] of Object.entries(data)) {
      localStorage[key] = value;
    }
  }, sessionStorage);
};

and to add on that, if you have multiple instances, you also need to implement a logic for synchronizing the data among instances. Not to mention that there's no way to save HTTP-only cookies. It can get very messy very quickly.

Luckily, that's one of the things we realized at Browserless, and we created a painkiller to tackle this issue with ease and cleanly.

Staying alive

With Browserless, you can keep the browser (and thus the session) alive and kicking in the background, while your instances connect remotely and share the session data (thus cookies, localStorage and sessionStorage).

To keep your connection alive, is as easy as adding keepalive=<MILLISECONDS> to the query string on your connection. Assuming you're self-hosting:


const innit = async () => {
  const browser = await puppeteer.connect({
    browserWSEndpoint: 'ws://localhost:3000?keepalive=300000'
  });
  const page = await browser.newPage();

  /*Your login code*/

  browser.disconnect(); //You MUST user browser.disconnect() rather than
 						//browser.close(), as it kills everything, careful!
}

This will keep the session alive for 300000 ms (or 5 minutes), even if you disconnect all your instances, running in the background — maybe loading something, maybe just idling waiting for new connections, it won't judge you. Let's look at an actual use case.


const innit = async () => {
  const browser = await puppeteer.connect({
    browserWSEndpoint: `ws://localhost:3000?keepalive=300000`,
  });
  const page = await browser.newPage();

  await page.goto("https://www.goodreads.com/user/sign_in");
  await page.click(".authPortalConnectButton");
  await page.waitForNetworkIdle();
  
  await page.type("#ap_email", "****");
  await page.type("#ap_password", "****");
  await page.click("#signInSubmit");
  await page.waitForNetworkIdle();
  browser.disconnect();
}

In this example, we're login to Goodreads and we're leaving the connection open for 5 minutes. But how can you re-use an existing session? Easy, just make a GET request to /sessions and you'll see all your active sessions, and each of them will have a browserWSEndpoint field that we can use to re-connect. In our self-hosting example, we'll go to http://localhost:3000/sessions and we'll get a JSON response like this one:


[
  {
    "description": "",
    "devtoolsFrontendUrl": "/devtools/inspector.html?ws=127.0.0.1:3000/devtools/page/CB172CDECBF091A32F62DEF13C12A298",
    "id": "CB172CDECBF091A32F62DEF13C12A298",
    "title": "Recent updates | Goodreads",
    "type": "page",
    "url": "https://www.goodreads.com/",
    "webSocketDebuggerUrl": "ws://127.0.0.1:3000/devtools/page/CB172CDECBF091A32F62DEF13C12A298",
    "port": "65320",
    "browserId": "6428cadf-47db-4bca-bb84-995e7a2350a5",
    "trackingId": null,
    "browserWSEndpoint": "ws://127.0.0.1:3000/devtools/browser/6428cadf-47db-4bca-bb84-995e7a2350a5"
  },
  ...

Then we're going to use that browser endpoint with our instances.


(async () => {
  const browser = await puppeteer.connect({
    browserWSEndpoint:
      "ws://127.0.0.1:3000/devtools/browser/6428cadf-47db-4bca-bb84-995e7a2350a5",
  });
  const page = await browser.newPage();

  await page.setViewport({ width: 1290, height: 720 });
  await page.goto("https://www.goodreads.com/"); // No login!
  await page.screenshot({ path: "goodreads.png", fullPage: true });
  browser.disconnect();
})();

et voilà, we're in, and we didn't have to login again!

keep the browser (and thus the session) alive and kicking in the background

Easy, isn't it?

But there are some things to consider! These connections are basically "new tabs" on a single browser, not isolated instances running from a clean slate. This means all the advantages and disadvantages, like website ads tracking cookies and all the instances/tabs being able to "speak" with each other. It is highly recommended that you don't have a single remote browser for all your work — a single browser with a lot of session data —, but to break it down into several remote browsers with small amounts of session data.

Manage sessions by tracking things down

If you use a modular approach to manage different parts of your program, it can get really messy trying to tell them apart if you use the /sessions API to get the remote connection URL. That's why we allow you to set Tracking Ids on all connections and API calls!


const innit = async () => {
  const browser = await puppeteer.connect({
    browserWSEndpoint: 'ws://localhost:3000?keepalive=300000&trackingId=b4de40cee5'
  });
  const page = await browser.newPage();

  browser.disconnect();
}

That way you can easily tell your browsers apart, by filtering them by trackingId. This is also advantageous as Tracking Ids are isolated in their own workspace, and they won't mingle with other browsers' downloads and user data saved to disk.

How to get started with Browserless

There are different ways to use our product.

If you’re using one of our hosted services, just connect to our WebSocket securely with your token to start web scraping!

Share this article

Ready to try the benefits of Browserless?