Session Management Tactic or How to manage sessions for web automation with Puppeteer + Browserless
Introduction
In the last couple of years, we’ve seen the rise of many brilliant technologies like Puppeteer and Playwright, which make web scraping an actual fun task that can be done with ease. Building on the shoulders of giants, they make it oh so simple to do what other alternatives like Selenium couldn’t even do without third-party libraries — HTTP Header injection, styles and scripts injection, or network inspection come to mind —, and no wonder why they took off so quickly. Naturally, we start requiring more complex workflows behind login pages and thus the need to manage sessions becomes a top priority.
As powerful and flexible as these new headless browsers are — which have many wonderful plugins that allow for stealth, ad-blocking, and even captcha bypassing — we kind of get the feeling that something’s missing, something about sessions and their data. It is still a hassle to manage sessions in a headless browser context, as they base their workflow on a clean slate model. Juggling between sessions is no easy task — you have to watch your cookies and make sure they’re the same in all your instances, securing your credentials, you have to synchronize the localstorage as well…
So let’s dive into how to manage sessions:
How to manage sessions in puppeteer
Freshly baked cookies
Currently, one of the most used approaches is to do a login, then to save the session data to disk. Take, for instance, this typical session-data-saving example
const innit = async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
/*Your login code*/
const cookies = JSON.stringify(await page.cookies());
const sessionStorage = await page.evaluate(() =>JSON.stringify(sessionStorage));
const localStorage = await page.evaluate(() => JSON.stringify(localStorage));
await fs.writeFile("./cookies.json", cookies);
await fs.writeFile("./sessionStorage.json", sessionStorage);
await fs.writeFile("./localStorage.json", localStorage);
browser.close();
};
for which you also have to implement a cookie, sessionStorage and localStorage loading.
const start = async () => {
await innit();
const cookiesString = await fs.readFile("./cookies.json");
const cookies = JSON.parse(cookiesString);
const sessionStorageString = await fs.readFile("./sessionStorage.json");
const sessionStorage = JSON.parse(sessionStorageString);
const localStorageString = await fs.readFile("./localStorage.json");
const localStorage = JSON.parse(localStorageString);
await page.setCookie(...cookies);
await page.evaluate((data) => {
for (const [key, value] of Object.entries(data)) {
sessionStorage[key] = value;
}
}, sessionStorage);
await page.evaluate((data) => {
for (const [key, value] of Object.entries(data)) {
localStorage[key] = value;
}
}, sessionStorage);
};
and to add on that, if you have multiple instances, you also need to implement a logic for synchronizing the data among instances. Not to mention that there’s no way to save HTTP-only cookies. It can get very messy very quickly.
Luckily, that’s one of the things we realized at Browserless, and we created a painkiller to tackle this issue with ease and cleanly.
Staying alive
With Browserless, you can keep the browser (and thus the session) alive and kicking in the background, while your instances connect remotely and share the session data (thus cookies, localStorage and sessionStorage).
To keep your connection alive, is as easy as adding keepalive=<MILLISECONDS>
to the query string on your connection. Assuming you’re self-hosting:
const innit = async () => {
const browser = await puppeteer.connect({
browserWSEndpoint: 'ws://localhost:3000?keepalive=300000'
});
const page = await browser.newPage();
/*Your login code*/
browser.disconnect(); //You MUST user browser.disconnect() rather than
//browser.close(), as it kills everything, careful!
}
This will keep the session alive for 300000 ms (or 5 minutes), even if you disconnect all your instances, running in the background — maybe loading something, maybe just idling waiting for new connections, it won’t judge you. Let’s look at an actual use case.
const innit = async () => {
const browser = await puppeteer.connect({
browserWSEndpoint: `ws://localhost:3000?keepalive=300000`,
});
const page = await browser.newPage();
await page.goto("https://www.goodreads.com/user/sign_in");
await page.click(".authPortalConnectButton");
await page.waitForNetworkIdle();
await page.type("#ap_email", "****");
await page.type("#ap_password", "****");
await page.click("#signInSubmit");
await page.waitForNetworkIdle();
browser.disconnect();
}
In this example, we’re login to Goodreads and we’re leaving the connection open for 5 minutes. But how can you re-use an existing session? Easy, just make a GET
request to /sessions and you’ll see all your active sessions, and each of them will have a browserWSEndpoint
field that we can use to re-connect. In our self-hosting example, we’ll go to http://localhost:3000/sessions and we’ll get a JSON response like this one:
[
{
"description": "",
"devtoolsFrontendUrl": "/devtools/inspector.html?ws=127.0.0.1:3000/devtools/page/CB172CDECBF091A32F62DEF13C12A298",
"id": "CB172CDECBF091A32F62DEF13C12A298",
"title": "Recent updates | Goodreads",
"type": "page",
"url": "https://www.goodreads.com/",
"webSocketDebuggerUrl": "ws://127.0.0.1:3000/devtools/page/CB172CDECBF091A32F62DEF13C12A298",
"port": "65320",
"browserId": "6428cadf-47db-4bca-bb84-995e7a2350a5",
"trackingId": null,
"browserWSEndpoint": "ws://127.0.0.1:3000/devtools/browser/6428cadf-47db-4bca-bb84-995e7a2350a5"
},
...
Then we’re going to use that browser endpoint with our instances.
(async () => {
const browser = await puppeteer.connect({
browserWSEndpoint:
"ws://127.0.0.1:3000/devtools/browser/6428cadf-47db-4bca-bb84-995e7a2350a5",
});
const page = await browser.newPage();
await page.setViewport({ width: 1290, height: 720 });
await page.goto("https://www.goodreads.com/"); // No login!
await page.screenshot({ path: "goodreads.png", fullPage: true });
browser.disconnect();
})();
et voilà, we’re in, and we didn’t have to login again!
Easy, isn’t it?
But there are some things to consider! These connections are basically “new tabs” on a single browser, not isolated instances running from a clean slate. This means all the advantages and disadvantages, like website ads tracking cookies and all the instances/tabs being able to “speak” with each other. It is highly recommended that you don’t have a single remote browser for all your work — a single browser with a lot of session data —, but to break it down into several remote browsers with small amounts of session data.
Manage sessions by tracking things down
If you use a modular approach to manage different parts of your program, it can get really messy trying to tell them apart if you use the /sessions API to get the remote connection URL. That’s why we allow you to set Tracking Ids on all connections and API calls!
const innit = async () => {
const browser = await puppeteer.connect({
browserWSEndpoint: 'ws://localhost:3000?keepalive=300000&trackingId=b4de40cee5'
});
const page = await browser.newPage();
browser.disconnect();
}
That way you can easily tell your browsers apart, by filtering them by trackingId. This is also advantageous as Tracking Ids are isolated in their own workspace, and they won’t mingle with other browsers’ downloads and user data saved to disk.
How to get started with Browserless
There are different ways to use our product.
- Use our online debugger to try it out!
- Sign up for a free account and get an API key. You have 6 hours of usage for free! After that, you can pay as you go, and only pay per second that you use!
- You can self-host for development purposes by using our OpenSource browserless docker image
- If you’ve already tested our service and want a dedicated machine for your requests, reach out to us for a quote.
If you’re using one of our hosted services; be that usage-based or capacity-based, just connect to our WebSocket securely with your token to start web scraping!