In this article we'll look at managing sessions and their data when using Puppeteer. That includes storing cookies and keeping browsers open.
These techniques are essential if you're performing complex workflows behind login pages. The clean slate model of automation libraries means it's up to you to connect cookies to instances, secure credentials, synchronize the localstorage and much more.
So let's dive into how to manage sessions:
How to manage sessions in puppeteer
Freshly baked cookies
Currently, one of the most used approaches is to do a login, then to save the session data to disk. Take, for instance, this typical session-data-saving example
for which you also have to implement a cookie, sessionStorage and localStorage loading.
and to add on that, if you have multiple instances, you also need to implement a logic for synchronizing the data among instances. Not to mention that there's no way to save HTTP-only cookies. It can get very messy very quickly.
Luckily, that's one of the things we realized at Browserless, and we created a painkiller to tackle this issue with ease and cleanly.
Staying alive
With Browserless, you can keep the browser (and thus the session) alive and kicking in the background, while your instances connect remotely and share the session data (thus cookies, localStorage and sessionStorage).
To keep your connection alive, is as easy as adding keepalive=<MILLISECONDS>
to the query string on your connection. Assuming you're self-hosting:
This will keep the session alive for 300000 ms (or 5 minutes), even if you disconnect all your instances, running in the background — maybe loading something, maybe just idling waiting for new connections, it won't judge you. Let's look at an actual use case.
In this example, we're login to Goodreads and we're leaving the connection open for 5 minutes. But how can you re-use an existing session? Easy, just make a GET
request to /sessions and you'll see all your active sessions, and each of them will have a browserWSEndpoint field that we can use to re-connect. In our self-hosting example, we'll go to http://localhost:3000/sessions and we'll get a JSON response like this one:
Then we're going to use that browser endpoint with our instances.
et voilà, we're in, and we didn't have to login again!
Easy, isn't it?
But there are some things to consider! These connections are basically "new tabs" on a single browser, not isolated instances running from a clean slate. This means all the advantages and disadvantages, like website ads tracking cookies and all the instances/tabs being able to "speak" with each other. It is highly recommended that you don't have a single remote browser for all your work — a single browser with a lot of session data —, but to break it down into several remote browsers with small amounts of session data.
Manage sessions by tracking things down
If you use a modular approach to manage different parts of your program, it can get really messy trying to tell them apart if you use the /sessions API to get the remote connection URL. That's why we allow you to set Tracking Ids on all connections and API calls!
That way you can easily tell your browsers apart, by filtering them by trackingId. This is also advantageous as Tracking Ids are isolated in their own workspace, and they won't mingle with other browsers' downloads and user data saved to disk.
How to get started with Browserless
There are different ways to use our product.
- Use our online debugger to try it out!
- Sign up for a free 7 day trial and get an API key.
- You can self-host for development purposes by using our OpenSource browserless docker image
- If you’ve already tested our service and want a dedicated machine for your requests, reach out to us for a quote.
If you’re using one of our hosted services, just connect to our WebSocket securely with your token to start web scraping!