How to run Puppeteer on Heroku for Web Automation

contents

Puppeteer is a great library for automating tasks such as web scraping or PDF generation.

However, when deploying it on a platform such as Heroku, you can run into issues like high resource consumption and compatibility issues.

In this guide, we’ll show you how to deploy it, using a web scraper app as an example — and offer an alternative way to overcome these obstacles.

First, let’s figure out the prerequisites

Before you set up the web scraping application, there are three things you need in hand:

  • Node.js and npm: Node.js is the runtime environment required to run JavaScript on the server, while npm is its package manager.
  • Heroku CLI: The command-line tool for creating and managing Heroku apps.
  • Know-how of JavaScript and web scraping: You need to know how JavaScript works and looks and a basic understanding of web scraping.

Now, use this three-step process to build your web scraping application:

Step 1: Set up your project

Set up your Node.js environment first. Follow these steps:

  • Create a new directory for your project 
  • Start a Node.js application by running npm init -y in your terminal to generate a default package.json file.
  • Install puppeteer-core and its necessary dependencies.

Use this code to achieve that:


npm install puppeteer-core

Next, configure puppeteer-core for Heroku. Unlike Puppeteer, this doesn't come with Chromium, so specify the browser version to ensure compatibility. 

Use existing build packs to install Chromium. Here's a basic setup for puppeteer-core in your project:


const puppeteer = require('puppeteer-core');

async function setup() {
    const browser = await puppeteer.launch({
        executablePath: '/app/.apt/usr/bin/google-chrome',
        args: ['--no-sandbox', '--disable-setuid-sandbox']
    });
    const page = await browser.newPage();
    return { browser, page };
}

After this, decide how you’ll host Chrome. You can run it on your machine if you’re in development mode. But for production, host them in a cloud platform or use a pool of managed browsers.

Step 2: Write the puppeteer-core script

Before you start this process, learn how the website's DOM (Document Object Model) structure works. You also need to manage resources well to use async functions for browser control.

Here’s a sample script for scraping data:


const { browser, page } = await setup();
await page.goto('https://example.com');
const data = await page.evaluate(() => {
    return document.querySelector('h1').innerText;
});
console.log(data);
await browser.close();

This script lets you navigate to a specific webpage and extract data by selecting the <H1> element on the page. It logs the data into the console for inspection — and then you can close the browser instance to release resources and clean up the data.

Note: If you’re using Heroku, you have to set up the Puppeteer environment with specific configurations, such as adding the Puppeteer Heroku buildpack.

Here’s what it looks like:


heroku buildpacks:add https://github.com/jontewks/puppeteer-heroku-buildpack

This command installs dependencies you need to run puppeteer-core on Heroku.

Step 3: Deploy to Heroku

Make sure your application structure is correct. 

Next, create a Procfile a text file in the root of your project, which tells Heroku how to run your application:


web: node index.js

If you want to deploy on Heroku, push the code with Heroku CLI using Git:


git init
git add .
git commit -m "Initial commit"
heroku create
git push heroku master

Things to keep in mind to optimize performance

If you want to optimize puppeteer-core performance on Heroku, use these tips to do so:

  • Reduce the number of DOM manipulations and interactions while running Puppeteer scripts.
  • Use specific CSS selectors to target page elements and improve page evaluation.
  • Minimize network requests by caching resources and using techniques like lazy loading and prefetching.
  • Add error-handling mechanisms to prevent script failures and improve reliability.
  • Monitor memory usage and identify memory leaks (like open pages/browsers) to prevent resource exhaustion.

Connecting your script to managed browsers

Browsers such as Chrome are notoriously messy to host. There are a range of issues required, such as chasing down memory leaks which can otherwise chew up your resources.

Instead, we would recommend using our pool of managed browsers.

To connect your scripts to Browserless, modify your puppeteer-core setup:


const puppeteer = require('puppeteer-core');

async function setup() {
    const browser = await puppeteer.connect({
        browserWSEndpoint: 'wss://chrome.browserless.io'
    });
    const page = await browser.newPage();
    return { browser, page };
}

Wrapping up

While the combination of puppeteer-core and Heroku is great for web automation, it brings many issues like incompatibility and high resource consumption.

Try out Browserless's managed pool of headless browsers to avoid these issues. You can tighten up your resource consumption while accessing a secure and stable browsing environment.

Take it for a test drive using the 7-day free trial.

Share this article

Ready to try the benefits of Browserless?