How to build your own Twitter scraper tool using Browserless and Node.js

June 13, 2022

contents

Let’s talk about Twitter! It is one of the largest social media platforms; where people can post small pieces of texts and reach millions of readers across the world. It is so influential that someone can alter the price of a crypto coin just by tweeting an image. Apart from individuals looking to share and gain information, companies, political institutions, and governments maintain accounts. The ubiquitous nature of Twitter has fascinated data analysts who look into account activity to analyze and gain insights about various trends, social phenomena, and even research how to maximize their client's profits.

In this article, we will learn how to retrieve helpful information about a Twitter account’s activity by employing web scraping techniques using a free automation platform like Browserless and the flexibility of a scripting programming language like JavaScript.

So let's dive deeper into a Twitter scraper guide:

Twitter scraper step #1 - Get a Browserless account

Browserless is a headless automation platform that provides fast, scalable, and reliable web browser automation, ideal for data analysis assignments. It’s an open-source platform with more than 7.2K stars on GitHub. It also has a hosted SaaS platform. Some of the largest companies worldwide use the platform daily to conduct QA testing and data collection tasks.

To get started, we first have to create an account.

The hosted SaaS platform offers free and paid plans if we need more powerful processing power. The free tier offers up to 6 hours of usage, which is more than enough for our case.

After completing the registration process, the platform supplies us with an API key. We will use this key to access the Browserless services later on.

Twitter scraper step #2 - Set up a Node script with Puppeteer

The next step is to set up our project. While Browserless has excellent support on programming languages and platforms, we will use JavaScript on Node.js due to its simplicity and robust environment.

First, let's initialize a new Node project and install the puppeteer package.

$ npm init -y && npm i puppeteer-core

In case you didn’t know, puppeteer is a popular Javascript library used for web-scraping. It counts more than 78K stars on GitHub and is actively maintained. The puppeteer-core package provides all the functionalities of the main puppeteer package without downloading the browser, resulting in reduced dependency artifacts. By the way, if you like puppeteer-core, check out our "How to do web automation with Puppeteer-core & Browserless [3 code examples]" article.

Once we have installed our dependency, we can create the script's structure.


import puppeteer from "puppeteer-core";
 
const BROWSERLESS_API_KEY = ‘YOUR_API_KEY_HERE’;
 
const getTwitterData = async (url) => {
  const _browser = await puppeteer.connect({
    browserWSEndpoint: `wss://chrome.browserless.io?token=${BROWSERLESS_API_KEY}`,
  });
  const _page = await _browser.newPage();
  await _page.goto(url);
  await _page.waitForSelector(`article`);
 
  // TODO: Use query selectors to access profile info and tweets
 
  _browser.disconnect();
 
  return {
    // TODO: results
  };
};
 
const data = await getTwitterData("https://twitter.com/NASA");
console.log(data);

There are a couple of things to notice here, so let's make a quick walk through the code:

First, we import the puppeteer-core module.
We declare a variable BROWSERLESS_API_KEY, whose value is the Browserless API key we retrieved from the dashboard earlier.
Then, we declare an asynchronous function getTwitterData which accepts the profile URL as a parameter, e.g., “https://twitter.com/NASA”.
We call getProfileData and print the results to the terminal. Note that we use top-level await syntax, supported by ESM from Node version 14 and afterward.

Inside the getTwitterData function, we connect to the Browserless service by calling the connect method of the puppeteer module and use the browserWSEndpoint property to indicate the connection URI, which consists of two parts:

The base URI wss://chrome.browserless.io
The token query-string parameter, which value is the API key we retrieved from the dashboard.

Then we instantiate a new browser page and navigate to the desired Twitter account by using the value of the url parameter. The following statement is critical: We call waitForSelector on the page instance to instruct the underline puppeteer engine to wait until all the tweets are loaded. Twitter’s browser-based UI is built with React as a SPA (Single Page Application), and the corresponding tweets need to be fetched after the page is initially loaded. If we did not use the waitForSelector method, we would not be able to retrieve the available tweets. An <article /> element represents each tweet, so we use that as a query selector. Finally, we disconnect from the remote browser instance before returning the results.

Twitter scraper step #3 - Retrieve profile info

The first information we will retrieve is regarding the profile itself: profile name, username, number of followers and following are all good and provide helpful insight about the account performance.

At the time of this writing, this is what the Twitter profile page looks like on desktop computers:

How to scrape Twitter profile

We can use the highlighted div element to get the text content of its children.

The resulting values will contain the profile name and username. We will use the same tactic to retrieve the number of followers and following by accessing the corresponding href attributes. We encapsulate this logic into a function getProfileInfo that we can later call from inside getTwitterData.


const getProfileInfo = async (page) =>
  await page.evaluate(() => {
    const $ = (selector) => document.querySelector(selector);
 
    return {
      profileName: $('[data-testid="UserName"] div span').innerText,
      username: $('[data-testid="UserName"] div:nth-of-type(2) span').innerText,
      followers: $('a[href$="/followers"]').innerText,
      following: $('a[href$="/following"]').innerText,
    };
  });

The call to evaluate method is used to execute the provided callback from within the browser instance.

Twitter scraper step #4 - Retrieve tweets statistics

Now that we have gathered basic profile details, we can retrieve some metrics for the latest tweets. The most common statistics we want to know about a tweet are the post time, the number of likes, retweets, and replies. Recall that we mentioned each tweet being an <article/> DOM element. It turns out that tweets are the only component that uses the <article/> tag. This makes our job easier because we can use querySelectorAll to gather all the articles and use the appropriate selectors to retrieve the desired info for each metric. We’ll also encapsulate this functionality into its function, getTweetMetrics.


const getTweetMetrics = async (page) =>
  await page.evaluate(() => {
    return [...document.querySelectorAll("article")].map((el) => {
      return {
        submitted: el.querySelector("time").dateTime,
        replies: el.querySelector('[data-testid="reply"]').innerText,
        retweets: el.querySelector('[data-testid="retweet"]').innerText,
        likes: el.querySelector('[data-testid="like"]').innerText,
      };
    });
  });

Like we did when we wanted to get the profile info, we will call the appropriate selectors on each element and get the inner text. For each tweet, we can access the post time by retrieving the value of the datetime attribute of the <time/> element. We can target the corresponding DOM elements for likes, retweets, and replies using the data-testid attribute.

Executing the Twitter scraping script

Here is the complete script:


import puppeteer from "puppeteer-core";
 
const BROWSERLESS_API_KEY = ‘YOUR_API_KEY_HERE’;
 
const getProfileInfo = async (page) =>
  await page.evaluate(() => {
    const $ = (selector) => document.querySelector(selector);
 
    return {
      profileName: $('[data-testid="UserName"] div span').innerText,
      username: $('[data-testid="UserName"] div:nth-of-type(2) span').innerText,
      followers: $('a[href$="/followers"]').innerText,
      following: $('a[href$="/following"]').innerText,
    };
  });
 
  const getTweetMetrics = async (page) =>
  await page.evaluate(() => {
    return [...document.querySelectorAll("article")].map((el) => {
      return {
        submitted: el.querySelector("time").dateTime,
        replies: el.querySelector('[data-testid="reply"]').innerText,
        retweets: el.querySelector('[data-testid="retweet"]').innerText,
        likes: el.querySelector('[data-testid="like"]').innerText,
      };
    });
  });
 
const getTwitterData = async (url) => {
  const _browser = await puppeteer.connect({
    browserWSEndpoint: `wss://chrome.browserless.io?token=${BROWSERLESS_API_KEY}`,
  });
  const _page = await _browser.newPage();
  await _page.goto(url);
  await _page.waitForSelector(`article`);
 
  const profileData = await getProfileInfo(_page);
  const tweetsMetrics = await getTweetMetrics(_page);
 
  _browser.disconnect();
 
  return {
    ...profileData,
    tweets: tweetsMetrics,
  };
};
 
const data = await getTwitterData("https://twitter.com/NASA");
console.log(data);

Running the above, we get an output similar to the below:


{
  profileName: 'NASA',
  username: '@NASA',
  followers: '58M Followers',
  following: '182 Following',
  tweets: [
    {
      submitted: '2022-06-08T16:48:55.000Z',
      replies: '81',
      retweets: '543',
      likes: '3,338'
    },
    {
      submitted: '2022-06-08T21:27:40.000Z',
      replies: '72',
      retweets: '394',
      likes: '3,172'
    },
    {
      submitted: '2022-06-08T19:04:21.000Z',
      replies: '196',
      retweets: '1,337',
      likes: '6,854'
    }
  ]
}

Epilogue

In this article, we learned how we can leverage an automation platform like Browserless together with JavaScript, through Node.js, to gather statistics about a Twitter profile activity. We hope we taught something interesting today so you can improve your workflow. As always, stay tuned for more educational articles.

If you like our content, you can check out how our clients use Browserless for different use cases:

George Gkasdrogkas,

Twitter, Personal website

Share this article