Turning sites into structured data | How to get JSON file from Wikipedia

December 7, 2022

contents

Everyone loves data in JSON format. Today we'll be obtaining JSON payloads from Wikipedia articles. Before we do, we ask you to support Wikipedia with a donation if you have found it helpful throughout the years; they have brought tremendous value to the world by bringing knowledge as a non-profit. That said, let's get straight to it.

Firstly, we'll need to connect to browserless by signing up, once we have an API key, we can use this to add it as the token flag and get access to the chrome browsers in the cloud.

We'll set a viewport just to make sure selectors don't change due to responsive styling when developing vs in production.

When navigating to the site, we'll use the waitUntil option set to load, so that we can get the information as soon as the page has loaded.

In order to get into the browser context to scrape information, we'll use the page.evaluate() method where most of our code will be. This is the stage where we'll scrape all the important data from a site that we want to return in the payload.

As an example, we'll use Wikipedia, where we'll be extracting the title of the article, all the paragraphs that are longer than 15 characters (to filter out small, irrelevant paragraphs, you can remove this step if you want), and the table with important information on the right side of the article. You could scrape much more information, such as the bibliographical references or the table of contents, but for now, this should prove a great starting point for most sites.

You can find this sample on our replit page named Structure Wikipedia into JSON, all you have to do to make it work is sign up to browserless and add the API KEY as a secret in replit named "TOKEN".

Here's the code if you want to run it from NodeJS by using the puppeteer-core library:


const puppeteer = require('puppeteer-core');

(async() => {
    const browser = await puppeteer.connect({ browserWSEndpoint: 'wss://chrome.browserless.io?token=YOUR_API_KEY' }); //connect to browserless

    const page = await browser.newPage();
    await page.setViewport({
      width: 1920,
      height: 1080
  })
    const url = "https://en.wikipedia.org/wiki/Carl_Linnaeus";
    await page.setViewport({ width: 1400, height: 1020 }),
    await page.goto(url, { waitUntil: 'load' });
    const myData = await page.evaluate(() => {
        let obj = {};
        const title = document.querySelector('.mw-page-title-main').innerHTML,
            ps = [...document.querySelectorAll('#mw-content-text > div.mw-parser-output > p')],
            p = ps.map(el => el.textContent).filter(value => value.length > 15),
            tbody = document.querySelector('.infobox tbody'),
            trs = Array.from(tbody.querySelectorAll('tr')),
            infocontent = [],
            ic = { data: {} };
        return trs.forEach((value, i) => {
            if (0 !== i) {
                const td = [...value.querySelectorAll('td,th')],
                    tds = td.map(el => null !== el.querySelector('a.image') ? el.querySelector('a.image').href : el.textContent),
                    data = tds.map(td => td),
                    key = data[0] || '1';
                let islink = !1;
                key.includes('https://') && (islink = !0);
                const val = data[1] || '1';
                void 0 === data[1] ? islink && infocontent.push({ key: key }) : ic.data[key] = val
            }
        }), ic.imgs = infocontent, obj = { ...obj, title: title, infobox: ic, paragraphs: p }, obj
    });
    browser.close();
    return { data: myData, type: 'text/plain' }

})();

Don't want to run this on NodeJS? Let's just wrap this all up inside a /function API!
Here's how the same code would look by using curl:


curl --request POST \
  --url 'https://chrome.browserless.io/function?TOKEN=YOUR_API_KEY' \
  --header 'Content-Type: application/json' \
  --data '{
  "code": "module.exports=async({page,context})=>{const{url}=context;await page.setViewport({width:1400,height:1020}),await page.goto('\''https://en.wikipedia.org/wiki/Carl_Linnaeus'\'',{waitUntil:'\''networkidle2'\''});const myData=await page.evaluate(()=>{let obj={};const title=document.querySelector('\''.mw-page-title-main'\'').innerHTML,ps=[...document.querySelectorAll('\''#mw-content-text > div.mw-parser-output > p'\'')],p=ps.map(el=>el.textContent).filter(value=>value.length>15),tbody=document.querySelector('\''.infobox tbody'\''),trs=Array.from(tbody.querySelectorAll('\''tr'\'')),infocontent=[],ic={data:{}};return trs.forEach((value,i)=>{if(0!==i){const td=[...value.querySelectorAll('\''td,th'\'')],tds=td.map(el=>null!==el.querySelector('\''a.image'\'')?el.querySelector('\''a.image'\'').href:el.textContent),data=tds.map(td=>td),key=data[0]||'\''1'\'';let islink=!1;key.includes('\''https://'\'')&&(islink=!0);const val=data[1]||'\''1'\'';void 0===data[1]?islink&&infocontent.push({key:key}):ic.data[key]=val}}),ic.imgs=infocontent,obj={...obj,title:title,infobox:ic,paragraphs:p},obj});return{data:myData,type:'\''text/plain'\''}};",
  "context": {
    "url": "https://en.wikipedia.org/wiki/Carl_Linnaeus"
  }
}'

I hope this was helpful to understand how to turn any website into structured data, let us know if this was useful and would like more of these, and we'll be happy to share!

Important things/notes to keep in mind

If you want to start using Browserless for your web automation:

Share this article

Turning sites into structured data | How to get JSON file from Wikipedia

Important things/notes to keep in mind

Ready to try the benefits of Browserless?