Using browserless to train your LLM

Large language models, or LLMs, are a great way to make data from various sources accessible to end users in a variety of ways. How you train your LLM is particularly important. Projects like huggingface and others do a great job of providing some starting datasets that make it easy to get these models off the ground. But what if you want to start competing with the likes of OpenAI and ingest even more data? What if the data you’re accessing is dynamically generated with JavaScript or uses other sophisticated technologies that require a web-browser to render? With browserless, you can craft a simple API call to do just that.
This guide assumes you have some familiarity with LLMs and how to use them, and focuses more on the data aspect of training these models. Feel free to read more about how to train your large-language model with the framework you’re using.
About browserless
browserless is a service that manages and makes Chrome programmatically accessible from a developers standpoint. In most cases, you need to use a library like puppeteer or playwright in order to get Chrome to do whatever it is you need it to do. This is great for certain projects, but since most LLMs are only interested in raw data, using a programmatic API to get this data can be a bit heavy-handed. This is where browserless shines since we have REST-based APIs on common use-cases across tech.
In particular are two APIs we want to highlight that make it extremely easy to fetch website data: the Scrape and Content APIs.
Using Scrape to train LLMs
Our Scrape API is well suited for fetching website data, after JavaScript has parsed and ran, returning website content back to you. Like most REST-based APIs, the Scrape API utilizes a JSON body describing the nature of your request, and what it should look for. A simple example looks like this:
curl --request POST \
--url 'https://chrome.browserless.io/scrape?token=YOUR-API-KEY' \
--header 'Content-Type: application/json' \
--data '{
"url": "https://cnn.com",
"elements": [
{
"selector": "body"
}
]
}'
The request above navigates to CNN.com, waits for JavaScript to parse and run, gets data for the body of the document, and returns the following (note that this is truncated for brevity sake):
const results = {
"data": [
{
"selector": "body",
"results": [
{
"text": "Audio\nLive TV\nLog In\nHAPPENING NOW\nAnalysis of Donald Trump's historic arraignment on 37 federal criminal charges. Watch CNN\nLive Updates: Ukraine \nTrump arraignment \nTrending: Tori Bowie autopsy \nInstant Brands bankruptcy \nHatch Act \nFather’s Day gifts \nPodcast: 5 Things\nTrump pleads not guilty to mishandling classified intelligence documents\nAnna Moneymaker/Getty Images\nFACT CHECK\nTrump has responded to his federal indictment with a blizzard of dishonesty. Here are 10 of his claims fact-checked\nOpinion: Trump backers are going bonkers with dangerous threats\nDoug Mills/The New York Times/Redux\nGALLERY\nIn pictures: The federal indictment of Donald Trump\n‘Dejected’: Grisham describes Trump’s demeanor as he headed to court\nTrump didn’t speak during the historic hearing, sitting with his arms crossed and a scowl on his face. Here’s what else happened\nLive Updates: Judge says Trump must not communicate with co-defendant about the case\nTakeaways from Trump’s historic court appearance\nWatch: Hear how Trump acted inside the ‘packed courtroom’ during arraignment\nJudge allows E. Jean Carroll to amend her defamation lawsuit to seek more damages against Trump\nInteractive: Former President Donald Trump’s second indictment, annotated\nDoug Mills/The New York Times/Redux\nVIDEO\nTrump stops at famous Cuban restaurant after his arrest...© 2016 Cable News Network.",
"width": 800,
"height": 25409,
"top": 0,
"left": 0,
"attributes": [
{
"name": "class",
"value": "layout layout-homepage cnn"
},
{
"name": "data-page-type",
"value": "section"
}
]
}
]
}
]
}
LLMs, in particular, are mostly interested in the “text” of a website, which this API generates inside that JSON structure. Furthermore, you can get more metadata about the content as well: things like size (in pixels) and positioning. These elements can further enhance your model’s knowledge of the data and add another dimension to weigh potential importance.
You can learn more about our Scrape API, including all the additional options, here.
Content API to train your LLM
The Content API is similar to the Scrape API in that it returns content after JavaScript has parsed and executed. It differs in that it will only return HTML content of the site itself with no additional parsing. Using it is rather similar to Scrape in that you’ll POST over a JSON body containing details about the URL you care about.
Below is an example of what this looks like:
curl --request POST \
--url 'https://chrome.browserless.io/content?token=YOUR-API-KEY' \
--header 'Content-Type: application/json' \
--data '{
"url": "https://cnn.com"
}'
Doing so returns purely the HTML of the page. Using other libraries can help with extracting content further if you wish, but a few LLMs can continue to parse these happily:
<!DOCTYPE html><html lang="en" data-uri="cms.cnn.com/_pages/clg34ol9u000047nodabud1o2@published" data-layout-uri="cms.cnn.com/_layouts/layout-homepage/instances/homepage-domestic@published" class="userconsent-cntry-us userconsent-reg-ccpa"><head><script type="text/javascript" src="https://cdn.krxd.net/userdata/get?pub=e9eaedd3-c1da-4334-82f0-d7e3ff883c87&technographics=1&callback=Krux.ns._default.kxjsonp_userdata"></script><script type="text/javascript" src="https://beacon.krxd.net/optout_check?callback=Krux.ns._default.kxjsonp_optOutCheck"></script><script type="text/javascript" src="https://beacon.krxd.net/cookie2json?callback=Krux.ns._default.kxjsonp_3pevents"></script><script type="text/javascript" src="https://consumer.krxd.net/consent/get/e9eaedd3-c1da-4334-82f0-d7e3ff883c87?idt=device&dt=kxcookie&callback=Krux.ns._default.kxjsonp_consent_get_0"></script><script type="text/javascript" async="" src="https://lightning.cnn.com/cdp/psm/i/web/release/3.3.1/psm.legacy.min.umd.js"></script><script type="text/javascript" async="" src="//www.i.cdn.cnn.com/zion/zion-mb.min.js"></script><script async="" src="https://cdn.boomtrain.com/p13n/cnn/p13n.min.js"></script><script type="text/javascript" async="" src="https://lightning.cnn.com/cdp/psm/brands/cnn/web/release/psm.min.js"></script><script async="" src="//cdn.krxd.net/ctjs/controltag.js.d58f47095e6041e576ee04944cca45da"></script><script type="text/javascript" defer="" async="" src="https://z.cdp-dev.cnn.com/zfm/zfh-3.js"></script><script id="GPTScript" type="text/javascript" src="https://securepubads.g.doubleclick.net/tag/js/gpt.js"></script><script type="text/javascript" src="https://steadfastseat.com/v2svxFVJ-Mg82zHMJUHkQBWwVF721AsFf1Y3MomzEUqIMQlG6f2VaL6ctdsQc2VgA"></script><script type="text/javascript" async="" src="//www.ugdturner.com/xd.sjs"></script><script async="" src="//static.adsafeprotected.com/iasPET.1.js"></script><script async="" src="//c.amazon-adsystem.com/aax2/apstag.js"></script><script async="" src="https://vi.ml314.com/get?eid=64240&tk=GBYTTE9dUG2OqHj1Rk9DPOaLspvMWfLqV236sdkHgf03d&fp="></script><script async="" src="https://cdn.ml314.com/taglw.js"></script><script type="text/javascript" async="" src="https://sb.scorecardresearch.com/beacon.js"></script><script type="text/javascript" async="" src="//s.cdn.turner.com/analytics/comscore/streamsense.5.2.0.160629.min.js"></script><script type="text/javascript" async="" src="//cdn3.optimizely.com/js/geo4.js"></script><style>body,h1,h2,h3,h4,h5{font-family:cnn_sans_display,helveticaneue,Helvetica,Arial,Utkal,sans-serif}h1,h2,h3,h4,h5{font-weight:700}:root{--theme-primary:#cc0000;--theme-background:#0c0c0c;--theme-divider:#404040;--theme-copy:#404040;--theme-copy-accent:#e6e6e6;--theme-copy-accent-hover:#ffffff;--theme-icon-color:#e6e6e6;--t......
Why use browserless to train a LLM?
You might ask why use a service like browserless to even do this? Many sites, including e-commerce, news, gaming, and more; use JavaScript to fetch additional resources and run what’s called a single-page application. Because of this you can’t simply “curl” or issue a system call to retrieve content since the site itself assumes some level of JavaScript will be required to run in order to fully render the page. Even Google runs a headless browser in order to generate data for its search index.
browserless unlocks these sites and services so you can ensure your LLM has the most up-to-date and complete content. The variety of options also allow you to configure how long to wait, whether or not to block-ads, or even if JavaScript should run at all. Furthermore if the site you’re interested in does some level of bot-detection then browserless can get around most of these with stealth options and more. It’s identical to the web browser you’re likely reading this blog on!
Bonus: Better content parsing with Functions!
If you don’t mind writing a little bit of JavaScript code, then you can strip out data you care about even more with the Function API. This handy API allows you to submit code that runs on our service and returns only the data you care about. Any valid JavaScript can be ran here, so you’re free to do whatever you’d like (including access to all of Puppeteer’s API).
As an example, let’s just get the first 1000 characters from CNN while stripping the newlines and return characters. This is easily done like so:
curl --request POST \
--url 'https://chrome.browserless.io/function?token=YOUR-API-KEY' \
--header 'Content-Type: application/javascript' \
--data 'module.exports = async({ page }) => {
await page.goto('\''https://cnn.com'\'', { timeout: 120000 });
const data = await page.evaluate(() => document.body.innerText);
const cleaned = data.replace(/(\r\n|\n|\r)/gm, '\''. '\'');
const trimmed = cleaned.substring(0, 1000);
return {
data: trimmed,
type: '\''text'\'',
};
};'
Feel free to alter this code to do even more: remove navigation text, remove any image tags, and more!
Sign-up and train your LLM today!
You can sign-up for a free account on browserless.io and get your LLM project off the ground. Not sure where to go or how to use the service effectively? Let us know!