Browserless now runs in LangChain
We’ve begun the process of integrating Browserless with the popular LangChain AI library, starting with Browserless’ REST APIs. As of today, our content API is now supported as a LangChain document loader. Using Browserless to get the contents of a webpage for ingestion into LangChain’s AI modules is as easy as calling
from langchain.document_loaders import BrowserlessLoader
The previous canonical way to get the contents of webpages in LangChain was the
WebBaseLoader module. This module uses the
requests library to make HTTP requests to the target URL. This is a perfectly valid way to get the contents of a webpage, but it has some drawbacks:
- It’s prone to encoding issues if Python is expecting a different encoding than the webpage is using. LangChain users have reported seeing non-ASCII characters in their text when using the
WebBaseLoader, which is a symptom of this issue.
- It’s extremely vulnerable to anti-bot measures, since the most basic anti-bot tests can determine that the request is coming from an automated script and not a real browser.
Using the new
Getting the contents of a webpage using the
BrowserlessLoader can be accomplished in just a few lines of code:
from langchain.document_loaders import BrowserlessLoader loader = BrowserlessLoader( api_token=YOUR_BROWSERLESS_API_TOKEN, urls=[ "https://example.com/url0", "https://example.com/url1", "https://example.com/url2", ] ) documents = loader.load() print(documents.page_content)
Simply sign up for a Browserless account, get your API token, and pass it to the
BrowserlessLoader constructor. Pass a list of URLs to the constructor, call the
load() method, and you’ll get back a list of
Document objects, each of which has a
page_content attribute that contains the text of the webpage.
Extracting the contents of webpages can be a useful step in many different AI workflows. For example, you could use the
BrowserlessLoader to get the contents of a webpage, and then use a long-context LLM like GPT-4 or Claude to extract particular fields from the text, even if they appear in different places across multiple webpages. You could get the contents of a blog post and then summarize it using LangChain’s LLM wrappers. You could keep tabs on an online forum by getting the contents of the forum’s pages and then using a classifier to identify posts that are relevant to you. LangChain has a thriving open-source community, check out the LangChain GitHub for more ideas.
In the short term: the LangChain team is currently working on modifying their
RecursiveWebLoader wrapper class to support the
BrowserlessLoader as a document loader. This will allow you to get the contents of a webpage and all of its child pages, recursively, using the
BrowserlessLoader, allowing for a higher quality guarantees on the contents of the pages and a more robust way to handle anti-bot measures. This is more akin to how larger companies crawl the web, so you can think of it as a more robust web-crawler.
In the long term, we’re looking into more seamless integrations between Browserless and LangChain, including controlling a stateful browser session from within LangChain. This opens the possibility of using LangChain to automate web tasks that require a browser, like filling out forms or interacting with a website’s UI. Stay tuned for more updates!