Note: These docs cover v1 of Browserless, which is used on our shared cloud. For our v2 docs, pleaseclick here

/scrape API

The scrape API allows for getting the contents a page, by specifying selectors you are interested in, and returning a structured JSON response. We also allow for setting a timeout option for asynchronously added elements.

The default behavior is to navigate to the URL specified, wait for the page to load (including parsing and executing of JavaScript), then waiting for the elements for a maximum of 30 seconds. All of these are configurable, and documented in detail below.

At a minimum you'll need to specify at least a url and an elements array.

Check out this API schema defined in Swagger.

Examples

  1. Basic Usage
  2. Specifying page-load behavior
  3. Custom behavior with waitFor
  4. Cookies, headers and other options
  5. Element timeouts
  6. Debugging

Basic Usage

Below is the most basic usage, where we'll navigate to the example.com website (waiting for page-load) and parse out all a elements.

Internally we use document.querySelectorAll to retrieve all matches on a page. Using a more specific selector can narrow down the returned results.

Get the a elements on example.com

{
  "url": "https://example.com/",
  "elements": [
    {
      "selector": "a"
    }
  ]
}

cURL request

curl -X POST \
  https://chrome.browserless.io/scrape?token=MY_API_TOKEN \
  -H 'Cache-Control: no-cache' \
  -H 'Content-Type: application/json' \
  -d '{
  "url": "https://example.com/",
  "elements": [{
      "selector": "a"
  }]
}'

Once ran, this will return a JSON payload of the following. We return the innerHTML and innerText of all matched selectors, as well as all their attributes. In the case of the above call, this will return:

{
  "data": [
    {
      "selector": "a",
      "results": [
        {
          "html": "More information...",
          "text": "More information...",
          "attributes": [
            {
              "name": "href",
              "value": "https://www.iana.org/domains/example"
            }
          ]
        }
      ]
    }
  ]
}

Specifying page-load behavior

The scrape API allows for setting specific page-load behaviors by setting a gotoOptions in the JSON body. This is passed directly into puppeteer's goto() method.

In the example below, we'll set a waitUntil property and a timeout

Get the H1 elements on example.com with custom goto options

{
  "url": "https://example.com/",
  "elements": [
    {
      "selector": "h1"
    }
  ],
  "gotoOptions": {
    "timeout": 10000,
    "waitUntil": "networkidle2"
  }
}

This will now force selectors to be collected after the network has been idle for 2 seconds, and we'll timeout the request after 10 seconds.

Custom behavior with waitFor

Sometimes it's helpful to do further actions, or wait for custom events on the page before getting data. We allow this behavior with the waitFor property. We closely follow puppeteer's waitFor() method.

This property can accept one of three options:

  • A function to be ran within the page's context, inside of the browser.
  • A number indicating the time in milliseconds to wait.
  • A valid selector to wait for.

Waiting for a selector

{
  "url": "https://example.com/",
  "elements": [
    {
      "selector": "h1"
    }
  ],
  "waitFor": "h1"
}

Waiting for 10 seconds

{
  "url": "https://example.com/",
  "elements": [
    {
      "selector": "h1"
    }
  ],
  "waitFor": 10000
}

Waiting for a function

{
  "url": "https://example.com/",
  "elements": [
    {
      "selector": "h1"
    }
  ],
  "waitFor": "() => document.querySelector('h1')"
}

Error handling

We do catch and return errors for invalid functions inside of waitFor. For instance: () => document.querySelector('h1')) as a waitFor will return an HTTP code of 400 and the following text:

Evaluation failed: SyntaxError: Unexpected token ')'
at new Function (<anonymous>)
  at waitForPredicatePageFunction (__puppeteer_evaluation_script__:2:21)

Cookies, headers and other options

As with our other APIs, you can inject things like cookies, headers and other options, including intercepting requests and responses. These follow the pattern set by puppeteer's setCokkie() and setExtraHTTPHeaders() methods.

Setting a special cookie

{
  "url": "https://example.com/",
  "elements": [
    {
      "selector": "h1"
    }
  ],
  "cookies": [
    {
      "name": "my-special-cookie",
      "value": "foo-bar",
      "url": "https://www.example.com",
      "path": "/",
      "sameSite": "Strict"
    }
  ]
}

Adding an authentication header

{
  "url": "https://example.com/",
  "elements": [
    {
      "selector": "h1"
    }
  ],
  "setExtraHTTPHeaders": {
    "Authentication": "Basic foo-bar"
  }
}

Intercepting requests

{
  "url": "https://example.com/",
  "elements": [
    {
      "selector": "h1"
    }
  ],
  "rejectRequestPattern": ["png", "jpg"]
}

Element timeouts

browserless will wait for the elements specified for up to 30 seconds to be inserted in the page. This is useful for single-page applications that can load data on the fly. You can conditionally change this timer by specifying a timeout property per-element.

Using a custom timeout of 10 seconds

{
  "url": "https://example.com/",
  "elements": [
    {
      "selector": "h1",
      "timeout": 10000
    },
    {
      "selector": "a",
      "timeout": 5000
    }
  ]
}

Debugging

Without a doubt, one of the most frustrating aspects of scraping is debugging broken scripts. This is nothing new for browserless, and though we offer excellent debugging tools, we thought it'd be appropriate to add some extra information in the JSON response to help make debugging even easier.

As of today, we offer five payloads to help debug the page: html, screenshot, console, cookies and network. Each is listed below.

HTML: This is the raw HTML of the webpage, after all page-load and waitFor functions have ran.

Screenshot: A full-page JPEG of the page encoded in base64.

Console: An array of all the various console messages the page has written out.

Cookies: An array of objects specifying all the cookies currently set on the page.

Network: An object with two properties: inbound and outbound, representing the outgoing requests as well as their inbound responses.

In order to get this debugging information, you'll need to specify that you want it in your JSON POST payload. This is made optional due to the fact that generating these properties can take additional resources away from other concurrent work. You'll need to specify what fields you want, by adding a debug object and declaring what fields you want returned. Here's an example of getting all the fields back:

Requesting all debug fields

{
  "url": "https://example.com/",
  "elements": [
    {
      "selector": "h1",
      "timeout": 10000
    },
    {
      "selector": "a",
      "timeout": 5000
    }
  ],
  "debug": {
    "screenshot": true,
    "console": true,
    "network": true,
    "cookies": true,
    "html": true
  }
}

Ready to try benefits of Browserless?