How to run Scrapy through our headless browsers

contents

Scrapy is a hugely popular framework for web scraping and crawling.  Python developers love it due to its scalability and extensibility, as the community writes and maintains a lot of useful plugins. 

With Browserless, you can now use our fleet of headless browsers to make it even more powerful. Read on to find out more…

The limitations of default Scrapy and HTTP requests

The fact that it uses good ol’ HTTP requests keeps things simple, but has limitations such as:

  • It can only return an initial render of a page
  • It won’t execute any JavaScript code
  • Waiting for events or selectors isn’t possible
  • You can’t interact with the page such as clicking and scrolling
  • Bot detectors will easily spot that your request isn’t human

If you want to scrape a single page application built with something like React.js, Angular.js or Vue.js, you’re gonna need a headless browser.

Hosting headless browsers yourself is a nightmare

You might have tried scrapy-playwright or scrapy-splash as a solution. Scrapy-playwright lets you render a page with Playwright through a simple flag in the meta dictionary, while scrapy-splash lets you control a WebKit browser using Lua scripts.

They’re great, but also require you to host your own headless browsers and build your own infrastructure. 

It will be fine if you’re only running one or two browsers locally, but scaling them up to tens or hundreds of browsers in a cloud deployment is a constant headache. Chrome is notorious for memory leaks, breakages from updates and consuming huge amounts of resources.

If you really want to go this route, we have an article about managing Chrome on AWS. For people who’d rather not even touch a browser, you can use our /content API.

Get rendered content without running your own browsers, with our /content API

Browserless offers a /content API that returns the HTML content of a page, with the special difference that it’s content that’s been rendered and evaluated inside a browser.

The API receives a JSON payload with the website URL and, optionally, some configuration.


curl --request POST \
  --url 'https://chrome.browserless.io/content?token=GOES-HERE' \
  --header 'Content-Type: application/json' \
  --data ' {
  "url": "https://puppeteer.github.io/pptr.dev/",
  "gotoOptions": { "waitUntil": "networkidle0" }
}'

Which will return the full rendered and evaluated HTML content of the page. 

Let’s take a look at the legacy Puppeteer documentation website. It’s entirely dependent on JavaScript execution, so a cURL request will return an empty document. If you use our /content API instead, it can wait for the JavaSCript to be evaluated and return the document.

Preview of the returned content

Using the /content API in Scrapy

Luckily, Scrapy is highly extensible and allows you to run your custom HTTP queries to scrape the data. All you need to do is use the  start_requests() method to make a query to the /content API, while  keeping your scraping code the same.

Let’s continue with the Puppeteer Docs page:


import json
import scrapy


class PptrDocsSpider(scrapy.Spider):
    name = "pptr-docs"

    def start_requests (self):
        options = {
            "url": "https://puppeteer.github.io/pptr.dev/",
            "gotoOptions": {
                "waitUntil": "networkidle0"
            }
        }

        yield scrapy.Request(
            url="https://chrome.browserless.io/content?token=GOES-HERE",
            method='POST',
            dont_filter=True,
            headers={"Content-Type": "application/json"},
            body=json.dumps(options)
        )

    def parse(self, response):
        entries = response.css('sidebar-component a.pptr-sidebar-item')

        for entry in entries:
            yield{
                'title' : entry.css('::text').get(),
                'url' : entry.css('::attr(href)').get(),
            }

And you’ll get the data needed after the page is rendered!

JSON generated by Scrapy

Stealth flags, residential proxies and headful mode

Of course, a big advantage of running requests through a browser is that you can slip past more detectors.

Browserless offers a trio of tools to help with this, which you can enable through our /content API

Inbuilt proxies - our ethically sourced residential proxies can be set with a specified country

Stealth mode - this flag launches the browser in a way that hides many of the signs it’s automated

Headless=false - running Chrome as a “headful” browser makes it less obviously automated, while also automatically handling aspects such as setting realistic user agents

You can run the /content API with these three settings by adding them onto the end of the scrapy.Request URL like so:


    def start_requests (self):
        options = {
            "url": "https://puppeteer.github.io/pptr.dev/",
            "gotoOptions": {
                "waitUntil": "networkidle0"
            }
        }

        yield scrapy.Request(
            url="https://chrome.browserless.io/content?token=GOES-HERE&stealth&headless=false&proxy=residential&proxyCountry=us",
            method='POST',
            dont_filter=True,
            headers={"Content-Type": "application/json"},
            body=json.dumps(options)
        )

Page interactions, waiting for selectors and more

As with all our other APIs, there's a large array of details you can configure to make the most out of it. You can set cookies, execute custom JS code, wait for selectors to appear on the page and more.

If you need to wait for a specific selector to appear in the DOM, to wait for a timeout to happen, or to execute a custom function before scraping, you can set the waitFor property in your JSON payload. This property mirrors Puppeteer's waitFor() method, and can accept one of these parameters:

  • A valid CSS selector to be present in the DOM.
  • A numerical value (in milliseconds) to wait.
  • A  custom JavaScript function to execute prior to returning.

For example, with this options, Browserless will wait for any a.pptr-sidebar-item elements before returning the page from the browser:


    def start_requests (self):
        options = {
            "url": "https://puppeteer.github.io/pptr.dev/",
            "waitFor": "a.pptr-sidebar-item"
        }

        yield scrapy.Request(
            url="https://chrome.browserless.io/content?token=GOES-HERE",
            method='POST',
            dont_filter=True,
            headers={"Content-Type": "application/json"},
            body=json.dumps(options)
        )

You can use waitFor to execute a custom function and interact with the page. You’re also able to click elements, set element’s values and anything else that you can do inside the JS virtual machine.

Keep in mind that this function is meant to be executed without changing the page or causing navigation, as navigating during the execution of the code can make things break.

If you need to inject some cookies or HTTP headers, it’s as easy as to include them in the payload, matching puppeteer's setCookie() and setExtraHTTPHeaders() methods.


    def start_requests (self):

        options = {
            "url": "https://puppeteer.github.io/pptr.dev/",
            "setExtraHTTPHeaders": {
                "Accept-Language": "en-GB,en-US;q=0.9,en;q=0.8"
            },
            "cookies": [
                {
                    "name": "very-tasty-cookie",
                    "value": "choco-chip",
                    "url": "https://puppeteer.github.io",
                    "path": "/",
                    "sameSite": "Strict"
                }
            ],
            "gotoOptions": { "waitUntil": "networkidle0" }
        }

        yield scrapy.Request(
            url="https://chrome.browserless.io/content?token=GOES-HERE",
            method='POST',
            dont_filter=True,
            headers={"Content-Type": "application/json"},
            body=json.dumps(options)
        )

For full reference of the values accepted by the /content API, please refer to our GitHub repo.

Conclusion

Scrapy is great, but it has limitations such as not being able to handle JS-heavy websites. 

Browserless can overcome these obstacles with our production-ready API. With a simple modification to your Scrapy projects, you can render and evaluate dynamic content, handle HTTP authentication, run custom code and more, all without sacrificing Scrapy’s strengths.

Want to run your Scrapy scripts with Browserless? Sign-up for a free trial!

Share this article

Ready to try the benefits of Browserless?

Sign Up