What is Data Scraping? The Tutorial for Beginners.
Bonus: if you like our content and this “Website data scraping” guide, you can join our web browser automation Slack community.
Suppose you want to get large amounts of information from a website as quickly as possible. How can this be done? In this article, we will talk about data scraping and how to scrape the web. Additionally, we’ll get into what data scraping is, why you would want to do it, how data scrapers work, and lastly, we’ll go over different processes for scraping the web. I’ll also include a quick example to reference.
What is data scraping?
If you’ve ever copied and pasted content from a website into a different location, you are doing a very manual version of data scraping. In this article, we will be using software applications to do the data scraping for us.
Data scraping is the process of using an application to extract valuable information from a website. This will allow us to obtain large amounts of data from websites in a short amount of time. Many of the larger websites like Google, Facebook, and GitHub have APIs that allow you to access their data. This is super convenient because the data will be given to you in a structured format that is easy to consume.
Unfortunately, that isn’t always the case. There will be other times when you will have to collect content and data for your specific use case. This is where web and data scraping applications come in handy. You can program these scraping applications to visit websites and extract the content/data that you want. The obvious benefit of this is being able to get the precise data that you want easily and efficiently.
Data scraping is comprised of two parts, the crawler, and the scraper. The crawler is the algorithm that we can create to browse the web and find the exact data that we want. An example of this would be navigating to a specific website and clicking on the page where the content you want exists. Once you have found that data, we will utilize the scraper. The scraper is used to “scrape” the data from the website. With the scraper, you can detect the data points you want and export them to a format that would work best for you.
Once the data has been exported the fun can begin. You are free to use that data however you need.
Data scraping examples & use cases
While reading this article you’ve probably wondered, “what are some good use cases for web/data scraping?” Let’s go over a couple of these use cases.
The first one that we will be talking about is my favorite – price monitoring. You can use price monitoring to keep track of prices and make sure you are finding the best deal. Who doesn’t want to save money? I have written a previous blog post about scraping Amazon.com to monitor the prices of specific products. Companies can also use price scraping to see what competitors are pricing similar products. This provides them with the advantage of being able to provide optimal pricing for their products so they can obtain maximum revenue.
Contact scraping is another way data scraping is used. Many companies and individuals can scrape the web for contact information to use for e-mail marketing. A good example of this would be scraping locations like an online employee directory or a bulk mailing list. For example, we shared this method in the Google Maps scraping guide. This use case is very controversial and often requires permission to collect this type of data. If you have ever visited a website and given them access to your contact information in exchange for using their software, you have permitted them to collect personal data like your e-mail address and phone number.
The last use case we’ll go over is news monitoring. Many individuals and companies can scrape news sites to stay current on stories and issues relevant to them. This could be especially useful if you are trying to create a feed of some type, or if you just need to keep up with day-to-day reports.
How does web scraping work?
Next, let’s go over how data scrapers work. Scrapers can take all the content on web pages or just the specific data that you want. In many situations, it is best to pinpoint the specific data you want so that the data scraper can quickly extract it. For example, in the Amazon web scraping blog post that I mentioned earlier, we look at pricing for office chairs. In that instance, we are only trying to identify the price of the chairs and the title of the item. That allowed the data scraper to swiftly filter out any unneeded clutter resulting in the script being run relatively fast.
Now when the process of scraping a site first begins there needs to be a URL that is provided to the script or application software. Based on the provided URL the scraper will navigate to that web page. Next, it will load the HTML code of that site. After that code has been loaded, the scraper can then begin to collect the data that is wanted/needed. Lastly, the collected data is outputted in a predefined format determined by the user. Usually, this is a JSON file, but it can also be saved in other formats like an excel spreadsheet or a CSV file.
Now that we know how a data scraper functions let’s identify some preliminary steps that are needed before you try to scrape a website yourself. There are many cool tools and software applications out there that help with scraping websites. Because of this we will stay at a high level and focus on the basics.
First, you want to find the URLs you want to scrape. This might seem obvious, but it is a key factor and how well your data scrape will work. If the URL you give the scraper is even slightly incorrect the data you get back will not be what you want, or even worse, your scraper won’t work at all. For instance, if you’re trying to do a price monitoring data scrape you want to make sure that your URL goes to a relevant site like Amazon or Google shopping.
Secondly, you want to inspect the webpage (“F12” on most keyboards) to identify what your scraper needs to scrape. If we use the same Amazon price monitoring example, you could go to a search result page on Amazon, inspect that page, and locate where the price is in the HTML code.
Once you have that, you want to identify the unique tags that are around the price so you can use that in your data scraper. Some good tags would be div tags with IDs or very specific class names.
After finding the code you want to collect and use, you’ll want to incorporate this into your data scrape. This could be writing a script with the IDs or class names that you found in the previous step or simply inputting the tags into scraping software. You’ll also probably want to add supporting information to help when displaying your data. Sticking with our Amazon example, if you are collecting the price of office chairs on Amazon it would be nice to also have the title of the item to accompany the price.
Once you’ve specified the tags in your script or scraping application, you’ll want to execute the code. This is where all the magic happens. Everything that we talked about in the above section about how data scrapers work comes into play here.
Hopefully, you now have the data you need to start building your application. Whether that be a dashboard of charts, a cool table, or a sweet content feed the data is yours to do with it what you like. Lots of times you may get data back that you don’t expect. That is completely normal. Just like anything else in the engineering world, if one tiny thing is off it can often lead to things being incorrect. Don’t get discouraged! Practice makes perfect and you will catch on.
All the basics have been covered for scraping the web. Before we end, I want to mention a cool tool that allows you to do data scraping. Browserless is a headless Chrome browser as a service. You can use Browserless with libraries like puppeteer or Selenium to automate web-based tasks like data collection. To learn more, make sure to visit the Browserless website where you can find blog posts, documentation, debuggers, and other resources.
In this example, we are going to do a simple data scrape of the Y Combinator Hacker News feed. You can also run this example in the Browserless debugger tool. For this, we will use two main tools, Puppeteer and Browserless. In the above paragraph, I mentioned these tools with corresponding links. I highly recommend you check them out before diving into the example.
Alrighty, let’s get to it! We are going to start with the initial setup. Luckily for us, there aren’t many dependencies we need to install. There’s only one… Puppeteer.
npm i puppeteer
Once you’ve run that command you are good to go! 👍
Let’s look at how to set up Browserless real quick.
For starters, you’ll need to set up an account. Once you get your account set up, you’ll be directed to your Browserless dashboard.
Here you can find your session usage chart, prepaid balance, API key, and all of the other account goodies.
Keep your API key handy as we’ll use that once we start writing our script. Other than that we are ready to start coding!
Ok now for the exciting part! Here is the code example:
Lines 4 and 5 are very important in this example.
Line 4 uses Puppeteer when running your script. This is good for testing so you can view how the browser is interacting with your script.
Line 5 is the Browserless connection. This is where you can add your API key which will link up to your Browserless account and allow you to run your script with Browserless.
Make sure one of these two lines are commented out. You only need one.
Alrighty, that is all you need for this example. Once things are installed and the code is implemented, you can open up your preferred command-line interface in your project and run
node [insert name of js file here].
The output should be a JSON file with real-time titles and links from the Hacker News feed!
I hoped this article on data scraping was intriguing and exciting. There are endless possibilities as to what you can accomplish with web and data scraping. I hope this sparks some cool projects for some of you.
Happy coding! ❤️
Important things/notes to keep in mind
- You can also try out the docker setup of browserless.io locally to increase the speed of prototyping and testing your scripts. Reach out to firstname.lastname@example.org about licensing for self-hosting. You can learn more about it here
- Browserless.io has a beautiful setup for development. Make sure to connect to the locally installed browser via puppeteer if you connect to your actual browserless.io setup.
- You can always keep an eye on the analytics via your browserless.io dashboard.