scorecard
  1. Home
  2. tech
  3. how-to
  4. What is web scraping? Here's what you need to know about the process of collecting automated data from websites, and its uses

What is web scraping? Here's what you need to know about the process of collecting automated data from websites, and its uses

Dave Johnson   

What is web scraping? Here's what you need to know about the process of collecting automated data from websites, and its uses
Tech3 min read
  • Web scraping is the process of using automated software, like bots, to extract structured data from websites.
  • There are many applications for web scraping, including monitoring product retail prices, lead generation, and analyzing sentiment about products and companies on social media.
  • Here's a brief overview of web scraping, its applications, and how it works.

Web scraping is the name given to the process of extracting structured data from third-party websites. In other words, it's a way to capture specific information from one or more websites without also copying unwanted or unrelated information. It's a common practice that has a lot of potential applications and a murky legal profile.

What to know about web scraping

Web scraping is usually an automated process, but it doesn't have to be; data can be scraped from websites manually, by humans, though that's slow and inefficient. More commonly, scraping is performed by software designed specifically for this application, generally in two main components. A crawler is a program that browses the internet and indexes the content of interest, and it passes this information onto the scraper.

The scraper is designed to locate the relevant structured information using markers called data locators. These locators indicate the presence of the data, which the scraper then extracts and stores offline in a spreadsheet or database for processing or analysis.

One simple example of web scraping: Consider a website that aggregates pricing information for retail products so shoppers can see which retailers have the best prices. A scraper can be programmed to index the product pages at every major retailer, with the scraper then visiting each page and using data locators to zero in just on the price field and ignore all the other data on the page - product description, reviews, and so on. The scraper can be run daily to update the webpage with the latest pricing information from around the web.

How web scraping is used

Because there is an enormous variety of data online, there is a wide variety of applications for web scraping. Here are some of the most common uses:

  • Price intelligence: Like the example above, many web scrapers are designed to monitor prices from retail sites. Retailers might use this to monitor prices at competitor sites, or the data might be used for competitive analysis, monitoring trends, or as a service to other users.
  • Real estate: Similarly, web scrapers commonly target real estate sites to monitor rental and sale prices, appraise property values in a given region, and conduct market analysis.
  • Lead generation: Marketers commonly use web scraping to generate leads by scraping structured data from websites like LinkedIn.
  • Sentiment analysis: Brands even use web scraping to understand how their products and services are being talked about online. Companies can collect data that mentions their name from social media sites like Facebook and Twitter.

The legality of web scraping

There's no easy answer to the question of web scraping's legality. This technology has had a number of legal challenges dating back to 2000, when online auction site eBay filed an injunction (which was granted by the court) against a site called Bidder's Edge for scraping its auction data.

In the years since, there have been a number of additional challenges to web scraping, but in 2017 LinkedIn lost a suit against a business that was scraping its content. With some precedent in the courts both for and against web scraping, it's currently a common practice across the internet.

Related coverage from Tech Reference:

READ MORE ARTICLES ON


Advertisement

Advertisement