Distil Networks

Video Blog: Why it’s easy to scrape your content


Here at Distil we often get asked, how difficult is it for someone to build a web scraper and begin capturing content.   Well I thought I’d show you exactly how easy it is for someone else to scrape your content and information.

Whether it’s the theft of publisher’s content, competitive price scraping, harvesting of personal user or financial information, or even database mining,  web scraping is a point of pain and threat that most businesses are  not prepared to handle.

In fact, most of our clients are surprised when we show them how often web scrapers were bypassing their protective measures, completely undetected.

So we hired a contractor from the web to scrape multiple data sets and join them together.   To avoid any malicious and negative results, the data we used was free and available to the public.  Here’s what we asked for:

We are conducting a research project on web scraping. This is for non commercial use.

We need someone to:

1. Extract all records from the 2012 N.C. Government Salary Database (This data is free and public record) (Link: http://www.charlotteobserver.com/2012/03/31/418239/nc-government-salary-database.html)

2. Refine the data to only include people who are 85 years or older.

3. Go to the following directory and harvest the email and phone number for each of these people, where available. (Link: http://www.ncgov.com/directory.aspx)

Please append the email and phone information to the original data and provide us with an Excel or CSV file.

Thank you.

The Results:
  – 26 Bids on the Project
  – Data scraped and delivered within 24 hour
  – 93K Records Captured
  – Total cost: $48.00

With less than 10 minutes of work and zero technical background, we were able to extract and join multiple data sets that some people might not want put together.  Now consider a web scraper that is far more sophisticated and targeting your business.  It’s not that hard to imagine.  We identify and block them every day.

Take Control of Your Website

Up to 60% of your website traffic could be bots! These non-human visitors are automated attacks responsible for fraud, data theft, and slowing down your website performance.

Sign Up For Your Free Trial Today