Video Blog: Why it’s easy to scrape your content
Here at Distil we often get asked, how difficult is it for someone to build a web scraper and begin capturing content. Well I thought I’d show you exactly how easy it is for someone else to scrape your content and information.
Whether it’s the theft of publisher’s content, competitive price scraping, harvesting of personal user or financial information, or even database mining, web scraping is a point of pain and threat that most businesses are not prepared to handle.
In fact, most of our clients are surprised when we show them how often web scrapers were bypassing their protective measures, completely undetected.
So we hired a contractor from the web to scrape multiple data sets and join them together. To avoid any malicious and negative results, the data we used was free and available to the public. Here’s what we asked for:
We are conducting a research project on web scraping. This is for non commercial use.
We need someone to:
1. Extract all records from the 2012 N.C. Government Salary Database (This data is free and public record) (Link: http://www.charlotteobserver.com/2012/03/31/418239/nc-government-salary-database.html)
2. Refine the data to only include people who are 85 years or older.
3. Go to the following directory and harvest the email and phone number for each of these people, where available. (Link: http://www.ncgov.com/directory.aspx)
Please append the email and phone information to the original data and provide us with an Excel or CSV file.
– 26 Bids on the Project
– Data scraped and delivered within 24 hour
– 93K Records Captured
– Total cost: $48.00
With less than 10 minutes of work and zero technical background, we were able to extract and join multiple data sets that some people might not want put together. Now consider a web scraper that is far more sophisticated and targeting your business. It’s not that hard to imagine. We identify and block them every day.
About the Author
Sean Harmer is a Co-Founder of Distil Inc, the first cloud-based data and content protection network. Sean brings over 12 years of experience in the technology, marketing, and communications industries, serving a wide range of Commercial Fortune 500 Businesses, National Nonprofit Organizations, and Local Startup Ventures.