Lots of people, when they hear about ScraperWiki, ask, “Is scraping legal? How can you build a business off that?” Usually, they follow up by saying, “We do it in our company, but we would never tell anyone.”
This is strange to us, as we have come from a world of good scraping: taking government data and making it easier for people to use for things that benefit all of society. We’re in favor of that kind of scraping.
It’s obviously a spectrum. At the other extreme, the most evil scraping would be to steal content that somebody else sells, and then to republish it at harm to their business. We’re against that kind of scraping.
It’s not scraping itself which is good or bad, or legal or illegal, but the circumstances in which you’re doing it.
We’ve written up in full our policy about the legality; it’s in our FAQ under “What’s your policy on what’s legal to scrape?“ — lots of details about robots.txt and take-down notices, and what is our and your legal responsibility.
Getting data into a data hub
Finally, ScraperWiki isn’t just about scraping.
We’re a data hub, and you need to get data into a data hub. As well as scraping, lots of people make API calls to do that on ScraperWiki, or download their own files from their own servers.
This is much more profound than it sounds — when you are using data for a new purpose, even if it is already structured, you still need to get it and convert it to your new needs. How you do that is a detail that depends on the circumstances.
The difference between parsing HTML web pages, and using a JSON REST API is surprisingly small. As an example, ScraperWiki’s Thomas Levine scraped EventBrite even though it has an API, because it was easier at the time for him.
Whether you’re using Nokogiri or Nestful, what matters is getting the data, and converting it into a form where it can do something useful for the world — and doing that legally.
A version of this post first appeared on the ScraperWiki blog.
Image courtesy of Flickr user bloomsberries.