ScraperWiki Digs Up Dirty Data So You Don’t Have To

    by Nicola Hughes
    September 19, 2011

    The best journalism comes from digging. Not phoning up press officers but speaking with those in the know. Not seeking comments from experts but going out onto the streets. The real stories, the scoops and the breakthroughs don’t come prepackaged. So why limit data-driven journalism to the relatively few sources of clean, pre-packaged and nicely delivered data?

    This is where ScraperWiki comes in. ScraperWiki is a developer platform that aims to liberate data from the web, build upon this information to make useful applications, and get journalists and developers working together in true HacksHackers fashion!


    i-a00012fafded595cd7373c7dcf1b97f3-ScraperWiki Logo Facebook.jpg


    We let you think big when it comes to dirty data. Think global corporate data. OpenCorporates did. So we made a call out and got 10 percent of the world’s companies’ data in just two weeks.

    We let you think quick and dirty for timely news stories. The Texas Tribune scraped the Department of Criminal Justice to build an interactive of executions on Gov. Rick Perry’s watch.

    We let you dig up dirt with dirty data. Even a simple little scraper can uncover facts for a front-page scoop as James Ball from The Guardian found out.


    Or even fish for a story like @OJCstatements, which is tweeting out, and attaching a hashtag to reports on judicial complaints.


    We’re an open wiki for coding scripts called “scrapers.” These extract data from the web — be it HTML, CSV or PDF files, or even data stored behind online forms.

    i-1c71df561d56034cfe7de7fc55c9d7b9-blob_developer.pngOur platform is structured so that lots of developers can work on many scrapers feeding a data store, making large-scale data projects feasible for a deadline-driven newsroom.

    Our structure helps prevent these projects from decay, meaning you can make interactives that stay up to date with the data rather than being used to wrap up tomorrow’s virtual fish and chips.

    Because our platform is built on code, you can integrate other services, be it Mechanical Turk, Refine or any service with an interface to the data. You can also integrate the data output for other services such as RSS, Twitter or email.

    We have a data-digging community using ScraperWiki everyday, so if you don’t have programmers in your newsroom or you need many more for a specific project, we have the right people you’re looking for.


    Thanks to funding from the Knight Foundation, here’s what we’re plugging into our system for you:


    • Data embargo, so journalists can keep their stories secret until going to print, but publish the data in a structured, reusable, public form with the story.
    • Data on demand service. Often journalists need the right data ordered quickly; we’re going to create a smooth process for this.
    • News application hosting. We’ll make it scalable and easier.
    • Data alerts. Automatically get leads from changing data. For example, watch bridge repair schedules, and email when one isn’t being maintained.

    If you want to explore our platform, I suggest you try out this tutorial for non-programmers and check your answers here.

    You can also keep track of our progress by finding us on Twitter and Facebook and checking out the blog.

    We’ll be at ONA (we’re nominated for an award!) and Strata. Catch us if you can!

    Tagged: coding data data journalism programmer-journalists project intros scrapers scraperwiki

    Comments are closed.

  • Who We Are

    MediaShift is the premier destination for insight and analysis at the intersection of media and technology. The MediaShift network includes MediaShift, EducationShift, MetricShift and Idea Lab, as well as workshops and weekend hackathons, email newsletters, a weekly podcast and a series of DigitalEd online trainings.

    About MediaShift »
    Contact us »
    Sponsor MediaShift »
    MediaShift Newsletters »

    Follow us on Social Media