Data is a buzzword nowadays. Whether it’s sifting Big Data to influence business, or the promise of Open Data to transform government, or Data Analytics winning elections, data is constantly in the news. But one thing that gets glossed over in all the buzz is that data is hard. Really, really hard. One of the hardest parts is cleaning, standardizing, and formatting data in a way that journalists and others can start to work with. These are real challenges faced by newsrooms, and we’re hoping to make some of that a little easier with two new Code Sprints we’re happy to announce.
First up: Dedupe
One of the biggest problems with data sets is figuring out if information in one set of data is the same as information in another. When you have a small set of data, the work is pretty straightforward. But as your rows increase, the work becomes daunting. Derek Eder and Forest Gregg at Chicago’s DataMade have been working on an automated process for deduplification of data, and we’re happy to help get it to a state where running it through huge datasets is as simple as a few calls from the command line.
A clear early use for the tool is in deduplifying campaign finance records, which can often be a slog. We’ve recruited the help of Derek Willis and others from the New York Times to help guide the development process as it progresses. As Derek explains, “One of our biggest problems in working with political data is the lack of canonical identifiers for people, and anything that helps us to address this issue is worth a good look, especially if it can relieve newsrooms of time-consuming and error-prone manual standardization efforts.”
The DataMade team has done a great deal of heavy lifting already — “we’ve solved the most of major engineering challenges of scaling up on large datasets,” DataMade’s Eder says — but getting a lower barrier to entry on the tool is time and money well spent. If you can program Python, you can fork and start running Dedupe today. If you want to wait for the simplified version, we’re expecting development to wrap up early this summer.
Next up: FMS Parser
The U.S. Treasury releases a statement of, essentially, the federal government’s checkbook every day at 4pm ET. Unhelpfully, they release it as a straight-up text file or a PDF. Newsroom developers and info-hackers Cezary Podkul, Burton DeWilde, Thomas Levine, Jake Bialer, Brian Abelson, and Michael Keller started work on scraping and parsing that daily statement at the Bicostal Datafest earlier this year.
The team got far enough along at the Datafest that they approached us about helping to turn it into an open API that any newsroom developer can access. With our Code Sprint grant, the team will take this once nearly inaccessible dataset and transform it into an easily accessible API that returns machine-readable JSON. In this time of cutbacks and budget wrangling, the FMS Parser should offer developers and journalists a new way to dive deeply into governmental spending.
The tool should see some immediate use too, as the team of developers working on it include newsroom developers at Reuters, the Daily Beast, and the Huffington Post (along with our Knight-Mozilla Fellow at the New York Times). While it’s still being developed, you can fork and follow at the FMS Parser Github repo.
A month ago, I announced a reimagined Code Sprint application process, and we’re excited to help tools like this get the funding and attention they need through it. We’re always looking for developers and newsrooms with great ideas they want to build (along with newsrooms that want to betatest them), so please drop a line. Let’s do this!
Dan Sinker heads up the Knight-Mozilla News Technology Partnership for Mozilla. From 2008 to 2011 he taught in the journalism department at Columbia College Chicago where he focused on entrepreneurial journalism and the mobile web. He is the author of the popular @MayorEmanuel twitter account and is the creator of the election tracker the Chicago Mayoral Scorecard, the mobile storytelling project CellStories, and was the founding editor of the influential underground culture magazine Punk Planet until its closure in 2007. He is the editor of We Owe You Nothing: Punk Planet, the collected interviews and was a 2007-08 Knight Fellow at Stanford University.
A version of this post originally appeared on Dan Sinker’s Tumblr here.