How News Organizations Should Prepare for Data Dumps

Soon every news organization will have its own “bunker” — a darkened room where a hand-picked group of reporters hole up with a disk/memory stick/laptop of freshly opened data, some stale pizza and lots of coffee.

Last year the U.K.‘s Daily Telegraph secreted half a dozen reporters in a room for nine days with about 4 million records of politicians’ expenses. They were hidden away even from the paper’s own employees. Now we learn that reporters from the Guardian, the New York Times and Der Spiegel did the same with Julian Assange of WikiLeaks somewhere in the Guardian’s offices in King’s Cross, London.

There is a wonderful irony that open data can generate such secrecy. Of course the purpose of this secrecy is to find — and protect — scoops buried in the data. From the perspective of many news organizations, these scoops are the main benefit of data dumps. Certainly the Daily Telegraph benefitted hugely from the scoops it dug out of the MPs’ expenses data. Weeks of front pages on the print paper, national uproar, multiple resignations, court cases and much soul searching about the state of parliamentary politics.

The Guardian, the New York Times and Der Spiegel have not been able to stretch the WikiLeaks Afghan logs over multiple weeks, but they did dominate the news for awhile, and stories will almost certainly continue to emerge.

These massive data releases are not going to go away. In fact, they’re likely to accelerate. The U.S. and U.K. governments are currently competing to see who can release more data sets. WikiLeaks will no doubt distribute more raw information, and WikiLeaks will spawn similar stateless news organizations. Therefore news organizations need to work out how best to deal with them, both to maximize the benefits to them and their readers, and to ensure they don’t do evil, as Google might say.

5 Questions

Here are just five (of many) questions news orgs should ask themselves when they get their next data dump:

1. How do we harness public intelligence to generate a long tail of stories? Though the Telegraph succeeded in unearthing dozens of stories from the Parliamentary expenses data, the handful of reporters in the bunker could never trawl through each of the millions of receipts contained on the computer disks. It was The Guardian that first worked out how to deal with this; it not only made the receipts available online but provided tools to search through them and tag them (see Investigate your MP’s expenses). This way it could harness the shared intelligence — and curiosity — of hundreds, if not thousands, more volunteer watchdogs, each of whom might be looking for a different story from the expenses data. As a result, the Guardian generated many more stories and helped nurture a community of citizen scrutineers

2. How do we make it personal? Massive quantities of data can be structured to be made directly relevant to whoever is looking at it. With crime data you can, for example, enable people to type in their postcode and see what crimes have happened in their neighborhood (e.g. San Francisco crimespotting). For MPs’ expenses, people could look up their own MP and scour his/her receipts. The Afghan logs were different in this respect, but OWNI, Slate.fr and Le Monde Diplomatique put together an app that allows you to navigate the logs by country, by military activity, and by casualties (see here). The key is to develop a front end that allows people to make the data immediately relevant to them.

3. How can use the data to increase trust? The expenses files, the Afghan logs, the COINs database (a massive database of U.K. government spending released last month) are all original documents that can be tagged, referenced and linked to. They enable journalists not only to refer back to the original source material, but to show an unbroken narrative flow from original source to final article. This cements the credibility of the journalism and gives the reader the opportunity to explore the context within the original source material. Plus, if published in linked data, the published article can be directly linked to the original data reference.

4. How do we best — and quickly — filter the data (and work out what, and what not, to publish)? Those that are best able to filter this data using human and machine methods are those who are most likely to benefit from it. Right now only a very small number of news organizations appear to be developing these skills, notably the Guardian, the New York Times, and the BBC. The skills, and algorithms, they develop will give them a competitive advantage when dealing with future data releases (read, for example, Simon Rogers on how the Guardian handled the 92,201 rows of data and how Alastair Dant dealt with visualizing IED events at FlowingData). These skills will also help them work out what not to publish, such as data that could put people in danger.

5. How can we ensure future whistleblowers bring their data to us? It’s impossible to predict where a whistleblower will take their information. John Wick, who brokered the MPs expenses disk to the Telegraph, went first to the Express, one of the U.K.‘s least well resourced and least prepared national papers. But it is likely that the organizations that become known for handling big data sets will have more whistleblowers coming to them. Julian Assange went to the Guardian partly because the journalist Nick Davies sought him out in Brussels (from Clint Hendler in CJR) but Assange must also have been convinced the Guardian would be able to deal with the data.

The influence of the war logs continues to spin across the globe, particularly following the Afghan president’s comments. But it is not the first — and certainly won’t be the last — big data dump. Better that news organizations prepare themselves now.

Can Social Micro-Earnings Help Micropayments Work for News? »

« SuperNova 2010

Tags: afghan logsdatadata journalismder spiegelguardianmp expensesnew york timestelegraphwikileaks

Martin Moore :Martin Moore is the director of the Media Standards Trust, a nonprofit organization that aims to foster high quality journalism. He has been working in news and media for more than a decade, including for the BBC, Channel 4, NTL, IPC Media, Trinity Mirror and others. Moore studied history at Cambridge and holds a doctorate from the London School of Economics, where he was teaching and researching until summer 2006.

View Comments (3)

David Loyn says:
August 4, 2010 at 12:59 am

A characteristically elegant analysis from the Director of the Media Standards Trust, with useful thoughts about the importance of media organisations developing digital management of big dossiers. Only then will these ‘Data dumps’ be useful for citizens in holding government to account.

Intriguingly the news story with the biggest impact from the Afghan leaks was the story that many names of alleged Afghan informants were left exposed, while American names were redacted. It was this story that put the Wikileaks publisher Julian Assange on the defensive, with claims that he had ‘blood on his hands.’ This story was first reported by The Times (of London), not one of the organisations that had privileged advanced access to the dossier. The rule seems to be that the news organisations that have the advantage of early access are less willing to bite the hand that feeds them.
Ed says:
August 6, 2010 at 4:07 pm

Potential Data Dump, not large, but spectacular

Julian & Geoff, I encourage you both to unfool the people ...
Friday, August 6, 2010 1:44 PM
From:
To:
"Pentagon Press Sec. Geoff Morrell c/o Maj. Bryan Woods" , wl-editor@sunshinepress.org, julian@wikileaks.org

Julian & Geoff,
I encourage you to unfool the people by showing them two simple, almost hidden histories.
Real asset price histories are VERY instructive, but they are kept little-apparent because fooling the people is more profitable.
Nearly never apparent:
3/30/1999 WSJ
http://homepage.mac.com/ttsmyf/begun.pdf
8/27/2006 NYT
http://www.nytimes.com/imagepages/2006/08/26/weekinreview/27leon_graph2.html

Up-to-date are here:
“Real Homes, Real Dow” at
http://homepage.mac.com/ttsmyf/RHandRD.html
These price histories are serial herd behaviors!
Individuals’ experiences were overwhelmingly timing-dependent. People uninformed of these serial herd behaviors are people fooled. This fooling of the people is USA history to date
Public Intelligence in the Public Interest (PhiBetaIota.net) says:
August 12, 2010 at 12:20 pm

It's clear we need more "public intelligence in the public interest" that addresses the top global threats and interconnects them with the major policies/priorities and stimulates the 8 humanities to communicate better with each other to solve problems. Below are a couple links that depict this outline from the non-profit Earth Intelligence Network which has been ignored for too long:

http://www.phibetaiota.net/?p=16729

http://www.phibetaiota.net/?p=25831

Comments are closed.

MediaShift Launches New Peer Group Trainings for Publishers
We know that our MediaShift community likes learning new things. That's why we have always…
5 Key Points for Journalism Educators Who Want to Teach Online
If you’re transitioning to teaching students online, then you’re possibly in one of two camps:…
Why Texas State’s Digital Media Degree is Attracting Female Students
One year ago, I wrote an article for MediaShift about our new undergraduate degree in…