Soon every news organization will have its own “bunker” — a darkened room where a hand-picked group of reporters hole up with a disk/memory stick/laptop of freshly opened data, some stale pizza and lots of coffee.
Last year the U.K.‘s Daily Telegraph secreted half a dozen reporters in a room for nine days with about 4 million records of politicians’ expenses. They were hidden away even from the paper’s own employees. Now we learn that reporters from the Guardian, the New York Times and Der Spiegel did the same with Julian Assange of WikiLeaks somewhere in the Guardian’s offices in King’s Cross, London.
There is a wonderful irony that open data can generate such secrecy. Of course the purpose of this secrecy is to find — and protect — scoops buried in the data. From the perspective of many news organizations, these scoops are the main benefit of data dumps. Certainly the Daily Telegraph benefitted hugely from the scoops it dug out of the MPs’ expenses data. Weeks of front pages on the print paper, national uproar, multiple resignations, court cases and much soul searching about the state of parliamentary politics.
The Guardian, the New York Times and Der Spiegel have not been able to stretch the WikiLeaks Afghan logs over multiple weeks, but they did dominate the news for awhile, and stories will almost certainly continue to emerge.
These massive data releases are not going to go away. In fact, they’re likely to accelerate. The U.S. and U.K. governments are currently competing to see who can release more data sets. WikiLeaks will no doubt distribute more raw information, and WikiLeaks will spawn similar stateless news organizations. Therefore news organizations need to work out how best to deal with them, both to maximize the benefits to them and their readers, and to ensure they don’t do evil, as Google might say.
Here are just five (of many) questions news orgs should ask themselves when they get their next data dump:
1. How do we harness public intelligence to generate a long tail of stories? Though the Telegraph succeeded in unearthing dozens of stories from the Parliamentary expenses data, the handful of reporters in the bunker could never trawl through each of the millions of receipts contained on the computer disks. It was The Guardian that first worked out how to deal with this; it not only made the receipts available online but provided tools to search through them and tag them (see Investigate your MP’s expenses). This way it could harness the shared intelligence — and curiosity — of hundreds, if not thousands, more volunteer watchdogs, each of whom might be looking for a different story from the expenses data. As a result, the Guardian generated many more stories and helped nurture a community of citizen scrutineers
2. How do we make it personal? Massive quantities of data can be structured to be made directly relevant to whoever is looking at it. With crime data you can, for example, enable people to type in their postcode and see what crimes have happened in their neighborhood (e.g. San Francisco crimespotting). For MPs’ expenses, people could look up their own MP and scour his/her receipts. The Afghan logs were different in this respect, but OWNI, Slate.fr and Le Monde Diplomatique put together an app that allows you to navigate the logs by country, by military activity, and by casualties (see here). The key is to develop a front end that allows people to make the data immediately relevant to them.
3. How can use the data to increase trust? The expenses files, the Afghan logs, the COINs database (a massive database of U.K. government spending released last month) are all original documents that can be tagged, referenced and linked to. They enable journalists not only to refer back to the original source material, but to show an unbroken narrative flow from original source to final article. This cements the credibility of the journalism and gives the reader the opportunity to explore the context within the original source material. Plus, if published in linked data, the published article can be directly linked to the original data reference.
4. How do we best — and quickly — filter the data (and work out what, and what not, to publish)? Those that are best able to filter this data using human and machine methods are those who are most likely to benefit from it. Right now only a very small number of news organizations appear to be developing these skills, notably the Guardian, the New York Times, and the BBC. The skills, and algorithms, they develop will give them a competitive advantage when dealing with future data releases (read, for example, Simon Rogers on how the Guardian handled the 92,201 rows of data and how Alastair Dant dealt with visualizing IED events at FlowingData). These skills will also help them work out what not to publish, such as data that could put people in danger.
5. How can we ensure future whistleblowers bring their data to us? It’s impossible to predict where a whistleblower will take their information. John Wick, who brokered the MPs expenses disk to the Telegraph, went first to the Express, one of the U.K.‘s least well resourced and least prepared national papers. But it is likely that the organizations that become known for handling big data sets will have more whistleblowers coming to them. Julian Assange went to the Guardian partly because the journalist Nick Davies sought him out in Brussels (from Clint Hendler in CJR) but Assange must also have been convinced the Guardian would be able to deal with the data.
The influence of the war logs continues to spin across the globe, particularly following the Afghan president’s comments. But it is not the first — and certainly won’t be the last — big data dump. Better that news organizations prepare themselves now.