When President Obama released his birth certificate and dozens of news organizations turned to DocumentCloud to present it to their readers, I snarked a bit. Though the birth certificate did prompt a few questions — which we’re still navigating — about the best way to handle duplicate uploads, the secret truth was we were both proud and flattered that so many newsrooms, faced with a document they wanted their readers to see, came straight to us and knew they could count on us.
When half a dozen newsrooms turned to us to help them get Sarah Palin’s emails out to the public, our first thought was that we’d waited too long to address the duplicate documents challenge. Our second was that this kind of case — multiple newsrooms eager to scan the same tower of documents — is exactly what makes managing duplicate documents challenging. DocumentCloud is a catalog of primary source documents and a tool for annotating, organizing and publishing them on the web.
Most newsrooms downloaded a PDF of the president’s birth certificate and then uploaded that file to DocumentCloud. A simple “chksum“ could have identified those as the identical files. It takes human intervention (or at least more complex algorithms) to declare that one scanner’s photo of a particular page matches another’s.
STOPGAP IN A BROKEN SYSTEM
The Palin email release was also an unfortunate reminder that DocumentCloud is often a stopgap in a broken system: There’s no good technical reason those emails had to be released on paper stacked into file boxes in order to be reliably redacted. There’s no good reason that a half-dozen newsrooms were sending staff to Juneau, Alaska, to digitize documents that started out digital.
Our own system was feeling a bit broken as well — we knew that reporters from multiple national papers were assembling in Juneau and preparing to be the first to get the documents online, but we couldn’t convince any one paper to scan the 25,000 pages once and make them available to everyone to embed. That left us spending a few hundred dollars to queue up additional servers so that our users could duplicate each other’s work. Not too much money, in the grand scheme of things, but money nonetheless. Those 25,000 pages turned out not to contain much noteworthy information, but that’s a different story.
We were also queuing up to make a big announcement of our own, the merger of DocumentCloud with IRE. I was too busy keeping our own shifting sands level to compose any particularly coherent thoughts, though I was proud of the way our servers weathered the storm — DocumentCloud was slower than usual for a few days, but not by much.
We designed DocumentCloud to scale as necessary, and it scaled well. And there’s no question that once again we took it as the sincerest of compliments that the country’s top newsrooms turned to DocumentCloud as soon as they had documents to share with their readers. I can’t think of better evidence that DocumentCloud has been a success.
FACILITATING CIVIC CONVERSATIONS
Earlier this week I found myself discussing Internet activist Aaron Swartz’s indictment. I kept seeing tweets like “downloading too many articles is a felony?!” Some of the charges in the indictment, like breaking into a locked room, go a few steps beyond coming up with a workaround to avoid paying for your 21st New York Times article online. Someone pointed out that particular charge was bizarre because in the case of the locked server closet, MIT ought to be the aggrieved party — these charges aren’t coming from MIT.
All over Twitter people were having the same conversation and a decent number of other ones. The same was happening over email and in social networks. Over and over, all kinds of people were pointing back to the indictment. Hundreds of thousands of people read the indictment. And that is why we built DocumentCloud — to facilitate civic conversations, to give people a better way to talk about substantive issues in the news.
Aaron’s legal troubles are the kind that my own community of programmers, journalists, web watchers and advocates are especially interested in, so I’ve seen how having the document available has influenced these conversations, but I also know that DocumentCloud is full of documents like Aaron’s indictment — documents that journalists thought were worth sharing and that readers are skimming to measure against the resulting reporting.
Photo of Aaron Swartz courtesy of Fred Benenson and used here under Creative Commons license.