DocumentCloud Going Open Source Every Step of the Way

    by Jeremy Ashkenas
    October 29, 2009

    What does it mean to work on a project where open-source principles are written into the founding contract? A little over a month after receiving a 2009 Knight News Challenge grant, DocumentCloud released its first open-source component.

    The system, called CloudCrowd, performs the distributed computing that helps process the vast quantities of documents that will eventually be stored in DocumentCloud. It might seem premature to be releasing code so early — in the past some Knight grantees have chosen to wait until the end of their grant — but the larger part of open-source is community, not code. We’re planning on releasing portions of DocumentCloud as we build them in order to take advantage of the contributions that the open-source community can provide.

    When finished, DocumentCloud will be a software system, a website, and a set of open standards that will make it possible to read, search, and organize primary source documents across the web. As a journalist or researcher, you will be able to run filtered searches across the library of documents, and embed your source documents right alongside an article or blog post. All the aspects of the system — the search engine, the document viewer, the journalist workspace — will be open-sourced during the course of our grant.


    The Value of Open Source

    Going open-source is often viewed by skeptics as a sort of feckless altruism: a free handout of valuable intellectual property. I couldn’t disagree more, especially with respect to nonprofit organizations. A small team with limited resources benefits greatly from the ideas, bug reports and patches that a community can provide.

    Since releasing CloudCrowd a month ago, we’ve fixed a handful of bugs and added dozens of features, directly as a result of input from the community of contributors. There are, as of this writing, 375 developers that choose to follow the project on GitHub, which means that they pay attention to the project and are notified any time changes are made. 12 of them have ‘forked’ the project, pushing CloudCrowd in the directions that matter most to them.

    Needless to say, this community of contributors dwarfs the size of DocumentCloud as an organization, and is invaluable in helping to increase the quality of the software. CloudCrowd is already being used to process biomedical data and align gene sequences across strains of influenza virus, an application that’s far afield of our original use. We’re looking forward to hearing more about how it holds up in other arenas.


    Because anything worth doing is worth doing twice, yesterday morning we released the second open-source component of DocumentCloud: Underscore.js. These days, creating a modern web application means taking JavaScript seriously. Underscore is a library that provides a base level of functional programming support in a manner that’s easy to integrate with projects and other libraries. Since yesterday morning, we’ve received a whole slew of patches, and have made a number of updates driven by suggestions from the community.

    Most of our work over the past few months has been on the internal DocumentCloud prototype, which is a complete first draft of what the system will become. We’re hoping to extract additional portions of the prototype for release in the near future. So stay tuned as more and more bits of DocumentCloud come online.

    Tagged: documentcloud documents javascript knight foundation open source ruby

    Comments are closed.

  • Who We Are

    MediaShift is the premier destination for insight and analysis at the intersection of media and technology. The MediaShift network includes MediaShift, EducationShift, MetricShift and Idea Lab, as well as workshops and weekend hackathons, email newsletters, a weekly podcast and a series of DigitalEd online trainings.

    About MediaShift »
    Contact us »
    Sponsor MediaShift »
    MediaShift Newsletters »

    Follow us on Social Media