• ADVERTISEMENT

    OpenBlock: Can You Explain Data to a Computer AND a Human?

    by Ryan Thornburg
    December 22, 2011

    Since the OpenRural project started in November, one of my primary efforts has been to lift the hood on the OpenBlock application itself and find the “unknown unknowns,” as a former defense secretary once said. We saw data go in, and maps and lists come out. But what happens inside the belly of the beast?

    i-69bb3cd3aa4527a312a4208426519fa8-openblock-logo.png

    Over the course of the next several posts, I’m going to give you an X-ray view into the guts of the OpenBlock application. Together, we’re going to watch how data gets ingested and processed into information and insights that residents of rural communities can use to make decisions about their daily lives.

    ADVERTISEMENT

    We knew basically two things when we started this project. First, we knew that public data went into OpenBlock. And we knew that digital public data is for the most part in a poor condition to be easily digested. Second, we knew that the meaningful geographies of rural areas were going to be different than the geographies of urban areas. But beyond that, the anatomy of OpenBlock remained almost completely obscured. No instances of the application lived in the wild, and the code itself was still missing significant documentation.

    Step 1: Getting data into OpenBlock

    Our first step was to figure out how to get data into OpenBlock. And while many of us probably think about data as being some sort of news event — a transaction, creation, deletion, inspection, election, rejection, incarceration or some other function of government that takes place at a specific time — the initial data you need for OpenBlock is about a location. You have to tell it where it is. And you do that by ingesting data from the U.S. Census Bureau.

    i-25ad70df6dd5406513ea7f5a57064c88-tigerlogo.png

    ADVERTISEMENT

    The Census Bureau is an amazing resource of geographic data, which it calls TIGER/Line files. The bureau provides TIGER/Line files that show information about various “layers” of geography — most of which aren’t relevant to OpenBlock. Each layer actually consists of several files that you download from the Census website as a single zip file. And inside that zip file is a file with a .shp extension — that’s a shapefile, and it is the only one that has nutritious value as far as OpenBlock is concerned.

    Several layers are important — one file contains data about the boundaries of all ZIP codes in a state. Even though most ZIP codes aren’t going to be relevant, you need them anyway. And while loading ZIP code shapefiles into OpenBlock may not be simple for people who don’t have at least some familiarity with Django, it doesn’t require a lot of human judgment. Plug in some code and you’re done.

    But nobody lives their lives by ZIP codes. We care about geographies like counties and cities and streets. And knowing which geographic data to load next does require some editorial thinking. Most of us are familiar with counties as political entities that have some meaning. But counties are different in each state. For example, in Virginia there is a City of Fairfax and County of Fairfax. Fairfax City isn’t in Fairfax County or any county. They touch each other, but one does not have jurisdiction over the other. Now, here in North Carolina we have Durham County and Durham City. Durham City is inside Durham County — mostly — except for the part of Durham City that is in Orange County. Orange County is the home of Chapel Hill, except for the part of Chapel Hill that’s in Durham County. And both Durham and Orange counties contain large parts of land that aren’t in any city at all.

    teaching journalism to programmers and vice versa

    So when we talk about teaching journalism to programmers and programming to journalists — this is really the kind of thing we’re talking about. Somewhere here we have to have someone who knows the political geography of North Carolina and who can also describe the rules of that geography to a computer program so that it doesn’t leave out anything it shouldn’t and also includes everything it should.

    Let’s say we want to show the locations of all new business that have been incorporated in Orange County. The N.C. Secretary of State, which records new businesses, may have the address of the business, but not the county. We’re going to have to tell the OpenBlock application which addresses are inside Orange County — regardless of whether they are in Chapel Hill, Durham City or no city. If we tell it just to grab the Chapel Hill addresses, we will erroneously include the businesses that are in the part of Chapel Hill that’s in Durham County. Or, since most people think of Chapel Hill as being an Orange County city and they might be confused if they know a business has opened but isn’t listed on our website, we may want to tell OpenBlock to include all Orange County addresses, but not the ones that are in Durham City, and also include the Durham County addresses if they are in Chapel Hill.

    And then we have to write for the reader an explanation of whatever assumptions we’re making in our data, and it has to be brief and clear.

    Once you’ve made your editorial decisions, county bounders can also be downloaded from the Census Bureau as well. Just to be proper, Census calls this layer “County and Equivalent.”

    Cities, however, are a more delicate matter that again require some knowledge of Census terminology and local political geography. There is no Census layer called “cities.”

    Until recently, the practice among the OpenBlock community had been to look for geographic information about city boundaries from local governments. Most county governments are pretty good about publishing their geographic data on the web. Many large public universities have a GIS (geographic information system) section in their library, such as this one at N.C. State University that indexes links to the geographic data source for many of the state’s county governments.

    lesson learned: local data varies

    The problem with local data is that poor and rural counties are less likely to have the online GIS data. And you will also find yourself dealing with a wide variety of standards. We spent a few days flummoxed by our inability to load Columbus County data. For whatever reason, the county decided not to include the necessary .prj projection file you need to make a shapefile work in OpenBlock. It’s a lesson that’s going to be important for us to remember throughout this project — local data varies widely in quality. And sometimes it’s not obvious to the layperson’s eye what is missing.

    Our thinking right now is that we’re going to be able to turn back to the Census Bureau for city information. But we’re not using the Census layer called “Consolidated Cities.” Nor the one called “County subdivisions.” We’re using the “Places” layer.

    You can read more about our “city” solution on the OpenBlock discussion group. But there are a few bits worth mentioning here:

    • Places do not cover 100 percent of a county. So we’re going to need to be on guard that events happening outside the boundaries of one of our places doesn’t somehow get left out of the search results.
    • Some of the places in Columbus County probably have little or no meaning to the audience there. In the “Government” navigation of our partner site, Whiteville.com, not all of the Census places are listed. That said, several of them show up occasionally as the location of obituaries that run on the site.

    Geography is hardly the dynamic data we think of as news. But we’ve already seen several road bumps that are big enough to deter almost all small news organizations from using the OpenBlock application. So post this as one of our goals: to automate the process of loading basic ZIP, county and city geographic data about your community. Again, you can read more in the OpenBlock discussion group about how we might use FIPS codes — the unique number given by the Census Bureau to each state, county and “place” — to do that.

    And there’s still a big elephant in the room — the geography of “blocks” that is one of the core concepts of OpenBlock. That topic is big enough for its own post a few weeks from now. Before then, I’m going to walk you through some of our experiences loading “news” data into OpenBlock and how we’re hoping we might be able to work with fellow Knight News Challenge winner ScraperWiki to build our own API for North Carolina state government.

    Understanding each step of the set-up, production and editing process involved with OpenBlock is critical to our ability to describe the expense side of the equation, which we hope and expect will lead to the financial viability of the application as a tool to fill the information needs of rural communities.

    Tagged: census data geography GIS information openblock openrural rural communities

    Comments are closed.

  • ADVERTISEMENT
  • ADVERTISEMENT
  • Who We Are

    MediaShift is the premier destination for insight and analysis at the intersection of media and technology. The MediaShift network includes MediaShift, EducationShift, MetricShift and Idea Lab, as well as workshops and weekend hackathons, email newsletters, a weekly podcast and a series of DigitalEd online trainings.

    About MediaShift »
    Contact us »
    Sponsor MediaShift »
    MediaShift Newsletters »

    Follow us on Social Media

    @MediaShiftorg
    @Mediatwit
    @MediaShiftPod
    Facebook.com/MediaShift