WRAL.com, North Carolina’s most widely read online news site, recently published a tool that allows you to search concealed weapons permits down to the street level. It didn’t use OpenBlock to do so. Why?
Or, if you’re like many journalistically and technically savvy people I’ve spoken over the last few months, you could ask why would they? There’s plenty of evidence out there to suggest the OpenBlock application is essentially a great experiment and proof of concept, but a dud as a useful tool for journalists. Many of the public records portions of Everyblock.com — OpenBlock‘s commercial iteration — are months if not years out of date. It can’t be found anywhere on the public sites of the two news organizations in which the Knight Foundation invested $223,625. There are only three sites running the open-source code — two of those are at universities and only one of which was created without funding from the Knight Foundation.
And, you, Thornburg. You don’t have a site up and running yet, either.
All excellent points, dear friends. OpenBlock has its problems — it doesn’t work well in multi-city installations, some search functions don’t work as you’d expect, there’s no easy way to correct incorrect geocoding or even identify possible failures, among other obstacles that I’ll describe in greater detail in a later blog post. But the alternatives also have shortcomings. And deciding whether to use OpenBlock depends on which shortcomings will be more tolerable to your journalists, advertisers and readers.
SHOULD I USE OPENBLOCK?
If you want to publish news from multiple cities or from unincorporated areas, or if you serve a rural community I’d hold off for now. If you visit our public repositories on GitHub you can see the good work the developers at Caktus have been doing to remove these limitations, and I’m proud to say that we have a private staging site that’s up and running for our initial partner site. But until we make the set-up process easier, you’re going to have to hire a Django developer (at anywhere from $48,000 a year to $150 an hour) to customize the site with your logo, your geographic data, and your news items.
The other limitation to OpenBlock right now is that it isn’t going to be cheap to maintain once you do get it up and running. The next priority for me is to make the application scale better to multiple installations and therefore lower the maintenance costs. Within the small OpenBlock community, there’s debate about how large of a server it requires. The very good developers at OpenPlans who did a lot of heavy lifting on the code between the time it was open sourced and the time that it should run nicely on a “micro” instance of Amazon’s EC2 cloud hosting service — about $180 a year.
But we and Tim Shedor, the University of Kansas student who built LarryvilleKU, find OpenBlock a little too memory intensive for the “micro” instance. We’re on an Amazon Web Services “small” instance, and LarryvilleKU is on a similar sized virtual server at MediaTemple. That costs more like $720 a year. And if you add a staging server to make sure your code changes break in private instead of public, you’re looking at hosting costs of nearly $1,500 a year.
And that’s before your scrapers start breaking. Depending on how conservative you are, you’ll want to set aside a budget for fixing each scraper somewhere between one and three times a year. Each fix might be an hour or maybe up to 12 hours of work for a Python programmer (or the good folks at ScraperWiki). If you have three data sources — arrests, restaurant inspections and home sales, let’s say — then you may get away with a $300 annual scraper maintenance cost, or it may set you back as much as $15,000 a year.
I’ve got some ideas on how to reduce those scraper costs, too, but more on that later as well.
Of course, if you have someone on staff who does Python programming and whose done some work with public records and published a few Django sites and they’ve got time to spare, then your costs will go down significantly.
But just in case you don’t have such a person on staff or aren’t ready to make this kind of investment, what are your alternatives?
GOOGLE MAPS AND FUSION TABLES
Using a Google Map on your news website is a little like playing the saxophone. It’s probably the easiest instrument to learn how to play poorly, but pretty difficult to make it really sing. Anyone can create a Google Map of homicides or parking garages or whatever, but it’s going to be a static map of only one schema, and it won’t be searchable or sortable.
On the other hand, you can also use Google Maps and Fusion Tables to build some really amazing applications, like the ones you might see in The Guardian or on The Texas Tribune or WNYC or The Bay Citizen. You can do all this, but it also takes some coding effort and probably a bit more regular hand care and feeding to keep the site up-to-date.
I’ve taken a look at how you might use Google’s data tools to replicate something like OpenBlock, although I’ve not actually done it. If you want to give it a whirl and report back, here’s my recipe.
A RECIPE FOR REPLICATING OPENBLOCK
Step 1. Create one Google Docs spreadsheet for each schema, up to a maximum of four spreadsheets. And create one Google Fusion Table for each scheme, up to a maximum of four tables.
Step 2. If the data you want is in a CSV file that’s been published to the web, you can populate it with a Google Docs function called ImportData. This function — as well as its sister functions ImportHTML and ImportXML — will only update 50 records a time. And I believe this function will pull in new data from the CSV about once an hour. I don’t know whether it will append the new rows or overwrite them, or what it would do if only a few of the fields in a record change. If you’re really lucky, the data would be in an RSS feed and you could use the ImportFeed function to get past this 50-record limit.
Of course, in the real world almost none of your data will be in these formats. None of mine are. And in that case, you’d have to either re-enter the data into Google Docs by hand or use something like ScraperWiki to scrape a datasource and present it as a CSV or a feed.
Step 3. Use a modification of this script to automatically pull the data — including updates — from the Google Docs spreadsheet into the corresponding Fusion table you created for that schema.
Step 4. Find the U.S. Census or local county shapefiles for any geographies you want — such as ZIP codes or cities or school districts — and convert them to KML.
Step 5. Upload that geographic information into another Fusion Table.
Step 6. Merge the the Fusion table from Step 3 with the Fusion table from Step 5.
- Use filters to select different points to appear on the map. (Examples: https://developers.google.com/fusiontables/docs/samples/change_query and http://googlegeodevelopers.blogspot.com/2011/05/turn-up-power-of-your-fusion-tables.html)
- Display points from different tables — known in OpenBlock parlance as “NewsItem types” — on a single Google Map (up to five). (Tutorial: http://michelleminkoff.com/2011/08/21/how-to-combine-multiple-fusion-tables-into-one-map/)
- Have maps and tables interact with each other, so that filtering by one shows information in another. (Example: http://www.texastribune.org/library/data/redistricting-viewer/)
And there’s even at least one prototype of using server-side scripting and Google’s APIs to build a relatively full-functioning GIS-type web application: https://github.com/odi86/GFTPrototype
After all that, you will have some of the features of OpenBlock, but not others.
Some key OpenBlock features you can replicate with Google Maps and Fusion Tables:
- Filter by date, street, city, ZIP code or any other field you choose. Fusion Tables is actually a much better interface for searching and filtering — or doing any kind of reporting work — than OpenBlock.
- Show up to four different kinds of news items on one map (five if you don’t include a geography layer).
- Conduct proximity searches. “Show me crimes reported within 1 mile of a specific address.”
WHAT YOU CAN’T REPLICATE
The OpenBlock features you can’t replicate with Google:
- Use a data source that is anything other than an RSS feed, HTML table, CSV or TSV. That’s right, no XLS files unless you manually import them.
- Use a data source for which you need to combine two CSV files before import. This is the case with our property transactions and restaurant inspections.
- Update more than 50 records at a time. Definitely a problem for police reports in all but the smallest towns.
- Use a data source that doesn’t store the entire address in a single field. That’s a problem for all the records with which we’re working.
- Map more than 100,000 rows in any one Fusion table. In rural counties, this probably wouldn’t be a concern. In Columbus County, N.C., there are only 45,000 parcels of land and 9,000 incidents and arrests a year.
- Use data sources that are larger than 20MB or 400,000 cells. I don’t anticipate this would be a problem for any dataset in any county we’re working.
- Plot more than 2,500 records a day on a map. Don’t anticipate hitting this limit either, especially after the initial upload of data.
- Parse text for an address — so you can’t map news articles, for example.
- Filter to the block level. If Main Street runs for miles through several miles, you’re not going to be able to narrow your search to anything relevant.
- Create a custom RSS feed, or email alert.
THE SEO ADVANTAGE
And there’s one final feature of OpenBlock that you can’t replicate using Google tools without investing a good deal of manual, rote set-up work — taking advantage of SEO or social media sharing by having a unique URL for a particular geography or news item type. Ideally, if someone searches for “home sales in 27514” I want them to come to my site. And if someone wants to post to Facebook a link to a particular restaurant that was scolded for having an employee with a finger-licking tendency (true story), I’d want them to be able to link directly to that specific inspection incident without forcing their friends to hunt through a bunch of irrelevant 100 scores.
To replicate OpenBlock’s URL structure using Google Maps and Fusion Tables, you’d have to create a unique web page and a unique Google map for each city and ZIP code. The geography pages would display a polygon of the selected geography, whether it’s a ZIP code or city or anything else, and all of the news items for that geography (up to four schemas, such as arrests, incidents, property sales, and restaurant inspections). That’s 55 map pages.
Then you’d have to create a map and a page for each news item type. That’s four pages, four Fusion tables, and four Google Docs spreadsheets.
Whew. I’m going to stick with our work in improving the flexibility and scalability of OpenBlock. But it’s still worth looking at Google Maps and Fusion Tables for some small and static data use cases. Other tools such as Socrata’s Open Data, Caspio and Tableau Public are also worth your time as you begin to think about publishing public data. Each of those have some maintenance costs and their own strengths and weaknesses, but the real trick for using all of these tools is public data that isn’t in any usable format. We’re looking hard at solving that problem with a combination of scraping and crowdsourcing, and I’ll report what we’ve found in an upcoming post.
Ryan Thornburg researches and teaches online news writing, editing,
producing and reporting as an assistant professor in the School of
Journalism and Mass Communication at the University of North Carolina at
Chapel Hill. He has helped news organizations on four continents
develop digital editorial products and use new media to hold powerful
people accountable, shine light in dark places and explain a complex
world. Previously, Thornburg was managing editor of USNews.com,
managing editor for Congressional Quarterly’s website and
national/international editor for washingtonpost.com. He has a master’s
degree from George Washington University’s Graduate School of Political
Management and a bachelor’s from the University of North Carolina at