Data journalism is in danger of becoming the broccoli of J-school education. We all know we should like data journalism, we all know we’re supposed to be doing data journalism, but, like broccoli, data journalism can seem so deeply unattractive and inedible.
This, then, is the guide to getting started with teaching data journalism — learning how to steam broccoli that is simple and delicious. This guide is not aimed at faculty who know their way around Python or Tableau (you can go back to your broccoli souffles) but rather at the journo-academic who might be teaching a basic data course for the first time and wants to know where and how to get started. In particular, this article is aimed at the smaller programs where one person is usually the go-to for anything that looks remotely like tech journalism.
Before getting started, all would-be data instructors should join the National Institute for Computer-Assisted Reporting. I can not stress this enough. NICAR has been doing this for a long time and it shows. They maintain a deep, rich collection of tutorials and assignments – including an awesome Excel course – which frees you up to be thinking about the journalism.
Also, reach out to your local news organization to see if they would be interested in collaborating on a data project. Our class is working with Newsday as one piece of a much larger Newsday data project. The students’ task is to request payroll records under the New York State freedom of information law from selected local government agencies and follow through on the responses. The project has provided an excellent way to introduce students to the mechanics of FOI, as well as the law itself. The students are developing a lot of confidence from dealing with the local clerks although (like most Millennials) they do tend to get stressed once the conversations move from email to phone.
What you’ll need
- NICAR membership
- The Data Journalism Handbook
- Familiarity with the following
- Excel (try the excellent course pack from NICAR)
- Google Spreadsheets (See a tutorial here)
- Google Forms (See a tutorial here)
- Google Fusion (See a tutorial here)
- Google/Open Refine (See a tutorial here)
- Web-based scraping tools (ScraperWiki, etc.)
- Freedom of Information Laws (federal and your state)
What you’ll be doing
- Getting students comfortable with numbers
- Getting students to think about data stories
- Finding the data
- Converting the data
- Cleaning and analyzing the data
- Interviewing the data
1. Getting students comfortable with numbers
Mention “data” to average students, and they’ll start getting all anxious about math. Some may even attempt to hide under their desk. Yes it is math, I tell them, but Excel will do the heavy lifting on the numbers. Your role is to figure out if the numbers are telling you anything.
I require students to purchase the $20 Excel module from NICAR. The course pack takes about two to three weeks of class time and explains rates, ratios, percentages, sorting, summing, filtering, and Pivot tables. The pack includes story ideas, teaching tips and a cheat sheet of Excel formulas. It’s a great introduction to Excel for first-time faculty and students alike. Just make sure you review the materials ahead of the class, so you can troubleshoot any issues.
2. Getting students to think about data stories
It’s best to start with something simple that is immediately relatable to your student audience. Jake Batsell has had great success with his coffee assignment, which gets students involved in surveying, crowdsourcing and mapping.
I’ve done something similar at Hofstra with gas prices. Students create a Google Form that they distribute through social media to survey commuters for gas prices at particular gas stations. The information is collected in a spreadsheet which is then converted to a map using Google Maps. This approach means they learn the value of structuring the questions correctly, so the answers actually make sense. Once students are comfortable with the process, we move on to more complex stories.
Ideas for data stories can come from anywhere. We’ve explored stories about crime rates, the job market, student debt, road traffic accidents, bridges, weather patterns, drug arrests and income inequality.
3. Finding the data
Students could do what we used to do when starting out in journalism – head to the library (read: Internet) and look for inspiration. Go and see what other and bigger outfits are doing. Data stories, like news stories, often cross state lines – if it’s news in Detroit, it could be news in Hempstead. All the students need to do is find the data.
I’ve found that it’s best to get students to come up with a general (rather than a specific) idea before searching for a usable data set. Students can often spend too much time developing their perfect story idea before they discover that their desired database does not exist or is stored in an unusable format.
Students and faculty are often surprised at how little usable information is actually available on the web. Data is often stored as web tables or PDFs (otherwise known in my class as Pretty Damned Frustrating), which are practically useless for analysis. And often times, requests to government agencies regularly result in PDFs.
For beginners, there are basically three ways to find data.
- A DIY survey like Google Forms
- Open databases (See list here)
- Freedom of information laws (See list here)
These are great for the first assignment. Students can use the forms to poll their classmates on topics from coffee prices to gas prices to off-campus rental costs. However, the resulting data sets are usually quite small and are best used for an in-class exercise.
Keep an eye out for any data stories appearing in regional, local or national media. Once you see some interesting stories, pounce on them and store links to the original data. If you’re not finding much, use the database search at Enigma to search for open records. I maintain a list of open databases for my course here, but this shorter list is usually enough to get students started:
- Open data from U.S. Govt http://www.data.gov/open-gov/
- The Census http://www.census.gov/
- Drill down http://factfinder.census.gov/faces/nav/jsf/pages/index.xhtml
- Drill down http://quickfacts.census.gov/qfd/index.html
- Drill down http://censtats.census.gov/
- Immigration Stats http://www.dhs.gov/immigration-statistics
- FBI crime reports http://www.fbi.gov/about-us/cjis/ucr/ucr
- College Navigator http://nces.ed.gov/collegenavigator/
- Campaign finances http://www.opensecrets.org/
- Labor statistics http://www.bls.gov/
- EPA data http://www.epa.gov/enviro/
Freedom of Information
The freedom of information process can often appear intimidating, which is why it is critical to introduce this in the classroom to help students gain experience with filing federal or state requests. As I mentioned earlier, our class is working on a project with Newsday, but instructors (and students) could come up with multiple ideas for class projects. This documents list by Tom Meagher is an excellent starter kit to help prompt discussion in the classroom. Also, this account of some seriously weird FOIA experiences will help students get over their fear of getting anything wrong.
4. Converting the data
Unfortunately, sometimes you will have to clean up data from a web table or PDF before you can start analyzing it. The good news is that PDF conversion has improved considerably and there are several excellent third-party solutions, some of which are listed in this document. Web tables are also very easy to manage by using the Import HTML command in Google Spreadsheets.
PDFS: Import.io, Tabula and ScraperWiki all offer free PDF conversions. I experimented with this data and found that ScraperWiki was the most useful. Tabula requires users to highlight the parts of the PDF they want to convert after it has uploaded while Scraperwiki converts it during the upload.
Web Tables and Web Scraping: Cindy Royal has an excellent tutorial into advanced web scraping here and I also provide a step-by-step tutorial for converting PDFs using Scraperwiki and importing web table data by using ImportHTML.
5. Cleaning and analyzing the data
Once you have the data in an Excel spreadsheet, make sure students copy it and put the original away, preferably in a folder marked “original data.” It’s important that everyone gets in the habit of working from a copy because it can be catastrophic to work on an original data set. Advise students to use logical names for spreadsheets and to get in the habit of saving the spreadsheets as either .xlsx or .xls, not both.
Now, students can begin applying what they’ve learned from the NICAR tutorials and begin working on sorting, filtering, ratios, rates and percentages in Excel or, alternatively, they can upload the data to Open Refine for more sophisticated wrangling. Open Refine – which can de-duplicate data, merge and edit variables and even combine datasets – helps students nail down and clean discrepancies in data. It can also pull in extra information to augment existing data by external links to sites such as Freebase. See a detailed tutorial here.
Pro-tip #1 Decide on the .xlsx or .xls format before uploading your files, as Open Refine cannot accept different Excel file types at the same time. If you don’t know this, you could wind up staring at the Parsing Data screen for some time (as I did).
Pro-tip #2 Unlike Google Fusion Tables or Spreadsheets, the projects you create in Google Refine are stored on your local computer. You won’t be able to access them once you change computers. This is great for keeping any sensitive information safe but has stopped more than one person in their tracks when they realize their carefully sorted data is sitting back home. Always make sure to download each database when you have finished working on it.
6. Interviewing the data
Once the data’s all tidied up, it’s time to start the interview. Teach your students to think of data as a source. In the old days we might have taken the source out for a drink and grilled them in the bar. In fact, multiple investigations got started because of a conversation in the bar. But with data we can study the source for hours looking for that “aha” moment when we know we’ve got a story. Keep asking the questions.
For example, in our data journalism class at Hofstra, which is based in Hempstead on Long Island, we wanted to look at crime statistics from Nassau County. However, the only public database we could find was this PDF on heroin arrests. We used Scraperwiki to convert the pdf into an Excel file and then uploaded the individual sheets to Open Refine where students were able to start interviewing the data.
Other journalism projects underway in the class include:
- The differences between on- and off-campus living costs at New York universities
- Income inequity between neighborhoods on Long Island
- Liquor licenses by neighborhoods on Long Island
- The state of New York’s bridges
- Drug treatment programs on Long Island
Each student works on a single project for the entire semester with multiple revisions and in some cases, multiple project ideas each week. The projects are gradually taking shape and the aim is to have each student publish a detailed project with interviews, interactive graphics and analysis by the end of the semester. So far, the class is on track with the assigned readings, and weekly quizzes and the projects are beginning to look impressive. And I am quietly confident that most of the students will no longer look for somewhere to hide when someone mentions math.
Assistant professor Kelly Fincham teaches journalism at Hofstra University, where she developed the award-winning Long Island Report. She has worked in journalism in New York, Dublin and Sydney and is the founding editor of Irish Central. Her website is at www.kellyfincham.com and she’s on Twitter @KellyFincham.