Last month, I had the privilege of participating in the Mozilla-Knight Learning Lab. This four-week online lecture series pulled together 60 individuals interested in journalism and technology and got them to sit together watching an array of guest lecturers. The end product from each participant was a project proposal.
Since it looks like I’m going to be one of the lucky ducks who gets to hack away on my proposed idea in Berlin this September, I wanted to share it here. I would love feedback, of course, but also if you know anybody who might be interested in incorporating their content or using the platform, I would love to get in touch.
Part 1: Introduction
Scraping, Slicing, and Scrubbing C-SPAN
How do you get from a TV channel to a rich video archive, and how do you get there automatically? The goal of my project, called ATTN–SPAN, is to convert C-SPAN into a series of overlapping video segments that are identified in terms of state, politician, topic, party, action, and legislative item. Some of this is straightforward, and some of it might be impossible, but here’s an overview of the planned nuts and bolts:
- DirecTV offers TV content in a format that’s easy to record digitally, and VLC is a free tool that can do that recording. Combine the two, and we can download C-SPAN streams into individual files that are primed and ready for analysis.
- Once a video file is in our clutches, we can use VLC once again to separate out the video from the Closed Captioning transcript.
- Now we have a transcript and a raw video file. Next, we register all of this information (in a database) so that we can look it all up later, and then convert the video file into streaming-friendly formats and store it alongside the original recording.
- C-SPAN consistently shows a graphic on the bottom of the screen that says who is talking, their state, their party, and what is being debated. By using a technique called Optical Character Recognition (OCR) we can pull this text out of the video image. Once pulled, we can add that to our database so that we can access all of this information for any moment in the video.
- At this point, we have most of the information we need, but there’s still room for fine-tuning. We can use audio levels and the closed-captioning transcripts to try to identify moments of inactivity, normal dialogue, and heated dialogue.
These steps are enough to split up and categorize C-SPAN footage into an organized video database, but there are still more ways to flag special moments in the footage. For example, we may want to identify changes in speaker emotion in order to give our algorithms the ability to craft more engaging episodes. This is possible through the work of Affective Computing at the MIT Media Lab, a group which has developed several tools that perform emotional analysis using facial recognition.
We may also want to identify specific legislative action (e.g., “calling a vote”). This could be accomplished by looking for key words in the transcript (e.g. “call a vote”) and possibly through common patterns in the audio signal. (Maybe there are identifiable sounds, such as a gavel hitting the table.) Both of these concepts require additional research.
Creating a Profile and Constructing an Episode
If video events are the building blocks, then viewer interests are the glue. The creation of a personalized episode requires two things: a user account and a context. The user account provides general information like where you live, what issues you have identified as important, and (if you are willing to connect with Twitter or Facebook) what issues your circles have been discussing lately.
The context comes from time and cyberspace. Every night, after Congress closes their gates, your profile is used to create a short, rich video experience designed to contain as much relevant content from that day as possible. At this point, you might get an email begging you to watch, or maybe you log in on your own because you’re addicted to badges and points and you want as much ATTN–SPAN karma as you can get.
There’s another way to access this content though, and that is through the websites you visit anyway. Imagine if you could read an article about the national debt on The New York Times (or in a chain email) and actually see quotes from your own senators in the report. What if you could supplement the national report with a video widget that lets you browse what your House members had to say when they controlled the floor during the debt debates?
From a technical perspective, this isn’t that far-fetched. Truth Goggles, one of my other projects, is a bookmarklet that will analyze the web page you are viewing, fact-check it, and rewrite the content to highlight truths and lies. This impossible feat is fairly similar to what I’m proposing here.
Adding Rich Information
Once an episode is pieced together, we can look up the information surrounding the video to know who is talking and what they’re talking about. What else can be added, and how do we get it? Existing APIs (application programming interfaces) offer some good options:
- Contact Information – Thanks to the Sunlight Labs Congress API, it’s possible to get the contact information for any member of Congress on the fly. Thanks to VoIP services, it’s possible to create web-based hooks to call those people with the click of a button.
- Campaign Contributions – The New York Times offers a Campaign Finance API which can help you understand where the person onscreen gets his or her money.
- Voting Records – The New York Times also offers a Congress API that will make it possible to know vote outcomes from related bills as well as information about the active speaker’s voting records.
- Truth and Lie Identification – My Truth Goggles project can be easily adapted to work with snippets from video transcripts. This will allow ATTN–SPAN to take advantage of fact-checking services like PolitiFact and NewsTrust.
This is a good start, but I would also like to show links to related news coverage and create socially driven events based on community sentiment (for instance, to track moments that caused people to get upset or happy). This won’t come for free, but it should be accessible given the right interface design.
Part 3: A Note to the Newsies
So that’s the idea and the plan. What’s the value?
It seems plausible that ATTN–SPAN, a system that analyzes primary source footage and pulls out any content that’s related to a particular beat, could be useful as a reporter’s tool, but what about your subscribers? ATTN–SPAN can augment an individual article so that it hits everybody close to home. Suddenly, one article becomes as effective as two dozen. Moving past text, for larger organizations with a significant amount of video footage, ATTN–SPAN can be tweaked to use your programming instead of (or in addition to) C-SPAN.
At this point, I have to warn you that this isn’t the first, nor will it be the last, project to work with C-SPAN. A 2003 demo out of the Media Lab used C-SPAN as one of several sources of information in a platform aimed to provide citizens with Total Government Awareness. Metavid, the platform I used in my initial prototype, already makes C-SPAN more accessible by enabling searches and filters. The list surely goes on.
So why is this a more powerful project? Well, the real goal of ATTN–SPAN isn’t to get more people watching C-SPAN. In fact, I tricked you: This project isn’t about government awareness at all. It’s actually part of an effort to make indisputable fact (“blunt reality” and “primary source footage”) a more prominent part of the media experience without requiring additional effort from the audience.
Newsrooms do an amazing job of reporting events and providing insight, but for deeper stories, there simply isn’t enough time or money to cover everybody’s niche without going beyond the average person’s attention span.
Thus ends my pitch.
The code for both prototypes mentioned in this post can be found on GitHub: ATTN–SPAN and Truth Goggles. Please forgive any dirty hacks. I would be thrilled if anybody wants to offer suggestions or even collaborate. On that note, please get in touch on Twitter @slifty. This proposal was originally posted on slifty.com.