X
    Categories: Uncategorized

3 Difficult Document-Mining Problems that Overview Wants to Solve

The Overview project is an attempt to create a general-purpose document set exploration system for journalists. But that’s a pretty vague description. To focus the project, it’s important to have a set of test cases — real-world problems that we can use to evaluate our developing system.

In many ways, the test cases define the problem. They give us concrete goals, and a way to understand how well or poorly we are achieving those goals. These tests should be diverse enough to be representative of the problems that journalists face when reporting on document sets, and challenging enough to push us to innovate. There’s also value in using material that is already well-studied, so we can compare the results using Overview to what we’ve already learned using other techniques.

With that in mind, we’ve been scouring the document set lore and the AP’s own archives to find good test data. Here are three types of problems we’d like Overview to address, and some document sets that provide good examples of each.

A large set of structured documents — the Wikileaks files

Wikileaks published the Afghanistan and Iraq war logs data sets last year, and recently the full archive of U.S. diplomatic cables has also become available. All three archives are the same basic type: hundreds of thousands of documents in identical format.

Each document has the same set of pre-defined fields, such as date, location, incident type, originating embassy, etc. But this isn’t just a series of fill-in-the-blank forms, because each document also includes a main text field that is written in plain English (well, English with a lot of jargon). We call these types of documents “semi-structured,” and part of the analysis work here is understanding the relationship between the free-form text and the structured fields.

For example, our previous visualizations of the war logs use the topics discussed in the text to cluster the dots that represent each document, but the color is from the “incident type” field: red for “explosive hazard,” light blue for “enemy action,” dark blue for “criminal event,” and so on. The human eye can interpret color and shapes at the same time, so this allows us to literally see the relationship between topics and incident types.

There are lots of other large, homogeneous, semi-structured document sets of interest to journalists. Corporate filings are a prime example, but we might also want to analyze legislative records (as the AP did to learn how “9/11” was invoked in the U.S. Congress over the last 10 years), or the police reports of a particular city.

The key feature of this type of document set is that all the documents are the same type, in the same format, and there are a lot of them. The Wikileaks war logs and cables are a good specific test because they are widely available and already well-studied, so we can see whether Overview helps us see stories that we already know are there.

Communications records — the Enron emails

Federal investigators released a large set of internal emails after the spectacular collapse of the Enron corporation in 2001. The Enron corpus contains more than 600,000 emails written by 158 different people within the company. It has been widely used to study both this specific case of corporate wrongdoing, and to explore broader principles and techniques in social network analysis.

The simplest way to visualize a huge pile of emails is to plot each email address as a node and draw edges when one person emailed another. That produces a plot of the the social network of communicators, such as this one from Stanford University assistant professor Jeffery Heer’s Exploring Enron project:

But there are other ways to understand this data set. For example, this plot excludes the element of time. Perhaps a group of conspirators gradually stopped talking to outsiders, or maybe power shifted from one branch of the company to another over time. These sorts of questions are addressed by dynamic network analysis. You could also ignore the social network completely and try to plot the threads of conversation, where one message refers back to an earlier one by someone else, as the IBM’s thread arc project did.

Email dumps are increasingly common, especially with the recent uptick of hacking by collectives such as Anonymous and Lulzsec. But the concepts and tools used to analyze email can be applied to a broader category: any record of communications between a set of people. These could be emails, IM transcripts, Facebook messages, or a large set of Twitter traffic. To be useful for this type of analysis each record must contain at least the date, the sender, the recipient(s), and the message itself. There might also be things like subject lines or references to previous messages, which can be very useful in tracing the evolution of a conversation.

Messy document dumps — the BP oil spill records

Freedom of Information laws don’t require governments to organize the documents they give back. In August of last year, the AP asked several U.S. federal agencies for all documents relating to the production of the report “BP Deepwater Horizon Oil Budget: What Happened to the Oil?” And we got them, in a 7,000-page PDF file. There are early drafts of the report, meeting minutes, email threads, internal reports, spreadsheets … The first step in mass analysis of this material is simply sorting it into categories.

Document classification algorithms can be used to automate this process, by scanning the text of each page and determining if it’s an email, a spreadsheet, or some other type of document. Then we can proceed with specialized visualization of each of these types of documents. For example, we could visualize the social network of the extracted emails.

This sorting process isn’t itself a visualization, because the output is several different piles of sorted documents, not a picture. But it’s an extremely important task, because a huge part of the work in any data journalism project is just getting everything in the right format and ready for the real analysis. Although Overview is designed for visualization, it needs to include powerful tools for data preparation and cleanup.

The Wikileaks and Enron test cases involve a large collection of identically formatted documents. The BP oil spill documents are different, because they’re anything but homogenous. This is an important test case because it represents a problem that comes up often in journalism, especially when we want to understand what we got back from a big Freedom of Information request.

Anything else?

If Overview could help with just these three problems, it would be an extremely valuable tool for journalists. But we need to make sure they’re the right problems. Are you trying to report on a large set of documents that isn’t anything like these cases? Please let us know!

Jonathan Stray :

Comments are closed.