X
    Categories: Uncategorized

How the AP’s Overview Turns Documents Into Pictures

Overview produces intricate visualizations of large document sets — beautiful, but what do they mean? These visualizations are saying something about the documents, which you can interpret if you know a little about how they’re plotted.

Same documents, different visualizations

There are two visualizations in the current prototype version of Overview, and both are based on document clustering.

The first is the items plot, which grew out of the proof-of-concept system we presented a year ago. Every document is a dot. Similar documents get pulled together to form visible groups, that is, clusters. All the dots start grey, but become colored as you apply tags while exploring the document set. You can click on individual documents to view them, or select a whole region of documents to see what they have in common.

Overview also has a “tree” view. Documents are still organized into clusters, but each “node” in the tree is an entire cluster, not just a single document. Also, the clusters are hierarchical, meaning that the larger clusters (higher up the tree) contain all the documents within their child clusters (lower down the tree.) The bottom of the window displays the top words and two-word phrases from the selected nodes. In this case, the selected node contains press releases discussing oil industry subsidies.

The tree view and the items plot show the same thing, just in different ways. When you select documents in one view, the same documents are selected in the other. They’re two different ways of looking at the same set of clusters: hierarchically categorized, or laid out visually.

Extracting key words

All of Overview’s clustering depends on grouping similar documents together, but what does that mean? Conceivably, two documents might be “similar” because they were written by the same person, talk about the same event, or came from the same place. There are as many potential categorization schemes as there are stories.

But Overview doesn’t know any of this. Instead, it breaks down documents by words and short phrases. It starts by counting how many times each word appears in each document. Frequent words are “key” words. But the language processing also discounts words which appear in too many documents. This gets rid of common English words like “the” and “is,” but also suppresses words which are very common in your specific set of documents. If most documents from a set of police reports contain the word “crime,” Overview will mostly ignore that word.

Two documents are similar if they have overlapping sets of key words. A cluster is a set of documents that are all pretty similar to one another, and less similar to all other documents. This sounds insanely naive; after all, this word counting process throws away pretty much all of the syntactic information in the text, including word order. It can’t differentiate between “police hit protesters” and “protesters hit police.” But it can group together all the documents that talk about police and protesters, and that by itself is useful enough. In fact, variations of this basic algorithm, called the vector space model, are used by every search engine.

Where do those documents go?

This simple, word-based technique determines where each document is placed in the visualizations. In the Items plot, Overview tries to place similar documents close together. Collections of documents with similar words (and two-word phrases) naturally form groups, or clusters. Clusters with similar topics tend to be nearby. But that’s the extent of the process; the exact angle or position of each document and cluster doesn’t mean anything at all. In fact, it depends somewhat on the initial, random position of the documents, and every time you run Overview you will get a slightly different visualization — the same clusters will show up, but possibly in different places.

The tree view finds not only clusters but sub-clusters. In the example above, the yellow branch of the tree contains the key words “fees, credit card, consumers, airlines.” The left sub-branch has key words “fees, credit card, consumers” and the right branch has key words “fees, airlines, surcharges.” One sub-cluster is about credit card fees, while the other is about airline fees. They’ve been grouped into one larger cluster because they both contain many occurrences of the word “fees”.

For more information, see the discussion of our WikiLeaks visualization. Or if you’re really into all the gory details — including how Overview creates these visualizations efficiently, even for large document sets — we’ve recently released a technical report in collaboration with the University of British Columbia.

With a little practice and experimentation, you can learn how to read Overview’s visualizations effectively. If you want to use Overview for your own work, it’s important to get a feel for what these visualization can tell you — and what they can’t.

Jonathan Stray :

Comments are closed.