Overview is a project to create an open-source document-mining system for investigative journalists and other curious people. We’ve written before about the goals of the project, and we’re developing some new technology, but mostly we’re stealing it from other fields.
The following are some of the best ideas we saw in 2011, the data-mining work that we found most inspirational. Many of these links are educational resources for learning about specific technology. Some of this work illuminates how algorithms and humans treat information differently. Other are just amazing, mind-bending work.
1. What do your connections say about you? A lot. It is possible to accurately predict your political orientation solely on the basis of your network on Twitter. You can also work out gender and other things from public information.
2. Free textbooks from Stanford University. “Introduction to Information Retrieval“ teaches you how a search engine works, in great detail. “Mining Massive Data Sets“ covers a variety of big-data principles that apply to different types of information.
3. We’re not above having a list of lists. Here’s the Data Mining Blog’s top 5 articles. Most of these are foundational, covering basic philosophy and technique such as choosing variables, finding clusters, and deciding what you’re looking for.
4. The MINE technique looks for patterns between hundreds or thousands of variables — say, patterns of gene expression inside a single cell. It’s very general, and finds not only individual relationships but networks of cause and effect. Here’s a nifty video, here’s the original paper, and here’s one statistician’s review.
5. This is one of those papers that really changed the way I look at things. How do we know when a data visualization shows us something that is “actually there,” as opposed to an artifact of the numbers? “Graphical Inference for Infovis“ provides one excellent answer, based on a clever analogy with numerical statistics.
6. Lots of text-mining work uses “clustering” or “classification” techniques to sort documents into topics. But doesn’t a categorization algorithm impose its own preconceptions? This is a deep issue, which you might think of as “framing“ in code. To explore this question Justin Grimmer and Gary King went meta with a system that visualizes all possible categorizations of a document set, and how they relate.
7. A few years ago Google showed that the number of searches for “flu” was a great predictor of the actual number of outbreaks in a given location — faster and more specific than the Center for Disease Control’s own surveillance data. The team has now expanded the technique into Google Correlate, which instantly scans through petabytes of data to find search terms which follow any user-supplied time series. Here’s New Scientist taking it for a test drive.
8. Not content with free professional textbooks, Stanford has created two free online courses for machine learning and natural language processing. Both are live-streamed lecture series taught by experts, with homework. Learning these intricate technologies has never been easier.
9. Lots of people have speculated about the role of social media in protest movements. A team of researchers looked at the data, analyzing a huge set of tweets from the “May 20” protests in Spain last year. How do protests spread from social media? Now we have at least one solid answer.
10. And the craziest data-mining link we ran across in 2011: IBM’s DeepQA project, which beat human Jeopardy champions. This project looks into an unstructured database to correctly answer about 80% of all general questions posed to it, in just a few seconds. Here’s a TED talk, and here’s the technical paper that explains how it works. I can’t tell you how badly I want one of these in the newsroom. If enough journalist hackers build on each other’s work, maybe one day …
Happy data mining! We’ll be releasing our own prototype document-mining system, and the source, at the NICAR conference next month. If these are the sorts of algorithms you like to play with, we’re also hiring programmers who want to bring these sorts of advanced techniques within everyone’s reach.