Across industries, data scientists are the hot and scarce commodity to help organizations collect, clean, and analyze the growing sea of “big data.”
Journalism has an acute need for data analysis, especially to monetize cutting-edge features and keep large, powerful institutions accountable.
But data scientists don’t come cheap, and the media industry is still hobbling through its unstable financial future. WSJ.com recently reported that many traditional scientists are forgoing academia for higher-paying data science positions at Yelp, Airbnb, Etsy and other Internet behemoths. The New York Times was the lone media organization noted.
Adding to the financial constraints of attracting data talent, media companies are also asking a lot of the people they’re looking for. Job advertisements for media data scientists often ask for well beyond the already, multi-faceted role. This consists of a full stack developer, statistician, and visual designer, who has knowledge of government data sources and the mind of an investigative journalist. It’s reminiscent of when online video exploded; media outlets recruited for one person to be the producer, videographer, editor — or “preditors.” That’s a lovely title indeed. Yes, J-schools, such as Columbia’s Lede Program, are creating new degrees to churn out new graduates to combine data scientist/journalist roles. Over time, we’ll find out whether one person can handle cleaning data, coding, and being an effective reporter and storyteller.
Last week at the Strata Hadoop World 2014 conference in New York, it became quickly apparent that the problem of filling this multi-faceted role is not limited to media companies. And so, publications have much to learn from what’s happening with data talent across other industries.
To illustrate the conundrum, see the Venn diagram below by Steve Geringer, a machine learning consultant. He adapted it from the original by data scientist Drew Conway, whose skill set is certainly unicorn-like.
Debate: Do Data Scientists Need to Know How to Code?
To avoid the futile search for a “unicorn,” what are the key skills a data scientist should have?
Whether they need to code was up for an Oxford-style debate at Strata Hadoop World 2014. Joseph Adler of Interana and Scott Nicholson from Poynt defined a data scientist as one who discovers insights from data. New tools, such as Paxata, Tamr, and Trifacta, allow for cleaning and analyzing data without coding. They defined writing code as creating original commands using a programming language, not typing in pre-existing commands into developed software programs, such as Excel. They qualified, however, to be an efficient data scientist, especially to produce new products, coding is required. Adler noted that top data science teams are hiring Ph.D.s for their analytical skills; any necessary coding is taught on the job.
‘Team yes code’ was Hilary Mason from Accel Partners and Lucian Lita of Intuit. Mason emphasized that data scientists need to be able to interact with systems where data live, which is in computer, and requires coding, however limited. Lita argued solving ambiguous problems with heterogeneous data by building models and going into production, requires an iterative process of agile development. Coding is fundamental to agile, otherwise,”we would have to be dependent on others to get the inputs,” he said.
In the media context, these two viewpoints depend on the goals. Does the media outlet require collecting original data and the labor-intensive cleaning? Does it need to build custom products to distribute the data? Then yes, a data scientist who can code and conduct statistical analysis is key. Outsourcing is a more affordable option, but to report effectively on the critical issues of our time, the team needs to be in-house.
For outlets that don’t want to collect original, messy data, their journalists with statistical skills can use clean datasets from government, academic, and non-partisan sources analyzed with statistical software and produced with out-of-the-box tools, such as Tableau and Mapbox. The conference trade show had a plethora of new ‘business intelligence’ vendors that offer both analysis and visualization tools. Conferees noted open source and proprietary tools have exploded in the last two years, so who knows what will be possible in a year or two.
Training Data Scientists Boot Camp Style
If a media organization were to recruit a data scientist, Laurie Skelly provided an excellent overview of both the soft and hard skills needed. Skelly is a data scientist at DataScope and an instructor at the newly formed Metis’ Data Science Bootcamp.
The soft skills include: Curiosity, creativity, grit, and humility to admit when things don’t work. The full-time, 12-week boot camp format provides group work and community support to fend off isolationism common to “impostor syndrome,” and the short deadlines seek to prevent perfectionism. “Rather than already having the expertise, it’s more important that you can ask questions and not be blocked by something you’re unfamiliar with,” Skelly said. Good advice indeed.
The hard skills are in terms of the following project-based model created by Hillary Mason and Chris Wiggins, which span development, statistics, web, and communications.
Project Phases to Conduct Data Science
Create a Goal
Skills & Tools
Machine Learning / Statistics
Domain Awareness (in a new field for the student)
Data Science Team Roles
Of the rare unicorns that do exist, it can be risky to depend on one person for data science capabilities, according to Amy Gaskins of MetLife’s Global Technology & Operations. She is not a developer and describes her skills in her bio as “integrating disparate data sets and identifying the correlations and patterns within them.” Gaskins advised recruiting team members with unique skills and cross-train to build resiliency. The team should consist of the following roles, which could be applied to media developers, producers/reporters, and managing editors.
- Problem solvers: Engineers who understand the problem a well as the technologies, code, and infrastructure needed to create a solution.
- Translators: A communicator who speaks the language of the business.
- Medic: A resource gatherer who can break barriers. She/he has the network and charm to get access and resources to overcome the king of “no” the team may confront.
- Leader: The team manager who understands the problem and how to fix it. They are essential and “their background is irrelevant.”
Gaskins advised against hiring applicants who hesitate to give an opinion, especially to a controversial question, because it risks indecisiveness when the leader is absent. Versatility is key because new problems arise and positions change. Lastly, interview and chose new hires together with a unanimous yes to maintain team cohesion.
The Prospects for Data Resources in Journalism
Overall, where journalism can offer much value to data science is the communications piece, particularly with genres based on narrative arcs and visual media. Developer-statisticians aren’t trained to be Emmy- or Oscar-winning storytellers, regardless if they are subject matter experts or can code visualizations. Whereas data scientists can provide insights and evidence that may lead to new interview subjects, stories to break, and patterns of decision-making in our most powerful institutions. The prospect of bringing media storytellers and data scientists together is extremely exciting and can be a real game changer for the future of journalism.
Jess Duda is a digital content strategist and producer developing a big data tool to track the policy process. Previously, she was at PBS Digital working on digital strategy with national and local web producers as well as product development.