Over the course of four seminars, the Big Data for the Public Good series presented a rare opportunity for leading data science thinkers, innovators, and practitioners to explore how the field can serve the public interest. Presented by Code for America and sponsored by Greenplum, the series hosted Michal Migurski and Eric Rodenbeck of Stamen Design, Jake Porway of DataKind and formerly The New York Times, wiki inventor and Nike Code for a Better World Fellow Ward Cunningham, and Jeremy Howard, President and Chief Scientist at Kaggle. Though the diverse selection of speakers explored the topic from a variety of perspectives, a set of recurring themes arose during the talks.
Data Science is Storytelling
As big data proliferates, new approaches to communicating the insights revealed are required. Interactive maps, data visualization, and infographics are tools to clarify complexity, placing data scientists into the role of the storyteller. Referencing Hans Rosling, Stamen’s Rodenbeck emphasized that “narrative is critical,” in order to provide context and effectively communicate what a particular dataset demonstrates to governments, social organizations, and citizens. Stamen learned this lesson when a city growth visualization tool the studio built for Trulia was the subject of backlash by residents, who believed the application visualized their community as if it was a missile target in a video game.
Cunningham emphasized the “storytelling aspect” of data science while discussing the Smallest Federated Wiki platform he developed at Nike, which allows companies and the public to share and collaborate on the analysis of data sets. The platform’s federated approach, Cunningham explained, allows other users to assess the quality and accuracy of analyses. “There are a lot of different stories to tell about any particular piece of data,” he said, boasting of the advantages of a federated approach. Through analysis, narratives emerge. “As we find our way through the data,” he explained, “we can say, ‘here’s a visualization that can tell this story and there’s a visualization that can tell that story.’”
Data Empowers Citizens
Stamen’s Migurski emphasized the value of establishing a dialogue between citizens and government institutions based upon data. Data empowers citizens to advocate for the needs of their communities, and reveals what needs are not being addressed. Sharing what he learned while building the crime-mapping application Oakland Crimespotting, Migurski identified four best practices for working with government data. He stated that tools “must demonstrate the impact by linking to truths shared within the communities served,” be stable and reliable, refer to an official version that can be verified and supported, and remain contextually relevant.
Porway spoke of the wealth of public data that goes untapped, lamenting undirected government or organization data dumps that are “like giving crude oil to people.” “Open data is not useable data,” he warned, advocating for an ongoing dialogue between government agencies, social organizations, and data scientists. “By bridging these communities, you’re starting to make that data useable,” he said, increasing the likelihood it can serve citizens.
Bridging the Data Science Gap
There is an abundance of public data, but a lack of skilled practitioners to make sense of it. This presents an opportunity for data scientists to use their skills to serve the public interest. Porway noted that in many social organizations, “data and skills are often silo’d from one another.” This creates a risk that the wealth of information these organizations produce will become irretrievable data exhaust.
“On the one hand, we have a group of people who are really good at looking at data, really good at analyzing things, but don’t have a lot of social outputs for it,” Porway said. “On the other hand, we have social organizations that are surrounded by data and are trying to do really good things for the world but don’t have anybody to look at it.”
Porway sees a network of “transformative communities” emerging to address this issue, within which government officials, representatives of social organizations, data scientists, researchers, and journalists “are coming together for a common goal and sharing across those boundaries to do more.”
One way to connect data scientists with institutions and organizations that lack skilled practitioners is the competition model established by Kaggle. Howard explained that Kaggle harnesses practitioners’ competitive impulse and their desire “to hack at interesting problems and interesting code.”
He noted that in “cause organizations where they don’t have people working on this stuff, they often don’t see the forest for the trees,” unaware of the value of the data available. Howard cited the EMC Data Science Global Hackathon for Air Quality Prediction, a weekend-long competition that offered participants access to EPA Air Quality Index data for Chicago, as an example of how the Kaggle model can serve the public interest.
Revealing the transformative potential of data science in service of the social good, Howard noted that the competitive hackathon worked with “a data set which is local in scope,” which “you can use at a local level, yet you can also take the results and apply them really powerfully throughout the world.”
As grand as that may sound, we are only on the cusp of what can be done with big data. As demonstrated by the thinkers and practitioners who spoke at the Big Data for the Public Good seminar series, there exists a community of data scientists who are as passionate about serving the public interest as they are datasets. The potential of such collaborations is nothing less than transformative.
About the Author