At the Data Science Summit 2012, practitioners and thought leaders shared broad visions and deep discipline
Big Data may be getting its day in the sun, but data science is the key to unlocking Big Data’s abundance of predictive value and insight. At the Data Science Summit 2012, on May 22 and 23 in Las Vegas, practitioners mingled with executives, statisticians, researchers, data journalists, and many more to discuss methodologies, opportunities, the future of data science, and how the discipline will impact how we work and live.
The breadth of attendees demonstrated the transformational impact of Big Data and the need for organizations across the spectrum — business, government, academia, education, research, and media — to contend with the challenges it poses.
Piyanka Jain, President & CEO of analytics leader Aryng.com, posed a central question on the mind of many attendees when she asked, “How do you navigate from B.I. to B.I. — business intelligence to business impact?”
Raw Data, Real Value
This transition was discussed during the roundtable panel “From Raw Data to Value Data”. Tony Jebara of Sense Networks, Kaggle’s Jeremy Howard, Applicology President and Former CTO for the U.S. Intelligence Community Bob Flores, and comScore’s CTO Michael Brown discussed the practices, challenges, and ethical questions posed by applying predictive analysis to reap actionable information from raw data. Addressing privacy issues, Brown stated that Intuit has distributed its best practices consensus process across the organization to preserve users’ privacy. Jebara warned against convincing users to place inordinate trust in anonymized data, since such information can be correlated with data from social media and the web to identify individuals.
The torrent of data produced on the web turned to a discussion of data quality — or “data conditioning,” as moderator Roger Magoulas of O’Reilly Media, put it. The panelists discussed how their respective organizations ensure accuracy in their datasets without missing key signals in the noise. “You don’t want to suppress the outliers,” said Brown, “because that might be some of the most interesting data…Insight comes from getting your hands dirty and seeing what’s in there.”
Data exhaust — information not collected or later discarded — is a hot topic among data scientists, inspiring a lively discussion. “The problem with that data,” said Howard, “is that it can be very hard to separate correlation and causality.” Instead, Howard argued in favor of designing experiments to specifically determine correlation, citing an example from an insurance pricing company where he developed elasticity models to predict the probability that potential customers would accept price quotes. “You can’t find that out by looking at past data,” he said, “because people have received different prices for different reasons. So we had insurers randomly change their price for the next 100,000 people who came in, and as a result they could use not data exhaust but actual transactional data to build an algorithm worth hundreds of millions of dollars.”
Prediction and Cautionary Tales
Predictive modeling was the focus of the opening keynote by Nate Silver, statistician, writer and founder of The New York Times political blog FiveThirtyEight.com. Silver became a sensation during the 2008 Presidential Election, when his statistical models proved more accurate at predicting voting results than news pundits and polls. In his talk, Silver warned of the risks of focusing strictly on the data available. “Despite all this information in our world, Big Data, there are all kinds of places where we’re not making much progress in making predictions or forecasts.” He said that such examples serve as “cautionary tales for some of the challenges we face as we encounter more and more information, and what skills we have to develop to analyze that information in a way that will make society…and businesses better off, and not run into various quagmires.”
Silver cited a number of examples of such situations, pointing to the bubble and burst cycle of stock market analysis in recent years. “If you look at stock market data,” he said, “there seems to be a lot of information there, but it’s very hard to tell if it’s meaningful in a predictive sense, especially since once you detect a financial signal, other people will detect it as well, and you have different dynamics working to feed back on itself.” He spoke of the challenges of discerning the signals that offer “real information, something which has predictive power, something that can help you explain how the world works.”
Amid all the noise of Big Data and many signals, Silver stated, “we’re looking for what you could think of as causality — we are looking for more structure that will guide us, that will help us outside of the model, that works in the real world.” Silver pointed to examples where existing predictive models failed to ensure disaster preparedness, such as the Fukushima earthquake in 2011 and a recent flood in North Forks, ND. “You don’t know when the earthquakes are going to occur,” Silver said, “but you can make a reasonable long-term prediction of the hazard, for example, by looking at the number of smaller earthquakes that have occurred.” Another key point was that researchers needed to “embrace uncertainty” or risk to develop useful predictive models. The North Forks flood is illustrative, Silver said, because weather predictions expected a 49-foot crest upon the town’s 51-foot levees, without accounting for a nine-foot margin of error.
“Next Practices Rather Than Best Practices”
As data increases exponentially and predictive insight becomes increasingly crucial to businesses and society, the need for a qualified work force of data scientists grows. With the United States ranked 27th out of the 29 most wealthy countries in science and engineering graduates according to a 2010 National Academies study, math and science education must be overhauled to address the needs of the time. Michael Chui, Senior Fellow and the McKinsey Global Institute, argued that to address this challenge and for businesses and the next generation of workers to remain competitive, schools must focus increasingly on statistical analysis.
“We’ve got to stop teaching so much calculus,” he said. “I think we should teach more stats. Who does an integral anymore in business? There are a few people in engineering. But who needs to understand conditional probability, and who needs to understand selection bias, and all those things that a data scientist just wakes up with and understands. One of the things we need to do in addition to solving all those technical problems is somehow effect the education — I don’t just mean formal education, but the way that people are thinking broadly throughout our organizations.”
Citing the McKinsey Global Institute report “Big data: The next frontier for innovation, competition, and productivity”, which estimated that by 2009, “nearly all sectors in the US economy had at least an average of 200 terabytes of stored data…per company with more than 1,000 employees,” Chui stated that “the effective use of Big Data is going to be the basis of competition going forward.” He explained that what is needed is “a process of next practices rather than best practices,” and that business leaders, educators, and the general public must be better informed about the value and challenges inherent in Big Data.
The Science-Design Nexus
For the past decade, Seed has worked at what founder and CEO Adam Bly described as the to address such scientific literacy issues. Explaining the organization’s mission, Bly asked, “How do you get seven billion people on the planet to be scientifically literate? How do we ensure that the world is capable of thinking scientifically, empirically, rationally, about the host of complex issues we face in the world and continue to face? How can we advance ideas and tools through media, technologies, and services to get people, companies and governments to think scientifically?”
Seed answers those questions by applying “the craft, the cognitive processes, and the tools of design,” Bly said, “to serve as an interface to the complexity that science conjures, reveals and exposes when you start doing scientific work.” The organization is guided by the question, “What’s the role of design in advancing scientific thinking, advancing scientific literacy, and compelling new insights through a scientific method?” Seed’s work at what Bly described as the “science-design nexus” positions it as a leader in the mission-driven data science space. “We came at data through science first,” he said, using “design as an interface to the complexity. Visualization is a prime example of that.”
Bly explored where data visualization is headed in both enterprise and government contexts, and society at large. Pointing to visualizations of tweets from the Arab Spring uprisings, and maps pulling geotag information from Flickr photos, he noted that even when the form of visualization used is not particularly new, “new data [yields] new insights.” Real social value will come through data mash-ups and correlation of separate datasets, Bly said, when “this new kind of approach to visualization” is shared with “behavioral economists, cognitive scientists, with the experts in negotiation who are studying how humans arrive at consensus. Could the combination of new data, new forms being revealed through visualization, and new science actually compel new insights?”
Bridging the Relevancy Gap
As data grows increasingly ubiquitous, predictive analysis and visualization will reveal new insights and business opportunities. But telling those stories to the rest of humanity will remain a challenge. “Data visualization should be delivering secrets to people,” said Programmer, Artist and Storyteller Jonathan Harris during his closing keynote on the intersections between data, art, and humanity. Speaking to the relevancy gap that separates the general public from the many benefits of data science, Harris stated that “the world is littered with infographics that are not interesting, because the data within them is not.”
Noting that data scientists, developers, artists, and storytellers share many tools and methodologies, Harris advocated for collaboration among the groups and a respect for individuals’ personal experience and perception amid all the data, as we work on what he called “the staging ground for the future.” The multidisciplinary synthesis advocated by Harris and Bly, in conjunction with deep data analysis by trained experts, is how individuals, businesses, and organizations will reap predictive insight and remain competitive in the era of Big Data.
To see all 9 presentations from Greenplum’s Data Science Summit 2012, visithttp://www.greenplum.com/datasciencesummit/videos/.
About the Author