This Month In Data Science

March 31, 2014 Paul M. Davis


With GigaOm’s Structure Data Conference, the Pivotal HD 2.0 announcement, and big cloud platform announcements from Google and Amazon (below), March was an eventful month for data science and the platforms on which it is exercised. Here’s our top picks for the data science news items of the month, both from Pivotal and the entire field.
This month in data science for March

Top Data Science News in March 2014

Gearing Up for Cloudapalooza: Google and Microsoft Face-off Against Amazon

The competition among cloud platforms intensified this month, with Google going head-to-head with Amazon Web Services (AWS) by announcing significant price cuts to Google Cloud Platform services. Amazon responded with price cuts and announcements of its own, while Microsoft waited in the wings with updates to its Windows Azure cloud services.

Better NCAA Brackets Through Data Science

While there are many approaches to choosing your NCAA basketball brackets, ranging from the sentimental to the superstitious, Kaggle’s “March Machine Learning Mania” competition applies some scientific rigor to the process. Over 250 teams have applied so far, aiming to develop an algorithmic model that predicts the results of the past five tournaments, and then testing that model in real time to predict the results of the 2014 tournament.

Open Data Could Add $3 Trillion A Year In Total Value Worldwide

A recent McKinsey Global Institute report estimates that open data could add over $3 trillion in total value annually to the education, transportation, consumer products, electricity, oil and gas, health care and consumer finance sectors worldwide.

White House Launches Website to Visualize Climate Change

The White House rolled out an ambitious new web app that aims to communicate climate change through visualizations of the available data, and projections of how climate change will affect users’ own lives. The project is part of the White House’s Climate Data Initiative, which brings together open government data, and private and philanthropic organizations, to analyze and communicate the latest climate change research.

How Statisticians Could Help Find That Missing Plane

The mystery of what happened to Malaysia Airlines Flight 370 continues to confound. In a post at Nate Silver’s recently relaunched FiveThirtyEight, Carl Bialik explains how statisticians can add insight to the search for answers by utilizing Bayesian techniques to calculate the probability of causes for the missing flight.

UC Berkeley Dean: Data Science Classes Aren’t Just for Engineers

While data science is a highly specialized field, requiring knowledge in statistical analysis, engineering, and machine learning techniques, data literacy is becoming increasingly important for everyone. During a talk at Gigaom Structure Data, UC Berkeley Dean AnnaLee Saxenian emphasized the importance of increased data literacy among professionals in a wide range of disciplines and fields, and declared that as a result, data science classes are increasingly important to a well-rounded curriculum.

Google Flu Trends: The Limits of Big Data

Google Flu Trends, one of the most visible and popular applications of data science in recent years, became a case study this month in the intensifying debate over the potential limitations of Big Data. An article published in Science magazine detailed some of Google Flu Trends’ most notorious failures—such as an overestimation of flu cases in the United States in 2012-13—and extrapolated that Google is guilty of “big data hubris.” In response, a number of data scientists have responded that Google Flu Trends is far from a representative case, including its co-inventor Matt Mohebbi, who explained to the New York Times that the tool was designed and intended “as a ‘complimentary signal’ rather than a stand-alone forecasting tool.”

This Month in Pivotal Data Science

Pivotal HD 2.0 to Help Enterprises To Get More Out of Hadoop With a Business Data Lake

Pivotal HD 2.0 will help companies to get more than ever out of their Hadoop investments by building in complimentary in-memory data processing with GemFire XD, and providing additional analytical fire-power with the improvement of tools and added libraries of pre-populated analytics. This is a distribution of Hadoop that really accelerates the time-t0-insight for enterprises of all sizes.

Paul Maritz at Structure: Hadoop is Just One Ingredient of a ‘Profound Shift’ in Software

Pivotal’s CEO Paul Maritz sat down with GigaOM’s Om Malik at the Structure Data Conference in New York, starting off the session picking up on a new trend that he is seeing in the market today. In the interview, he calls Big Data technology Hadoop out as a catalyst to the market, citing that the bigger trend is in software is building on some of the tenets of Hadoop to take lots of cheap machines and cheap storage, and reinvent how businesses are building applications.

My First Three Months at Pivotal, and the Road Ahead to ApacheCON 2014

Pivotal’s Apache Hadoop leader, Roman Shaposhnik, shares what he has been up to for the first three months at Pivotal. In this post, he writes about how he is aligning Pivotal’s distribution, Pivotal HD, with the Apache Hadoop ecosystem projects and Apache Bigtop.

Time Series Analysis #2: Recognizing Patterns within a Time Series

The SQL Window Function construct can be used as a basis for many sorts of ordered calculations within SQL. This post elaborates on how this query capability can be used for a specific type of problem that frequently shows up in time series analysis, which is the recognition of simple patterns of movement within a series.

Upcoming Data Science Events

Data Science for the 99%

Tue, April 1; Webinar

In this webinar, Pivotal Data Labs members Woo J. Jung, Sarah Aerni, and Srivatsan Ramanujam will discuss some of the open source tools in their arsenal. They will introduce and provide details on the variety of tools – such as MADlib, PL/R, PL/Python, PivotalR, PyMADlib and a host of others – they have utilized and extended for customer engagements.

ApacheCon North America

April 7–9; Westin Denver Downtown, Denver, CO

ApacheCon brings together the open source community to learn about and collaborate on the technologies and projects driving the future of open source, big data and cloud computing.

SF: Accessing External Hadoop Data Sources Using Pivotal Xtension Framework (PXF)

Tuesday, April 8, 5:30 to 8:30 pm; Pivotal Labs, San Francisco, CA

Pivotal’s Sameer Tiwari provides insight into Pivotal Xtension Framework, an external table interface that gives SQL access on top of data stored within the Hadoop ecosystem. It enables loading and querying of data stored in HDFS, HBase and Hive. It supports a wide range of data formats such as Text, AVRO, Hive, Sequence, RCFile formats and HBase.

Cloud Foundry Summit

June 9–11; Hilton Union Square, San Francisco, CA

The premier event for developers and cloud operators using the industry’s leading Open Source Platform-as-a-Service: Cloud Foundry. Join core contributors to the project and real world users for three days to discuss deep technical topics, engineering roadmap, community ecosystem and operational best practices.

GigaOm Structure 2014

June 18–19; Mission Bay Conference Center, San Francisco, CA

Meet the innovators and thinkers who are building infrastructure to run the applications of the next decade.


About the Author


Testing JavaScript Promises
Testing JavaScript Promises

tldr: Testing promises is surprisingly hard. I wrote a mock-promises to address it. A recent project of mi...

Wear that Android
Wear that Android

A few weeks ago, Google announced Android Wear – a development platform for extending Android to wearable t...