This Month in Data Science

June 30, 2014 Paul M. Davis

This Month in Data Science for June 2014June was a big month for data science, with major news coming out of the Apache Hadoop® Summit 2014, much discussion about the job market for data scientists, and new research demonstrating the impact Apache Hadoop® is having on enterprises. Pivotal announced benchmarks demonstrating the industry-leading speed of its SQL query optimizer HAWQ running on Pivotal HD, while our data science team shared numerous technical insights about their client engagements on the blog. Here’s our monthly roundup of the top data science news of the month, both from Pivotal and the entire industry.

SiliconANGLE: Skill Sets Today’s Data Scientists Need to Succeed | #HadoopSummit2014

At the Apache Hadoop® Summit 2014, held earlier this month, there was much discussion about the high demand for data scientists, and what that means for businesses and practicioners. While the definition of what a data scientist is remains a matter of dispute, there is increasing agreement on the skillsets businesses are looking for in the hiring pool.

PCWorld: USENIX Researchers Get a Grip on Apache Hadoop® Performance

PCWorld covers the need for accurate models capable of predicting big data workloads. The article discusses current researchers’ efforts, highlighting various issues like cost and lack of accessibility to Apache Hadoop® systems as factors contributing to the lack of proper models.

Wired: Tell Your Kids to Be Data Scientists, Not Doctors

The future isn’t in plastics, or even venerable high-wage and high-status careers like doctors or lawyers, according to Burtch Works’ Linda Burtch. This isn’t only because of the salary prospects: Burtch notes that much future medical research will be performed by data scientists, and emphatically declares it to be the “Career of the Future.”

InformationWeek: UC Berkeley Breeds Data Scientists Online: $60K, 18 Months

The University of California at Berkeley announced its new Master of Information and Data Science (MIDS) program, which costs a steep $60K but aims to turn out new data science professionals in only 18 months.

Wall Street Journal: BNY Mellon Finds Promise and Integration Challenges with Apache Hadoop®

Wall Street Journal discusses BNY Mellon’s Hadoop® implementation, highlighting the business changes enabled by the integration. Notes that short-term, negative integration challenges are worth the prospect of significant long-term gains.

TechTarget: Seven Data Science Lessons From McGraw-Hill Education Analytics Guru

What programming language should every data scientist know? How should data scientists be trained? Why do you need more women on your team? McGraw-Hill Education’s Alfred Essa answers those questions and more in this TechTarget feature.

Bloomberg: Retailers Use Big Data to Turn You Into a Big Spender

Bloomberg explores Klarna, a startup that combines big data and shopping services. The article examines the effect of big data on shopping/e-commerce websites.

CMSwire: One Woman’s Path to Data Science

There’s been much discussion in tech circles in recent months about the industry’s gender gap among engineers. In this article, CMSwire profiles Dstillery’s chief data scientist Claudia Perlich, discussing her path to success in a highly-coveted position, and her take on the imporant traits shared by successful data scientists.

This Month in Pivotal Data Science

Pivotal HAWQ Benchmark Demonstrates Up To 21x Faster Performance on Hadoop® Queries Than SQL-like Solutions

This month at the ACM SIGMOD Conference, the premier international forum for database researchers, practitioners and users, Pivotal announced the architectural benefits and results for its brand-new cost-based query optimizer. The results bore out Pivotal’s statement that HAWQ is the world’s fastest SQL query engine on Hadoop®, with benchmarks demonstrating it is capable of up to 21 times faster performance and three times the queries supported for Hadoop®.

Graph Analytics for Identity Resolution—Transforming Billions of Customer Records in One Minute

Two Pivotal Data Scientists share details on how they took billions of customer records from multiple systems and LOB data silos then computed matches for identity resolution in only one minute. Ultimately, the approach had a big impact on segmentation, target marketing, and marketing ROI while the engagement only took days to perform at a major insurance provider.

A Data Science Approach to Detecting Insider Security Threats

Employees must access internal information freely to be productive, yet ill-intentioned information access must be guarded. Most of security tools today focus on identifying malware-initiated attacks. Pivotal sees many opportunities for Big Data Analytics to address the problem of identifying anomalous user-to-resource access activities.

Using Data Science Techniques for the Automatic Clustering of IT Alerts

The ultimate goal of a data science-driven IT infrastructure is one capable of performing automated root cause analysis and failure prediction. To achieve this goal, some foundational blocks must be built. One of these foundational blocks is the automatic clustering of IT alerts. In this post, Derek Lin demonstrates this using a patented approach the Pivotal Data Science team performed for a client.

Exploring Big Data Solutions: When To Use Apache Hadoop® vs In-Memory vs MPP

In today’s world of big data, there are several different technology approaches available to data management. For many companies, a combination of approaches is necessary. This high level overview explores the benefits, trade-offs and Pivotal’s recommendations for three primary technologies: Apache Hadoop® distributions, In-memory data grids (IMDG), and massively-parallel processing (MPP).

New Benchmark Results: Pivotal Query Optimizer Speeds Up Big Data Queries Up To 1000x

The new super-efficient Pivotal Query Optimizer developed by the Greenplum engineering team, previously codenamed “Orca”, this new feature has been released as part of the HAWQ query engine in Pivotal HD, Pivotal’s commercially-supported distribution of Apache Hadoop®.

Upcoming Events

Building a Distributed Data Ingestion System with RabbitMQ

Thursday, July 17, 2014
5:45 PM to 8:30 PM
Pivotal Labs
875 Howard Street 5th Floor, San Francisco, CA

In this talk we are going to show how to build a system that can ingest data produced at separate geo located areas (think AWS and it’s many regions), using different technology stacks, and replicate it to a central cluster where it can be further processed and analysed.

OSCON 2014 – O’Reilly Conferences

July 20–24
Portland, OR

From the early days of open source, OSCON has been the only event that covers the open source stack in its entirety. Not just one language, tool, or philosophy, but all of the moving parts, integrated and working together. It’s everything you need to know about open source to keep you ahead of the curve.

Editor’s Note: Apache, Apache Hadoop, Hadoop, and the yellow elephant logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.

About the Author


How To: 20 Minute Guide to Get Started with PivotalR
How To: 20 Minute Guide to Get Started with PivotalR

In this article, Pivotal engineer and predictive analytics expert Hai Qian explains how someone new to R ca...

Shake is Hiring for Talented Technologists!
Shake is Hiring for Talented Technologists!

One of our clients, Shake, is looking to hire! Shake is a mobile-first agreement platform that allows peopl...

SpringOne 2021

Register Now