This Month in Data Science

June 30, 2014 Paul M. Davis

June was a big month for data science, with major news coming out of the Apache Hadoop® Summit 2014, much discussion about the job market for data scientists, and new research demonstrating the impact Apache Hadoop® is having on enterprises. Pivotal announced benchmarks demonstrating the industry-leading speed of its SQL query optimizer HAWQ running on Pivotal HD, while our data science team shared numerous technical insights about their client engagements on the blog. Here’s our monthly roundup of the top data science news of the month, both from Pivotal and the entire industry.

SiliconANGLE: Skill Sets Today’s Data Scientists Need to Succeed | #HadoopSummit2014

At the Apache Hadoop® Summit 2014, held earlier this month, there was much discussion about the high demand for data scientists, and what that means for businesses and practicioners. While the definition of what a data scientist is remains a matter of dispute, there is increasing agreement on the skillsets businesses are looking for in the hiring pool.

PCWorld: USENIX Researchers Get a Grip on Apache Hadoop® Performance

PCWorld covers the need for accurate models capable of predicting big data workloads. The article discusses current researchers’ efforts, highlighting various issues like cost and lack of accessibility to Apache Hadoop® systems as factors contributing to the lack of proper models.

Wired: Tell Your Kids to Be Data Scientists, Not Doctors

The future isn’t in plastics, or even venerable high-wage and high-status careers like doctors or lawyers, according to Burtch Works’ Linda Burtch. This isn’t only because of the salary prospects: Burtch notes that much future medical research will be performed by data scientists, and emphatically declares it to be the “Career of the Future.”

InformationWeek: UC Berkeley Breeds Data Scientists Online: $60K, 18 Months

The University of California at Berkeley announced its new Master of Information and Data Science (MIDS) program, which costs a steep $60K but aims to turn out new data science professionals in only 18 months.

Wall Street Journal: BNY Mellon Finds Promise and Integration Challenges with Apache Hadoop®

Wall Street Journal discusses BNY Mellon’s Hadoop® implementation, highlighting the business changes enabled by the integration. Notes that short-term, negative integration challenges are worth the prospect of significant long-term gains.

TechTarget: Seven Data Science Lessons From McGraw-Hill Education Analytics Guru

What programming language should every data scientist know? How should data scientists be trained? Why do you need more women on your team? McGraw-Hill Education’s Alfred Essa answers those questions and more in this TechTarget feature.

Bloomberg: Retailers Use Big Data to Turn You Into a Big Spender

Bloomberg explores Klarna, a startup that combines big data and shopping services. The article examines the effect of big data on shopping/e-commerce websites.

CMSwire: One Woman’s Path to Data Science

There’s been much discussion in tech circles in recent months about the industry’s gender gap among engineers. In this article, CMSwire profiles Dstillery’s chief data scientist Claudia Perlich, discussing her path to success in a highly-coveted position, and her take on the imporant traits shared by successful data scientists.

This Month in Pivotal Data Science

Pivotal HAWQ Benchmark Demonstrates Up To 21x Faster Performance on Hadoop® Queries Than SQL-like Solutions

This month at the ACM SIGMOD Conference, the premier international forum for database researchers, practitioners and users, Pivotal announced the architectural benefits and results for its brand-new cost-based query optimizer. The results bore out Pivotal’s statement that HAWQ is the world’s fastest SQL query engine on Hadoop®, with benchmarks demonstrating it is capable of up to 21 times faster performance and three times the queries supported for Hadoop®.

Graph Analytics for Identity Resolution—Transforming Billions of Customer Records in One Minute

Two Pivotal Data Scientists share details on how they took billions of customer records from multiple systems and LOB data silos then computed matches for identity resolution in only one minute. Ultimately, the approach had a big impact on segmentation, target marketing, and marketing ROI while the engagement only took days to perform at a major insurance provider.

A Data Science Approach to Detecting Insider Security Threats

Employees must access internal information freely to be productive, yet ill-intentioned information access must be guarded. Most of security tools today focus on identifying malware-initiated attacks. Pivotal sees many opportunities for Big Data Analytics to address the problem of identifying anomalous user-to-resource access activities.

Using Data Science Techniques for the Automatic Clustering of IT Alerts

The ultimate goal of a data science-driven IT infrastructure is one capable of performing automated root cause analysis and failure prediction. To achieve this goal, some foundational blocks must be built. One of these foundational blocks is the automatic clustering of IT alerts. In this post, Derek Lin demonstrates this using a patented approach the Pivotal Data Science team performed for a client.

Exploring Big Data Solutions: When To Use Apache Hadoop® vs In-Memory vs MPP

In today’s world of big data, there are several different technology approaches available to data management. For many companies, a combination of approaches is necessary. This high level overview explores the benefits, trade-offs and Pivotal’s recommendations for three primary technologies: Apache Hadoop® distributions, In-memory data grids (IMDG), and massively-parallel processing (MPP).

New Benchmark Results: Pivotal Query Optimizer Speeds Up Big Data Queries Up To 1000x

The new super-efficient Pivotal Query Optimizer developed by the Greenplum engineering team, previously codenamed “Orca”, this new feature has been released as part of the HAWQ query engine in Pivotal HD, Pivotal’s commercially-supported distribution of Apache Hadoop®.

Upcoming Events

Building a Distributed Data Ingestion System with RabbitMQ

Thursday, July 17, 2014
5:45 PM to 8:30 PM
Pivotal Labs
875 Howard Street 5th Floor, San Francisco, CA

In this talk we are going to show how to build a system that can ingest data produced at separate geo located areas (think AWS and it’s many regions), using different technology stacks, and replicate it to a central cluster where it can be further processed and analysed.

OSCON 2014 – O’Reilly Conferences

July 20–24
Portland, OR

From the early days of open source, OSCON has been the only event that covers the open source stack in its entirety. Not just one language, tool, or philosophy, but all of the moving parts, integrated and working together. It’s everything you need to know about open source to keep you ahead of the curve.

Editor’s Note: Apache, Apache Hadoop, Hadoop, and the yellow elephant logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.

About the Author

Biography

Internet IRL: Hard won lessons from Tough Mudder

Tough Mudder technologist, Ori Neidich, discusses hard-won lessons in bringing high-availability portable n...

Shake is Hiring for Talented Technologists!

One of our clients, Shake, is looking to hire! Shake is a mobile-first agreement platform that allows peopl...

This Month in Data Science

SiliconANGLE: Skill Sets Today’s Data Scientists Need to Succeed | #HadoopSummit2014

PCWorld: USENIX Researchers Get a Grip on Apache Hadoop® Performance

Wired: Tell Your Kids to Be Data Scientists, Not Doctors

InformationWeek: UC Berkeley Breeds Data Scientists Online: $60K, 18 Months

Wall Street Journal: BNY Mellon Finds Promise and Integration Challenges with Apache Hadoop®

TechTarget: Seven Data Science Lessons From McGraw-Hill Education Analytics Guru

Bloomberg: Retailers Use Big Data to Turn You Into a Big Spender

CMSwire: One Woman’s Path to Data Science

This Month in Pivotal Data Science

Pivotal HAWQ Benchmark Demonstrates Up To 21x Faster Performance on Hadoop® Queries Than SQL-like Solutions

Graph Analytics for Identity Resolution—Transforming Billions of Customer Records in One Minute

A Data Science Approach to Detecting Insider Security Threats

Using Data Science Techniques for the Automatic Clustering of IT Alerts

Exploring Big Data Solutions: When To Use Apache Hadoop® vs In-Memory vs MPP

New Benchmark Results: Pivotal Query Optimizer Speeds Up Big Data Queries Up To 1000x

Upcoming Events

Building a Distributed Data Ingestion System with RabbitMQ

OSCON 2014 – O’Reilly Conferences

About the Author

Previous

Next

This Month in Data Science

This Month in Pivotal Data Science

Upcoming Events

About the Author

Previous

Next

Related content in this Stream

Introducing VMWare Tanzu Data Hub, a self-managed Database as a Service (DBaaS) Platform, providing enterprises a way to host their internal DBaaS offering for internal business users.

In the cloud-native landscape, MCAs drive seamless compliance integration. Their expertise ensures proactive security measures align with regulatory standards for sustained innovation & collaboration.

Tanzu Application Platform brings innovation faster with more frequent feature updates. With 1.9, take advantage of enhanced DORA metrics visibility and improved compliance options for companies.

We’re excited to share some great news! Spring Academy Pro content is now free. It will be available to everyone who registers a work, vocational, or educational email address.

March 28, 2024, marks the official minor release date of Spring Cloud Gateway for K8s version 2.2, and it's set to optimize how developers protect access to their GraphQL services.

We are excited to announce that VMware Tanzu Application Service 6.0 is now generally available!

Get a clear picture of your OSS supply chain, and the risks you face from your open source software dependencies, using the all-new Tanzu OSS Health Assessment.

Trivy can now utilize CSAF VEX data to filter out false positives in CVE reports, maximizing the value of VEX documents in VMware Tanzu Application Catalog.

Bitnami-packaged open source software container images available in DockerHub are now signed by Notation, an implementation of the Notary Project specifications and a CNCF-incubating project.

There’s never been a better time to be a Java and Spring developer! Let me show you why with a sneak peak into JD Conference 2024.

If you're into FinOps, you've probably heard of FOCUS. Introducing our FOCUS FlexReports template for AWS, Azure, and GCP. Turn your cloud bills into FOCUS-compliant reports in minutes!

The latest Spring Boot simplifies infrastructure setup with Docker Compose. Now, supporting Bitnami images, it opens new possibilities for developers. Exciting times ahead!

Shape the future of Spring! Participate in the State of Spring Survey 2024. Share insights, collaborate with the community, and drive innovation.

Extend Apache Tomcat support with Tanzu Spring Runtime. Seamless transition, enhanced security, and uninterrupted workflow for Java applications.

Welcome to another edition of What’s new with Tanzu Application Catalog. This is a quarterly round up of all things related to Tanzu Application Catalog.

As we stand at the threshold of a new era in data management, Greenplum continues to lead the industry with its commitment to innovation.

Experience enhanced security with Tanzu Application Platform. Elevate your organization's defenses from code to build with SLSA Level 3, image scanning scheduling & automatic upgrades for new patches.

Explore Spring's exceptional NPS score of 75, surpassing industry benchmarks by 18%. Discover why it matters.