Introducing R for Big Data with PivotalR

May 27, 2014 Guest Blogger

Written by Hai Qian & Woo J. Jung of Pivotal Data Labs

featured-PivotalR When discussing data science tools, it’s common for folks to passionately debate about algorithm breadth, scalability, and performance among the many available options. Yet one of the most important aspects to consider when choosing a data science tool—usability—is often ignored in these discussions.

We believe that usability is perhaps the most important aspect to consider when selecting data science tools. In day-to-day settings, a data scientist should be focusing on what she wants to do with the data, rather than having to determine the technical aspects of how she is going to get there. Unfortunately, we’re at a stage where the how takes up a large chunk of a typical data scientist’s workflow.

There are a number of user-friendly data science tools available: R, Python, SAS, Stata, and more. In particular, the S language (R being an open-source implementation of S) was designed specifically for data analysis. While these tools offer excellent, interactive interfaces for performing data science, they face scalability and performance challenges when end users transition from small to big data.

At Pivotal, we asked ourselves this question: Wouldn’t it be great if there was a way to harness the familiarity and usability of a tool like R, and at the same time take advantage of the performance and scalability benefits of in-database/in-Hadoop computation?

Within our team, the clear answer was yes. We realized that we could achieve this by simply building a tool—in particular, an R package—that translates R code into SQL which feeds into the database for execution.

We’re excited to announce that this tool—PivotalR—is available on GitHub to download and use today. PivotalR is an R library with a familiar user interface that enables data scientists to perform in-database/in-Hadoop computations. While data scientists interact with a familiar R environment, those complex computations all occur “under the hood.”

PivotalR builds on R’s tradition of providing an interface with backend routines that run when needed in other languages or environments, operating at various levels of abstraction (e.g., Fortran, C, Stan, etc.). This framework allows the data scientist to express ideas and interact with R’s unified, user-friendly interface, while allowing her to piggy-back on faster subroutines or pre-existing tools when it makes sense to do so.

We believe that PivotalR takes this paradigm a step further. Traditionally, these backend subroutines are often executed on the same hardware as the R client itself (e.g., your laptop, a dedicated R server, etc.). This is fine when working with reasonably sized datasets, but can quickly become problematic when working with big data. For practitioners who have powerful, dedicated hardware for their database or Hadoop cluster, it’s unfortunate to have to leave all that computing power unused and spend the time to move all that big data from the data store to a laptop or server for modeling. Furthermore, there are no guarantees that this laptop or server would have enough memory to run these models.

PivotalR’s backend SQL queries can either run on a local database running on one’s laptop, or directly in the dedicated database or Hadoop cluster. This framework allows for the best of both worlds: A familiar, user-friendly R interface provided by the client machine, with highly-scalable, parallelized computing capabilities available through the database or Hadoop cluster.

As a short side note, if you haven’t been exposed to the availability of canned SQL functions that execute sophisticated machine learning routines like Elastic Net and Latent Dirichlet Allocation directly in database/Hadoop, we invite you to check out the open-source library MADlib. It’s worth mentioning that PivotalR piggy-backs substantially on MADlib.

The diagram below is our attempt to illustrate the mechanics of PivotalR’s design. At its core, an R function in PivotalR:

Translates R code into corresponding SQL statements in the R client
Executes these statements in the database or Hadoop cluster
Returns summarized output to R

PivotalR

Call MADlib’s in-DB machine learning functions directly from R
Syntax is analogous to native R function
Data doesn’t need to leave the database
All heavy lifting, including model estimation and computation, are done in the database

This framework allows practitioners to benefit from the scalability and performance of in-database/in-Hadoop analytics without leaving the R command line. We leverage RPostreSQL as the communication bridge between the database or Hadoop cluster and the client machine. All of the heavy lifting, including model estimation and computation, are done in-database/in-Hadoop.

The principal design philosophy behind PivotalR is to not compromise the “R-ness” of the user experience. This is a common approach among R contributors who leverage subroutines in the backend, and their efforts are well-appreciated by end users in the R community. For example, the PivotalR function for linear regression, madlib.lm(), is pretty much identical in look-and-feel to R’s native lm() function. For those of you who have had the unfortunate experience of manually creating indicators for a categorical variable with many distinct values in SQL, we are happy to say that PivotalR supports automated dummy variable coding à la as.factor().

If you’re anything like us and have become accustomed to R’s convenience operators, you may find geek-comfort in knowing that code like the following is also fully supported:

madlib.lm(y~., data=d[,-c(16,27,41:56)])

We greatly prioritized the look-and-feel of R while developing PivotalR. We look forward to demonstrate the fruits of this labor in a series of future blog posts, which will demonstrate one or two use cases per post using working R code examples. In the meantime, we invite you to get started with PivotalR by visiting the PivotalR GitHub page and watching this video demo.

PivotalR is currently supported on PostreSQL, Pivotal Greenplum, and Pivotal HD with HAWQ. It is available for download on CRAN and GitHub.

Hai Qian is Senior Software Engineer in the Pivotal Predictive Analytics team

Woo J. Jung is Senior Data Scientist at Pivotal Data Labs

About the Author

Biography

Managing Stateful Docker Containers with Cloud Foundry BOSH

Learn in this article how Cloud Foundry BOSH can help you to orchestrate your multi-node containerized appl...

Mike Grafton – Robolectric

Mike Grafton talks about the Android unit-testing framework Robolectric in this talk titled “Everything you...

Introducing R for Big Data with PivotalR

About the Author

Previous

Next

Introducing R for Big Data with PivotalR

About the Author

Previous

Next

Related content in this Stream

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This blog provides a summary of VMware Tanzu CloudHealth news and product updates for the month of April, 2024

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

How VMware Tanzu CloudHealth helps customers uncover spiraling AWS Extended Support charges.

VMware Tanzu enhances Spring development with simplified operations, accelerated innovation, seamless microservices transition, increased security, and effortless scaling.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

Bitnami-packaged open source software is loved by developers for its ease of use, which enables developers to directly pull a Bitnami package and seamlessly start using it with little effort.

VMware Tanzu announces the General Availability of AWS Commitment Discount Recommendations, which provides recommendations for all reservable services in AWS through VMware Tanzu CloudHealth.

Introducing VMWare Tanzu Data Hub, a self-managed Database as a Service (DBaaS) Platform, providing enterprises a way to host their internal DBaaS offering for internal business users.

In the cloud-native landscape, MCAs drive seamless compliance integration. Their expertise ensures proactive security measures align with regulatory standards for sustained innovation & collaboration.

Tanzu Application Platform brings innovation faster with more frequent feature updates. With 1.9, take advantage of enhanced DORA metrics visibility and improved compliance options for companies.

We’re excited to share some great news! Spring Academy Pro content is now free. It will be available to everyone who registers a work, vocational, or educational email address.

March 28, 2024, marks the official minor release date of Spring Cloud Gateway for K8s version 2.2, and it's set to optimize how developers protect access to their GraphQL services.

We are excited to announce that VMware Tanzu Application Service 6.0 is now generally available!