Data Science Case Study: A Healthcare Company's Journey To Big Data

November 24, 2014 Hulya Emir-Farinas

featured-technology-pathJoint work performed by Hulya Emir-Farinas and Sarah Aerni with help from Noah Zimmerman, Emily Kawaler and Ailey Crow.

Adopting a new technology is never a trivial task. Introducing a brand new tool into a data scientist’s toolset is no different. The resistance to change is especially high in companies that employ tens or hundreds of statisticians. Understandably, analysts have learned to love their tool and live with any shortcomings. The effort required to learn a more efficient tool often seems too great even if such a transition would lead to long-term time savings. This is where Pivotal Data Labs (PDL) comes into the picture, using a team of highly skilled set of data scientists and engineers to prove results to our customers such as:

  • Shorter time to insight and to market
  • Better utilization of all captured data (both structured and unstructured)
  • Improved model quality and better decision-making
  • Minimized data movement and need to create multiple copies

In this blog, we will describe an example journey to technology adoption executed through a series of data science engagements solving real problems for our customer, a major healthcare provider. This customer has a large division of research, and as a trailblazer in preventive healthcare, employs many accomplished clinicians and biostatisticians who are limited by the analytics tools that they use. The journey they took shows how analytics can be done faster and better through a series of 5 projects (Figure 1). Each project answered different questions, proving the need and utility of new tools in advancing their data science practices, improving their business, and ultimately leading to the decision to adopt new technology.

Screen Shot 2014-11-24 at 11.38.16 AM

PHASE 1: Prove Better Technology Speeds Up Discovery

Their journey started with a hackathon. They invited four vendors to a 24-hour event and provided medication order history and environmental sensor data. In 24 hours, we accomplished the following:

  • Showed that there is a correlation between measured environmental factor and prevalence of a chronic respiratory disease.
  • Predicted who is most likely to have a chronic respiratory disease related admission in the next three months.
  • Demonstrated that patients who do not pick up their medication from the pharmacy are more likely to have expensive hospitalizations.
  • Built a population management tool for physicians and a mobile app for patients where both apps were powered by predictive models built during the hackathon.

This hackathon served as a proof point that the platform was capable of rapidly ingesting, analyzing, and visualizing large-scale data that had never been analyzed before in a very short period of time. The customer was convinced that Pivotal’s big data technology platform enabled more rapid discovery and insights. However, they also wanted to see if it was possible to use the power to improve the quality of existing models.

PHASE 2: Prove Better Technology Can Improve Model Quality

The customer presented PDL with a model that researchers and statisticians had worked on extensively. Their model on predicting the length of stay of patients admitted to the hospital for Acute Myocardial Infarction (AMI) was state-of-the-art and was the most accurate model published in the academic literature. Our goal was to demonstrate that, by using new technology, they would be able to leverage more of their data, and using data-driven (rather than hypothesis-driven) approaches, improve the model quality.

In 3 weeks, we were able to engineer over 300 rich features, experiment with many different model forms, and build an ensemble model that doubled the accuracy of their baseline model. Some of the insights from this effort were very interesting to our customer:

  • We proved that length of stay (LOS) couldn’t be explained by just biology—operations, nurse schedules, and hospital’s experience in cardiology also played a big role explaining the variations in LOS.
  • Model fit for LOS for AMI is influenced largely by the most recent information. Available on a patient from the current hospitalization and recent laboratory test values, demonstrated by the figure below. By removing various groups of features and seeing the effect on model fit in the test set, we were able to assess the value of each group of features.

Figure 2 (Left) General categories of features used in the modeling exercise, over 300 total color coded by group given in chart. (Right) chart showing the model accuracy decrease on test data when feature group is excluded from the model against baseline (yellow) with all features included.

PHASE 3: Prove Better Technology Can Be Accessible To Non-technical Experts

With new technology, being able to take advantage of existing talent is critical for two reasons—overall adoption and to benefit from the full potential of an organization’s data. In this proof point, the PDL team collaborated with Pivotal Labs and created an application that allowed clinicians and data scientists to generate rich features on hundreds of millions of patient records stored in HDFS without writing any code in a short time.

The valuable knowledge physicians possess about patients can contribute greatly to modeling exercises, for example, in identifying valuable features of patient readmissions or adherence to smoking cessation programs. However, without the ability to write code, it is often difficult for them to translate their clinical knowledge and explore different hypothesis by processing and visualizing the data. This also applies to patient diagnoses. Coding systems that are used (ICD-9 in this case) are highly specific. By knowing a patient’s comorbidities at a less granular level and grouping codes, for example using CCS codes or the Charlson index, we find the analysis to be far more informative. Furthermore, depending on the particular application, a physician may only be interested in capturing newly diagnosed conditions (incidence) prior to a particular procedure and only within fixed windows of time.

To enable physicians to profile their patient population, we created a web-based application that was capable of, in seconds, processing hundreds of millions of patient records to generate profiles of the patient population (Figure 3). The application was flexible enough to allow the physicians to

  • choose various levels of granularity (CCS code levels)
  • filter by treatment location (e.g., hospital, skilled nursing facility)
  • using different time windows of interest

It generated a visualization of the breakdown of diagnoses of the requested patient population, even allowing interactive drill-downs.

Processing these large volumes of data in real-time and on-demand is no trivial task, and we used a compressed representation of the data to use in-database, bitwise operations, making the process extremely fast and efficient. This application was so successful that, together with the LOS project, it won the 2014 innovation fund for technology award for our sponsors within the company.


Figure 3: Screenshot of the web-based application where physicians can select various ways to group and aggregate patient histories for selected populations.

PHASE 4: Prove Data Science Can Improve Business Outside The Clinical Setting

In this project and proof point, our customer asked if we could help their accounts payable department with fraud, waste, and abuse (FWA) detection. This department was already doing a great job detecting FWA using deterministic rules established by their domain experts. However, they were interested in how data science might improve their approaches.

In just a few weeks, we managed to detect a substantial number of FWAs that were undetected by the existing rules as covered in our webinar Machine Learning for Forensic Accounting. Furthermore, the approaches reduced the number of false positives for review, reducing the workload of the domain experts. These tasks were accomplished by leveraging several different approaches:

  • Fuzzy string matching (by calculating Damerau-Levenshtein) to identify duplicate entries that may result from data-entry errors
  • Benford’s Law for identifying falsified invoices which follow a non-natural distribution of invoice amounts
  • Anomaly detection on purchasing profiles, e.g. to identify opportunities to reduce spending by comparing hospital generic drug purchasing behavior (see figure)

As a result, the Accounts Payable department hired their very first data scientist, which is one of many ways we measure success at PDL.


Figure 4: The heatmap shows the spend profile for the pharmacies within the healthcare provider for a single drug (defined as an active ingredient). Each column shows how a given pharmacy’s dollar spend for that particular drug is distributed across the various products available. A sample irregularity is shown where the pharmacy spends a larger fraction on a brand name drug than other pharmacies.

PHASE 5: Prove That The Technology Doesn’t Require Pivotal Data Scientists

After being convinced that PDL can use Pivotal technologies to build better models faster, the customer wanted us to train their data scientists to build better models. We designed a custom training session and asked their data scientists to bring the model they were working on to see if we could improve any of them. In 5 short days, their data scientists built a brand new sepsis mortality model (which outperformed the general mortality model) and improved their EDIP (Early Detection of Impending Physical Deterioration) model significantly. This was through our platform (Apache Hadoop® and HAWQ) that enabled the use of new modeling tools and extremely large-scale datasets, including bedside monitor feeds and orders.

Using Pivotal’s technology they were able to:

  • Perform rapid data exploration, munging and modeling of this data stored in HDFS with HAWQ’s SQL capabilities.
  • Have access to a variety of visualization and processing tools, including our big data machine learning library, MADlib.

It was a great experience to see their data scientists explore the whole dataset in its rawest form and build many interesting features in minutes. These would have taken them days using their old analytics tool.

Succeed At Your Own Big Data Journey

Acquiring a new technology never guarantees adoption, especially for running analytics. You may already have a shiny distributed computing platform but if your data scientists are still extracting a sample and taking it to an in memory solution to analyze it, you are missing the boat. Sometimes you need to teach your data scientists how to leverage this new technology and PDL is happy to help you with that challenge. The sample technology adoption journey here is only one of many examples of how PDL has helped our customers along this path. Look for future posts on how customers engage and get educated with new technologies to discover how they can revolutionize their business.

Learning More

Editor’s Note: Apache, Apache Hadoop, Hadoop, and the yellow elephant logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.


About the Author


Tracker Secrets: Adding a Story to an epic
Tracker Secrets: Adding a Story to an epic

Four ways (that I know of, but there maybe more) to add a story to an epic: 1) Add the epic label to the st...

Increasing the Size of a VCSA Root Filesystem
Increasing the Size of a VCSA Root Filesystem

In this blog post we describe the procedure to increase the size of the root filesystem of a VCSA (VMware v...