Conquering the $216 Billion Cancer Problem in 8 Hours with Spring, Redis, Neo4J, and More

April 19, 2014 Adam Bloom

featured-cancerEach year, cancer costs $216 billion in the U.S. and takes 1600 lives per day or over 585,000 lives per year. It is the second most common cause of death in the U.S, accounting for 1 out of 4 deaths.

One start-up, Redbasin Networks, is using big data about cancer and open source technologies from Pivotal and others to help beat the disease. In fact, they have turned months of R&D work into an 8 hour cloud workload using Spring, Redis, Neo4J, and MongoDB.

Speaking at last fall’s SpringOne 2GX, Redbasin shared their company background and the core components of their technology stack. They were bullish on a key premise—helping pharmaceutical R&D departments use big data to follow the popular software principle of “fast failure.” By applying the agile concept, they believe faster failure during the R&D process can save billions of dollars.

Top pharmaceutical companies like Roche spend $9.3 billion per year in R&D and others spend billions. Within these investments, there is a lot of waste, and a recent Forbes analysis pointed out how the development of a single drug can cost $5.5 billion while 95% of experimental medicines studied in humans fail to be both effective and safe. In Redbasin’s vision, this multi-billion dollar expense is unnecessary—one of their customers saw the results of Redbasin algorithms and exclaimed how it took their company’s R&D team months to do what Redbasin’s system did in 8 hours.

Redbasin is a pioneer. But, they are not alone in their thinking about health, science, and big data. Many pharma and bioinformatics related companies are looking to big data to solve similar problems along with McKinsey, IBM, EMC, and Pivotal. Healthcare poses both tremendous challenges and opportunities worldwide. In the U.S., healthcare spending grew 3.7% in 2012 to $2.8 trillion or $8,915 per person—this is 17.2 percent of U.S. Gross Domestic Product (GDP). Projections show it might grow to $4.8 trillion, almost doubling, by 2021. Cancer costs were $27 billion in 1990, $90 billion in 2008 and reached $104 billion in 2006 with a projection of $173 billion in 2020—a 6.4 fold increase.

Cancer’s Cost Drivers—A Multi-Billion Dollar Big Data Problem

A variety of factors impact these cancer costs, but drug prices are a key culprit. According to one source covering the state of oncology in the U.S., human longevity, competitive factors between doctors and hospitals, a shortage of oncologists, and increasing drug prices are the underlying factors to the rising cost of cancer. In the article, one doctor said he remembers when average cancer drug costs were $1,000 per month and now they average $10,000 per month, and this reference backs up the statement. The American Society of Clinical Oncology also pointed to costs like unnecessary or ineffective tests and treatments.

Redbasin drives down drug-related costs by re-using expensive sets of big data. As Redbasin’s CEO, Smitha Gudur, and CTO, Manoj Joshi spoke at last year’s SpringOne 2GX, they explained the cancer cost problem through the lens of big data, “There are so many entities and organizations talking to each other in haphazard ways,” as they point out the disarray of data ownership across dozens of stakeholders in the field. The EPA, FDA, biotech labs, CDC, instrument vendors, drug labs, patients, pharma companies, contract research organizations, universities, and hospitals all have separate views of their own analysis, sets, and silos of expensive data. Recently, a Genentech VP and M.D. described the same problem, “Many systems where information is stored were built in a way that doesn’t allow them to ‘talk’ to each other. This integration continues to be a big challenge with big data.” Similarly, Keith Perry, associate vice president and deputy CIO of MD Anderson, said, “Many of our databases currently don’t interface with each other because they’re generated by and housed separately.”

As Gudur provided the SpringOne 2GX audience with a background on their business, she explained how they are bringing this data together into one place and posited their core belief, “Failures are expensive. But, if you can use the concept of ‘fast failures’ from the start-up and software world, it could save a ton of money.” According to Gudur, the drug lifecycle starts with an investment of 10 years and an annual cost of $100 million USD per year. By the time a drug company starts clinical trials, they have probably spent $700 million. This is where Redbasin will drive down costs—by data mining on past investments in drug research to help with current R&D efforts.

Redbasin’s Big Data Challenges and Analytics Platform

Redbasin’s user is a life science R&D staff member who recommends a drug molecule or publishes a research paper. Redbasin’s big data analytics system helps these researchers arrive at conclusions faster.

As noted, Redbasin’s core problem to date is around big data management—you might say Redbasin is trying to build a data lake architecture. To perform advanced analytics on this data, they are bringing together hundreds of heterogeneous data sources with no common meaning or schema. Redbasin’s CTO describes the data as a nebula and explained the 225 dimensions and sub-dimensions in the data model to date. These dimensions include genes, proteins, diseases, drugs, antibodies, ligands, pathways, enzymes, amino acids, instruments, trials, and much more. In fact, the data is bigger and more complicated than just genome information—the popular project defining how our double-stranded DNA of 23 chromosomes turns into 3.2 billion base pairs containing 20-25,000 distinct protein-coding genes. Redbasin isn’t just modeling and storing gene information but also the sequencing of genes and factors of mutation that cause the data to evolve and expand with new research.

With so many dimensions and an evolving data set, it takes a fair amount of effort just to define a query for analysis. In addition, the queries for Redbasin’s dataset includes hierarchies, nested information, fractal-like behavior of data, pervasive joins, and multi-dimensional meta-data, making relational data or SQL-oriented schemas unusable. For example, temporality can be difficult to represent or analyze in relational models, but time is an important query factor on PD/PK sub-dimensions. Pharmacodynamics (PD) dimensions measure what the drug does to the body over time and pharmacokinetics (PK) dimensions measure what the body does to the drug over time—through phases like drug release, absorption into blood circulation, distribution throughout the fluids and tissues, metabolization, and excretion.

How Redbasin Uses Spring, Redis, Neo4J, and MongoDB

Often using math to reduce complexity, Redbasin’s big data analytics architecture relies on Spring, Redis, Neo4J, and MongoDB as well as Lucene and Hadoop’s Hbase. The video and slides from Redbasin’s presentation at SpringOne 2GX provide additional depth.

Spring is the default framework. According to Joshi, “With Spring, we can focus on our application instead of the wiring.” Specifically, Spring is used to separate analytics from queries, and all data binding functions go through Spring Data. Spring Data is used to connect to hundreds of data sources and helps to create dynamic models for contextual data mining—setting the context for specific mining workloads. Spring Data also provides high-productivity access through Spring Data Neo4J, Spring Data Redis, Spring Data MongoDB, and Spring Data Hadoop.

MongoDB is used because Joshi has to deal with a lot XML and JSON. The data is big and complex—some of their source XML files can be 30-40 pages in length. Since the data also changes over time, MongoDB provides flexibility. Joshi shared how they find the NoSQL data store easy to use, scalable and performant. It also provides support for Java and REST, two important standards within their architecture.

Neo4J is used as a graph database and helps Redbasin provide rich analysis services. A graph database is used because the Redbasin data model doesn’t always fit into traditional BI data models or one-to-many and many-to-one types of relationships. Their data has a much more significant level of relationships with other data where many-to-many relationships are pervasive. For Redbasin, the graph model performs much better than “the exponential slowdown of many-JOIN SQL queries in a relational database.” For example, Neo4J is used to help answer starting questions like, “What are all the genes and proteins affected by breast cancer?” and then further analysis can be done to answer “Where do other drugs interact with these same genes and proteins?” This is where Neo4J is used alongside Redis—for sub-graph caching.

When Redbasin began using Redis, it was for auto-completion. This evolved into ontology and taxonomy look ups, alias management, and for most data structure management within their analytics workloads. Redis can store data like organism classifications, disease codes, human gene identifiers, and drugs. However, the most interesting and perhaps important use case is with sub-graph caching.

As Joshi began to describe the use case for sub-graph caching, he said, “One in fourteen drugs fail in clinical trials. At a cost of $2 billion per drug, that means it costs $28 billion for a success. We want to reduce the amount of cost, effort, and research needed.” To do this, he explained one of their key approaches—determining past drugs that show similar behavior to what is needed to solve a current problem. In effect, the analysis applies existing drugs to new use cases. One company did this recently, using an older arthritis drug for cancer today. Technically, Joshi explained how this is comparing multiple sub-graphs, and they use Redis to help manage the correlations across these graphs. First they perform some contextual mining to figure out what to look for because Redbasin’s Neo4J data can have hundreds of millions or billions of nodes. Comparing entire sets of analytical information is too big to run in a single JVM and graph traversal can crash the JVM. Redis helps to cache the information and run the correlations and comparisons in a more effective manner.

To Learn More:

Register now for the premier Java event of 2014 - Spring, Groovy and Grails as well as Cloud and Big Data


About the Author


Your Server has "participated in a very large-scale attack"
Your Server has "participated in a very large-scale attack"

In this blog post we configure an NTP (network time protocol) server on a FreeBSD-based Hetzner virtual mac...

How to take a leave of absence from work
How to take a leave of absence from work

As it turns out, I’m not invincible. In September I broke my most expensive finger and in February had to ...