Data As The New Oil: Producing Value for the Oil & Gas Industry

April 28, 2015 Rashmi Raghu


The popular perception of data as the new oil is very apt, particularly within the oil and gas industry. During a recent webinar for Data Science Central, I discussed the topic of producing value for the oil & gas industry with big data analytics and data science (joint work done with Niels Kasch and Kaushik Das).

While the presence of large amounts of data from sources such as seismic surveys is not new to the oil & gas industry, the increase in cost-effective data collection methods, storage, and computational resources are reducing the barriers to storing large amounts of data from sensors and other sources. Companies can leverage these data sources to improve logistics, business operations and more. By using a wide breadth of data sources and big data technologies and techniques, oil and gas companies can improve efficiency, realize new business opportunities, and enhance decision making.

There are certainly challenges and roadblocks for companies that want to yield new opportunities and insights from the increase in data production and analysis. As in many industries, oil and gas companies may have data sources and skilled practitioners within silos, often due to legacy software and policies. These technologies and domain experts may be unable to work and communicate with one another. As a result, an organization may not know the numerous data sources it has access to and what it can leverage.

The promise of data as “the new oil” is realized when we can tap into its value, in a meaningful, cross-functional way to enhance decision-making. It is precisely in this collaboration and access to all available data sources wherein business value is realized. In contrast to the siloed approaches common in many industries, a Data Lake model enables data to be stored centrally and curated in a meaningful way, and provides businesses with a comprehensive view of the truth. The integration of data assets leads to more informed, powerful models, and never-before-realized opportunities based on model insights. Moreover, businesses which operationalize the real-time application of predictive models where appropriate can enhance their ability to rapidly respond to new events.

Data-driven use cases within oil and gas companies include predictive maintenance of equipment through the modeling of function and failure and the optimization of maintenance schedules, seismic imaging and inversion analysis, reservoir simulation and management, production optimization, supply chain optimization, as well as energy trading.

Most significant among these perhaps is the impact predictive analytics will have upon drilling operations. Drilling wells is an expensive process, compounded by equipment failure. According to one operator, as reported by The American Oil & Gas Reporter, April 2014, drilling motor damage can account for 35% of rig non-productive time and $150,000 per incident. Given that there were over 800,000 oil & gas wells in the US as of 2009 (according to, the total cost of such incidents could amount to billions of dollars. The goal of predictive maintenance efforts is to increase efficiency, reduce costs, and take steps towards zero unplanned downtime by predicting equipment function/failure and establishing an early warning system for equipment failure, optimize drilling operations parameters, and improve health and safety while reducing environmental risks. The data to perform these tasks comes from a number of sources, including sensor-enabled machinery as well as data reported by operators. By realizing an effective business data lake and introducing data science techniques, a business can track and predict drilling equipment function and failure, an important step on the path towards establishing early warning systems that ensure zero unplanned downtime.

Complex use cases such as these can be tackled using a well-defined data science process that the Pivotal Data Science team has successfully employed in many engagements. The process includes a number of steps:

  • a problem formulation phase where the goal is to ensure the problem being solved is relevant to the business and stakeholders
  • a data step to explore the data and build the right feature set
  • a modeling step wherein we move from answering ‘what, where, and when’ to ‘why and what if’
  • an application phase to build a framework to integrate the model into decision-making processes.

The blog post “The Eightfold Path of Data Science” provides a comprehensive discussion of this data science methodology.

Pivotal’s Big Data Suite can be utilized throughout all phases of the data science and analytics cycle. Pivotal’s products provide companies with the platform to integrate data from multiple sources, across data warehouses and rig operators, and the ability to analyze both structured and unstructured data in a unified manner. They also support the development of complex and extensible predictive models at scale.

Predictive analytics use cases for drilling operations include prediction of drill rate-of-penetration (ROP) and prediction of equipment failure. There are a number of relevant data sources to consider when performing predictive analytics for drilling operations, such as drill rig sensor data and data logged by drill operators. In the case of drill rig sensor data, features being captured include depth, ROP, RPM, torque, weight on bit, and much more, potentially accounting for billions of records depending on the number of wells and drilling duration. It is also be important to consider operator data, such as event details, drill bit details, failures, and component details, potentially accounting for tens to hundreds of thousands of records or more.

Building analytical models using such data sets requires a comprehensive framework for performing data integration at scale, which includes data cleansing and the standardization of columns. As in many big data analytics use cases, this presents a number of challenges: data sources do not necessarily use consistent entries in the features and columns that link them, there is the potential for errors in manually entered data, and sensor measurements can return invalid values due to malfunction. Consider the problem of not having consistent entries for rig/well names across data sources. One way to solve this problem is to standardize these columns in all data sources by deriving a canonical representation of the columns using regular expression transformations. Another option is to join well names from different data sources using string distance computations and fuzzy matching.


Once such data integration and data cleansing issues have been resolved, the next step is understanding correlations in the available data. For useful results, summary statistics and correlations between variables need to be computed at scale for thousands of variable combinations. MADlib, an open-source library of machine learning algorithms implemented for the distributed computing SQL products that are part of the Pivotal Big Data Suite, can be particularly useful in such problems. MADlib includes many relevant algorithms, such as the parallel implementation of the ‘summary’ function (a generic function that produces summary statistics from any data table much like R’s summary function) and Pearson’s correlation (a function that associates two variables to determine the degree of linear dependence between them).

When working with use cases that have a complex feature set consisting of multiple time series, it is often useful to create new features from these variables rather than strictly working with raw data. For instance, it is useful to create statistical features over moving windows of time series data, a task that Pivotal’s Big Data Suite is well-equipped to perform rapidly using GPDB (Pivotal’s MPP Database) or HAWQ (Pivotal’s SQL-on-Hadoop engine), MADlib, and procedural language extensions such as PL/R and PL/Python. These tools can be used separately or in tandem for the fast computation of hundreds of features over time windows, using billions of rows of time series data. Pivotal Greenplum Database, for example, has built-in support for window functions that enhance the ability to work with time series data, a topic covered in detail in a previous series of time series analysis blog posts.

Once features have been generated, the next step is to build a model to solve the problem at hand. Predictive analytics can be performed on a number of aspects of drilling operations, using drill rig sensor and operator data: Problems include predicting the rate of penetration, the occurrence of equipment failure in a chosen future time window, and the remaining lifecycle of equipment to name a few. The choice of algorithm to model any problem must fit the problem statement. For instance, to predict the occurrence of equipment failure in a chosen future time window, one could consider algorithms such as Logistic Regression, Elastic Net Regularized Regression and Support Vector Machines among others. In order to predict the remaining life of any equipment one could use Cox Proportional Hazards Regression. These algorithms (and more) can be executed in parallel and at scale using MADlib on Pivotal GPDB or HAWQ.

Predictive analytics for drilling operations enables oil and gas companies to take steps towards achieving zero unplanned downtime. This can provide businesses with increased efficiency and can reduce the cost and risk of drilling and maintaining wells. The application of big data analytics technologies and techniques empowers organizations to fully utilize their data, develop a comprehensive data integration framework for multiple complex data sources, build and operationalize predictive models, and ultimately gain a competitive advantage by leveraging the entire big data analytics pipeline.

Learn More

About the Author

Rashmi Raghu

Rashmi Raghu is a Principal Data Scientist at Pivotal with a focus on the Internet-of-Things and applications in the Energy sector. Her work has spanned diverse industry problems including uncovering patterns & anomalies in massive datasets to predictive maintenance. She holds a Ph.D. in Mechanical Engineering with a minor in Management Science & Engineering from Stanford University. Her doctoral work focused on the development of novel computational models of the cardiovascular system to aid disease research. Prior to that she obtained Master’s and Bachelor’s degrees in Engineering Science from the University of Auckland, New Zealand.

Pivotal Conversations–So Isn’t Microservices Just SOA With JSON?
Pivotal Conversations–So Isn’t Microservices Just SOA With JSON?

In this podcast, Coté speaks with Founder and CEO, Jakub Nešetřil. Jakub's company is all about A...

Pivotal Extends HAWQ, The SQL On Hadoop Engine, To Hortonworks HDP
Pivotal Extends HAWQ, The SQL On Hadoop Engine, To Hortonworks HDP

Pivotal continues to make quick progress on our mission to make our industry leading Big Data Suite product...

SpringOne 2022

Register Now