Data As The New Oil: Producing Value for the Oil & Gas Industry

April 28, 2015 Rashmi Raghu

featured-oil-data

The popular perception of data as the new oil is very apt, particularly within the oil and gas industry. During a recent webinar for Data Science Central, I discussed the topic of producing value for the oil & gas industry with big data analytics and data science (joint work done with Niels Kasch and Kaushik Das).

While the presence of large amounts of data from sources such as seismic surveys is not new to the oil & gas industry, the increase in cost-effective data collection methods, storage, and computational resources are reducing the barriers to storing large amounts of data from sensors and other sources. Companies can leverage these data sources to improve logistics, business operations and more. By using a wide breadth of data sources and big data technologies and techniques, oil and gas companies can improve efficiency, realize new business opportunities, and enhance decision making.

There are certainly challenges and roadblocks for companies that want to yield new opportunities and insights from the increase in data production and analysis. As in many industries, oil and gas companies may have data sources and skilled practitioners within silos, often due to legacy software and policies. These technologies and domain experts may be unable to work and communicate with one another. As a result, an organization may not know the numerous data sources it has access to and what it can leverage.

The promise of data as “the new oil” is realized when we can tap into its value, in a meaningful, cross-functional way to enhance decision-making. It is precisely in this collaboration and access to all available data sources wherein business value is realized. In contrast to the siloed approaches common in many industries, a Data Lake model enables data to be stored centrally and curated in a meaningful way, and provides businesses with a comprehensive view of the truth. The integration of data assets leads to more informed, powerful models, and never-before-realized opportunities based on model insights. Moreover, businesses which operationalize the real-time application of predictive models where appropriate can enhance their ability to rapidly respond to new events.

Data-driven use cases within oil and gas companies include predictive maintenance of equipment through the modeling of function and failure and the optimization of maintenance schedules, seismic imaging and inversion analysis, reservoir simulation and management, production optimization, supply chain optimization, as well as energy trading.

Most significant among these perhaps is the impact predictive analytics will have upon drilling operations. Drilling wells is an expensive process, compounded by equipment failure. According to one operator, as reported by The American Oil & Gas Reporter, April 2014, drilling motor damage can account for 35% of rig non-productive time and $150,000 per incident. Given that there were over 800,000 oil & gas wells in the US as of 2009 (according to data.gov), the total cost of such incidents could amount to billions of dollars. The goal of predictive maintenance efforts is to increase efficiency, reduce costs, and take steps towards zero unplanned downtime by predicting equipment function/failure and establishing an early warning system for equipment failure, optimize drilling operations parameters, and improve health and safety while reducing environmental risks. The data to perform these tasks comes from a number of sources, including sensor-enabled machinery as well as data reported by operators. By realizing an effective business data lake and introducing data science techniques, a business can track and predict drilling equipment function and failure, an important step on the path towards establishing early warning systems that ensure zero unplanned downtime.

Complex use cases such as these can be tackled using a well-defined data science process that the Pivotal Data Science team has successfully employed in many engagements. The process includes a number of steps:

a problem formulation phase where the goal is to ensure the problem being solved is relevant to the business and stakeholders
a data step to explore the data and build the right feature set
a modeling step wherein we move from answering ‘what, where, and when’ to ‘why and what if’
an application phase to build a framework to integrate the model into decision-making processes.

The blog post “The Eightfold Path of Data Science” provides a comprehensive discussion of this data science methodology.

Pivotal’s Big Data Suite can be utilized throughout all phases of the data science and analytics cycle. Pivotal’s products provide companies with the platform to integrate data from multiple sources, across data warehouses and rig operators, and the ability to analyze both structured and unstructured data in a unified manner. They also support the development of complex and extensible predictive models at scale.

Predictive analytics use cases for drilling operations include prediction of drill rate-of-penetration (ROP) and prediction of equipment failure. There are a number of relevant data sources to consider when performing predictive analytics for drilling operations, such as drill rig sensor data and data logged by drill operators. In the case of drill rig sensor data, features being captured include depth, ROP, RPM, torque, weight on bit, and much more, potentially accounting for billions of records depending on the number of wells and drilling duration. It is also be important to consider operator data, such as event details, drill bit details, failures, and component details, potentially accounting for tens to hundreds of thousands of records or more.

Building analytical models using such data sets requires a comprehensive framework for performing data integration at scale, which includes data cleansing and the standardization of columns. As in many big data analytics use cases, this presents a number of challenges: data sources do not necessarily use consistent entries in the features and columns that link them, there is the potential for errors in manually entered data, and sensor measurements can return invalid values due to malfunction. Consider the problem of not having consistent entries for rig/well names across data sources. One way to solve this problem is to standardize these columns in all data sources by deriving a canonical representation of the columns using regular expression transformations. Another option is to join well names from different data sources using string distance computations and fuzzy matching.

Once such data integration and data cleansing issues have been resolved, the next step is understanding correlations in the available data. For useful results, summary statistics and correlations between variables need to be computed at scale for thousands of variable combinations. MADlib, an open-source library of machine learning algorithms implemented for the distributed computing SQL products that are part of the Pivotal Big Data Suite, can be particularly useful in such problems. MADlib includes many relevant algorithms, such as the parallel implementation of the ‘summary’ function (a generic function that produces summary statistics from any data table much like R’s summary function) and Pearson’s correlation (a function that associates two variables to determine the degree of linear dependence between them).

When working with use cases that have a complex feature set consisting of multiple time series, it is often useful to create new features from these variables rather than strictly working with raw data. For instance, it is useful to create statistical features over moving windows of time series data, a task that Pivotal’s Big Data Suite is well-equipped to perform rapidly using GPDB (Pivotal’s MPP Database) or HAWQ (Pivotal’s SQL-on-Hadoop engine), MADlib, and procedural language extensions such as PL/R and PL/Python. These tools can be used separately or in tandem for the fast computation of hundreds of features over time windows, using billions of rows of time series data. Pivotal Greenplum Database, for example, has built-in support for window functions that enhance the ability to work with time series data, a topic covered in detail in a previous series of time series analysis blog posts.

Once features have been generated, the next step is to build a model to solve the problem at hand. Predictive analytics can be performed on a number of aspects of drilling operations, using drill rig sensor and operator data: Problems include predicting the rate of penetration, the occurrence of equipment failure in a chosen future time window, and the remaining lifecycle of equipment to name a few. The choice of algorithm to model any problem must fit the problem statement. For instance, to predict the occurrence of equipment failure in a chosen future time window, one could consider algorithms such as Logistic Regression, Elastic Net Regularized Regression and Support Vector Machines among others. In order to predict the remaining life of any equipment one could use Cox Proportional Hazards Regression. These algorithms (and more) can be executed in parallel and at scale using MADlib on Pivotal GPDB or HAWQ.

Predictive analytics for drilling operations enables oil and gas companies to take steps towards achieving zero unplanned downtime. This can provide businesses with increased efficiency and can reduce the cost and risk of drilling and maintaining wells. The application of big data analytics technologies and techniques empowers organizations to fully utilize their data, develop a comprehensive data integration framework for multiple complex data sources, build and operationalize predictive models, and ultimately gain a competitive advantage by leveraging the entire big data analytics pipeline.

Learn More

About the Author

Rashmi Raghu is a Principal Data Scientist at Pivotal with a focus on the Internet-of-Things and applications in the Energy sector. Her work has spanned diverse industry problems including uncovering patterns & anomalies in massive datasets to predictive maintenance. She holds a Ph.D. in Mechanical Engineering with a minor in Management Science & Engineering from Stanford University. Her doctoral work focused on the development of novel computational models of the cardiovascular system to aid disease research. Prior to that she obtained Master’s and Bachelor’s degrees in Engineering Science from the University of Auckland, New Zealand.

Pivotal Conversations–So Isn’t Microservices Just SOA With JSON?

In this podcast, Coté speaks with Apiary.io Founder and CEO, Jakub Nešetřil. Jakub's company is all about A...

Pivotal Extends HAWQ, The SQL On Hadoop Engine, To Hortonworks HDP

Pivotal continues to make quick progress on our mission to make our industry leading Big Data Suite product...

Data As The New Oil: Producing Value for the Oil & Gas Industry

About the Author

Previous

Next

Data As The New Oil: Producing Value for the Oil & Gas Industry

About the Author

Previous

Next

Related content in this Stream

How VMware Tanzu CloudHealth helps customers uncover spiraling AWS Extended Support charges.

VMware Tanzu enhances Spring development with simplified operations, accelerated innovation, seamless microservices transition, increased security, and effortless scaling.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

Bitnami-packaged open source software is loved by developers for its ease of use, which enables developers to directly pull a Bitnami package and seamlessly start using it with little effort.

VMware Tanzu announces the General Availability of AWS Commitment Discount Recommendations, which provides recommendations for all reservable services in AWS through VMware Tanzu CloudHealth.

Introducing VMWare Tanzu Data Hub, a self-managed Database as a Service (DBaaS) Platform, providing enterprises a way to host their internal DBaaS offering for internal business users.

In the cloud-native landscape, MCAs drive seamless compliance integration. Their expertise ensures proactive security measures align with regulatory standards for sustained innovation & collaboration.

Tanzu Application Platform brings innovation faster with more frequent feature updates. With 1.9, take advantage of enhanced DORA metrics visibility and improved compliance options for companies.

We’re excited to share some great news! Spring Academy Pro content is now free. It will be available to everyone who registers a work, vocational, or educational email address.

March 28, 2024, marks the official minor release date of Spring Cloud Gateway for K8s version 2.2, and it's set to optimize how developers protect access to their GraphQL services.

We are excited to announce that VMware Tanzu Application Service 6.0 is now generally available!

Get a clear picture of your OSS supply chain, and the risks you face from your open source software dependencies, using the all-new Tanzu OSS Health Assessment.

Trivy can now utilize CSAF VEX data to filter out false positives in CVE reports, maximizing the value of VEX documents in VMware Tanzu Application Catalog.

Bitnami-packaged open source software container images available in DockerHub are now signed by Notation, an implementation of the Notary Project specifications and a CNCF-incubating project.

There’s never been a better time to be a Java and Spring developer! Let me show you why with a sneak peak into JD Conference 2024.