ISI Logo

Engineering Application of Data Science

Shahab D. Mohaghegh - March 2017

Data Science in Engineering is not confined to applied statistics. To maximize efficiency and spur practical innovations, engineering domain experts must be trained to become expert Data Science practitioners. Objectives of Engineering Application of Data Science include:

  • Advancing the art and science of engineering problem solving, design, and uncertainty quantification with extensive incorporation of Data Science.
  • Training the next generation of engineers and scientists with practical knowledge and expertise in the art and science of Data Mining, Machine Learning, and Artificial Intelligence.

When it comes to “Engineering Application of Data Science”, there is a spectrum. On one end of this spectrum are those that display a "religious" view toward first principle physics as the sole foundation of engineering and believe that engineering is all about physics and its inter-relation with mathematics and absolutely nothing else. On the other end of the spectrum are those that have a similar "religious" belief toward statistics. The idea they subscribe to, whether it be physics or statistics, is more a "belief" that should not and must not be challenged, rather than a never-ending quest to find the best solution. I have found myself in a continuous disagreement with both of these camps.

The “non-wavering believers” in physics and geology that I usually call “traditionalists” cannot even imagine than there might be other ways of solving problems without first understanding the underlying physics, formulating the governing equations, and then using math to solve (analytically or numerically) the developed set of equations. Human brain never uses this path to find solutions (think of the simple problem of catching a ball that is thrown to you). Artificial Intelligence (AI) is an attempt to mimic the method that brain uses to solve problems. Many physics-based problems are now solved this way using AI. Driverless cars are examples that probably everyone can relate to.

On the opposite side are the “true believers” in statistics. They “believe” that “all” problems can be solved with data. Of course this may not be too far from the truth (in a hypothetical world), but the problem is that these folks cannot see that there is a major difference between how problems are solved using statistical methods versus machine learning. They cannot digest the difference between stochastic data models versus data generated by a well-defined and sometimes well-understood physical phenomenon. Of course the more advanced ones in this group, call machine learning and AI, statistical methods. I do not have a problem with that, as long as they do not try to completely ignore the physics (or geology) or try to fit a pre-determined probability distributions, or a pre-determined equation (or set of equations whether it be linear or pre-defined non-linear – aka, multivariate analysis), to every dataset that they see, regardless of its origin. I call these people curve-fitting experts. To most of them the relationship between correlation and causation is either irrelevant or can be explained with simple logic even though it may not make much of sense.

To this group, whether the data is extracted from social media or it has been generated using numerical simulation models (reservoir simulators or computational fluid dynamics [CFD]), makes no difference. Domain expertise to them is not very relevant and may only be needed to justify their discovered correlations to the non-statisticians, no matter how shady the explanation may end up being. However, some of these individuals that have recently entered our industry through developing start-ups, have learned not to say that out loud and sometimes they even hire a few engineers or geo-scientists in their organizations, just to avert criticism or to be able to use the right abbreviations and terminology. I disagree with both ends of this spectrum. To me, it does not matter what set of tools are used to accomplish the objectives, what really matters is to find the set of tools that can offer proper and practical solutions. My definition of “proper” and “practical” solutions relates, first and foremost, to accurate “Prediction”. The resulting solution (model) must be able to generate accurate predictions of the behavior of the system that is being modeled. Once this condition is met, then and only then, it is important for the solution (model) to be able to provide information regarding the nature of the process and shed lights on the reasons behind the model’s behavior. Since, if the predictions are not accurate, the provided information cannot be trusted, regardless of the fact that the model may include “internal consistency”.

As you may have noticed, “Data Science” has become a buzz word and a marketing scheme and most recently has started to lose some of it original attraction as a scientific approach to finding proper and practical solutions. This is, to a large extent, due to the fact that the marketing schemes employed by some of the start-ups with substantial venture capital investments that usually initiate projects in the operating companies from the top (usually at the CEO level), end up generating (at best) mediocre solutions and most of the time, disappointing results. In other words, they have generally failed to deliver what they have promised. Another major mistake in the adoption of Data Science by some companies in our industry that has been contributing to the lack of substantial success (at best they have resulted in minimal return on investment) is the development of Data Science and/or Data Analytics groups within organizations that are lead and overwhelmingly occupied by non-domain experts. The Data Science and/or Data Analytics groups in these organizations (including major service companies) are directed by statisticians and/or AI experts with little to no understanding of the oil and gas industry and the complexities associated with subsurface, wellbore, and surface facility modeling that are the forte of domain experts such as drilling, completion, reservoir, and production engineers as well as geo-scientists. Of course, some of these operating companies are not prepared to publicly admit these failures (large number of their professionals admit that in private) due to the amount of investment and publicity that had followed their decisions. Let’s take a closer look at the vital importance of domain expertise in the application of Data Science in the oil and gas industry.

Since its introduction as a discipline in mid-90s “Data Science” has been used as a synonym for applied statistics. Today, Data Science is used in multiple disciplines and is enjoying immense popularity. What has been causing much confusion is the lack of distinction between the applications of Data Science to physics-based versus non-physics-based disciplines. Such distinctions surface once Data Science is applied to industrial applications and when it starts to move above and beyond simple academic exercises.

So what is the difference between Data Science as it is applied to physics-based versus non-physics-based disciplines? When Data Science is applied to non-physics-based problems, it is merely applied statistics. Application of Data Science in social networks and social media, consumer relations, demographics, or politics (some may even include medical and/or pharmaceutical sciences to this list) takes a purely statistical form, since there are no sets of governing partial differential (or other mathematical) equations that have been developed to model human behavior or to the respond of human biology to drugs. In such cases (non-physics-based areas), relationship between correlation and causation cannot be resolved using physical experiments and usually, as long as they are not absurd, are justified or explained, by scientist and statisticians, using psychological, sociological, or biological reasoning.

On the other hand, when Data Science is applied to physics-based problems such as self-driving cars, or multi-phase fluid flow in reactors (CFD), or in porous media (reservoir simulation), it is a completely different story. The interaction between parameters that is of interest to physics-based problem solving, despite their complex nature, have been understood (to a large extent) and modeled by scientists and engineers for decades. Therefore, treating the data that is generated from such phenomena (regardless whether it is measurements by sensors or generated by simulation) as just numbers that need to be processed in order to learn their interactions (as it is done using statistical tools) is a gross mistreatment and over-simplification of the problem, and hardly ever generates useful results. That is why many of such attempts have, at best, resulted in unattractive and mediocre outcomes. So much so that many engineers (and scientists) have concluded that Data Science has little serious applications in industrial and engineering disciplines.

The question may rise that if the interaction between parameters that is of interest to engineers and scientists have been understood and modeled for decades, then how could Data Science contribute to industrial and engineering problems? The answer is: “considerable (and sometimes game changing and transformational) increase in the efficiency (and even accuracy) of the problem solving”. So much so that it may change a solution from an academic exercise into a real-life solution. For example, CO2-EOR or water-flooding optimization, and field development planning in large, mature oil and gas fields with hundreds of wells, or uncertainty quantification associated with reservoir characteristics or operational conditions, require examination of hundreds of thousands and sometime millions of scenarios. Large, complex numerical simulation models (with millions of cells) that take hours for a single run (even on High Performance Computing [HPC] systems or large number of GPUs) cannot provide the speed that is required for such tasks. Accomplishing such objective requires accurate and comprehensive, full field, dynamic reservoir models that can run in few seconds. Using Artificial Intelligence and Machine Learning to develop history matched, data-driven models is a viable solution for such objectives.

There is a flourishing future for Data Science as the new generation of engineers and scientists are exposed to, and start using it in their everyday life. The solutions to clarify and distinguish the application of Data Science to physics-based disciplines and to demonstrate the useful and game changing applications of Data Science in engineering and industrial applications is to develop a new generation of engineers and scientists that are well versed in the application of Data Science. In other words, the objective should be to train and develop “engineers” that understand and are capable of efficiently applying Data Science to engineering problem solving.