Today, data – both structured and unstructured, is seen as the most valuable business asset to solve
Today, data – both structured and unstructured, is seen as the most valuable business asset to solve problems and improve productivity. An article in Forbes says every company today is a data company! However, we often get questions from our clients, whether data can offer insights when there is no complete data available. The short answer is YES; data can provide insights even when the complete data set is NOT available! Here is a one example to illustrate this.
Crude oil is generally transported to the refineries through steel pipes. Any rupture in these pipes will have adverse consequences not only to the oil company, but also to the community and the environment. So, oil companies take extreme care and precaution while transporting oil to ensure that crude oil transportation via pipelines remains the safest and cheapest mode of oil transportation. One such mechanism to ensure that oil pipes don’t rupture is applying predictive analytics and machine learning (ML) techniques on the available data and taking preventive corrective actions.
Let’s say there is a hypothetical regression model which predicts the rupture of oil pipes based on 3 independent variables – operating pressure, pipe corrosion levels, and the amount and the quality of soil where the pipe resides. Typically, the data on operating pressure and pipe corrosion levels is easily available; the operating pressure details are provided by the pump and the pipe corrosion levels are provided by the PIGs (Pipe Inspection Gauges). However, the data on the soil around the pipe might not be easy to acquire as weather conditions like landslides, soil erosion, humidity, etc. affect the quality and quantity of the soil where the pipe resides. Hence of the 3 independent variables, the data is available only on 2 variables - operating pressure and pipe corrosion levels. So, when the regression analysis is performed to predict the pipe rupture using data only from the 2 variables, the adjusted R-square value and the P-value will show poor association as the “Soil” variable data, which is statistically significant, is missing in the regression model.
In simple words, the “Soil” variable is indeed a significant, independent variable, which is needed in order to get good insights from the regression model. Given that the data on the soil is missing, not all the pertinent data is available to do the predictive analytics on pipeline rupture. The converse is also true; if data is collected for regression analysis that is of no use in predicting the pipe rupture (due to high P-value), it is better to stop gathering that data. A study by Forrester says that 73% of data collected in an enterprise in actually never used!
Knowing that you don’t have all the available data to take an action itself is an insight. Good insights provide the right direction – in this example the next step in the right direction is exploring new data or finding proxy data to simulate the conditions that closely models the physical environment. This is exactly the reason, why data can provide you insights when physics problems typically don’t. Data Analytics problems start with a hypothesis unlike the physics problems which are deterministic and where the outcome is known. Despite not having the complete data, data can still be a valuable resource for the organization!