Business professionals have traditionally viewed the world in concrete terms and sometimes even round numbers. That legacy perspective is black and white compared to the shades of gray that data science produces. Instead of producing a single number result such as 40%, the result is probabilistic, combining a level of confidence with a margin of error. (The statistical calculations are far more complex than that, of course.)
While two numbers are arguably twice as complicated as one, confidence and error probabilities help non-technical decisionmakers:
- Think more critically about the numbers used to make decisions
- Understand that predictions are merely probabilities, not absolute “truths”
- Compare options with a greater level of precision by understanding the relative tradeoffs of each
- Engage in more meaningful and informative discussions with data scientists
In fact, there are several reasons why data science isn’t an exact science, some of which are described below.
“When we’re doing data science effectively, we’re using statistics to model the real world, and it’s not clear that the statistical models we develop accurately describe what’s going on in the real world,” said Ben Moseley, associate professor of operations research at Carnegie Mellon University’s Tepper School of Business. “We might define some probability distribution, but it isn’t even clear the world acts according to some probability distribution.”
You may or may not have all the data you need to answer a question. Even if you have all the data you need, there may be data quality problems that could cause biased, skewed, or otherwise undesirable outcomes. Data scientists call this “garbage in, garbage out.”
According to Gartner, “Poor data quality destroys business value” and costs organizations an average of $15 million per year in losses.
If you lack some of the data you need, then the results will be inaccurate because the data doesn’t accurately represent what you’re trying to measure. You may be able to get the data from an external source but bear in mind that third-party data may also suffer from quality problems. A current example is COVID-19 data, which is recorded and reported differently by different sources.
“If you don’t give me good data, it doesn’t matter how much of that data you give me. I’m never going to extract what you want out of it,” said Moseley.
It’s been said that if one wants better answers, one should ask better questions. Better questions come from data scientists working together with domain experts to frame the problem. Other considerations include assumptions, available resources, constraints, goals, potential risks, potential benefits, success metrics, and the form of the question.
“Sometimes it’s unclear what is the right question to ask,” said Moseley.
Data science is sometimes viewed as a panacea or magic. It’s neither.
“There are significant limitations to data science [and] machine learning,” said Moseley. “We take a real-world problem and turn it into a clean mathematical problem, and in that transformation, we lose a lot of information because you have to streamline it somehow to focus on the key aspects of the problem.”
A model may work very well in one context and fail miserably in another.
“It’s important to be clear that this model is only true in given circumstances. These are boundary conditions,” said Berkeley College Professor Darshan Desai. “And when these boundary conditions are not met, the assumptions are not valid, so the model needs to be revisited.”
Even within the same use case, a prediction model can be inaccurate. For example, a churn model based on historical data might place more weight on recent purchases than older purchases or vice versa.
“The first thing that comes to mind is to build a prediction based on the existing data that you have, but when you build the churn prediction model based on the existing data that you have, you are discounting the future data that you will be collecting,” said Desai.
Michael Yurushkin, CTO and founder of data science company BroutonLab said there’s a joke about data science not being an exact science because of neural networks.
“In open source neural networks, if you open GitHub and you try to replicate the results of other researchers, you will get [different] results,” said Yurushkin. “One researcher writes a paper and prepares a model. According to the requirements of confidence, you must prepare a model and show results but very often, data scientists don’t provide the model. They say, “‘I will provide [it] in the near future,’ [but] the near future doesn’t come for years.”
When training a neural network using Stochastic gradient descent, the results depend on the random number starting point. So, when other researchers start training the same neural network using the same method, it will descend from a different random starting point so the result will be different, Yurushkin said.
Image recognition begins with labeled data, such as photographs that are labeled “cat” and “dog,” respectfully. However, not all content is so easy to label.
“If we want to build a binary classified for NSFW image classification, it’s difficult to say [an] image is NSFW [because] in a Middle Eastern country like Saudi Arabia or Iran, a lady wearing a bikini would be considered NSFW content, so you’d get one result. But if you [use the same image] in the United States where cultural standards and norms are totally different, then the result will be different. A lot depends on the conditions and on the initial input,” said Yurushkin.
Similarly, if a neural network is trained to predict the type of image coming from a mobile phone, if it has been trained on songs and photos from an iOS phone, it won’t be able to predict the same type of content coming from an Android device and vice versa.
“Many open source neural networks that solve the facial recognition problem were tuned on a particular data set. So, if we try to use this neural network in real situations, on real cameras, it doesn’t work because the images coming from the new domain differ a bit so the neural network can’t process them in the right way. The accuracy decreases,” said Yurushkin. “Unfortunately, it’s difficult to predict in which domain the model will work well or not. There are no estimates or formulas which will help us researchers find the best one.”
Lisa Morgan is a freelance writer who covers big data and BI for InformationWeek. She has contributed articles, reports, and other types of content to various publications and sites ranging from SD Times to the Economist Intelligent Unit. Frequent areas of coverage include … View Full Bio