Think back to high school calculus class. Calculus is the branch of mathematics invented by Isaac Newton (and Leibniz) that lets you do amazing things. If you take the derivative of distance, you get velocity and if you take the derivative of velocity, you get acceleration. You can go backwards and take the integral of acceleration to get velocity and do the same to velocity and get distance.
The reason that I bring this up is that the process of differentiation and integration is an inverse similitude for what happens in the mining of Big Data Mining. Big data generally starts with a data entry -- a single point. That data entry (usually a column in a row in a database) in conjunction with other entries is integrate to create a fact. Facts are integrated with other facts to become information. Information integrated with other information, becomes knowledge. Mining Big Data usually stops at the information stage.
Knowledge is an ontological map of combined information to create both abstract and concrete ideations to create an amalgam of fact, belief, prediction, concepts, ideals and metaphors that gives a basis of understanding about any situation, object, proposition or relationship. The utilization of Big Data is nowhere near creating knowledge from its sources. It just creates the building blocks of knowledge without any underlying understanding of the wherefores and whys.
That ability alone is amazing in itself and not enough, but let me iterate that I am not talking about Machine Learning (although that can be put into the mix in the future). Data mining is done by a person, so you have an actual brain driving the process of trying to make sense of a huge pile of data. As such, you can have some intelligent advantage over machine learning.
Information gleaned from Data Mining can be extremely useful to any enterprise. However all data is not created equal, and dirty data creates a lot of noisy correlations and just plain wrong information. And some data is just not that rich in information potential. But even the best of datasets can produce spurious correlations.
Without background knowledge, spurious correlations have no value. As an example, here are some spurious correlations from http://www.tylervigen.com/
The above graph shows that US spending on science, space, and technology correlates with Suicides by hanging, strangulation and suffocation.
You need knowledge behind the mining to find intelligent, meaningful relationships. Here are other examples of absurd correlations:
- Number people who drowned by falling into a swimming-pool correlates with Number of films Nicolas Cage appeared in
- Per capita consumption of cheese (US) correlates with Number of people who died by becoming tangled in their bedsheets
- Divorce rate in Maine correlates with Per capita consumption of margarine (US)
But the biggest fundamental problem with Big Data that with the lack of background knowledge, you could be finding local information and nuggets that do not translate into a universal picture, or vice versa. For example, a colleague was telling me of mining big data to determine the best days to run special promotions for dairy products for a small supermarket chain. When the data was dimensioned for day-of-week, it was found that there was a statistically significant decrease in demand for dairy products on a Friday. The data showed that dairy product sales were way down for that day, and past promotions did not work as well as projected.
What the data mining was missing, was that there was a statistically significant dip in the sales curve that was due to local conditions. Data was aggregated from all of the chains into a central database, and the sales database was mined. However the chain of stores had many locations, and in some locations a major dairy was co-located in the same city. That dairy had retail operations just on weekdays, and not on weekends, and on Friday, the dairy would run huge sales to clear its stocks over the weekend closure. Folks would buy their dairy products at huge savings at the dairy on Fridays at a subset of locations and it was enough to skew the data past a statistically significant point. However statistical significance did not negate increased sales overall, and it appeared that the data mining exercise was a failure when the null hypothesis was tested. The issue would have been avoided if the data has been dimensioned by geographical location, but who would have thought that a supermarket demand for milk on a Friday in Cincinnati would be different from that supermarket's Friday dairy demand in Steubenville?
To recap, the fundamental problem with Big Data, is that you get information, but rarely do you get knowledge. You know the who, where, what and when, but you don't know the why. And that may be the Achilles Heel of a lot of Big Data projects that do not deliver a promised ROI or Return On Investment.
In many companies, you have to mine Big Data to improve the bottom line. If you find spurious correlations, or information that is just plain wrong in either a local or universal sense, then the exercise has been a waste of money, time and resources.
So how do you map a Big Data Project to a reasonable ROI, especially since a fundamental flaw of mining Big Data is missing background knowledge? The answer lies in Total Data Point Dimensioning, coupled with Dynamic Cognitive Modeling, and the subject of a future blog post on how I see it evolving and the toolsets necessary for it. Stay tuned.