All Things Techie With Huge, Unstructured, Intuitive Leaps

Never Mind Data Mining -- I Made a Data Refinery

refinery - definition of refinery by the Free Online Dictionary ...
re·fin·er·y (r -f n -r ). n. pl. re·fin·er·ies. An industrial plant for purifying a crude substance, refinery [rɪˈfaɪnərɪ]. n pl -eries. (Business ...

OK, so I'm writing some datamining software. Actually, I am writing a package to take raw data, plot it into a data mart, validate the data, cleanse it, evaluate it, and then put it into a database where I can mine it.

But I have some unique challenges. The data is collected in a Third World African country. It is in the form of surveys. The data collected is to provide health care. The data collectors are indigenous people who are provided with a cell phone, survey instruments, and are paid to go out and survey people. They get paid per completed survey, and in a country where it is tough to make a dollar, this is a plum job.

The problem is that this country is a country with a brokerage economy where everyone cheats. This country is known around the world for sending out email scams. A respected figure once said that if cheating were the pinnacle of civilization, then this country would be the most civilized in the world. It is not uncommon for the data collectors to sit on a street corner and make up surveys instead of going house to house.

So, how do you get around that. Luckily the mobile devices have GPS. When they send the survey, the longitude and latitude is sent. The surveys are broken down into several steps. The first crew goes out and enumerates the houses and gets the GPS coordinates of the house. It is given a control number. Then the following surveys must match the control number, and the GPS coordinates must match. My software does that.

One of the surveys asks how many members there are in the household. The following surveys ask the same question. When the surveys all arrive by HTTP or GPRS, they all go into a database as raw.

My software takes it out and does an evaluation. The surveys come in at a couple of thousand a day. It is too much to evaluate them manually. My software has to take the bulk of the crude material, sift it out for the good stuff, and then operate on the good stuff.

It struck me, that before I can do data mining, I have to do some data refining. The incoming surveys are like ore. The nuggets of real live data is buried in the junk, the partially completed surveys, the fraudulent ones, the failed transmissions and corrupt ones, and the ones that are partially true and the rest made up. Just like ore, I have to purify the data, and then I can operate on it. It struck me, that I have built a data refinery.

In various TED talks, I heard the stat that the whole world up to the time of the internet, had produced 5 exabytes of data. We now produced 5 exabytes in a couple of days. A lot of it is pure crap. There are funny jokes, red-neck anti-Obama diatribes, emails from a bunch of Russians who think that I have a small penis, and all sort of other spam.

Most of the exabytes of data that we generate is like ore. It needs to be refined. It is possible to extract knowledge from the stream of fake viagra emails and Facebook updates, and it can be monetized, but you have to separate the wheat from the chaff. It struck me that my data refinery can do the job.

There are lessons to be learned from my experience, and those lessons form the basis of a data refinery.

The first step is to get rid of the fake and fraudulent stuff. That is objective one for a data refinery. The second step, is to identify the partial data, and make a determination if it can be salvaged. The third step is to cleanse the dirty, but real data. And the fourth step is to put it all together in a clean place where you can operate on it and monetize it.

There you go -- the four basic steps of creating a data refinery. I am being deliberately vague on how this is done, because there is money in doing this. Race you to the patent office.

No comments:

Post a Comment