Future Imperfect & Software Stream of Consciousness : datamining

Showing posts with label datamining. Show all posts

Super Bowl XLVIII Predictions From A Datamining Geek

First of all, let me say that these predictions will probably be somewhat wrong. But what if they aren't? These predictions are made without looking at the football performance of the two teams -- Seattle and Denver. But rather they are made with an analysis of social media and datamining the web for crowd chatter. There is the premise that the crowd is always right, and perhaps they will be on this one. I like to call this technique "crowd mining". So lets get to it.

My research and dubious math skill predict that going on crowd behavior alone, I predict that the winner will be the Seattle Seahawks. After I crunched the numbers, I was disheartened to learn that Peyton Manning is the Broncos quarterback. He is a formidable force and I am glad that I didn't recognize this fact before I started this exercise, otherwise it would have skewed my results. And it may be the reason why these predictions could be wrong.

But what numbers does the crowd sentiment suggest:

Then I decided to apply some statistical normalization to the numbers, and I got the following result:

So, I am going to do something that I have never done before, and lay a bet on those two results. Various pundits figure that over $100 million will be bet on this Super Bowl

As a further step in statistical analysis, you have to take into account fat tails, or random events. Suppose that for some reason there is a blowout, an injury, or a team can't play well in the weather, or if the awesome Manning offence is brought down by the Seattle defense. So, to cover the entire range of possibilities, the following are the complete fat tail results. The left column is the Seahawks and the right column is the Broncos. If these predictions are accurate and we get a weird game like a high scoring or a very low scoring game, here is the range of predicted results.

Seattle Denver

3 to 0

7 to 3

10 to 7

13 to 10

17 to 13

21 to 14

24 to 17

27 to 20

27 to 21

30 to 23

31 to 24

33 to 26

35 to 27

35 to 28

36 to 29

37 to 30

38 to 35

It will be interesting to see how these geekazoid predictions turn out.

Never Mind Data Mining -- I Made a Data Refinery

refinery - definition of refinery by the Free Online Dictionary ...

www.thefreedictionary.com/refinery

re·fin·er·y (r -f n -r ). n. pl. re·fin·er·ies. An industrial plant for purifying a crude substance, refinery [rɪˈfaɪnərɪ]. n pl -eries. (Business ...

Refinery - Refinery - Refinery gas - Petroleum refinery

OK, so I'm writing some datamining software. Actually, I am writing a package to take raw data, plot it into a data mart, validate the data, cleanse it, evaluate it, and then put it into a database where I can mine it.

But I have some unique challenges. The data is collected in a Third World African country. It is in the form of surveys. The data collected is to provide health care. The data collectors are indigenous people who are provided with a cell phone, survey instruments, and are paid to go out and survey people. They get paid per completed survey, and in a country where it is tough to make a dollar, this is a plum job.

The problem is that this country is a country with a brokerage economy where everyone cheats. This country is known around the world for sending out email scams. A respected figure once said that if cheating were the pinnacle of civilization, then this country would be the most civilized in the world. It is not uncommon for the data collectors to sit on a street corner and make up surveys instead of going house to house.

So, how do you get around that. Luckily the mobile devices have GPS. When they send the survey, the longitude and latitude is sent. The surveys are broken down into several steps. The first crew goes out and enumerates the houses and gets the GPS coordinates of the house. It is given a control number. Then the following surveys must match the control number, and the GPS coordinates must match. My software does that.

One of the surveys asks how many members there are in the household. The following surveys ask the same question. When the surveys all arrive by HTTP or GPRS, they all go into a database as raw.

My software takes it out and does an evaluation. The surveys come in at a couple of thousand a day. It is too much to evaluate them manually. My software has to take the bulk of the crude material, sift it out for the good stuff, and then operate on the good stuff.

It struck me, that before I can do data mining, I have to do some data refining. The incoming surveys are like ore. The nuggets of real live data is buried in the junk, the partially completed surveys, the fraudulent ones, the failed transmissions and corrupt ones, and the ones that are partially true and the rest made up. Just like ore, I have to purify the data, and then I can operate on it. It struck me, that I have built a data refinery.

In various TED talks, I heard the stat that the whole world up to the time of the internet, had produced 5 exabytes of data. We now produced 5 exabytes in a couple of days. A lot of it is pure crap. There are funny jokes, red-neck anti-Obama diatribes, emails from a bunch of Russians who think that I have a small penis, and all sort of other spam.

Most of the exabytes of data that we generate is like ore. It needs to be refined. It is possible to extract knowledge from the stream of fake viagra emails and Facebook updates, and it can be monetized, but you have to separate the wheat from the chaff. It struck me that my data refinery can do the job.

There are lessons to be learned from my experience, and those lessons form the basis of a data refinery.

The first step is to get rid of the fake and fraudulent stuff. That is objective one for a data refinery. The second step, is to identify the partial data, and make a determination if it can be salvaged. The third step is to cleanse the dirty, but real data. And the fourth step is to put it all together in a clean place where you can operate on it and monetize it.

There you go -- the four basic steps of creating a data refinery. I am being deliberately vague on how this is done, because there is money in doing this. Race you to the patent office.

The Semantic Web and a Possible Rules Engine that Rocks

The entry below on the putative consciousness of Google got me to thinking about "The Semantic Web". It was/is an initiative of W3C to make all web pages machine readable.

A good example of making dumb web pages smart is the "Apples for Sale" example. Picture this. An HTML web page has apples for sale. It is a simple page. There is a picture of an apple, a piece of text that says "Apples For Sale". Another piece of text that says $1.00 and another piece of text that says "Each". A machine reading that web page HTML would not know that it was a commerce page offering something for sale. It would not know that $1.00 is the price. It would not know that apples is the object being offered for sale and it would not know that each is the unit relating to price per unit.

The Semantic Web would change all that. It would mark-up a web page to associate all the stuff with the HTML so that a machine could sort through it.

A few years back, the "next big thing" was a rules engine. A rules engine would be incorporated into an application, and if the business rules change, you wouldn't have to change your application. You would just change a rules file that the rules engine read.

I used a rules engine for a network policy tool that decided which server would provide what services in a LAN. I expected rules engines to progress a lot further, but they have become sidelines rather than mainstream.

How a rules engine fits into the semantic web, is that a Rules Interchange Format is part of the infrastructure of the semantic web. One must agree on rules if machines are to read and understand web pages. Rules engines can be predictive or reactive (forward chaining or backwards chaining). For example, a forward chaining rules engine calculates loan risk during a credit application while a backwards chaining rules engine tells humans or other machines when inventory items are getting low.

Rules engines have not been widely used, and in my shortsighted humble opinion, it is because they are bulky, non-intuitive and put a performance hit on applications. However, I may have an algorithm for a rules engines that rocks.

Consider the following code. It is part of the The Rule Interchange Format (RIF) which is the W3C Recommendation:

Prefix(ex )

(* ex:rule_1 *)

Forall ?customer ?purchasesYTD (

If And( ?customer#ex:Customer

?customer[ex:purchasesYTD->?purchasesYTD]

External(pred:numeric-greater-than(?purchasesYTD 5000)) )

Then Do( Modify(?customer[ex:status->"Gold"]) ) )

The RIF is entirely based on "If ..... (some condition) .... then .... (do this)". What this bit of Rules Interchange Code does, is for a commercial entity to check each customer's year-to-date purchases and if they are greater than $5,000, then upgrade their status to "Gold".

The thought struck me, that one could have a rules engine that operated directly on the database. It would parse the RIF language and automagically convert it to SQL. (I will race you to the patent office on this idea).

My rules engine would create an SQL statement that would create a cursor with "Select * from CustomerTable where "YearToDate" total > '5000.00'. Then I would loop through the cursor and update the status to gold.

The great thing about this, is that this rules engine that rocks, would revolutionize data-mining and database reporting. The more that I think about it, the more that I am convinced that this could be the NEXT BIG THING in data mining.

And as for the Semantic Web, in my opinion it is a no-go. Who is going to mark-up a few billion pages that are already out there? Also the entire history of the Internet won't be re-worked so it will be useless to the semantic web. I see this function being done at a single point at the web server level, which will have context engines to recognize stuff and mark up the page as they serve it up. Now that is a workable plan.

I'd write more on this, but I have to open up an IDE and test this rules engine idea. Later.