Future Imperfect & Software Stream of Consciousness : data mining

Showing posts with label data mining. Show all posts

Process Mining From Event Logs -- An Untapped Resource And Wave of The Future

A couple of years ago, I was searching for untapped horizons in data mining, and I came across a course given by Professor Wil van der Aalst where he pioneered the technology of business process mining from server event logs. Naturally I signed up for the course. It is and was a fascinating course, not only due to its in-depth and non-trivial treatment of gleaning knowledge from data, but for me, it got the creative juices flowing to think of where it could be applied elsewhere. I was so intrigued with the possibilities, that I created a Google Scholar Alert for Professor van der Aalst's publication. The latest Google alert was on January 31rst, and it was a paper entitled "Connecting databases with process mining". The link is here: http://repository.tue.nl/858271 It was this paper that triggered this article.

I am a huge proponent of AI, Machine Learning and Analytics. In Machine Learning, you gather large datasets, clean the data, section the data into smaller sets for training & evaluation, and then train an AI machine with hundreds, perhaps thousands of training epochs until the probability of gaining the sought-after knowledge crosses an appropriate threshold. Machine intelligence is a huge field of endeavor and it is progressing to be a major part of everyday life in all phases of life. However, it is time consuming to teach the machine and get it right. Professor van der Aalst's area of expertise can provide a better way. Let me explain:

My particular interest, is that I am building a semantic blockchain to record all of the data coupled to vehicles, autonomous or not. Blockchain of course, is an immutable data ledger that is true, autonomous itself in operation, disintermediates third parties and is outage-resistant. Autonomous vehicles will by law, be required to log every move, have records of their software revisions, and have records like post-crash behavior etc.

I immediately saw the possibilities of using this data. Suppose that you are in an autonomous vehicle and that vehicle has never been on a tricky roadway that you need to navigate to get to your destination. Your car doesn't know the route parameters, but thousands of other autonomous vehicles have, including many with your kind of operating system and software. With the connected car, your vehicle would know its GPS coordinates and query a system for the driving details for this piece of roadway that is unknown to the computer. Instead of intense computational ability required to navigate, a recipe with driving features could be downloaded.

Rather than garnering those instructions from repeated training epochs in machine learning, one could apply process mining to the logs to extract the knowledge required. There are already semantic methods of communicating processes, from decision trees to Petri nets, and if the general process were already known to the machine, it would reduce the computational load. As a matter of fact, each vehicle could have a process mining module to extract high level algorithms for the roads that it drives regularly. That in itself will reduce the computational load of the vehicles. It would know in advance, where the stop signs are, for example, and you won't have Youtube videos of self-driving cars going through red lights and stop signs.

It goes a lot further than autonomous vehicles. This concept of creating high level machine processes through event logs can be applied to such diverse fields from robotic manufacturing to cloud server monitoring and numerous fields where human operators or real world human judgement is required.

Process mining could either eliminate machine learning in a lot of instances, or it could supplement it, with a mix of technologies. The aim is the same, which is aggregating data into information and integrating information into knowledge, both for humans and machines.

This process mining business reminds me of the history behind Bayesian Inference. The Reverend Thomas Bayes discovered probability and prior belief equations. They sat on a dusty shelf for over 200 years and they were re-purposed for computer inference and machine intelligence. I think that Professor van der Aalst's methodologies will be re-purposed for things yet un-imagined, and it will not take 200 years to come to fruition.

Sentiment Analysis And Data Mining To Understand The World's Problems of Today

I was genuinely perplexed. The world is a vastly different place than I envisioned it as a teenager. It seems that the continued enlightenment and scientific advancement in the years from post World War II to the turn of the millennium would bring the world into a less chaotic global village with a greater degree of peace, stability and economic well-being for man. In many respects, the world has regressed.

Purely for my own understanding, I decided to try and figure out some reasons for the current problems of the world, using my skills in data mining. I took twenty top international news sites, and by scraping their content with open source tools, I had a collection, a snapshot of the microcosm of the world today. Encapsulated in that collection, would be a good starting point as a list of the major problems of the world.

To do some preliminary research into the world's problems, I decided to see what research was out there in the public domain. Eurobarometer had actually conducted a poll across the length and breadth of Europe, and came up with the following list of the top ten major world problems:

#10 Don't Know
#9 Proliferation Of Nuclear Weapons
Tied #7 Armed Conflict
Tied #7 Spread Of Infectious Disease
#6 The Increasing Global Population
#5 Availability Of Energy
#4 International Terrorism
#3 The Economic Situation
#2 Climate Change
#1 Poverty, Hunger And Lack Of Drinking Water

It is interesting that two percent of the people in Europe answered with "Don't Know". This was the reason that I conducted this exercise in the first place.

After I had my collection of data from the news sources, I decided to do a bottom-up analysis of the news. I tagged each story with a tag that generally summarized the theme of the story. I had a lot of tags, and at that point, I needed to do some feature engineering by adding a layer of abstraction to the tags, so that the stories could be grouped for sameness. I kept adding layers of abstraction until I got a manageable number of tags, and then did a bottom-up Naive Bayes classification of the tags. The classifiers neatly categorized the stories.

I didn't just want a grocery list of the problems. I was looking for something deeper. I was looking for answers related to the human condition, and how we, as a varied group of humans who inhabit this earth feel, react, create and possibly solve these problems. So consequently, I created another layer of abstraction for a broad brush category of problems that condensed the list into a smaller but cogent set that related directly to the human condition. Once I had the bottom up tag analysis done, I decided to do a top down, sentiment analysis of my problem tags. It would be interesting to see how my analysis would fare with the Eurobarometer analysis.

Don't forget, my list came from the news sites, so it represents a snapshot of what was in the forefront on this particular current time period. Here is my list of twelve issues:

Africa Issues
Alienation/Marginalization of peoples/societies/groups
Business Sector Wars/Competition
Economical Structural Change
Environment
Globalization
Mass Media/Censorship/Subjectivity
Migrant Problems
Nationalism
Partisanship
Religious Fundamentalism/Jihadism/Religious Wars
Technology Frontiers/Problems

The differences between my list and the Eurobarometer list was apparent. Africa was not on the list whereas it was represented as its own category in the news of late, and indirectly in the Migrant issues (although the migrant issues were a global phenomenon including the Caribbean where Haitians are fleeing their homeland causing problems in the neighboring countries).

In trying to understand the root cause, one of the surprising inclusions on my list, was Alienation/Marginalization of peoples/societies/groups. This included stories about gay rights, Kurdish struggles in other countries, Sunni versus Shia, Basques versus Spaniards etc.

So how did my sentiment analysis turn out? As it turns out, for my limited study, the environment is the number one issue in terms of global problems. Here is the list and the percentage of stories connected to the issues.

Environment -54.21%
Alienation/Marginalization of peoples/societies/groups -23.29%
Mass Media/Censorship/Subjectivity -8.48%
Migrant Problems -3.92%
Globalization -1.75%
Business Sector Wars -1.71%
Technology Frontiers -1.67%
Partisanship -1.35%
Nationalism -1.26%
Religious Fundamentalism -1.03%
Africa Issues -0.95%
Economical Structural Change -0.36%

It seems that most conflicts in the world arise from the number two problem - Alienation /Marginalization of peoples/ societies/ groups. This is probably the root cause of most social problems facing any area of the globe today. Everyone wants and needs their own place in the sun, and others are trying to prevent them from having it, for a whole range of reasons.

People also seem to be concerned about their sources of information. Right wing groups accuse the mainstream media of liberal bias. Conservative news sites are mocked as Faux News. It seems that in the plethora of information sources, everyone has a hidden agenda, and folks are concerned about it. Objective information is very hard to find, with the democratization of information dissemination on the internet.

There is no need to further expound on migrant problems, which came in at number 4 on my list. It is hugely topical.

There are still worries about globalization, but it doesn't have the same impact as the people or environment related stories.

It is interesting that business and technology appear on the list of problems. Business has the general sentiment of being anti-humanistic and profit for profit's sake at the expense of the human condition. Technology is seen as a threat with artificial intelligence, killer robots and job destroyers.

The next two categories can be somewhat related - partisanship and nationalism. They are both 'people-interacting in their countries' stories. Partisanship is now rampant with gridlocked Congress versus the president, the Confederate flag issue and nationalism is seen in various venues around the globe where Scotland wants to exit the United Kingdom, Great Britain wants to exit the European Union, Basque and Catalonia want to exit from Spain, Quebec wanted to separate from Canada, ad infinitum.

Religious fundamentalism is inexplicably rising. There seems to be a growing intolerance between mainstream and fundamentalism. This is not only seen in the Muslim world, but also in the US where a city clerk refused to issue marriage licence to gays because of fundamentalism religious beliefs. We have seen Baptists churches picketing the funerals of slain American soldiers from overseas, on religious grounds. Who would have predicted this shift 30 years ago? I would be interested in knowing why there is a swing to fundamentalism in the modern world. In broad brush strokes, this seems to be a struggle with progression versus regression and it is inexplicable to rational thought.

Africa is low on the list, but concerning. Africa was the site of proxy wars between the superpowers in the last 60 years or more, and now there is currency collapse, armed conflict, epidemics, partisan in-fighting, loss of democracy and pretty much any social, economic or environmental ill that anyone can name. Africa creates instability in the global village.

And bringing up the bottom of the list, is fundamental economic change. Long term jobs are being replaced by the gig economy. Manufacturing is undergoing fundamental changes. The biggest profits are now from virtual paper transactions on Wall Street with the one-percenters who jerk the economy around with their financial derivatives and dark markets.

Certainly this exercise has opened the window and shed some light for me, but as usual, answers to these issues are elusive, complex and in many cases there are no apparent ones. Life does seem to go on.

How To Be A Billionaire Using Big Data and Machine Learning in Three Easy Paradigms

1) Download WireShark and load it onto a laptop with the biggest hard disk storage that you can find.
2) Go to the airport and sit there all day using the free airport WiFi, and turn on the record function on Wireshark
3) Use data-mining and machine learning on the datasets.

The billion dollar platform idea will emerge from the data. Guarantee it.

Giving The Shaft To Data Mining And Obsfucating IBM & Twitter's Privacy Intrusion on Your Life

Those b*st*rds are going too far. Even though I am a data miner, I have a great concern as a data privacy advocate. Essentially Twitter & IBM are teaming up to mine your Twitter Stream to monetize your posts. They will take your tweets and try to sell crap to you, or worse, sell your data to other companies.

Here's how it will work. If you post that your mother died, you will see a crematorium or undertaking ads. Tweet about spending some time in the hospital, and you might pay a higher health insurance premium because they will sell that info to insurance companies. The same about driving fast. Tweet about your kid going to college, and you will get a full court press on everything from college choices to clothes for university life.

It sucks. It just isn't right. You have three choices. You can vote with your feet and leave Twitter. I have already left Facebook and LinkedIn. Twitter is my last stand.

You can carry on, but in a previous blog post, I mentioned that the most dangerous thing about Big Data Mining, is that data mining can make assumptions about you that simply aren't true, and you may be categorized into a list that you don't want to be on. It could affect your job, your security clearance, your credit score or who knows what.

You could self-censor, but censorship is wrong, even self-censoring.

I like the last option - f*ck with the machine learning, and deep learning and data-mining. How? Obfuscate. Here are a few things that I will do.

1) Disable all location services for tweets.
2) Disable all location services that your smart phone takes. It writes the location into the EXIF data. It also writes date and time and camera type, etc.
3) Google for a free EXIF editor, and remove all EXIF data from your pics.
4) Do not put your actual location in your bio. For example, I follow a dude, who's location is : Where I Have To Be
5) Put in a fake town where you live. If you have a dog named Rover, put down that you live in Roverville. You can still keep your same state.
6) Never use your middle name or initial. It's just one more authentication factor.
7) When social media streams are mined using NLP or Natural Language Processing, an important part of that is finding "possessive determiners". Don't use them. Possessive Determiners are words like my, your, her, etc. If you tweet "Its my birthday", even the dumbest NLP data mining machine can pick it up. However if you say "Welcome to Birthdayville, Population Me", not even the smartest NLP machine can pick that up. Get rid of possessive determiners in your Tweets.
8) Practice Typoglycemia. http://en.wikipedia.org/wiki/Typoglycemia Here is an example that would totally screw up a deep learning machine:

"I cdn'uolt blveiee taht I cluod aulaclty uesdnatnrd waht I was rdanieg: the phaonmneel pweor of the hmuan mnid. Aoccdrnig to a rseearch taem at Cmabrigde Uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are, the olny iprmoatnt tihng is taht the frist and lsat ltteer be in the rghit pclae. The rset can be a taotl mses and you can sitll raed it wouthit a porbelm. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe. Scuh a cdonition is arppoiatrely cllaed Typoglycemia .
"Amzanig huh? Yaeh and you awlyas thguoht slpeling was ipmorantt."

9) User slang. If your gas pedal foot itches to drive a BMW, call it a beamer or a beemer and don't capitalize the word.

10) Use alternate spelling. Ime a bygg phan of Neel Yoongs mewsic.

11) Throw in rand o m s pac es in yo ur sente nce. Or e*ven the od*d star will do.

12) Never tweet your age, your spouse or partner (I see married to @sweetiePie all the time) or any other information. It is okay to list your employment of academic institution and that leave a lot of room to fool the NLP machines if you work at the Big Blue, or teach @ the Yard (thanks to the Harvard profs that follow me -- appreciate it).

Using these simple tips will cause the data mining and perceptrons scanning your feed to take a pass on what you type. Now is the time to bowdlerize or obfuscate your account.

I think that the bigger answer, is to startup a new hybrid of Twitter and Facebook that guarantees information privacy. But in the meantime, let's be careful out there as to what we post. And remember, its not that difficult to deke out smart machines.

Re-Booting & Reforming Democracy With Big Data ~ The Box Carries the Vox

Winston Churchill stood in the British Parliament and spoke the following words:

"Many forms of Government have been tried and will be tried in this world of sin and woe. No one pretends that democracy is perfect or all-wise. Indeed, it has been said that democracy is the worst form of government except all those other forms that have been tried from time to time."

When the American Founding Fathers created a unique concept of Western Liberal Democracy, it was in fact a great experiment enshrining concepts operating under the principles of liberalism. This includes protecting the enshrined rights of the individual; fair, free, and competitive elections between multiple distinct political parties, a separation of powers into different branches of government, the rule of law in everyday life as part of an open society, and the equal protection of human rights, civil rights, civil liberties, and political freedoms for all persons. (quoted from Wikipedia).

However, the mechanism that they put into place to administer this democracy was very much a kludge or a compromise to best accommodate the will of the people, taking into account, the pragmatic aspects of their place and time in history.

Something has happened between then and now. Nowadays, the will of the people is being subverted and distorted by political partisanship and it is not for the good of the country and the people. The gridlock and dysfunction in the America Congress is a prime example (If con is the opposite of pro, then is Congress the opposite of progress?). And in the American Senate, when they do the roll call, half of them answer "Not Guilty!". You don't have to go far to find examples of how the mechanism of government fails to democratically represent the will of the people. The idea of thousands of people being represented by a person whose vote and interest can be bought by a business lobby somehow sucks all of the air out of the room of democracy.

Well, times have changed since 1776, but the ways of the government have not. With the advent of technological age, it is time to update, enhance, and empower the forces of democracy through the judicious application of technology, communications and data management.

I still believe that political parties are necessary. As human beings we will always have ideological differences, and no matter how batsh*t crazy some people are, they still have a right to vote and express their opinion. Churchill again once said that the biggest argument against democracy is to speak to the average voter for five minutes. So you will get the weirdos who think that owning an assault rifle will protect them from a drone strike when their elected, democratic government chooses to attack them in their bunker amid the 10 years of rice stocks mixed with prepper gadgets stored on the shelves therein. There will always be the snake-handlers, the wanna-be polygamists, the Ovary Overlords who want to legislate women's reproductive rights, and the folks who want to throw out the science curriculum in schools and replace it with learned treatises on Adam and Eve domesticating the dinosaurs. All these have a right to a voice in the democracy.

Political parties also define policy, which is important in government. Policy the course by which the government steers by. We don't want a rudderless ship, so we still need legislators to debate policy. But when they come up with legislation specifics to policy implementation, I want my direct say.

In the days of 1776, it took a week to get from Philadelphia to New York. There were no telephones. You couldn't track people down on the farm for their views. Times have changed. We have Big Data. We have the technology and communications tools to hear from everyone. We have the infrastructure to empower all voices. With computer data collection, we can collect hundreds of millions of pieces of data in minutes. And we can machine-collate them in real time.

So, what if anyone with a Social Security Number had a private encryption key? Whenever legislation came up for a vote, we all vote on it? Vox populi. The voice of the people can speak and be heard. The legislation would be put to a vote, and we the people would respond. We could all directly vote for the legislation and the laws that we affect us. Being digital in this age has put a voice to the voiceless and nameless. Data Science can be our rescuers and our salvation.

This would make it harder for big business and lobbies to affect democracy. They would have to convince entire populations of their point of view, and it doing so, they would have to make it in the interest of the population. It would be the great leveling ground in the current incarnation of democracy.

Do we have the guts to change the way that we enact democracy? We still have a Digital Divide, where a significant portion of the population doesn't participate, and cannot participate in digital online life. We have education issues. We have a built-in inertial brake for radical change. The people who benefit from the current state of dysfunction want to keep it that way. So it will be an uphill battle, but Big Data can reform democracy and put the power back into the hands of the people, where it belongs.

Back to Winston Churchill one more time to close my argument. Upon being offered The Order Of the Garter after a particularly humiliating defeat in the election of 1945, he said "Why should I accept the Order of the Garter, when I have already been given the Order of the Boot?" It is time to give the old tired mechanisms of democracy a boot.

Data Mining And Ethics - Revenue Over Rights?

I was reading a WIRED magazine article where the author said that his liberal arts degree on his resume was viewed in the same way as a face tattoo in a job interview for an investment banking position. However, a liberal arts degree is probably the qualification of an up and coming job title for data mining companies -- the CEthO or the Chief Ethics Officer. Once data mining and machine learning gets to the next level, coupled with the Internet of Everything, there will be huge privacy and ethics issues to contend with.

The big question of ethics that will arise, is "Is is okay to make money off information about people that I glean from mining my data?"

Several far-fetched but not so far-fetched scenarios come to mind. I am reminded of the data mining done by Target Stores when they deduced that a 15 year old girl was pregnant by her purchases of face cream combined with a certain brand of vitamins. Suppose that I was a data miner for a drug store chain, and I could find a strong correlation between a person buying certain antacids and a few months later being diagnosed with an ulcer requiring expensive stomach surgery. A health insurance company would be highly interested in knowing that. Should we sell the information on people that we discover? It would be an incredibly lucrative revenue stream.

Ethics was never a question in the good old days of business. McDonald's grew their fast food empire by putting toys into their Happy Meals and creating an obese America by targeting and hooking the children. One nutritionist noted that Chicken McNuggets in a Happy Meal were nutritionally worse than deep-fried cake. At the time, it was seen as a slick marketing move.

Data mining is in the same stage that McDonalds was fifty years ago. It is a solution looking for a problem to monetize. So it is like the Wild West until legislators, sober second-thought minds and ethicists added some groundwork rules to the field of endeavor. I personally know of a data miner who set up in a Caribbean jurisdiction that has little to no privacy laws. Big companies ship him their data complete with personal information, and he ships back everything bit of intelligence that he finds. His revenues are out of this world. It is almost the ethical equivalent of having consumer products made in the Third World by child labor. I can see the days, where data becomes a commodity that is bought, sold and traded. There will be export laws on data -- especially data with personal identification information in it.

But there are ethical questions closer to home. Suppose that my employer expects me to mine data, and I discover an untapped revenue stream that is extremely easy to exploit. Do I tell my employer, or give my notice and create a start-up to exploit that situation? What is the ethical course of action?

We have a long way to go with with applying ethics to data mining. Ethics is a lot like beauty -- it is in the eye of the beholder. I am reminded of a story that a shopkeeper told regarding ethics. He said "I was closing the store and a customer came in at the last minute and made a large purchase with a hundred dollar bill. After he had left and I closed the shop, as I was counting the money, I noticed that there was actually another hundred dollar bill stuck to the bill that the last customer had tendered. Immediately a question of ethics arose. I was wondering if I had to tell my business partner or not."

And that perfectly explains why we need some ethical boundaries in data mining.

Impressed with Microsoft -- finally -- Azure

I, like a lot of other geeks, have become greatly disillusioned with Microsoft in the past several years. I saw them as anti-innovative, fat-cats protected a revenue stream that did not favors for its users, and becoming a stodgy, quaint grandparent in a tech world, where it thought that it was still the same sex object that it was in its early daze.

Microsoft, in my opinion has hung on too long to its archaic operating system which is essentially one big kludge onto top of a stack of turtles of kludges all the way down to the bare silicon. All of their innovations in almost every endeavor from tablets to phones, to music services, have been market failures because they stubbornly resisted changes to their bloated, digital-cholesterol clogged operating system. If they truly want to be innovative, they would ditch it in favor of a brand of QNX or Linux for a sleek, less vulnerable system. Back in the early daze of the 8086 microprocessor, I saw a QNX system being able to boot from one floppy disk, and in its day, that was amazing.

Now that I got that off my chest, I must grudgingly admit that Microsoft has lit a spark that impresses me with their Azure big data suite. If they are going to re-invent themselves and breathe new life back into the corporation and become innovative again, then Azure might be the vehicle.

Big Data is where it is at, and where it is going to be if we want to manage and monetize the Internet of Everything. And Microsoft Azure is trying to create and promulgate products to that end with Azure. I only became aware of Azure when several members of the Azure team followed me on Twitter, and when I checked them out, I realized that it wasn't Bill Gates' Microsoft. I really liked what I saw.

Azure offers data analysis as a service, and they have a free component. It is done in a quasi-cloud environment, and from what I see, once you graduate from the newbie class, the prices is okay. The good news is that there is a link to some pretty nifty free tools. Here is the link:

https://datamarket.azure.com/browse?query=machine%20learning&price=free

The tools are varied, useful and intriguing.

Microsoft just may have a chance to dominate the market. Their thin edge of the wedge with azure is great, but they must follow the template of Microsoft Word when it started to dominate the marketplace. Back in the day, personal computers were useful, but not that useful when it came to creating documents electronically. The IBM Selectric typewriter was the weapon of choice to use up reams of paper. Then along came the word processor. Dr. An Wang made a fortune from inventing computer memory, and then sunk his money into Wang Labs headquartered in Lowell Massachusetts. The Wang word processor became ubiquitous for several years. It was a dedicated piece of hardware, and tightly coupled software that didn't do anything else except create formatted documents. Prior to that, electronic documents were printed on a dot matrix or impact printer without stylings. (The Wang OS was the first OS that I successfully hacked).

Microsoft Word came out and essentially destroyed the word processor. It was order of magnitude cheaper, easier to use and a mere fraction of the cost. It is still the dominant document creator to this day. Microsoft needs to do the same thing with Azure.

Right now, a lot of the Azure products use the statistical language R. Other plugins calculate linear regression, and all sorts of stuff like standard deviation, blah blah blah. Microsoft needs to hide that under a big layer of abstraction and make all of that invisible to the end user. Picture the end user who runs a niche cafe in a hip town. Their Point-Of-Sale and computer system collects metrics, meta-data and machine data. The owners of this data has no idea what this data can tell them or how they can increase their revenue streams. They don't know Bayesian inference from degree of confidence.

Microsoft needs to build data analysis for the common person, like they built word processing for the common person. If they do that, they will take their company into the next century. If not, they will be the biggest Edsel of the tech industry. However, for the first time in a long time, I like what I am seeing come out of Microsoft.

Process Mining, OpenStack and Possibly a New Java Framework

In my process data mining course, on the internal forums, an OpenStack developer asked how the event logs from using OpenStack could be used in process mining. This is how I replied:

First of all, let me congratulate you on OpenStack. I am both a user, and I use the services of an OpenStack driven Platform-As-A-Service to host the development of my mobile apps.

I would see several potentially huge benefits if you incorporated process mining into the OpenStack platform. For example. spammers now use virtual OpenStack concept to set up a virtual machine, do their spamming or hacking and then tear down the machine never to be seen again. If you got a signature or a process of this activity, you could theoretically intercept it while it is happening.

Another possibility, is that every time the software does a create, an instantiation, an instant of a virtual, or anything, if you record the timestamps of these machine events, you could provide a QoS or quality of service metric, both for monitoring the cloud and for detecting limitations caused by hardware, software or middleware bottlenecks.

I can see a possibility for mis-configuration of parameters that degrade service quality, that would be picked up by a process mining that would detect missing setup steps in the process. In other words, an arc around a required region would indicate that required steps were missing.

This course has inspired me to start working on a Java framework (maybe a PROM plugin) that operates on an independent thread (maybe in an OpenStack incarnation) that monitors activity on a server and compares it to ideal processes in real time and flags someone if a crucial process deviates from it. I think that I could get this going in a timely fashion.

Once again, this course has opened my eyes to potential methodologies and algorithms that can be applied to non-traditional fields.

Note: PROM is an open source process mining tool. The data mining course is given by the Eindhoven University of Technology in the Netherlands.

Process Mining, Data Mining, Explicit & Implicit Events

The course in process data mining given by Professor Wil van der Aalst from the Eindhoven University of Technology in the Netherlands, has opened my eyes to a few elements in data mining that I had not considered.

At first blush, the course looks like it would be quite useful for determining bottle necks in processes like order fulfillment, or patient treatment in a hospital, service calls or a manufacturing environment, and it is. But to an eCommerce platform builder like myself, it can provide amazing insights that I had never thought of before taking this course.

Professor van der Aalst has introduced a layer of abstraction or perhaps a double layer of abstraction in defining any process with a Petri Net derived from an event log. Here is an example of a Petri Net (taken from Wikipedia) :

The P's are places and the T's are transitions. In the theoretical and abstract model, tokens (the black dots) mark various spots in the process. Tokens are consumed by transitions, and regenerated when they arrive at the next occurring place. The arrival of a token at a specific place, records an explicit behavior in the transition. So how did this help me?

I do data mining to enhance revenue stream on our eCommerce platform. (See the blog entry below this one). Previous data mining efforts on my part dealt with implicit events. Sure we had an event log, but we looked at the final event of say a customer purchasing something, and tried to find associations that drove the purchase (attributes or resources like price, color, time of day, past buys of the customer, etc). The customer's act of making the purchase was captured in the event logs, using timestamps of various navigations, but all of the events leading to purchased were implicit events that we never measured. With the event logs, we have explicit behaviors, and using those event logs, we can define the purchase process for each customer. So we start making process maps of the online events that led to the purchase. In short, we began to look at the explicit events.

Where will this take us? It will show us the activities and processes leading to a high value event for us (a purchase). What it does, is that we isolate high value process events, and by mapping customer behavior to those events, we can evaluate and refine which customers will end up making an online purchase. So we can treat those customers in a special way with kid gloves.

In essence, we can gain insight into the probability of an online purchase if a new customer starts creating events in our event logs, which indicates behavior that leads to a purchase. This data is extremely valuable, as now we can put this customer on our valued customer list, and using other data mining techniques, we can suggest other things that the customer is interested in and get more sales.

To recap, we now can measure explicit behaviors instead of implicit behaviors based on such limited metrics as past buying behaviors. We add a whole new dimension in enhancing the shopping experience for our users, and thereby enhancing our bottom line revenue stream.

As in life, often in data mining, it pays to pay attention to the explicit things. Process mining is an incredibly efficient way to deduce explicit behaviors that lead to desired outcomes on our platforms.

Never Mind Data Mining -- I Made a Data Refinery

refinery - definition of refinery by the Free Online Dictionary ...

www.thefreedictionary.com/refinery

re·fin·er·y (r -f n -r ). n. pl. re·fin·er·ies. An industrial plant for purifying a crude substance, refinery [rɪˈfaɪnərɪ]. n pl -eries. (Business ...

Refinery - Refinery - Refinery gas - Petroleum refinery

OK, so I'm writing some datamining software. Actually, I am writing a package to take raw data, plot it into a data mart, validate the data, cleanse it, evaluate it, and then put it into a database where I can mine it.

But I have some unique challenges. The data is collected in a Third World African country. It is in the form of surveys. The data collected is to provide health care. The data collectors are indigenous people who are provided with a cell phone, survey instruments, and are paid to go out and survey people. They get paid per completed survey, and in a country where it is tough to make a dollar, this is a plum job.

The problem is that this country is a country with a brokerage economy where everyone cheats. This country is known around the world for sending out email scams. A respected figure once said that if cheating were the pinnacle of civilization, then this country would be the most civilized in the world. It is not uncommon for the data collectors to sit on a street corner and make up surveys instead of going house to house.

So, how do you get around that. Luckily the mobile devices have GPS. When they send the survey, the longitude and latitude is sent. The surveys are broken down into several steps. The first crew goes out and enumerates the houses and gets the GPS coordinates of the house. It is given a control number. Then the following surveys must match the control number, and the GPS coordinates must match. My software does that.

One of the surveys asks how many members there are in the household. The following surveys ask the same question. When the surveys all arrive by HTTP or GPRS, they all go into a database as raw.

My software takes it out and does an evaluation. The surveys come in at a couple of thousand a day. It is too much to evaluate them manually. My software has to take the bulk of the crude material, sift it out for the good stuff, and then operate on the good stuff.

It struck me, that before I can do data mining, I have to do some data refining. The incoming surveys are like ore. The nuggets of real live data is buried in the junk, the partially completed surveys, the fraudulent ones, the failed transmissions and corrupt ones, and the ones that are partially true and the rest made up. Just like ore, I have to purify the data, and then I can operate on it. It struck me, that I have built a data refinery.

In various TED talks, I heard the stat that the whole world up to the time of the internet, had produced 5 exabytes of data. We now produced 5 exabytes in a couple of days. A lot of it is pure crap. There are funny jokes, red-neck anti-Obama diatribes, emails from a bunch of Russians who think that I have a small penis, and all sort of other spam.

Most of the exabytes of data that we generate is like ore. It needs to be refined. It is possible to extract knowledge from the stream of fake viagra emails and Facebook updates, and it can be monetized, but you have to separate the wheat from the chaff. It struck me that my data refinery can do the job.

There are lessons to be learned from my experience, and those lessons form the basis of a data refinery.

The first step is to get rid of the fake and fraudulent stuff. That is objective one for a data refinery. The second step, is to identify the partial data, and make a determination if it can be salvaged. The third step is to cleanse the dirty, but real data. And the fourth step is to put it all together in a clean place where you can operate on it and monetize it.

There you go -- the four basic steps of creating a data refinery. I am being deliberately vague on how this is done, because there is money in doing this. Race you to the patent office.