All Things Techie With Huge, Unstructured, Intuitive Leaps
Showing posts with label analytics. Show all posts
Showing posts with label analytics. Show all posts

Process Mining From Event Logs -- An Untapped Resource And Wave of The Future


A couple of years ago, I was searching for untapped horizons in data mining, and I came across a course given by Professor Wil van der Aalst where he pioneered the technology of business process mining from server event logs. Naturally I signed up for the course. It is and was a fascinating course, not only due to its in-depth and non-trivial treatment of gleaning knowledge from data, but for me, it got the creative juices flowing to think of where it could be applied elsewhere. I was so intrigued with the possibilities, that I created a Google Scholar Alert for Professor van der Aalst's publication. The latest Google alert was on January 31rst, and it was a paper entitled "Connecting databases with process mining". The link is here: http://repository.tue.nl/858271 It was this paper that triggered this article.

I am a huge proponent of AI, Machine Learning and Analytics. In Machine Learning, you gather large datasets, clean the data, section the data into smaller sets for training & evaluation, and then train an AI machine with hundreds, perhaps thousands of training epochs until the probability of gaining the sought-after knowledge crosses an appropriate threshold. Machine intelligence is a huge field of endeavor and it is progressing to be a major part of everyday life in all phases of life. However, it is time consuming to teach the machine and get it right. Professor van der Aalst's area of expertise can provide a better way. Let me explain:

My particular interest, is that I am building a semantic blockchain to record all of the data coupled to vehicles, autonomous or not. Blockchain of course, is an immutable data ledger that is true, autonomous itself in operation, disintermediates third parties and is outage-resistant. Autonomous vehicles will by law, be required to log every move, have records of their software revisions, and have records like post-crash behavior etc.

I immediately saw the possibilities of using this data. Suppose that you are in an autonomous vehicle and that vehicle has never been on a tricky roadway that you need to navigate to get to your destination. Your car doesn't know the route parameters, but thousands of other autonomous vehicles have, including many with your kind of operating system and software. With the connected car, your vehicle would know its GPS coordinates and query a system for the driving details for this piece of roadway that is unknown to the computer. Instead of intense computational ability required to navigate, a recipe with driving features could be downloaded.

Rather than garnering those instructions from repeated training epochs in machine learning, one could apply process mining to the logs to extract the knowledge required. There are already semantic methods of communicating processes, from decision trees to Petri nets, and if the general process were already known to the machine, it would reduce the computational load. As a matter of fact, each vehicle could have a process mining module to extract high level algorithms for the roads that it drives regularly. That in itself will reduce the computational load of the vehicles. It would know in advance, where the stop signs are, for example, and you won't have Youtube videos of self-driving cars going through red lights and stop signs.

It goes a lot further than autonomous vehicles. This concept of creating high level machine processes through event logs can be applied to such diverse fields from robotic manufacturing to cloud server monitoring and numerous fields where human operators or real world human judgement is required.

Process mining could either eliminate machine learning in a lot of instances, or it could supplement it, with a mix of technologies. The aim is the same, which is aggregating data into information and integrating information into knowledge, both for humans and machines.

This process mining business reminds me of the history behind Bayesian Inference. The Reverend Thomas Bayes discovered probability and prior belief equations. They sat on a dusty shelf for over 200 years and they were re-purposed for computer inference and machine intelligence. I think that Professor van der Aalst's methodologies will be re-purposed for things yet un-imagined, and it will not take 200 years to come to fruition.



Connected Autonomous Cars, Big Data, and Not Re-inventing The Wheel


Smart Roads Need Not Be So Smart

The introduction of technologies into daily life lets us let go of old paradigms and ways of doing things. It also lets us jettison conventional ideas. I was in a deep conversation last night at dinner with a philosopher friend and I was telling him that I was working with automotive blockchain as a true ledger -- especially for self-driving cars. I mentioned that perhaps we would need smart roads or smart road sign sensors to indicate things like speed limits and such to the autonomous car.

We got into a discussion on how self-driving cars will change everything about mobility -- even the concept of your car sitting in a parking lot all day. For example, after your self-driving car drops you off at work, you can send it out to work for money as an Uber car, and it comes to pick you up after your work day is done. Or you can send it home.

My friend opined that with this and other technologies, one is only limited by the imagination as to what can be implemented. He didn't think that we would need smart roads. He pointed out that using Big Data, the computational load of self-driving cars could be significantly reduced. We wouldn't need smart roads hardware embedded in geographic locations. It was brilliant.

Here is how it will work. My blockchain is intended as a vehicle black box recorder. Everything with the connected car is recorded in real time. This includes GPS coordinates, date, time, and all of the instructions issued by the operating system of the vehicle to drive a particular stretch of road. Here is the clever bit.

Suppose all of this stuff is uploaded to a central repository, and is searchable. The connected autonomous vehicle, upon entering a specific roadway, would access this information. Through Big Data analytics, it would now know average driving conditions and speed for time of day, season of the year, rush hour, rain, sleet snow and it would know the salient features of the roadway. For example, you won't have self-driving cars running red lights or stop signs like you see on Youtube now, because you will have those features available to you. It will know things like where to watch out for other vehicles exiting a driveway (based on history of cars stopping to let these vehicles out). In other words, you will have a smart roadway without sensors and without Internet of Things (IoT) indicators. It will be like Google Street View for autonomous vehicles. The vehicles will be able to search, find and download roadway features, and use these features to navigate, without intense computational load on the car operating system. The onboard driving system would have to only detect anomalies and other traffic. You would not be re-inventing a computational feature map every time that you went down that road.

Smart roads would be smart because there would be a driving-instruction history created by thousands of vehicles on how to navigate these roads. They would be mapped with a GIS system that included driving parameters.

It would be the Google search engine for the brain inside your car. I am sure that Google has already thought of this concept. They were forward-looking enough to start Street View, but there is always room for a better mousetrap hatched by a disrupter.  The disruption in this case, is to present the driving parameters in a way that will be understood by all self-driving cars. Therein lies the next billion dollar play.

Sentiment Analysis And Data Mining To Understand The World's Problems of Today


I was genuinely perplexed.  The world is a vastly different place than I envisioned it as a teenager. It seems that the continued enlightenment and scientific advancement in the years from post World War II to the turn of the millennium would bring the world into a less chaotic global village with a greater degree of peace, stability and economic well-being for man.  In many respects, the world has regressed.

Purely for my own understanding, I decided to try and figure out some reasons for the current problems of the world, using my skills in data mining.  I took twenty top international news sites, and by scraping their content with open source tools, I had a collection, a snapshot of the microcosm of the world today.  Encapsulated in that collection, would be a good starting point as a list of the major problems of the world.

To do some preliminary research into the world's problems, I decided to see what research was out there in the public domain. Eurobarometer had actually conducted a poll across the length and breadth of Europe, and came up with the following list of the top ten major world problems:

  • #10 Don't Know
  • #9 Proliferation Of Nuclear Weapons
  • Tied #7 Armed Conflict
  • Tied #7 Spread Of Infectious Disease
  • #6 The Increasing Global Population
  • #5 Availability Of Energy
  • #4 International Terrorism
  • #3 The Economic Situation
  • #2 Climate Change
  • #1 Poverty, Hunger And Lack Of Drinking Water

It is interesting that two percent of the people in Europe answered with "Don't Know".  This was the reason that I conducted this exercise in the first place.

After I had my collection of data from the news sources, I decided to do a bottom-up analysis of the news.  I tagged each story with a tag that generally summarized the theme of the story.  I had a lot of tags, and at that point, I needed to do some feature engineering by adding a layer of abstraction to the tags, so that the stories could be grouped for sameness.  I kept adding layers of abstraction until I got a manageable number of tags, and then did a bottom-up Naive Bayes classification of the tags.  The classifiers neatly categorized the stories.

I didn't just want a grocery list of the problems.  I was looking for something deeper. I was looking for answers related to the human condition, and how we, as a varied group of humans who inhabit this earth feel, react, create and possibly solve these problems.  So consequently, I created another layer of abstraction for a broad brush category of problems that condensed the list into a smaller but cogent set that related directly to the human condition.  Once I had the bottom up tag analysis done, I decided to do a top down, sentiment analysis of my problem tags.  It would be interesting to see how my analysis would fare with the Eurobarometer analysis.

Don't forget, my list came from the news sites, so it represents a snapshot of what was in the forefront on this particular current time period.  Here is my list of twelve issues:
  • Africa Issues
  • Alienation/Marginalization of peoples/societies/groups
  • Business Sector Wars/Competition
  • Economical Structural Change
  • Environment 
  • Globalization
  • Mass Media/Censorship/Subjectivity
  • Migrant Problems
  • Nationalism
  • Partisanship
  • Religious Fundamentalism/Jihadism/Religious Wars
  • Technology Frontiers/Problems
The differences between my list and the Eurobarometer list was apparent.  Africa was not on the list whereas it was represented as its own category in the news of late, and indirectly in the Migrant issues (although the migrant issues were a global phenomenon including the Caribbean where Haitians are fleeing their homeland causing problems in the neighboring countries).

In trying to understand the root cause, one of the surprising inclusions on my list, was Alienation/Marginalization of peoples/societies/groups. This included stories about gay rights, Kurdish struggles in other countries, Sunni versus Shia, Basques versus Spaniards etc.  

So how did my sentiment analysis turn out?  As it turns out, for my limited study, the environment is the number one issue in terms of global problems. Here is the list and the percentage of stories connected to the issues.
    
  • Environment -54.21%
  • Alienation/Marginalization of peoples/societies/groups   -23.29%
  • Mass Media/Censorship/Subjectivity   -8.48%
  • Migrant Problems   -3.92%
  • Globalization   -1.75%
  • Business Sector Wars   -1.71%
  • Technology Frontiers   -1.67%
  • Partisanship   -1.35%
  • Nationalism   -1.26%
  • Religious Fundamentalism   -1.03%
  • Africa Issues   -0.95%
  • Economical Structural Change   -0.36%
It seems that most conflicts in the world arise from the number two problem - Alienation /Marginalization of peoples/ societies/ groups.  This is probably the root cause of most social problems facing any area of the globe today.  Everyone wants and needs their own place in the sun, and others are trying to prevent them from having it, for a whole range of reasons.

People also seem to be concerned about their sources of information.  Right wing groups accuse the mainstream media of liberal bias.  Conservative news sites are mocked as Faux News. It seems that in the plethora of information sources, everyone has a hidden agenda, and folks are concerned about it. Objective information is very hard to find, with the democratization of information dissemination on the internet.

There is no need to further expound on migrant problems, which came in at number 4 on my list. It is hugely topical.

There are still worries about globalization, but it doesn't have the same impact as the people or environment related stories.

It is interesting that business and technology appear on the list of problems. Business has the general sentiment of being anti-humanistic and profit for profit's sake at the expense of the human condition. Technology is seen as a threat with artificial intelligence, killer robots and job destroyers.

The next two categories can be somewhat related - partisanship and nationalism.  They are both 'people-interacting in their countries' stories.  Partisanship is now rampant with gridlocked Congress versus the president, the Confederate flag issue and nationalism is seen in various venues around the globe where Scotland wants to exit the United Kingdom, Great Britain wants to exit the European Union, Basque and Catalonia want to exit from Spain, Quebec wanted to separate from Canada, ad infinitum. 

Religious fundamentalism is inexplicably rising. There seems to be a growing intolerance between mainstream and fundamentalism.  This is not only seen in the Muslim world, but also in the US where a city clerk refused to issue marriage licence to gays because of fundamentalism religious beliefs.  We have seen Baptists churches picketing the funerals of slain American soldiers from overseas, on religious grounds.  Who would have predicted this shift 30 years ago? I would be interested in knowing why there is a swing to fundamentalism in the modern world.  In broad brush strokes, this seems to be a struggle with progression versus regression and it is inexplicable to rational thought.

Africa is low on the list, but concerning.  Africa was the site of proxy wars between the superpowers in the last 60 years or more, and now there is currency collapse, armed conflict, epidemics, partisan in-fighting, loss of democracy and pretty much any social, economic or environmental ill that anyone can name.  Africa creates instability in the global village.

And bringing up the bottom of the list, is fundamental economic change.  Long term jobs are being replaced by the gig economy. Manufacturing is undergoing fundamental changes. The biggest profits are now from virtual paper transactions on Wall Street with the one-percenters who jerk the economy around with their financial derivatives and dark markets.

Certainly this exercise has opened the window and shed some light for me, but as usual, answers to these issues are elusive, complex and in many cases there are no apparent ones.  Life does seem to go on.

Conquering The Time Domain in Marketing With Big Data & Analytics


Our platform sells big ticket items -- it remarkets and wholesales used cars.  The supply chain is well defined. A new car dealer takes in a car on trade. He really doesn't want to do it, because most used cars are not moneymakers. If it sits on his or her used car lot forever, it loses money for him/her  instead of making money. That is because the new car inventory underlying that trade-in is usually financed.  To complete the deal cycle of used car trade-in -> new car purchase -> used car sale for recouping money, the used car has to sell quickly.

Secondhand car dealers in small markets are experts at what sells and for how much, and what the market is willing to pay. They have intense local knowledge of their geographic domain.  A lot of the time, new car dealers do not have that expertise and/or knowledge.

Coupled to this fact, is that in spite of the parameters of make, model, year and condition, there is no uniform valuation for a used vehicle. It varies by area, time of year, color of vehicle, geographical location, local economy and a million and one different factors. Folks like Black Book try to standardize the valuation for the process, but at best, they are only a rough guide based on auction prices around the continent.

As we have shown in this article, the Black Book paradigm of gleaning value from auctions is not  accurate because up to two-thirds of all vehicles are remarketed through relationship-based wholesaling, and never hit the auction floor.

Coupled to that, there is no "real price" for any used vehicle. What a vehicle sells for is based on what the new car dealer has in it (a combination of what he thinks the vehicle is worth and the discount that he has allowed on the new car that was bought with this trade-in). A good example of this is that on our platform recently, a dealer had $9,000 in an SUV. That's the reserve price that he put on his vehicle, because that is what he needed to make the deal profitable. He let the market forces dictate the ultimate price, but he needed $9,000. The SUV sold for $27,000 in the fair and equitable marketplace on our platform. So what was the vehicle worth? It was worth $9000 to one person and triple that to another. This is why we introduced crowd-sourcing valuations into our platform.

But there is one other element in marketing that transcends specific sectors, and that is the time element.  Currently, a light manufacturer will do a run of product, and try to flog it off to wholesalers, retailers, online markets etc.  It costs money to hold the product in inventory.

Technology such as 3D printing and print on demand for digital books alleviates some inventory build-up, but generally the time domain is huge in merchandising and marketing.  What I mean by that, is that inventory is built up, and disposed of over time at ever-changing prices based on supply and demand. There is a measurable, considerable cost to storing inventory.

As pointed out in the automobile remarketing industry that we are in, the domain of time is a negative one. The longer an item stays in inventory, the less it is worth, and the larger the drag on the bottom line. Positive revenue stream is based on timely sales.

To conquer the time domain, we used Big Data to our advantage. We coupled it with our relationship-based sales paradigm described in the above link, and as it turns out, the piece of technology was patentable, and we have foundation patents pending in that area.

This is how it works. The whole idea is to move inventory quickly. We have mapped the buyer/seller network relationships (a social network media type of construct) with trusted buyer zones based on previous commercial relationships. This is the first step in the process that we have created.  The product is offered to this trusted network group for a limited time (in our case, four hours is a norm). If the product does not sell, what then? As the clock ticks, money is lost.

The second step involves Analytics.  We use Big Data to find in our customer base, and in other databases, who is the best and most frequent kind of buyer for this product. The machine assembles a top-10 list based on a proprietary algorithm of sifting through Big Data, and offers it to that ad hoc group of buyers for a limited time.  The really nice part, is that once buyers find out about the top ten, we have a potential revenue stream where they will pay for early market information and a chance at a deal.

When that time expires, the platform has the smarts to move the inventory to the next phase of selling. In our case, it goes to general auction to the open group of buyers, and if that fails, the platform has the technology and ability to transfer the inventory to a classified type of listing.

Our competitive advantage, is that we have conquered the time domain with relationship-based social network selling for the first step, and the use of Big Data for the second step. Our competitors use the third step as their first step.

Big Data has a huge advantage in conquering the time domain.  Suppose as a manufacturer, or even a retailer you had a platform to sell all of your inventory in a specified time-frame. With a platform such as ours, adapted to other fields, you could commoditize your inventory, and using relationship-based selling coupled with Big Data, you could have your inventory dispersed just as it was about to leave the factory floor, or arrive on a shipping dock.  Big Data will even tell you how much inventory to order and make.

Merchandising and selling will all change drastically in the next few years, and those that don't adopt the Analytics/Machine Learning paradigm, will bite the dust.