Showing posts with label big data. Show all posts
Showing posts with label big data. Show all posts
Time For Data DNA
I'm thinking of having my DNA tested because most days, I feel that I'm not me. Weird ideas keep popping into my head. Yesterday I was thinking about data & analytics.
As a data science aficionado, I like to keep every scrap of data. I like to find things that the data tells me. Most of the time it tells me that my life is boring & I should go outside for a walk or find a way to get a private jet or something. But a thought with more gravitas struck me. Mother Nature doesn't keep a database of every animal, plant or cognitively-impaired Trumpkin ever produced. The pattern is kept in DNA. Whenever she wants to play dice with the universe, Cupid sends an arrow thru the hearts of polar opposites in the fat tails of a population distribution, and interesting things happen -usually in a manner where interesting is like circus-interesting.
So back to databases. Other than keeping an entry of identifying things like name and address, why do we have to have multitudinous tables for metadata about a specific humanoid. Why can't be have a classifier that stores a short hand model like DNA? This would be helpful for merchants holding customer data. After all, any merchant has a finite number of customer types. There has to be a better data way.
(originally appeared as a Linked In post: https://www.linkedin.com/in/ken-bodnar-57b635133/ )
Process Mining From Event Logs -- An Untapped Resource And Wave of The Future
A couple of years ago, I was searching for untapped horizons in data mining, and I came across a course given by Professor Wil van der Aalst where he pioneered the technology of business process mining from server event logs. Naturally I signed up for the course. It is and was a fascinating course, not only due to its in-depth and non-trivial treatment of gleaning knowledge from data, but for me, it got the creative juices flowing to think of where it could be applied elsewhere. I was so intrigued with the possibilities, that I created a Google Scholar Alert for Professor van der Aalst's publication. The latest Google alert was on January 31rst, and it was a paper entitled "Connecting databases with process mining". The link is here: http://repository.tue.nl/858271 It was this paper that triggered this article.
I am a huge proponent of AI, Machine Learning and Analytics. In Machine Learning, you gather large datasets, clean the data, section the data into smaller sets for training & evaluation, and then train an AI machine with hundreds, perhaps thousands of training epochs until the probability of gaining the sought-after knowledge crosses an appropriate threshold. Machine intelligence is a huge field of endeavor and it is progressing to be a major part of everyday life in all phases of life. However, it is time consuming to teach the machine and get it right. Professor van der Aalst's area of expertise can provide a better way. Let me explain:
My particular interest, is that I am building a semantic blockchain to record all of the data coupled to vehicles, autonomous or not. Blockchain of course, is an immutable data ledger that is true, autonomous itself in operation, disintermediates third parties and is outage-resistant. Autonomous vehicles will by law, be required to log every move, have records of their software revisions, and have records like post-crash behavior etc.
I immediately saw the possibilities of using this data. Suppose that you are in an autonomous vehicle and that vehicle has never been on a tricky roadway that you need to navigate to get to your destination. Your car doesn't know the route parameters, but thousands of other autonomous vehicles have, including many with your kind of operating system and software. With the connected car, your vehicle would know its GPS coordinates and query a system for the driving details for this piece of roadway that is unknown to the computer. Instead of intense computational ability required to navigate, a recipe with driving features could be downloaded.
Rather than garnering those instructions from repeated training epochs in machine learning, one could apply process mining to the logs to extract the knowledge required. There are already semantic methods of communicating processes, from decision trees to Petri nets, and if the general process were already known to the machine, it would reduce the computational load. As a matter of fact, each vehicle could have a process mining module to extract high level algorithms for the roads that it drives regularly. That in itself will reduce the computational load of the vehicles. It would know in advance, where the stop signs are, for example, and you won't have Youtube videos of self-driving cars going through red lights and stop signs.
It goes a lot further than autonomous vehicles. This concept of creating high level machine processes through event logs can be applied to such diverse fields from robotic manufacturing to cloud server monitoring and numerous fields where human operators or real world human judgement is required.
Process mining could either eliminate machine learning in a lot of instances, or it could supplement it, with a mix of technologies. The aim is the same, which is aggregating data into information and integrating information into knowledge, both for humans and machines.
This process mining business reminds me of the history behind Bayesian Inference. The Reverend Thomas Bayes discovered probability and prior belief equations. They sat on a dusty shelf for over 200 years and they were re-purposed for computer inference and machine intelligence. I think that Professor van der Aalst's methodologies will be re-purposed for things yet un-imagined, and it will not take 200 years to come to fruition.
Connected Autonomous Cars, Big Data, and Not Re-inventing The Wheel
Smart Roads Need Not Be So Smart
The introduction of technologies into daily life lets us let go of old paradigms and ways of doing things. It also lets us jettison conventional ideas. I was in a deep conversation last night at dinner with a philosopher friend and I was telling him that I was working with automotive blockchain as a true ledger -- especially for self-driving cars. I mentioned that perhaps we would need smart roads or smart road sign sensors to indicate things like speed limits and such to the autonomous car.
We got into a discussion on how self-driving cars will change everything about mobility -- even the concept of your car sitting in a parking lot all day. For example, after your self-driving car drops you off at work, you can send it out to work for money as an Uber car, and it comes to pick you up after your work day is done. Or you can send it home.
My friend opined that with this and other technologies, one is only limited by the imagination as to what can be implemented. He didn't think that we would need smart roads. He pointed out that using Big Data, the computational load of self-driving cars could be significantly reduced. We wouldn't need smart roads hardware embedded in geographic locations. It was brilliant.
Here is how it will work. My blockchain is intended as a vehicle black box recorder. Everything with the connected car is recorded in real time. This includes GPS coordinates, date, time, and all of the instructions issued by the operating system of the vehicle to drive a particular stretch of road. Here is the clever bit.
Suppose all of this stuff is uploaded to a central repository, and is searchable. The connected autonomous vehicle, upon entering a specific roadway, would access this information. Through Big Data analytics, it would now know average driving conditions and speed for time of day, season of the year, rush hour, rain, sleet snow and it would know the salient features of the roadway. For example, you won't have self-driving cars running red lights or stop signs like you see on Youtube now, because you will have those features available to you. It will know things like where to watch out for other vehicles exiting a driveway (based on history of cars stopping to let these vehicles out). In other words, you will have a smart roadway without sensors and without Internet of Things (IoT) indicators. It will be like Google Street View for autonomous vehicles. The vehicles will be able to search, find and download roadway features, and use these features to navigate, without intense computational load on the car operating system. The onboard driving system would have to only detect anomalies and other traffic. You would not be re-inventing a computational feature map every time that you went down that road.
Smart roads would be smart because there would be a driving-instruction history created by thousands of vehicles on how to navigate these roads. They would be mapped with a GIS system that included driving parameters.
It would be the Google search engine for the brain inside your car. I am sure that Google has already thought of this concept. They were forward-looking enough to start Street View, but there is always room for a better mousetrap hatched by a disrupter. The disruption in this case, is to present the driving parameters in a way that will be understood by all self-driving cars. Therein lies the next billion dollar play.
How To Be A Billionaire Using Big Data and Machine Learning in Three Easy Paradigms
1) Download WireShark and load it onto a laptop with the biggest hard disk storage that you can find.
2) Go to the airport and sit there all day using the free airport WiFi, and turn on the record function on Wireshark
3) Use data-mining and machine learning on the datasets.
The billion dollar platform idea will emerge from the data. Guarantee it.
Future Job Category ~ Data Grader And Goal-Oriented Valuator
The one thing that I have learned from designing a remarketing platform, is that there is a market for absolutely everything. This was exemplified by my trip to Siberia. Near Finland's border I discovered a Russian millionaire who made his money from crap. Literally. He was a chicken farmer. Up on the north peninsula to Finland, the ground is rocky and devoid of nutrients. Chicken crap is an amazing fertilizer full of nitrogen. He would trade piles of chicken crap for used Swedish cars - mostly beaters that the Finnish farmers bought cheaply. Getting a car in Russia is tough if you are not in a major center, and he marked up his cars considerable and within a few years, became a millionaire. Like I say, there is a market for everything.
I was watching a video on the modern fur trade, from the trapping right down to the auction. It was fascinating because when we think of fur trade, we think of glamorous women in fur coats on the runway. In the case of furs, there is an industrial market for crap furs. The Chinese are the biggest buyers. They will buy rabbit fur, and furs that are not good enough for clothing, and use them for toys, novelties, lining on the inside of boots -- wherever.
So there is a job description in the fur purveying process called a fur grader. Before the auction, he and his staff go through the bales and bundles of furs, and groups them according to quality and type. Bales for sale consist of similar types and grades so that the quality of the bale is consistent.
As I was watching this, it struck me that the same thing will happen to data. There will be a data grader who will assemble datasets, grade them, evaluate their marketability, cleanse the data, and put it on a data exchange. In the case of valuable datasets, there will be a data auction. And if you are thinking of starting a data exchange, have I got the platform for you !!!
You will also get the arbitrageur and day trader of data. If I see I dataset that is going cheap, I just may pick it up. You see, I will have machine learning on my side, so I can process it. I will do goal-oriented mining, and create multiple datasets of differing things with differing values. Much like capital partners buy and ailing company, and break up and sell its parts for more than the overall thing was worth, the same will be done for data.
It's a brave new world out there. Folks in the industry used to think that the never-ending, endlessly multiplying streams of data were a scourge. They are actually a raw material and an asset. After all, there is a market for everything. Sign up for my emails on the right for more.
# DO-IT #tag for NLP machine learning for this article.
Big Data, Data Mining and Machine Learning in Advertising
The advertising industry has been pretty much on the ball when it comes to exploiting audiences and corporations in the name of spreading a brand message. However Google has eaten their lunch when it come to advertising revenues and the sole reason is that Big Data is the most effective way to target a demographic, and Google is the king of Big Data. Google knows what people want to buy from the searches that the consumers do. They have the stats on what makes someone click, when they click, what best to make them click, where they are predicted to live and pretty much everything that an advertiser wants to know to reach their audience.
The laggards in the industry are the big and small advertising agencies and brokers. I can predict that the biggest, most profitable players that will be big in advertising in the next 5 years just by examining who is adopting data mining, big data and machine learning in the advertising field.
Advertising has been bought and sold using a very large degree of granularity. For example, its a slam dunk that if you want to get a brand message across, the quickest way is to buy a Super Bowl ad. But not everyone is a Fortune 500 company that can afford an ad in the ... the ... well .. Super Bowl of Advertising.
So how will advertising agencies monetize themselves and create huge and diverse revenue streams in the very near future?
Here is an example of how it will work. With Big Data and Machine Learning, the advertising industry can operate like a futures exchange for media placement. They will craft the brand message. Big Data and machine learning will not only tell them what the most effective demographic will be, but it can do it with a fine degree of granularity as well. Advertising will be more goal oriented. If an advertiser wants to find new markets in a new demographic, Big Data and Machine Learning will tell the advertiser where to do it. It will also outline the optimal time, the optimal vehicle, the optimal message and the optimal leitmotif of the brand message. If they want to expand their market penetration with their current target demographic, they will know where to do that as well.
Finer degree of granularity will mean real time media monitoring with a control room full of Bloomberg-like stock market screens. If Reddit is trending with a million hits per hour with a content classification linked to the 25-35 year old demographic, it may be time to buy a spot from a ad placement futures database and pop the content onto the site within a couple of minutes. Advertising space will be bought in bulk and traded in smaller elements like a commodity exchange in real time. It is media-adaptive advertising.
If CNN has an exclusive breaking story and millions are flocking to their site, the viewing demographic can be machine-analyzed in real time (a Bayesian probability, again gleaned from millions of training epochs) and a highly effective ad can be placed that will have an instant ROI or return on investment that would be stratospheric compared to a regular buy.
Amazon or Google will develop a real time ad engine platform (RTADS --- or they will buy one from guys like me), and it will be the next big thing. App developers will be paid to incorporated RTADS (Real Time Ad Delivery System) so that trending, topical media events can be exploited with embedded, dynamic advertising based on who is consuming the media. As a matter of fact, the next Bloomberg millionaire will be the optimal ad monitoring station for advertising insertion. It is a function that is unknown today, and tomorrow the developers and maintainers of the system will have a coherent job description for it..
Machine Learning can predict good wine years and classical vintage. It can predict which films will win the Oscars. It can guide robots and munitions. Once the advertising agencies see the harnessed power of Big Data, Data Mining and Machine Learning, they will convert quickly, and eventually those players operating in the old paradigms will die. It is only a matter of time.
It's a great time to be alive if you are into Machine Learning, Artificial Intelligence, Data Mining and Big Data.
Re-Booting & Reforming Democracy With Big Data ~ The Box Carries the Vox
Winston Churchill stood in the British Parliament and spoke the following words:
"Many forms of Government have been tried and will be tried in this world of sin and woe. No one pretends that democracy is perfect or all-wise. Indeed, it has been said that democracy is the worst form of government except all those other forms that have been tried from time to time."
When the American Founding Fathers created a unique concept of Western Liberal Democracy, it was in fact a great experiment enshrining concepts operating under the principles of liberalism. This includes protecting the enshrined rights of the individual; fair, free, and competitive elections between multiple distinct political parties, a separation of powers into different branches of government, the rule of law in everyday life as part of an open society, and the equal protection of human rights, civil rights, civil liberties, and political freedoms for all persons. (quoted from Wikipedia).
However, the mechanism that they put into place to administer this democracy was very much a kludge or a compromise to best accommodate the will of the people, taking into account, the pragmatic aspects of their place and time in history.
Something has happened between then and now. Nowadays, the will of the people is being subverted and distorted by political partisanship and it is not for the good of the country and the people. The gridlock and dysfunction in the America Congress is a prime example (If con is the opposite of pro, then is Congress the opposite of progress?). And in the American Senate, when they do the roll call, half of them answer "Not Guilty!". You don't have to go far to find examples of how the mechanism of government fails to democratically represent the will of the people. The idea of thousands of people being represented by a person whose vote and interest can be bought by a business lobby somehow sucks all of the air out of the room of democracy.
Well, times have changed since 1776, but the ways of the government have not. With the advent of technological age, it is time to update, enhance, and empower the forces of democracy through the judicious application of technology, communications and data management.
I still believe that political parties are necessary. As human beings we will always have ideological differences, and no matter how batsh*t crazy some people are, they still have a right to vote and express their opinion. Churchill again once said that the biggest argument against democracy is to speak to the average voter for five minutes. So you will get the weirdos who think that owning an assault rifle will protect them from a drone strike when their elected, democratic government chooses to attack them in their bunker amid the 10 years of rice stocks mixed with prepper gadgets stored on the shelves therein. There will always be the snake-handlers, the wanna-be polygamists, the Ovary Overlords who want to legislate women's reproductive rights, and the folks who want to throw out the science curriculum in schools and replace it with learned treatises on Adam and Eve domesticating the dinosaurs. All these have a right to a voice in the democracy.
Political parties also define policy, which is important in government. Policy the course by which the government steers by. We don't want a rudderless ship, so we still need legislators to debate policy. But when they come up with legislation specifics to policy implementation, I want my direct say.
In the days of 1776, it took a week to get from Philadelphia to New York. There were no telephones. You couldn't track people down on the farm for their views. Times have changed. We have Big Data. We have the technology and communications tools to hear from everyone. We have the infrastructure to empower all voices. With computer data collection, we can collect hundreds of millions of pieces of data in minutes. And we can machine-collate them in real time.
So, what if anyone with a Social Security Number had a private encryption key? Whenever legislation came up for a vote, we all vote on it? Vox populi. The voice of the people can speak and be heard. The legislation would be put to a vote, and we the people would respond. We could all directly vote for the legislation and the laws that we affect us. Being digital in this age has put a voice to the voiceless and nameless. Data Science can be our rescuers and our salvation.
This would make it harder for big business and lobbies to affect democracy. They would have to convince entire populations of their point of view, and it doing so, they would have to make it in the interest of the population. It would be the great leveling ground in the current incarnation of democracy.
Do we have the guts to change the way that we enact democracy? We still have a Digital Divide, where a significant portion of the population doesn't participate, and cannot participate in digital online life. We have education issues. We have a built-in inertial brake for radical change. The people who benefit from the current state of dysfunction want to keep it that way. So it will be an uphill battle, but Big Data can reform democracy and put the power back into the hands of the people, where it belongs.
Back to Winston Churchill one more time to close my argument. Upon being offered The Order Of the Garter after a particularly humiliating defeat in the election of 1945, he said "Why should I accept the Order of the Garter, when I have already been given the Order of the Boot?" It is time to give the old tired mechanisms of democracy a boot.
Data Mining And Ethics - Revenue Over Rights?
I was reading a WIRED magazine article where the author said that his liberal arts degree on his resume was viewed in the same way as a face tattoo in a job interview for an investment banking position. However, a liberal arts degree is probably the qualification of an up and coming job title for data mining companies -- the CEthO or the Chief Ethics Officer. Once data mining and machine learning gets to the next level, coupled with the Internet of Everything, there will be huge privacy and ethics issues to contend with.
The big question of ethics that will arise, is "Is is okay to make money off information about people that I glean from mining my data?"
Several far-fetched but not so far-fetched scenarios come to mind. I am reminded of the data mining done by Target Stores when they deduced that a 15 year old girl was pregnant by her purchases of face cream combined with a certain brand of vitamins. Suppose that I was a data miner for a drug store chain, and I could find a strong correlation between a person buying certain antacids and a few months later being diagnosed with an ulcer requiring expensive stomach surgery. A health insurance company would be highly interested in knowing that. Should we sell the information on people that we discover? It would be an incredibly lucrative revenue stream.
Ethics was never a question in the good old days of business. McDonald's grew their fast food empire by putting toys into their Happy Meals and creating an obese America by targeting and hooking the children. One nutritionist noted that Chicken McNuggets in a Happy Meal were nutritionally worse than deep-fried cake. At the time, it was seen as a slick marketing move.
Data mining is in the same stage that McDonalds was fifty years ago. It is a solution looking for a problem to monetize. So it is like the Wild West until legislators, sober second-thought minds and ethicists added some groundwork rules to the field of endeavor. I personally know of a data miner who set up in a Caribbean jurisdiction that has little to no privacy laws. Big companies ship him their data complete with personal information, and he ships back everything bit of intelligence that he finds. His revenues are out of this world. It is almost the ethical equivalent of having consumer products made in the Third World by child labor. I can see the days, where data becomes a commodity that is bought, sold and traded. There will be export laws on data -- especially data with personal identification information in it.
But there are ethical questions closer to home. Suppose that my employer expects me to mine data, and I discover an untapped revenue stream that is extremely easy to exploit. Do I tell my employer, or give my notice and create a start-up to exploit that situation? What is the ethical course of action?
We have a long way to go with with applying ethics to data mining. Ethics is a lot like beauty -- it is in the eye of the beholder. I am reminded of a story that a shopkeeper told regarding ethics. He said "I was closing the store and a customer came in at the last minute and made a large purchase with a hundred dollar bill. After he had left and I closed the shop, as I was counting the money, I noticed that there was actually another hundred dollar bill stuck to the bill that the last customer had tendered. Immediately a question of ethics arose. I was wondering if I had to tell my business partner or not."
And that perfectly explains why we need some ethical boundaries in data mining.
The Fundamental Problem With Big Data Mining -- Missing Knowledge and ROI
Think back to high school calculus class. Calculus is the branch of mathematics invented by Isaac Newton (and Leibniz) that lets you do amazing things. If you take the derivative of distance, you get velocity and if you take the derivative of velocity, you get acceleration. You can go backwards and take the integral of acceleration to get velocity and do the same to velocity and get distance.
The reason that I bring this up is that the process of differentiation and integration is an inverse similitude for what happens in the mining of Big Data Mining. Big data generally starts with a data entry -- a single point. That data entry (usually a column in a row in a database) in conjunction with other entries is integrate to create a fact. Facts are integrated with other facts to become information. Information integrated with other information, becomes knowledge. Mining Big Data usually stops at the information stage.
Knowledge is an ontological map of combined information to create both abstract and concrete ideations to create an amalgam of fact, belief, prediction, concepts, ideals and metaphors that gives a basis of understanding about any situation, object, proposition or relationship. The utilization of Big Data is nowhere near creating knowledge from its sources. It just creates the building blocks of knowledge without any underlying understanding of the wherefores and whys.
That ability alone is amazing in itself and not enough, but let me iterate that I am not talking about Machine Learning (although that can be put into the mix in the future). Data mining is done by a person, so you have an actual brain driving the process of trying to make sense of a huge pile of data. As such, you can have some intelligent advantage over machine learning.
Information gleaned from Data Mining can be extremely useful to any enterprise. However all data is not created equal, and dirty data creates a lot of noisy correlations and just plain wrong information. And some data is just not that rich in information potential. But even the best of datasets can produce spurious correlations.
Without background knowledge, spurious correlations have no value. As an example, here are some spurious correlations from http://www.tylervigen.com/
The above graph shows that US spending on science, space, and technology correlates with Suicides by hanging, strangulation and suffocation.
You need knowledge behind the mining to find intelligent, meaningful relationships. Here are other examples of absurd correlations:
- Number people who drowned by falling into a swimming-pool correlates with Number of films Nicolas Cage appeared in
- Per capita consumption of cheese (US) correlates with Number of people who died by becoming tangled in their bedsheets
- Divorce rate in Maine correlates with Per capita consumption of margarine (US)
But the biggest fundamental problem with Big Data that with the lack of background knowledge, you could be finding local information and nuggets that do not translate into a universal picture, or vice versa. For example, a colleague was telling me of mining big data to determine the best days to run special promotions for dairy products for a small supermarket chain. When the data was dimensioned for day-of-week, it was found that there was a statistically significant decrease in demand for dairy products on a Friday. The data showed that dairy product sales were way down for that day, and past promotions did not work as well as projected.
What the data mining was missing, was that there was a statistically significant dip in the sales curve that was due to local conditions. Data was aggregated from all of the chains into a central database, and the sales database was mined. However the chain of stores had many locations, and in some locations a major dairy was co-located in the same city. That dairy had retail operations just on weekdays, and not on weekends, and on Friday, the dairy would run huge sales to clear its stocks over the weekend closure. Folks would buy their dairy products at huge savings at the dairy on Fridays at a subset of locations and it was enough to skew the data past a statistically significant point. However statistical significance did not negate increased sales overall, and it appeared that the data mining exercise was a failure when the null hypothesis was tested. The issue would have been avoided if the data has been dimensioned by geographical location, but who would have thought that a supermarket demand for milk on a Friday in Cincinnati would be different from that supermarket's Friday dairy demand in Steubenville?
To recap, the fundamental problem with Big Data, is that you get information, but rarely do you get knowledge. You know the who, where, what and when, but you don't know the why. And that may be the Achilles Heel of a lot of Big Data projects that do not deliver a promised ROI or Return On Investment.
In many companies, you have to mine Big Data to improve the bottom line. If you find spurious correlations, or information that is just plain wrong in either a local or universal sense, then the exercise has been a waste of money, time and resources.
So how do you map a Big Data Project to a reasonable ROI, especially since a fundamental flaw of mining Big Data is missing background knowledge? The answer lies in Total Data Point Dimensioning, coupled with Dynamic Cognitive Modeling, and the subject of a future blog post on how I see it evolving and the toolsets necessary for it. Stay tuned.
How To Be The Next Big Data, Machine-Learning Millionaire (in 3 Easy Paradigms)
I have a special talent. I make other people rich -- extremely rich! The first time that it happened, was in the early 1990's. I invented a new type of golf tee. The lawyer that I hired patented it under an umbrella proxy over which he had power of attorney. He said it was necessary for the financing of it. It was a long story, but I never saw a dime. He is retired in Turks & Caicos. It was particularly painful to find one of my designs on the golf course, now that the patent is expired.
The second time, I made a pile myself. It was during the tech boom, and the tech crash took us out with the speed of a tsunami. The third time was when I was when I was consulting as a technical architect to a G8 government. We were sitting in a scrum, and one of the team members mentioned that the telecom giant Nortel was trading at .75 cents a share. A few short months ago, it was at $130 per share. This team member said that it might be worthwhile to throw ten grand at it. I said "yeah, yeah, lets do it" and ultimately forgot. A young programmer on our team, believed in my endorsement of the stock and threw much more than that at it. He got out when it reached $16 a share. Do the math. Nortel eventually collapsed, but our intrepid friend made such a pile that he bought a BMW and never got out his pajamas for the next few years.
The last time that I made someone rich, was that I was in idle conversation with an elderly Manhattan-based writer last May (May 2014). He was a meditating Buddhist who lived simply and had a pile of cash to bet on the next big thing. He asked me what the next big thing would be. I told him that it would be the Internet of Everything.
He asked me who would be the big player in the Internet of Everything. I told him that Sierra Wireless (stock symbol SWIR) had foundation patents and had the potential to be the next Google or Apple. Since May ( a short 9 months ago) he has doubled his money. He thinks that I am genius. Needless to say, I didn't get on the ride with him.
And now, you too can benefit from my largess and become a millionaire in the field of Big Data and Machine Learning. You can do it in three easy paradigms.
Paradigm 1: Write a Universal Lightweight Data Inter-change Universal Sensor Data Transfer Protocol. Use JSON or XML. It is dead easy. And actually, you don't have to do it. I did it for you in this blog entry! And for the ultra-lazy putative millionaire, here is an example of it:
For that, I propose my handy-dandy XML based Universal Sensor Transfer Protocol, but instead of XML it is STML or Sensor Transfer Markup Language. Here is what it looks like:
(quote)
<?stml version="1.0" encoding="utf-8"?>
<sensor>
<name>Caliente Temp Sensor</name>
<serial_no>000-000-001</serial_no>
<units>degrees</units>
<scale>Fahrenheit</scale>
<reading>65.9</reading>
<timestamp>22/10/2014:20:26</timestamp>
</sensor>
Paradigm 2: Use some open source stuff like Apache Tomcat, MySQL and open source stuff to write a RESTful service to pop all of the sensor readings into a database.
Paradigm 3: Using your favorite machine-learning platform, input the data and train the living crap out of the data, preferably in real time to make ultra-smart houses, ultra-smart factories, ultra-smart utilities, etc etc etc. Everyone will want one of your platforms, because the system will be fire-up-and-forget as the military guys say of intelligent systems. The machine will learn what is normal, call someone when it ain't, and send back feedback to optimize whatever the sensor controls and make life, smarter, easier and better. It will save everyone time, energy, human work hours and time. AND IT WILL MAKE YOU FRIGGIN' RICH. Everybody will want one of these systems.
And here is the disrupter idea for the disruptive idea:
The cutesy coder guys will offload the training to the cloud and push the results to a smartphone.
There you go. You are welcome. This idea is a sure-fire winner to make you a millionaire. I would do this project, except that I am too busy with being Chief Technology Officer of our company. Also I am working on a recreational pharmaceutical company with a new designer drug offering. We are combining birth control pills with LSD so that you can take a trip without the kids. Oughta be a slam-dunk as well!
Oh, and be sure to sign up in the box to the lower right for my occasional non-obtrusive emails with further app ideas, cogitation on Deep Learning and AI, and futurism thoughts on tech. There will be a few monetizable ideas there as well.
Conquering The Time Domain in Marketing With Big Data & Analytics
Our platform sells big ticket items -- it remarkets and wholesales used cars. The supply chain is well defined. A new car dealer takes in a car on trade. He really doesn't want to do it, because most used cars are not moneymakers. If it sits on his or her used car lot forever, it loses money for him/her instead of making money. That is because the new car inventory underlying that trade-in is usually financed. To complete the deal cycle of used car trade-in -> new car purchase -> used car sale for recouping money, the used car has to sell quickly.
Secondhand car dealers in small markets are experts at what sells and for how much, and what the market is willing to pay. They have intense local knowledge of their geographic domain. A lot of the time, new car dealers do not have that expertise and/or knowledge.
Coupled to this fact, is that in spite of the parameters of make, model, year and condition, there is no uniform valuation for a used vehicle. It varies by area, time of year, color of vehicle, geographical location, local economy and a million and one different factors. Folks like Black Book try to standardize the valuation for the process, but at best, they are only a rough guide based on auction prices around the continent.
As we have shown in this article, the Black Book paradigm of gleaning value from auctions is not accurate because up to two-thirds of all vehicles are remarketed through relationship-based wholesaling, and never hit the auction floor.
Coupled to that, there is no "real price" for any used vehicle. What a vehicle sells for is based on what the new car dealer has in it (a combination of what he thinks the vehicle is worth and the discount that he has allowed on the new car that was bought with this trade-in). A good example of this is that on our platform recently, a dealer had $9,000 in an SUV. That's the reserve price that he put on his vehicle, because that is what he needed to make the deal profitable. He let the market forces dictate the ultimate price, but he needed $9,000. The SUV sold for $27,000 in the fair and equitable marketplace on our platform. So what was the vehicle worth? It was worth $9000 to one person and triple that to another. This is why we introduced crowd-sourcing valuations into our platform.
But there is one other element in marketing that transcends specific sectors, and that is the time element. Currently, a light manufacturer will do a run of product, and try to flog it off to wholesalers, retailers, online markets etc. It costs money to hold the product in inventory.
Technology such as 3D printing and print on demand for digital books alleviates some inventory build-up, but generally the time domain is huge in merchandising and marketing. What I mean by that, is that inventory is built up, and disposed of over time at ever-changing prices based on supply and demand. There is a measurable, considerable cost to storing inventory.
As pointed out in the automobile remarketing industry that we are in, the domain of time is a negative one. The longer an item stays in inventory, the less it is worth, and the larger the drag on the bottom line. Positive revenue stream is based on timely sales.
To conquer the time domain, we used Big Data to our advantage. We coupled it with our relationship-based sales paradigm described in the above link, and as it turns out, the piece of technology was patentable, and we have foundation patents pending in that area.
This is how it works. The whole idea is to move inventory quickly. We have mapped the buyer/seller network relationships (a social network media type of construct) with trusted buyer zones based on previous commercial relationships. This is the first step in the process that we have created. The product is offered to this trusted network group for a limited time (in our case, four hours is a norm). If the product does not sell, what then? As the clock ticks, money is lost.
The second step involves Analytics. We use Big Data to find in our customer base, and in other databases, who is the best and most frequent kind of buyer for this product. The machine assembles a top-10 list based on a proprietary algorithm of sifting through Big Data, and offers it to that ad hoc group of buyers for a limited time. The really nice part, is that once buyers find out about the top ten, we have a potential revenue stream where they will pay for early market information and a chance at a deal.
When that time expires, the platform has the smarts to move the inventory to the next phase of selling. In our case, it goes to general auction to the open group of buyers, and if that fails, the platform has the technology and ability to transfer the inventory to a classified type of listing.
Our competitive advantage, is that we have conquered the time domain with relationship-based social network selling for the first step, and the use of Big Data for the second step. Our competitors use the third step as their first step.
Big Data has a huge advantage in conquering the time domain. Suppose as a manufacturer, or even a retailer you had a platform to sell all of your inventory in a specified time-frame. With a platform such as ours, adapted to other fields, you could commoditize your inventory, and using relationship-based selling coupled with Big Data, you could have your inventory dispersed just as it was about to leave the factory floor, or arrive on a shipping dock. Big Data will even tell you how much inventory to order and make.
Merchandising and selling will all change drastically in the next few years, and those that don't adopt the Analytics/Machine Learning paradigm, will bite the dust.
An End To Dangerous Big Data Stalking
You are being stalked. Every website that you visit may add a stalker in the form of tracking cookies to your browser. They know where you have been. And with just a modicum of inference they know who you are.
This web tracking is pervasive. It all goes into a big database. If for some reason, you enter your name on a form, and the form is transmitted to the website in what is known as an HTTP Post, they will harvest your name. But even without your name, they will know what demographic you belong to. They will know your financial standing and how much you earn. They will know what music you listen to and what clothes you buy. And all of this information is processed without the benefit of human eyes sorting and classifying this data. Machine Learning is pervasive.
But here is what is most dangerous about these stalkers. They can make the wrong inference, and put you on a watch list that may be impossible to get off, or you may not even know about. Here is a scenario that could make you a terrorist according to Big Data and Machine Learning.
You are sipping your morning coffee looking at Facebook, and you see a heartbreaking picture of a child caught in the clutches of war in the Middle East. You "Like" the photo. Then it is time for you to go to the airport. You are flying business class and are given a choice of food. There are Halal meals. You are an adventurous foodie, so you tick it to try it. Coupled to that, is that you have an aisle seat. Then you check your Twitter feed. Someone posts about "Freedom of Religion", You favorite the tweet. In the business section of a European website, you see the add for a hedge fund that promises great returns. You click for more information. What you don't know, is that you have put the Big Data Digital Stalkers into overdrive, and you are now a person of interest to several agencies.
As it turns out, the photo that you "Liked" was posted by a terrorist group to garner sympathy. All of the "Likes" are collected as possible links to these terrorists. You are in another database because you chose Halal food instead of the bacon cheeseburger. The aisle seat is problematic. Hijackers do not take window seats. The "Freedom of Religion" tweet was sponsored by the Muslim Anti-Defamation League. Into another database you go. The hedge fund promising great returns is headquartered in the Cayman Islands. The IRS is suddenly interested in you.
The most dangerous thing about Big Data Stalkers, that that they make Bayesian Inferences which are probabilities. Probabilities are just that. They are not certainty. Even with a 99% probability, the next event in the sample space could be wrong -- not what the probability predicts. Machine Learning and Big Data Stalkers are a clear and present danger to personal privacy.
The other intrusion on your life from Big Data Stalking is the stuff done with commercial enterprises. They aim to learn absolutely everything they can about you, because they can sell that data. Big Data can produce new or enhanced revenue streams. Is there a way out of this?
I say that there can be. With a paradigm shift, the consumers of Big Data can get what they want, and your privacy can be protected. How you ask? With a little dash of technology.
Let's suppose that you turn the tables and consent to limited data tracking. That data tracking is now bowdlerized, meaning that sensitive personal stuff is obfuscated or removed. This is done by an app on your device, cell phone, tablet or computer. Then you are paid for that data to the highest bidder. Everyone is happy, and you the consumer benefit from the data collection.
As for the other stuff, technology can help too. I am a huge proponent of Artificial Intelligence. Suppose that you had a proxy entity digital assistant called Blocker. Blocker would surf the web for you, executing your Likes and Dislikes while retaining your anonymity. Blocker would run on a proxy service, so that even IP addresses would be hidden. On top of that, it would surf in anonymous mode. If there wasn't any personal user data to be had, your privacy would be protected. The data flow wouldn't entirely be impeded because through content analysis, you could still make pretty good inferences of the humans behind any wall. For example, a grandma living in Norway wouldn't be listening to rap music, but her grandson might be.
So, with a bit of different thinking, we can mitigate the dangers of Big Data Stalkers. The unfortunate thing, is that many denizens of the Internet, do know or don't care about the Stalkers.
The End of the Big Data Fad ~ Introducing Data Flow
If I were a venture capitalist, which one day I hope to be, I wouldn't fund companies and start-ups that process reams and reams of Big Data or Dark Data. Big Data, as we know it, is a flash in the pan, and it will disappear just like the Atari.
Yes we will have the internet of everything generating more data than anyone can handle. Yes we will have data generated by ever single one of the billions of humans inhabiting this planet. Yes we will have data generated by the trillions of devices that are or will be connected. But gathering huge lots of data and then batch processing it, is an unsustainable model.
When I was just starting out adult life, one of my neighbors was a draft dodger from the Vietnam War. He was/is a pacifistic in the positive sense. He didn't trust the motives of the US government and the McCarthy Communist witch hunts, and his buddies dying in foreign jungles for an unfathomable reason, in a war that they couldn't win. So he came to Canada. He has a degree in civil engineering, but he landed in Silicon Valley North and started working for a start-up. It was an exciting time. The computer was becoming ubiquitous, and almost every industry was crying for some sort of computerization in an age where there weren't any off-the-shelf software packages.
He joined a start-up and his job was to write the software for a data digestor for a shoe manufacturing company. The company would do a run of shoe components (soles, uppers, large, small, all sizes, ladies, men's, children, brown, black, purple) and they didn't know how many parts of what. They generally ran a line until they ran out of component pieces, and then laboriously switched over to another make, color and size.
It was the perfect application for a data digestor. Every time a component batch was made, someone would swipe a pre-programmed card and the result would go to a collector, and then to a database that could be queried, and management could do some actual planning by matching what components they had and knowing the manufacturing run limit. Big data wasn't very big then, but it was still an issue.
After the data digestor was delivered, the start-up ran through all of their money including the shoe company money, leaving my friend as the last employee. His wife was a registered nurse so he lived off her income while he refined the data digestor. He worked for shares of the company abandoned by the other employees. He was paid his putative salary in shares and the shares were valued at ten cents. After a year of refinement and frugal living, the company was bought for its data digestor, and my old neighbor is wealthy to this day.
All this to say, is that Big Data is going to grow exponentially, and we can't handle it like we are doing now, with old paradigms. Big Data has to be digested, mined and made sense of when it is created. It can't be allowed to accumulate otherwise when we get around to it, the effort will be akin to emptying the ocean with a teaspoon.
Because of these old paradigms still in play, in the long term, I would short the stock of SAP and SAS and all of the old-school stuff. A better way will come along, and just like the tire-cord industry disappeared when Michelin invented the radial tire, and we will have a new something-else.
So if I were a venture capitalist, I would fund and bet my money on creative, innovative paradigms that for men and machines that made sense of data as it was created. If we don't develop a sure-fire universal way of doing that, in a few short years, even if every molecule of silicon in every mountain in the world was transmuted to semi-conductor memory, it still wouldn't be enough.
A bigger philosophical question is, "How much of this data is valuable?". That question will be answered by clever minds who can monetize pieces of it. The economic world is very Darwinian.
As for Big Data, we will bury that sucker. Its child will live on though, and we will call it Data Flow. Data Flow software will be ubiquitous and highly necessary,
Big Data Retro Art
You don't see much Big Data art, so I decided to create some out of retro comic art. I then tweeted the above pic on my Twitter account and immediately it was favorited and re-tweeted even though it was in the middle of the Patriots/Ravens post-season play. Enjoy!
Process Mining & Me
Image copyright by Professor Wil van der Aalst, Eindhoven University of Technology
I am taking process mining to a different arena, using the basic methodology and event logs. I understand the necessity for well defined proces
ses in relation to things like ISO 9001 and quality management and the achievement of Six Sigma. I started my career as a circuit designer in a military electronics shop with a major designer/manufacturer and not only did we have to have incredibly good yields from the fab shop, but we had to have reliable equipment that we sold to NATO forces. Process improvement involved saving time, money and resources and creating optimal performance.
However I have moved on and now I am using process discovery in the opposite sense -- in e-Commerce to improve revenue streams. Essentially, we have a captive platform where self-identified industry insiders buy and sell to each other on a wholesale level. Our platform has several areas where our clients spend time. They can create trusted buyer zones with their circle of buyers and sellers (platform enabled geo-location). They can create packages and offer them for sale to platform escalated groups. They can invoke software robots to buy and sell for them. They can offer and buy and sell from classified listings. In short, we want to map the processes of how our customers use our platform, and hence optimize the UIX or User Interface Experience to maximize revenue.
We have event logs and timestamps for everything, from when they log in, to when they change their buyer/seller groups, to when they consign inventory, or make offers and counter offers, or do browse the listings. However the event logs and time stamps are not located in one database table. The challenge was to create an effective case id to tie the disparate event logs together. Luckily our platform is based on Java Classes, Java Server Pages, Facelets, Taglets and the whole J2EE environment. As a result, we have a discrete session which is serializable and by simply altering all the event logs by caching the system-generated session id keyed to each event in the disparate event logs, we will have created a powerful customer analysis tool on our platform.
This will enable us to take things a step further. You have heard of responsive UIX designs to adapt to whatever device is utilizing the platform. The process discovery outlined above with enable us to push the boundaries of responsive design to create a machine-customized UIX that facilitates customer behavior on our platform to maximize revenue stream. Each customer will have a process map based on past behavior, and that process map will generate the UIX with a custom menu, that will be different for each customer types.
Our previous datamining looked for relationships between product groups and buying behavior. It looked at time-domain factors and essentially all sorts of data-dimensions and Bayesian inference of the interrelationships between those dimensions, to enhance revenue stream.
I realize that this doesn't exactly fit into the accepted semantics of a what a process is in the context of this course, but in a larger sense, we are discovering the buying process or the behavior processes on a trade platform that leads to facilitating buying behavior in our users. It adds event processing to our data mining, and this is where this course adds value for me.
The Future of Big Data
The Future of Big Data
A brand new type of derivative of the future that will be traded like options and stocks.
Financial institutions like Goldman Sachs and others have a penchant for doing business solely to make money for the sake of making money. They were the ones that pushed toxic collaterized debt obligations into the banking system -- essentially a risky type of security that caused the economic meltdown of 2008 in the sub-prime mortgage field.
There are other types of so-called securities or derivatives that bankers love to bet on and make money. Blythe Masters, a banker at JPMorgan Chase invented the credit default swap. It is essentially a derivative, synthetic or derived investment instrument used for hedging loans. The way that it works is that when a loan is given, a buyer can buy a type of a bet that the loan will not default. As long as the loan doesn't default, the buyer pays a premium to the seller of the CDS. If the loan defaults, the seller must pay the value of the loan. You do not have to own the loan to participate. It is like one huge casino between financial institutions. Huge amounts swing on small margins.
Other financial derivatives are puts and calls which are a bet on whether a security (stock) will go up or down, and of course options are derivatives. These derivatives are traded like commodities in a secondary market -- meaning they are not connected to their underlying stock or companies.
The inventor of the CDS or credit default swap has been called the woman who invented financial weapons of mass destruction. The reason for this is that big money swings in the balance should an "event" occur which triggers a payout. Credit default swaps can be bought and sold as well as other derivatives.
The first to market with any type of new financial instrument is the big winner. Financial institutions cannot resist the urge to make money at any and every opportunity, and there is a big opportunity opening up.
I am here to predict the next type of derivative that will hit the market. It will be the BDD or Big Data Derivative. Big Data is defined as huge amounts of data that corporations generate. It can include machine-generated data, sales data, manufacturing data, personnel data, web visit data or any other kind of information stored.
Until the advent of fast processors, servers with almost unlimited capacity and bandwidth to kill, processing this data was almost impossible. Now there is a virtual deluge of data, and it can be extremely valuable. The cycle works like this: Data is mined for information. Information is integrated into knowledge. Knowledge is used to generate money.
Data mining is a burgeoning field, and although there are formalized methodologies to do it, it is still a wide open field and anyone who is knowledgeable in statistics and higher math can generate algorithms and formulae to tease valuable knowledge out of the data for profit. It is just like financial institutions develop proprietary algorithms for computer trading. Companies develop proprietary data mining algorithms.
I predict that Big Data will be a commodity. It will be treated like precious metal ore. A company can choose to mine and refine for their own revenue streams, or I predict that Big Data will be traded like a commodity such as copper, cotton or pork bellies. The Big Data from one company can be valuable to a whole host of other companies. Even mundane data like machine cycles in a manufacturing environment can be processed to do value engineering or economical modeling for new ventures. There are as many uses for processed Big Data as there are business endeavors. The companies that adopt the knowledge from Big Data will have a tactical edge over those who don't.
So what are some of the elements of the commerce side of Big Data? Someone will make a ton of money with classification algorithms. Other quants will come up with algorithms to value it, and it will remain the last bastion of true arbitrage, because one man's scrap is another man's gold. Some techno-freaks will invent classifiers built into edge databases or database engine structures designed for real time intelligence. There is a universe of possibilities.
There are so many possible money-making opportunities with Big Data, that derivatives will become standard instruments of trade and money-making. A white paper is currently being authored. To reserve your copy please send an email to DataPrivacy@mail.com.
The Dark Side of Big Data
There is a dark side to big data. It is personal privacy. There are obvious privacy risks for the accidental or intended disclosure of collected "hard", personal data, but to my way of thinking the real danger is from derived or predictive data using mathematical constructs like Bayesian Inference and other tools. Using large datasets, these tools are melded into business intelligence cubes that work wonders in improving the bottom line, but violate privacy in a fundamental way in the sense that they are predicting human behaviors based on inferential probability, that may have a large degree of error in individual cases, yet are useful enough on a macro scale to improve the bottom line. A good example of this are credit scores. Just because 80 percent of people employing action A with action B tend to default on loans 55 percent more than people who do not exhibit those behaviors, doesn't mean that the entire population demographic will default, yet they are judged as if they all will.
The real danger of this predictive stuff comes from aggregators who combine predictive data with actual personal data and sell it to other companies. Judgements will made that may be untrue, but may result in denial of things like college entrance, handgun ownership, club memberships, professional certifications, career choices (suppose that you are of a certain height and the data says that people of that height do not do well in a particular professional sport. Yet we all know stories of the little guy who could.) and other life events where some sort of body has authority over certain aspects of our lives.
One of the current thrusts of Big Data, is to find non-intuitive behavioral predictors. For example we have heard of Target Department Stores sending pregnancy coupons to a 15 year old girl. Her parents threw a fit, until they discovered that their daughter was actually pregnant. Target figured it out using probabilities and finding a correlation of beauty products and vitamins leading to buying pregnancy stuff five months later in a certain demographic. Supermarkets have long known to put beer and diapers together on a Saturday, and it results in a large increase in sales. (Wife sends hubby to store for diapers, but the big game will be on later on in the weekend and the hubbies buddies are coming over.) All this is fine and dandy because it happens on an anonymous level, but when this sort of predictive stuff is applied with identifying data, it could become dangerous.
What is a CIO or CTO to do? To my way of thinking, the chief responsibility is to management, shareholders and the bottom line, and not to the privacy of the masses. Business is the last venue of civilized men for uncivilized warfare, and as a result, I am predicting a further erosion of privacy from Big Data. It is a force majeure, an unstoppable tsunami of assaults against our privacy that will rival any effort of the NSA or any other organization intent on cataloging the behaviors of the masses.
Subscribe to:
Posts (Atom)