Anecdotal Observations about Artificial Neural Networks and Multilayer Perceptrons
I like experimenting. I have had a lot of fun of trying to embed knowledge in the thresholds of a massively parallel Artificial Neural Networks, specifically multi-layer perceptrons. The field of artificial intelligence is a space where one can create magnificent experiments without the physical ramifications of things going very wrong very quickly, say in other experiments using high energy explosives, raw high voltage electricity or caustic, extremely fast exothermic reactions. Everything happens in the five pounds of laptop sitting on your knees without smoke, or blowing fuses or having to call the fire department.
I created my own Java framework with each perceptron being its own object. I am told that object oriented programming is the most advanced form of programming, and in my quest for a possible Nobel Prize, I have to use the most advanced tools available to me. My perceptrons were connected by axon objects which held the outputs and fed them into the next hidden layer. It was intended to be a fine example of silicon mimicry of noggin topography.
I didn't do anything fancy with the activation function. Rather than a Heaviside function, even though I rather admire the works of the very eccentric Oliver Heaviside ( http://en.wikipedia.org/wiki/Oliver_Heaviside ). He showed that the math involved in understanding Einstein's theories is less than complex in the mathematics of describing what happens inside an electric power cable. But je digress. I chose a sigmoid activation function and there are two common ones:
I chose tanh because for the back propagation, I knew that the differential of tanh was 1 - tanh^2 , and that would save me some time for coding the weight adjustments for back propagation. I used the standard formula for corrections to the weight on the nodes which minimizes the error of the output given by the fancy schmancy looking equation that merely states the sum of all of the inputs multiplied by their weights, including a bias and bias weight:
The back propagation uses gradient descent and is given by, (hence the need to know the differential of the sigmoid function):
which again is a very smart-looking, complicated-looking equation that adds some real provenance of superior intelligence to this blog entry. Any kindergarten prodigy could understand the concept of the relationship of the local gradient to the differential of the activation or squashing function to get an approximation of the correct weights when the perceptron gives an answer as dumb as a doorpost, or an NFL player hit on the head one too many times.
So having said all of that, these are my observations about multi-layer perceptrons:
1) Bigger is not better
It takes a hell of a lot of training epochs to get a large number of hidden layers to move significantly to a more correct output. I start getting better approximations to the training set more quickly with less layers. I naively thought that the more, the merrier and much smarter. Less layers get smarter quicker. There is a proviso though. Deep Layers, like Deep Throat does eventually give more satisfying results. It gives a measure of accuracy in very complex inputs. However it takes a hell of a lot of training and spinning of silicon gears to get there. Moi - I am the impatient type who likes results quickly even though, like the smoked bloaters that I bought yesterday -- they are a bit off.
2) It pays to be explicit. Since I was using a Sigmoid activation function, I figured that the hidden layers would act as some huge flip-flop or boolean gate array to magically come up with an answer with a minimum amount of neurons. By this I mean, suppose that you had a problem with three inputs and the hypothesis value of the outputs was also three. two, one or zero. Since the output neurons can be activated or not, the binary equivalent of a three possible output scenario can be represented by only two neurons (counting to three in binary goes like: 00,01, 10, 11). I soon learned to be explicit. If you have three hypothetical output values, you should have three output neurons in the output layer to minimize training epochs.
3) It pays to be a wild and crazy guy. With eager anticipation, I fired up my newly made artificial intelligence machine, expecting a Frankenstein equivalent of Einstein to machine learn and do my tasks for me, and generally make me look brilliant. I figured that I was on the verge of artificial genius to enhance my brain capacity, which I already figured to be roughly the size of a small planet. So when it came to setting weights and biases, I either went with the integer 1 or a random integer, and figured that the back propagation would clean up and get me to the appropriate figure. Again, that would be the case if I had the patience to sit through a few million training epochs. When my perceptrons had the intellectually ability of Popeye the Sailor Man, I was sorely tempted to give up, until I started doing crazy things with the initial weights. In one case, I got satisfaction by starting with a value of 10^-3.
4) It pays to initially log everything. I was getting fabulous results in early testing with minimal training. I figured that I was well on the way to creating my own silicon-based Watson that would reside on my laptop at my beck and call. However as the number of training epochs climbed, the weight correction stalled and the logs reported that the local gradient was NaN or not a number. Several WTF sessions resulted, and it took me through the travail of logging everything to discover that the phi equation where I was supposed to calculate the local gradient, had a mistake. I was squaring Math.tan instead of Math.tanh. It was disappointing to learn that a programming error initially added a lot of accuracy and amazement to my artificial intelligence machine, but as it progressed, it got dumber and dumber. I suppose that it's a good model for the intellectual capacity of a human being's Life journey, but that wasn't what I was aiming for.
5) One of the most difficult things using multi-layer perceptrons is framing the tool such that the functioning maps clearly to a hypothesis value with a minimal amount of jumping through loops and hoops and a pile of remedial programming. In other words, to make the thing do real life work (instead of constructing a piece of software that autonomously mimics an exclusive OR gate, as most tutorials in this field do) you have to design it such that real world inputs can be mapped into the morphology, topology and operating methodology of an artificial neural net, and the outputs can be mapped to significant hypothesis values.
In other words, a high school girl can code a functional multi-layer perceptron machine (and she has -- if you watch TED talks, she diagnoses cancer with it), however it takes a bit of real work to make it solve real life problems. But when you do, machine learning is one of the most sublime achievements of the human race. The machines achieve a level of logic that their carbon-based creators cannot. And that is why Dr. Stephen Hawking says that Artificial Intelligence poses a threat to mankind. I am not worried about the threat of artificial intelligence. I am more worried about the threat of a fanatic, with a bunch of explosives strapped to his chest. It is only logical, and it doesn't take very many training epochs to figure that one out.