All Things Techie With Huge, Unstructured, Intuitive Leaps
Showing posts with label multilayer perceptrons. Show all posts
Showing posts with label multilayer perceptrons. Show all posts

Standard Artificial Neural Network Template



For the past few weeks, I have been thinking about having a trained artificial neural network on a computer, transferring it to another computer or another operating system, or even selling the trained network in a digital exchange in the future.

It really doesn't matter in what programming language artificial neural networks are written in.  They all have the same parameters, inputs, outputs, weights, biases etc.  All of these values are particularly suited to be fed into the program using XML document based on an .XSD schema or a light-weight protocol like JSON.  However, to my knowledge, this hasn't been done, so I took it upon myself to crack one out.

It is not only useful in creating portability in a non-trained network, but it also has data elements for a trained network as well, making the results of deep learning, machine learning and AI training portable and available.

Even if there are existing binaries, creating a front end to input the values would take minimal programming, re-programming or updating.

I also took the opportunity to make it extensible and flexible. Also there are elements that are not yet developed (like an XML XSD tag for a function) but I put the capability in, once it is developed.

There are a few other interesting things included.  There is the option to define more than one activation function. The values for the local gradient, the alpha and other parameters are included for further back propagation.

There is room to include a link to the original dataset to which these nets were trained (it could be a URL, a directory pathway, a database URL etc).  There is an element to record the number of training epochs.  With all of these information, the artificial neural net can be re-created from scratch.

There is extensibility in case this network is chained to another. There is the added data dimension in case other type of neurons are invented such as accumulators, or neurons that return a probability.

I put this .xsd template on Github as a public repository. You can download it from here:

http://github.com/kenbodnar/ann_template

Or if you wish, here is the contents of the .xsd called ann.xsd.  It is heavily commented for convenience.


<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified">
  <xs:element name="artificial_neural_network">
    <xs:complexType>
      <xs:sequence>
        <!-- The "name" element is the name of the network. They should have friendly names that can be referred to if it ever goes up for sale, rent, swap, donate, or promulgate.-->
        <xs:element name="name" type="xs:string" minOccurs="1" maxOccurs="1"/>
        <!-- The "id" element is optional and can be the pkid if the values of this network are stored in an SQL (or NOSQL) database, to be called out and assembled into a network on an ad hoc basis-->
        <xs:element name="id" type="xs:integer" minOccurs="0" maxOccurs="1"/>
        <!-- The "revision" element is for configuration control-->
        <xs:element name="revision" type="xs:string" minOccurs="1" maxOccurs="1"/>
        <!-- The "revision_history" is optional and is an element to describe changes to the network -->
        <xs:element name="revision_history" type="xs:string" minOccurs="0" maxOccurs="1"/>
        <!-- The "classification element" is put in for later use. Someone will come up with a classification algorithm for types of neural nets.There is room for a multiplicity of classifications-->
        <xs:element name="classification" type="xs:string" minOccurs="0" maxOccurs="0"/>
        <!-- The "region" element is optional and will be important if the networks are chained together, and the neurons have different functions than a standard neuron, like an accumulator or a probability computer
        and are grouped by region, disk, server, cloud, partition, etc-->
        <xs:element name="region" type="xs:string" minOccurs="0" maxOccurs="1"/>
        <!-- The "description" element is an optional field, however a very useful one.-->
        <xs:element name="description" type="xs:string" minOccurs="0" maxOccurs="1"/>
        <!-- The "creator" element is optional and denotes who trained these nets -->
        <xs:element name="creator" type="xs:string" minOccurs="0" maxOccurs="1"/>
        <!-- The "notes" element is optional and is self explanatory-->
 <xs:element name="notes" type="xs:string" minOccurs="0" maxOccurs="1"/>
        <!-- The source element defines the origin of the data. It could be a URL -->
 <xs:element name="dataset_source" type="xs:string" minOccurs="0" maxOccurs="1"/>
        <!-- This optional element, together with the source data helps to recreate this network should it go wonky -->
        <xs:element name="number_of_training_epochs" type="xs:integer" minOccurs="0" maxOccurs="1"/>
        <!-- The "number_of_layers" defines the total-->
        <xs:element name="number_of_layers" type="xs:integer" minOccurs="1" maxOccurs="1"/>
        <xs:element name="layers">
          <xs:complexType>
            <xs:sequence>
              <!-- Repeat as necessary for number of layers-->
              <xs:element name="layer" type="xs:string" minOccurs="1" maxOccurs="1">
                <xs:complexType>
                  <xs:sequence>
                    <!-- Layer Naming and Neuron Naming will ultimately have a recognized convention eg. L2-N1 is Layer 2, Neuron #1-->
                    <xs:element name="layer_name" type="xs:string" minOccurs="0" maxOccurs="1"/>
                    <!-- number of neurons is for the benefit of an object-oriented constructor-->
                    <xs:element name="number_of_neurons" type="xs:integer" minOccurs="1" maxOccurs="1"/>
                    <!-- defining the neuron this is repeated as many times as necessary-->
                    <xs:element name="neuron">
                      <xs:complexType>
                        <xs:sequence>
                          <!--optional ~  currently it could be a perceptron, but it could also be a new type, like an accumulator, or probability calculator-->
                          <xs:element name="type" type="xs:string" minOccurs="0" maxOccurs="1"/>
                          <!-- name is optional ~ name will be standardized eg. L1-N1 layer/neuron pair. The reason is that there might be benefit in synaptic joining of this layer to other networks and one must define the joins -->
                          <xs:element name="name" type="xs:string" minOccurs="0" maxOccurs="1"/>
                          <!-- optional ~ again, someone will come up with a classification system-->
                          <xs:element name="neuron_classification" type="xs:string" minOccurs="0" maxOccurs="1"/>
                          <!-- number of inputs-->
                          <xs:element name="number_of_inputs" type="xs:integer" minOccurs="1" maxOccurs="1"/>
                          <!-- required if the input layer is also an output layer - eg. sigmoid, heaviside etc-->
                          <xs:element name="primary_activation_function_name" type="xs:string" minOccurs="0" maxOccurs="1"/>
                          <!-- ~ optional - there is no such thing as a xs:function - yet, but there could be in the future -->
                          <xs:element name="primary_activation_function" type="xs:function" minOccurs="0" maxOccurs="1"/>
                          <!-- in lieu of an embeddable function, a description could go here ~ optional -->
                          <xs:element name="primary_activation_function_description" type="xs:string" minOccurs="0" maxOccurs="1"/>
                          <!-- possible alternate activation functions eg. sigmoid, heaviside etc-->
                          <xs:element name="alternate_activation_function_name" type="xs:string" minOccurs="0" maxOccurs="1"/>
                          <!-- ~ optional - there is no such thing as a xs:function - yet, but there could be in the future -->
                          <xs:element name="alternate__activation_function" type="xs:function" minOccurs="0" maxOccurs="1"/>
                          <!-- in lieu of an embeddable function, a description could go here ~ optional -->
                          <xs:element name="alternate__activation_function_description" type="xs:string" minOccurs="0" maxOccurs="1"/>
                          <!-- if this is an output layer or requires an activation threshold-->
                          <xs:element name="activation_threshold" type="xs:double" minOccurs="1" maxOccurs="1"/>
                          <xs:element name="learning_rate" type="xs:double" minOccurs="1" maxOccurs="1"/>
                          <!-- the alpha or the 'movement' is used in the back propagation formula to calculate new weights-->
                          <xs:element name="alpha" type="xs:double" minOccurs="1" maxOccurs="1"/>
                          <!-- the local gradient is used in back propagation-->
                          <xs:element name="local_gradient" type="xs:double" minOccurs="1" maxOccurs="1"/>
                          <!-- inputs as many as needed-->
                          <xs:element name="input">
                            <xs:complexType>
                              <xs:sequence>
                                <!-- Inputs optionally named in case order is necessary for definition -->
                                <xs:element name="input_name" type="xs:string" minOccurs="0" maxOccurs="1"/>
                                <!-- use appropriate type-->
                                <xs:element name="input_value_double" type="xs:double" minOccurs="0" maxOccurs="unbounded"/>
                                <!-- use appropriate type-->
                                <xs:element name="input_value_integer" type="xs:integer" minOccurs="0" maxOccurs="unbounded"/>
                                <!-- weight for this input-->
                                <xs:element name="input_value_weight" type="xs:double" minOccurs="1" maxOccurs="1"/>
                                <!-- added as a convenince for continuation of back propagation if the network is relocated, moved, cloned, etc-->
                                <xs:element name="input_value_previous_weight" type="xs:double" minOccurs="1" maxOccurs="1"/>
                              </xs:sequence>
                            </xs:complexType>
                          </xs:element>
                          <!-- end of input-->
                          <!-- bias start-->
                          <xs:element name="bias">
                            <xs:complexType>
                              <xs:sequence>
                                <xs:element name="bias_value" type="xs:double" minOccurs="1" maxOccurs="1"/>
                                <xs:element name="bias_value_weight" type="xs:double" minOccurs="1" maxOccurs="1"/>
                                <!-- added as a convenince for continuation of back propagation if the network is relocated, moved, cloned, etc-->
                                <xs:element name="bias_value_previous_weight" type="xs:double" minOccurs="1" maxOccurs="1"/>
                              </xs:sequence>
                            </xs:complexType>
                          </xs:element>
                          <!-- end of bias-->
                          <xs:element name="output">
                            <xs:complexType>
                              <xs:sequence>
                                <!-- outputs optionally named in case order is necessary for definition -->
                                <xs:element name="output_name" type="xs:string" minOccurs="0" maxOccurs="1"/>
                                <xs:element name="output_value_double" type="xs:double" minOccurs="0" maxOccurs="unbounded"/>
                                <!-- hypothetical value is a description of what it means if the neuron activates and fires as output if this is the last layer-->
                                <xs:element name="hypothetical_value" type="xs:string" minOccurs="0" maxOccurs="unbounded"/>
                              </xs:sequence>
                            </xs:complexType>
                          </xs:element>
                          <!-- end of output-->
                        </xs:sequence>
                      </xs:complexType>
                    </xs:element>
                    <!-- end of neuron-->
                  </xs:sequence>
                </xs:complexType>
              </xs:element>
              <!-- end of layer-->
            </xs:sequence>
          </xs:complexType>
        </xs:element>
        <!-- end of layers-->
      </xs:sequence>
    </xs:complexType>
  </xs:element>
  <!-- network-->
</xs:schema>

I hope this helps someone. This is open source. Please use it and pass it on if you find it useful.

Perils of Overtraining in AI Deep Learning


When we partnered with a local university department of Computer Science to create some Artificial Neural Networks (ANNs) for our platform, we gave them several years of data to play with.  They massaged the input data, created an ANN machine and ran training epochs to kingdom come.

 The trouble with ANNs, is that you can over-train them.  This means that they respond  specifically for the data set in a highly accurate manner, but they are not general enough to accurately process new data.  To put it in general terms, their point-of-view is too narrow, and encompasses only the data that they were trained on.

In the training process, I was intuitively guessing that the learning rate and improved accuracy would improve in an exponential manner with each iterative training epoch.  I was wrong.  Here is a graph showing that the learning rate is rather linear than exponential in the training cycle.


So the minute that the graph stops being linear, is when you stop training.  However, as our university friends found out, they had no way to regress the machine to exactly one training epoch back.  They had no record of the weights, biases, adjusted weights, etc of the epoch after the hours of back propagation or learning, and as a result, they had to re-run all of the training.

Me, I had a rather primitive way of saving the states of the neurons and layers. I mentioned it before. I wrote my machine in Java using object oriented programming, and those objects have the ability to be serialized.  In other words, binary objects in memory can be preserved in a state, written to disk, and then resurrected to be active in the last state that they were in.  Kind of like freezing a body cryogenically, but having the ability to bring it back to life.

So after every training epoch, I serialize the machine.  If I over-train the neural nets, I can get a signal by examining and/or plotting the error rates which are inverse to the accuracy of the nets. In the above graph, once the function stops being linear, I know that I am approaching the over-training event horizon.  Then I can regress with my save serialized versions of the AI machine.

Then the Eureka moment struck me! I had discovered a quick and easy cure for over-training.

I had in a previous blog article, a few down from here (or http://coderzen.blogspot.com/2015/01/brain-cells-for-sale-need-for.html ) I made the case for a standardized AI machine where you could have an XML or JSON lightweight representation of the layers, inputs, number of neurons, outputs and even hypothetical value mappings for the outputs, and then you wouldn't need to serialize the whole machine.  At the end of every training epoch, you just output the recipe for the layers, weights, biases etc, and you could revert to an earlier training incarnation by inputting a new XML file or a JSON object.

It's really time to draw up the .XSD schema for the standardized neuron. I want it to be open source. It would be horrible to be famous for thinking of a having a standardized neural net. Besides, being famous is just a job.

A Returned-Probability Artificial Neural Network - The Quantum Artificial Neural Network


Artificial Neural Networks associated with Deep Learning, Machine Learning using supervised and unsupervised learning are fairly good at figuring out deterministic things. For example they can find an open door for a robot to enter. They can find patterns in a given matrix or collection, or field.

However, sometimes there is no evident computability function. In other words, suppose that you are looking at an event or action that results from a whole bunch of unknown things, with a random bit of chaos thrown in.  It is impossible to derive a computable function without years of study and knowing the underlying principles. And even then, it still may be impossible to quantify with an equation, regression formula or such.

But Artificial Neural Nets can be trained to identify things without actually knowing anything about the background causes.  If you have a training set with the answers or results of size k (k being a series of cases), then you can always train your Artificial Neural Networks or Multilayer Perceptrons on k-1 sets, and evaluate how well you are doing with the last set. You measure the error rate and back propagate, and off you go to another training epoch if necessary.

This is happening with predicting solar flares and the resultant chaos that it cause with electronics and radio communications when these solar winds hit the earth.  Here is a link to the article, where ANN does the predicting:

http://www.dailymail.co.uk/sciencetech/article-2919263/The-computer-predict-SUN-AI-forecasts-devastating-solar-flares-knock-power-grids-Earth.html

In this case, the ANN's have shown that there is a relationship between vector magnetic fields of the surface of the sun, the solar atmosphere and solar flares.  That's all well and dandy for deterministic events, but what if the determinism was a probability and not a direct causal relationship mapped to its input parameters? What if there were other unknown or unknownable influence factors?

That's were you need an ANN (Artificial Neural Network) to return a probability as the hypothesis value. This is an easy task for a stats package working on database tables, churning out averages, probabilities, degrees of confidence, standard deviations etc, but I am left wondering if it could be done internally in the guts of the artificial neuron.

The artificial neuron is pretty basic. It sums up all of the inputs and biases multiplied by their weights, and feeds the result to an activation function.  It does this many times over in many layers.  What if you could encode the guts of the neuron to spit out the probability of the results of what is being inputted? What if somehow you changed the inner workings of the perceptron or neuron to calculate the probability.  It seems to me that the activation function is somehow ideally suited to adaptation to do this, because it can be constructed to deliver an activation value of between 0 and 1, which matches probability notation.

Our human brains work well with fuzziness in our chaotic world.   We unconsciously map patterns and assign probabilities to them. There is another word for fuzzy values. It is a "quantum" property. The more you know about one property of an object, the less you know about another.  Fuzziness. The great leap forward for Artificial Neural Networks, is to become Quantum and deliver a probability.  Once we can get an Artificial Neural Net machine to determine probability, then we can apply Bayesian mechanics. That's when it can make inferences, and get a computer on the road to thinking from first principles -- by things that it has learned by itself.

Brain Cells For Sale ~ The Need For Standardization of Artificial Neural Nets


When it comes to Artificial Neural Networks, the world is awash with roll-your-own. Everyone has their own brand and implementation.  Although the theory and practice is well thought out, tested and put into use, the implementation in almost every case is different. In our company, we have a partner university training artificial neural nets for our field of endeavor as a research project for graduate students.

Very few roll-your-own ANN's or Artificial Neural Networks are object-oriented in terms of the way they are programmed. This is because it is easier to have a monolithic program where each layer resides in an array, and the neurons can input and output to each other easily.  All ANNs are coded in everything from Java, to C, to C++, to C# to kiddie scripting.  I am here to preach today, that there should be a standard Artificial Neuron.  To be more explicit, the standardization should be in the recipe for layers, inputs, weights, biases and outputs.  Let me explain.

While the roll-your-own is efficient for each application, it has several major drawbacks.  Let me go through some of them.

The first one is portability. We have a multitude of platforms on everything from Windows to Linux, to Objective C in the iOS native format, to QNX to folks putting Artificial Neural Networks on silicon, and programming right down to the bare metal, or the semi-metals that dope the silicon matrix in the transistor junctions of the chips. We need to be able to run a particular set of specifically trained neural nets on a variety of platforms.

The multiplicity of platforms was seen early on and as a result, we had strange things developed like CORBA or Common Object Request Broker Architecture being formulated ( http://en.wikipedia.org/wiki/Common_Object_Request_Broker_Architecture ). CORBA came about in the early 1990's in its initial incarnations however it is bulky and adds a code-heavy layer of abstraction to each platform when you want to transport silicon brainiacs like a multilayer perceptron machine. The idea of distributed computing is an enticing one, but due to a large variety of factors, including security and the continued exponential multiplication of integrated transistors on a chip according to Moore's Law, it is a concept that has been obviated for the present time.

My contention, is that if you had a standard for a Neural Net, then you wouldn't have to call some foreign memory or code object from a foreign computer. You would just use a very simple light-weight data protocol to transfer post-learning layers, weights and biases (like JSON)  and bingo -- you can replicate smartness on a new machine in minutes without access to training data, or the time spent training the artificial neural net. It would be like unpacking a thinker in a box. You could be dumber than a second coat of paint, but no one would notice, because your mobile phone did your thinking for you.

There is another aspect to this, and it is the commercial aspect.  If I came across a unique data set, and trained a bunch of neural networks to predict stuff in the realm of that data set, I potentially could have a bunch of very valuable neural nets that I could sell to you.  All that you would have is pay me the money, download my neural net recipe with its standardized notation, and be in business generating your own revenue stream. It wouldn't matter what platform, operating system or chip set that your computer or device used -- the notation for the recipe of the artificial neural network would be agnostic to the binaries.

We are in a very strange time, with the underpinnings of our society changing at a very fast pace.  My contention is that the very nature of employment may change for many people.  We will not longer need to import cheap goods from China that fill the dollar stores. You will order the recipe for a 3D printer and make whatever you need.  This paradigm alone will kill many manufacturing jobs. As a result, the nature of work will change.  People will find a niche, and supply the knowledge in that niche that can be utilized or even materialize that knowledge into what they need.   We will transcend the present paradigm of people supporting themselves by making crafts and selling them on Etsy or writing books and selling them on Amazon.  People will make and sell knowledge products, and one could sell trained neural nets for any field of endeavor.

Just as rooms full of Third World country young men game all day and sell the rewards online to impatient first world gamers, you will have people spending days and weeks training neural nets and sell them on an online marketplace.

That day is coming shortly, and the sooner that we have a standard for Artificial Neural Net recipes, the sooner that we will see intelligence embedded in devices and trained neural nets for sale. You can count on it.

These thoughts were spawned on my daily walk, and you can bet that I have already started to create a schema for a neural net transference, as well as a Java Interface for one version of a standardized neural net.  Stay tuned.

Synaptic Pruning in Artificial Neural Networks and Multilayer Perceptrons

What happens in a baby's mind is fascinating.  While the baby is sleeping, it processes all of the information that its senses took in, and puts in through a huge Mixmaster creating all sorts of connections to memory, storage, logic and emotions.  I love the way that Mother Nature plays dice.  The baby's brain makes synaptic connections between bits of data that are also inappropriate. This is hugely beneficial because once these connections are made, then the logic circuits can evaluate if they are sound and reflect the outside world.  A baby's brain multiplies in size 5 times until it reaches adulthood, largely from creation of synapses or links to neurons (plus other biological infrastructure functions).  This is why a child's imagination is so fertile.

Then we have synaptic pruning near the onset of puberty. ( http://en.wikipedia.org/wiki/Synaptic_pruning ). Once we start thinking about sex, we start pruning the synapses that we think are inappropriate.  The cartoon below gives a very simplistic diagram of pruning inappropriate synapses.  I use the word inappropriate in the sense of what is considered inappropriate by adults and keen rationalists or fairy-tale dogmatics.



How did I get onto this?  I saw a tweet by a hard-core religion fundamentalist who stated that neuroplasticity was the deity's way of fixing a brain.  (In that context, I think that he was implying neural re-wiring to fix apostasy, homosexuality, atheism, and everything else that he didn't approve of.)  I had heard of neuroplasticity but I googled it to ascertain the current scientific thinking of it.  Simply, neuroplasticity is the rewiring or creation of synapsis to take over functions of the brain that have been destroyed by trauma, injury and/or accident.  For example, it has been reported that brain function controlling say motor activity has been discovered in a portion of the brain not known for that activity in an accident victim.  The term synaptic pruning was in this article, and I had to investigate the term.

Once I googled it, it reminded me of the works of Dr. Stephen L. Thaler, PhD.  He has a raft of scientific discovery and patents, and he was an early adopter of artificial neural networks. ( http://imagination-engines.com/iei_founder.php ). In a nutshell, he did some work in Cognition, Consciousness and Creativity in artificial neural networks for which he holds patents.  He discovered that if you randomly destroyed neurons in a massive array of artificial neural networks, as the network was expiring, it came up with creative outputs or solutions.  As a result, he added another layer of neurals nets to observe this.  In essence by killing off neurons randomly, he was doing synaptic pruning of a sort.

Let me quote from Dr. Thaler's website:

After witnessing some really great ideas emerge from the near-death experience of artificial neural networks, Thaler decided to add additional nets to automatically observe and filter for any emerging brainstorms. From this network architecture was born the Creativity Machine (US Patent 5,659,666). Thaler has proposed such neural cascade as a canonical model of consciousness in which the former net manifests what can only be called a stream of consciousness while the second net develops an attitude about the cognitive turnover within the first net (i.e., the subjective feel of consciousness). In this theory, all aspects of both human and animal cognition are modeled in terms of confabulation generation. Thaler is therefore both the founder and architect of confabulation theory and the patent holder for all neural systems that contemplate, invent, and discover via such confabulations.


The idea then struck me, that perhaps it wasn't necessary to destroy the neuron in the network to achieve what Dr. Thaler saw, but rather just do the synaptic pruning, by randomly destroying inputs (and as a result their weights) in the hidden layers of multilayer perceptrons.

After the connection was destroyed, you would still run the AI machine including back propagation and see what comes out.  What a fascinating concept, and I am itching to try this once I find the time.

I am sure that all sorts of people might think that Dr. Thaler is a nutbar, but those were the same people who thought that Benoit Mandelbrot's ideas on fractal geometry were child's play with no practical applications.  Or we see how the works of the Rev. Thomas Bayes who is a relative unknown, publishing only two papers in his lifetime, and dying in 1761 postulated the important Bayesian inference used in Machine Learning.

So Artificial Neural Networks come and go in popularity in the computing field.  I am sure that Dr. Thaler is onto something, and for some strange reason, his theories may pan out to be seminal in the field of machine consciousness that way that Alan Turing's ideas became pivotal in this modern age of technology.  And somewhere in there, synaptic pruning will take place, and it just may not be a footnote in the development of artificial consciousness.

If you are looking for ideas for a master or doctoral thesis, you are welcome.





What I learned from playing with Artificial Neural Nets and Multilayer Perceptrons


Anecdotal Observations about Artificial Neural Networks and Multilayer Perceptrons

I like experimenting. I have had a lot of fun of trying to embed knowledge in the thresholds of a massively parallel Artificial Neural Networks, specifically multi-layer perceptrons.  The field of artificial intelligence is a space where one can create magnificent experiments without the physical ramifications of things going very wrong very quickly, say in other experiments using high energy explosives, raw high voltage electricity or caustic, extremely fast exothermic reactions. Everything happens in the five pounds of laptop sitting on your knees without smoke, or blowing fuses or having to call the fire department.

  I created my own Java framework with each perceptron being its own object.  I am told that object oriented programming is the most advanced form of programming, and in my quest for a possible Nobel Prize, I have to use the most advanced tools available to me.  My perceptrons were connected by axon objects which held the outputs and fed them into the next hidden layer. It was intended to be a fine example of silicon mimicry of noggin topography.

I didn't do anything fancy with the activation function. Rather than a Heaviside function, even though I rather admire the works of the very eccentric Oliver Heaviside ( http://en.wikipedia.org/wiki/Oliver_Heaviside ).  He showed that the math involved in understanding Einstein's theories is less than complex in the mathematics of describing what happens inside an electric power cable.  But je digress.   I chose a sigmoid activation function and there are two common ones:



I chose tanh because for the back propagation, I knew that the differential of tanh was 1 - tanh^2 , and that would save me some time for coding the weight adjustments for back propagation.  I used the standard formula for corrections to the weight on the nodes which minimizes the error of the output given by the fancy schmancy looking equation that merely states the sum of all of the inputs multiplied by their weights, including a bias and bias weight:



The back propagation uses gradient descent and is given by, (hence the need to know the differential of the sigmoid function):

which again is a very smart-looking, complicated-looking equation that adds some real provenance of superior intelligence to this blog entry. Any kindergarten prodigy could understand the concept of the relationship of the local gradient to the differential of the activation or squashing function to get an approximation of the correct weights when the perceptron gives an answer as dumb as a doorpost, or an NFL player hit on the head one too many times.

So having said all of that, these are my observations about multi-layer perceptrons:

1) Bigger is not better

It takes a hell of a lot of training epochs to get a large number of hidden layers to move significantly to a more correct output. I start getting better approximations to the training set more quickly with less layers. I naively thought that the more, the merrier and much smarter.  Less layers get smarter quicker. There is a proviso though. Deep Layers, like Deep Throat does eventually give more satisfying results.  It gives a measure of accuracy in very complex inputs. However it takes a hell of a lot of training and spinning of silicon gears to get there. Moi - I am the impatient type who likes results quickly even though, like the smoked bloaters that I bought yesterday -- they are a bit off.

2) It pays to be explicit.  Since I was using a Sigmoid activation function, I figured that the hidden layers would act as some huge flip-flop or boolean gate array to magically come up with an answer with a minimum amount of neurons. By this I mean, suppose that you had a problem with three inputs and the hypothesis value of the outputs was also three. two, one or zero. Since the output neurons can be activated or not, the binary equivalent of a three possible output scenario can be represented by only two neurons (counting to three in binary goes like:  00,01, 10, 11).  I soon learned to be explicit. If you have three hypothetical output values, you should have three output neurons in the output layer to minimize training epochs.

3) It pays to be a wild and crazy guy. With eager anticipation, I fired up my newly made artificial intelligence machine, expecting a Frankenstein equivalent of Einstein to machine learn and do my tasks for me, and generally make me look brilliant.  I figured that I was on the verge of artificial genius to enhance my brain capacity, which I already figured to be roughly the size of a small planet. So when it came to setting weights and biases, I either went with the integer 1 or a random integer, and figured that the back propagation would clean up and get me to the appropriate figure.  Again, that would be the case if I had the patience to sit through a few million training epochs.  When my perceptrons had the intellectually ability of Popeye the Sailor Man, I was sorely tempted to give up, until I started doing crazy things with the initial weights. In one case, I got satisfaction by starting with a value of 10^-3.

4) It pays to initially log everything. I was getting fabulous results in early testing with minimal training. I figured that I was well on the way to creating my own silicon-based Watson that would reside on my laptop at my beck and call. However as the number of training epochs climbed, the weight correction stalled and the logs reported that the local gradient was NaN or not a number. Several WTF sessions resulted, and it took me through the travail of logging everything to discover that the phi equation where I was supposed to calculate the local gradient, had a mistake. I was squaring Math.tan instead of Math.tanh. It was disappointing to learn that a programming error initially added a lot of accuracy and amazement to my artificial intelligence machine, but as it progressed, it got dumber and dumber. I suppose that it's a good model for the intellectual capacity of a human being's Life journey, but that wasn't what I was aiming for.

5) One of the most difficult things using multi-layer perceptrons is framing the tool such that the functioning maps clearly to a hypothesis value with a minimal amount of jumping through loops and hoops and a pile of remedial programming. In other words, to make the thing do real life work (instead of constructing a piece of software that autonomously mimics an exclusive OR gate, as most tutorials in this field do) you have to design it such that real world inputs can be mapped into the morphology, topology and operating methodology of an artificial neural net, and the outputs can be mapped to significant hypothesis values.

In other words, a high school girl can code a functional multi-layer perceptron machine (and she has -- if you watch TED talks, she diagnoses cancer with it), however it takes a bit of real work to make it solve real life problems. But when you do, machine learning is one of the most sublime achievements of the human race. The machines achieve a level of logic that their carbon-based creators cannot.  And that is why Dr. Stephen Hawking says that Artificial Intelligence poses a threat to mankind.  I am not worried about the threat of artificial intelligence. I am more worried about the threat of a fanatic, with a bunch of explosives strapped to his chest. It is only logical, and it doesn't take very many training epochs to figure that one out.