In the last chapter, we presented an overview of different neural network models. In this chapter, we continue the broad discussion of neural networks with two important topics: Learning and Training. Here are key questions that we would like to answer:
• How do neural networks learn?
• What does it mean for a network to learn ?
• What differences are there between supervised and unsupervised learning ?
• What training regimens are in common use for neural networks?
There are many varieties of neural networks. In the final analysis, as we have discussed briefly in Chapter 4 on network modeling, all neural networks do one or more of the following :
• Pattern classification
• Pattern completion
• Data clustering
• Function evaluation
A neural network, in any of the previous tasks, maps a set of inputs to a set of outputs. This nonlinear mapping can be thought of as a multidimensional mapping surface. The objective of learning is to mold the mapping surface according to a desired response, either with or without an explicit training process.
A network can learn when training is used, or the network can learn also in the absence of training. The difference between supervised and unsupervised training is that, in the former case, external prototypes are used as target outputs for specific inputs, and the network is given a learning algorithm to follow and calculate new connection weights that bring the output closer to the target output. Unsupervised learning is the sort of learning that takes place without a teacher. For example, when you are finding your way out of a labyrinth, no teacher is present. You learn from the responses or events that develop as you try to feel your way through the maze. For neural networks, in the unsupervised case, a learning algorithm may be given but target outputs are not given. In such a case, data input to the network gets clustered together; similar input stimuli cause similar responses.
When a neural network model is developed and an appropriate learning algorithm is proposed, it would be based on the theory supporting the model. Since the dynamics of the operation of the neural network is under study, the learning equations are initially formulated in terms of differential equations. After solving the differential equations, and using any initial conditions that are available, the algorithm could be simplified to consist of an algebraic equation for the changes in the weights. These simple forms of learning equations are available for your neural networks.
At this point of our discussion you need to know what learning algorithms are available, and what they look like. We will now discuss two main rules for learning—Hebbian learning, used with unsupervised learning and the delta rule, used with supervised learning. Adaptations of these by simple modifications to suit a particular context generate many other learning rules in use today. Following the discussion of these two rules, we present variations for each of the two classes of learning: supervised learning and unsupervised learning.
Learning algorithms are usually referred to as learning rules. The foremost such rule is due to Donald Hebb. Hebb’s rule is a statement about how the firing of one neuron, which has a role in the determination of the activation of another neuron, affects the first neuron’s influence on the activation of the second neuron, especially if it is done in a repetitive manner. As a learning rule, Hebb’s observation translates into a formula for the difference in a connection weight between two neurons from one iteration to the next, as a constant [mu] times the product of activations of the two neurons. How a connection weight is to be modified is what the learning rule suggests. In the case of Hebb’s rule, it is adding the quantity [mu]aiaj, where ai is the activation of the ith neuron, and aj is the activation of the jth neuron to the connection weight between the ith and jth neurons. The constant [mu] itself is referred to as the learning rate. The following equation using the notation just described, states it succinctly:
[Delta]wij = [mu]aiaj
As you can see, the learning rule derived from Hebb’s rule is quite simple and is used in both simple and more involved networks. Some modify this rule by replacing the quantity ai with its deviation from the average of all as and, similarly, replacing aj by a corresponding quantity. Such rule variations can yield rules better suited to different situations.
For example, the output of a neural network being the activations of its output layer neurons, the Hebbian learning rule in the case of a perceptron takes the form of adjusting the weights by adding [mu] times the difference between the output and the target. Sometimes a situation arises where some unlearning is required for some neurons. In this case a reverse Hebbian rule is used in which the quantity [mu]aiaj is subtracted from the connection weight under question, which in effect is employing a negative learning rate.
In the Hopfield network of Chapter 1, there is a single layer with all neurons fully interconnected. Suppose each neuron’s output is either a + 1 or a – 1. If we take [mu] = 1 in the Hebbian rule, the resulting modification of the connection weights can be described as follows: add 1 to the weight, if both neuron outputs match, that is, both are +1 or –1. And if they do not match (meaning one of them has output +1 and the other has –1), then subtract 1 from the weight.
The delta rule is also known as the least mean squared error rule (LMS). You first calculate the square of the errors between the target or desired values and computed values, and then take the average to get the mean squared error. This quantity is to be minimized. For this, realize that it is a function of the weights themselves, since the computation of output uses them. The set of values of weights that minimizes the mean squared error is what is needed for the next cycle of operation of the neural network. Having worked this out mathematically, and having compared the weights thus found with the weights actually used, one determines their difference and gives it in the delta rule, each time weights are to be updated. So the delta rule, which is also the rule used first by Widrow and Hoff, in the context of learning in neural networks, is stated as an equation defining the change in the weights to be affected.
Suppose you fix your attention to the weight on the connection between the ith neuron in one layer and the jth neuron in the next layer. At time t, this weight is wij(t) . After one cycle of operation, this weight becomes wij(t + 1). The difference between the two is wij(t + 1) - wij(t), and is denoted by [Delta]wij . The delta rule then gives [Delta]wij as :
[Delta]wij = 2[mu]xi(desired output value – computed output value)j
Here, [mu] is the learning rate, which is positive and much smaller than 1, and xi is the ith component of the input vector.
Supervised neural network paradigms to be discussed include :
• Feedforward Backpropagation network
• Statistical trained networks (Boltzmann/Cauchy machines)
• Radial basis function networks
The Perceptron and the Adaline use the delta rule; the only difference is that the Perceptron has binary output, while the Adaline has continuous valued output. The Feedforward Backpropagation network uses the generalized delta rule, which is described next.
While the delta rule uses local information on error, the generalized delta rule uses error information that is not local. It is designed to minimize the total of the squared errors of the output neurons. In trying to achieve this minimum, the steepest descent method, which uses the gradient of the weight surface, is used. (This is also used in the delta rule.) For the next error calculation, the algorithm looks at the gradient of the error surface, which gives the direction of the largest slope on the error surface. This is used to determine the direction to go to try to minimize the error. The algorithm chooses the negative of this gradient, which is the direction of steepest descent. Imagine a very hilly error surface, with peaks and valleys that have a wide range of magnitude. Imagine starting your search for minimum error at an arbitrary point. By choosing the negative gradient on all iterations, you eventually end up at a valley. You cannot know, however, if this valley is the global minimum or a local minimum. Getting stuck in a local minimum is one well-known potential problem of the steepest descent method. You will see more on the generalized delta rule in the chapter on backpropagation (Chapter 7).
The Boltzmann machine (and Cauchy machine) uses probabilities and statistical theory, along with an energy function representing temperature. The learning is probabilistic and is called simulated annealing. At different temperature levels, a different number of iterations in processing are used, and this constitutes an annealing schedule. Use of probability distributions is for the goal of reaching a state of global minimum of energy. Boltzmann distribution and Cauchy distribution are probability distributions used in this process. It is obviously desirable to reach a global minimum, rather than settling down at a local minimum.
Figure clarifies the distinction between a local minimum and a global minimum. In this figure you find the graph of an energy function and points A and B. These points show that the energy levels there are smaller than the energy levels at any point in their vicinity, so you can say they represent points of minimum energy. The overall or global minimum, as you can see, is at point B, where the energy level is smaller than that even at point A, so A corresponds only to a local minimum. It is desirable to get to B and not get stopped at A itself, in the pursuit of a minimum for the energy function. If point C is reached, one would like the further movement to be toward B and not A. Similarly, if a point near A is reached, the subsequent movement should avoid reaching or settling at A but carry on to B. Perturbation techniques are useful for these considerations.
Local and global minima.
Sometimes in simulated annealing, first a subset of the neurons in the network are associated with some inputs, and another subset of neurons are associated with some outputs, and these are clamped with probabilities, which are not changed in the learning process. Then the rest of the network is subjected to adjustments. Updating is not done for the clamped units in the network. This training procedure of Geoffrey Hinton and Terrence Sejnowski provides an extension of the Boltzmann technique to more general networks.
Although details of radial basis functions are beyond the scope of this book, it is worthwhile to contrast the learning characteristics for this type of neural network model. Radial basis-function networks in topology look similar to feedforward networks. Each neuron has an output to input characteristic that resembles a radial function (for two inputs, and thus two dimensions). Specifically, the output h(x) is as follows:
h(x) = exp [ (x - u)2 / 2[sigma]2 ]
Here, x is the input vector, u is the mean, and [sigma] is the standard deviation of the output response curve of the neuron. Radial basis function (RBF) networks have rapid training time (orders of magnitude faster than backpropagation) and do not have problems with local minima as backpropagation does. RBF networks are used with supervised training, and typically only the output layer is trained. Once training is completed, a RBF network may be slower to use than a feedforward Backpropagation network, since more computations are required to arrive at an output.
Unsupervised neural network paradigms to be discussed include:
• Hopfield Memory
• Bidirectional associative memory
• Fuzzy associative memory
• Learning vector quantizer
• Kohonen self-organizing map
Unsupervised learning and self-organization are closely related. Unsupervised learning was mentioned in Chapter 1, along with supervised learning. Training in supervised learning takes the form of external exemplars being provided. The network has to compute the correct weights for the connections for neurons in some layer or the other. Self-organization implies unsupervised learning. It was described as a characteristic of a neural network model, ART1, based on adaptive resonance theory (to be covered in Chapter 10). With the winner-take-all criterion, each neuron of field B learns a distinct classification. The winning neuron in a layer, in this case the field B, is the one with the largest activation, and it is the only neuron in that layer that is allowed to fire. Hence, the name winner take all.
Self-organization means self-adaptation of a neural network. Without target outputs, the closest possible response to a given input signal is to be generated. Like inputs will cluster together. The connection weights are modified through different iterations of network operation, and the network capable of self-organizing creates on its own the closest possible set of outputs for the given inputs. This happens in the model in Kohonen’s self-organizing map.
Kohonen’s Linear Vector Quantizer (LVQ) described briefly below is later extended as a self-organizing feature map. Self-organization is also learning, but without supervision; it is a case of self-training. Kohonen’s topology preserving maps illustrate self-organization by a neural network. In these cases, certain subsets of output neurons respond to certain subareas of the inputs, so that the firing within one subset of neurons indicates the presence of the corresponding subarea of the input. This is a useful paradigm in applications such as speech recognition. The winner-take-all strategy used in ART1 also facilitates self-organization.
Suppose the goal is the classification of input vectors. Kohonen’s Vector Quantization is a method in which you first gather a finite number of vectors of the dimension of your input vector. Kohonen calls these codebook vectors. You then assign groups of these codebook vectors to the different classes under the classification you want to achieve. In other words, you make a correspondence between the codebook vectors and classes, or, partition the set of codebook vectors by classes in your classification.
Now examine each input vector for its distance from each codebook vector, and find the nearest or closest codebook vector to it. You identify the input vector with the class to which the codebook vector belongs.
Codebook vectors are updated during training, according to some algorithm. Such an algorithm strives to achieve two things: (1), a codebook vector closest to the input vector is brought even closer to it; and (two), a codebook vector indicating a different class is made more distant from the input vector.
For example, suppose (2, 6) is an input vector, and (3, 10) and (4, 9) are a pair of codebook vectors assigned to different classes. You identify (2, 6) with the class to which (4, 9) belongs, since (4, 9) with a distance of [radic]13 is closer to it than (3, 10) whose distance from (2, 6) is [radic]17. If you add 1 to each component of (3, 10) and subtract 1 from each component of (4, 9), the new distances of these from (2, 6) are [radic]29 and [radic]5, respectively. This shows that (3, 10) when changed to (4, 11) becomes more distant from your input vector than before the change, and (4, 9) is changed to (3, 8), which is a bit closer to (2, 6) than (4, 9) is.
Training continues until all input vectors are classified. You obtain a stage where the classification for each input vector remains the same as in the previous cycle of training. This is a process of self-organization.
The Learning Vector Quantizer (LVQ) of Kohonen is a self-organizing network. It classifies input vectors on the basis of a set of stored or reference vectors. The B field neurons are also called grandmother cells, each of which represents a specific class in the reference vector set. Either supervised or unsupervised learning can be used with this network. (See Figure Layout for Learning Vector Quantizer.)
Layout for Learning Vector Quantizer.
The Hopfield memory, Bidirectional Associative memory and Fuzzy Associative memory are all unsupervised networks that perform pattern completion, or pattern association. That is, with corrupted or missing information, these memories are able to recall or complete an expected output. Gallant calls the training method used in these networks as one-shot learning, since you determine the weight matrix as a function of the completed patterns you wish to recall just once. An example of this was shown in Chapter 4 with determination of weights for the Hopfield memory.
ART1 is the first neural network model based on adaptive resonance theory of Carpenter and Grossberg. When you have a pair of patterns such that when one of them is input to a neural network the output turns out to be the other pattern in the pair, and if this happens consistently in both directions, then you may describe it as resonance. We discuss in Chapter 8 bidirectional associative memories and resonance. By the time training is completed, and learning is through, many other pattern pairs would have been presented to the network as well. If changes in the short-term memory do not disturb or affect the long-term memory, the network shows adaptive resonance. The ART1 model is designed to maintain it. Note that this discussion relates largely to stability.
Learning, convergence, and stability are matters of much interest. As learning is taking place, you want to know if the process is going to halt at some appropriate point, which is a question of convergence. Is what is learned stable, or will the network have to learn all over again, as each new event occurs? These questions have their answers within a mathematical model with differential equations developed to describe a learning algorithm. Proofs showing stability are part of the model inventor’s task. One particular tool that aids in the process of showing convergence is the idea of state energy, or cost, to describe whether the direction the process is taking can lead to convergence.
The Lyapunov function, discussed later in this chapter, is found to provide the right energy function, which can be minimized during the operation of the neural network. This function has the property that the value gets smaller with every change in the state of the system, thus assuring that a minimum will be reached eventually. The Lyapunov function is discussed further because of its significant utility for neural network models, but briefly because of the high level of mathematics involved. Fortunately, simple forms are derived and put into learning algorithms for neural networks. The high-level mathematics is used in making the proofs to show the viability of the models.
Alternatively, temperature relationships can be used, as in the case of the Boltzmann machine, or any other well-suited cost function such as a function of distances used in the formulation of the Traveling Salesman Problem, in which the total distance for the tour of the traveling salesman is to be minimized, can be employed. The Traveling Salesman Problem is important and well-known. A set of cities is to be visited by the salesman, each only once, and the aim is to devise a tour that minimizes the total distance traveled. The search continues for an efficient algorithm for this problem. Some algorithms solve the problem in a large number but not all of the situations. A neural network formulation can also work for the Traveling Salesman Problem. You will see more about this in Chapter 15.
Suppose you have a criterion such as energy to be minimized or cost to be decreased, and you know the optimum level for this criterion. If the network achieves the optimum value in a finite number of steps, then you have convergence for the operation of the network. Or, if you are making pairwise associations of patterns, there is the prospect of convergence if after each cycle of the network operation, the number of errors is decreasing.
It is also possible that convergence is slow, so much so that it may seem to take forever to achieve the convergence state. In that case, you should specify a tolerance value and require that the criterion be achieved within that tolerance, avoiding a lot of computing time. You may also introduce a momentum parameter to further change the weight and thereby speed up the convergence. One technique used is to add a portion of the previous change in weight.
Instead of converging, the operation may result in oscillations. The weight structure may keep changing back and forth; learning will never cease. Learning algorithms need to be analyzed in terms of convergence as being an essential algorithm property.
Neural networks are dynamic systems in the learning and training phase of their operation, and convergence is an essential feature, so it was necessary for the researchers developing the models and their learning algorithms to find a provable criterion for convergence in a dynamic system. The Lyapunov function, mentioned previously, turned out to be the most convenient and appropriate function. It is also referred to as the energy function. The function decreases as the system states change. Such a function needs to be found and watched as the network operation continues from cycle to cycle. Usually it involves a quadratic form. The least mean squared error is an example of such a function. Lyapunov function usage assures a system stability that cannot occur without convergence. It is convenient to have one value, that of the Lyapunov function specifying the system behavior. For example, in the Hopfield network, the energy function is a constant times the sum of products of outputs of different neurons and the connection weight between them. Since pairs of neuron outputs are multiplied in each term, the entire expression is a quadratic form.
Besides the applications for which a neural network is intended, and depending on these applications, you need to know certain aspects of the model. The length of encoding time and the length of learning time are among the important considerations. These times could be long but should not be prohibitive. It is important to understand how the network behaves with new inputs; some networks may need to be trained all over again, but some tolerance for distortion in input patterns is desirable, where relevant. Restrictions on the format of inputs should be known.
An advantage of neural networks is that they can deal with nonlinear functions better than traditional algorithms can. The ability to store a number of patterns, or needing more and more neurons in the output field with an increasing number of input patterns are the kind of aspects addressing the capabilities of a network and also its limitations.
Sometimes neural networks are used as adaptive filters, the motivation for such an architecture being selectivity. You want the neural network to classify each input pattern into its appropriate category. Adaptive models involve changing of connection weights during all their operations, while nonadaptive ones do not alter the weights after the phase of learning with exemplars. The Hopfield network is often used in modeling a neural network for optimization problems, and the Backpropagation model is a popular choice in most other applications. Neural network models are distinguishable sometimes by their architecture, sometimes by their adaptive methods, and sometimes both. Methods for adaptation, where adaptation is incorporated, assume great significance in the description and utility of a neural network model.
For adaptation, you can modify parameters in an architecture during training, such as the learning rate in the backpropagation training method for example. A more radical approach is to modify the architecture itself during training. New neural network paradigms change the number or layers and the number of neurons in a layer during training. These node adding or pruning algorithms are termed constructive algorithms. (See Gallant for more details.)
The analogy for a neural network presented at the beginning of the chapter was that of a multidimensional mapping surface that maps inputs to outputs. For each unseen input with respect to a training set, the generalization ability of a network determines how well the mapping surface renders the new input in the output space. A stock market forecaster must generalize well, otherwise you lose money in unseen market conditions. The opposite of generalization is memorization. A pattern recognition system for images of handwriting, should be able to generalize a letter A that is handwritten in several different ways by different people. If the system memorizes, then you will not recognize the letter A in all cases, but instead will categorize each letter A variation separately. The trick to achieve generalization is in network architecture, design, and training methodology. You do not want to overtrain your neural network on expected outcomes, but rather should accept a slightly worse than minimum error on your training set data. You will learn more about generalization in Chapter 14.
Learning and training are important issues in applying neural networks. Two broad categories of network learning are supervised and unsupervised learning. Supervised learning provides example outputs to compare to while unsupervised learning does not. During supervised training, external prototypes are used as target outputs and the network is given a learning algorithm to follow and calculate new connection weights that bring the output closer to the target output. You can refer to networks using unsupervised learning as self-organizing networks, since no external information or guidance is used in learning. Several neural network paradigms were presented in this chapter along with their learning and training characteristics.