A Survey of Neural Network Models

Neural Network Models

You were introduced in the preceding pages to the Perceptron model, the Feedforward network, and the Hopfield network. You learned that the differences between the models lie in their architecture, encoding, and recall. We aim now to give you a comprehensive picture of these and other neural network models. We will show details and implementations of some networks in later chapters.

The models we briefly review in this chapter are the Perceptron, Hopfield, Adaline, Feed-Forward Backpropagation, Bidirectional Associative Memory, Brain-State-in-a-Box, Neocognitron, Fuzzy Associative Memory, ART1, and ART2. C++ implementations of some of these and the role of fuzzy logic in some will be treated in the subsequent chapters. For now, our discussion will be about the distinguishing characteristics of a neural network. We will follow it with the description of some of the models.

Layers in a Neural Network

A neural network has its neurons divided into subgroups, or fields, and elements in each subgroup are placed in a row, or a column, in the diagram depicting the network. Each subgroup is then referred to as a layer of neurons in the network. A great many models of neural networks have two layers, quite a few have one layer, and some have three or more layers. A number of additional, so-called hidden layers are possible in some networks, such as the Feed-forward backpropagation network. When the network has a single layer, the input signals are received at that layer, processing is done by its neurons, and output is generated at that layer. When more than one layer is present, the first field is for the neurons that supply the input signals for the neurons in the next layer.

Every network has a layer of input neurons, but in most of the networks, the sole purpose of these neurons is to feed the input to the next layer of neurons. However, there are feedback connections, or recurrent connections in some networks, so that the neurons in the input layer may also do some processing. In the Hopfield network you saw earlier, the input and output layers are the same. If any layer is present between the input and output layers, it may be referred to as a hidden layer in general, or as a layer with a special name taken after the researcher who proposed its inclusion to achieve certain performance from the network. Examples are the Grossberg and the Kohonen layers. The number of hidden layers is not limited except by the scope of the problem being addressed by the neural network.

A layer is also referred to as a field. Then the different layers can be designated as field A, field B, and so on, or shortly, FA, FB.

Single-Layer Network

A neural network with a single layer is also capable of processing for some important applications, such as integrated circuit implementations or assembly line control. The most common capability of the different models of neural networks is pattern recognition. But one network, called the Brain-State-in-a-Box, which is a single-layer neural network, can do pattern completion. Adaline is a network with A and B fields of neurons, but aggregation or processing of input signals is done only by the field B neurons.

The Hopfield network is a single-layer neural network. The Hopfield network makes an association between different patterns (heteroassociation) or associates a pattern with itself (autoassociation). You may characterize this as being able to recognize a given pattern. The idea of viewing it as a case of pattern recognition becomes more relevant if a pattern is presented with some noise, meaning that there is some slight deformation in the pattern, and if the network is able to relate it to the correct pattern.

The Perceptron technically has two layers, but has only one group of weights. We therefore still refer to it as a single-layer network. The second layer consists solely of the output neuron, and the first layer consists of the neurons that receive input(s). Also, the neurons in the same layer, the input layer in this case, are not interconnected, that is, no connections are made between two neurons in that same layer. On the other hand, in the Hopfield network, there is no separate output layer, and hence, it is strictly a single-layer network. In addition, the neurons are all fully connected with one another.

Let us spend more time on the single-layer Perceptron model and discuss its limitations, and thereby motivate the study of multilayer networks.

XOR Function and the Perceptron

The ability of a Perceptron in evaluating functions was brought into question when Minsky and Papert proved that a simple function like XOR (the logical function exclusive or) could not be correctly evaluated by a Perceptron. The XOR logical function, f(A,B), is as follows:

A

B

f(A,B)= XOR(A,B)

0

0

0

0

1

1

1

0

1

1

1

0

To summarize the behavior of the XOR, if both inputs are the same value, the output is 0, otherwise the output is 1.

Minsky and Papert showed that it is impossible to come up with the proper set of weights for the neurons in the single layer of a simple Perceptron to evaluate the XOR function. The reason for this is that such a Perceptron, one with a single layer of neurons, requires the function to be evaluated, to be linearly separable by means of the function values. The concept of linear separability is explained next. But let us show you first why the simple perceptron fails to compute this function.

Since there are two arguments for the XOR function, there would be two neurons in the input layer, and since the function’s value is one number, there would be one output neuron. Therefore, you need two weights w1 and w2 ,and a threshold value θ. Let us now look at the conditions to be satisfied by the w’s and the θ so that the outputs corresponding to given inputs would be as for the XOR function.

First the output should be 0 if inputs are 0 and 0. The activation works out as 0. To get an output of 0, you need 0 < θ. This is your first condition. The table shows this and two other conditions you need, and why.

Conditions on Weights

Input

Activation

Output

Needed Condition

0, 0

0

0

0 < θ

1, 0

w1

1

w1 > θ

0, 1

w2

1

w2 > θ

1, 1

w1 + w2

0

w1 + w2< θ

From the first three conditions, you can deduce that the sum of the two weights has to be greater than θ, which has to be positive itself. Line 4 is inconsistent with lines 1, 2, and 3, since line 4 requires the sum of the two weights to be less than θ. This affirms the contention that it is not possible to compute the XOR function with a simple perceptron.

Geometrically, the reason for this failure is that the inputs (0, 1) and (1, 0) with which you want output 1, are situated diagonally opposite each other, when plotted as points in the plane, as shown below in a diagram of the output (1=T, 0=F):

 F     T
 T     F

You can’t separate the T’s and the F’s with a straight line. This means that you cannot draw a line in the plane in such a way that neither (1, 1) ->F nor (0, 0)->F is on the same side of the line as (0, 1) ->T and (1, 0)-> T.

Linear Separability

What linearly separable means is, that a type of a linear barrier or a separator—a line in the plane, or a plane in the three-dimensional space, or a hyperplane in higher dimensions—should exist, so that the set of inputs that give rise to one value for the function all lie on one side of this barrier, while on the other side lie the inputs that do not yield that value for the function. A hyperplane is a surface in a higher dimension, but with a linear equation defining it much the same way a line in the plane and a plane in the three-dimensional space are defined.

To make the concept a little bit clearer, consider a problem that is similar but, let us emphasize, not the same as the XOR problem.

Imagine a cube of 1-unit length for each of its edges and lying in the positive octant in a xyz-rectangular coordinate system with one corner at the origin. The other corners or vertices are at points with coordinates (0, 0, 1), (0, 1, 0), (0, 1, 1), (1, 0, 0), (1, 0, 1), (1, 1, 0), and (1, 1, 1). Call the origin O, and the seven points listed as A, B, C, D, E, F, and G, respectively. Then any two faces opposite to each other are linearly separable because you can define the separating plane as the plane halfway between these two faces and also parallel to these two faces.

For example, consider the faces defined by the set of points O, A, B, and C and by the set of points D, E, F, and G. They are parallel and 1 unit apart, as you can see in Figure Separating plane.. The separating plane for these two faces can be seen to be one of many possible planes—any plane in between them and parallel to them. One example, for simplicity, is the plane that passes through the points (1/2, 0, 0), (1/2, 0, 1), (1/2, 1, 0), and (1/2, 1, 1). Of course, you need only specify three of those four points because a plane is uniquely determined by three points that are not all on the same line. So if the first set of points corresponds to a value of say, +1 for the function, and the second set to a value of –1, then a single-layer Perceptron can determine, through some training algorithm, the correct weights for the connections, even if you start with the weights being initially all 0.

Separating plane.

Consider the set of points O, A, F, and G. This set of points cannot be linearly separated from the other vertices of the cube. In this case, it would be impossible for the single-layer Perceptron to determine the proper weights for the neurons in evaluating the type of function we have been discussing.

A Second Look at the XOR Function: Multilayer Perceptron

By introducing a set of cascaded Perceptrons, you have a Perceptron network, with an input layer, middle or hidden layer, and an output layer. You will see that the multilayer Perceptron can evaluate the XOR function as well as other logic functions (AND, OR, MAJORITY, etc.). The absence of the separability that we talked about earlier is overcome by having a second stage, so to speak, of connection weights.

You need two neurons in the input layer and one in the output layer. Let us put a hidden layer with two neurons. Let w11, w12, w21, and w22, be the weights on connections from the input neurons to the hidden layer neurons. Let v1, v2 , be the weights on the connections from the hidden layer neurons to the outout neuron.

We will select the w’s (weights) and the threshold values θ1 , and θ2 at the hidden layer neurons, so that the input (0, 0) generates the output vector (0, 0), and the input vector (1, 1) generates (1, 1), while the inputs (1, 0) and (0, 1) generate (0, 1) as the hidden layer output. The inputs to the output layer neurons would be from the set {(0, 0), (1, 1), (0, 1)}. These three vectors are separable, with (0, 0), and (1, 1) on one side of the separating line, while (0, 1) is on the other side.

We will select the √s (weights) and τ, the threshold value at the output neuron, so as to make the inputs (0, 0) and (1, 1) cause an output of 0 for the network, and an output of 1 is caused by the input (0, 1). The network layout within the labels of weights and threshold values inside the nodes representing hidden layer and output neurons is shown in Figure Example network. Table Results for the Perceptron with One Hidden Layer gives the results of operation of this network.

Example network.

Results for the Perceptron with One Hidden Layer.

Input

Hidden Layer Activations

Hidden Layer Outputs

Output Neuron activaton

Output of network

(0, 0)

(0, 0)

(0, 0)

0

0

(1, 1)

(0.3, 0.6)

(1, 1)

0

0

(0, 1)

(0.15, 0.3)

(0, 1)

0.3

1

(1, 0)

(0.15, 0.3)

(0, 1)

0.3

1


Note:  The activation should exceed the threshold value for a neuron to fire. Where the output of a neuron is shown to be 0, it is because the internal activation of that neuron fell short of its threshold value.


Example of the Cube Revisited

Let us return to the example of the cube with vertices at the origin O, and the points labeled A, B, C, D, E, F, and G. Suppose the set of vertices O, A, F, and G give a value of 1 for the function to be evaluated, and the other vertices give a –1. The two sets are not linearly separable as mentioned before. A simple Perceptron cannot evaluate this function.

Can the addition of another layer of neurons help? The answer is yes. What would be the role of this additional layer? The answer is that it will do the final processing for the problem after the previous layer has done some preprocessing. This can do two separations in the sense that the set of eight vertices can be separated—or partitioned—into three separable subsets. If this partitioning can also help collect within each subset, like vertices, meaning those that map onto the same value for the function, the network will succeed in its task of evaluating the function when the aggregation and thresholding is done at the output neuron.

Strategy

So the strategy is first to consider the set of vertices that give a value of +1 for the function and determine the minimum number of subsets that can be identified to be each separable from the rest of the vertices. It is evident that since the vertices O and A lie on one edge of the cube, they can form one subset that is separable. The other two vertices, viz., F and one for G, which correspond to the value +1 for the function, can form a second subset that is separable, too. We need not bother with the last four vertices from the point of view of further partitioning that subset. It is clear that one new layer of three neurons, one of which fires for the inputs corresponding to the vertices O and A, one for F, and G, and the third for the rest, will then facilitate the correct evaluation of the function at the output neuron.

Details

The table lists the vertices and their coordinates, together with a flag that indicates to which subset in the partitioning the vertex belongs. Note that you can think of the action of the Multilayer Perceptron as that of evaluating the intersection and union of linearly separable subsets.

Partitioning of Vertices of a Cube

Vertex

Coordinates

Subset

O

(0,0,0)

1

A

(0,0,1)

1

B

(0,1,0)

2

C

(0,1,1)

2

D

(1,0,0)

2

E

(1,0,1)

2

F

(1,1,0)

3

G

(1,1,1)

3

 

The network, which is a two-layer Perceptron, meaning two layers of weights, has three neurons in the first layer and one output neuron in the second layer. Remember that we are counting those layers in which the neurons do the aggregation of the signals coming into them using the connection weights. The first layer with the three neurons is what is generally described as the hidden layer, since the second layer is not hidden and is at the extreme right in the layout of the neural network. The table gives an example of the weights you can use for the connections between the input neurons and the hidden layer neurons. There are three input neurons, one for each coordinate of the vertex of the cube.

Weights for Connections Between Input Neurons and Hidden Layer Neurons.

Input Neuron #

Hidden Layer Neuron #

Connection Weight

1

1

1

1

2

0.1

1

3

-1

2

1

1

2

2

-1

2

3

-1

3

1

0.2

3

2

0.3

3

3

0.6

Now we give the weights for the connections between the three hidden-layer neurons and the output neuron.

Weights for Connection Between the Hidden-Layer Neurons and the Output Neuron

Hidden Layer Neuron #

Connection Weight

1

0.6

3

0.3

3

0.6

It is not apparent whether or not these weights will do the job. To determine the activations of the hidden-layer neurons, you need these weights, and you also need the threshold value at each neuron that does processing. A hidden-layer neuron will fire, that is, will output a 1, if the weighted sum of the signals it receives is greater than the threshold value. If the output neuron fires, the function value is taken as +1, and if it does not fire, the function value is –1.

Neural Network for Cube Example

Threshold Values

Layer

Neuron

Threshold Value

hidden

1

1.8

hidden

2

0.05

hidden

3

-0.2

output

1

0.5

Performance of the Perceptron

When you input the coordinates of the vertex G, which has 1 for each coordinate, the first hidden-layer neuron aggregates these inputs and gets a value of 2.2. Since 2.2 is more than the threshold value of the first neuron in the hidden layer, that neuron fires, and its output of 1 becomes an input to the output neuron on the connection with weight 0.6. But you need the activations of the other hidden-layer neurons as well. Let us describe the performance with coordinates of G as the inputs to the network. The table describes this:

Results with Coordinates of Vertex G as Input

Vertex/
Coordinates

Hidden Layer

Weighted Sum

Comment

Activation

Contribution to Output

Sum

G:1,1,1

1

2.2

>1.8

1

0.6

 

 

2

-0.8

<0.05

0

0

 

 

3

-1.4

<-0.2

0

0

0.6

The weighted sum at the output neuron is 0.6, and it is greater than the threshold value 0.5. Therefore, the output neuron fires, and at the vertex G, the function is evaluated to have a value of +1.

The table shows the performance of the network with the rest of the vertices of the cube. You will notice that the network computes a value of +1 at the vertices, O, A, F, and G, and a –1 at the rest:

Results with Other Inputs

 

Hidden Layer Neuron#

Weighted Sum

Comment

Activation

Contribution to Output

Sum

O :0, 0, 0

1

0

<1.8

0

0

 

 

2

0

<0.05

0

0

 

 

3

0

>-0.2

1

0.6

0.6*

A :0, 0, 1

1

0.2

<1.8

0

0

 

 

2

0.3

>0.05

1

0.3

 

 

3

0.6

>-0.2

1

0.6

0.9*

B :0, 1, 0

1

1

<1.8

0

0

 

 

2

-1

<0.05

0

0

 

 

3

-1

<-0.2

0

0

0

C :0, 1, 1

1

1.2

<1.8

0

0

 

 

2

0.2

>0.05

1

0.3

 

 

3

-0.4

<-0.2

0

0

0.3

D :1, 0, 0

1

1

<1.8

0

0

 

 

2

.1

>0.05

1

0.3

 

 

3

-1

<-0.2

0

0

0.3

E :1, 0, 1

1

1.2

<1.8

0

0

 

 

2

0.4

>0.05

1

0.3

 

 

3

-0.4

<-0.2

0

0

0.3

F :1, 1, 0

1

2

>1.8

1

0.6

 

 

2

-0.9

<0.05

0

0

 

 

3

-2

<-0.2

0

0

0.6*


*The output neuron fires, as this value is greater than 0.5 (the threshold value); the function value is +1.


Other Two-layer Networks

Many important neural network models have two layers. The Feedforward backpropagation network, in its simplest form, is one example. Grossberg and Carpenter’s ART1 paradigm uses a two-layer network. The Counterpropagation network has a Kohonen layer followed by a Grossberg layer. Bidirectional Associative Memory, (BAM), Boltzman Machine, Fuzzy Associative Memory, and Temporal Associative Memory are other two-layer networks. For autoassociation, a single-layer network could do the job, but for heteroassociation or other such mappings, you need at least a two-layer network. We will give more details on these models shortly.

Many Layer Networks

Kunihiko Fukushima’s Neocognitron, noted for identifying handwritten characters, is an example of a network with several layers. Some previously mentioned networks can also be multilayered from the addition of more hidden layers. It is also possible to combine two or more neural networks into one network by creating appropriate connections between layers of one subnetwork to those of the others. This would certainly create a multilayer network.

Connections Between Layers

You have already seen some difference in the way connections are made between neurons in a neural network. In the Hopfield network, every neuron was connected to every other in the one layer that was present in the network. In the Perceptron, neurons within the same layer were not connected with one another, but the connections were between the neurons in one layer and those in the next layer. In the former case, the connections are described as being lateral. In the latter case, the connections are forward and the signals are fed forward within the network.

Two other possibilities also exist. All the neurons in any layer may have extra connections, with each neuron connected to itself. The second possibility is that there are connections from the neurons in one layer to the neurons in a previous layer, in which case there is both forward and backward signal feeding. This occurs, if feedback is a feature for the network model. The type of layout for the network neurons and the type of connections between the neurons constitute the architecture of the particular model of the neural network.

Instar and Outstar

Outstar and instar are terms defined by Stephen Grossberg for ways of looking at neurons in a network. A neuron in a web of other neurons receives a large number of inputs from outside the neuron’s boundaries. This is like an inwardly radiating star, hence, the term instar. Also, a neuron may be sending its output to many other destinations in the network. In this way it is acting as an outstar. Every neuron is thus simultaneously both an instar and an outstar. As an instar it receives stimuli from other parts of the network or from outside the network. Note that the neurons in the input layer of a network primarily have connections away from them to the neurons in the next layer, and thus behave mostly as outstars. Neurons in the output layer have many connections coming to it and thus behave mostly as instars. A neural network performs its work through the constant interaction of instars and outstars.

A layer of instars can constitute a competitive layer in a network. An outstar can also be described as a source node with some associated sink nodes that the source feeds to. Grossberg identifies the source input with a conditioned stimulus and the sink inputs with unconditioned stimuli. Robert Hecht-Nielsen’s Counterpropagation network is a model built with instars and outstars.

Weights on Connections

Weight assignments on connections between neurons not only indicate the strength of the signal that is being fed for aggregation but also the type of interaction between the two neurons. The type of interaction is one of cooperation or of competition. The cooperative type is suggested by a positive weight, and the competition by a negative weight, on the connection. The positive weight connection is meant for what is called excitation, while the negative weight connection is termed an inhibition.

Initialization of Weights

Initializing the network weight structure is part of what is called the encoding phase of a network operation. The encoding algorithms are several, differing by model and by application. You may have gotten the impression that the weight matrices used in the examples discussed in detail thus far have been arbitrarily determined; or if there is a method of setting them up, you are not told what it is.

It is possible to start with randomly chosen values for the weights and to let the weights be adjusted appropriately as the network is run through successive iterations. This would make it easier also. For example, under supervised training, if the error between the desired and computed output is used as a criterion in adjusting weights, then one may as well set the initial weights to zero and let the training process take care of the rest. The small example that follows illustrates this point.

A Small Example

Suppose you have a network with two input neurons and one output neuron, with forward connections between the input neurons and the output neuron, as shown in Figure Neural network with forward connections.. The network is required to output a 1 for the input patterns (1, 0) and (1, 1), and the value 0 for (0, 1) and (0, 0). There are only two connection weights w1 and w2.

Neural network with forward connections.

Let us set initially both weights to 0, but you need a threshold function also. Let us use the following threshold function, which is slightly different from the one used in a previous example:

 

 

 

1 if x > 0

f(x)

=

{

 

 

 

 

0 if x ≤ 0

The reason for modifying this function is that if f(x) has value 1 when x = 0, then no matter what the weights are, the output will work out to 1 with input (0, 0). This makes it impossible to get a correct computation of any function that takes the value 0 for the arguments (0, 0).

Now we need to know by what procedure we adjust the weights. The procedure we would apply for this example is as follows.

  If the output with input pattern (a, b) is as desired, then do not adjust the weights.

  If the output with input pattern (a, b) is smaller than what it should be, then increment each of w1 and w2 by 1.

  If the output with input pattern (a, b) is greater than what it should be, then subtract 1 from w1 if the product aw1 is smaller than 1, and adjust w2 similarly.

The table shows what takes place when we follow these procedures, and at what values the weights settle:

Adjustment of Weights

step

w1

w2

a

b

activation

output

comment

1

0

0

1

1

0

0

desired output is 1; increment both w’s

2

1

1

1

1

2

1

output is what it should be

3

1

1

1

0

1

1

output is what it should be

4

1

1

0

1

1

1

output is 1; it should be 0.

5

 

 

 

 

 

 

subtract 1 from w2

6

1

0

0

1

0

0

output is what it should be

7

1

0

0

0

0

0

output is what it should be

8

1

0

1

1

1

1

output is what it should be

9

1

0

1

0

1

1

output is what it should be

This example is not of a network for pattern matching. If you think about it, you will realize that the network is designed to fire if the first digit in the pattern is a 1, and not otherwise. An analogy for this kind of a problem is determining if a given image contains a specific object in a specific part of the image, such as a dot should occur in the letter i.

If the initial weights are chosen somewhat prudently and to make some particular relevance, then the speed of operation can be increased in the sense of convergence being achieved with fewer iterations than otherwise. Thus, encoding algorithms are important. We now present some of the encoding algorithms.

Initializing Weights for Autoassociative Networks

Consider a network that is to associate each input pattern with itself and which gets binary patterns as inputs. Make a bipolar mapping on the input pattern. That is, replace each 0 by –1. Call the mapped pattern the vector x, when written as a column vector. The transpose, the same vector written as a row vector, is xT. You will get a matrix of order the size of x when you form the product xxT. Obtain similar matrices for the other patterns you want the network to store. Add these matrices to give you the matrix of weights to be used initially, as we did in Chapter Constructing a Neural Network. This process can be described with the following equation:

W = ςixixiT

Weight Initialization for Heteroassociative Networks

Consider a network that is to associate one input pattern with another pattern and that gets binary patterns as inputs. Make a bipolar mapping on the input pattern. That is, replace each 0 by –1. Call the mapped pattern the vector x when written as a column vector. Get a similar bipolar mapping for the corresponding associated pattern. Call it y. You will get a matrix of size x by size y when you form the product xyT. Obtain similar matrices for the other patterns you want the network to store. Add these matrices to give you the matrix of weights to be used initially. The following equation restates this process:

W = ςixiyiT

On Center, Off Surround

In one of the many interesting paradigms you encounter in neural network models and theory, is the strategy winner takes all. Well, if there should be one winner emerging from a crowd of neurons in a particular layer, there needs to be competition. Since everybody is for himself in such a competition, in this case every neuron for itself, it would be necessary to have lateral connections that indicate this circumstance. The lateral connections from any neuron to the others should have a negative weight. Or, the neuron with the highest activation is considered the winner and only its weights are modified in the training process, leaving the weights of others the same. Winner takes all means that only one neuron in that layer fires and the others do not. This can happen in a hidden layer or in the output layer.

In another situation, when a particular category of input is to be identified from among several groups of inputs, there has to be a subset of the neurons that are dedicated to seeing it happen. In this case, inhibition increases for distant neurons, whereas excitation increases for the neighboring ones, as far as such a subset of neurons is concerned. The phrase on center, off surround describes this phenomenon of distant inhibition and near excitation.

Weights also are the prime components in a neural network, as they reflect on the one hand the memory stored by the network, and on the other hand the basis for learning and training.

Inputs

You have seen that mutually orthogonal or almost orthogonal patterns are required as stable stored patterns for the Hopfield network, which we discussed before for pattern matching. Similar restrictions are found also with other neural networks. Sometimes it is not a restriction, but the purpose of the model makes natural a certain type of input. Certainly, in the context of pattern classification, binary input patterns make problem setup simpler. Binary, bipolar, and analog signals are the varieties of inputs. Networks that accept analog signals as inputs are for continuous models, and those that require binary or bipolar inputs are for discrete models. Binary inputs can be fed to networks for continuous models, but analog signals cannot be input to networks for discrete models (unless they are fuzzified). With input possibilities being discrete or analog, and the model possibilities being discrete or continuous, there are potentially four situations, but one of them where analog inputs are considered for a discrete model is untenable.

An example of a continuous model is where a network is to adjust the angle by which the steering wheel of a truck is to be turned to back up the truck into a parking space. If a network is supposed to recognize characters of the alphabet, a means of discretization of a character allows the use of a discrete model.

What are the types of inputs for problems like image processing or handwriting analysis? Remembering that artificial neurons, as processing elements, do aggregation of their inputs by using connection weights, and that the output neuron uses a threshold function, you know that the inputs have to be numerical. A handwritten character can be superimposed on a grid, and the input can consist of the cells in each row of the grid where a part of the character is present. In other words, the input corresponding to one character will be a set of binary or gray-scale sequences containing one sequence for each row of the grid. A 1 in a particular position in the sequence for a row shows that the corresponding pixel is present(black) in that part of the grid, while 0 shows it is not. The size of the grid has to be big enough to accommodate the largest character under study, as well as the most complex features.

Outputs

The output from some neural networks is a spatial pattern that can include a bit pattern, in some a binary function value, and in some others an analog signal. The type of mapping intended for the inputs determines the type of outputs, naturally. The output could be one of classifying the input data, or finding associations between patterns of the same dimension as the input.

The threshold functions do the final mapping of the activations of the output neurons into the network outputs. But the outputs from a single cycle of operation of a neural network may not be the final outputs, since you would iterate the network into further cycles of operation until you see convergence. If convergence seems possible, but is taking an awful lot of time and effort, that is, if it is too slow to learn, you may assign a tolerance level and settle for the network to achieve near convergence.

The Threshold Function

The output of any neuron is the result of thresholding, if any, of its internal activation, which, in turn, is the weighted sum of the neuron’s inputs. Thresholding sometimes is done for the sake of scaling down the activation and mapping it into a meaningful output for the problem, and sometimes for adding a bias. Thresholding (scaling) is important for multilayer networks to preserve a meaningful range across each layer’s operations. The most often used threshold function is the sigmoid function. A step function or a ramp function or just a linear function can be used, as when you simply add the bias to the activation. The sigmoid function accomplishes mapping the activation into the interval [0, 1]. The equations are given as follows for the different threshold functions just mentioned.

The Sigmoid Function

More than one function goes by the name sigmoid function. They differ in their formulas and in their ranges. They all have a graph similar to a stretched letter s. We give below two such functions. The first is the hyperbolic tangent function with values in (–1, 1). The second is the logistic function and has values between 0 and 1. You therefore choose the one that fits the range you want. The graph of the sigmoid logistic function is given in Fig. 5.3.

1.  f(x) = tanh(x) = ( ex - e-x) / (ex + e-x)

2.  f(x) = 1 / (1+ e-x)

Note that the first function here, the hyperbolic tangent function, can also be written, as 1 - 2e-x / (ex + e-x ) after adding and also subtracting e-x to the numerator, and then simplifying. If now you multiply in the second term both numerator and denominator by ex, you get 1 - 2/ (e2x + 1). As x approaches -∞, this function goes to -1, and as x approaches +∞, it goes to +1. On the other hand, the second function here, the sigmoid logistic function, goes to 0 as x approaches -∞, and to +1 as x approaches +∞. You can see this if you rewrite 1 / (1+ e-x) as 1 - 1 / (1+ ex), after manipulations similar to those above.

You can think of equation 1 as the bipolar equivalent of binary equation 2. Both functions have the same shape.

Figure The sigmoid function is the graph of the sigmoid logistic function (number 2 of the preceding list).

The sigmoid function.

The Step Function

The step function is also frequently used as a threshold function. The function is 0 to start with and remains so to the left of some threshold value θ. A jump to 1 occurs for the value of the function to the right of θ, and the function then remains at the level 1. In general, a step function can have a finite number of points at which jumps of equal or unequal size occur. When the jumps are equal and at many points, the graph will resemble a staircase. We are interested in a step function that goes from 0 to 1 in one step, as soon as the argument exceeds the threshold value θ. You could also have two values other than 0 and 1 in defining the range of values of such a step function. A graph of the step function follows in Figure The step function.


The step function.


Note:  You can think of a sigmoid function as a fuzzy step function.


The Ramp Function

To describe the ramp function simply, first consider a step function that makes a jump from 0 to 1 at some point. Instead of letting it take a sudden jump like that at one point, let it gradually gain in value, along a straight line (looks like a ramp), over a finite interval reaching from an initial 0 to a final 1. Thus, you get a ramp function. You can think of a ramp function as a piecewise linear approximation of a sigmoid. The graph of a ramp function is illustrated in figure Graph of a ramp function.


Graph of a ramp function.

Linear Function

A linear function is a simple one given by an equation of the form:  f(x) = αx + β

When α = 1, the application of this threshold function amounts to simply adding a bias equal to β to the sum of the inputs.

Applications

As briefly indicated before, the areas of application generally include auto- and heteroassociation, pattern recognition, data compression, data completion, signal filtering, image processing, forecasting, handwriting recognition, and optimization. The type of connections in the network, and the type of learning algorithm used must be chosen appropriate to the application. For example, a network with lateral connections can do autoassociation, while a feed-forward type can do forecasting.

Some Neural Network Models

Adaline and Madaline

Adaline is the acronym for adaptive linear element, due to Bernard Widrow and Marcian Hoff. It is similar to a Perceptron. Inputs are real numbers in the interval [–1,+1], and learning is based on the criterion of minimizing the average squared error. Adaline has a high capacity to store patterns. Madaline stands for many Adalines and is a neural network that is widely used. It is composed of field A and field B neurons, and there is one connection from each field A neuron to each field B neuron. Figure The Madaline model shows a diagram of the Madaline.

The Madaline model.

Backpropagation

The Backpropagation training algorithm for training feed-forward networks was developed by Paul Werbos, and later by Parker, and Rummelhart and McClelland. This type of network configuration is the most common in use, due to its ease of training. It is estimated that over 80% of all neural network projects in development use backpropagation. In backpropagation, there are two phases in its learning cycle, one to propagate the input pattern through the network and the other to adapt the output, by changing the weights in the network. It is the error signals that are backpropagated in the network operation to the hidden layer(s). The portion of the error signal that a hidden-layer neuron receives in this process is an estimate of the contribution of a particular neuron to the output error. Adjusting on this basis the weights of the connections, the squared error, or some other metric, is reduced in each cycle and finally minimized, if possible.

Figure for Backpropagation Network

Bidirectional Associative Memory

Bidirectional Associative Memory, (BAM), and other models described in this section were developed by Bart Kosko. BAM is a network with feedback connections from the output layer to the input layer. It associates a member of the set of input patterns with a member of the set of output patterns that is the closest, and thus it does heteroassociation. The patterns can be with binary or bipolar values. If all possible input patterns are known, the matrix of connection weights can be determined as the sum of matrices obtained by taking the matrix product of an input vector (as a column vector) with its transpose (written as a row vector).

The pattern obtained from the output layer in one cycle of operation is fed back at the input layer at the start of the next cycle. The process continues until the network stabilizes on all the input patterns. The stable state so achieved is described as resonance, a concept used in the Adaptive Resonance Theory.

Fuzzy Associative memories are similar to Bidirectional Associative memories, except that association is established between fuzzy patterns. Chapter FAM: Fuzzy Associative Memory deals with Fuzzy Associative memories.

Temporal Associative Memory

Another type of associative memory is temporal associative memory. Amari, a pioneer in the field of neural networks, constructed a Temporal Associative Memory model that has feedback connections between the input and output layers. The forte of this model is that it can store and retrieve spatiotemporal patterns. An example of a spatiotemporal pattern is a waveform of a speech segment.

Brain-State-in-a-Box

Introduced by James Anderson and others, this network differs from the single-layer fully connected Hopfield network in that Brain-State-in-a-Box uses what we call recurrent connections as well. Each neuron has a connection to itself. With target patterns available, a modified Hebbian learning rule is used. The adjustment to a connection weight is proportional to the product of the desired output and the error in the computed output. You will see more on Hebbian learning in Chapter Learning and Training. This network is adept at noise tolerance, and it can accomplish pattern completion. The Figure A Brain-State-in-a-Box, network shows a Brain-State-in-a-Box network.

A Brain-State-in-a-Box, network.

What’s in a Name?

More like what’s in the box? Suppose you find the following: there is a square box and its corners are the locations for an entity to be. The entity is not at one of the corners, but is at some point inside the box. The next position for the entity is determined by working out the change in each coordinate of the position, according to a weight matrix, and a squashing function. This process is repeated until the entity settles down at some position. The choice of the weight matrix is such that when the entity reaches a corner of the square box, its position is stable and no more movement takes place. You would perhaps guess that the entity finally settles at the corner nearest to the initial position of it within the box. It is said that this kind of an example is the reason for the name Brain-State-in-a-Box for the model. Its forte is that it represents linear transformations. Some type of association of patterns can be achieved with this model. If an incomplete pattern is associated with a completed pattern, it would be an example of autoassociation.

Counterpropagation

This is a neural network model developed by Robert Hecht-Nielsen, that has one or two additional layers between the input and output layers. If it is one, the middle layer is a Grossberg layer with a bunch of outstars. In the other case, a Kohonen layer, or a self-organizing layer, follows the input layer, and in turn is followed by a Grossberg layer of outstars. The model has the distinction of considerably reducing training time. With this model, you gain a tool that works like a look-up table.

Neocognitron

Compared to all other neural network models, Fukushima’s Neocognitron is more complex and ambitious. It demonstrates the advantages of a multilayered network. The Neocognitron is one of the best models for recognizing handwritten symbols. Many pairs of layers called the S layer, for simple layer, and C layer, for complex layer, are used. Within each S layer are several planes containing simple cells. Similarly, there are within each C layer, an equal number of planes containing complex cells. The input layer does not have this arrangement and is like an input layer in any other neural network.

The number of planes of simple cells and of complex cells within a pair of S and C layers being the same, these planes are paired, and the complex plane cells process the outputs of the simple plane cells. The simple cells are trained so that the response of a simple cell corresponds to a specific portion of the input image. If the same part of the image occurs with some distortion, in terms of scaling or rotation, a different set of simple cells responds to it. The complex cells output to indicate that some simple cell they correspond to did fire. While simple cells respond to what is in a contiguous region in the image, complex cells respond on the basis of a larger region. As the process continues to the output layer, the C-layer component of the output layer responds, corresponding to the entire image presented in the beginning at the input layer.

Adaptive Resonance Theory

ART1 is the first model for adaptive resonance theory for neural networks developed by Gail Carpenter and Stephen Grossberg. This theory was developed to address the stability–plasticity dilemma. The network is supposed to be plastic enough to learn an important pattern. But at the same time it should remain stable when, in short-term memory, it encounters some distorted versions of the same pattern.

ART1 model has A and B field neurons, a gain, and a reset as shown in Figure The ART1 network.. There are top-down and bottom-up connections between neurons of fields A and B. The neurons in field B have lateral connections as well as recurrent connections. That is, every neuron in this field is connected to every other neuron in this field, including itself, in addition to the connections to the neurons in field A. The external input (or bottom-up signal), the top-down signal, and the gain constitute three elements of a set, of which at least two should be a +1 for the neuron in the A field to fire. This is what is termed the two-thirds rule. Initially, therefore, the gain would be set to +1. The idea of a single winner is also employed in the B field. The gain would not contribute in the top-down phase; actually, it will inhibit. The two-thirds rule helps move toward stability once resonance, or equilibrium, is obtained. A vigilance parameter ρ is used to determine the parameter reset. Vigilance parameter corresponds to what degree the resonating category can be predicted. The part of the system that contains gain is called the attentional subsystem, whereas the rest, the part that contains reset, is termed the orienting subsystem. The top-down activity corresponds to the orienting subsystem, and the bottom-up activity relates to the attentional subsystem.

The ART1 network.

In ART1, classification of an input pattern in relation to stored patterns is attempted, and if unsuccessful, a new stored classification is generated. Training is unsupervised. There are two versions of training: slow and fast. They differ in the extent to which the weights are given the time to reach their eventual values. Slow training is governed by differential equations, and fast training by algebraic equations.

ART2 is the analog counterpart of ART1, which is for discrete cases. These are self-organizing neural networks, as you can surmise from the fact that training is present but unsupervised. The ART3 model is for recognizing a coded pattern through a parallel search, and is developed by Carpenter and Grossberg. It tries to emulate the activities of chemical transmitters in the brain during what can be construed as a parallel search for pattern recognition.

Summary

The basic concepts of neural network layers, connections, weights, inputs, and outputs have been discussed. An example of how adding another layer of neurons in a network can solve a problem that could not be solved without it is given in detail. A number of neural network models are introduced briefly. Learning and training, which form the basis of neural network behavior has not been included here, but will be discussed in the following chapter.