index/CrySis/Research/Neural Network Algorithms
  Background History

Algorithms

Cascade Correlation

Neural Network Algorithms

Not all neural networks are created equal; the efficiency of a neural network is determined by how it “learns.” The power of a neural network is contained in its ability to “remember” past data and provide classifications based on this. Past inputs are “remembered” through the value of the network’s weights. The most widely used technique is supervised learning. Supervised learning requires that a training set of data whose classifications are known be shown to the network one at a time. Each time, the weights are adjusted to provide the desired output with the given inputs. Back-propagation, radial-basis, and delta rule training algorithms are among the most popular and versatile.

Back propagation was one of the first training algorithms developed. It is widely used for its simplicity, however it is far from being the best. The problems encountered by it are universal to most training algorithms. To train a back-propagation network, every weight in the network must be initialized to a small random number. The numbers have to be random so that each neuron will adapt a different set of weights. The entire set of training data is input into the network one-by-one. For every training sample the desired output is compared to the actual output, and the weights of each neuron are altered based on the amount of error it contributed. After many iterations, or epochs, the weights reach values that offer minimal error. This seemingly simple process can take a tremendous amount of time.‘Teaching’ the XOR classification to a simple network consisting of 2 input, 2 hidden and 1 output neuron using back-propagation can take over 500,000 epochs to reach an acceptable (1%) error level. Fahlman and Lebiere identified two likely culprits: the size step problem, and the moving target problem.

The back-propagation algorithm makes adjustments by computing the derivative, or slope of the network error with respect to each neuron’s output. It attempts to minimize the overall error by descending this slope to the minimum value for every weight. It advances one step down the slope each epoch. If the network takes steps that are too large, it may pass the global minimum. If it takes steps that are small, it may settle on local minima, or take an inordinate amount of time to arrive at the global minimum. The ideal step size for a given problem requires detailed, high-order derivative analysis, a task not performed by the algorithm.

The moving target problem appears because the weights of each neuron are adjusted independently. An advantage of a large network is that each neuron becomes a specialized feature detector; its weights become tuned to identify a specific characteristic of its inputs. As the weights are altered, each neuron’s role becomes increasingly defined. However, back-propagation does not coordinate this development; several neurons may identify a particular feature (ex. feature A), and ignore another (feature B). When feature A’s error signal is eliminated, feature B remains. The neurons may then abandon feature A, and begin focusing on feature B. Throughout numerous epochs the neurons ‘dance’ between feature A and feature B. It may take several thousand epochs before both feature A and feature B are identified of at the same time