Standard Steepest Descent with a Momentum term
QuickProp proposed by Scott Fahlman
Directional minimization algorithms
   -
Conjugated Gradients
   -
Transversal Gradients
   - simple
Minimum Search

Only deterministic, gradient-based algorithms are implemented. The reason is they are much faster than stochastic algorithms and a problem with local minima can be solved using the dynamic structure during the training. All algorithms process entire training set and then an update of the neuron interconnections is made (so called offline or global approach). This is called an iteration here.
Gradient vector
g (or the steepest descent direction) used in following algorithms is a negative of a derivative of a network error function E with respect to the network interconnection coefficients w: g = -E/w. This gradient is averaged over all training events.

Standard Steepest Descent with a Momentum term

Each iteration takes step Dwn-1 (in the interconnection coefficients space) toward the greatest descent g of the error surface.

Dwn = agn + gDwn-1,   Dw0 = 0,   a > 0,   0 ≤ g < 1

The a parameter is called LearnRate in the network's Training Setup dialog window. It should be a positive, small value. Reduce the default value if the algorithm seems to be unstable, or it oscillates in the final state (try a = 0.1 ÷ 0.001). The g parameter is called Momentum and it varies in the range <0,1). This parameter speeds up training and makes it more resistant to the statistical fluctuations of the error surface (or its small local minima), however values close to 1 may result in oscillations around the destination minimum.
Variations of this algorithm are called
xxMomentumOptimize in the network's Training Setup dialog window.

--top--   --up--

QuickProp proposed by Scott Fahlman (1988)

Much more efficient than steepest descent, it uses second derivative of the error surface with respect to each weight to adjust the step size that is taken in each iteration. First step is taken according to the steepest descent rule and a parameter is called InitStep for this iteration. It has almost no influence on the algorithm efficiency, however one should keep it usually in the range below 3.0 to ensure the algorithm stability. Steps in following iterations depend on second derivative of the error surface:

Dwn = Dwn-1gn / (gn-1 - gn)

Weight and gradient values are italic to mark that changes are calculated for each weight separately. There is a parameter called MaxStepRatio that limits rapid increase of weight changes. Its default value of 1.75 is adequate for most tasks. Weight changes calculated according to the QuickProp rule may be modified slightly by changes calculated with the steepest descent algorithm. This may improve efficiency, but in some cases leads also to unstable behavior of the algorithm. Amount of modification is given by DeltaMomentum parameter (default value of 0.0 turns off the modification; good try for start is 0.01).

--top--   --up--

Directional minimization algorithms

Algorithms from this group do the search for the minimum on the error surface in a fixed direction. To speed up the search, changes are relatively large at the beginning (Step0 parameter) and are reduced as the algorithm closes up to the minimum point. When the minimum is found new direction is calculated according to the algorithm specific rule. Tolerance / MinStep parameters tell how precise the minimum should be located.

- Conjugated Gradients

Assures the minimal number of the direction changes while reaching the minimum under some assumptions. Each new direction dn is calculated as:

dn = g + gdn-1,

where: g = (g - dn-1)Tg / dn-1Tdn-1

- simple Minimum Search

Each new direction is exactly the direction of the gradient g at the minimum point found in previous searching direction, so it is simplified version of the Conjugated Gradient algorithm, where g = 0.0.

- Transversal Gradients

This algorithm changes the search direction in the point where the current gradient direction is perpendicular to the current searching direction.

--top--   --up--

Each neuron response is calculated as a function of S, where S is a dot product of neuron input vector and neuron weight vector (bias is also included in S); this is a typical approach for MLP networks (other network types, like RBF, may use some distance measures rather than the dot product). Power of neural processing is hidden in a nonlinearity of activation function fact. Network can approximate any function you wish (ok, lets say "almost") only by using simple fact combined with optimized weights in multiple neuron units. In fact, in most tasks, exact shape of fact is not so important - network will do the job by optimizing weights. Therefore most common choice is sigmoid (logistic) function due to simplicity of its derivative calculation. This unipolar function (and its bipolar equivalent - hyperbolic tangent) is also extensively optimized for calculation speed in NetMaker code.
However, different
fact are predestined to different applications: sigmoid and hyperbolic tangent functions are usually used in classification tasks; arcus tangent has smoother shape and may give better results in approximation tasks (it also doesn't saturate as quickly as sigmoid function, which may speed up training in some cases).
Plots below may help you choose the proper function. If you are not sure - try different functions, but be aware that mixing different nonlinearities in one network may be difficult to train. To make such a plot using
NetMaker - choose menu Edit / Add Graph / Functions.

Unipolar activation functions.

Unipolar activation derivatives.


Bipolar activation functions.

Bipolar activation derivatives.

--top--   --up--

Network minimizes error function E = 1/NSei during the training process (NetMaker uses batch learning, so the error E is averaged over all training events). There are eight functions available now:

- MSE (mean squared error, default):
e = (t - o)2, this function is the most common choice, but read the description of the other functions too!

- Pow4:
e = (t - o)4, focuses training on the events with large distance between desired and obtained network output - usually chosen when tails of the network output distributions are much more important than other ranges (for example: when you need extremely pure selection /when few events survive/ at the cost of decreased network performance in the range of moderate purities and efficiencies).

- IAtanh1, IAtanh2:
integrated hyperbolic arcus tangent of (t - o); IAtanh1 is suitable for unipolar and centered sigmoid output layer types, IAtanh2 should be used with bipolar output layer types; note that network with linear output layer may exceed allowed range of arguments for these functions; these functions have similar effect to the Pow4, but have almost linear derivative around e = 0, which is favorable for training algorithms.

- ITanh:
integrated hyperbolic tangent of (t - o); ITanh is suitable for all output layer types; this function has exactly opposite effect to the Pow4 and IAtanh's: influence of the events with large network error value is a bit suppressed - this is useful if you expect outliers or gross measurement errors in your training data (and you will be surprised how often this function improves the network performance); the effect is stronger for bipolar output layers; function also have almost linear derivative around e = 0.

Asymmetric functions - they focus the training on one of the network output distribution tails; use these functions also if overvalued network output is much more painful than undervalued or vice-versa:
- Asymm1, Asymm2:
e =
[a∙(t - o)]2 / [a∙(t - o) + 1], where a is a scaling factor: a = 0.75 (Asymm1, for unipolar output layers), a = 0.4 (Asymm2, for bipolar output layers);
- AsymmL:
e =
e-(t - o) + (t - o) - 1, with unlimited range of (t - o) and therefore suitable for all types of output layer.

t - desired network output value (target vector element); o - obtained network output value (output vector element).

Error functions.

Error function derivatives.

--top--   --up--