|
Standard
Steepest
Descent with
a Momentum
term
Each iteration takes step
Dwn-1
(in the interconnection coefficients space) toward
the greatest descent
g
of the error surface.
Dwn
=
agn
+
gDwn-1,
Dw0
= 0,
a
> 0, 0 ≤
g < 1
The
a
parameter is called
LearnRate
in the network's Training Setup dialog window. It
should be a positive, small value. Reduce the
default value if the algorithm seems to be unstable,
or it oscillates in the final state (try
a
= 0.1 ÷ 0.001).
The g
parameter is called
Momentum
and it varies in the range
<0,1).
This parameter speeds up training and makes it more
resistant to the statistical fluctuations of the
error surface (or its small local minima), however
values close to
1
may result in oscillations around the destination
minimum. Variations of this algorithm are called
xxMomentumOptimize
in the network's Training Setup dialog window.
--top--
--up-- |
|
QuickProp
proposed by Scott Fahlman (1988)
Much more efficient
than
steepest descent,
it uses second derivative of the error surface with
respect to each weight to adjust the step size that
is taken in each iteration. First step is taken
according to the
steepest
descent rule
and a
parameter is called
InitStep
for this iteration. It has almost no influence on
the algorithm efficiency, however one should keep it
usually in the range below
3.0
to ensure the algorithm stability. Steps in
following iterations depend on second derivative of
the error surface:
Dwn
= Dwn-1∙gn
/ (gn-1 - gn)
Weight and gradient values are italic to mark that
changes are calculated for each weight separately.
There is a parameter called
MaxStepRatio
that limits rapid increase of weight changes. Its
default value of
1.75
is adequate for most tasks. Weight changes
calculated according to the
QuickProp
rule may be modified slightly by changes calculated
with the
steepest
descent
algorithm. This may improve efficiency, but in some
cases leads also to unstable behavior of the
algorithm. Amount of modification is given by
DeltaMomentum
parameter (default value of
0.0
turns off the modification; good try for start is
0.01).
--top--
--up-- |
|
Directional minimization algorithms
Algorithms from this
group do the search for the minimum on the error surface
in a fixed direction. To speed up the search,
changes are relatively large at the beginning (Step0
parameter) and are reduced as the algorithm closes
up to the minimum point. When the minimum is found new
direction is calculated according to the algorithm
specific rule.
Tolerance
/
MinStep
parameters tell how precise the minimum should be
located.
- Conjugated Gradients
Assures the minimal
number of the direction changes while reaching the
minimum under some assumptions. Each new direction
dn
is calculated as:
dn
= g + g∙dn-1,
where: g
= (g - dn-1)T∙g
/ dn-1T∙dn-1
- simple
Minimum Search
Each new direction
is exactly the direction of the gradient
g
at the minimum point found in previous searching
direction, so it is simplified version of the
Conjugated
Gradient
algorithm, where
g
= 0.0.
- Transversal Gradients
This algorithm
changes the search direction in the point where the current gradient direction
is perpendicular to the current searching direction.
--top--
--up-- |
|
|
|
Each neuron response is
calculated as a function of
S,
where
S
is a dot product of neuron input vector and neuron
weight vector (bias
is also included in
S);
this is a typical approach for MLP networks (other
network types, like RBF, may use some distance
measures rather than the dot product).
Power of neural processing is hidden in a nonlinearity
of activation function
fact.
Network can approximate any function you wish (ok,
lets say "almost") only by using simple
fact
combined with optimized weights in multiple neuron
units. In fact, in most tasks, exact shape of
fact
is not so important - network will do the job by
optimizing weights. Therefore most common choice is
sigmoid
(logistic) function due to simplicity of its
derivative calculation. This unipolar function (and
its bipolar equivalent -
hyperbolic
tangent) is
also extensively optimized for calculation speed in
NetMaker
code. However, different
fact
are predestined to different applications:
sigmoid
and
hyperbolic tangent
functions are usually used in classification tasks;
arcus tangent
has smoother shape and may give better results in
approximation tasks (it also doesn't saturate as
quickly as
sigmoid
function, which may speed up training in some
cases). Plots below may help you choose the proper
function. If you are not sure - try different
functions, but be aware that mixing different
nonlinearities in one network may be difficult to
train. To make such a plot using
NetMaker
- choose menu
Edit / Add Graph / Functions.
|
|
|
Network minimizes error function
E = 1/N∙Sei
during
the training process (NetMaker
uses batch learning, so the error
E
is averaged over all training events). There are
eight functions available now:
- MSE (mean
squared error, default):
e = (t - o)2,
this function is the most common choice, but
read the description of the other functions too!
-
Pow4:
e = (t - o)4,
focuses training on the events with large
distance between desired and obtained network
output - usually chosen when tails of the
network output distributions are much more
important than other ranges (for example: when
you need extremely pure selection
/when few events survive/ at the cost of
decreased network performance in the range of
moderate purities and efficiencies).
- IAtanh1, IAtanh2:
integrated
hyperbolic arcus tangent of
(t - o);
IAtanh1 is suitable for unipolar and centered
sigmoid output layer types,
IAtanh2
should be used with
bipolar output layer types;
note that network with linear output layer
may exceed allowed range of arguments for these
functions; these functions have similar effect
to the Pow4, but have almost linear derivative
around
e = 0,
which is favorable for training algorithms.
- ITanh:
integrated
hyperbolic tangent of
(t - o);
ITanh is suitable for all output layer types;
this function has exactly opposite effect
to the Pow4 and IAtanh's: influence of the
events with large network error value is a bit
suppressed - this is useful if you expect
outliers or gross measurement errors in your
training data (and you will be surprised how
often this function improves the network
performance); the effect is stronger for bipolar
output layers; function also have almost linear derivative
around
e = 0.
Asymmetric
functions - they focus the training on one of the network
output distribution tails; use these functions
also if overvalued network output is much more
painful than undervalued or vice-versa:
- Asymm1, Asymm2:
e = [a∙(t - o)]2
/ [a∙(t - o) + 1],
where a is a scaling factor:
a = 0.75 (Asymm1, for unipolar
output layers), a = 0.4
(Asymm2, for bipolar output layers);
- AsymmL:
e =
e-(t - o) + (t - o) - 1, with unlimited range of (t - o) and therefore suitable for
all types of output layer.
t
- desired network output value (target vector
element);
o
- obtained network output value (output vector
element).
|
 |
| Error
functions. |
 |
| Error
function derivatives. |
--top--
--up--
|
|
|