Tricks for Training Neural Nets Faster

Last updated:

WIP Alert This is a work in progress. Current information is probably correct but more content will be added in the future.

General rules: SGD for large, perhaps redundant datasets, full-batch learning for small datasets

Initializing Weights

Small and random weights, proportionally to the square root of the fan-in of the receiving unit.

WHY: Break symmetry, avoid saturation, makes training faster for units with high fan-in

Multiply the Learning rate by a constant

Equal to the fan-in of the receiving unit.

WHY: makes training faster for units with high fan-in

Shifting Inputs

Subtract the mean over all inputs, use tanh as activation function.

WHY: makes gradient descent more efficient (easier error surface)

Rescaling Inputs

To unit variance.

WHY: makes gradient descent more efficient (easier error surface)

Decorrelating Input Components

Linear neurons only.

  • apply PCA and drop components with smaller eigenvalues

Make the learning rate smaller towards the end of learning

But not too soon or you may stop learning before you should.

WHY: make your net more resistant to fluctuations that could mess up what you've already learned

Use momentum

Also, Nesterov momentum

It's an adaptive learning rate method.

Base learning rate is accelerated or decelerated with each new gradient; i.e., use the gradient to change the speed rather than the position of the learning rate.

small base learning rate + high momentum is better than just high learning rate.

WHY: makes your net more resistant to noise, fluctuations in the inputs.

Use individual learning rates for each unit

While the gradient for a unit is postive, keep making the learning rate larger (for that unit only) and reduce it if you get a negative gradient.

WHY: quickly exploit "good" weights but quickly change them if it gets bad (gradient is opposite sign)

Use individual local gains

It's an adaptive learning rate method.

Start off with a global learning rate of 1 and, for each weight, add 0.5 to the rate if subsequent gradients do not change sign, and multiply the rate by 0.95 (i.e. decrease it) if it does change sign.

  • limit the gains to a prefixed range

  • use slightly larger mini-batches to prevent fluctuations

WHY: different weights may need to vary differently according to their magnitude, unit fan-in, etc.

Use rmsprop

It's an adaptive learning rate method.

Use the magnitude of recent gradients to normalize the current gradient.

Instead of using the current gradient to update the learning rate, divide the current gradient by the square root of the mean of the squares (RMS) of the previous gradients.

WHY: more resistance to fluctuations, noise

Cost Function: use Cross Entropy rather than quadratic error

This is probably the default cost function is most toolkits, but sometimes it isn't.

WHY: Cross Entropy has nicer partial derivatives, makes learning quicker (the larger the errors, the larger the gradient)


References

Dialogue & Discussion