Paper Summary: Learning to Forget: Continual Prediction with LSTM
Last updated:Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.

WHAT
This paper introduces the "forget gate" to LSTM cells, that learn when they should "reset" the cell state. This was not present in the original 1997 paper by Hochreiter and Schmidhuber.
WHY
If LSTMs are applied to continuous data streams, cells without gates keep growing the magnitude of the state indefinitely and then cause the network to saturate and stop working.
HOW
The CEC (Constant Error Carousel) weight (which is a constant \(1.0\) for vanilla LSTMs) is replaced by a learned parameter, which controls what signals should be memorized or "discarded".
CLAIMS/QUOTES
RNNs are bad: "[...] standard RNNs fail to learn in the presence of time lags greater than 5-10 discrete time steps between relevant input events and target signals."
Constant Error Carousel (CEC): "The CEC's solve the vanishing error problem: in the absence of new input or error signals to the cell, the CEC's local error back flow remains constant, neither growing nor decaying. [...] This is why LSTM can bridge arbitrary time lags between input events and target signals."
State size growth: "The internal states tend to grow linearly."
Learning rates: Using an exponentially decaying learning rate improves results in the continuous learning case.
EXTENDS/USES
- Hochreiter and Schmidhuber 1997: Long Short-term memory
NOTES
- What the authors call a "continuous" problem is what other people refer to as "online" learning.1
References
IEEE Xplore: Learning to Forget: Continual Prediction with LSTM
2: Not to be confused with real-time ML.