Paper Summary: Long Short-Term Memory
Last updated:Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.
What
Authors introduce the Long Short-term Memory (or LSTM) model.
Authors provide:
1) A new architecture for Recurrent Neural Nets
2) A new strategy for deciding what gets saved in the internal state of Recurrent Neural Nets (RNNs).
Why
Because existing RNNs are not good at learning in problems where the temporal effects are too far apart (over 1000 time steps)
Because applying Back-propagation Through Time (BPTT)1 is very computationally expensive; in addition, gradients tend to either explode of vanish exponentially.
How
A mechanism for keeping error propagation under control (i.e. not exploding or vanishing) is proposed, called CEC or Constant Error Carousel
Gates are introduced so that
- Important state learned within recurrent cells are kept protected from noisy perturbations (input gates)
- Only relevant cell state is propagated to outher weights (output gates)
LSTMs vs Regular Recurrent Neural Nets
LSTMs are one particular version of Recurrent Neural Nets (RNNs).
While vanilla RNNs just connect the previous time steps of a give cell to itself, LSTMs add the mentioned enhancements so as to
prevent errors from exploding (which happens when we try to use vanilla RNNs to keep many time steps in memory)
better decide what past error signals should be kept and which ones shouldn't.
Claims
LSTMs perform better than vanilla RNN counterparts in all tasks analyzed, including:
tasks with very short-term relationships (LSTMs are faster and more accurate than RNNs)
tasks with long-term temporal relationships (LSTMs work whereas RNNs don't at all)
Constant Error Carrousel (CEC)
The Constant Error Carrousel (CEC) is an extra constraint placed upon the weights during training time.
This helps gradients not vanish or explode.
Notes
LSTMs are a type of Recurrent Neural Net.
Authors mention that, surprisingly, simply guessing network weights and testing them one after another (without any optimization) works for solving simple problems.
Memory cells are called short-term memory because network weights themselves are our long-term memory.
LSTMs do differentiate between recent and older signals but it can't do it in a very accurate way;
- E.g. It probably won't learn the fact that a temporal effect takes place after exactly 99 timestamps (as opposed to 100 timesteps).
My 2¢
Any problem where data is inherently sequential can be modelled by RNNs and, by extension, by LSTMs:
- real-valued time series data (asset prices, wheather, signals, etc) with and without noise on features and on targets
- natural language, text
- natural language, speech
- video data
- simple arithmetic (addition, subtraction and even multiplication)
- order information (e.g. target depends on whether some values are before or after another given value)
XOR (exclusive-or) operation continues to be a hard problem even for LSTMs.
1: BPTT is the simplest (but not most efficient) way to adapt Backpropagation to RNNs. Basically you unroll the net so that you treat each previous time-step as if it were another layer.