Paper Summary: Long Short-Term Memory-Networks for Machine Reading

Last updated: 26 Dec 2022

Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.

WHAT

Authors present an enhancement to how Attention is used in LSTMs, namely intra-attention or self-attention

They name it LSTMNs (Long Short-Term Memory Networks)¹

HOW

In the LSTMN, the attention mechanism is added within the encoder (whereas in previous implementations it was added between the encoder and the decoder.)

Authors present two ways of integrating self-attention into LSTMs:

"Shallow Fusion": Use encoder-decoders and both use self-attention
"Deep Fusion": Use encoder-decoders and they use both inter-attention and self-attention

On the left the Shallow Fusion integration technique and on the right
the Deep Fusion technique, where the encoder and the decoder
have both regular and self-attention

WHY

Traditional LSTMs with Attention may have a hard time storing knowledge that:

Requires it to store long sequences of text
Has structure (other than sequential ordering)

Traditional LSTMs have to recursively compress the knowledge in its memory cells after each iteration; this makes it harder for them to represent finer concepts accurately.

CLAIMS

Language modelling
- LSTMN beats traditional LSTMs with the same memory (as measured by perplexity)
Sentiment Analysis
- LSTMN beats traditional LSTMs on this task (measured by accuracy)
- But a CNN (called T-CNN) was better than both LSTMN and traditional LSTMS
Natural Language Inference (textual entailment)
- LSTMNs beats traditional LSTMs on this task (measured by accuracy)

QUOTES

On self-attention: "A key idea behind the LSTMN is to use attention for inducing relations between tokens"

NOTES

Model is tested in the following tasks: language modeling, sentiment analysis, and natural language inference
The term "self-attention" doesn't seem to show up in this article - they call it "intra-attention" (as opposed to Bahdanau's "inter-attention")
There was no pre-training (self-supervised or otherwise)
- But they used pretrained embeddings

References

Cheng et al, 2016: Long Short-Term Memory-Networks for Machine Reading

Footnotes

1: "Memory networks" refer back to Weston et al 2015: Memory Networks

Felipe 25 Dec 2022 26 Dec 2022 paper-summary attention sequence-learning