Paper Summary: A Simple but Tough-to-beat Baseline for Sentence Embeddings

Paper Summary: A Simple but Tough-to-beat Baseline for Sentence Embeddings

Last updated: 13 May 2018

Table of Contents

WHAT
HOW
CLAIMS
NOTES

Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.

WHAT

It's an unsupervised method to build sentence embeddings from each individual word embedding in the sentence.

HOW

1) Compute the weighted average of the word vectors (where the weight \(w\) is the SIF: Smooth Inverse Frequency) in the sentence;

$$ SIF(w)=\frac{a}{(a+p(w)} $$

where \(a\) is a hyper-parameter and \(p(w)\) is the estimated word frequency in the corpus.

2) Subtract from the sentence embedding obtained in step 1) the first principal component of the matrix with all sentence embeddings as columns.

CLAIMS

It's a simple and unsupervised approach but it performs better (in unsupervised and supervised tasks) than more complex methods that need supervision, like RNNs and LSTMs.

NOTES

In the experiments, TF-IDF weighted GloVe embeddings also had satisfactory results, sometimes better than all other methods (supervised or otherwise).

References

Arora et al. 2017: A SIMPLE BUT TOUGH-TO-BEAT BASELINE FOR SENTENCE EMBEDDINGS

Felipe 13 May 2018 13 May 2018 paper-summary embeddings compositionality natural-language-processing

Dialogue & Discussion

