Paper Summary: A Simple but Tough-to-beat Baseline for Sentence Embeddings
Last updated:Table of Contents
Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.
WHAT
It's an unsupervised method to build sentence embeddings from each individual word embedding in the sentence.
HOW
- 1) Compute the weighted average of the word vectors (where the weight \(w\) is the SIF: Smooth Inverse Frequency) in the sentence;
$$ SIF(w)=\frac{a}{(a+p(w)} $$
where \(a\) is a hyper-parameter and \(p(w)\) is the estimated word frequency in the corpus.
- 2) Subtract from the sentence embedding obtained in step 1) the first principal component of the matrix with all sentence embeddings as columns.
CLAIMS
- It's a simple and unsupervised approach but it performs better (in unsupervised and supervised tasks) than more complex methods that need supervision, like RNNs and LSTMs.
NOTES
- In the experiments, TF-IDF weighted GloVe embeddings also had satisfactory results, sometimes better than all other methods (supervised or otherwise).