Paper Summary: Distributed Representations of Sentences and Documents
Last updated:Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.
WHAT
Extend the insights and methods used for word embeddings so as to learn embeddings for larger blocks of text, such as paragraphs and whole documents.
Two variants are described: PV-DM (Distributed Memory) and PV-DBOW (Distributed Bag of Words)
HOW
Depends upon which variant it is:
PV-DM
In the PV-DM variant, you just create dummy document tokens and then proceed to doing normal CBOW-style learning.
The difference is that, for every context you use to predict the target word, you concatenate/add the dummy token. Note that the same dummy token is used for every context that belongs to the same document.
PV-DBOW
In the PV-DBOW variant, the input/output pairs are more similar to the Skip-gram variant of Word2Vec.
You construct pairs of dummy document tokens as input (as above) and random words in that document as output. This forces the model to learn paragraph vectors that are good are predicting words in that paragraph.
WHY
Because the authors want to generalize word embeddings to larger blocks of text such as sentences, paragraphs and documents.
CLAIMS
Works better than (then available) methods for building higher-level embeddings because those usually used things like parse trees, which restrict the models to sentences only.
State-of-the-art results in the IMDB Sentiment Classification Task (using a single-layer logistic net as classifier)
NOTES
The methods described in this paper have been called Doc2Vec in the Gensim NLP Framework.
PV-DM alone is better than PV-DBOW alone. But a mixture of the two is more stable.
They also test their document representation on an information retrieval task (interestingly, just averaging word vectors gives worse results tha simple BOW features)