Paper Summary: Distributed Representations of Sentences and Documents

Last updated:
Table of Contents

Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.

WHAT

Extend the insights and methods used for word embeddings so as to learn embeddings for larger blocks of text, such as paragraphs and whole documents.

Two variants are described: PV-DM (Distributed Memory) and PV-DBOW (Distributed Bag of Words)

HOW

Depends upon which variant it is:

PV-DM

In the PV-DM variant, you just create dummy document tokens and then proceed to doing normal CBOW-style learning.

The difference is that, for every context you use to predict the target word, you concatenate/add the dummy token. Note that the same dummy token is used for every context that belongs to the same document.

PV-DBOW

In the PV-DBOW variant, the input/output pairs are more similar to the Skip-gram variant of Word2Vec.

You construct pairs of dummy document tokens as input (as above) and random words in that document as output. This forces the model to learn paragraph vectors that are good are predicting words in that paragraph.

WHY

Because the authors want to generalize word embeddings to larger blocks of text such as sentences, paragraphs and documents.

CLAIMS

  • Works better than (then available) methods for building higher-level embeddings because those usually used things like parse trees, which restrict the models to sentences only.

  • State-of-the-art results in the IMDB Sentiment Classification Task (using a single-layer logistic net as classifier)

NOTES

  • The methods described in this paper have been called Doc2Vec in the Gensim NLP Framework.

  • PV-DM alone is better than PV-DBOW alone. But a mixture of the two is more stable.

  • They also test their document representation on an information retrieval task (interestingly, just averaging word vectors gives worse results tha simple BOW features)


References

Dialogue & Discussion