Paper Summary: From Word to Sense Embeddings: A Survey on Vector Representations of Meaning

Last updated:

Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.


Authors survey techniques to train sense vectors that model each word sense, as opposed to word vectors which model each word as a single vector.


Word-sense Disambiguation (WSD) is one of the most important NLP/CL tasks; it is one of the tasks included in SemEval.

It is a task that's useful on its own and it can help other downstream tasks such as Semantic Role Labelling (SRT), Machine Translation (MT) and even Information Retrieval (IR).


They divide approaches to build sense vectors into two classes:

  • Unsupervised models use raw text corpora to learn word senses automatically, i.e. analyzing the types of contexts each word appears in.

  • Knowledge-based techniques use external sources of sense information such as WordNet synsets.


  • One of the main limitations of representing words in the vector space model (VSM) is the meaning conflation deficiency. In other words, the inability of word vectors to discriminate among several meanings of a single word.

  • Usage of sense embeddings instead of word embeddings enhances performance in multiple NLP downstream tasks.


  • According to the authors, most traditional ways to find word senses (so-called two-pass learning) are related to finding all contexts in which a word appears (neighbour words) and then clustering them. Each cluster thereby found is a separate word sense.

  • Traditional word embedding strategies (Word2Vec, GloVe, etc) train a single, static representation for each word. Some new methods use dynamic representations instead; these slightly modify the representation of each word depending upon its context at inference time. These are called Contextualized Embeddings. See Context2Vec below.

  • Retrofitting embeddings means adding an extra post-processing step for trained word embeddings so that, in addition to the neural objective function, they are made to maximize some other function (e.g. similarity as measured by WordNet) for fine-tuning.

  • Multiple strategies rely on first disambiguating a text corpus before training. This means replacing words with their senses (in case there are many senses). This enables one to train a regular algorithm on this dataset, which would force it to learn different representations for each sense. For instance, a disambiguated text would look like this:

    • "Cancer cells#biology are among the ..."
    • "... the main challenge for electric car manufacturers is to build efficient battery cells#energy-storage..."
    • "... each of the many organization cells#organizational-unit has been disbanded."
    • "... some animals have undifferentiated cells#biology when they are embryos."


Dialogue & Discussion