Paper Summary: Context is Everything: Finding Meaning Statistically in Semantic Spaces

Paper Summary: Context is Everything: Finding Meaning Statistically in Semantic Spaces

Last updated:

Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.


  • Introduces the concept of CoSal (Contextual Salience), which extends the IDF term in TF-IDF to include not just the relative rarity of a word with respect to the corpus but also how rare that word is in the document.

    • In other words, each word is weighed differently depending not just upon the document it's in but also what that document's word distribution looks like.
  • Trains a sentence embedding model using CoSal, which beats SkipThought (Kiros et al 2014) and bi-LSTM (Peters et al 2017) on SentEval.


Because TF-IDF and related measures have drawbacks such as needing too many documents to be able to infer the rarity of a word with respect to the corpus.


  • To calculate the CoSal of a word embedding they multiply the global (corpus) covariance and the local (document or sentence) covariance.

  • They use Mahalanobis distance (M-distance) with normalized embeddings rather than simple Euclidean distance, because M-distance takes into account the covariance of the dataset.

  • To build sentence embeddings form individual CoSal word vectors they calculate the sigmoid (to weigh individual contributions) of the distance between each word in the sentence and the sentence average (as per Aurora et al 2017).


  • "Words that are slightly more contextually important contribute much more to the meaning of a sentence than words hich are only slightly less contextually important."

  • "TF-IDF is not good because it is good at detecting unusual words, but because the unusualness of a word happens to be a good proxy for their contextual unusualness."

  • "With very small datasets (Many fewer words than dimensions), using lemmatization can improve performance."

  • Outperforms TF-IDF in all SentEval tasks while needing less training examples.

  • Generalizing this method to produce document embeddings does not work very well.


  • Bag-of-words: any process to turn a document into features that don’t take word order into account (I.e. a bag holds no order)

  • TF-IDF: a specific way to weigh bag-of-words features. Assumes that words that occur in just a few documents in the CORPUS are more informative.

  • The sentence embeddings trained here are in the same space as the word embeddings, so if you take a word whose vector lies close to a sentence's vector, that word will be a rough approximation of the meaning of the sentence.