Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.
An approach for extracting key words and key phrases from text documents and performing (extractive) text summarization.
It is based on Google's PageRank ranking algorithm for web pages.
Because many NLP tasks can be modelled as a (unsupervised) search for the most relevant words, phrases or sentences in a document.
Therefore it would be very useful to leverage an efficient algorithm such as PageRank for that end.
For Keyword Extraction, TextRank adapts PageRank so that nodes are keywords and edges are the number co-occurrences between them1
- Unlike PageRank, TextRank edges are weighted, to encode different levels of "linking" between keywords.
For Sentence Extraction, nodes are sentences and edges are the extent of overlap between them.2
- The unsupervised algorithm presented in this article obtains better results (Precision, F1-score) than then state-of-the-art supervised approaches for keyword extraction.
- The ROUGE metric (text similarity) has been found to be highly correlated with actual human evaluations.
The baseline tasks were Keyword Extraction and Sentence Extraction (i.e. Extractive Summarization).
Keyword Extraction applied to abstracts, not full-length texts.
The best results for Keyword Extraction were obtained using a window size of 2 and undirected edges.
Post-processing steps are used to form multiword key phrases from keywords.
1: Co-occurrence within a window of 2 to 10 words. I.e. not only immediate co-occurence.
2: Overlap here means: number of common tokens, normalized by the sentence length.