Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.
They adapt a method used in image classification for text.
It is an example of multi-instance learning because each sample (i.e. document) is viewed as a bag of features and these are used to train an SVM classifier.
Originally, this was used to represent images as a bag of viual elements (see references)
The source document is first split into "segments" using the TextTiling algorithm (with sentence boundaries as the initial candidates) and then these are clustered into k-medoids using the Hausdorff distance between each "bag".
Each bag is mapped into a k-dimensional array, where each element refers to how well the bag fits into the k-th cluster. This mapping is used as a representation of the document, which is then classified using SVM.
Performs better than baseline multi-label methods such as Binary Relevance (with SVM), ML-kNN and Label Powersets.
- This paper
- This is one of the earliest papers on the subject of multi-instance, multi-label learning.
- TextTiling is used to segment a document.