Paper Summary: Multi-instance multi-label learning for automatic tag recommendation
Last updated:Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.
WHAT
They adapt a method used in image classification for text.
It is an example of multi-instance learning because each sample (i.e. document) is viewed as a bag of features and these are used to train an SVM classifier.
Originally, this was used to represent images as a bag of viual elements (see references)
HOW
The source document is first split into "segments" using the TextTiling algorithm (with sentence boundaries as the initial candidates) and then these are clustered into k-medoids using the Hausdorff distance between each "bag".
Each bag is mapped into a k-dimensional array, where each element refers to how well the bag fits into the k-th cluster. This mapping is used as a representation of the document, which is then classified using SVM.
CLAIMS
Performs better than baseline multi-label methods such as Binary Relevance (with SVM), ML-kNN and Label Powersets.
References
Shen et al 2009: Multi-instance multi-label learning for automatic tag recommendation
- This paper
NIPS 2006: Multi-Instance Multi-Label Learning with Application to Scene Classification
- This is one of the earliest papers on the subject of multi-instance, multi-label learning.
Hearst 1997: TextTiling: Segmenting Text into Multi-paragraph Subtopic Passages
- TextTiling is used to segment a document.