Summarization of Multi-Document Topic Hierarchies using Submodular Mixtures

Ramakrishna Bairi, Rishabh Iyer, Ganesh Ramakrishnan, Jeff Bilmes


Abstract

We study the problem of summarizing DAG-structured topic hierarchies over a given set of documents. Example appli- cations include automatically generating Wikipedia disambiguation pages for a set of articles, and generating candidate multi-labels for preparing machine learn- ing datasets (e.g., for text classification, functional genomics, and image classi- fication). Unlike previous work, which focuses on clustering the set of documents using the topic hierarchy as features, we directly pose the problem as a submodular optimization problem on a topic hierarchy using the documents as features. Desirable properties of the chosen topics include document coverage, specificity, topic diversity, and topic homogeneity, each of which, we show, is naturally modeled by a submodular function. Other information, provided say by unsupervised approaches such as LDA and its variants, can also be utilized by defining a submodular function that expresses coherence between the chosen topics and this information. We use a large-margin framework to learn convex mixtures over the set of submodular components. We empirically evaluate our method on the problem of automatically generating Wikipedia disambiguation pages using human generated clusterings as ground truth. We find that our frame- work improves upon several baselines according to a variety of standard evalua- tion metrics including the Jaccard Index, F1 score and NMI, and moreover, can be scaled to extremely large scale problems.