Preliminary Program

Content Models for Survey Generation: A Factoid-Based Evaluation

Rahul Jha, Catherine Finegan-Dollak, Ben King, Reed Coke, Dragomir Radev

Abstract

We present a new factoid-annotated dataset for evaluating content models for scientific survey article generation containing 3,425 sentences from 7 topics in natural language processing. We also introduce a novel HITS-based content model for automated survey article generation called HitSum that exploits the lexical network structure between sentences from citing and cited papers. Using the factoid-annotated data, we conduct a pyramid evaluation and compare HitSum with two previous state-of-the-art content models: C-Lexrank, a network based content model, and TopicSum, a Bayesian content model. Our experiments show that our new content model captures useful survey-worthy information and outperforms C-Lexrank by 4% and TopicSum by 7% in pyramid evaluation.