Demographic Factors Improve Classification Performance

Dirk Hovy


Abstract

Extra-linguistic factors influence language use, and are accounted for by speakers and listeners. Most natural language processing (NLP) tasks to date, however, treat language as uniform. This assumption can harm performance. We investigate the effect of including demographic information on performance in a variety of text-classification tasks. We find that by including age or gender information, we consistently and significantly improve performance over demographic-agnostic models. These results hold across three text-classification tasks in five languages.