What’s in a Domain? Analyzing Genre and Topic Differences in Statistical Machine Translation

Marlies van der Wees, Arianna Bisazza, Wouter Weerkamp, Christof Monz


Abstract

Domain adaptation is an active field of research in statistical machine translation (SMT), but so far most work has ignored the distinction between the topic and genre of documents. In this paper we quantify and disentangle the impact of genre and topic differences on translation quality by introducing a new data set that has controlled topic and genre distributions. In addition, we perform a detailed analysis showing that differences across topics only explain to a limited degree translation performance differences across genres, and that genre-specific errors are more attributable to model coverage than to suboptimal scoring of translation candidates.