Preliminary Program

Coupled Sequence Labeling on Heterogeneous Annotations: POS Tagging as a Case Study

Zhenghua Li, Jiayuan Chao, Min Zhang, Wenliang Chen

Abstract

In order to effectively utilize multiple datasets with heterogeneous annotations, this paper proposes a coupled sequence labeling model that can directly learn and infer two heterogeneous annotations simultaneously, and to facilitate discussion we use Chinese part-of-speech (POS) tagging as our case study. The key idea is to bundle two sets of POS tags together (e.g. ``[NN,n]''), and build a conditional random field (CRF) based tagging model in the enlarged space of bundled tags with the help of ambiguous labelings. To train our model on two non-overlapping datasets that each has only one-side tags, we transform a one-side tag into a set of bundled tags by considering all possible mappings at the missing side and derive an objective function based on ambiguous labelings. The key advantage of our coupled model is to provide us with the flexibility of 1) incorporating joint features on the bundled tags to implicitly learn the loose mapping between heterogeneous annotations, and 2) exploring separate features on one-side tags to overcome the data sparseness problem of using only bundled tags. Experiments on benchmark datasets show that our coupled model significantly outperforms the state-of-the-art baselines on both one-side POS tagging and annotation conversion tasks. The codes and newly annotated data are released for non-commercial usage.