Inducing Word and Part-of-Speech with Pitman-Yor Hidden Semi-Markov Models

Kei Uchiumi, Hiroshi Tsukahara, Daichi Mochihashi


Abstract

We propose a nonparametric Bayesian model for joint unsupervised word segmentation and part-of-speech tagging. Extending previous model for word segmentation, our model is called a Pitman-Yor Hidden Semi-Markov Model (PYHSMM) and considered as a method to build a class $n$-gram language model directly on raw strings, while integrating character and word level information. Experimental results on standard datasets on Japanese, Chinese and Thai revealed it outperforms previous results to yield the state-of-the-art accuracies. This model will also serve to analyze a structure of a language whose words are not identified a priori.