[PDF]
This tutorial presents a pattern-based empirical approach to meaning representation and computation. It is a response to the finding by corpus linguists that "most meanings require the presence of more than one word for their normal realization". The tutorial shows how patterns are built, using corpus evidence, using machine learning methods, and discusses potential applications of patterns. It is intended for an audience with heterogeneous competences but with a common interest in corpus linguistics and computational models for meaning-related tasks in NLP. The goal is to equip the audience with a better understanding of the role played by patterns in natural language, an operative command of the methodology used to acquire patterns, and a forum in which to discuss their utility in NLP applications.
The relatively recent explosion of corpus-driven research has shown that intermediate text representations (ITRs), built from the bottom up, using corpus examples towards a complex representation of phrases, play an important role in dealing with the meaning disambiguation problem. It has been shown that it is possible to identify and to learn corpus patterns that encode the information that accounts for the senses of the verb and its arguments in the context. These patterns link the syntactic structure of verbal phrases and the semantic types of their argument fillers via the role that each of these play in the disambiguation of the phrase as a whole. The available solutions developed so far range from supervised to totally unsupervised approaches. The patterns obtained encode the necessary information for handling the meaning of each word individually as well as that of the phrase as a whole. As such, they are instrumental in building better language models as in the contexts matched by such patterns. The semantic types used in pattern representation play a discriminative role, therefore the patterns are sense discriminative and as such they can be used in word sense disambiguation and other meaning-related tasks. The meaning of a pattern as a whole is expressed as a set of basic implicatures. The implicatures are instrumental in textual entailment, semantic similarity and paraphrasing generation etc.
The corpus patterns methodology is designed to offer a viable solution to meaning representation. The techniques we present are widely applicable in NLP and they deal efficiently with data sparseness and open domain expression of semantic relationships. We show how including a corpus pattern module into an NLP system is beneficial for accurately and consistently resolving textual entailment, paraphrase generation and semantic similarity.
The tutorial is divided into three main parts, which are interconnected: (1) Probabilities and Patterns Corpus Pattern Theory of Norms and Exploitations (2) Inducing Semantic Types and Semantic Task Oriented Ontologies and (C) Machine Learning and Applications of Corpus Patterns.
1. Discovering Computable Semantic Properties of Verb Phrases
Why do we need patterns? How to analyse corpus data; lexical statistics; theory of linguistic norms and exploitations;
the Sketch Engine; sense discriminative patterns.
2. Semantic Types and Ontologies
Inducing semantic types and ontologies; grouping and processing lexical sets according to their semantic types; argument structures, semantic frames, and patterns.
3. Statistical Models for Corpus Pattern Recognition and Extraction from Corpora
Finite State Markov chains and ranching Processes; Naive Bayesian and Gaussian Random Fields for computing conditional probabilities over semantic types; Latent Dirichlet analysis for unsupervised pattern extraction; Probabilities: approximately correct models and statistical query models; Joint Source Channel model for recognition of normal patterns in text; recognising exploitation; using patterns in tasks such as computing textual entailment, paraphrase generation, and measuring textual similarity
THE TUTORS
Patrick Hanks is a lexicographer and corpus linguist. He currently holds two research professorships: one at the Research Institute of Information and Language Processing in the University of Wolverhampton, the other at the Bristol Centre for Linguistics in the University of the West of England (UWE, Bristol).
Elisabetta Jezek has been teaching Syntax and Semantics and Applied Linguistics at the University of Pavia since 2001. Her research interests and areas of expertise are lexical semantics, verb classification, theory of Argument Structure, event structure in syntax and semantics, corpus annotation, computational Lexicography.
Daisuke Kawahara is an Associate Professor at Kyoto University. He is an expert in the areas of parsing, knowledge acquisition and information analysis. He teaches graduate classes in natural language processing. His current work is focused on automatic induction of semantic frames and semantic parsing, verb polysemic classes, verb sense disambiguation, and automatic induction of semantic frames.
Octavian Popescu is a researcher at IBM's T. J. Watson Research Center, Yorktown, working on computational semantics with a focus on corpus patterns for meaning processing. His work is focused on models for word sense disambiguation, textual entailment and paraphrase acquisition. He taught various NLP graduate courses in computational semantics at Trento University (IT), Colorado University at Boulder (US) and University of Bucharest (RO).
As a truly international team united by a common interest for corpus patterns, we are looking forward to meeting you in Beijing. -- Patrick, Elisabetta, Diasuke and Octavian.