Progress in wide-coverage parsing has been made in recent years by combining rule-based parsing using grammars automatically induced from hand-annotated treebanks with statistical models of such semantically- and pragmatically- relevant properties of parses as headword dependencies. Until recently, such parsers have been based on highly overgenerating context-free covering grammars. The analyses that these grammars yield depart in important respects from interpretable structures. In particular they fail to include the long-range "deep" semantic dependencies that are involved in relative and coordinate constructions.
The paper reviews some recent experiment in automatically extracting an expressive CCG grammar from the Penn Treebank, and its use for parsing with wide coverage using statistical head dependency models also derived from the treebank. These parsers achieve state-of-the-art coverage and speed, and perform well on hard sentences involving long range dependencies. The paper reports experiments on porting to a corpus of questions for a successful application to the TREC QA task, including the provision of semantic interpretations in the form of Discourse Representation Structures (DRS).
Error analysis shows that the main obstacle to improved performance is poor coverage in the treebank-derived lexicon. If I have time I will talk about work in progress on unsupervised generalization of the lexicon using unlabelled text.
Mark Steedman is Professor in the School of Informatics at the University of Edinburgh. He received his PhD from the University of Edinburgh in 1973. He re-joined the University of Edinburgh in 1998, after teaching at the Universities of Warwick, Edinburgh, and most recently Pennsylvania, where he was Professor in the Department of Computer and Information Science. He was an Alfred P. Sloan Foundation Visiting Fellow at the University of Texas at Austin in 1980/81, and a Visiting Professor at Penn in 1986/87. He is a Fellow of the British Academy, the Royal Society of Edinburgh, and the American Association for Artificial Intelligence.
His research interests cover issues in computational linguistics, artificial intelligence, computer science and cognitive science, including syntax and semantics of natural language, parsing and comprehension of natural language discourse by humans and by machine, wide-coverage statistical parsing, and spoken natural language generation using Combinatory Categorial Grammar (CCG). Much of his current NLP research is addressed to issues in spoken discourse and dialogue, especially the meaning of intonation and prosody, and in wide coverage parsing with CCG. Some of his research concerns the analysis of music by humans and machines.