Note: Students starting LT projects are advised to enrol in 433-327 Knowledge Technologies, 433-687 Knowledge Technologies, or 433-611 Web Search and Text Analysis.
Recent technological advances have afforded people any-time, any-place
access to information, while enabling providers to track people's
activities in physical environments. This project will involve
research to integrate the fields of language technology and user
modelling, and to develop a system for synchronised navigation of
hypertext and information rich physical environments. The work will
be done in collaboration with
the Kubadji Project, involving
a team of researchers from CSSE, the Department of Information
Systems, Monash University, and the Melbourne Museum.
Contact: Tim Baldwin
Named entity recognition (NER) is the task of identifying mentions of people, organisations, places, dates, etc in natural language data. E.g., we might analyse the sentence:
Wolff, currently a journalist in Argentina, played with Del Bosque in the final years of the seventies in Real Madrid.as:
[PERSON Wolff], currently a journalist in [LOCATION Argentina], played with [PERSON Del Bosque] in the final years of the seventies in [ORGANISTION Real Madrid].Conventionally, NER is treated as a sequential tagging task, with no regard for the logical structure of the text or discourse cohesion. However, it is often the case that a NE reoccurs with regularity throughout a text, with subsequent references being made by way of an abbreviated form established explicitly at the time of the first mention, or based on social conventions (e.g. referring to a person by their surname). This research aims to use basic discourse models to “track” or co-index an NE throughout a text, and use the accumulative textual evidence to boost NER performance.
Contact: Tim Baldwin
For many of the world's languages, voice pitch is used to signal lexical and grammatical contrasts (cf the way we use stress in English to distinguish CONtest from conTEST). This project will use methods in digital signal processing and machine learning to explore the prosodic structure of a language from Papua New Guinea, in collaboration with researchers in the UK and USA. (Here's an example to show how voice pitch is used to distinguish words in a PNG language waímé tree kangaroo, waíme house rat, waimé needle).
Contact: Steven Bird,
During 2010, the BOLD:PNG
project is creating an audio archive for a hundred languages
in Papua New Guinea. This project will investigate the workflows
for collecting oral literature, and develop a prototype tool to be
deployed on low-power netbooks for use in remote field locations,
to manage all aspects of the data collection process.
It will combine methods from data modelling, interface design,
and digital signal processing.
Contact: Steven Bird,
It is easy to create multimedia recordings and share them online using
services like YouTube. However, these sites can only be navigated by
topic. How could a collection be organised so that people could study
the content systematically? This project will involve research into
architectures for collaborative online annotation, and the development
of a system for manipulating time-aligned, interlinear audio
transcriptions. The architecture could be extended to allow for
plugins ranging from simple audio concordancers to sophisticated
knowledge discovery tools, together with illustrative applications to
the analysis of conversation, rhetoric, or dialect. Application to
languages other than English, including endangered languages, will be
possible.
Contact: Steven Bird,
Contact: Steven Bird, Greg Restall
The biggest single source of error in part-of-speech (POS) tagging is unknown words. Traditionally, unknown word handling has been carried out on the fly, based on affix-based similarity to known words, or analysis of the lexical composition of the unknown word. In this research, we will explore avenues for improving unknown word handling by way of more sophisticated models of word similarity, word and document context. We will also investigate methods for effectively trawling existing lexical resources to form a gold-standard POS lexicon for use in existing POS taggers.
Contact: Tim Baldwin
Documents are typically logically structured into text segments adhering to a particular genre of prose. E.g., a newspaper article will typically have a headline (written in “headlinese”, i.e. in present tense and without articles: “Government Announces Budget Deficit”), followed by the main prose of the article. In this project, we seek to develop a classification of text segment genre (e.g. headline, classified, personal, spoken, standard written, ...) along broadly syntactic and lexical lines, and determine the logical structure of a given document according to this classification. One area of interest is the interaction between document segmentation/genre classification and text classification.
Contact: Tim Baldwin
Word sense disambiguation (WSD) is the task of determining the sense of a given word in context, based on a fixed sense inventory. Traditionally, WSD research has focused on using only the immediate lexical context of target words to perform the disambiguation, with moderate success. Additionally, while sense inventories are often organised hierarchically based on synonym, hyponym and hypernym relations, WSD methods have traditionally ignored this ontological structure and adopted a “bag of senses” style approach. In this research, we will make use of three novel sources of information in carrying out supervised WSD: syntactico-semantic (rich syntactic annotation from a precision grammar, and logical form-style dependencies), ontological (the hierarchical data provided in the dictionary definitions) and domain data.
Contact: Tim Baldwin
Translation memories are a powerful aid to human translators, which maintain an incremental database of past translations and suggest partial or complete translations for novel inputs based on this translation history. They rely crucially on the assumption that strings which are similar in one language will have correspondingly similar translations in a second language. This project will implement a range of translation similarity methods and analyse their efficacy in translation tasks involving a range of language pairs, and also broader-ranging applications relying on some notion of similarity.
Contact: Tim Baldwin and Steven Bird.