Note: Students starting LT projects are advised to take Human Language Technology as a directed study subject in first semester.
Recent technological advances have afforded people any-time, any-place
access to information, while enabling providers to track people's
activities in physical environments. This project will involve
research to integrate the fields of language technology and user
modelling, and to develop a system for synchronised navigation of
hypertext and information rich physical environments. The work will
be done in collaboration with
the Kubadji Project, involving
a team of researchers from CSSE, the Department of Information
Systems, Monash University, and the Melbourne Museum.
Contact: Steven Bird, Tim Baldwin,
It is easy to create multimedia recordings and share them online using
services like YouTube. However, these sites can only be navigated by
topic. How could a collection be organised so that people could study
the content systematically? This project will involve research into
architectures for collaborative online annotation, and the development
of a system for manipulating time-aligned, interlinear audio
transcriptions. The architecture could be extended to allow for
plugins ranging from simple audio concordancers to sophisticated
knowledge discovery tools, together with illustrative applications to
the analysis of conversation, rhetoric, or dialect. Application to
languages other than English, including endangered languages, will be
possible.
Contact: Steven Bird,
Search engines let us perform keyword queries over unstructured text.
Database engines let us perform structured queries over relations.
Powerful technologies are now in widespread use for both
unstructured and structured data. This project will examine
semistructured data, hierarchically organised data with
optional, repeatable, and partially-ordered elements. Large text
collections will be automatically parsed to produce a database of
trees, and an Ajax-based graphical interface will be developed to
permit these trees to be queried. The system will be used by
linguists to study the way English grammar is evolving on the web.
[This project will extend the work of the
NSF QLDB project.]
Contact: Steven Bird, James Bailey
Argument maps are box-and-line diagrams that lay out visually
reasoning and evidence for and against a statement or claim. A good
map clarifies and organizes thinking by showing the logical
relationships between thoughts that are expressed simply and
precisely. This project may take two forms. One would be to analyze
argument maps in order to detect unstated assumptions and missing
evidence. Another would be to explore methods for automatically
generating naturalistic argumentative prose from argument maps. The
project will involve collaboration
with Austhink, a local company
that is developing argument visualisation software.
Contact: Steven Bird
The languages of India pose interesting challenges for the development of language technologies, a fact that has been recognized by the Indian Government through its TDIL initiative. In this project you will extend NLTK to support language processing tasks for an Indian language. Languages of particular interest are Bangla, Hindi, Marathi and Telugu.
Contact: Steven Bird
Named entity recognition (NER) is the task of identifying mentions of people, organisations, places, dates, etc in natural language data. E.g., we might analyse the sentence:
Wolff, currently a journalist in Argentina, played with Del Bosque in the final years of the seventies in Real Madrid.as:
[PERSON Wolff], currently a journalist in [LOCATION Argentina], played with [PERSON Del Bosque] in the final years of the seventies in [ORGANISTION Real Madrid].Conventionally, NER is treated as a sequential tagging task, with no regard for the logical structure of the text or discourse cohesion. However, it is often the case that a NE reoccurs with regularity throughout a text, with subsequent references being made by way of an abbreviated form established explicitly at the time of the first mention, or based on social conventions (e.g. referring to a person by their surname). This research aims to use basic discourse models to “track” or co-index an NE throughout a text, and use the accumulative textual evidence to boost NER performance.
Contact: Tim Baldwin
The biggest single source of error in part-of-speech (POS) tagging is unknown words. Traditionally, unknown word handling has been carried out on the fly, based on affix-based similarity to known words, or analysis of the lexical composition of the unknown word. In this research, we will explore avenues for improving unknown word handling by way of more sophisticated models of word similarity, word and document context. We will also investigate methods for effectively trawling existing lexical resources to form a gold-standard POS lexicon for use in existing POS taggers.
Contact: Tim Baldwin
Documents are typically logically structured into text segments adhering to a particular genre of prose. E.g., a newspaper article will typically have a headline (written in “headlinese”, i.e. in present tense and without articles: “Government Announces Budget Deficit”), followed by the main prose of the article. In this project, we seek to develop a classification of text segment genre (e.g. headline, classified, personal, spoken, standard written, ...) along broadly syntactic and lexical lines, and determine the logical structure of a given document according to this classification. One area of interest is the interaction between document segmentation/genre classification and text classification.
Contact: Tim Baldwin
Word sense disambiguation (WSD) is the task of determining the sense of a given word in context, based on a fixed sense inventory. Traditionally, WSD research has focused on using only the immediate lexical context of target words to perform the disambiguation, with moderate success. Additionally, while sense inventories are often organised hierarchically based on synonym, hyponym and hypernym relations, WSD methods have traditionally ignored this ontological structure and adopted a “bag of senses” style approach. In this research, we will make use of three novel sources of information in carrying out supervised WSD: syntactico-semantic (rich syntactic annotation from a precision grammar, and logical form-style dependencies), ontological (the hierarchical data provided in the dictionary definitions) and domain data.
Contact: Tim Baldwin
Translation memories are a powerful aid to human translators, which maintain an incremental database of past translations and suggest partial or complete translations for novel inputs based on this translation history. They rely crucially on the assumption that strings which are similar in one language will have correspondingly similar translations in a second language. This project will implement a range of translation similarity methods and analyse their efficacy in translation tasks involving a range of language pairs, and also broader-ranging applications relying on some notion of similarity.
Contact: Tim Baldwin and Steven Bird.