Language Technology Projects for Students

Note: Students starting LT projects are advised to take Human Language Technology as a directed study subject in first semester.

Museum Hypernavigation

Recent technological advances have afforded people any-time, any-place access to information, while enabling providers to track people's activities in physical environments. This project will involve research to integrate the fields of language technology and user modelling, and to develop a system for synchronised navigation of hypertext and information rich physical environments. The work will be done in collaboration with the Kubadji Project, involving a team of researchers from CSSE, the Department of Information Systems, Monash University, and the Melbourne Museum.

Contact: Steven Bird, Tim Baldwin,

Analyzing Talk

It is easy to create multimedia recordings and share them online using services like YouTube. However, these sites can only be navigated by topic. How could a collection be organised so that people could study the content systematically? This project will involve research into architectures for collaborative online annotation, and the development of a system for manipulating time-aligned, interlinear audio transcriptions. The architecture could be extended to allow for plugins ranging from simple audio concordancers to sophisticated knowledge discovery tools, together with illustrative applications to the analysis of conversation, rhetoric, or dialect. Application to languages other than English, including endangered languages, will be possible.

Contact: Steven Bird,

Googling Trees

Search engines let us perform keyword queries over unstructured text. Database engines let us perform structured queries over relations. Powerful technologies are now in widespread use for both unstructured and structured data. This project will examine semistructured data, hierarchically organised data with optional, repeatable, and partially-ordered elements. Large text collections will be automatically parsed to produce a database of trees, and an Ajax-based graphical interface will be developed to permit these trees to be queried. The system will be used by linguists to study the way English grammar is evolving on the web. [This project will extend the work of the NSF QLDB project.]

Contact: Steven Bird, James Bailey

Analyzing argument maps

Argument maps are box-and-line diagrams that lay out visually reasoning and evidence for and against a statement or claim. A good map clarifies and organizes thinking by showing the logical relationships between thoughts that are expressed simply and precisely. This project may take two forms. One would be to analyze argument maps in order to detect unstated assumptions and missing evidence. Another would be to explore methods for automatically generating naturalistic argumentative prose from argument maps. The project will involve collaboration with Austhink, a local company that is developing argument visualisation software.

Contact: Steven Bird

Technologies for Indian Languages

The languages of India pose interesting challenges for the development of language technologies, a fact that has been recognized by the Indian Government through its TDIL initiative. In this project you will extend NLTK to support language processing tasks for an Indian language. Languages of particular interest are Bangla, Hindi, Marathi and Telugu.

Contact: Steven Bird

Discourse-structured, co-indexed named entity recognition

Named entity recognition (NER) is the task of identifying mentions of people, organisations, places, dates, etc in natural language data. E.g., we might analyse the sentence:

Wolff, currently a journalist in Argentina, played with Del Bosque in the final years of the seventies in Real Madrid.
as:
[PERSON Wolff], currently a journalist in [LOCATION Argentina], played with [PERSON Del Bosque] in the final years of the seventies in [ORGANISTION Real Madrid].
Conventionally, NER is treated as a sequential tagging task, with no regard for the logical structure of the text or discourse cohesion. However, it is often the case that a NE reoccurs with regularity throughout a text, with subsequent references being made by way of an abbreviated form established explicitly at the time of the first mention, or based on social conventions (e.g. referring to a person by their surname). This research aims to use basic discourse models to “track” or co-index an NE throughout a text, and use the accumulative textual evidence to boost NER performance.

Contact: Tim Baldwin

Part-of-speech tagging and unknown words

The biggest single source of error in part-of-speech (POS) tagging is unknown words. Traditionally, unknown word handling has been carried out on the fly, based on affix-based similarity to known words, or analysis of the lexical composition of the unknown word. In this research, we will explore avenues for improving unknown word handling by way of more sophisticated models of word similarity, word and document context. We will also investigate methods for effectively trawling existing lexical resources to form a gold-standard POS lexicon for use in existing POS taggers.

Contact: Tim Baldwin

Document segmentation/genre classification

Documents are typically logically structured into text segments adhering to a particular genre of prose. E.g., a newspaper article will typically have a headline (written in “headlinese”, i.e. in present tense and without articles: “Government Announces Budget Deficit”), followed by the main prose of the article. In this project, we seek to develop a classification of text segment genre (e.g. headline, classified, personal, spoken, standard written, ...) along broadly syntactic and lexical lines, and determine the logical structure of a given document according to this classification. One area of interest is the interaction between document segmentation/genre classification and text classification.

Contact: Tim Baldwin

Feature-rich word sense disambiguation

Word sense disambiguation (WSD) is the task of determining the sense of a given word in context, based on a fixed sense inventory. Traditionally, WSD research has focused on using only the immediate lexical context of target words to perform the disambiguation, with moderate success. Additionally, while sense inventories are often organised hierarchically based on synonym, hyponym and hypernym relations, WSD methods have traditionally ignored this ontological structure and adopted a “bag of senses” style approach. In this research, we will make use of three novel sources of information in carrying out supervised WSD: syntactico-semantic (rich syntactic annotation from a precision grammar, and logical form-style dependencies), ontological (the hierarchical data provided in the dictionary definitions) and domain data.

Contact: Tim Baldwin

Translation similarity toolkit

Translation memories are a powerful aid to human translators, which maintain an incremental database of past translations and suggest partial or complete translations for novel inputs based on this translation history. They rely crucially on the assumption that strings which are similar in one language will have correspondingly similar translations in a second language. This project will implement a range of translation similarity methods and analyse their efficacy in translation tasks involving a range of language pairs, and also broader-ranging applications relying on some notion of similarity.

Contact: Tim Baldwin and Steven Bird.