Language Technology Projects for Students

Note: Students starting LT projects are advised to enrol in 433-327 Knowledge Technologies, 433-687 Knowledge Technologies, or 433-611 Web Search and Text Analysis.

Museum Hypernavigation

Recent technological advances have afforded people any-time, any-place access to information, while enabling providers to track people's activities in physical environments. This project will involve research to integrate the fields of language technology and user modelling, and to develop a system for synchronised navigation of hypertext and information rich physical environments. The work will be done in collaboration with the Kubadji Project, involving a team of researchers from CSSE, the Department of Information Systems, Monash University, and the Melbourne Museum.

Contact: Tim Baldwin

Discourse-structured, co-indexed named entity recognition

Named entity recognition (NER) is the task of identifying mentions of people, organisations, places, dates, etc in natural language data. E.g., we might analyse the sentence:

Wolff, currently a journalist in Argentina, played with Del Bosque in the final years of the seventies in Real Madrid.
[PERSON Wolff], currently a journalist in [LOCATION Argentina], played with [PERSON Del Bosque] in the final years of the seventies in [ORGANISTION Real Madrid].
Conventionally, NER is treated as a sequential tagging task, with no regard for the logical structure of the text or discourse cohesion. However, it is often the case that a NE reoccurs with regularity throughout a text, with subsequent references being made by way of an abbreviated form established explicitly at the time of the first mention, or based on social conventions (e.g. referring to a person by their surname). This research aims to use basic discourse models to “track” or co-index an NE throughout a text, and use the accumulative textual evidence to boost NER performance.

Contact: Tim Baldwin

Computational Modelling of Prosody in Natural Speech

For many of the world's languages, voice pitch is used to signal lexical and grammatical contrasts (cf the way we use stress in English to distinguish CONtest from conTEST). This project will use methods in digital signal processing and machine learning to explore the prosodic structure of a language from Papua New Guinea, in collaboration with researchers in the UK and USA. (Here's an example to show how voice pitch is used to distinguish words in a PNG language waímé tree kangaroo, waíme house rat, waimé needle).

Contact: Steven Bird,

Language Engineering in the Field: Preserving 100 Endangered Languages in New Guinea

During 2010, the BOLD:PNG project is creating an audio archive for a hundred languages in Papua New Guinea. This project will investigate the workflows for collecting oral literature, and develop a prototype tool to be deployed on low-power netbooks for use in remote field locations, to manage all aspects of the data collection process. It will combine methods from data modelling, interface design, and digital signal processing.

Contact: Steven Bird,

Analyzing Talk

It is easy to create multimedia recordings and share them online using services like YouTube. However, these sites can only be navigated by topic. How could a collection be organised so that people could study the content systematically? This project will involve research into architectures for collaborative online annotation, and the development of a system for manipulating time-aligned, interlinear audio transcriptions. The architecture could be extended to allow for plugins ranging from simple audio concordancers to sophisticated knowledge discovery tools, together with illustrative applications to the analysis of conversation, rhetoric, or dialect. Application to languages other than English, including endangered languages, will be possible.

Contact: Steven Bird,

Projects with the Natural Language Toolkit

The Natural Language Toolkit (NLTK) is an open source library containing implementations of many natural language processing algorithms in Python (see A wide range of projects are described in the Project section of the NLTK issue tracker, covering state-of-the-art statistical NLP algorithms, and multilingual processing.

Contact: Steven Bird, Greg Restall

Part-of-speech tagging and unknown words

The biggest single source of error in part-of-speech (POS) tagging is unknown words. Traditionally, unknown word handling has been carried out on the fly, based on affix-based similarity to known words, or analysis of the lexical composition of the unknown word. In this research, we will explore avenues for improving unknown word handling by way of more sophisticated models of word similarity, word and document context. We will also investigate methods for effectively trawling existing lexical resources to form a gold-standard POS lexicon for use in existing POS taggers.

Contact: Tim Baldwin

Document segmentation/genre classification

Documents are typically logically structured into text segments adhering to a particular genre of prose. E.g., a newspaper article will typically have a headline (written in “headlinese”, i.e. in present tense and without articles: “Government Announces Budget Deficit”), followed by the main prose of the article. In this project, we seek to develop a classification of text segment genre (e.g. headline, classified, personal, spoken, standard written, ...) along broadly syntactic and lexical lines, and determine the logical structure of a given document according to this classification. One area of interest is the interaction between document segmentation/genre classification and text classification.

Contact: Tim Baldwin

Feature-rich word sense disambiguation

Word sense disambiguation (WSD) is the task of determining the sense of a given word in context, based on a fixed sense inventory. Traditionally, WSD research has focused on using only the immediate lexical context of target words to perform the disambiguation, with moderate success. Additionally, while sense inventories are often organised hierarchically based on synonym, hyponym and hypernym relations, WSD methods have traditionally ignored this ontological structure and adopted a “bag of senses” style approach. In this research, we will make use of three novel sources of information in carrying out supervised WSD: syntactico-semantic (rich syntactic annotation from a precision grammar, and logical form-style dependencies), ontological (the hierarchical data provided in the dictionary definitions) and domain data.

Contact: Tim Baldwin

Translation similarity toolkit

Translation memories are a powerful aid to human translators, which maintain an incremental database of past translations and suggest partial or complete translations for novel inputs based on this translation history. They rely crucially on the assumption that strings which are similar in one language will have correspondingly similar translations in a second language. This project will implement a range of translation similarity methods and analyse their efficacy in translation tasks involving a range of language pairs, and also broader-ranging applications relying on some notion of similarity.

Contact: Tim Baldwin and Steven Bird.