An Intelligent Search Infrastructure
for Language Resources on the Web
Special Research Initiatives - E-Research SR0567353
Amount(s): 2005: AU$49,018; 2006: AU$49,018
Project Activities: 2005
In 2005, the CIs have engaged in a range of preliminary activities:
- CI Hughes participated in the ARC/ARIIC E-Research Workshop in November 2005
as a representative of this project, presenting a poster describing the research proposed.
- CI Hughes has continued the development of the language crawler software,
and continued to run large scale web crawling experiments in unsupervised language
data acquisition from the web. At the time of this report, more than 1.5 million
URLs have been identified with "interesting" language content for the more than
7000 languages represented in the Ethnologue classification of languages.
- CI Baldwin and CI Hughes have established research collaboration with the
Language Observatory Project (http://www.language-observatory.org), in particular
with Professor Yishiki Mikami from Nagaoka University of Technology (Japan) and
Dr Virach Sornlertlamvanich from the Thai Computational Linguistics Laboratory
(Thailand) . It is anticipated that a visit to Japan to scope technical collaboration
will take place in March 2006; our ARC funded research is largely complementary
to the Language Observatory Project; and to other activities at TCL Lab.
- CI Hughes has led technical engagement with the Advanced Research Computing
unit at The University of Melbourne, resulting in the availability of suitable
computational and data storage infrastructure for the project.
- CI Hughes has supervised a summer intern (Peter Lee) working on the
customisation of digital repository software (E-Prints) to suit the needs of
language archives. This software will be used in a later stage of this project
(acting as the digital repository for the Language Archive to be built in Q3.)
- CI Hughes has established research collaboration with Dr James Hogan at
Queensland University of Technology and a PhD student (Asegir Frimannsson)
working in the area of software localization. In particular, the language
crawler component of this project identifies various types of translation
and localization resources by language which are relevant to software engineering
processes; Hogan and Frimannsson are direct consumers of specific portions
of language crawl data.
Last Updated:
Mon Jan 23 13:17:34 EST 2006