Language Technology Seminar Series


Title: Hybrid back-transliteration and its application to word pair extraction from comparable corpora

Speaker: Slaven Bilac (Google, ex Tokyo Institute of Technology)

Location: ICT Building, L2.06

Date: Thursday, 12 April 2005

Time: 1pm

Abstract:

Transliterating words and names from one language to another is a frequent and highly productive phenomenon. For example, the English word "cache" is transliterated into Japanese as kyasshu. Transliteration is information losing since important distinctions are not always preserved in the process. Hence, automatically converting transliterated words back into their original form (i.e. back-transliteration) is a real challenge. Nonetheless, due to its wide applicability in MT and CLIR, it is an interesting problem from a practical point of view.

In this presentation, I will describe a method to automatically produce back-transliterations based on a hybrid model combining grapheme-based (i.e. spelling) and phoneme-based (i.e. pronunciation) information with statistical word segmentation. Each transliterated string is first segmented into individual words and then encoded as a Weighted Finite State Transducer (WFST). Final back-transliterations are produced by cascading composition of this WFST with WFSTs representing the transliteration model and language model.

Furthermore, I will describe a method for extracting transliteration pairs from comparable corpora. Proposed method exploits the structure of comparable corpora to extract a large subset of similarly distributed English words for each Japanese transliteration and then relies on phonetic similarity (i.e. back-transliteration) to find the best match in this subset. Back-transliteration also produce similarity score which can be used to order extracted pairs.


Disclaimer: This page, its contents and style, are the responsibility of the author and do not necessarily represent the views, policies or opinions of The University of Melbourne.