Transliterating words and names from one language to another is a frequent and highly productive phenomenon. For example, the English word "cache" is transliterated into Japanese as kyasshu. Transliteration is information losing since important distinctions are not always preserved in the process. Hence, automatically converting transliterated words back into their original form (i.e. back-transliteration) is a real challenge. Nonetheless, due to its wide applicability in MT and CLIR, it is an interesting problem from a practical point of view.
In this presentation, I will describe a method to automatically produce back-transliterations based on a hybrid model combining grapheme-based (i.e. spelling) and phoneme-based (i.e. pronunciation) information with statistical word segmentation. Each transliterated string is first segmented into individual words and then encoded as a Weighted Finite State Transducer (WFST). Final back-transliterations are produced by cascading composition of this WFST with WFSTs representing the transliteration model and language model.
Furthermore, I will describe a method for extracting transliteration pairs from comparable corpora. Proposed method exploits the structure of comparable corpora to extract a large subset of similarly distributed English words for each Japanese transliteration and then relies on phonetic similarity (i.e. back-transliteration) to find the best match in this subset. Back-transliteration also produce similarity score which can be used to order extracted pairs.