The heuristic of choosing the first listed sense in a dictionary is a very powerful one and WSD systems often do not outperform it, particularly those that do not use hand-tagged training data. Many researchers use the first sense heuristic as a back-off method when sufficient information is not available from the context, however the heuristic relies on hand-tagged data which is costly to produce. Problems arise when there is little or no data for a given word in available resources. Furthermore, whilst there are hand-tagged corpora available for some languages, the frequency distribution of the senses of a word depend on the type of text one is looking at. For example, one would expect a different predominant sense for ``star'' if one were looking at scientific astronomy reports compared with popular news. We present work on the use of an automatically acquired thesaurus and the WordNet similarity package to rank WordNet senses automatically from raw textual corpora. The results are promising when evaluated against the gold-standards provided by SemCor and Senseval data. We will demonstrate that that this technique is superior to heuristics from manually created gold-standards when dealing with low frequency words or in domain specific settings.
Diana McCarthy is a UK Royal Society Dorothy Hodgkin fellow, in the Department of Informatics at Sussex University. She completed her PhD at Sussex in 2001 and has worked there as a post-doc on a number of national and EU projects prior to starting the Royal Society fellowship in Feb 2005. Prior to the PhD she has worked as a speech and language therapist, and in the commerical sector for companies producing a variety of AI products.