Web text has been successfully used as training data for many NLP applications. While most previous work accesses web text through search engine hit counts, we created a Web Corpus by downloading web pages to create a 'topic-diverse' collection of 10 billion words of English. We show that for context-sensitive spelling correction the Web Corpus results are better than using a search engine. For thesaurus extraction, it achieved similar overall results to a corpus of newspaper text. With many more words available on the web, better results can be obtained by collecting much larger web corpora.
(James will also give an update on other ongoing projects at Sydney University.)
James Curran is an ARC Postdoctoral Fellow in the School of Information Technologies at the University of Sydney. He is interested in statistical approaches to Natural Language Processing ranging from theoretical and low-level component development through to high-level systems development in Question Answering and Information Extraction.