Language Technology Seminar Series

Title: A Web Text Corpus for Natural Language Processing

Speaker: James Curran (University of Sydney)

Location: ICT Building, L2.06

Date: Friday 12 May 2006

Time: 1-2pm


Web text has been successfully used as training data for many NLP applications. While most previous work accesses web text through search engine hit counts, we created a Web Corpus by downloading web pages to create a 'topic-diverse' collection of 10 billion words of English. We show that for context-sensitive spelling correction the Web Corpus results are better than using a search engine. For thesaurus extraction, it achieved similar overall results to a corpus of newspaper text. With many more words available on the web, better results can be obtained by collecting much larger web corpora.

(James will also give an update on other ongoing projects at Sydney University.)


James Curran is an ARC Postdoctoral Fellow in the School of Information Technologies at the University of Sydney. He is interested in statistical approaches to Natural Language Processing ranging from theoretical and low-level component development through to high-level systems development in Question Answering and Information Extraction.

Disclaimer: This page, its contents and style, are the responsibility of the author and do not necessarily represent the views, policies or opinions of The University of Melbourne.