Lightly supervised acquisition of named entities and linguistic patterns for multilingual text mining Articles uri icon

publication date

  • April 2013

start page

  • 87

end page

  • 109

issue

  • 1

volume

  • 35

International Standard Serial Number (ISSN)

  • 0219-1377

Electronic International Standard Serial Number (EISSN)

  • 0219-3116

abstract

  • Named Entity Recognition and Classification (NERC) is an important component of applications like Opinion Tracking, Information Extraction, or Question Answering. When these applications require to work in several languages, NERC becomes a bottleneck because its development requires language-specific tools and resources like lists of names or annotated corpora. This paper presents a lightly supervised system that acquires lists of names and linguistic patterns from large raw text collections in western languages and starting with only a few seeds per class selected by a human expert. Experiments have been carried out with English and Spanish news collections and with the Spanish Wikipedia. Evaluation of NE classification on standard datasets shows that NE lists achieve high precision and reveals that contextual patterns increase recall significantly. Therefore, it would be helpful for applications where annotated NERC data are not available such as those that have to deal with several western languages or information from different domains.

subjects

  • Computer Science

keywords

  • named entity recognition and categorization; information extraction; multilingual natural language processing; bootstrapping algorithms