About | People | Data | Publications

About

This project is devoted to building a large multilingual semantic network through the application of novel techniques for semantic analysis specifically targeted at the Wikipedia corpus. The driving hypothesis of the project is that the structure of Wikipedia can be effectively used to create a highly structured graph of world knowledge in which nodes correspond to entities and concepts described in Wikipedia, while edges capture ontological relations such as hypernymy and meronymy. Special emphasis is given to exploiting the multilingual information available in Wikipedia in order to improve the performance of each semantic analysis tool. Significant research effort is therefore aimed at developing tools for word sense disambiguation, reference resolution and the extraction of ontological relations that use multilingual reinforcement and the consistent structure and focused content of Wikipedia to solve these tasks accurately. An additional research challenge is the effective integration of inherently noisy evidence from multiple Wikipedia articles in order to increase the reliability of the overall knowledge encoded in the global Wikipedia graph. Computing probabilistic confidence values for every piece of structural information added to the network is an important step in this integration, and it is also meant to provide increased utility for downstream applications. The proposed highly structured semantic network complements existing semantic resources and is expected to have a broad impact on a wide range of natural language processing applications in need of large scale world knowledge.

The project is a collaboration between the Language and Information Technologies group at University of North Texas and the Natural Language Processing group at Ohio University. The project is sponsored by the National Science Foundation, under awards #1018613 and #1018590.

People

Data

  • WPCat -- a Wikipedia taxonomic relation dataset. It contains ten text files, each corresponding to one root category from Wikipedia. Each file contains a directed acyclic graph of categories and titles sampled automatically from the Wikipedia category graph as descendants of the corresponding root category. Node-to-parent and node-to-root pairs have been manually annotated for is-a and instance-of relations. More details can be found in: Mike Chen and Razvan Bunescu, Taxonomic Relation Extraction from Wikipedia: Datasets and Algorithms, Technical Report, June 2011. [download]
  • WPCoref -- a Wikipedia (co)reference dataset. It contains three large Wikipedia articles (John Williams, Barack Obama, and The New York Times) that were manually annotated with coreference and reference information. Coreference relations were annotated for all markable noun phrases, similar to the MUC guidelines. Furthermore, each coreference chain was manually linked to the Wikipedia title that describes the corresponding entity, if such a title exists. The files are in the AIF format recognized by the Callisto annotation interface. More details can be found in: Razvan Bunescu, (Co)Reference Resolution in Wikipedia, Technical Report, August 2011. [download]

Publications

  • Carmen Banea and Rada Mihalcea, Word Sense Disambiguation with Multilingual Features, International Conference on Semantic Computing, Oxford, UK, January 2011. [pdf]
  • Samer Hassan and Rada Mihalcea, Corpus-based and Knowledge-based Measures of Semantic Relatedness, in Proceedings of the American Association for Artificial Intelligence (AAAI 2011), San Francisco, August, 2011. [pdf]
  • Mike Chen and Razvan Bunescu, Taxonomic Relation Extraction from Wikipedia: Datasets and Algorithms, Technical Report, June 2011. [pdf]
  • Razvan Bunescu, (Co)Reference Resolution in Wikipedia, Technical Report, August 2011. [pdf]


Last modified 08/31/2011