About | People | Data | Publications
About
This project is devoted to building a large multilingual semantic
network through the application of novel techniques for semantic
analysis specifically targeted at the Wikipedia corpus. The driving
hypothesis of the project is that the structure of Wikipedia can be
effectively used to create a highly structured graph of world knowledge
in which nodes correspond to entities and concepts described in
Wikipedia, while edges capture ontological relations such as hypernymy
and meronymy. Special emphasis is given to exploiting the multilingual
information available in Wikipedia in order to improve the performance
of each semantic analysis tool. Significant research effort is therefore
aimed at developing tools for word sense disambiguation, reference
resolution and the extraction of ontological relations that use
multilingual reinforcement and the consistent structure and focused
content of Wikipedia to solve these tasks accurately. An additional
research challenge is the effective integration of inherently noisy
evidence from multiple Wikipedia articles in order to increase the
reliability of the overall knowledge encoded in the global Wikipedia
graph. Computing probabilistic confidence values for every piece of
structural information added to the network is an important step in this
integration, and it is also meant to provide increased utility for
downstream applications. The proposed highly structured semantic network
complements existing semantic resources and is expected to have a broad
impact on a wide range of natural language processing applications in
need of large scale world knowledge.
The project is a collaboration between the Language and Information Technologies group at University of North Texas and the Natural Language Processing group at Ohio University. The project is sponsored by the National Science Foundation, under awards #1018613 and #1018590.
People
Data
- WPCat -- a Wikipedia taxonomic relation dataset. It contains ten text files, each corresponding to one root
category from Wikipedia. Each file contains a directed acyclic graph of categories and titles sampled automatically
from the Wikipedia category graph as descendants of the corresponding root category. Node-to-parent and node-to-root
pairs have been manually annotated for is-a and instance-of relations. More details can be found in: Mike Chen and
Razvan Bunescu, Taxonomic Relation Extraction from Wikipedia: Datasets and Algorithms, Technical Report, June 2011.
[download]
- WPCoref -- a Wikipedia (co)reference dataset. It contains three large Wikipedia articles (John Williams, Barack
Obama, and The New York Times) that were manually annotated with coreference and reference information. Coreference
relations were annotated for all markable noun phrases, similar to the MUC guidelines. Furthermore, each coreference
chain was manually linked to the Wikipedia title that describes the corresponding entity, if such a title exists. The
files are in the AIF format recognized by the Callisto annotation interface. More details can be found in:
Razvan Bunescu, (Co)Reference Resolution in Wikipedia, Technical
Report, August 2011. [download]
Publications
- Carmen Banea and Rada Mihalcea, Word Sense Disambiguation with
Multilingual Features, International Conference on Semantic Computing,
Oxford, UK, January 2011. [pdf]
- Samer Hassan and Rada Mihalcea, Corpus-based and Knowledge-based
Measures of Semantic Relatedness, in Proceedings of the American
Association for Artificial Intelligence (AAAI 2011), San Francisco,
August, 2011. [pdf]
- Mike Chen and Razvan Bunescu, Taxonomic Relation Extraction from
Wikipedia: Datasets and Algorithms, Technical Report, June 2011. [pdf]
- Razvan Bunescu, (Co)Reference Resolution in Wikipedia, Technical
Report, August 2011. [pdf]
Last modified 08/31/2011
|