Downloads
From Language and Information Technologies
Various software modules and data sets; made available under the terms of GNU General Public License. Both data and software are distributed without any warranty.
[edit]
GWSD: Unsupervised Graph-based Word Sense Disambiguation
- GWSD is a system for unsupervised all-words graph-based word sense disambiguation download GWSD 1.0 (September 13, 2007)
Ravi Sinha and Rada Mihalcea, Unsupervised Graph-based Word Sense Disambiguation Using Measures of Word Semantic Similarity, In Proceedings of the IEEE International Conference on Semantic Computing (ICSC 2007), Irvine, CA, September 2007. |pdf|
Rada Mihalcea, Unsupervised Large-Vocabulary Word Sense Disambiguation with Graph-based Algorithms for Sequence Data Labeling, In Proceedings of the Joint Conference on Human Language Technology / Empirical Methods in Natural Language Processing (HLT/EMNLP), Vancouver, October, 2005. |pdf|
[edit]
Affective Text: Data Annotated for Emotions and Polarity
- Affective Text is a data set consisting of 1000 test headlines and 200 development headlines, each of them annotated with the six Eckman emotions and the polarity orientation. download (July 13, 2007).
Carlo Strapparava and Rada Mihalcea, SemEval-2007 Task 14: Affective Text, in Proceedings of the 4th International Workshop on the Semantic Evaluations (SemEval 2007), Prague, Czech Republic, June 2007. pdf
Read more about the task here.
[edit]
SenseLearner: All-Words Word Sense Disambiguation Tool
- SenseLearner 2.0 download (June 13, 2005).
- Changes in version 2.0: a client-server model that allows for significantly faster tagging; simpler input file format (the SemCor-like format is not anymore required)
- SenseLearner 1.0 (beta) download (Nov 18, 2004).
[edit]
Benchmark for the evaluation of back-of-the-book indexing systems
- A benchmark for the evaluation of systems for back-of-the-book indexing download. The benchmark is described in:
Andras Csomai and Rada Mihalcea, Creating a Testbed for the Evaluation of Automatically Generated Back-of-the- book Indexes, in Proceedings of the Conference on Computational Linguistics and Intelligent Text Processing (CICLing), LNCS, Mex ico City, February 2006. pdf
[edit]
FrameNet - WordNet verb sense mapping
- FnWnVerbMap 1.0 download. A mapping between verb lexical units in FrameNet II and verb senses in WordNet. The mapping process is described in:
Lei Shi and Rada Mihalcea, Putting Pieces Together: Combining FrameNet, VerbNet and WordNet for Robust Semantic Parsing, Cicling 2005, Mexico. pdf
[edit]
Resources and Tools for Romanian NLP
- Romanian corpus of newspaper articles (and two novels), 50 mil. words. [research purpose only - send a request to rada at cs unt edu]
- Romanian sense tagged data, 39 ambiguous words download
- Romanian-English parallel texts, sentence-aligned, 1 mil. words (each side) download; [research purpose only - send a request to rada at cs unt edu]
- Romanian-English word aligned data (2003) download
- See also the webpage of the HLT/NAACL 2003 workshop on Building and Using Parallel Texts: Data Driven Machine Translation and Beyond for related tools & resources.
- Romanian-English word aligned data (2005) download
- See also the webpage of the ACL 2005 workshop on Building and Using Parallel Texts: Data Driven Machine Translation and Beyond for related tools & resources.
- Romanian-English dictionary (38,000 entries) download
- For other resources and tools for Romanian, see the ConsiLR webpage.
[edit]
Open Mind Word Expert Sense Tagged Data
- OMWE 1.0: Sense tagged data for 288 nouns, created within the Open Mind Word Expert framework during one year of activity. download
- OMWE 2.0: Sense tagged data for nouns, verbs, adjectives, created within the Open Mind Word Expert framework. These data sets were used during the Senseval-3 evaluations.
[edit]
TWA Sense Tagged Data
- Sense tagged data for six words with two-way ambiguities (bass, crane, motion, palm, plant, tank). download
[edit]
Resources for Word Alignment
- Word aligned data for Romanian-English, English-French.
- Parallel texts for training.
- Code for word alignment evaluation.
All these available from the webpage of the HLT/NAACL 2003 workshop on Building and Using Parallel Texts: Data Driven Machine Translation and Beyond.
[edit]
SemCor
Texts semantically annotated with WordNet 1.6 senses (created at Princeton University), and automatically mapped to WordNet 1.7 and WordNet 1.7.1
- SemCor 1.6 download
- SemCor 1.7 download
- SemCor 1.7.1 download
- SemCor 2.0 download
- SemCor 2.1 download
- SemCor 3.0 download
[edit]
WordNet mappings
A mapping between synsets offsets in various WordNet versions.
- WordNet 1.6 - 1.7 download
- WordNet 1.6 - 1.7.1 download
- WordNet 1.7 - 1.7.1 download
- WordNet 1.6 - 2.0 download
- WordNet 1.7.1 - 2.0 download
[edit]
Senseval-2 and Senseval-3 English all-words data converted into SemCor format
- Senseval-2 English all-words converted into SemCor format. download
- Senseval-3 English all-words converted into SemCor format. download
[edit]
Text Filtering
- Evaluation software for text filtering systems, implements the normalized utility, F-measure, precision, and recall, as defined in the TREC 2002 Filtering task. Straightforward usage, follows closely the TREC 2002 Filtering guidelines. download.
- More soon...
[edit]
QA Data Set: Annotated questions
- Annotations for about 5,500 questions used in an analysis of information requests. Questions are drawn from the Excite log, respectively the TREC QA benchmark. This is the data set used in the experiments reported in:
Rada Mihalcea, The Semantic Wildcard, in Proceedings of the LREC 2002 Workshop on "Using Semantics for Information Retrieval and Filtering: State of the Art and Future Research", Las Palmas, Spain, May 2002.
| Excite | TREC | |
|---|---|---|
| Annotated data | What Which | What Which |
| Question types | What Which | What Which |