Projects

From Language and Information Technologies

Jump to: navigation, search

Contents

Current

Lexical semantics

Graph-based methods for natural language processing

Project webpage

Many language processing applications can be modeled by means of a graph. These data structures have the capability to encode in a natural way the meaning and structure of a cohesive text, and follow closely the associative or semantic memory representations. The activation or ranking of nodes in such graph structures mimics to some extent the functioning of human memory, and can be turned into a rich source of knowledge useful for several language processing applications. The goal of this research project is to investigate and develop techniques for the application of spreading activation and graph-based ranking models to text processing, and explore the benefits of such models inspired by psychological theories of human memory to NLP tasks. The outcome will be a unifying framework for the application of spreading activation and node and relation ranking to text-based graph structures, and an in-depth analysis of the applicability of these models to automatic text processing. The project is funded by the Texas Advanced Research Program and Google.

Finding important information in unstructured text

Project webpage

A vast majority of the information we deal with in everyday life consists of raw, unstructured text, where the most important facts or concepts are not always readily available, but hidden in the myriad of details that accompany them. To handle and digest the sheer amount of information we are exposed to in this information age, more sophisticated procedures are required to unveil the important parts of a text, and to allow us to process more information in less time. The goal of this project is to develop robust and accurate techniques to automatically extract important information from unstructured text, in the form of phrases or entire sentences, enabling tasks such as keyword extraction and extractive summarization. This project is funded by Google.

PicNet: a pictorial knowledge-base

Project webpage

Sentiment and subjectivity analysis

Project webpage

There has been a great deal of research in Natural Language Processing on automatically extracting facts and topics. But full-fledged knowledge discovery from text will require not only facts, but also sentiments, affect, and opinions (subjectivity). The goal of this research project is to develop accurate methods for the construction of resources and automatic methods for affect and sentiment analysis in multiple languages.

Multilingual text processing

Project webpage

The emphasis of many empirical approaches to Machine Translation (MT) is on widely used languages such as English, French, Mandarin, or Spanish. Such research has been driven by the availability of large parallel corpora such as the Canadian (French-English) and Hong-Kong (Mandarin-English) Hansards. However, the world is ever-changing and current events often create an immediate need for language processing capabilities in languages where there are few existing online corpora, lexicons, or processing tools. Moreover, a significant percentage of the world's 7,200 spoken languages are close to extinction , so there is an increasing need for sustained conservation efforts for such endangered languages. We have two overall goals in this project. First, to develop techniques for the derivation of tools and linguistic resources for minority or under-studied languages where online corpora and natural language processing tools are in short supply. Second, to collect large parallel texts for less widely used languages, and thereby create corpora suitable to sustain research in Machine Translation.

Computational humour

Project webpage

Previous

SenseLearner: All-Words Word Sense Disambiguation

Project webpage includes software download, publications, and a demo.

Word Sense Disambiguation (WSD) is a core task in natural language processing and is considered essential for major applications like text understanding, common sense reasoning, and machine translation. Previous research on WSD has produced good disambiguation schemes for the relatively few words for which training data has been available. In contrast, there have been few attempts to create systems that disambiguate all words in open text. In this project, we conduct exploratory research of various WSD techniques to enable the development of a tool for semantic tagging of all words in open text. The project was partially funded by NSF.

TeachComputers: Distributed Knowledge Capture with Volunteer Contributions over the Web

Project webpage

The TeachComputers project is a collaborative effort to make computers more intelligent by integrating knowledge collected from Web users. It includes several Web-based activies, such as Open Mind Word Expert, RSDNet, etc. The project has already resulted in several large annotated data sets that were used in international evaluations (Senseval). Joint work with Timothy Chklovski.


SPOT: Semantic Parsing for Open Text

Project webpage

SPOT is a rule-based semantic parser that relies on a frame dataset (FrameNet), and a semantic network (WordNet), to identify semantic relations between words in open text, as well as shallow semantic features associated with concepts in the text. Parsing semantic structures allows semantic units and constituents to be accessed and processed in a more meaningful way than syntactic parsing, moving the automation of understanding natural language text to a higher level.


(Semi)automatic Generation of Lexically Annotated Corpora

Ambiguity is inherent to human language. Successful solutions for automatic resolution of ambiguity in natural language often require large amounts of annotated data to achieve good levels of accuracy. While recent advances in Natural Language Processing (NLP) have brought significant improvements in the performance of NLP methods and algorithms, there has been relatively little progress on addressing the problem of obtaining annotated data required by some of the highest-performing algorithms. As a consequence, many of today's NLP applications experience severe data bottlenecks. The goal of this project is to investigate a large range of methods for building annotated corpora, which will address the data annotation bottleneck currently faced by many language processing applications.

Personal tools