Talks
From Language and Information Technologies
THE AUTOMATED PROCESSING OF BILINGUAL DISCOURSE
Thamar Solorio, University of Texas at Dallas
Friday, July 18, 2008, 11:30am, F223
Abstract:
Code-switching is an interesting linguistic phenomenon commonly observed in highly bilingual communities. It consists of mixing languages in the same conversational event. Despite its popularity, this type of discourse has received very little attention from the natural language processing community. Most of the work in this area attempts to solve problems where the language samples, either spoken or written, are monolingual.
We recently started working on developing a part-of-speech tagger for Spanish-English code-switched text. In this talk I will discuss results of different approaches to solve the tagging problem by taking advantage of existing resources for both languages. The long-term goal of this research is to develop a full syntactic parser for English-Spanish code-switched text, commonly known as Spanglish, that can be exploited to tackle higher-level tasks on mixed-language sources. Although the work is focused on English-Spanish bilingual discourse, the knowledge acquired from this project can later be extended to other language combinations.
KEYWORDS IN THE MIST: AUTOMATED KEYWORD EXTRACTION FOR VERY LARGE DOCUMENTS AND BACK OF THE BOOK INDEXING
Andras Csomai
Friday, March 28, 2008, 10am, F223
Abstract:
This thesis addresses the problem of automatic keyphrase extraction from large documents and back of the book indexing. The potential benefits of automating this process are far reaching, from improving information retrieval in digital libraries, to saving countless man-hours by helping professional indexers creating back of the book indexes.
The thesis introduces a new methodology to evaluate automated systems, which allows for a detailed, comparative analysis of several techniques for keyphrase extraction. We introduce and evaluate both supervised and unsupervised techniques, designed to balance the resource requirements of an automated system and the best achievable performance.
Additionally, a number of novel features are proposed, including a statistical informativeness measure based on Chi statistics; an encyclopedic feature that taps into the vast knowledge base of Wikipedia to establish the likelihood of a phrase referring to an informative concept; and a linguistic feature based on sophisticated semantic analysis of the text using current theories of discourse comprehension.
The resulting keyphrase extraction system is shown to outperform the current state of the art in supervised keyphrase extraction by a large margin. Moreover, a fully automated back of the book indexing system based on the keyphrase extraction system was shown to lead to back of the book indexes closely resembling those created by human experts.
LEARNING LANGUAGE FROM ITS PERCEPTUAL CONTEXT
Raymond Mooney, University of Texas at Austin
Friday, March 7, 2008, 11:30-12:30pm, F223
Abstract:
Current systems that learn to process natural language require laboriously constructed human-annotated training data. Ideally, a computer would be able to acquire language like a child by being exposed to linguistic input in the context of a relevant but ambiguous perceptual environment. As a step in this direction, we present a system that learns language from sportscasts of simulated soccer games. The training data consists of textual human commentaries on Robocup simulation games. A set of possible meanings for each comment is automatically constructed from game event traces. Our previously developed systems for learning to parse and generate natural language (KRISP and WASP) were augmented to learn from this data and then commentate novel games. The system is evaluated based on its ability to parse sentences into correct meanings and generate accurate descriptions of game events. Human evaluation was also conducted on the overall quality of the generated sportscasts and compared to human-generated commentaries.
Bio:
Raymond J. Mooney is a Professor in the Department of Computer Sciences at the University of Texas at Austin. He received his Ph.D. in 1988 from the University of Illinois at Urbana/Champaign. He is an author of over 150 published research papers, primarily in the areas of machine learning and natural language processing. He was program co-chair for the 2006 AAAI Conference on Artificial Intelligence, general chair of the 2005 Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, and co-chair of the 1990 International Conference on Machine Learning. He is a Fellow of the American Association for Artificial Intelligence and recipient of best paper awards from the National Conference on Artificial Intelligence, the SIGKDD International Conference on Knowledge Discovery and Data Mining, and the Annual Meeting of the Association for Computational Linguistics. His recent research has focused on learning for natural-language processing, text mining for bioinformatics, statistical relational learning, and transfer learning.
THE STANFORD WORDNET PROJECT: AUTOMATIC ACQUISITION OF KNOWLEDGE FROM TEXT
Rion Snow, Stanford
Thursday, November 8, 2007, 11:00-12:30pm, F223
Abstract:
This talk describes our recent work in learning semantic relations and WordNet-like taxonomies from English text. We use machine learning methods to learn the hypernym (is a kind of) and coordinate term (is similar to) relations, and propose a model for inferring taxonomies that combine heterogenous evidence sources for maximal benefit. Our work has resulted in the Stanford Wordnet Project, which currently offers an augmented version of WordNet with 400,000 additional automatically-inferred hyponyms.
Bio:
Rion Snow is a PhD Candidate in Computer Science at Stanford University, working with Professors Andrew Ng and Dan Jurafsky. Rion works in the intersection of machine learning and natural language processing, with a focus in computational semantics. He leads the Stanford Wordnet Project, which aims at learning large-scale semantic networks automatically from natural text. His work on automatically inferring semantic taxonomies recently received the Best Paper Award at the 2006 conference for the Association of Computational Linguistics.
THE ENHANCEMENT OF MACHINE TRANSLATION FOR LOW DENSITY LANGUAGES USING WEB-GATHERED PARALLEL TEXTS
MS THESIS DEFENSE
Michael Mohler, UNT
Monday, October 8, 2007, 1:00-2:00pm
Abstract:
The majority of the world's languages are poorly represented in informational media like radio, television, newspapers, and the Internet. Translation into and out of these languages may offer a way for speakers of these languages to interact with the wider world, but current statistical machine translation models are only effective with a large corpus of parallel texts - texts in two languages that are translations of one another - which most languages lack.
This thesis describes the Babylon project, which attempts to alleviate this shortage by supplementing existing parallel texts with texts gathered automatically from the Web - specifically targeting pages that contain text in a pair of languages. Results indicate that parallel texts gathered from the Web can be effectively used as a source of training data for machine translation, and can significantly improve the translation quality for text in a similar domain. However, the small quantity of high-quality low-density language parallel texts on the Web remains a significant obstacle.
INFORMATION EXTRACTION BEYOND DOCUMENT COLLECTIONS
ACQUISITION OF CLASS ATTRIBUTES FROM QUERY LOGS
Marius Pasca, Google Inc.
Monday, April 2, 2007, 11:30-12:30pm
Abstract:
As part of a large effort to acquire large repositories of facts from
unstructured text on the Web, a seed-based framework for textual
information extraction allows for weakly supervised extraction of
class attributes (e.g., "side effects" and "generic equivalent" for
drugs) from anonymized query logs. The extraction is guided by a small
set of seed attributes, without any need for handcrafted extraction
patterns or further domain-specific knowledge. The attributes of
classes pertaining to various domains of interest to Web search users
have accuracy levels significantly exceeding current state of the art.
Inherently noisy search queries are shown to be a highly valuable,
albeit unexplored, resource for Web-based information extraction, for
the task of class attribute extraction.
Bio:
Marius Pasca is a senior research scientist in the research group at Google. He earned a Ph.D. degree in Computer Science from Southern Methodist University, Dallas, Texas in December 2001, and an M.Sc. degree in Computer Science from Joseph Fourier University, Grenoble, France in June 1998. He is the author of the book "Open-domain question answering from large text collections", published in April 2003. Current research interests include factual information extraction from unstructured text and advanced matching functions for information retrieval.
THE LANGUAGE OF SPACE AND TIME
AND SOME COMPUTATIONAL ATTEMPTS TO UNDERSTAND IT
Paul Morarescu, University of Texas at Dallas
Thursday, March 1, 2007, 6:00pm-7:00pm
Abstract:
In 1514 A.D., Nicolaus Copernicus marked one of the greatest moments in the history of science with the first outline of his heliocentric model of the Solar System. Five hundred years and millions of astronomical observations later, the Sun, the Moon and other celestial bodies still "rise" and "set" in many languages of the world. The sky is up both for the Canadians and the Australians, although their "ups" are in opposite directions. Our languages indicate that, for most practical purposes, we still refer to our world as flat, stationary, and located in the center of the Universe.
These languages (English included) describe a world that hosts a large number of seemingly curious phenomena such as fences that "run" from one wall to another, stock markets and prices that are often "rising," "falling," "crashing" or "skyrocketing," roads that "take" people to their destinations, people who are "in" love and have "high" expectations until they grow "up" and get "over" them, and people who put their thoughts "into" words and "pass" them to their "close" friends overseas. One the same flat world, it "is going" to rain tomorrow although nothing is going anywhere. Finally, an abstract concept such as time is known to be "long" or "short," "ahead" of us or "back," to "fly" like an arrow, and in general to "be money."
This, however, is not a presentation about metaphors or an argument for their linguistic pervasiveness. For that, I recommend Lakoff, Johnson, and Levin. I will still remind you that metaphors are advertised on some common trucks in Greece - kudos to Guy Deutscher. Then, I will focus on the computational implications of the growing cognitive linguistic consensus that most metaphors are born in space. That is, they are generated by our age-long habit of employing the concepts derived from our visual perception of the physical world in the description of virtually all other topics that we talk or think about, no matter how abstract they are. Countless cross-linguistic examples illustrate this thesis. At this point, an essential question arises in the computational linguist's mind: if language is really a metaphor of our perception of space, could we use a computational model of space to model the language as well?
To attempt an answer to this question, we will embark on a journey to learn about the semantics of space and motion in the complementary frameworks of lexical, compositional, constructional and cognitive semantic theories. We will contemplate the linguistic connection between space and time, and wonder whether Einstein was really the first to think about it. Diachronic and synchronic arguments, including etymological ones, will take us as far back in time as the origin of the distinction between the content and functional words. The linguistic meaning itself may appear to us as a function of space and time, among other variables. We will introduce the "localist" theories of language such as Jackendoff's Lexical-Conceptual Semantics (LCS). We will compare Fillmore's frame semantics with Talmy's (or Langacker's) cognitive grammar based on Figure, Ground, Frame of Reference and Psychological space. These concepts borrowed from Gestalt psychology apply naturally both to objects located in space and to events located in time and causal chains. They explain why our language is so much better at describing some scenes than others. We explore a possible mapping between the frame and cognitive semantics, which allows to use the FrameNet annotated corpus to test Talmy's theory. This, in turn, enables us to ask and automatically answer new questions, and approach spatial reasoning (entailment) tasks. A brief review of existing models of spatial reasoning, among which geometrical, logical and qualitative ones, is in order. Enter the ontologies, and the connections with domains such as robotics, bioinformatics and geographical information systems.
OK, maybe we won't have time for all of these. But we will try.
MEDIATED INFORMATION RETRIEVAL: THE WEBCLUSTER AND THE MIR PROJECTS
Prof. Gheorghe Muresan, Rutgers University
Friday, November 10, 2006, 11:30pm-12:30pm, F223
Abstract:
Mediated Information Retrieval represents an interaction model that addresses a problem well-documented in information-seeking literature: users are frequently unable to articulate a query that clearly and comprehensively expresses their information need. This can be attributed to the information need being too ambiguous and not clearly defined in the user's mind, to a lack of problem domain knowledge on the part of the searcher, to a lack of understanding of a retrieval system's conceptual model, or to an inability to use a certain required query syntax.
We propose to address this problem by designing a system that emulates the human search mediator, or intermediary. The system can help a user explore a domain of interest, learn its structure, terminology and key concepts, clarify and refine an information need and it can help an information-seeker generate high-quality queries that can be submitted to the web or other such large and heterogeneous document collections. Alternatively, it can act as an intelligent agent and monitor a target collection, e.g. the web, for new documents that match the user's profile. Although the original system was implemented and tested only for single users, it was designed as an information assistant, as well as a collaboration intermediary. User profiles, coding topical interests, are built as the users search for and interact with information; a combination of explicit and implicit relevance feedback is used to assess the relevance of information objects. Recommendations can be made explicitly or automatically to other users with similar interests, as judged by their profiles.
The talk will describe the mediated retrieval concept, will discuss various design decisions, and will report on experiments conducted as part of the WebCluster project, at the Robert Gordon University, and the MIR project, at Rutgers University.
About the speaker: Gheorghe Muresan is an Assistant Professor in the School of Communication, Library and Information Science, Rutgers University. His research interest is in Information Retrieval, with particular focus on interactive IR, personalization, context modeling and clustering. As part of Interactive TREC 2002 and 2003 he investigated the effect of system support on user queries and the correlation between statistical properties of queries and retrieval effectiveness. A follow-up project, based on Rutgers logs from Interactive TREC, is looking at differences between queries submitted by different subjects for the same assigned topic, at query reformulation, and at statistical models of predicting query quality. In HARD TREC 2004 and 2005, Gheorghe investigated the use of metadata for personalization, and compared models and sources of expansion terms for query improvement. Currently, he is doing follow-up work to those projects, and working on a project proposal on combining aspects of personalization.
TRAIT PROCESS VARIABLES AS INDICES OF SYSTEMIC ANOMALOUS BEHAVIOR IN VIRTUAL INTERACTION
Thomas Kidenda, University of North Texas
Friday, November 3, 2006, 4:00pm-5:30pm, ISB 218
Abstract:
In contrast to users experience in pre-Web IR-systems, the integration of virtual interfaces into a formerly visceral information search-space transforms user-system interaction(s) into a distinctly more obscure process. Theoretically, the contemporary information search space and interaction scenarios now present unprecedented variables that implicitly challenge the disciplinary truisms long held in LIS about the dynamics of users' cognitive processes, users' conceptual maps, users' information seeking behavior, and user-systems interactions. Of particular interest to this study are the pit falls of designing systems around a perceived model of a canonical or typical user. Thus, whereas, it has been long recognized that it is important to take into account some significant characteristics of people when designing interactive IR-systems, therein lays the problem insofar as systems function implicitly on the basis of a model of a traditional canonical user. Inherently, it follows that the working assumptions and dynamics of such systems are tailored to the perceived existence or pursuit of a user type and/or cognitive type that accounts for exemplars. Accordingly, this study posits that when systems exhibit and/or users employ navigational strategies, search criteria or interaction techniques that seem, appear, or have the proclivity towards non-conformity, anomalous, non-normative, or atypical paths and outcomes it potentially raises questions about what we know about non-normative users and/or the IR systems that possibly engender this uncharacteristic category of users.
Therefore, the prospect that manifestations of non-normative trait behavior can reliably function as indices of systemic anomalies insofar as systems fail to optimally adapt to users^Ò goals is intriguing and worth investigating.
NATURAL LANGUAGE INTERFACES TO DATABASES
Yohan Chandra, University of North Texas
Thursday, October 5, 2006, 4:00pm - 5:30pm, NTRP F223
Abstract:
In today's information era, databases represent one of the major sources of information. They are the storage format of choice for information in a large variety of fields, ranging from patient databases in the medical domain, flights and hotels information in the tourism field, employee information in various companies, stock prices, movie schedules, and many others. In order to obtain information from a database, one needs to formulate a query in such a way that the computer will understand it. Unfortunately, not everybody is able to write such queries, and this task proves particularly difficult for those who lack a computer science background. On the other hand, people can effortlessly and naturally communicate using their natural language, which however computers cannot understand.
Natural Language Interfaces to Databases (NLIDB) are systems that aim to bridge this gap, and automatically translate natural language sentences to database queries. This thesis proposes a novel approach to NLIDB, using graph-based models. The system starts by collecting as much information as possible from existing databases and sentences, and transforms this information into a knowledge base for the system. Given a new question, the system will use this knowledge to analyse and translate the sentence into its corresponding database query statement. The graph-based NLIDB system uses English as the natural language, a relational database model, and SQL as the formal query language. In experiments performed with natural language questions ran against a large database containing information about U.S. Geography, the system showed good performance compared to the state-of-the-art in the field.
AN APPROACH TOWARDS SELF-SUPERVISED CLASSIFICATION USING CYC
Kino Coursey, University of North Texas
Thursday, October 5, 2006, 2:00pm - 3:30pm, NTRP F223
Abstract:
Due to the long duration required to perform manual knowledge entry by human knowledge engineers, it is desirable to find methods to automatically acquire knowledge about the world by accessing online information. In this work we examine using the Cyc ontology to guide the creation of Naive Bayes classifiers to provide knowledge about items described in Wikipedia articles. Given an initial set of Wikipedia articles the system uses the ontology to create positive and negative training sets for the classifiers in each category. The order in which classifiers are generated and are used to test articles is also guided by the ontology. The research conducted shows that a system can be created that utilizes statistical text classification methods to extract information from an ad-hoc generated information source like Wikipedia for use in a formal semantic ontology like Cyc. Benefits and limitations of the system are discussed along with future work.
PARSING ARABIC DIALECTS
Mona Diab, Columbia University
Friday, September 22, 2006, 11:30am - 12:30pm, NTRP F223
Abstract:
The Arabic language is a collection of spoken dialects with important phonological, morphological, lexical, and syntactic differences, along with a standard written language, Modern Standard Arabic (MSA). Since the spoken dialects are not officially written, it is very costly to obtain adequate corpora to use for training dialect natural language processing tools such as parsers. In this talk, we address the problem of parsing transcribed spoken Levantine Arabic (LA). We do not assume the existence of any annotated LA corpus (except for development and testing), nor of a parallel corpus LA-MSA. Instead, we use explicit knowledge about the relation between LA and MSA.
About the speaker:
Dr. Mona Diab is a research scientist at Columbia University, working in the Natural Language Processing group in the Center for Computational Learning Systems. She did her postdoctoral research work at Stanford in the Natural Language Processing lab, after receiving her doctoral degree from the University of Maryland College Park.
PREVIOUS SEMINARS
- Martha Palmer, University of Pennsylvania, April 1, 2005, 11:00am. Putting Meaning into Your Trees [abstract]
- Paul Thompson, Dartmouth College, February 11, 2005, 7:00pm. Intelligence and Security Informatics [abstract]
- Vasile Rus, U.Memphis, November 12, 2004, 11:00am. Using World Knowledge for Question Answering. [abstract]
- Diane Cook, A.I. Lab, UT Arlington, September 17, 2004, 11:00am. [abstract]
- Bob Parks, Wordsmyth, "Dictionaries: The Art and Craft of Lexicography", March 31 2004, 11:30am, F223. [abstract]
- Ted Pedersen, "NLP Research at UMD", Febr 26, 2004, 11:30am, F219.
- Lei Shi, "A General Purpose Semantic Parser Using FrameNet and WordNet", November 26, 2003, 11am (GAB 320). [abstract]
- Fernando Gomez, "A Computational View of Verb Predicates and Semantic Roles", November 21, 2003, 11am. [abstract]
- Carlo Strapparava, "Getting Serious About the Development of Computational Humor", August 8, 2003, 11am. [abstract]
- Li Yang, "Building an Intelligent Filtering System with Idea Indexing", July 8, 2003. [abstract]
- Guili Sun, "XML Based agent scripts and inference mechanisms", May 8, 2003. [abstract]
- Ted Pedersen, "Using Measures of Semantic Relatedness for Word Sense Disambiguation", April 3, 2003, 5pm. [abstract]
- Sebastian Hammer, "Seminar on Information Systems and Human Computer Interaction", March 13, 2003, 5pm. [abstract]
- Klaus Truemper, "Futile Questioning in Intelligent Systems", February 14, 2003, 11am. [abstract]