Natural Language Processing


Machines that speak with us (Spoken Dialogue Systems) rely disproportionately on accurate transcription of the speech signal into readable text. When the system has low confidence in the automatic speech recognition (ASR) of a caller's utterance, a typical dialogue strategy requires the system to repeat its best guess and ask for confirmation. This leads to unnatural interactions and dissatisfied callers. Our novel methodology, wizard ablation, collects simulated human-system dialogues that vary in controlled ways in order to investigate problem-solving strategies people would use if a person's abilities and options were restricted to be more like a machine's. Our testbed application, the CheckItOut dialog system, is modeled on a corpus of telephone transactions between patrons and librarians that we collected at New York City's Andrew Heiskell Braille & Talking Book Library. (Loqui, a Latin phrase meaning "I speak"; because the "I" in the case of an ablated wizard is neither the wizard nor the system, we like the alliterative allusion to Loki (lo-kee), the Norse god of mischief.)
For Spoken Dialogue Systems (SDS), investigate human strategies for handling system errors


CADIM: Columbia Arabic Dialect Modeling

Arabic Dialect Modeling for Speech and Natural Language Processing


A suite of tools for morphological disambiguation, POS tagging, diacritization, lexicalization, stemming and other tasks.

A System for Arabic Tokenization, Diacritization, Morphological Disambiguation, POS Tagging, Stemming and Lemmatization [1]

Much work has been done on addressing different specific natural language processing tasks for Arabic, such as tokenization, diacritization, morphological disambiguation, part-of-speech (POS) tagging, stemming and lemmatization. The MADA system along with TOKAN provide one solution to all of these different problems.



The Columbia Arabic Treebank (CATiB) is a database of syntactic analyses of Arabic sentences

The Columbia Arabic Treebank (CATiB) is a database of syntactic analyses of Arabic sentences. CATiB contrasts with previous approaches to Arabic treebanking in its emphasis on faster production with some constraints on linguistic richness. Two basic ideas inspire the CATiB approach. First, CATiB avoids the annotation of redundant linguistic information that is determinable automatically from syntax and morphological analysis, e.g., nominal case. And secondly, CATiB uses linguistic representation and terminology inspired by the long tradition of Arabic syntactic studies.


A toolkit for Arabic tokenization, POS tagging and Base Phrase Chunking

AMIRA is a successor suite to the ASVMTools. The AMIRA toolkit includes a clitic tokenizer (TOK), part of speech tagger (POS) and base phrase chunker (BPC) - shallow syntactic parser. The technology of AMIRA is based on supervised learning with no explicit dependence on knowledge of deep morphology, hence, in contrast to systems such as MADA, it relies on surface data to learn generalizations. In general the tools are based on using a unified framework casting each of the component problems as a classification problem.

Syndicate content