CADIM: Columbia Arabic Dialect Modeling
The Arabic language is actually a collection of dialects with important phonological, morphological, lexical, and syntactic differences. However, throughout the Arab world, the standard written language is the same, Modern Standard Arabic (MSA), that is also used in some official spoken communication (newscasts, parliamentary debates). MSA is based on Classical Arabic and is itself not a native spoken language. This situation has important negative consequences for Arabic automatic speech recognition (ASR) and natural language processing (NLP): since the spoken dialects are not officially written, it is costly to obtain adequate corpora to use for training the kind of ASR and NLP tools commonly in use today, for example, language models for ASR. Experience has shown that using MSA text for language models is ineffective in improving dialect ASR.
This project aims at devising a way to hierarchically specify the morphology and syntax of a group of closely related languages/dialects. The syntax of a seed language (MSA) is automatically extracted from an existing treebank and is augmented by hand as needed, while the syntax of related dialects is specified manually to the extent that it differs from that of other dialects. A similar approach is pursued for morphology and the lexicon. These formal specifications are then used to derive transducers among the dialects. To test the utility of this approach, the transducers are used to convert MSA corpora to (an approximation of) dialect text. These "created" corpora in turn are used to train language models for the dialect, with the expectation of improving dialect ASR over the baseline in which only small dialect corpora or (large) MSA corpora are used for language modeling.
The project has the potential to improve the quality of ASR for Arabic dialects and, more generally, to increase the understanding of how closely related languages can be modeled formally. The developed NLP tools for Arabic dialects will be made available to the research community.