A System for Arabic Tokenization, Diacritization, Morphological Disambiguation, POS Tagging, Stemming and Lemmatization [1]

Much work has been done on addressing different specific natural language processing tasks for Arabic, such as tokenization, diacritization, morphological disambiguation, part-of-speech (POS) tagging, stemming and lemmatization. The MADA system along with TOKAN provide one solution to all of these different problems.

Our approach distinguishes between the problems of morphological analysis (what are the different readings of a word out-of-context) and morphological disambiguation (what is the correct reading in a specific context.) Once a morphological analysis is chosen in context, we can determine its full POS tag, lemma and diacritization. Morphological analysis and disambiguation are handled in the MADA component of our system. Knowing the morphological analysis also allows us to tokenize and stem deterministically.

Since there are many different ways to tokenize Arabic (tokenization is a convention adopted by researchers), the TOKAN component allows the user to specify any tokenization scheme that can be generated from disambiguated analyses. The tokenized version is produced using the ARAGEN generator (Habash 2004).

Arabic Processing Challenges

The Arabic language raises many challenges for natural language processing (NLP). First, Arabic is a morphologically complex language. The morphological analysis of a word consists of determining the values of a large number of (partially orthogonal) features, such as basic part-of-speech (i.e., noun, verb, and so on), voice, gender, number, information about the clitics, and so on.  For Arabic, this gives us about 333,000 theoretically possible completely specified morphological analyses. In contrast, English morphological tagsets usually have about 50 tags, which cover all morphological variations. Second, Arabic orthographic rules cause some parts of words to be deleted or modified when cliticization occurs. For example, the Taa-Marbuta appears as a regular Taa when followed by a pronominal clitic. Simple segmentation of the pronominal clitic without recovering the Taa-Marbuta could cause unnecessary ambiguity or add to the sparsity problem. Third, Arabic is written with optional diacritics that primarily specify short vowels; they are usually absent, which contributes ambiguity.  Finally, the writing system also shows different levels of specificity in spelling some letters, e.g. أ can be spelled without the Hamza (ء) as ا and ي can be spelled without the dots as ى.  The complexity of the morphology together with the underspecification of the orthography create a high degree of ambiguity. On average, a word form in the Penn Arabic Treebank (PATB; Maamouri et al. 2004) has about 12 morphological analyses. For example the word والى can be analyzed as والي `ruler', و+الى+ي `and to me', و+ألي `and I follow', و+آل+ي `and my clan' or و+آلي `and automatic'. Each of these cases has a different diacritization.


Syndicate content