Natural Language Processing
A System for Arabic Tokenization, Diacritization, Morphological Disambiguation, POS Tagging, Stemming and Lemmatization 
Much work has been done on addressing different specific natural language processing tasks for Arabic, such as tokenization, diacritization, morphological disambiguation, part-of-speech (POS) tagging, stemming and lemmatization. The MADA system along with TOKAN provide one solution to all of these different problems.
The Columbia Arabic Treebank (CATiB) is a database of syntactic analyses of Arabic sentences. CATiB contrasts with previous approaches to Arabic treebanking in its emphasis on faster production with some constraints on linguistic richness. Two basic ideas inspire the CATiB approach. First, CATiB avoids the annotation of redundant linguistic information that is determinable automatically from syntax and morphological analysis, e.g., nominal case. And secondly, CATiB uses linguistic representation and terminology inspired by the long tradition of Arabic syntactic studies.
AMIRA is a successor suite to the ASVMTools. The AMIRA toolkit includes a clitic tokenizer (TOK), part of speech tagger (POS) and base phrase chunker (BPC) - shallow syntactic parser. The technology of AMIRA is based on supervised learning with no explicit dependence on knowledge of deep morphology, hence, in contrast to systems such as MADA, it relies on surface data to learn generalizations. In general the tools are based on using a unified framework casting each of the component problems as a classification problem.