Tagging Arabic Text


Khaled Al-Sham'aa
This process of identifying names, places, dates, and other noun words and phrases that establish the meaning of a body of text-is critical to software systems that process large amounts of unstructured data coming from sources such as email, document files, and the Web.

Arabic words are classifies into three main classes, namely, verb, noun and particle. Verbs are sub classified into three subclasses (Past verbs, Present Verbs, etc.); nouns into forty six subclasses (e.g. Active participle, Passive participle, Exaggeration pattern, Adjectival noun, Adverbial noun, Infinitive noun, Common noun, Pronoun, Quantifier, etc.) and particles into twenty three subclasses (e.g. additional, resumption, Indefinite, Conditional, Conformational, Prohibition, Imperative, Optative, Reasonal, Dubious, etc.), and from these three main classes that the rest of the language is derived.

The most important aspect of this system of describing Arabic is that all the subclasses of these three main classes inherit properties from the parent classes.
    Arabic grammarians describe Arabic as being derived from three main categories: noun, verb and particle.
Arabic is very rich in categorising words, and contains classes for almost every form of word imaginable. For example, there are classes for nouns of instruments, nouns of place and time, nouns of activity and so on. If we tried to use all the subclasses described by Arabic grammarians, the size of the tagset would soon reach more than two or three hundred tags. For this reason, we have chosen only the main classes. But because of the way all the classes inherit from others, it would be quite simple to extend this tagset to include more subclasses.