Tagging Arabic Text
This process of identifying names, places, dates, and other
noun words and phrases that establish the meaning of a body of text-is critical to software
systems that process large amounts of unstructured data coming from sources such as email,
document files, and the Web.
Arabic words are classifies into three main classes,
namely, verb, noun and particle. Verbs are sub classified into three subclasses (Past
verbs, Present Verbs, etc.); nouns into forty six subclasses (e.g. Active participle,
Passive participle, Exaggeration pattern, Adjectival noun, Adverbial noun, Infinitive
noun, Common noun, Pronoun, Quantifier, etc.) and particles into twenty three subclasses
(e.g. additional, resumption, Indefinite, Conditional, Conformational, Prohibition,
Imperative, Optative, Reasonal, Dubious, etc.), and from these three main classes that
the rest of the language is derived.
The most important aspect of this system of describing Arabic is that all the subclasses
of these three main classes inherit properties from the parent classes.
Arabic grammarians describe Arabic as being derived from three
main categories: noun, verb and particle.
Arabic is very rich in categorising words, and contains classes for almost every form
of word imaginable. For example, there are classes for nouns of instruments, nouns of
place and time, nouns of activity and so on. If we tried to use all the subclasses
described by Arabic grammarians, the size of the tagset would soon reach more than two
or three hundred tags. For this reason, we have chosen only the main classes. But because
of the way all the classes inherit from others, it would be quite simple to extend this
tagset to include more subclasses.