Arabic Soundex Class


Khaled Al-Sham'aa
Terms that are often misspelled can be a problem for database designers. Names, for example, are variable length, can have strange spellings, and they are not unique. Words can be misspelled or have multiple spellings, especially across different cultures or national sources.

To solve this problem, we need phonetic algorithms which can find similar sounding terms and names. Just such a family of algorithms exists and is called SoundExes, after the first patented version.

A Soundex search algorithm takes a word, such as a person's name, as input and produces a character string which identifies a set of words that are (roughly) phonetically alike. It is very handy for searching large databases when the user has incomplete data.
    We develop this class as an Arabic counterpart to English Soundex.
The original Soundex algorithm was patented by Margaret O'Dell and Robert C. Russell in 1918. The method is based on the six phonetic classifications of human speech sounds (bilabial, labiodental, dental, alveolar, velar, and glottal), which in turn are based on where you put your lips and tongue to make the sounds.

Soundex function that is available in PHP, but it has been limited to English and other Latin-based languages. This function described in PHP manual as the following: Soundex keys have the property that words pronounced similarly produce the same soundex key, and can thus be used to simplify searches in databases where you know the pronunciation but not the spelling. This soundex function returns string of 4 characters long, starting with a letter.

We develop this class as an Arabic counterpart to English Soundex, it handle an Arabic input string formatted in windows-1256 character set to return Soundex key equivalent to normal soundex function in PHP even for English and other Latin-based languages because the original algorithm focus on phonetically characters alike not the meaning of the word itself.