KwIndex

Note: This documentation is a draft and incomplete. If you use KwIndex, I encourage you to also take a look at the source code.

What is it?

KwIndex is yet another library to facilitate full text indexing using PHP and MySQL.

Why KwIndex

Admittedly, KwIndex is not the fastest full text indexing package available, or one that is most scalable, or one that provides the smallest possible index, but for many small to medium applications, it suffices. The main advantage of using KwIndex is that you can easily and cheaply add/remove documents from the index without rebuilding the whole index. This is appropriate for document sets that are fast-changing. It is not appropriate to be used on document sets gigabytes or more in size.

Where can I get KwIndex

http://steven.haryan.to/php/KwIndex.tar.gz

How do I install it?

KwIndex is a PHP library, so first you must have PHP working on your site. Also, KwIndex stores the full text index on several MySQL tables, so you need to have access to a MySQL database.

To test KwIndex, download the package and extract it to a web-accessible directory on your server. Then access test.php from that directory. The test suite will tell whether KwIndex is fully working on your system.

After the test is successful, you can copy KwIndex.lib to your PHP library path (e.g., to /usr/local/lib/php/), or you can just bring it along to whatever directory in which you will be needing/requiring KwIndex.

How do I use it?

  1. require "KwIndex.lib";
  2. Tell KwIndex how to retrieve the documents

    Subclass KwIndex and override the document_sub() method.

    class MyKwIndex extends KwIndex {
    	function document_sub($doc_ids) {
    		# ...
    	}
    }
    

    document_sub() is the method that KwIndex will consult whenever it needs the documents, identified by positive integer numbers. The method should accept an array of document id's and return an associate array of documents, keyed by the document ids. By delegating the document retrieval method, you can tell KwIndex to index database columns, external text files, or even remote web pages, depending on how you write the document_sub() method.

    For example, if you have some text files named "1.txt", "2.txt", ... "1000.txt" that you want to index. Then you would write document_sub() to be something like this:

    function &document_sub($doc_ids) {
    	$path = "/home/yola/files";
    	$docs = array();
    
    	for($i=0; $i<sizeof($doc_ids); ++$i) {
    		$id = $doc_ids[$i];
    		$filename = "$path/$id.txt";
    		$fd = fopen($filename, "r");
    		$docs[$id] = fread($fd, filesize($filename));
    		fclose($fd);
    	}
    	return $docs;
    }
    

    So when KwIndex want to index documents 5 to 10, it will invoke document_sub() with a code like the following:

    $docs = $this->document_subs(array(5, 6, 7, 8, 9, 10));
    
    and then document_sub() should return an associate array like this:
    array(
    	5 => "content of file 5.txt...",
    	6 => "content of file 6.txt ...",
    	7 => "content of file 7.txt ...",
    	8 => "content of file 8.txt ...",
    	9 => "content of file 9.txt ...",
    	10 => "content of file 10.txt ..."
    )
    

    You can use any positive identifying number sets as the document id, as long as it is unique.

    Another example. If your text lives in a database column (for example, in a column named CONTENT of the ARTICLES table, and the document ID of the articles are in the ID field), you would code document_sub() like this:

    function &document_sub($doc_ids) {
    	$linkid = $this->linkid;
    	$docs = array();
    
    	# let's select the documents in a single query
    	$res = mysql_query("select ID, CONTENT from ARTICLES ".
    	                     "where ID in (".join(',', $doc_ids).")",
    	                   $linkid);
    
    	while($row = mysql_fetch_row($res)) {
    		$docs[ $row[0] ] = $row[1];
    	}
    	return $docs;
    }
    

    $linkid is the MySQL link identifier, already stored in the object when you specify linkid or hostname/username/password to the KwIndex constructor. You can of course use other link identifier if the database is different from the one you are storing the full text index in.

    You can preprocess the document text any way you want in document_sub(), for example strip the HTML tags, filter out any garbage, etc. One thing to note: document_sub() needs to be able to return several documents at once (its id's specified by the $doc_ids array) so that KwIndex can index batches of documents at once. Indexing larger batches of documents will need more memory, but it will be much faster than indexing one small documents at a time.

  3. Indexing documents

    Once you have written the appropriate document_sub() method, create an instance of your KwIndex subclass and call the add_document() method:

    $kw = new MyKwIndex(array(
                          "db_name" => "MYDB",
                          "linkid"  => $linkid
                       ));
    
    The KwIndex constructor takes an associate array of arguments/options. The only required options are db_name and linkid. Or, if you did not supply linkid, supply hostname, username, and password for KwIndex to call mysql_connect() for you.

    If you cannot create the instance, it might be that you did not supply the correct database username/password. You need to supply the user which has sufficient privileges to read and write the database, and also to list the tables (read access to mysql MySQL database).

    KwIndex will create several tables of its own if it did not previously exist.

    To index a batch of documents, specify its id to the add_document() method. For example:

    $kw->index_document(array(1, 2, 3, 4, 5)); # index document 1 to 5
    $kw->index_document(range(100,200)); # index document 100 to 200
    

    Indexing larger batches will result in greater speed, but will require more memory to store the document contents while indexing.

  4. Updating the index

    Whenever a document changes its content, it will need to be reindexed. To do this, just use update_document():

    $kw->update_document(array(3,6,7));
    

    If you no longer wants certain documents from being indexed (e.g., to avoid it being shown up in searches), use remove_document():

    $kw->remove_document(array(6,7,8));
    

    To remove the whole index, use remove_index():

    $kw->remove_index();
    
  5. Doing searches

    The ability to search over large amount of text quickly is the essence of this whole full text indexing stuff. To do this in KwIndex, use the search() method. search() takes an associate array of arguments/options. They are:

    search() returns an array of matching document ids.

    Some examples:

    # return all document ids that contain all these words
    $doc_ids = $kw->search(array('words'=>'long-haired german shepherd'));
    
    # return only at most 10 results that contain all these words
    $doc_ids = $kw->search(array('words'=>'german shepherd dog', 'num'=>10));
    
    # return the 11th to 20th results
    $doc_ids = $kw->search(array('words'=>'german shepherd dog', 'num'=>10, 'start'=>11));
    
    # return all documents that matches any of these words
    $doc_ids = $kw->search(array('words'=>'flea tick louse mite', 'boolean'=>'OR'));
    
  6. An example/demo

    Try it out: http://wholesomesoft.com/kwindex-demo/demo.php
    Source code: http://wholesomesoft.com/kwindex-demo/demo.phps

So, how slow/fast is it?

You can expect KwIndex to be generally slower than other indexing package. This is because KwIndex store each hit on a separate MySQL table row, to provide easy addition and removal of hits from the index. Here are the numbers from our server machine, which is a dual-PIII.

The documents are stored in BLOB columns of MySQL database. Number of documents: 34921. Total size of text: 68M. The set contains 140k unique words. The vectorlist table stores 4.75mil of hits or records, and its size is 49M (the index size is 55M).

One-word searching typically needs 0.5-1 second, 4-word AND searching typically needs 2 seconds, 4-word OR searching typically needs 3 seconds. YMMV. That kind of speed might be enough for your needs, but it might be not.

However, adding and/or removing documents is not expensive at all, and you can build the index on the fly as the documents come in or become available. This is the way we do it to index more than a year's worth of news articles at satunet.com and Kafegaul, a network of Indonesian news portal sites now owned by M-Web.

See also

Copyright

© 2000, Steven Haryanto