bot_recognizer: class for bot recognizing and dispatching
Today there are so many internet crawlers,spiders that regularly scan Your sites, index their content,
or automatically post messages on your forms and perform many other tasks (sometimes quite malicious).
You'd like to make Your site to be bot-aware, so returned HTML code changes depending on what bot is reading them.
Surely You want to protect Your site from destructive actions, like posting SPAM messages on forums, email harvesting etc.
With a little help of this class that tasks can be easier to do.
Basic idea is not original - checking requester's IP address and USer-agent string.
Searching is done in local bot definitions database.
(Of course this won't protect from bots that use IP- or user-agent spoofing technics.)
Bot definitions can be stored in a simple delimited text file, or in a DB table.
In case of using SQL DB as bot definition storage ("DB storage" mode), this class is "Zend Framework" aware (it can use one of Zend_Db_Adapter_*
classes to access data).
Bot recognition can be performed by analyzing IP and user-agent (by default), or only one desired parameter (IP or user-agent string).
Currently the class has the following functions:
- Loading "initial" bot definitions from delimited text file (a sample included in distributive);
- Importing initial bot definitions into SQL table, if using DB storage mode planned;
- Importing additional bot definitions from external files (local or internet URL's);
at the moment the only supported format is iplists.com bot collections (See links at the bottom of this document)
- detecting bot "id" by the client's IP address and/or user-agent string.
In case of using Dispatch() method, respective user procedure can be called.
Dispatching process can call unique function for each bot (if desired), or one function for all bots of the same "type":
i.e. function 1 for "indexing" bots, function 2 for email harvesters etc.
Bots registered as "malicious" can have one common handler function.
Function Dispatch() can have a limited "working time" interval of a day.
- Debugging site behavior for desired bot, by setting "emulated" bot's IP address and user-agent string.
Installation on the site
- Place php module bot_recognizer.php and "bot-defs.txt" file into your site's folder. If ZEND Framework is not used on your site,
as_dbutils.php must be copied too.
- Add manually desired bot definitions to the provided "bot-defs.txt" file, if needed.
- If You're going to use SQL database as a bot definition storage, tune and run script import-botdefs.php once,
so "initial" base will be created from "bot-defs.txt" file. This script re-creates table in DB (with dropping previous data),
so it can be called each time You changed info in bot-defs.txt
Simple code example
# using dispatch method example
require_once('bot_recognizer.php');
$botrec = new CBotRecognizer();
$botrec->RegisterActionForBots('MicrosoftBots','msn');
$botrec->RegisterActionForBots('GoogleBots','google');
$botrec->Dispatch();
echo "Hello, dear Human !"; // this text is shown only if none of MicrosoftBots and GoogleBots called.
#...
function MicrosoftBots() {
die("This text is shown for Microsoft spiders (msn) !");
}
function GoogleBots() {
die("This text is shown for Google bots");
}
# using GetBotId() approach:
require_once('bot_recognizer.php');
$botrec = new CBotRecognizer();
$botid = $botrec->GetBotId();
if($botrec->IsMaliciousBot()) die("You are a malicious bot. Get lost !");
switch($botid) {
case 'goolge' : case 'msn' :
echo "This is Microsoft or Google bot. Let them in...";
break;
case CBotRecognizer::UNDEFINED_BOT:
die("some undefined bot detected !");
break;
case 0:
echo "Hello, dear Human !"; // this text is shown only if none of MicrosoftBots and GoogleBots called.
break;
default: # all other detected bots...
echo "Bot $botid detected, no particular action for it !";
break;
}
Using the class CBotRecognizer
Minimal use is shown in examplde above.
All working parameters are default, so bot definition is loaded from "bot-defs.txt", that must be placed in one
folder with bot_recognizer.php class module, seacrh method is IP and user-agent string.
File with bot definitions is a text file delimited with "|" char. Each line should contain from 4 to 6 delimited values :
- short bot name (that will be returned if this bot is recognized), we'll name it "bot id"
- starting IP address,
- ending IP address,
- substring that identifies this bot if found in User-Agent string
- optional parameter (integer) setting this bot type
- optional parameter (1) marking this bot as "malicious"
File bot-defs.txt fragment
msn|65.55.211.113|65.55.211.119|msnbot
msn|65.55.232.22|65.55.232.22|msnbot
google|66.249.71.22|66.249.71.139|Googlebot
alexa|67.202.54.191|67.202.54.191|ia_archiver
yahoo|72.30.142.240|72.30.142.240|Yahoo!
webalta|76.73.62.242|76.73.62.242|webalta crawler
# some malicious bots from kloth.net, "1" in 5-th field means "malicious bot"
rdprm.gouv.qc.ca|207.96.148.8|207.96.148.8||1
easydl|76.10.155.74|76.10.155.74|EasyDL|1
More than one line can have the same bot id, because the same company may run multiple crawlers on multiple IP addresses,
with multiple user-agent strings. Just visit iplists.com
and check out any bot list (google for example) and You'll see what I mean.
In "DB storage" mode two different approaches are possible to access bot definitions in SQL DB:
1. Using database wrapper class CDbEngine from as_dbutils.php, that included in distributive :
require_once('as_dbutils.php'); // database access wrapper
require_once('bot_recognizer.php');
$mydb = new CDbEngine(DBTYPE_MYSQL,'localhost','user','password','mydatabase');
$botrec = new CBotRecognizer(array('dbobject'=>$mydb));
#...
$botrec->Dispatch();
2. If Your site uses Zend Framework (ZF), its database access classes can be used:
(MySQL PDO used in our examples)
require_once('Zend/Db.php');
require_once('Zend/Db/Table.php');
require_once('Zend/Db/Adapter/Pdo/Mysql.php');
require_once('bot_recognizer.php');
# create Zend db adapter object...
$mydb = new Zend_Db_Adapter_Pdo_Mysql(
array( 'host'=> 'localhost',
'username' => 'user',
'password' => 'password',
'dbname' => 'mydatabase'));
Zend_Db_Table::setDefaultAdapter($dbAdapter);
# the rest is identical:
$botrec = new CBotRecognizer(array('dbobject'=>$mydb));
#...
$botrec->Dispatch();
Before using "SQL storage" mode, You have to create the table in Your DB for bot definitions list.
This can be done in two ways:
- By calling method CBotRecognizer::CreateBotDefTable(); in that case the empty table [prefix]bot_definitions
will be generated.
(Warning: If the table with the same name exists, it will be dropped,
so don't forget to check your existing tables list and use appropriate name prefix !
- By calling method CBotRecognizer::LoadBotDefinitionsFile('',true) : this function (re)creates empty table (by calling CreateBotDefTable)
and loads initial bot definitions from bot-defs.txt file. Probably you'll want to edit bot-defs.txt before running LoadBotDefinitionsFile(),
if you have your own bots definitions to be catched and processed.
Method list with descriptions
CBotRecognizer([$param]) - class constructor.
Passed $param should be an associative array containing any parameters from the next list:
key | meaning |
tableprefix |
Changes prefix in database table name that used as bot definitions storage. By default botrec_, so the table has a name
"botrec_bot_definitions".
|
dbobject |
Passed "database access" object variable, that must be created and prepared for work before calling constructor.
It can be one of Zend_Db_Adapter instances or a CDbEngine object (defined in as_dbutils.php).
Database Connection parameters must be set before calling constructor (see examples above), as CBotRecognizer does not
open database connection.
By default CBotRecognizer object is created in "file storage" mode, so bot definitions are to be loaded from delimited text file.
By passing dbobject parameter, You turn "DB storage" mode, so bot definitions are searched in SQL database.
If no dbobject passed, constructor tries to load initial bot definitions from bot-defs.txt (or other file, if 'sourcefile'
parameter passed).
|
searchmode |
One of values :
CBotRecognizer::SEARCH_IP_ONLY,
CBotRecognizer::SEARCH_IP_OR_AGENT
or CBotRecognizer::SEARCH_AGENT_ONLY.
Default value is SEARCH_IP_OR_AGENT (equal 1), means that bot identification process will check
client IP address and user-agent string, and if anything matches, respective bot name (id) is returned.
Other values turn only one search method : by IP address or by user-agent string (sub-string).
|
sourcefile |
Passes "initial" bot definitions file name. By default class tryes to load data from bot-defs.txt file in the same folder
as the bot_recognizer.php script.
|
worktime |
You can set "active time" for dispatching by setting this value.
When current time is out of passed interval, Dispatch() won't do anything and just returns.
Valid format is string "HH:MI-HH:MI" or two-element array : array($time_from, $time_to). Time values must be have leading zeroes.
For example, if You pass interval "05:00-07:00", Dispatch() will work only from 05 to 07 AM.
|
CreateBotDefTable() - creates empty table in connected database for bot definitions. Table name is [passed_prefix]bot_definitions.
As a default name prefix is "botrec_", so default table name is botrec_bot_definitions.
EmulateBot($ip [, $agent]) - setting "emulated" bot parameters.
This function is called when you want to test your page generation for specified bot. Consequent calls of GetBotID()
and Dispatch() will run as if client request comes from specified IP address and has $agent user-agent string.
SetSearchMode($mode) - explicitly changes search mode ($mode must be a one of values CBotRecognizer::SEARCH_IP_ONLY (0),
CBotRecognizer::SEARCH_IP_OR_AGENT (1) or CBotRecognizer::SEARCH_AGENT_ONLY (2).
GetBotId($ua='',$ip='') - searches in bot definitions for client's IP address or user-agent string and returns found bot Id,
or -1 if some "undefined" bot recognized (by one of predefined words found in user-agent string, like 'crawl' or 'spider')
Return value: bot id (string) or CBotRecognizer::UNDEFINED_BOT (-1) or false if "normal browser" is accessing your page.
Optional parameters $ua and $ip can be used to define bot for specific user-agent string and/or IP address. For example,
this can be useful while analyzing site access logs.
LoadBotDefinitionsFile($srcfile='', $clearexisting=false) - loads bot definitions from specified text file.
If "File storage" is active, bot definitions are added to internal array for this "searching session".
So If You use "file storage", You should call this method every time before recognition bot, but in "SQL storage" these methods
should be called only once, to load all Your text files into database.
The optional second parameter commands to clear current bots definitions before loading, if $clearexisting is true.
SQL table for bot definitions is (re)created if $clearexisting is true.
ImportBotsFromUrl($bot_id, $url,$file_type=0,$malicious=0) can be used to load bot definitions from internet sources
or locally saved files in iplists.com format:
# fragment from google.txt file, downloaded from www.iplists.com:
# UA "Mediapartners-Google/2.1"
# UA "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
# UA "gsa-crawler (Enterprise; S4-E9LJ2B82FJJAA; me@mycompany.com)"
209.185.108
209.185.253
209.85.238
209.85.238.11
209.85.238.4
...
As You can see, some lines contain user-agent string, some others - IP address or a subnet.
Method parameters :
$bot_id is a bot's short identifier that will be returned to You when this bot is recognized.
$url is local file name or internet URL to load.
$malicious - is "malicious" flag for bot to be loaded.
ImportBotsFromUrl() parses file, collects all user-agent strings and IP addresses, merges IP ranges to optimize database,
and loads result into SQL table or memory (depending on active storage mode).
IP "addresses" like 209.85.238 (that identifies the subnet) are converted to the respective range (209.85.238.0 - 209.85.238.255).
Note that all overlapped, nested or "adjacent" IP ranges will be merged.
For example, collection of ranges 64.68.80, 64.68.81, 64.68.82 will be converted to one IP range 64.68.80.0 - 64.68.82.255.
AddBotDefinition($botid, $ipfrom, $ipto=0, $useragent='', $bottype=0, $malicious=0) -
addes one definition into current storage (memory if "file storage" is active, or SQL table {prefix}bot_definitions otherwise).
$botid - string bot identifier that will be returned by GetBotId() method.
$ipfrom - staring IP address (as integer value ! IP addres in octet form "nnn.nnn.nnn.nnn" must be converted by FromIpToX32() ).
$ipto - ending IP address. If empty, registered IP range will be $ipfrom...$ipfrom.
$useragent - substring that can be found in user-agent string for this bot. For example "msn" for Microsoft MSN bots.
$bottype - what type this bot will be of (according to your classification).
$malicious - is this bot malicious (0 or 1).
SetHandlerForBots($callbackfnc,$botlist) - registers callback function that will be called from Dispatch() mnethod if one of
bots from passed list has recognized.
$callbackfnc - string, existing function name to be called.
$botlist - string or array containing bot id list to be "dispatched". String value can store more than one bot id, delimited with
any of characters "|" "," ";" .
These values for $botlist are equivalent : "msn|google|yahoo" and array('msn','google','yahoo').
If You want to register callback function for all undefined bots, use this call :
SetHandlerForBots($callbackfnc,CBotRecognizer::UNDEFINED_BOT).
SetHandlerForTypes($callbackfnc,$bottype) registers one handler function for all bots of one "type".
The type of bots is what You set for all Your registered bot definitions, so there is no limitations here.
You can assign type "0" as "indexing" bots (as a major type), type "1" for email harvesters, type "3" for spam bots etc. When all bots are "scattered" by these types,
You can use SetHandlerForTypes instead of (or together) SetHandlerForBots.
If unique handler function for some bot is registered, it is called rather than function for all bots of this type.
SetMaliciousHandler($funcname) sets one callback function for ALL bots having non-zero "malicious" flag.
If set, this function will be called for all malicious bots regardless of their id's or type and registered callback functions for each of them.
As You can see, functions SetHandlerForBots() (and optionally SetMaliciousHandler()) must be called only if You use Dispatch() method.
IsMaliciousBot() returns true if detected bot is marked as malicious.
GetBotType() returns detected bot type.
Dispatch() performs dispathing : it gets bot id (by calling GewtBotId() method) and calls respective callback function.
In case of malicious bot, the function registered by SetMaliciousHandler() is called rather than anything else.
If You have set "worktime" in class constructor. current time is checked to be in "worktime" interval, if not - Dispatch() won't do anything.
GetErrorMessage() return error message from the last operation (or empty string if no errors occured).
Additional methods
FromIpToX32($ipaddr) converts passed string with IP addres in octet notation (212.56.200.111) to respective
integer value. This is analog of MySQL function INET_ATON().
FromX32ToIp($ipx32) - performs reverse conversion from integer to "nnn.nnn.nnn.nnn"), analog of MySQL INET_NTOA().
Links
Change log
1.00.001 (10/02/2009)