Version: @(#) $Id: markup_parser.php,v 1.66 2009/10/31 20:52:17 mlemos Exp $
Markup parser
Manuel Lemos (mlemos-at-acm.org)
Copyright © (C) Manuel Lemos 2009
@(#) $Id: markup_parser.php,v 1.66 2009/10/31 20:52:17 mlemos Exp $
Parse HTML and other markup based documents.
Use the StartParsing function to initialize the parser. Then use the Parse function to make the class parse markup data, eventually read from files. When you are done with feeding the whole document data, call the FinishParsing function.
The Parse function returns arrays of tokens that describe each document element. The RewriteElement function can be used to convert the tokens back to markup document strings.
Element tokens are associated to the respective positions in the document. Positions are numbers that represent their offsets relative to beginning of the document. The GetPositionLine function can return the line and column number associated to a given document position if the track_lines is set to 1.
The ParseDTDExpressionValue and ParseAttributeList functions can be used to parse expressions that may appear in DTD markup elements.
string
''
Store the message that is returned when an error occurs.
Check this variable to understand what happened when a call to any of the class functions has failed.
This class uses cumulative error handling. This means that if one class functions that may fail is called and this variable was already set to an error message due to a failure in a previous call to the same or other function, the function will also fail and does not do anything.
This allows programs using this class to safely call several functions that may fail and only check the failure condition after the last function call.
Just set this variable to an empty string to clear the error condition.
int
0
Store the code that is returned when an error occurs.
Check this variable to understand what happened when a call to any of the class functions has failed. It may be set to several possible error codes defined as constants:
MARKUP_PARSER_ERROR_NONE - No error happened
MARKUP_PARSER_ERROR_UNEXPECTED - It was found a condition that the class is not yet ready to handle
MARKUP_PARSER_ERROR_INVALID_SYNTAX - A syntax error was found
MARKUP_PARSER_ERROR_INVALID_USAGE - An invalid value was passed to the class function parameters or set to the class variables
int
-1
Point to the position of the markup data or file that refers to the last error that occurred.
Check this variable to determine the relevant position of the document when a parsing error occurs. A negative value indicates that there was no error or the last error is not associated to a specific document position.
int
8000
Maximum length of the chunks of markup data read from files that the class parse at one time.
Adjust this value according to the available memory.
bool
1
Specify whether the class should ignore syntax errors in malformed documents.
Set this variable to 0 if it is necessary to verify whether markup data may be corrupted due to to eventual bugs in the program that generated the document.
Currently the class only ignores some types of syntax errors. Other syntax errors may still cause the Parse to fail.
array
array()
Return a list of positions of the original document that contain syntax errors.
Check this variable to retrieve eventual document syntax errors that were ignored when the ignore_syntax_errors is set to 1.
The indexes of this array are the positions of the errors. The array values are the corresponding syntax error messages.
bool
1
Tell the class to return the position of each document element token.
Set this variable to 0 if you do not need to know the position of each parsed markup element.
bool
0
Tell the class to keep track the position of each document line.
Set this variable to 1 if you need to determine the line and column number associated to a given position of the parsed document.
bool
1
Tell the class to lower the case of tag and attribute names in the RewriteElement function.
Set this variable to 0 when you want to preserve the original case tags and attributes being rewritten.
bool
1
Tell the class to always quote the values of attribute in the RewriteElement function.
Set this variable to 0 when you want that attribute values be quoted only when they have spaces, tabs or line break characters.
bool
0
Tell the class to decode all the character entities in character data or tag attributes.
Set this variable to 1 if you need to get all the character data or tag attributes with character entities already decoded.
bool
1
Tell the class to allow grave accent characters as delimiters for quoted tag attributes.
Set this variable to 0 if you want the class to be strict and not accept grave accent characters to quote tag attribute values.
bool GetPositionLine(
Get the line number of the document that corresponds to a given position.
Pass the document offset number as the position to be located. Make sure the track_lines variable is set to 1 before parsing the document.
position - Position of the line to be located.
line - Returns the number of the line that corresponds to the given document position.
column - Returns the number of the column of the line that corresponds to the given document position.
This function returns 1 if the track_lines variable is set to 1 and it was given a valid positive position number that does not exceed the position of the last parsed document line.
bool ParseDTDExpressionValue(
Parse the value of an element expression used in a DTD.
Use only if you need to expand entity values when parsing DTDs.
value - DTD expression value to be parsed.
expression - Array that defines the types and values of the parsed DTD expression.
Returns 1 if it is given a valid DTD expression value.
bool ParseAttributeList(
Parse the value of an attribute list expression used in a DTD.
Use only if you need to expand attribute list values when parsing DTDs.
value - Attribute list expression value to be parsed.
attlist - Array that defines the types and values of the parsed DTD attribute list expression.
Returns 1 if it is given a valid DTD attribute list expression value.
bool StartParsing(
Initialize the state of the markup parser.
Call this function before start parsing the markup document, passing the file name or data to be parse and eventually other parsing option parameters.
parameters - Specifies a list of options that define how to parse the given document. Currently it has the following options:
Data - String with the markup data to be parsed
File - Name of the file from which the data to be parsed should be read instead of a static string.
DecodeEntities - Alternative way to set the option for determining whether the class should decode character entities, as described for the decode_entities.
Returns 1 if all parameters are correctly defined.
bool Parse(
Parse the markup document.
Call this function iteratively until the end argument is returned set to 1.
end - Determine when the parser reached the end of the document.
elements - Return a sequence of associative arrays with entries that describe each document element that was parsed.
Returns 1 if there were no fatal parsing errors.
bool FinishParsing()
Close any files and release any resources allocated while the document was being parsed.
Call this function after you are done with parsing the markup document.
Returns 1 if all resources were successfully released.
bool RewriteElement(
Generate a string for a previously parsed document markup element.
Call this function for each markup element when you want to regenerated an element that was just parsed and eventually filtered.
element - Associative array that defines the type and the values of the document element to be rewritten.
markup - Return the string of the rewritten document element.
Returns 0 if it is pass an invalid element definition.