Introduction


The RtfTools package is a set of free PHP classes that operate on files containing Microsoft Rich Text Format data (*.rtf).

Each class has been designed to accomplish a specific task that may be useful if you have to process Rtf files in various ways :

  • RtfMerger : merges several Rtf documents together and generates a single output document in Rtf format.
  • RtfTemplater : processes Rtf documents that use a templating language to generate a customized output Rtf document.
    The templating language supports variables expansions, expressions evaluations, IF/ELSE constructs and FOR loops.
    The RtfTemplater class can also be used when merging multiple documents together, allowing to generate customized mails to be used for mass printing.
  • RtfTexter : extracts text from Rtf documents, mainly for indexing purposes, with some basic formatting capabilities.
  • RtfParser : a generic Rtf parser that can be used if you need more advanced capabilities when interpreting Rtf document contents.
  • RtfBeautifier : a pretty-printer that reformats Rtf contents in a human-readable way, so that you will be able to easily compare the contents of two Rtf files by using the Unix diff or Windows windiff tools.

All the classes in the RtfTools package have been designed to be able to process Rtf documents that may be larger than the available memory. This is especially useful when you need to handle at once several Rtf documents whose total size may exceed your current PHP memory limit.

However, you always have the choice of using a version that relies on an underlying file (allowing for processing files bigger that the available memory), or its twin version that operates on Rtf contents stored as a string into memory (allowing for faster processing).

For example, the RtfTemplater class is simply an abstract class which has two derived classes :

  • RtfStringTemplater, which first loads the contents of an Rtf document into memory before processing the macro-language constructs that it contains
  • RtfFileTemplater, which loads the contents of an Rtf document by blocks (blocks have a default size of 16 kilobytes) while processing on-the-fly the macro-language constructs that the document contains.

The same dichotomy exists with all the other classes, at the exception of RtfMerger.

The Overview of the RtfTools classes section provides more information about the hierarchy of the RtfTools classes, especially regarding when to use the derived classes that operate on string contents and the ones that operate directly on files.

You will also find an Examples section, that gives a general overview on how to use the classes. If you would like examples that sound a little bit more like real life, using real sample Rtf files, you can also have a look at the examples directory in the .zip file containing the latest release of the distribution (which can be downloaded here : http://www.rtftools.net/download.php?version=latest)

Finally, the Reference section gives a complete description of the RtfTools classes, their properties and their methods.

Licensing

The applicable licensing scheme for using this package is GPL V3.

Prerequisites

This package requires PHP >= 5.6.

Installation

There is no particular installation process. Just extract the files located in the sources directory of the .ZIP archive to your preferred include directory location. You can also extract the whole archive if you like.

Overview of the RtfTools classes


This section will give you an overview of how the classes in the RtfTools package are organized and why they were organized this way.

You will discover that most classes come in two versions : a string-based version which operates on a whole Rtf document directly loaded into memory, and a file-based version that loads chunks of data from an Rtf document.
Both versions provide exacty the same functionalities, but they differ in terms of memory and cpu usage ; their constructors, of course are also different, since they do not require the same parameters.

Finally, you will find a small discussion about when to chose the string-based version and when to chose the file-based version.

Design requirements

Classes of this package have been designed with the following requirements in mind :

  1. They should be able to process Rtf documents bigger than the available memory. This has been achieved by using the SearchableFile class, available in this package.
  2. The Rtf classes should be more or less accessible as strings and provide the ArrayAccess, Countable and Iterator interfaces.
    Additionnaly, they also provide string-oriented functions that have a native PHP equivalent, such as substr, strpos, etc.
  3. Rtf documents should be processed either from a string or from a file.

    Of course, it would have been ideal to have a single class that is able to handle both cases, providing them with a stream wrapper either for strings or files contents. But this would have been in contradiction with requirement (2), which aimed at providing a string-like interface for both string- and file-based versions.

    Another more or less ideal approach would have been to use multiple inheritance, but PHP definitively lacks of support for that.

    Given these requirements and technical constraints, the final choice has been ported on using traits.

  4. The file-based version should not have too much overhead when compared to the string-based version of the same class : here again, the SearchableFile class is of great help, since it implements some kind of buffer cache, which holds in memory the most recent blocks that have been read so far.
    Actually, most systems I've tested show a 2- to 4-times performance difference (ie, file-based versions as 2 to 4 times slower than their string-based counterpart). However, I have seen a desktop PC which had a very good I/O subsystem ; on this platform, the performances where quite similar.

Introduction to the RtfTools class hierarchy

Maybe the easiest way to understand how the classes of the RtfTools package are organized is to start from the root, the RtfDocument class ; the diagram below, which uses a home-made formalism, describes the origin of it all :



Diagram explanations

The diagram above needs a few explanations :

  • The gray shapes represent entities that cannot be directly instantiated as objects : interfaces, abstract classes and traits
  • The yellow shapes represent entities that can be instantiated (classes).
  • Each line between two shapes represent a relationship, with a direction ; they all have a label that explains the kind of relationship.
    For example, the line labelled implements between the IRtfDocument and RtfDocument shapes mean : "The RtfDocument class implements the IRtfDocument interface".
    Similarly, the line labelled uses between the RtfFileDocument class and the RtfFileSupport trait means : "The RtfFileDocument class uses the RtfFileSupport trait". The has member line shows that the RtfFileSupport class has a member of type SearchableFile, which is used to operate on the underlying Rtf document.

The RtfDocument class hierarchy

Of course, the diagram above was not only an example ; it describes the various components that are articulated around the base abstract class, RtfDocument.

This diagram shows that the RtfDocument class implements the IRtfDocument interface. This is not completely true, in reality : the actual implementation of the IRtfDocument interface has been delegated into the two traits, RtfStringSupport and RtfFileSupport.

But here comes the dichotomy : at the next abstraction level, the RtfDocument class splits into two final versions : RtfStringDocument and RtfFileDocument. The first one will load the contents of an Rtf document entirely into memory, while the second one will read the document contents from disk, only when they are needed.

The first approach is focused on performance, while the second one is focused on reducing memory usage.

Classes derived from RtfDocument

Based on this modeling approach, most of the specialized classes of the RtfTools package roughly follow the same scheme. An example is given below for the RtfTemplater class :


The above diagram shows that the RtfTemplater class inherits from the RtfDocument one ; as its parent, this is an abstract class that later specializes in two classes, RtfStringTemplater and RtfFileTemplater.

At the exception of the RtfMerger class, all other classes inherit more or less directly from RtfDocument.

String-based vs file-based classes

Now that we have understood the dichotomy between the string-based and file-based classes, there is one big question that may come up to your mind : "Why do I have to chose between a string-based version and a file-based one ?". Here are a few hints, which are not to be taken as truths :

Chose the string-based version of an RtfDocument class when :

  • You know that your memory limit will never be reached by the files you have to process
  • You want higher performance

Conversely, chose the file-based version when :

  • You know that some of the files you have to process may exhaust your available memory due to their size
  • You are not that much concerned with performance ; this is the case for example of batch scripts

Whatever the solution you chosed, please keep in mind that the API will remain exactly the same, whether you chose the string-based version of a class or its file-based counterpart.


Examples


You will find below a few examples on how to use the various classes from the RtfTools package.

You will also find running examples in the examples directory of the .ZIP archive containing the RtfTools package.

Processing a template Rtf document

Merging multiple Rtf documents together

Merging Rtf files is fairly simple ; first, create a instance of the **RtfMerger** class ; you can supply a list of files to be merged together, or add them later by calling the *Add()* method :

include ( 'path/to/RtfMerger.phpclass' ) ; $merger = new RtfMerger ( 'sample1.rtf', 'sample2.rtf' ) ; $merger -> Add ( 'sample3.rtf' ) ;

The above example specified the names of the files to be merged ; but you can also give objects inheriting from the RtfDocument class, such as in the example below :

$merger = new RtfMerger ( ) ; $merger -> Add ( new RtfFileDocument ( 'sample3.rtf' ) ) ; $merger -> Add ( new RtfStringDocument ( file_get_contents ( 'sample4.rtf' ) ) ) ; $template_variables = [ 'a' => 'this is variable A', 'b' => 'this is variable b' ] ; $merger -> Add ( new RtfFileTemplater ( 'sample5.rtf', $template_variables ) ;

Related class : RtfMerger

Extracting text from an Rtf document

Extracting text from an Rtf document is easy ; the following example extracts plain text contents from files "sample1.rtf" and "sample2.rtf", and puts them in files "sample1.txt" and "sample2.txt", respectively. The plain text contents of file "sample2.rtf" are echoed on the standard output :

include ( 'path/to/RtfTexter.phpclass' ) ; // Use the string-based version of the class for the first file $contents = file_get_contents ( 'sample1.rtf' ) ; $doc = new RtfStringTexter ( $contents ) ; $doc -> SaveTo ( 'sample1.txt' ) ; // Use the file-based version of the class for the second file $doc = new RtfFileTexter ( 'sample2.rtf' ) ; echo $doc -> AsString ( ) ; $doc -> SaveTo ( 'sample2.txt' ) ;

Related class : RtfTexter

Pretty-printing Rtf document contents

The following example will process two files, sample1.rtf and sample2.rtf, and will generates their pretty-printed output to files sample1.txt and sample2.txt, respectively :

include ( 'path/to/RtfBeautifier.phpclass' ) ; // Use the string-based version of the class for the first file $contents = file_get_contents ( 'sample1.rtf' ) ; $doc = new RtfStringBeautifier ( $contents ) ; $doc -> SaveTo ( 'sample1.txt' ) ; // Use the file-based version of the class for the second file $doc = new RtfFileBeautifier ( 'sample2.rtf' ) ; $doc -> SaveTo ( 'sample2.txt' ) ;

Now, if you are running Unix, you can type the following command to compare the contents of both documents :

$ diff sample1.txt sample2.txt | more

On Windows systems, you can use the Windiff command, which graphically displays its comparison results :

C:\ > windiff sample1.txt sample2.txt

(the windiff command can be downloaded here : http://www.grigsoft.com/download-windiff.htm)

Related class : RtfBeautifier

Parsing an Rtf file


Class reference


RtfDocument class

The RtfDocument class is an abstract class from which all other classes of the RtfTools package inherit (at the exception of RtfMerger).

It supports the IRtfDocument interface, but it does not implement the methods declared in it : this role is delegated to the RtfStringSupport and RtfFileSupport traits, that are later used by specialized (non-abstract) classes such as RtfStringDocument and RtfFileDocument.

You may notice that there is a mix of naming conventions on method names ; Some use joined words with their first letter uppercased, some use lowercase words separated with an underline.
At the exception of the AsString and SaveTo methods, which were initially the only methods designed for public usage, some other methods were considered to have some interest for the outside world ; this is why :

  • Methods with uppercased first letters in their name, such as GetCompoundTag, are public static methods that can operate on fragments on Rtf code
  • Methods whose name consists only in lowercase words separated by underlines where initially private, but were considered to potentially be of some interest for the outside world. They operate on instances of RtfTools classes.


Class diagram

If you have read the Overview of the RtfTools classes section, then you are already familiar with this diagram :



Constructor

The RtfDocument constructor does not accept any parameters of its own ; it simply delegates instantiation to the __specialized_construct method of the RtfStringSupport and RtfFileSupport traits, passing all the arguments it received.

You can have a look at the String and File support traits section later in this chapter for an explanation on how parent and derived class constructors intercommunicate their parameters.

Methods

public static function DecodeSpecialChars ( $contents, $convert_accents = false )

Decodes characters using the Rtf notation \'xy, where x and y are hexadecimal digits, and replaces them with their Ansi counterparts.

Parameters :

  • $contents (string) :
    Rtf text to be decoded.
  • $convert_accents (boolean) :
    When true, escape sequences representing accentuated characters will be replaced with their unaccentuated ascii equivalent.

Return value :

Returns the input text, with all special characters converted.

Notes :

The following conversions apply :
  • Accentuated characters are replaced with their unaccentuated counterparts if the $convert_accents parameter is true.
  • All Rtf constructs specifying quotes or doublequotes are replaced with their ascii equivalent
  • Unbreakable spaces (\~) are replaced with a single space
  • Carriage returns and newlines are suppressed

public static function EscapeString ( $value )

Some strings (designated as #PCDATA in the Rtf specifications) may contain characters that could be interpreted as Rtf instructions ; such characters are :

  • "\", which starts the name of a tag, such as "\rtf1"
  • "{", which starts a nested construct
  • "}", which ends a nested construct
All these characters must be represented as "\\", "\{" and "\}", respectively

Parameters :

  • $value (string) :
    String to be escaped.

Return value :

Returns the escaped value.

public static function GetCompoundTag ( $data, $tag, $offset = 0, $include_tag = false )

Extracts a compound tag from Rtf data, handling multiple nesting levels if necessary.

For example, the color table present in the header part of an Rtf document has the following structure :

{\colortbl;\red0\green0\blue0;\red0\green0\blue255;\red0\green255\blue255; \red0\green255\blue0;\red255\green0\blue255;}

The \colortbl tag is enclosed within curly braces. The GetCompoundTag method locates such a tag in the Rtf data supplied by the $data parameter, and returns the enclosed contents, without the curly braces.

Parameters :

  • $data (string) :
    Rtf data where the tag is to be searched.
  • $tag (string) :
    An Rtf tag, including the leading backslash, such as "\fonttbl".
  • $offset (integer) :
    Position in $data where the search should begin.
  • $include_tag (boolean) :
    When true, the searched tag will also be included in the output result.

Return value :

Returns the tag contents (including nested tags) if found, or false otherwise.

public static function get_document_start ( )

Rtf documents contain a header part and a body part. To process a document body, we need of course to be sure where the header part ends and where the body part starts.

Unfortunately, there is not precise point in an Rtf document that says : "this is the end of the header, and the start of the body part".

A header is made of several parts, such as tags that define the character set used globally in the document, as well as compound structures such as font tables, color tables, style sheets and so on.
Most of these elements are optional, which could render things a little bit tricky ; fortunately, there is a certain order to respect when specifying them. For example, the optional font table must appear, when specified, before the optional color table, which in turn must appear before the stylesheet table, etc.

The get_document_start method is able to locate the end of the very last part of a document header, which signals the start of the body part.

If no header end has been found (which should not happen except for very ill-formed documents), then the get_document_start method will try to locate the first \sectd (section start) or \pard (reset paragraph settings to their defaults).
In such a case, you may lose formatting styles that can occur just between the end of the header and the \sectd or \pard tags.

Return value :

Returns the byte offset, in the Rtf document, of the start of the document body.

public static function ToClosingDelimiter ( $string, $start = 0 )

Suppose that you have a compound statement such as the following font table, located inside Rtf contents :

(some Rtf contents) {\fonttbl {\f1 ... {\panose ...} Time New Roman;} {\f2 ... Arial;}} (some other Rtf contents)
The ToClosingDelimiter method will find the last closing brace, provided that you supply the index of the opening brace in the Rtf document.

Although the RtfStringDocument and RtfFileDocument classes implement a to_closing_delimiter method that searches string or file contents until a closing brace has been found, it is sometimes handy to do it on a simple string. This is why there is also a generic closing delimiter search method that operates on strings, whatever the underlying document implementation looks like (string-based or file-based).

As for its specialized counterpart, this method is able to handle nested constructs.

Parameters :

  • $string (string) :
    A string containing Rtf data.
  • $start (integer) :
    The start of the compound construct whose ending brace is to be located. This parameter MUST point to the next character AFTER the opening brace.

Return value :

Returns the byte offset of the closing brace of the compound construct starting at offset $start - 1, or false if the supplied Rtf data has imbalanced nested opening/closing braces.

public static function TwipsToCm ( $value )

Converts a value expressed in twips (1/1440 of an inch) to centimeters.

Properties

protected $DecodingTable

This table is used by the DecodeSpecialChars method to decode special character specifications of the form : \'xy (where xy are hexadecimal characters providing an ascii code) with their ascii equivalent.

It also provides translations for the following characters :

  • Single quotes specified by the \lquote or \rquote, or by an Ascii character greater than 127, will be replaced with their ascii equivalent (ascii 0x27)
  • Double quotes, such as English quotes or French angle brackets, will be replaced with their ascii equivalent (ascii 0x22)
  • Unbreakable spaces specified by the \~ tag, will be replaced by a single ascii space (0x20)
  • Carriage returns and line feeds are suppressed.

protected $DecodingTableWithAccents

This table is used by the DecodeSpecialChars method to replace accentuated characters with their ascii equivalent without accents, when the $convert_accents parameter is true.

public $Name

For file-based documents, contains the name of the supplied input Rtf document.
For string-based documents, contains an empty string.

protected $RecordSize

Contains the record size used when writing output documents.
The default record size is 4Mb for string-based classes, and 16Kb for file-based ones.

Constants

TWIPS_PER_CM

Number of twips per centimeters.

TOKEN_* constants

The TOKEN_* constants represent a syntactic element of an Rtf file :
  • TOKEN_INVALID : Represents an invalid character encountered in the input stream. Such a situation should never happen, however.
  • TOKEN_LBRACE : An opening brace ("{"). This signals the start of a compound Rtf construct.
  • TOKEN_RBRACE : A closing brace ("}") that signals the end of a compound construct.
  • TOKEN_CONTROL_WORD : A control word, such as \par (paragraph start). The Microsoft Rtf Specifications use the general term of control word. In this help document, we use the more restrictive term tag when talking about construct that start with a backslash followed by letters.
  • TOKEN_CONTROL_SYMBOL : Special kinds of control words that start with a backslash character but are not followed by a sequence of letters that form a word ; this is the case for example for \~ (unbreakable space), \- (optional hyphen), \_ (non-breaking hyphen), etc.
  • TOKEN_CHAR : An Ansi character specification of the form \'xy, where xy represents the character code.
  • TOKEN_ESCAPED_CHAR : One of the very few characters that need to be escaped when placed in plain text : \{, \} and \\.
  • TOKEN_PCDATA : The term PCDATA has the same meaning as the one specified by Microsoft : it specifies character (plain text) data.
  • TOKEN_SDATA : Hexadecimal data specified as a sequence of hexadecimal characters (0 through 9 and A through F). Such a sequence can be found in picture elements (\pict tags) where the text found before the last closing brace represents the image data itself, encoded in hexadecimal.
  • TOKEN_BDATA : Binary data that comes right after a control world. The only example I have so far is related to the \bin control world.
  • TOKEN_NEWLINE : Newlines are not significant in Rtf contents, they are simply ignored (in fact, the whole contents of an Rtf file could be written on a single line). However, being able to make such distinction is useful for classes such as the RtfBeautifier class.
These constants are mainly used by the RtfParser class, but since RtfTexter relies on RtfParser, their definition has been put at the RtfDocument class level.

IRtfDocument interface

The methods declared in the IRtfDocument interfaces are implemented by all classes derived from the RtfDocument class.
This means that they work the same way, whether you are using a string-based or file-based class inheriting from RtfDocument.

All classes inheriting from RtfDocument (and therefore, supposed to implement the IRtfDocument interface) implement the ArrayAccess, Countable and Iterator interfaces.

This means for example that calling the count() builtin function on an object inheriting from the RtfDocument class will return you the number of characters present in your Rtf document :

$doc = new RtfStringDocument ( 'sample.rtf' ) ; echo count ( $doc ) ;
Note that it will return the number of characters in the Rtf code, not the number of characters of the plain text.

You can iterate through each character of the Rtf data present in your document by using a for loop :

$doc = new RtfFileDocument ( 'sample.rtf' ) ; for ( $i = 0, $count = count ( $doc ) ; $i < $count ; $i ++ ) echo "CHAR at position $i = [{$doc [$i]}]\n" ;
Note that you can use array index notation to retrieve an individual character, such as in $doc [$i].

Similarly, you can use a foreach loop to iterate through individual characters :

$doc = new RtfFileDocument ( 'sample.rtf' ) ; foreach ( $doc as $ch ) { // Do something with $ch... }

public function AsString ( )

Returns the contents of the underlying Rtf document as a string.

public function SaveTo ( $filename )

Saves the current document to the specified file.

public function get_contents ( )

Returns the whole contents of the underlying Rtf document, as a string.

public function strchr ( $cset, $start = 0 )

Searches for the first character in the Rtf document that is present in the $cset string, starting at the character position specified by $start.

This function behaves like a mix between the builtin strchr() and strcspn() functions.
Unlike strchr, it is able to search for the first occurrence of a character belonging to a given set of characters (and not only for a single character) ; but unlike the strcspn() function, it will return the offset of the first character found (strcspn actually returns the length of the longest segment that do not include any character specified by $cset).

The reason for this is that most of the classes belonging to the RtfTools package need to parse Rtf contents ; most of their needs consists in finding the next character having semantics in the Rtf language : backslash, opening and closing brace.

The method returns the offset of the found character, or false otherwise.

public function strlen ( )

Returns the number of characters present in the underlying Rtf document.
The following instructions are equivalent (at least from a semantic point of view) :

echo count ( $doc ) ; echo $doc -> strlen ( ) ;

public function strpos ( $searched_string, $start = 0 )

Searches the underlying Rtf document for the string specified by the $searched_string parameter, starting at the character offset specified by $start.

The method returns the offset of the found string, or false otherwise.

public function substr ( $start, $length = false )

Returns a substring of the underlying Rtf document.
This function behaves like the builtin substr() function.

public function write ( $fp, $start, $length = false )

Writes characters from the underlying Rtf document, starting at the offset specified by the $start parameter, to the file resource specified by $fp.

If the $length parameter has been specified, only this number of characters will be written to the output file ; otherwise, all the characters from $start until the end of file will be written.

public function to_closing_delimiter ( $start = 0 )

Searches for the closing delimiter of a compound construct, starting at the character offset specified by the $start parameter.
By convention, it is assumed that $start MUST point to the character AFTER the opening brace of the compound construct.

You can have a look at the RtfDocument::ToClosingDelimiter method for a more detailed explanation.

String and File support traits

The RtfStringSupport and RtfFileSupport traits have two characteristics in common :

  • They implement all the methods provided by the IRtfDocument interface and not implemented by the RtfDocument class.
    Of course, each implementation depends on the fact that the underlying document is stored as a string or a file. You can have a look at the IRtfDocument interface for more explanations about these common methods.
  • They provide a pseudo-constructor, __specialized_construct, that will be called by the RtfDocument interface, providing the arguments given by derived classes such as RtfBeautifier, RtfTexter, RtfTemplater and RtfParser.

RtfStringSupport trait

The specialized constructor of the RtfStringSupport trait has the following signature :

protected function __specialized_construct ( $rtfdata, $chunk_size ) ;
The parameters are the following :
  • $rtfdata (string) :
    Rtf data coming from an Rtf document (most probably by using the file_get_contents() builtin function).
  • $chunk_size (integer) :
    Specifies the record size to be used when generating output Rtf data. The default record size is 4Mb.
    This record size may vary depending on the calling class.
This "constructor" will set the $Name property of the parent RtfDocument class to an empty string, since it does not correspond to any existing file.

RtfFileSupport trait

The specialized constructor of the RtfFileSupport trait has the following signature :

protected function __specialized_construct ( $rtffile, $record_size = 16384, $cache_size = 8 ) ;
The parameters are the following :
  • $rtffile (string) :
    Name of the input file containing Rtf data. An exception will be thrown if the file is not accessible or does not exist.
  • $record_size (integer) :
    Specifies the record size to be used when generating output Rtf data. The default record size is 16Kb.
  • $cache_size (integer) :
    Indicates how many buffers of $record_size bytes should be cached into memory.
    Caching is useful when your search operations imply going back and forth consecutive records, or when you have to recall the contents of a record that has been previously read. This avoids most of the time unnecessary disk reads.
This trait has a member, $SearchableFile, which is an instance of the SearchableFile class. It maps the underlying Rtf document to a cached memory object which allows for string searches, extractions and so on.
Although the latest release of this SearchableFile class is available in the lastest releases of the RtfTools package, you can also find it at phpclasses.org

String and File document classes

As this will be the case for almost all the classes of the RtfTools package, you will have at a given point to decide whether to use string-based versions (consuming more memory, but less cpu and I/O) or the file-based versions (consuming pretty less memory, but more I/O).

Both versions provide exactly the same features ; the choice is thus driven by the amount of data you will have to process, and how much memory and cpu usage are available to you.

Although those classes do not have a great interest by themselves (you can only perform searches on the initial data, extract portions of it, and write contents to an output file), they have been designed so that the RtfMerge class will only work with objects inheriting from the RtfDocument class.

They have different constructors, however : you will discover them in the following sections.

RtfStringDocument class

public function __construct ( $rtfdata, $chunk_size = 4 * 1024 * 1024 )

Loads an Rtf document into memory. Look at the RtfStringSupport trait for an explanation about the constructor's parameters.

A typical usage could be :

$doc = new RtfStringDocument ( file_get_contents ( 'sample.rtf' ) ) ;

RtfFileDocument class

public function __construct ( $file, $record_size = 16384, $cache_size = 8 )

Loads an Rtf document into memory. Look at the RtfFileSupport trait for an explanation about the constructor's parameters.

A typical usage could be :

$doc = new RtfFileDocument ( 'sample.rtf' ) ;

RtfBeautifier class

The goal of the RtfBeautifier class is to take an Rtf document and to produce a pretty-printed output.

But why wanting to pretty-print Rtf documents ? suppose that you have two Rtf documents whose contents are almost similar, and that you want to compare them.

Since the raw Rtf data can have several instructions grouped on the same line, you will have to make the difference between two files that may have lines of Rtf data that are hundreds of characters long.

Comparing data formatted in such a way can be a brain-killer ; suppose for example that the files you need to compare both have the same line, but one is 700 characters-long while the other one is 705 characters long, because some \pard tag has been inserted somewhere within.

When using tools such as the Unix diff or the Windows windiff command, you will find in the output that those lines differ in both files, but you will have to visually compare a 700-characters long line with a 705-characters long one. If will be a tough task to identify that there is an additional \pard tag located inside the line of the second file.

This is where the RtfBeautifier class comes to the scene : it is a debugging aid that takes a file and pretty-prints it by putting every Rtf syntactic element on a separate line, taking care of indentation levels.

Pretty-printing an Rtf document is very simple ; consider the following PHP script which takes file sample1.rtf as input, and generates an output file, sample1.txt, containing the pretty-printed contents ; it then repeats the same process with file sample2.rtf :

<?php include ( 'path/to/RtfBeautifier.phpclass' ) ; $beautifier = new RtfFileBeautifier ( 'sample1.rtf' ) ; $beautifier -> SaveTo ( 'sample1.txt' ) ; $beautifier = new RtfFileBeautifier ( 'sample2.rtf' ) ; $beautifier -> SaveTo ( 'sample2.txt' ) ;

Now you are able to compare files sample1.txt and sample2.txt using the diff or windiff commands (or whatever diff-like command you prefer).

To give you an idea of what the output of the RtfBeautifier is, consider the following Rtf sample file contents (for the sake of brevity, only the start of the file is listed here, and the same line is show over 3 lines) :

{\rtf1\adeflang1025\ansi\ansicpg1252\uc1\adeff0\deff0\stshfdbch0\stshfloch0\stshfhich0\stshfbi0 \deflang1036\deflangfe1036{\fonttbl{\f0\froman\fcharset0\fprq2{\*\panose 02020603050405020304} Times New Roman;}{\f1\fswiss\fcharset0\fprq2{\*\panose 020b0604020202020204}Arial;} ...

The output of the RtfBeautifier class will be :

{ \rtf1 \adeflang1025 \ansi \ansicpg1252 \uc1 \adeff0 \deff0 \stshfdbch0 \stshfloch0 \stshfhich0 \stshfbi0 \deflang1036 \deflangfe1036 { \fonttbl { \f0 \froman \fcharset0 \fprq2 { \*\panose 02020603050405020304 } Times New Roman; } { \f1 \fswiss \fcharset0 \fprq2 { \*\panose 020b0604020202020204 } Arial; } ...

As you may not have guessed, even if it looks like Rtf contents, the output of the RtfBeautifier class is not valid Rtf contents. Although not clearly stated in the Microsoft Rtf Specifications, spaces cannot be put everywhere ; for example, there must be no space or line break between an opening brace ("{") and tags such as "\fonttbl" or "\rtf1".

As a conclusion, the RtfBeautifier class is definitely a debugging tool that generates output for easy comparison of Rtf files, nothing more...


Class diagram

The class diagram for the RtfBeautifier class is the following :

Constructor

The constructor of the abstract class RtfBeautifier has the following signature :

public function __construct ( $options, $indentation_size )

The parameters are the following :

  • $option (string) :
    A set of flags that condition the pretty-printing output process. See the Constants section for more explanations about the available options.
  • $indentation_size (integer) :
    Number of spaces to be used for indenting the output.
The RtfBeautifier class is abstract so it cannot be instantiated ; you will have to look at the RtfStringBeautifier and RtfFileBeautifier sections to understand how to use the string-based and file-based versions.

Methods

public function AsString ( )

Returns the pretty-printed contents of an Rtf documents as a string.

public function SaveTo ( $filename )

Pretty-prints the underlying Rtf document and saves the generated output to a file.
Note that this method does not needs to load the entire document contents into memory before generating its output : it reads input data by blocks and generates output data on-the-fly.

Properties

public $Options

Options that condition the behavior of the pretty-printing process. See the Constants section for more explanations on this set of flags.

The AsString and SaveTo methods use the current value of this property to process pretty-printing options so it is safe to modify it just before calling them.

public $IndentationSize

Number of spaces to be used for each indentation level.

Constants

BEAUTIFIER_* constants

The BEAUTIFIER_* constants allow to specify a set of flags that condition the process of pretty-printing. The following flags are available :
  • BEAUTIFY_GROUP_SPECIAL_WORDS :
    "Standard" Rtf tags usually have the form : "\word", followed by an optional integer parameter. However, some special tags that came after the initial Rtf specifications, or that are application-specific, may be prefixed with the special control word "\*", such as in the following example :
    \*\latentstyles

    Microsoft says that an application not recognizing such tags should simply ignore them.

    When this flag is not specified, the control word part ("\*") is printed on a separate line. For example :

    \* \latenstyles
  • BEAUTIFY_SPLIT_ADJACENT_WORDS :
    It is not unusual in Rtf documents to find groups of tags joined together, without any separator space (a situation which is authorized by the Rtf Specifications) ; and sometimes, you will find adjacent groups of tags separated by a space ; for example :
    \af0\afs20\alang1025 \ltrch\fcs0

    When this flag is not set, the pretty-printed output will look like :

    \af0\afs20\alang1025 \ltrch\fcs0

    When set, the same output will look like :

    \af0 \afs20 \alang1025 \ltrch \fcs0
  • BEAUTIFY_SPLIT_CHARS :
    Indicates whether character code control words (of the form \'xy) should be put on a separate line or not.

    For example, "En-tête" is encoded as "En-t\'eate" and the encoded version will be output as is if this flag was not specified. When specified, it will be output as :

    En-t \'ea te
  • BEAUTIFY_STRIP_IMAGE_DATA :
    Images are specified the following way in an Rtf document :
    {\pict image parameters hexdata}

    where hexdata is a sequence of hexadecimal digits (ascii 0 through 9 and a through f)

    Since images encoded this way will take twice the size of the original image, and there should be no particular reason for you to compare the same image in two different Rtf documents (unless you really want to be sure that both versions are the same), setting this flag will simply strip the hexadecimal data present in the \pict flag and replace them with a comment, such as in the following example :

    { \pict ... image parameters ... /* x bytes of image data not shown }
  • BEAUTIFY_STRIP_BIN_DATA :
    Works the same as the BEAUTIFY_STRIP_IMAGE_DATA flag, but for binary data.

    Binary data uses the \bin tag, which must be followed by a parameter specifying the number of bytes that follow the tag ; the following defines a 6-bytes length binary data ("123}45") :

    {\bin6 123}45}

    You may notice a few things :

    • There is always a space beteen the \bin control word and the first byte of data
    • The binary data can contain any character in the range 0-255, even special characters that are normally interpreted as Rtf elements, such as "{", "}" or "\".
      Since the number of bytes to process is given by the parameter of the \bin tag ("6", in the above example), we are sure that we will no have anything else to parse until we have collected that number of bytes. The first closing brace thus belongs to the binary data itself and will never be interpreted.
      Of course, once the six bytes of binary data will have been collected, the last closing brace will be considered to match the opening one.
    • The above example only uses plain ascii characters ; but any ascii value, from 0 to 255, is authorized.

  • BEAUTIFY_STRIP_DATA :
    Same as BEAUTIFY_STRIP_IMAGE_DATA | BEAUTIFY_STRIP_BIN_DATA.

  • BEAUTIFY_ALL :
    Enables all of the above flags.

RtfStringBeautifier class

public function __construct ( $rtfdata, $options = self::BEAUTIFY_ALL, $indentation_size = 4, $chunk_size = 4 * 1024 * 1024 )

Creates an RtfBeautifier object, using the specified Rtf data.

A typical usage could be :

$doc = new RtfStringBeautifier ( file_get_contents ( 'sample.rtf' ) ) ; $doc -> SaveTo ( 'sample.txt' ) ; // Save pretty-printed contents to output file echo $doc -> AsString ( ) ; // Echo pretty-printed contents to standard output

The parameters are the following :

  • $rtfdata (string) :
    Rtf document data, specified as a string.
  • $option (string) :
    A set of flags that condition the pretty-printing output process. See the Constants section for more explanations about the available options.
  • $indentation_size (integer) :
    Number of spaces to be used for indenting the output.
  • $chunk_size (integer) :
    Buffer size used when generating output by blocks.

RtfFileBeautifier class

public function __construct ( $file, $options = self::BEAUTIFY_ALL, $indentation_size = 4, $record_size = 16384

Creates an RtfBeautifier object, without loading the file contents into memory.

A typical usage could be :

$doc = new RtfFileBeautifier ( 'sample.rtf' ) ; $doc -> SaveTo ( 'sample.txt' ) ; // Save pretty-printed contents to output file echo $doc -> AsString ( ) ; // Echo pretty-printed contents to standard output

The parameters are the following :

  • $file (string) :
    Path to an Rtf document to be processed.
  • $option (string) :
    A set of flags that condition the pretty-printing output process. See the Constants section for more explanations about the available options.
  • $indentation_size (integer) :
    Number of spaces to be used for indenting the output.
  • $record_size (integer) :
    Record size to be used when reading and writing data.

RtfMerger class

The RtfMerger class allows you to combine the contents of several Rtf files into a single one. It can be used for example for mass printing or for storing a set of related files into a single Rtf document.

Unlike all the other classes of this package that process Rtf contents, this class does not inherit from RtfDocument.

Merging documents together

Merging documents together is a simple three-steps process :

  1. Create an object of class RtfMerger. Although Rtf documents can be specified as class constructor's parameters, you can still add more documents later by using the Add() method.
  2. (optional) Add as many files as you want using either the Add() method or the array access interface that the class implements.
  3. To create the merged document, either call the AsString() or SaveTo() method.

Class diagram

The RtfMerger acts as a container for objects inheriting from the RtfDocument class.
The diagram below shows that it contains two important member properties :

  • An array of RtfMergerDocument objects. Every document added to an RtfMerger is wrapped by this class.

    An RtfMergerDocument object holds a few associative arrays that are used when some elements in the header of the document are in conflict with the data already existing in the global header. They are used for renumbering any reference to an existing color, font, stylesheet (and more) whenever needed.

  • A global header, of class RtfMergerHeader. This header is built while processing documents, to gather important information from individual document headers.

Merging process overview

Note : a little knowledge of the Rtf Specifications would be welcome here to better understand the merging process.

Merging several Rtf documents together require a few manipulations. Before explaining them, a short overview of the Rtf document format is needed.

Rtf documents have a header and a body part ; the Microsoft Rtf Specifications state that an Rtf document is built like this :

<file> ::= '{' <header> <body> '}'

The above description states that an Rtf document always starts with an opening brace, followed by a header part, then by a body part, and finally terminated by a closing brace.

If we have a further look at the <header> part, we will find something like this (a quotation mark after a construct means that it is optional) :

<header> ::= \rtf1 \fbidis? <character set> <from>? <deffont> <deflang> <fonttbl>? <filetbl>? <colortbl>? <stylesheet>? <stylerestrictions>? <listtables>? <revtbl>? <rsidtable>? <mathprops>? <generator>?

Globally, a header starts with the \rtf1 tag (an Rtf document always starts with the string {\rtf1), followed by a certain number of tags which are more or less to be seen as global document properties ; then you will see compound structures such as the font table, the color table, the style sheet table, etc.
There are two other information that are not considered as being part of the header by the Rtf Specifications, but they are indeed specific to a document ; these are :

  • \xmlnstbl : Xml namespace table, not used by the RtfMerger class
  • \info, which contains information such as the author of the document, its title, etc.

The RtfMerger class discards any information related to Xml namespaces, but it allows you to specify author information that will be put in the final document.

Tables in the header part of a document define a set of items : the color table defines the colors used in the document, the font table defines the fonts used in the document, the stylesheet table defines style sheets used in the document, and so on.

Each entry in these tables can be referred to later in the document body by using the appropriate tag (control word). For example, setting the foreground color in a paragraph can be specified with the \cfx tag, where x is the entry number in the document color table.
The same kind of process applies to the style sheet table, the font table, etc.

The problem comes when merging multiple documents together ; each document (probably) use its own header tables for colors, fonts, stylesheets and it may happen that an entry from document x conflicts with the same entry in document x+1.
When such a situation happens (it happens in most cases), some renumbering has to occur in document x+1 for the entries that conflict with those of document x.

This is why, during the processing of a document to be merged, tables local to the document will hold entries in the corresponding RtfMergerDocument object, indicating which references should be renumebered, because there was already an entry having that id in the global header, but with a different definition.

The global header that will be generated will include all the entries coming from the first document to be merged, plus the renumbered entries coming from the subsequent documents.

The following sections give a little bit more details about each of these elements, and explain how they are handled during the merging process. You will see that some tables require specialized handling when renumbering references to their elements.

Global document properties

The term Global document properties is used here to indicate tags (Control words, in the Microsoft terminology) that define settings at the document level. The following example specifies a default language code of 1025 when paragraph settings are reset to their default (using the \plain control word) ; it also specifies that the document uses the ansi character set (\ansi) along with code page 1036 (\ansicpg1036) :

\deflang1025\ansi\ansicpg1036

The RtfMerger class will collect all those various tags coming from the headers of the documents to be merged. However, if a tag has been found having a different parameter value in a previously processed document, it will not be overridden and a warning such as in the following example will be issued :

Tag \ansicpg value mismatch : current = 1057, previous = 1036

In its current state of development, the RtfMerger class simply ignores conflicting global document properties that may come from documents processed after the first one in the merging process.

Color tables

Color tables are specified in a compound structure that starts with the \colortbl tag and contain color specifications in RGB format (using the \red, \green and \blue tags) ; color specifications are separated with a semicolon :

{\colortbl;\red255\green255\blue255;\red0;\green0;blue0;...}

In the above table, 3 colors are defined :

  • A color with no specification (the specification should have taken place between the \colortbl tag and the first semicolon : this is known as the default color of the document, and its index in the color table is #0).
  • The color white (\red255\green255\blue255;\red0), which has index #1.
  • The color black (\red0;\green0;blue0), which has index #2.
  • "..." stands for : additional color specifications not listed here.

Color indexes are zero-based. The tags that reference a color within the body part of a document are :

  • \cbx, which defines the background color to be entry x in the document color table.
  • \cfx, which defines the foreground color to be entry x in the document color table.

When building a global color table regrouping all the colors referenced by the documents to be merged, the following rules apply :

  • For each color entry of each document :
    • The global color table is searched for an entry having the same characteristics (same red, green and blue values ; but be aware that additional color attributes may be specified so they are also taken into account when comparing two color specifications together).
    • If a color with the same characteristics exists in the global color table, then two situations can happen :
      • The color has the same index in both the current document color table and in the global color table ; in this case, references to colors in the document body (using the \cbx or \cfx tags) will remain as is : no renumbering is needed.
      • The global header contains the color, but uses an index which is different from the one used by the current document ; in this case, a color substitution entry is added in the RtfMergerDocument color substitution table. This table will be used later for renumbering conflicting colors when performing the merge operation.
    • The global header does not contain the color ; in this case, the new color (coming from the current document) is added to the global color table and a new color substitution entry is added in the RtfMergerDocument color subsitution table for later renumbering.

Font tables

Font tables are specified in a compound structure that starts with the \fonttbl tag and contains font specifications defined in nested compound structures. The following example defines three fonts, Times New Roman, Arial, and Calibri that are referenced inside the document (the Rtf code has been intentionally indented for better readability) :

{\fonttbl {\f0\froman\fcharset0\fprq2{\*\panose 02020603050405020304}Times New Roman;} {\f1\fswiss\fcharset0\fprq2{\*\panose 020b0604020202020204}Arial;} {\f39\fswiss\fcharset0\fprq2{\*\panose 020f0502020204030204}Calibri;} }

While color indexes in color tables are assigned sequentially (ie, the first color in the color table has index 0, the second has index 1, and so on), fonts have their own numbering scheme, specified by the \f tag, followed by a font number.
We can see from the above example that the Times New Roman font has number #0, Arial number #1, and Calibri number #39.

The tags that reference a font within the body part of a document are \f and \af.

Within a single document, all font numbers are unique. However, when it comes to merging multiple documents together, you need to take care that the font numbers used in an individual document will not conflict with the font numbers used in another document so, again, there will be a renumbering operation during the merging process.

When building a global font table regrouping all the fonts defined by the documents to be merged, the following rules apply :

  • Foreach font definition entry of each document :
    • Build an "anonymized" version of the font definition ; an anonymized version of a font definition is built from the initial font definition found in the current document, where the \f tag has been removed, together with any space, tab, newline or carriage return

      For example, the "anonymized" font definition for font #0 in the above example (Times New Roman), will give the following :

      {\froman\fcharset0\fprq2{\*\panose02020603050405020304}TimesNewRoman;}
    • If the global font table already contains such an anonymized entry (meaning that the two definitions are identical), then to situations can arise :
      • Both fonts have the same id (specified with the \f tag) ; in this case, no renumbering will occur.
      • Both fonts have a different id ; in this case, a font substitution entry is added in the RtfMergerDocument font substitution table. This table will be used later for renumbering conflicting fonts when performing the merge operation.
    • The global header does not contain the font definition ; in this case, the new font (coming from the current document) is added to the global font table and a new font substitution entry is added in the RtfMergerDocument font subsitution table for later renumbering.

Note that the fonts coming from individual documents are renumbered sequentially, starting from 0. This means that a global font table coming from the font table example above will look like this (note that the Calibri font, which was initially referred to as font #39, is now font #2) :

{\fonttbl {\f0\froman\fcharset0\fprq2{\*\panose 02020603050405020304}Times New Roman;} {\f1\fswiss\fcharset0\fprq2{\*\panose 020b0604020202020204}Arial;} {\f2\fswiss\fcharset0\fprq2{\*\panose 020f0502020204030204}Calibri;} }

List tables

A list table is a case similar to a color table : it contains list definitions, whose id is assigned sequentially. The difference is that list numbers start at 1 while color numbers are 0-based.

The list table of a document contains list definitions that can be referenced in the body part ; they define properties such as list levels, which in turn specify attributes such as the picture to be used for bullets, the numbering scheme for this level, etc.

A list table definition is a compound statement that starts with the \listtable tag ; it contains list definitions that in turn start with the \list tag, such as in the following example (for brevity, the contents of each list has been replaced by an ellipsis) :

{\listtable {\list \listidx...} {\list \listidy...} ...}

Lists in a document are referenced by the \lsx tag, where x is the 1-based list entry index into the list table.
Each list definition must contain a \listidx tag, that specifies a unique id for this list.

When building a global list table regrouping all the lists referenced by the documents to be merged, the following rules apply :

  • For each list definition of each document :
    • The global list table is searched for an entry having the same characteristics. To do that, a "compressed" version of the list definition is created by removing any space, tab, newline or carriage return.
    • If a list with the same characteristics exists in the global list table, then two situations can happen :
      • The list has the same index in both the current document list table and in the global list table ; in this case, references to this list in the document body (using the \lsx tag) will remain as is : no renumbering is needed.
      • The global header contains the list definition, but uses an index which is different from the one used by the current document ; in this case, a list substitution entry is added in the RtfMergerDocument list substitution table. This table will be used later for renumbering conflicting lists when performing the merge operation.
    • The global header does not contain the list ; in this case, the new list (coming from the current document) is added to the global list table and a new list substitution entry is added in the RtfMergerDocument list subsitution table for later renumbering.

List Override tables

List override tables are complements to existing list definitions ; there are generally two types of list overrides :

  • Overrides that specify different formatting properties for a paragraph that has to be formatted as a list
  • Overrides that specify a different start value for lists

A list override table definition is a compound statement that starts with the \listoverridetable tag ; it contains list definitions that in turn start with the \listoverride tag, such as in the following example (for brevity, the contents of each list has been replaced by an ellipsis) :

{\listoverridetable {\listoverride \listidx \lsa...} {\listoverride \listidy \lsb...} ...}

Each list override definition contains two important tags :

  • \listidx, which refers to the list, in the list table, having the same id
  • \lsx, the one-based number identifying this list override entry

In the document body, lists are referred to using the \lsx tag.

The process of handling conflicting list override entries is nearly the same as the one used for font tables : conflicting override list entries are "anonymized", by removing the \listid and \ls tags. This "anonymized" version is used to check if we already encountered such a list override definition in a previous document.

Depending on the comparison result, the same renumbering process as the one that is used for font definitions is applied here. The only difference is that the unique list ids (\listid tags) are renumbered together with the override list ids (\ls tags).

Stylesheet tables

A stylesheet table is a list of nested stylesheet definitions, which are a shorthand for specifying character, paragraph or section formatting.

A stylesheet table is a compound statement that starts with the \stylesheet tag ; it contains in turn stylesheet definitions that can specify any character, paragraph or section formatting tags, such as in the following (abbreviated) definition :

{\stylesheet {\ql \li0\ri0\widctlpar\wrapdefault\aspalpha\aspnum\faauto\adjustright ...} {\*\ts11\tsrowd\trftsWidthB3\trpaddl108\trpaddr108 ...} {\*\cs10 \additive \ssemihidden Default Paragraph Font;} }

There are a few kinds of styles, which are given by a specific tag followed by the stylesheet id ; the possible style identification tags are :

  • \sx : paragraph style with id x.
  • \*\csx : character style with id x.
  • \*\dsx : section style with id x.
  • \*\tsx : table style with id x.

You will notice in the above example that the very first style (the one which starts with the \ql tag) has no id (ie, none of the style id tags listed above appears in it). This is the default style, which is by convention numbered as style #0.

Thus, the process of renumbering style sheets in the global header will be very similar to the one used for font tables, with the following exceptions :

  • The first style not having any style identification id will have number #0
  • Several kinds of styles must be handled (character, paragraph, section and table). They will all be renumbered sequentially.

In addition, style definitions can contain tags whose parameter is a style id ; these ids must also be renumbered during the merge process. Such tags are :

  • \sbasedon : Defines the style on which the current style is based on.
  • \snext : Defines the style to be used in the next paragraph marked by this style.
  • \slink : Link to a character style that shares the same font properties as this one.

RSID

RSID (Revision Save IDs) are used for revision tracking. The merging process will remove anything related to revision tracking ; this includes :

  • The RSID id table defined in the header part document, and indicated by the \rsidtbl tag. It lists all the unique ids used by all the various kinds of RSID tags used throughout the document.
  • The various RSID tags themselves :
    • \rsidx : each time the document is saved, a new RSID is added into the RSID table. This tag appears only in the RSID table.
    • \insrsidx : denotes a text that has been inserted.
    • \delrsidx : identifies text marked as deleted.
    • \charrsidx : character formatting has been changed.
    • \sectrsidx : section formatting has been changed.
    • \pararsidx : paragraph formatting has been changed.
    • \tblrsidx : table formatting has been changed.
    • \rsidrootx : identifies the start of the document history (first save).

Preserving or removing RSIDs across the various documents to be merged is not a great issue in itself ; RSID numbers are normally numbers that are chosen randomly, so there are little chances that one RSID from document x conflicts with an RSID coming from document y.

However, tracking the individual history of documents to be merged together inside the final merged document does not makes great sense. After all, the main purpose of a merged document is not to be edited afterwards ; a typical usage will be to print it, or to store it unmodified in a database.

This is why the merging process removes any revision information, as if the whole document had been edited all at once before the first save (although a first save would create by itself a first RSID).

Shapes

Shapes in a document can take various forms : they can be text areas, geometric shapes or more complex structures. The fact is that each shape has its own id, identified by the \shplid tag.

Merging two documents having shapes with the same id may result in strange things :

  • You may find that everything is ok when doing a Print preview
  • However, the printed version may show shapes at positions that differ from the original document, or with different size, showing truncated contents, or bigger than the original version

This is why renumbering shapes so that they will all have a unique id in the merged document is really important.

The merging process

Now that you have a global view of what the merging process takes care about, describing its overall actions will be simpler :

  1. Before the merging process, whenever you add a document to be merged, either by specifying it through the class constructor or by using the Add() method, or by using the array access methods, the transformations described in the paragraphs above will be applied to the document header and will gradually contribute to build the merged document header (color table, font table, list table and so on), because your documents are wrapped by the RtfMergerDocument class, which performs these operations automatically.
    Normally, only renumbering operations occur on header tables coming from individual documents. There is one exception to this rule however : stylesheets, which contain formatting information that can also occur within a document body. For this reason, during this preprocessing step, their contents are processed as if they were part of the document body (see the ReplaceReferences protected method).
  2. First, the Rtf code for the merged document is built using the BuildHeader() method of the GlobalHeader property of the RtfMerger object.
  3. Then, for the body part of each document, the ReplaceReferences protected method will handle the hassle of renumbering what is to be renumbered (references to colors, fonts, styles and so on), using the local renumbering table that has been built during step #1.

The final merged document will only be generated when you call either the AsString or the SaveTo method.

In the first case, you will need as much memory as necessary to hold the global header plus the body parts of each document to be merged.

In the second case, you will only need enough memory to hold the biggest document body, or the global document header, whichever is the biggest.

Constructor

The RtfMerger class constructor has the following signature :

public function __construct ( [ [documents...] options] ) ;

An RtfMerger instance without any documents in it can be created this way :

$merger = new RtfMerger ( ) ;

You can then add existing documents by using the Add method or the array access methods :

$merger -> Add ( "sample1.rtf" ) ; $merger [] = "sample2.rtf" ; $merger [] = new RtfFileTemplater ( "sample3.rtf", $variables ) ;

You can also specify filenames or objects inheriting from the RtfDocument class to the constructor :

$merger = new RtfMerger ( "sample1.rtf", "sample2.rtf", new RtfFileTemplater ( "sample3.rtf", $variables ) ) ;

Methods

public function Add ( $args...)

Adds the specified documents to the merger object. $args can be of two types :

  • An object inheriting from the RtfDocument class, such as RtfStringDocument, RtfFileDocument, RtfStringTemplater or RtfFileTemplater.
  • A string. Depending on the options currently defined by the Options property, such a parameter will be interpreted differently :
    • If the RTFMERGE_STRINGS_AS_FILENAMES bit is set (the default), the string will be interpreted as an existing filename, and an object of type RtfFileDocument will be added to the merger object.
    • If the RTFMERGE_STRINGS_AS_DATA bit is set, the string parameter will be interpreted as Rtf data and an object of type RtfStringDocument will be added to the merger object.

Array access

The RtfMerger class implements the Countable, ArrayAccess and IteratorAggregate interfaces, which allows you to have access to the documents that have been added to the merger object :

$merger = new RtfMerger ( "sample1.rtf", "sample2.rtf" ) ; echo count ( $merger ) ; // Will display "2" $doc2 = $merger [1] ; // $doc2 will be set to the RtfFileDocument object that // has been created from "sample2.rtf" file contents // Iterate through each document foreach ( $merger as $doc ) // do something with $doc, which is of type RtfFileDocument

public function AsString ( )

Returns the merged document contents as a string.

public function SaveTo ( $filename )

Saves the merged document contents to the specified filename.

Properties

protected $Documents

Every document added through the Add or array access methods is put in this array, after being wrapped in an RtfMergerDocument object.

Document information properties

You can define some document-information properties that will be put in the final merged document :

  • Title
  • Subject
  • Author
  • Manager
  • Company
  • Operator
  • Category
  • Keywords
  • Comment
  • Summary
  • Version

Note that the Keywords property is an array of strings.

public $Options

The $Options property is a set of RTF_MERGE_* constants that condition the behavior of the RtfMerge object.

private $GlobalHeader

The $GlobalHeader property holds a object of class RtfMergerHeader. As more documents of type RtfMergerDocument are added to the merger object, this object is complemented by the new colors, fonts, stylesheets and list definitions brought by the new documents.

When the final document will be generated with either the AsString or SaveTo methods, this object will return the mandatory Rtf code needed to build the document header.

Constants

RTF_MERGE_*

The RTF_MERGE_* constants are used for the $Options property to define the behavior of the RtfMerge class for the merging process ; it can be any combination of the following :

  • RTF_MERGER_STRINGS_AS_FILENAMES : Indicates that parameters specified as string to either the class constructor, the Add() method or the array access methods are to be considered as filenames.
    In this case, an object of file RtfFileDocument will be created for the file and added to the merger object.
    This is the default option.
  • RTF_MERGER_STRINGS_AS_DATA : Indicates that parameters specified as string to either the class constructor, the Add() method or the array access methods are to be considered as Rtf data.
    In this case, an object of file RtfFileDocument will be created for the file and added to the merger object.
  • RTF_MERGER_NONE : No specific option. Note that the default option RTF_MERGER_STRINGS_AS_FILENAMES will be set anyway.

RtfParser class

The RtfParser class is a general class that you can use to parse Rtf contents. It's a little bit more than a parser, however, because it handles special constructs that are specific to certain tags (or control words) ; it features the following :

  • Handling of special tags such as \pict, a compound construct that embeds image data ; and \bin, another compound construct that embeds arbitrary binary data whose length is specified as a parameter of the \bin control word itself.
  • Handling of special control words, which define a parameter that later affects how other related control words are to be interpreted.

Parsing an Rtf file only requires to repeatedly call the NextToken() method, which returns an object inherited from the RtfToken class, that gives all the necessary information about the next Rtf token available.

Overview

The simplest program to parse Rtf contents could look like this :

<?php require ( 'RtfParser.phpclass' ) ; $file = "sample.rtf" ; $parser = new RtfFileParser ( $file ) ; while ( ( $token = $parser -> NextToken ( ) ) !== false ) { // do something with $token }

Class diagram

The class diagram for the RtfParser class is the following :

Constructor

The RtfParser abstract base class has the following constructor :

public function __construct ( ) ;

No particular parameter is required.

Methods

public function GetControlWordValue ( $word, $default = '' )

Gets the currently applicable parameter value for the specified control word.
The best example I can give for explaining the utility of this method is regarding Unicode characters, which are specified by the \u tag like in the following example :

\u10356
The Rtf specification states that Unicode characters are followed by character symbols (using the "\'" tag) which specify the number of the code page that best matches the Unicode character that precedes :
\u10356\'a1\'b0

The number of character symbols that follow a Unicode character specification is given by the \uc tag ; in the above example, it should be written like this :

\uc2 \u10356\'a1\'b0

However, the specification states that this number (the parameter of the \uc2 tag) should be tracked and that a stack of applicable values depending on the current curly brace nesting level should be handled (the \uc tag may be present elsewhere in the document, not specifically before Unicode character specifications, and its default value should be 1).

So, in the above example, we have to answer the question : "What is the current value of the \uc tag ?" whenever we encounter a \u tag, to be able to determine the number of character symbols that should follow it.

For example, if the current value of the \uc tag is 1, then the following sequence will be interpreted as Unicode character #10356, and the nearest code page that can represent this character will be 161 (0xa1) ; the Unicode character is followed by an uppercase A (\'41) :

\u10356\'a1\'41

If the current value of the \uc tag is 2, then the above sequence will be interpreted as Unicode character #10356, and the nearest code page that can represent this character will be 41281 (0xa141).

To be able to handle such a situation, you will first have to call the TrackControlWord() method to tell the parser that we want to track the current value of the \uc, as in the following example :

$parser -> TrackControlWord ( 'uc', true, 1 ) ;

Then whenever a \u control word specifying a Unicode character is encountered when parsing an Rtf document, you can call this method to retrieve the currently applicable value for \uc :

$uc_value = $parser -> GetControlWordValue ( 'uc' ) ;

The parameters are the following :

  • $word (string) :
    Control word whose current parameter value is to be retrieved, without the leading backslash.
  • $default (mixed) :
    Default value to be returned, if the control word has not been tracked using the TrackControlWord method.
The GetControlWordValue method returns either the current parameter value for the specified control word (which is dependent of the current nesting level if the control word is stackable), or $default otherwise.

public function IgnoreCompounds ( $list )

When parsing an Rtf document, not all control words may be of interest to you. This method allows you to supply the list of control words you want to be ignored, as an array of strings :

$parser -> IgnoreCompounds ( [ 'fonttbl', 'listtable', 'listoverridetable', 'pict' ] ) ;

When implementing a class inheriting from RtfParser, you should consider which tags (or control words) are useless for the task you want to carry on ; parsing a whole Rtf document can be slow, so helping the parser by telling him which tags can be safely ignored will result in performance improvement. This is especially true for control words such as \pict or \bin, which can embed huge amount of data.

public function NextToken ( )

Returns the next available token from the Rtf input stream. The returned value is of type RtfToken, or false if all tokens have been processed.

This methods skips the tokens that are to be ignored, ie the ones that has been specified to a call to the IgnoreCompounds method.

public function Reset ( )

Resets the parser object, so that parsing can start again from the beginning.

public function SkipCompound ( )

Some Rtf constructs may not be of interest to you, depending on the result you want to achieve. Suppose for example that you do not want to interpret anything coming from the font table, which has a definition that looks like :

{\fonttbl{font definition 1}...{font definition n}}

SkipCompound allows you to continue past the closing brace that terminates the font table started by {\fonttbl, ignoring any content between these two delimiters.

Note that the function will decrement the current brace nesting level.

public function TrackControlWord ( $word, $stackable, $default_value = false )

Tracks a control word specification in the current Rtf document. This allows for example to associate raw data with a control word, such as for the \pict tags.

It also allows you to track control words whose value can be changed when entering a new nesting level and must be restored when exiting this nesting level (this is the case for example of the \uc

Parameters are the following :

  • $word (string) :
    Control word to be tracked.
  • $stackable (boolean) :
    Indicates whether the control word parameter value can be nested within braces. This is not the case for example of \pict constructs, where the first data available will be the image contents themselves ; however, constructs such as \ucx, where "x" represents the number of bytes in a Unicode character specification that can be found afterwards, can be stacked. Every closing brace will pop up the last value of x.
  • $default_value (string) :
    When non-false, specifies an initial default value for the control word. This parameter is mainly used for stackable control words.

Properties

$CurrentColumn

Returns the current column position in the parsed Rtf document. Columns are numbered from 1.

$CurrentLine

Returns the current line in the parsed Rtf document. Lines are numbered from 1.

$CurrentPosition

Returns the current byte offset from the start of the file. Byte offsets start at 0.

$NestingLevel

Current nesting level of curly braces.

RtfStringParser class

public function __construct ( $rtfdata, $chunk_size = 4 * 1024 * 1024 )

Creates an RtfParser object, using the specified Rtf data as a string.

The parameters are the following :

  • $rtfdata (string) :
    Rtf document data to be parsed.
  • $chunk_size (integer) :
    Not used.

RtfFileParser class

public function __construct ( $file, $record_size = 16384 )

Creates an RtfParser object, using the specified Rtf document.

The parameters are the following :

  • $file (string) :
    Name of the file to be parsed.
  • $record_size (integer) :
    Record size to be used when reading the Rtf document.

RtfTemplater class

The RtfTemplater class allows for processing template documents using a specific macro language, in order to generate different final Rtf documents whose contents will depend on the input you supplied. Such input is mainly given through variables with as many different values as you have documents to process.

Overview

The principle of templating documents is really simple and needs only 3 steps :

  1. Create a template document using the macro-language constructs provided by the RtfTemplater class. This document will be able to reference variables, but also expressions to be substituted in place, IF/THEN/ELSE constructs and FOR loops.
  2. As you may have guessed, the final output document will contain customized contents, depending on the input you supplied. So the next step is to instantiate an object of class RtfDocument, providing it with values of variables that are used inside the template document and that will condition the contents of the final document.
  3. Use either the AsString or SaveTo methods to generate the final document.

Creating your first template

To create your first template, simply use your favorite word processor as long as it can save or export contents into Rtf format. Such an editor could be Microsoft Word, OpenOffice, LibreOffice or even Wordpad !

The following example document (let's assume this is an Rtf document) references 4 variables : TITLE, FIRSTNAME, LASTNAME and SENDER. It also uses the PHP date() function to put the current date :

Date : %( date ( 'd/m/Y' ) )% Dear %$TITLE% %$FIRSTNAME% %$LASTNAME%, Your reservation for the year 2016 Annual Congress of Pataphysical Scientists has been confirmed. Regards, %$SENDER%.

You can notice a few things from the above document template :

  • Template constructs are always enclosed with percent signs
  • References to template variables must be preceded by a dollar sign, such as in : $TITLE
  • It is possible to call PHP functions in expressions ; in the above example, we are calling the date() function.
  • Expressions must start with the string "%(" and end with ")%", otherwise the templater will try to evaluate them as variables.

Generating personalized documents

A simple script will allow us to generate personalized documents from the document template we saw in the previous section.

The first thing we need to do is to include the RtfTemplater.phpclass file :

include ( 'RtfTemplater.phpclass' ) ; $template_file = 'example.rtf' ; // Assume this is our example template above

Now, we will need to supply some data to generate personalized documents, using different values for the TITLE, FIRSTNAME, LASTNAME and SENDER variables referenced in our template.

To do that, we have to put individual values in an array ; the example below declares an array that contains the variable substitutions for 3 recipients :

$recipients = [ [ 'TITLE' => 'Ms', 'FIRSTNAME' => 'Jane', 'LASTNAME' => 'Doe', 'SENDER' => 'Alfred Jarry, Senior Pataphysics Engineer' ], [ 'TITLE' => 'Mr', 'FIRSTNAME' => 'John', 'LASTNAME' => 'Smith', 'SENDER' => 'Alfred Jarry, Senior Pataphysics Engineer' ], [ 'TITLE' => 'Mr', 'FIRSTNAME' => 'Peter', 'LASTNAME' => 'Watson', 'SENDER' => 'Alfred Jarry, Senior Pataphysics Engineer' ] ] ;

Now we can generate an output document for each entry in our $recipients array ; we will build a loop, and create a new instance of the RtfTemplater class, using our base template document and recipient data :

for ( $index = 1, $count = count ( $recipients ) ; $index <= $count ; $index ++ ) { $recipient = $recipients [ $index ] ; $templater = new RtfTemplater ( $template_file, $recipient ) ; $templater -> SaveTo ( "output.$index.rtf", $recipient ) ; }

Viewing the results

The sample code above will generate 3 files : "output.1.rtf", "output.2.rtf" and "output.3.rtf". Let's view one of them :

Date : 25/10/2016 Dear Ms Jane Doe, Your reservation for the year 2016 Annual Congress of Pataphysical Scientists has been confirmed. Regards, Alfred Jarry, Senior Pataphysics Engineer.

That's all ! you just created your first mailing script.

Class diagram

The class diagram for the RtfTemplater class is the following :

Templater macro-language reference

The templating pseudo-language implements a few simple control structures. All expressions can reference variables that have been passed to the constructor of the RtfStringTemplater or RtfFileTemplater class constructor :

$variables = [ 'VNAME1' => 'the value of vname1', 'VNAME2' => 'the value of vname2', 'INDEX' => 17, 'ARRAY' => [ 'string a', 'string b', 'string c' ], 'TITLE' => 'M.' ] ; $document = new RtfStringTemplater ( $contents, $variables ) ;

Array keys are simply variable names, which are case-sensitive, while array values represent the string that will be substituted whenever the variable is referenced in the document.

Note that in the above example, one of the variables, ARRAY, is not scalar ; such an array variable can be used in FOREACH constructs.

Language overview

The macro templating language provides the following constructs :
  • References to variables
  • Computed expressions
  • IF constructs
  • FOR/REPEAT loops

Every macro language construct must be surrounded by percent signs, as in the following examples :

%$VARIABLE% %( date ( 'd/m/Y' ) )% %FOR ( $I = 1 TO $INDEX )

Paragraph marks (line breaks) between the enclosing percent signs of an instruction are ignored.

Compound statements such as IF or loops can be nested.

The RtfTemplater class tries to be as smart as possible when differentiating macro constructs from regular document contents.
However, some situations can make the macro language interpreter a little bit confused ; see the Coping with percent signs paragraph for an explanation on how to avoid such situations.

Expressions

Expressions can reference variables passed to the class constructor, but they can also use any operators or functions provided by PHP. Expressions are replaced with their evaluation result in the output contents.

As for the PHP language, variable names must be prefixed by the "$" sign ; for example (using our example $variables described above) :

%$VNAME%

will be substituted with :

the value of vname1

Referencing a variable name can be considered as the simplest possible expression ; when it comes to more complex expressions, you will need to enclose them with %( and %) :

Current index : %($INDEX + 100)% Today is : %( date ( 'd/m/Y' ) )%

An expression can use any syntactic element allowed by PHP ; in addition, you can also call builtin functions, as the date() function in the above example.

Undefined variables will be expanded to an empty string and a warning will be issued, unless the $warnings parameter of the class constructor has been set to false.

Note that variable names are case-sensitive.

IF constructs

IF constructs are a way to conditionally include text in your output document ; as for the traditional if statements in various programming languages, the IF statement accepts an expression enclosed with parentheses. You can use any syntactic element recognized by PHP, call builtin functions and reference variables passed to the class constructor :

%IF ( $TITLE == 'M.' )%The value of TITLE is : "M"%END%

In the above example, the output document will contain the following string if the value of the $TITLE variable has been set to "M." :

The value of TITLE is : "M"

An IF construct can have as many ELSEIF alternatives as needed, and an optional ELSE statement :

%IF ( $INDEX == 19 )% index = 19 %ELSEIF ( $INDEX == 18 )% index = 18 %ELSE% index is neither 19 nor 18. %END%

Using our example $variables array where the $INDEX variable has the value 17, you will notice that the output document will contain two empty lines before the string : index is neither 19 nor 18..

This is due to the fact that a paragraph mark (a line break) has been inserted after each ending percent sign of each IF and ELSEIF/ELSE statements. If you would like no line break to be inserted in the output, and still preserve the readability of your macro-language constructs, then you could put the ending percent at the beginning of the next line :

%IF ( $INDEX == 19 ) %index = 19 %ELSEIF ( $INDEX == 18 ) %index = 18 %ELSE %index is neither 19 nor 18. %END%

FOR loops

FOR loops are a way to repeat text a certain number of times. Specify a start and end index :

%FOR ( $i = 1 TO $INDEX ) %This is line #%$i% %END%

The above example will insert 17 lines in the output document (the INDEX variable has been defined to be 17 in our variables array), from «This is line #1» to «This is line #17».

You can also specify an optional step :

%FOR ( $i = 1 TO $INDEX BY 2 )% or : %FOR ( $i = 1 TO $INDEX STEP 2 )%

FOREACH loops

FOREACH loops are based on array variables (look at the 'ARRAY' entry of the $variables array in the example above).

The following instruction will output the text "string a", "string b", "string c" on separate lines :

%FOREACH ( $value IN $ARRAY )%%$value% %END%

REPEAT loops

REPEAT loops are only a shortcut for FOR loops :

%REPEAT ( $i = $INDEX )%

is equivalent to :

%FOR ( $i = 1 TO $INDEX )%

Predefined variables

The following variables are predefined and can be referenced anywhere :

  • FILENAME : Name of the input template file. Expands to an empty string for objects of class RtfStringTemplater.

Coping with percent signs

The templater class does its best in trying to distinguish control statements from pure text. It will for example correctly handle the following case :

Tax rate is : 20% some other text %$VNAME%

However, if you follow "20%" with a sign that is recognized as the start of an expression, such as an opening parenthesis :

Tax rate is : 20% (since 2016) some other text %$VNAME%

then it will try to interpret the string "% (since 2016) some other text%" as an expression and will issue a warning, because this is not a valid computed expression.

To avoid such situations, simply double the percent sign, as in the following :

Tax rate is : 20%% (since 2016)

Under the hood...

Constructor

The RtfTemplater abstract base class has the following constructor :

public function __construct ( $variables, $warnings = true )

The parameters are the following :

  • $variables (associative array) :
    An associative array which defines the variables that can be referenced from within the document template to customize its contents.

    Each array key is the name of a document variable ; the corresponding value must either be a scalar value or an array. When defined as an array value, the variable can be referenced in a %FOREACH% instruction.

  • $warnings (boolean) :
    When true, warnings will be issued if incorrect situations are found, such as undefined variables being referenced, or syntax errors in some instructions.

Methods

public function AsString ( )

Returns the preprocessed contents of an Rtf template as a string, using the variables that have been specified to the class constructor.

public function SeparateTextFromRtf ( $contents )

Separates the tags and text parts of a piece of Rtf contents.

This function is especially used for extracting template constructs delimited by percent signs. It may happen that due to user manipulations in the template document, some Rtf tags may be interspersed with real template constructs.

Imagine for example that in the string %$VNAME2%", the "%V" and "2%" parts have been put in boldface ; the corresponding Rtf code may look like :

%$VN}{\rtlch\fcs1 \af0 \ltrch\fcs0 \lang2057\langfe1036\langnp2057\insrsid15075231 AM} {\rtlch\fcs1 \af0 \ltrch\fcs0 \b\lang2057\langfe1036\langnp2057\insrsid15075231 \charrsid15075231 E2%

This method ensures that both the original text contents (%$VNAME%) and Rtf data will be preserved,

It returns an associative array containing two entries :

  • rtf : Rtf contents.
  • text : Pure text contents (%$VNAME2% in the above example).

public function SaveTo ( $filename )

Saves the preprocessed contents of an Rtf templte to a file, using the variables that have been specified to the class constructor.
Note that this method does not needs to load the entire document contents into memory before generating its output : it reads input data by blocks and generates output data on-the-fly.

Properties

public $Variables

Document variables, as specified to the class constructor. Since this variable is public, it can be freely changed after instantiating the class.

public $Warnings

Enables/disables warnings. The initial value of this property has been passed to the constructor.

private static $TagsWithTextParameter

This internal array is used by the SeparateTextFromRtf method to identify Rtf tags that always include a text parameter, which is not to be confused with regular text coming from the document.

RtfStringTemplater class

public function __construct ( $rtfdata, $variables = [], $warnings = true, $chunk_size = 4 * 1024 * 1024 )

Creates an RtfTemplater object, using the specified Rtf data as the template.

A typical usage could be :

$variables = [ 'FIRSTNAME' => 'Jane', 'LASTNAME' => 'Doe' ] ; $doc = new RtfStringTemplater ( file_get_contents ( 'sample.rtf', $variables ) ) ; $doc -> SaveTo ( 'sample.txt' ) ; // Save templated contents to output file echo $doc -> AsString ( ) ; // Echo templated contents to standard output

The parameters are the following :

  • $rtfdata (string) :
    Rtf document data, specified as a string.
  • $variables (associative array) :
    An associative array which defines the variables that can be referenced from within the document template to customize its contents.

    Each array key is the name of a document variable ; the corresponding value must either be a scalar value or an array. When defined as an array value, the variable can be referenced in a %FOREACH% instruction.

  • $warnings (boolean) :
    When true, warnings will be issued if incorrect situations are found, such as undefined variables being referenced, or syntax errors in some instructions.
  • $chunk_size (integer) :
    Buffer size used when generating output by blocks.

RtfFileTemplater class

public function __construct ( $file, $variables = [], $warnings = true, $record_size = 16384 )

Creates an RtfTemplater object, using the specified Rtf file as a template.

A typical usage could be :

$variables = [ 'FIRSTNAME' => 'Jane', 'LASTNAME' => 'Doe' ] ; $doc = new RtfFileTemplater ( file_get_contents ( 'sample.rtf', $variables ) ) ; $doc -> SaveTo ( 'sample.txt' ) ; // Save templated contents to output file echo $doc -> AsString ( ) ; // Echo templated contents to standard output

The parameters are the following :

  • $file (string) :
    Path to an existing document
  • $variables (associative array) :
    An associative array which defines the variables that can be referenced from within the document template to customize its contents.

    Each array key is the name of a document variable ; the corresponding value must either be a scalar value or an array. When defined as an array value, the variable can be referenced in a %FOREACH% instruction.

  • $warnings (boolean) :
    When true, warnings will be issued if incorrect situations are found, such as undefined variables being referenced, or syntax errors in some instructions.
  • $record_size (integer) :
    Buffer size used for both reading the template document and generating an output custom document.

RtfTexter class

The RtfTexter class extracts text from an Rtf document.
Although it can perform some - very - basic text formatting, it is intended to be used mainly for text indexing purposes.

Extracting text from an Rtf document is pretty simple, as shown by the following example :

<?php include ( 'path/to/RtfTexter.phpclass' ) ; $texter = new RtfFileTexter ( 'sample.rtf' ) ; echo $texter -> AsString ( ) ; // Echo text contents $texter -> SaveTo ( 'sample.txt' ) ; // Save text contents to sample.txt

Class diagram

The class diagram for the RtfTexter class is the following :

Constructor

The constructor of the RtfTexter class has the following signature :

public function __construct ( $options = self::TEXTEROPT_ALL, $page_width = 80 )

Parameters are the following :

  • $options : A combination of TEXTEROPT_* constants that condition the format of the extracted text.
  • $page_width : Maximum page width, when the TEXTEROPT_WRAP_TEXT flag is specified is set for the $options parameter. Text lines longer than this quantity will be wrapped.

Methods

public function AsString ( )

Returns the text contents of an Rtf document as a string.

public function SaveTo ( $filename )

Saves the text contents of an Rtf document to the specified file.

protected function FormatParagraphs ( $data )

Internal method. Formats the specified paragraph(s) (which may contain several lines) to fit the width specified by the $PageWidth property.

This method is called only if the $Options property has the TEXTEROPT_WRAP_TEXT flag set.

protected function SetOptions ( $flags )

Internal method. Sets the $Options property, together with the $Eol string.

protected function TextifyData ( &$data, $nesting_level_to_reach = false )

Internal method. Processes the text data to be extracted.
Parameters are the following :

  • $data : variable that will receive the extracted text contents.
  • $nesting_level_to_reach : In some cases (such as for \headerr or \footerr tags, which contain the text to be put in headers and footers for the current section), the TextifyData() method recursively calls itself to analyze specific Rtf contents.
    This parameter tells which Rtf recursion level marks the end of the analysis (ie, the final nested braces level).

Properties

public $Eol

String used for end of lines.

public $Options

Option flags (a combination of TEXTEROPT_* constants).

public $PageWidth

Maximum width, in characters, of a page.

This setting will be enforced only if the TEXTEROPT_WRAP_TEXT flag is set for the $Options property.

protected static $IgnoreList = [ ... ]

Compound tags that can be safely ignored during text extraction.

protected static $TranslatedCharacters = [ ... ]

Characters that must be substituted to avoid spurious data in the output. Such characters are for example the left and right double-quotes.

protected static $TranslatedTags = [ ... ]

Tags that are to be translated either to their ascii or html entity equivalents.

Constants

TEXTEROPT_* constants

Gets/sets the flags that will condition the text extraction process. It can be any combination of the following flags :

  • TEXTEROPT_INCLUDE_PAGE_HEADERS : include page headers.
    Note that page headers won't be repeated on each page unless a section break is encountered.
  • TEXTEROPT_INCLUDE_PAGE_FOOTERS : include page footers.
    Note that page footers won't be repeated on each page unless a section break is encountered.
  • TEXTEROPT_INCLUDE_PAGE_TITLES : a synonym for :
    TEXTEROPT_INCLUDE_PAGE_HEADERS | TEXTEROPT_INCLUDE_PAGE_FOOTERS
  • TEXTEROPT_USE_FORM_FEEDS : use form feeds to separate pages.
  • TEXTEROPT_WRAP_TEXT : wrap text, using the width specified for the PageWidth property.
  • TEXTEROPT_EOL_STYLE_DEFAULT : use the PHP_EOL constant for line endings.
  • TEXTEROPT_EOL_STYLE_WINDOWS : use cr/lf for line endings.
  • TEXTEROPT_EOL_STYLE_UNIX : use newlines for line endings.

RtfStringTexter class

public function __construct ( $rtfdata, $options = self::TEXTEROPT_ALL, $page_width = 80 )

Loads Rtf data for further extraction.

The parameters are the following :

  • $rtfdata : a whole Rtf document specified as a string.
  • $options : a combination of TEXTEROPT_* flags.

A typical usage could be :

$doc = new RtfStringTexter ( file_get_contents ( 'sample.rtf' ) ) ; echo $doc -> AsString ( ) ; // Echo text contents from file sample.rtf echo $doc -> SaveTo ( 'sample.txt' ) ; // Save text contents to file sample.txt

RtfFileTexter class

public function __construct ( $file, $options = self::TEXTEROPT_ALL, $page_width = 80 )

Loads Rtf data from the specified file for further extraction.

The parameters are the following :

  • $file : Rtf document whose contents are to be extracted. Must be an existing file.
  • $options : a combination of TEXTEROPT_* flags.

A typical usage could be :

$doc = new RtfFileTexter ( 'sample.rtf' ) ; echo $doc -> AsString ( ) ; // Echo text contents from file sample.rtf echo $doc -> SaveTo ( 'sample.txt' ) ; // Save text contents to file sample.txt

Internal classes reference


This section provides a references to the classes that are used internally by the RtfTools package and are not normally exposed to the outside world.

RtfMergerDocument class

Whenever a document is added to a merger object, it is wrapped by an RtfMergerDocument object which basically performs the following tasks :

  • Analyze the document header and extract tables such as the color table, the font table, the stylesheet table, etc.
  • Add those elements to the $GlobalHeader property of the parent RtfMerger object.
  • Internally keep a track of the various colors, fonts, styles, lists (and so on) that needs to be renumbered when the merging process will occur.
  • Renumber references to colors, fonts, styles, lists (and so on) when the merging process occurs, and return the modified document body.

Constructor

The constructor of the RtfMergerDocument class has the following signature :

public function __construct ( $parent, $document, $global_header )

Parameters are the following :

  • $parent : Parent object, of class RtfMerger
  • $document : Document object to be wrapped, inheriting from the RtfDocument class.
  • $global_header : The object that contains the global header for the merged document. This is an object of class RtfMergerHeader.

Methods

protected function ExtractColorTable ( $header )

Extracts the color table from the document header. Updates the global header acordingly and holds a table of color renumberings in case of conflicts.

protected function ExtractFontTable ( $header )

Extracts the font table from the document header. Updates the global header acordingly and holds a table of font renumberings in case of conflicts.

protected function ExtractListTable ( $header )

Extracts the list table from the document header. Updates the global header acordingly and holds a table of list renumberings in case of conflicts.

protected function ExtractOverrideListTable ( $header )

Extracts the override list table from the document header. Updates the global header acordingly and holds a table of list override renumberings in case of conflicts.

protected function ExtractStylesheetTable ( $header )

Extracts the stylesheet table from the document header. Updates the global header acordingly and holds a table of stylesheet renumberings in case of conflicts.

protected function ExtractSettings ( $header )

Extracts the various settings that can be found in a header of an Rtf document, specified as single tags.

In the current version, a warning will be issued if one of the documents has a header setting different from the first one that has been encountered. Future versions may be able to handle different setting values more gracefully.

public function GetBody ( $remove_rsid = true )

Returns the body of the underlying document, once all the renumbering operations have been applied for the color tables, font tables, stylesheet tables and so on.

protected function ReplaceReferences ( $text, $remove_rsid = false, $renumber_shapes = false )

This method is called by the GetBody method to replace any reference to colors, fonts, styles and lists with their new number in the merged document.

The $remove_rsid parameter specifies whether revision history information should be removed from the document body. Although the method's default value is false, the RtfMerger class always set it to true.

The $renumber_shapes parameter specifies whether shapes should also be renumbered. The only reason why this parameter should be false is when processing stylesheet contents.

protected function ReplaceStylesheetReferences ( )

Since stylesheets contain formatting tags, some of them may reference elements that need to be renumbered (colors, fonts, etc.). This method calls the ReplaceReferences method to perform the necessary replacements that apply to stylesheets contents.

Properties

protected $BodyOffset

Holds the byte offset, into the underlying Rtf document, of the body start.

private static $DEBUG = false

Outputs debug information when set to a combination of the RTFMERGER_DEBUG_* constants.

protected $Document

Holds the underlying document object.

protected $Parent

Holds the parent RtfMerger object.

Constants

RTFMERGER_DEBUG_* constants

The RTFMERGER_DEBUG_* constants can be used to define the RtfMergerDocument::$DEBUG property to output useful debug information :

  • RTFMERGER_DEBUG_COLOR_EXTRACTION : outputs debug information about color table entries renumerings.
  • RTFMERGER_DEBUG_FONT_EXTRACTION : outputs debug information about font table entries renumerings.
  • RTFMERGER_DEBUG_LIST_EXTRACTION : outputs debug information about list table entries renumerings.
  • RTFMERGER_DEBUG_OVERRIDE_EXTRACTION : outputs debug information about list override table entries renumerings.
  • RTFMERGER_DEBUG_STYLESHEET_EXTRACTION : outputs debug information about stylesheet table entries renumerings.
  • RTFMERGER_DEBUG_SETTINGS : outputs debug information about the header settings that have been found in the underlying document header.
  • RTFMERGER_DEBUG_ALL : enables all debug information.
  • RTFMERGER_DEBUG_NONE : disables all debug information.

RtfMergerHeader class

The RtfMergerHeader is used internally by the RtfMerger class to collect header information from the various documents to be merged.

During this process of collecting information, the class has to be considered as passive : it is manipulated by the various RtfMergerDocument instances that represent the documents to be merged.
Each instance adds the header information contained in its own underlying Rtf document to this object.

Constructor

The constructor has no parameter and only instantiates an object of class RtfMergerHeader.

Methods

public function BuildHeader ( )

Returns the Rtf code for the header (aka Global header) of the output merged file.

public function GetColorTable ( )

Returns the Rtf code for the color table containing all the colors coming from the documents to be merged.

public function GetDocumentInformation ( )

Returns the Rtf code for document information (see the Document Information Properties section for more information).

public function GetFontTable ( )

Returns the Rtf code for the font table containing all the font definitions coming from the documents to be merged.

public function GetGenerator ( )

Returns the Rtf code for the generator entry, which specifies the software that has generated the document.

public function GetListTable ( )

Returns the Rtf code for the list table containing all the list definitions coming from the documents to be merged.

public function GetListOverrideTable ( )

Returns the Rtf code for the list override table containing all the list overrides coming from the documents to be merged.

public function GetStylesheetTable ( )

Returns the Rtf code for the stylesheet table containing all the stylesheets coming from the documents to be merged.

Properties

public $ColorTable = []

An associative array whose keys are the color definitions, and whose values are color indexes.

See the Color tables section of Merger process for more information on how this table is built and upgraded when processing new documents to be merged.

Document information properties

Document information is a special compound tag (\info) that allows you to specify creator's information in an Rtf document. All the properties below can be accessed through the RtfMerger object, without the "Info" prefix :

  • InfoTitle
  • InfoSubject
  • InfoAuthor
  • InfoManager
  • InfoCompany
  • InfoOperator
  • InfoCategory
  • InfoKeywords
  • InfoComment
  • InfoSummary
  • InfoVersion

All those properties default to the value false. When set to a string, they will be written in the \info tag when generating the merged document.

The creation and revision times will also be automatically added to the output document information.

Note that the InfoKeywords property is an array of strings.

See the Document information properties section of Merger process for more information on how these properties can be accessed directly through an RtfMerger object.

public $FontTable = []

An associative array whose keys are the md5 hash of the "anonymized" version of the font definition, and whose values are associative arrays containing the following entries :

  • def : the complete Rtf code for the font definition, including the font id, which has been renumbered if needed
  • id : the font id, which may have been renumbered if needed.

See the Font tables section of Merger process for more information on how this table is built and upgraded when processing new documents to be merged.

public $ListTable = []

An associative array whose keys are the md5 hash of the "anonymized" version of the list definition, and whose values are associative arrays containing the following entries :

  • def : the complete Rtf code for the list definition, including the list id, which has been renumbered if needed
  • id : the list id, which may have been renumbered if needed.

See the List tables section of Merger process for more information on how this table is built and upgraded when processing new documents to be merged.

public $ListOverrideTable = []

An array containing the list overrides, where the references to the list entries have been renumbered when necessary.

See the List override tables section of Merger process for more information on how this table is built and upgraded when processing new documents to be merged.

public $NextShapeId = 1000

Shapes are one of the rare elements contained in the body part of a document that need to be renumbered to avoid conflicts across multiple documents.
The initial value is 1000. It could be anything else, but Microsoft Word seems to like starting at this value when numbering shapes, so it has been chosen to be the initial value of the very first shape encountered in the very first document.

This number is incremented each time a new shape is found in some document.

See the Shapes section of Merger process for more information on how shape numbering is processed across multiple documents.

public $Settings = []

An associative array whose keys are tags (aka as Control Words, in the Microsoft documentation), and whose values are the tag parameter.

See the Global properties section of Merger process for more information on how this table is built and upgraded when processing new documents to be merged.

public $StylesheetTable = []

An associative array whose keys are the md5 hash of the "anonymized" version of the stylesheet definition, and whose values are associative arrays containing the following entries :

  • def : the complete Rtf code for the stylesheet definition, including the stylesheet id, which has been renumbered if needed
  • id : the stylesheet id, which may have been renumbered if needed.

Note that stylesheets can include tags such as \sbasedonx, \snextx and \slinkx, where x is also a stylesheet index.
Those indexes are also renumbered if necessary.

See the Stylesheet tables section of Merger process for more information on how this table is built and upgraded when processing new documents to be merged.

RtfToken classes

The NextToken method of the RtfParser class is used to parse Rtf documents and retrieve the next token available from the Rtf stream.

The type of value returned by this method is always an object inheriting from the RtfToken abstract class.

If you have browsed the RtfTools package documentation and source code, you may have noticed that most of the classes use their own, simplified, internal parser. This is the case for example for classes such as RtfBeautifier and RtfTemplater, where the parsing needs are really basic and do not need an elaborate method to analyze Rtf contents.

In some situations, however, you may have more complex needs in terms of parsing. This is the case of the RtfTexter class, which needs to differentiate between pure plain text, and what is to be considered as a parameter of a compound statement ; you will encounter such situations with font definitions for example, which look like this :

{\f1 ... Times New Roman;} ... {\par ... This is a sample paragraph.}

The string "Times New Roman;" in the above example is not plain text, but rather the display name of the font identified by id #1 (\f1 tag).

The second line, however, introduces a new paragraph, whose contents are "This is a sample paragraph.". By using the RtfParser class, you will be able to distinguish whether the additional text specified before the closing brace is to be interpreted as text or not.

The sections below describe the various kinds of objects returned by the NextToken method of the RtfParser class.

The section related to the RtfToken class shows the methods and properties common to all its derived classes. The sections related to classes inheriting from RtfToken will only show the differences and additions specific to those classes.

All of these classes are instantiated by the NextToken method of the RtfParser class, and are not meant to be instantiated from other places.

Class diagram

The following diagram shows the hierarchy of the various RtfToken classes :

RtfToken class

The RtfToken astract class provides public properties and methods that are common to every syntactic element that can be found in an Rtf document.

Constructor

The RtfToken class constructor is called by all its derived classes and has the following signature :

public function __construct ( $type, $text, $space_after, $offset, $line, $column )

The parameters are the following :

  • $type : Token type (one of the TOKEN_* constants defined in the RtfDocument class).
  • $text : The whole token text, as found in the Rtf stream, also containing the optional space that may be present after.
  • $space_after : True if the token is followed by a space (a space following a token is considered as belonging to the token).
  • $offset : Byte offset of the start of the token, in the Rtf input stream.
  • $line : Line number where the token starts, in the Rtf input stream.
  • $column : Column number where the token starts, in the Rtf input stream.

Methods

public function ToRtf ( )

Returns the whole token, as it was found in the Rtf stream.

public function ToText ( )

Returns the whole token, as it was found in the Rtf stream.
This behavior may be changed by derived classes.

public function __tostring ( )

A synonym for ToRtf.

Properties

public $Column

Column number of the start of the Rtf tag in the input document. Column numbers start at 1.

public $Line

Line number of the start of the Rtf tag in the input document. Line numbers start at 1.

public $Offset

Byte offset of the start of the Rtf tag in the input document. Byte offsets start at 0.

public $SpaceAfter

Set to true if the related tag has a space after (spaces after a control word are to be considered as being part of the control word, not as plain text).

public $Text

Contains the Rtf syntactic element, as it has been found in the input Rtf stream.

public $Type

Token type, as described in the TOKEN_* section of the RtfDocument class.

RtfControlSymbolToken class

Implements a control symbol token, such as \~ (unbreakable space) or \- (optional hyphen).

Constructor

The constructor of the RtfControlSymbolToken class has the following signature :

public function __construct ( $char, $offset, $line, $column )

The $char parameter indicates the character following the leading backslash. Other parameters are the same as for the RtfToken class.

Methods

public function ToText ( )

Returns the token text, ie the real character expressed by the input Rtf tag. The following subsitutions occur :

  • \~ (unbreakable space) : the returned value will be a space.
  • <\- (optional hyphen) and \_ (hyphen) : the returned value will be the string "-".
  • All other values : the character after the leading backslash will be returned.

RtfControlWordToken class

Implements a control word, such as \par or \f12.

This class also handles "special" control words, that are preceded by the \* special construct, such as in : \*\panose.

Constructor

The class constructor has the following signature :

public function __construct ( $word, $space, $special, $offset, $line, $column )

The $word parameter holds the control word itself, followed by its optional integer parameter.

The $special parameter is a boolean value that indicates whether the control word was preceded by the \* special construct or not.

Other parameters are the same as for the RtfToken class.

Properties

ControlWord

Control word. For a tag such as \*\pnseclvl1, this property will contain the string "pnseclvl".

Parameter

Holds the optional integer parameter after the control word. For a tag such as \*\pnseclvl1, this property will contain the integer value "1".

If the control word does not contain any parameter, this property will be set to the empty string.

Special

A boolean value that indicates whether the control word is a special one, ie preceded by the \* construct.

RtfDataToken class

The RtfDataToken class is a base abstract class for Rtf compound constructs that end with some data before the last closing brace. Such a construct could be for example a picture, denoted by the \pict control word.

Constructor

The constructor of the RtfDataToken class has the following signature :

public function __construct ( $type, $data, $offset, $line, $column )

The $data parameter holds the data that has been found before the last closing brace.

Other parameters are the same as for the RtfToken class.

RtfBDataToken class

The RtfBData class is intended for maybe the only tag in the Rtf specifications that has a parameter which gives the length of the data immediately following it : the \bin tag.

The following example defines some binary data which is 10 bytes long :

{\bin10 0123456789}

Constructor

The constructor of the RtfBDataToken class has the following signature :

public function __construct ( $data, $offset, $line, $column )

The $data parameter holds the binary data located just after the \bin control word (in the example above, this would be the string "0123456789").

Other parameters are the same as for the RtfToken class.

Properties

RelatedControlWord

Indicates the control word which is related to this data entry (\pict for pictures, \bin for binary data, and any other control word that starts a compound statement containing character data).

RtfPCDataToken class

Holds free-form text data specified within curly braces.

Constructor

The constructor of the RtfPCData token has the following signature :

public function __construct ( $data, $offset, $line, $column )

The $data parameter holds text data located just before the closing brace.

Other parameters are the same as for the RtfToken class.

Methods

ToText

Returns the character data after removing newlines and carriage returns, which are not part of the text.

RtfSDataToken class

The RtfSData class holds hexadecimal data that represent an embedded image. This is typically the kind of data found in \pict tags :

{\pict 0ABC2937DF...}

Constructor

The constructor of the RtfSData token has the following signature :

public function __construct ( $data, $offset, $line, $column )

The $data parameter holds text data located just before the closing brace.

Other parameters are the same as for the RtfToken class.

RtfEscapedCharacterToken class

Holds a character representation specified using the \'xy notation, where x and y are hexadecimal digits representing the character code in the Windows Ansi character set.

Constructor

The constructor of the RtfEscapedCharacterToken class has the following signature :

public function __construct ( $hex, $offset, $line, $column )

The $hex parameter specifies the integer code of the character specification that has been found in the input Rtf stream.

Other parameters are the same as for the RtfToken class.

Methods

public function ToText()

Returns the underlying character value, as a string.

Properties

public $Char

Holds the string value of the character.

public $Ord

Holds the integer character code.

RtfEscapedExpressionToken class

The RtfEscapedExpressionToken class is designed to represent escaped characters that may have a special syntactic meaning within an Rtf document, such as \{, \} and \\.

Although such cases could have been covered by the RtfControlSymbolToken class, they have been intentionally made distinct so that more advanced parsers can make the difference between both cases without requiring further testing.

Constructor

The constructor of the RtfEscapedExpressionToken class has the following signature :

public function __construct ( $char, $offset, $line, $column )

The $char parameter specifies the character immediately after the backslash.

Other parameters are the same as for the RtfToken class.

Methods

Returns the character following the backslash, as a string.

Properties

public $Char

Holds the string value of the character.

RtfInvalidToken class

In some cases, the NextToken method of the RtfParser class can return a token having the RtfInvalidToken class, to indicate that something unexpected was found in the input Rtf stream.

Such cases can arise in the following situations :

  • The xy part of a character specification of the form \'xy is not a string of EXACTLY two hexadecimal digits.
  • The next character after a backslash cannot be interpreted as an escaped character, a character expression, a special character or the start of a control word.
  • The next character after a backslash is the end-of-file.

Constructor

The constructor of the RtfInvalidToken class has the following signature :

public function __construct ( $text, $offset, $line, $column )

All parameters have the same meaning as for the RtfToken class.

RtfLeftBraceToken class

The RtfLeftBrace class represents an opening brace, which is one of the basic Rtf syntactic elements.

Constructor

The constructor of the RtfLeftBraceToken class has the following signature :

public function __construct ( $space_after, $offset, $line, $column )

All parameters have the same meaning as for the RtfToken class.

RtfNewlineToken class

The RtfNewlineToken class represents a line break that has been encountered in the input Rtf stream. Since line breaks, which are normally represented by newlines or cr+lf's, are not significant, extended parsers relying on the RtfParser class can safely ignore them (note that the current line and column positions in the Rtf input stream will be updated accordingly anyway).

Constructor

The constructor of the RtfLeftBraceToken class has the following signature :

public function __construct ( $text, $offset, $line, $column )

All parameters have the same meaning as for the RtfToken class.

RtfRightBraceToken class

The RtfRightBrace class represents an opening brace, which is one of the basic Rtf syntactic elements.

Constructor

The constructor of the RtfRightBraceToken class has the following signature :

public function __construct ( $space_after, $offset, $line, $column )

All parameters have the same meaning as for the RtfToken class.