ExternalSort

Implements an external sort algorithm that can be used to sort the contents of large files that do not fit into memory. Large files may be needed for flat-file databases, and sorting allows for quick retrieval e.g. with a binary search algorithm.

ExternalSort can be invoked via object-oriented method calls or via the command line. The latter is likely preferable for asynchronous calls like

 exec('php ExternalSort.php YOURFILE 2>&1 &')
because large files take a while to sort.

Performance

ExternalSort implements a similar algorithm as GNU sort, but it is not as efficient and the processing time increases faster with increasing data load.

Measured Examples (user cpu time on IntelĀ® Pentium(R) 4 CPU 1.60GHz, 748.3 MiB RAM):
filesizeGNU sortExternalSort
93MB0m31.382s1m15.933s
185MB1m9.424s3m17.696s
370MB2m25.049s9m9.994s
740MB5m9.603s29m5.773s

Juxtaposing the user CPU cycle values (and adding extrapolated intermediate values):

This clearly shows that GNU sort is finding additional efficiencies in working on my sample files. However, ExternalSort's processing time increases linearly and it should do ok on fast machines (the above tests were run on an old clunker that makes inefficiencies very noticeable).

Raison d'etre

  1. it provides a pure PHP solution that is not dependent on external tools (such as GNU sort).
  2. PHP sort() defaults to sorting by ascii value while GNU sort is by default locale dependent. This means that GNU sort is incompatible with e.g. PHP's strcmp.
    It is possible to make PHP's sort locale-aware, but this also breaks compatibility with strcmp. Likewise, it is possible to make GNU sort use ascii values (by exporting LC_ALL=C before), but fiddling with a global locale variable may not be a good idea on a server.

Known Issue

ExternalSort fails when the bucket size is set to very small values such that line lengths may exceed the bucket size. I think this is a nonissue, because bucket sizes will in practice go well beyond any conceivable line length and for files with really long lines, such as images, sorting does not make sense anyway. At this point, I will therefore not try to handle this issue.

Object-oriented usage

	$ExternalSort = new ExternalSort();
	$ExternalSort->setFile($file);
	$ExternalSort->run();

Commandline usage examples

Getting usage hints:
$ php ExternalSort.php ?
usage: php ExternalSort.php [options] FILE     
options:
   --windowslinebreaks     switches from default unix linebreaks (\\n) 
                           to windows linebreaks (\\r\\n) in the 
                           outputfile
                           default: false = unix linebreaks
   --max=NUMBER            set maximum bucket file size to NUMBER
                           the script uses approximately 
                           bucketFileSize*6 of RAM
                           default: 5000000
    --out=FILENAME         sets the output file
                           default: FILE.sorted
    --verify               verify that the output file is sorted
                           default: false = no verification
    --check                verify that the input file is sorted
                           default: false = no check
    --trim                 trim all peripheral whitespace
                           default: false = no trimming
    --q             	   quiet mode: no output
    --v             	   display (hopefully) pretty/meaningful 
                           progress info
                           default: false = off
                           v: bucket is initialized
                           <: bucket is filled with data
                           |NUMBER: bucket NUMBER is split in half to 
                                    stay within max size
                           >: bucket is sorted and written to output
    ?|-?|help|-help|--help  display this help
Checking an unsorted file (here: its own source file) for sortedness:
$ php ExternalSort.php --check ExternalSort.php
verifying ExternalSort.php
.
ExternalSort.php is not sorted at line 3: 
verification took 0.01 s
Sorting a file with subsequent verification:
$ php ExternalSort.php --verify ExternalSort.php
numbuckets: 1
bucket 0
key:  
verifying ExternalSort.php.sorted
.
ExternalSort.php.sorted is sorted
verification took 0.02 s
wrote ExternalSort.php.sorted
Checking a sorted file for sortedness:
$ php ExternalSort.php --check ExternalSort.php.sorted
verifying ExternalSort.php.sorted
.
ExternalSort.php.sorted is sorted
verification took 0.02 s
Sorting a file with verbose output and specified bucket size:
$ php ExternalSort.php --v --max=2000 ExternalSort.php
showing tracking output
sorting file: ExternalSort.php
maxBucketSize: 2000
size of input file: 0.02 MB
init:  0 s, 0.5 MB peak
1
|0:2022B x 45 lines->1|2
|0:2596B x 72 lines->2|3
|0:3130B x 88 lines->3|4
|0:3288B x 79 lines->4|5
|3:3227B x 99 lines->5|6
|3:2416B x 63 lines->6||2:2257B x 67 lines->7||1:2263B x 90 lines->8|7
|5:2487B x 80 lines->9|8
|0:2102B x 50 lines->10||3:2786B x 59 lines->11||6:2130B x 57 lines->12|9
|4:2177B x 53 lines->13||2:2033B x 46 lines->14|10
|3:2302B x 46 lines->15|11
|5:2074B x 54 lines->16| 0.33 s, 0.57 MB peak
sort: numbuckets: 17
bucket 0
key: 						print "firstline:        >>$firstline<<\n";
bucket 1
key:                            default: false = no trimming
bucket 2
key:                     if ($this->trackingOutput){
bucket 3
key:                 elseif(!empty($option) && !in_array($option, array('--v', '--q'))){
bucket 4
key:                 } 
bucket 5
key:             $s = str_replace("\r\n", "\n", $s); //normalize linebreaks
bucket 6
key:             //$this->externalBuckets->checkBucketSizes($this->maxsize, $this->trackingOutput, $this->tmpdir);
bucket 7
key:             }
bucket 8
key:         $prevline='';
bucket 9
key:         $this->trim=false;
bucket 10
key:         if ( !$commandline){
bucket 11
key:         touch($path);
bucket 12
key:      * Constructor sets up {@link tmpdir} for temporary bucket files.
bucket 13
key:      * flag indicating whether we should output
bucket 14
key:      * via the commandline and the (@link run) method
bucket 15
key:     --verify               verify that the output file is sorted
bucket 16
key:     public function setFile($f){
>>>>>>>>>>>>>>>>> 0.14 s, 0.57 MB peak
sorting took 0.48 s
wrote ExternalSort.php.sorted