Implements an external sort algorithm that can be used to sort the contents of large files that do not fit into memory. Large files may be needed for flat-file databases, and sorting allows for quick retrieval e.g. with a binary search algorithm.
ExternalSort can be invoked via object-oriented method calls or via the command line. The latter is likely preferable for asynchronous calls like
exec('php ExternalSort.php YOURFILE 2>&1 &')because large files take a while to sort.
ExternalSort implements a similar algorithm as GNU sort, but it is not as efficient and the processing time increases faster with increasing data load.
Measured Examples (user cpu time on IntelĀ® Pentium(R) 4 CPU 1.60GHz, 748.3 MiB RAM):filesize | GNU sort | ExternalSort |
---|---|---|
93MB | 0m31.382s | 1m15.933s |
185MB | 1m9.424s | 3m17.696s |
370MB | 2m25.049s | 9m9.994s |
740MB | 5m9.603s | 29m5.773s |
$ExternalSort = new ExternalSort(); $ExternalSort->setFile($file); $ExternalSort->run();
$ php ExternalSort.php ? usage: php ExternalSort.php [options] FILE options: --windowslinebreaks switches from default unix linebreaks (\\n) to windows linebreaks (\\r\\n) in the outputfile default: false = unix linebreaks --max=NUMBER set maximum bucket file size to NUMBER the script uses approximately bucketFileSize*6 of RAM default: 5000000 --out=FILENAME sets the output file default: FILE.sorted --verify verify that the output file is sorted default: false = no verification --check verify that the input file is sorted default: false = no check --trim trim all peripheral whitespace default: false = no trimming --q quiet mode: no output --v display (hopefully) pretty/meaningful progress info default: false = off v: bucket is initialized <: bucket is filled with data |NUMBER: bucket NUMBER is split in half to stay within max size >: bucket is sorted and written to output ?|-?|help|-help|--help display this helpChecking an unsorted file (here: its own source file) for sortedness:
$ php ExternalSort.php --check ExternalSort.php verifying ExternalSort.php . ExternalSort.php is not sorted at line 3: verification took 0.01 sSorting a file with subsequent verification:
$ php ExternalSort.php --verify ExternalSort.php numbuckets: 1 bucket 0 key: verifying ExternalSort.php.sorted . ExternalSort.php.sorted is sorted verification took 0.02 s wrote ExternalSort.php.sortedChecking a sorted file for sortedness:
$ php ExternalSort.php --check ExternalSort.php.sorted verifying ExternalSort.php.sorted . ExternalSort.php.sorted is sorted verification took 0.02 sSorting a file with verbose output and specified bucket size:
$ php ExternalSort.php --v --max=2000 ExternalSort.php showing tracking output sorting file: ExternalSort.php maxBucketSize: 2000 size of input file: 0.02 MB init: 0 s, 0.5 MB peak 1 |0:2022B x 45 lines->1|2 |0:2596B x 72 lines->2|3 |0:3130B x 88 lines->3|4 |0:3288B x 79 lines->4|5 |3:3227B x 99 lines->5|6 |3:2416B x 63 lines->6||2:2257B x 67 lines->7||1:2263B x 90 lines->8|7 |5:2487B x 80 lines->9|8 |0:2102B x 50 lines->10||3:2786B x 59 lines->11||6:2130B x 57 lines->12|9 |4:2177B x 53 lines->13||2:2033B x 46 lines->14|10 |3:2302B x 46 lines->15|11 |5:2074B x 54 lines->16| 0.33 s, 0.57 MB peak sort: numbuckets: 17 bucket 0 key: print "firstline: >>$firstline<<\n"; bucket 1 key: default: false = no trimming bucket 2 key: if ($this->trackingOutput){ bucket 3 key: elseif(!empty($option) && !in_array($option, array('--v', '--q'))){ bucket 4 key: } bucket 5 key: $s = str_replace("\r\n", "\n", $s); //normalize linebreaks bucket 6 key: //$this->externalBuckets->checkBucketSizes($this->maxsize, $this->trackingOutput, $this->tmpdir); bucket 7 key: } bucket 8 key: $prevline=''; bucket 9 key: $this->trim=false; bucket 10 key: if ( !$commandline){ bucket 11 key: touch($path); bucket 12 key: * Constructor sets up {@link tmpdir} for temporary bucket files. bucket 13 key: * flag indicating whether we should output bucket 14 key: * via the commandline and the (@link run) method bucket 15 key: --verify verify that the output file is sorted bucket 16 key: public function setFile($f){ >>>>>>>>>>>>>>>>> 0.14 s, 0.57 MB peak sorting took 0.48 s wrote ExternalSort.php.sorted