I pushed out another beta of Pcompress a couple of days back. This release contains a whole bunch of improvements include performance optimizations, optimization of memory footprint and new algorithm support. Dedupe performance has been increased and the chunking approach has been improved. I added support for libbsc, an excellent new Block-Sorting Compressor developed by Ilya Grebnov. It uses the same Burrows Wheeler transform technique that is found in Bzip2 but uses different, more advanced, encoding algorithms that provide better compression than Bzip2 at a higher speed. It has built-in parallelism via OpenMP.
I ported the Lempel-Ziv pre-compression code from libbsc into Pcompress and it provides increased compression ratio for all other compression algorithms from lz4 to zlib to LZMA. Very interesting stuff. I also added the multithreaded LZMA port from p7zip. All these natively multi-threaded algorithm implementations provide a speed boost when dealing with either a single chunk or very few chunks for the entire file. However these also required addition of new logic to balance the number of threads between chunk processing and algorithm implementation threads.
For the next steps here are some of the things I am looking at:
- Use a stronger digest rather than depending on CRC64 as pointed out in a comment to my previous post. I am considering SHA 512-256 or Skein 512-256.
- Implement verbose progress update, more statistics and more debug info.
- Need a test suite.
- Ability to do faster fixed-block dedupe in addition to the current content-aware chunking.
- Implement global dedupe. Currently deduplication is only at the chunk level. Implementing global dedupe will require an external index with more metadata and will allow incremental addition of archives/files. This is however a major feature with lots of scalability and performance considerations.
Eventually I will also have to clearly document the file format. Till now I have refrained from that since things were in too much flux, but this will have to be done when the format stabilizes. One change that will happen for example is using 32 bytes instead of the current 8 bytes for the checksum.