Pcompress uses an in-memory hash table to find similar regions of data, loads the chunk hashes of those regions and matches them to find exact chunk matches for deduplication. Currently this works on a single archive producing a compressed file as the output. The index is not required during decompression and is discarded after use. An archival store on the other hand needs to deal with multiple archives and needs a persistent index to keep deduplicating across archives that get added to the store. So there is an archive server which receives data streams from clients. Clients are responsible for identifying the files to be backed up, rolling them into a archive format like tar and sending the data stream to the server. The archive server holds the aforementioned similarity index.
The similarity index in Pcompress is a hash table, so looking at persistence, one quickly thinks of the numerous NoSQL solutions. Evaluate and benchmark one and use it. Even SQlite can be looked at, as it is embeddable, fast and reliable. Good enough for the use case at hand. After pondering this somewhat, it occurred to me that my approach in Pcompress is a scalable in-memory index. The key thing here is the “in-memory” piece. The design is centered around that and does not use an index that can overflow to disk. This in turn means that I do not require an on-disk index format. I just need to stash the records in a file and load them into an in-memory hash when the archive server daemon is started. So all I need is a simple flat file with a sequence of fixed-length records. When a new KV pair is added to the hash it is first appended to the flat file. This is almost like journaling with the journal being the data store itself. Write I/O remains largely sequential. When keys are deleted it can marked invalid in the flat file. A separate cleanup process (like Postgres VACUUMDB) can be used to eliminate the deleted records. Marking deleted records requires in-place updates which is simple because records are of fixed-length.
Thus I can dispense with a NoSQL library and keep things very simple and fast. This approach is similar to the Sparkey KV store from Spotify. The primary differences being the hash not stored to disk and ability to do deletes. Of course unlike Sparkey I want robustness against data corruption. Firstly I will use sync writes to the flat file. Secondly the record will be inserted into the in-memory hash only after disk write is successful.