After cooling off in the hills for a few days, curiosity got the better of me. I discussed in my previous post on chunk size distribution from the rolling hash algorithm that there is a potential trend that can be inferred. Regions having greater variability produce more smaller chunks while regions of uniformity produce fewer larger ones.
I wanted to do some more analysis to see whether there is indeed a trend, short of looking at actual data samples. A trend or correlation will indicate that the rolling hash fingerprint based chunking algorithm is working nicely with desirable properties. Now the question comes how do we define uniformity or conversely variability of data ? I considered two ways of looking at it:
- Spans or run-lengths of identical bytes values. Fewer longer spans will indicate low variability and vice versa.
- Zones where same pattern repeats. That is uniform repeating patterns.
Now point #2 is not suitable for our case since repeating patterns can generate repeating break-points in the rolling hash giving rise to many small chunks depending on the repeating pattern size. So from the point of view of our rolling hash this is a form of repeating variability rather than uniformity. This leaves point #1. Long spans of identical byte values will have less chance of triggering break-points since we check for a few least significant bits being zero as our break-point. Spans of zero will of course trigger many break-points but we can ignore that for this analysis as our dataset is quite varied.
I used an initial 24GB portion of the earlier 360GB dataset since I will generating a large number of data points. This initial portion also contains a wide variety of textual, binary (Linux and Windows) and media files. I decided to split the data into 1MB segments and chunk them with an average of 4KB chunk size. I also computed the following properties per segment (of course in a single thread):
- Variability Index
- Ratio of Total size of small chunks : Total segment size (in bytes)
- Ratio of Total size of large chunks : Total segment size (in bytes)
Where the following hold:
I somewhat arbitrarily chose the partition value of 16KB keeping in mind average chunk size of 4KB. Eventually all the three values above are ratios and can be directly compared. I also computed the average chunk size per segment and plotted it separately.
Chunk Size Distribution vs Data Variability
Splitting approx 24GB of data in to approx 1MB sized segments produced more than 23400 data items. The chunking algorithm in Pcompress splits segments at a rabin chunk boundary so that every segment gets a set of complete content-defined chunks. This causes a slightly variable segment size.
Since the number of points is very large the lines in the line graph merge together and look like an area graph. Looking at the graphs for a few moments yields the view that there is indeed a correlation between data variability and chunk size distribution. Regions of higher variability do indeed tend to produce smaller chunks and vice versa. Now lets zoom into the middle interesting portion of the Distribution graph to see some more detail (more than 11000 points):
The correlation between the three lines is more clearly visible here but it is not a 100% fit, but then, is anything perfect in life and this universe ? So our requirement of greater data variability producing smaller chunks and vice versa holds in general but does not hold all the time. We can do some quick statistical analysis to get a better idea. First thing to do is to add a couple of polynomial trend lines.
We can clearly see that total size of large chunks is inversely proportional to the data variability with a high degree of probability. This is exactly what we desire. So over a large amount of data localized variations even out and general trends are observed. Finally I also computed Pearson’s correlation coefficients:
So combined with these values and all the previous analysis we can say that clearly there exists a strong correlation of the form which is desirable. Pcompress appears to have a good chunking algorithm unless, of course, someone more experienced than me in this aspect points out shortcomings or mistakes.