Parallel Directory Tree Compare in Python

A friend asked me if there is any way to quickly use multiple threads to compare two directory trees. He was dealing with data migration scenarios with directory trees containing potentially millions of files and he would like to compare the copy with the original. A normal recursive diff would take days.

After trying out a few experiments with shell scripts and some such it finally boiled down to Python and it’s great multiprocessing module to get the work done. The full code is pasted below and is mostly self-explanatory. On my dual-core hyperthreaded laptop it takes half the time as compared to a recursive diff (with 26000 files). However this thing is primarily I/O bound and the tinny laptop’s drive cannot keep up with the I/O demands from multiple threads. A good server will give decent speedup compared to normal diff.

import os, sys
import hashlib
import stat
import multiprocessing
from multiprocessing import Process, JoinableQueue

num_procs = multiprocessing.cpu_count()
num_procs *= 4

def get_hash(fh, sz):
        h = hashlib.sha256()
        while True:
                buf =
                if len(buf) == 0: break
        return h.digest()

def chk(path, lf, dir):
        path1 = os.path.join(dir, path)
        st = None
        st1 = None
                st = os.lstat(path)
                st1 = os.lstat(path1)
                lf.write("Missing: " + path1 + "\n")

        if not stat.S_ISREG(st.st_mode):
        if st.st_size != st1.st_size:
                lf.write("Size differ: " + path1 + "\n")
        if st.st_size == 0: return

        hv = None
        hv1 = None
                fh = open(path, "r")
                hv = get_hash(fh, st.st_size)
                fh = open(path1, "r")
                hv1 = get_hash(fh, st.st_size)
                lf.write("Open error: " + path1 + "\n")

        if hv != hv1:
                lf.write("Digests differ: " + path1 + "\n")

def proc_chk(q, lf, dir):
        while True:
                path1 = q.get()
                if path1 == "done":
                chk(path1, lf, dir)

q = JoinableQueue()
lf = open("/var/tmp/differ_files", "w+")
dir1 = os.path.realpath(sys.argv[1])
dir2 = os.path.realpath(sys.argv[2])

o_cwd = os.getcwd()
cwd = os.getcwd()

for i in range(0, num_procs):
        p = Process(target=proc_chk, args=(q, lf, dir2))

for dirpath, dirnames, filenames in os.walk(".", followlinks=False):
        for f in filenames:
                q.put(os.path.join(dirpath, f))

for i in range(0, num_procs):

It should be called like this: python <dir1> <dir2> . It will generate a list of any differing files in /var/tmp/differ_files.


4 thoughts on “Parallel Directory Tree Compare in Python

  1. kostas

    thank you for this post. it is running for me perfectly. but i have some questions. please help with that. is that the best solution to compare directories. can we be 10000% sure? what about to compare all the common things and take the diffrences with something like dircmp?

    1. moinakg Post author

      The python program is incomplete for the purpose which you are considering. The program only compares regular files. To be complete it needs to do the following:

      • Detect presence or absence of non-file path entries including directories.
      • Compare date/time stamps.
      • Compare Xattrs and ACLs, if present (This can be an optional switch.)
      1. kostas

        thank you for your answer. to be honest , i totally agree with you especially for the first one, on which i am working now. BUT should i do this with hash(this code,improved) or should i choose a totally other way like trees equal match,mismatch,errors, dirs_cmp = filecmp.dircmp(dir1, dir2)?

  2. kostas

    also i would like to ask something else. I took your code for the comparison and also i added statistics for emprty files, total size, numbers of items e.t.c. I have an idea to split them up in 2 dif programmes : statistics & comparison. How can i call one program from the other? I dont know and i cannot find something relate to this. I wanna run only one of them and this one to run the rest. Please give me some links and some info about, if you want, of course.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s