The earlier KDE 4.2.4 that I built for BeleniX had a weird persistent issue of various DBUS clients timing out after not receiving a response for their messages. It would happen erratically but when it did happen the desktop will be very slow to come up, applications will open after some time and the “.xsession-errors” file will get filled up with DBUS timeout messages. At that time there were many other issues to resolve and this problem got ignored not least due to it’s erratic nature.
The problem however persisted even when I built the latest KDE 4.3.1. It was much less effort to update to 4.3.1 since now the build recipes were already present and 4.3.1 had far less bugs than 4.2.4. When testing in VirtualBox I started seeing that this time the timeouts were consistent. So I started poking around with dbus-monitor and qdbusviewer and eventually found that clicking on thhe kded node on qdbusviewer caused a timeout with appropriate messages coming up with dbus-monitor. So kded4 was stuck. Next I used pstack and found the kded4 stack which showed that it was stuck on a write to the Gamin file descriptor. Promptly I got a stack of the gam_server process and found that one of it’s threads was blocked on a write. Using pfiles I saw them pointing to the same unix domain socket – AHA so they were deadlocked.
Now Gamin is a drop in replacement for FAM – File Alteration Monitor that can monitor files and directories and provide notification of their changes to consumers. Gamin uses Inotify on Linux. The OpenSolaris port done by the JDS team uses File Event Notification that is similar to Linux Inotify but uses a more generic Event Ports framework. Now KDE uses the KDirWatch class that in turn communicates with Gamin and I was using the OpenSolaris Gamin port. It appears that KDirWatch uses a single thread while Gamin can send back events at any time, even when the consumer has issued a subscribe call and it has not returned. Indeed the OpenSolaris port of Gamin sends back events in the new subscription processing flow. There are additional calls in KDirWatch around calls to FAMMonitorFile and FAMMonitorDirectory with comments about avoiding a deadlock. But that is not enough as I could clearly see. This looks like a design shortcoming to me. Ideally KDirWatch should use one thread to handle async notifications and invoke subscription requests in another thread.
Ok all good now what am I to do ? Getting so close to finishing the KDE 4.3.1 build, I was in no mood to sit down and start changing KDirWatch. One alternative was to disable FAM and use Polling Mode, but that would be horrible. Eventually I modified the Gamin patch for OpenSolaris to not send back some of the events during the subscription flow and that did the trick for now. DBUS timeouts are solved. I am not sure what will be the impact of this on Gnome 2.26, but at least KDE which is the primary desktop for BeleniX, is working. BTW KDE 4.3.1 on BeleniX 0.8Alpha is now available. I will put a separate post on that.