Multi-threaded GTK applications – Part 2: java-gnome

I have long cherished a dream to make the Java bindings for GNOME thread safe. I didn’t know if it could be done, but I sure wanted to try.

The workaround you had to do to be able to make GTK calls from other threads in the old java-gnome was terrible. It’s terrible in all the other major Java widget toolkits too. This is rather surprising, actually: one of Java’s strengths is its support for using threads. Now don’t get me wrong: multiple threads are not a panacea and I know as well as anyone that you can shoot yourself in the foot quite easily when doing threaded programming. But it is easy to get on with writing concurrent programs in Java, and threads are a fundamental feature of the runtime environment. Thus it has always struck me as a big let down that the GUI toolkits aren’t thread-safe (worse in fact: with the “GUI use must be single threaded” requirement, they are thread-hostile).

So I’ve been studying this for a while now. Could we do better in java-gnome 4.0?

Java’s synchronization primitives

Anyone who has done Java programming will be aware of the synchronized keyword. You can use it as a qualifier to a method:

public synchronized void doUsefulThings() {
    // code protected by object lock
}

but doing so is just shorthand for:

public void doUsefulThings() {
    synchronized (this) {
        // code protected by object lock
    }
}

The key point is that the object being used as the lock doesn’t have to be this; it can be any object, ie

public void doUsefulThings() {
    synchronized (obj) {
        // code protected by object lock
    }
}

Java synchronized blocks have a useful property: when you exit the block the lock is automatically released. This is good to know because if an Exception is thrown from some code called within the block, the lock won’t remain held. [By comparison, if you use lock functions explicitly (such as the J2SE 1.5 Lock interface) then you have to manually deal with the possibility of Exceptions being thrown by always using a try {} block as follows:

Lock obj;
...
    obj = new ReentrantLock();
...

public void doUsefulThings() {
    try {
        obj.lock();
        // code protected by object lock
    } finally {
        obj.unlock();
    }
}

which accomplishes the same thing, trading greater flexibility (you can have locks span methods or even classes) for slightly more cumbersome code]

Now, getting back to creating a Java binding around the GTK and GNOME libraries:

What others wouldn’t dream of considering

When you enable threading in GTK, you have to obey the injunction that all of your GDK and GTK calls must be done within the main GDK lock. As I discussed in my post about GTK thread awareness last week, this is fairly easy to ignore in C code largely because callbacks to signal handlers are done within the main loop running in a single (“main”) thread (since threads are only used infrequently in GNOME programs written in C), all within the GDK lock. But if you want to work from other threads, you have to surround that code which uses GTK with gdk_threads_enter() and gdk_threads_leave() and also do things like protect GIdle calls, etc.

As I said, I want to create java-gnome as a thread-safe API — no more of the ridiculous contortions imposed on the developer just to do GUI calls from a worker thread. A few ways to achieve this come to mind. Before I learned the above it seemed that the only way to do it would be to marshal every single call through some producer-consumer queue setup to convey the call across to the main thread. Yuk! [and more on this later]. Meanwhile, I learned about the actual thread awareness mechanism in GTK, and discovered that threaded use of GTK does not have to be done from the main thread only but from within the main lock. So why not just use that?

Instead of the default implementation (which is just is a simplistic GMutex), we:

  1. use gdk_threads_set_lock_functions() to call custom functions to lock and unlock.
  2. have these custom functions take and release a lock on a Java object.

You could certainly do (2) by calling from C back to Java via a JNI invoke and having custom lock() and unlock() methods written in Java. But rather than having to make a cross boundary call, (which would be a pain to write, is much more complicated, requires you to write your own Lock implementation and probably all be rather slow), JNI enables you to directly take the lock on an object!

    (*env)->MonitorEnter(env, obj);
    // code protected by object lock
    (*env)->MonitorExit(env, obj);  

which is precisely the C equivalent of doing:

    synchronized (obj) {
        // code protected by object lock
    }

in Java!

(Yes, yes, JNI is ugly to use. Why do you think we’re rewriting java-gnome? So we can generate that layer!)

Anyway, needless to say, the custom lock functions to pass to gdk_threads_set_lock_functions() are pretty straight forward. All we have to do is get a C side reference to the Java object, and then call MonitorEnter() and MonitorLeave() in our lock() and unlock() functions, respectively. Too easy.

But why all this trouble to change the GDK lock to a synchronization monitor on a Java object? Simple: now we can use the same lock from both the GDK side and from the Java side.

This is brilliant, because it means we can transparently co-operate with the Java VMs thread co-ordination mechanisms, rather than fighting against them or worse ignoring them. And that suddenly means just get on with making the library thread-safe. Here’s why: Java monitors are reentrant. If the thread already holds the lock on an object and it encounters another synchronized block requesting the same lock, it just carries on [nesting the monitors, as you might imagine].

So the combination of:

  1. Replacing the default GDK lock implementation with something that enters and exits a Java object monitor (thus protecting all internal GTK usage with that lock), plus
  2. A synchronized wrapper around each GTK call (thus protecting all Java side usage of GTK with the same lock), especially Gtk.main() (thus creating the condition that the main loop is running within the lock), plus
  3. The (largely unwritten) fact that the GTK main loop releases the lock by calling gdk_threads_leave() when it isn’t doing anything

means that other Java threads can safely make calls into GTK. Any thread that comes along while the main loop is busy will simply find the lock held by another thread (the “main” thread, as it happens) and will wait.

Point 3 bears closer examination. One of the other interesting details about Java synchronized blocks is that if a call is made to Object.wait() within it, the lock is silently released until the condition arises and then the thread tries to reacquire the lock so that the next instruction takes place inside the monitor as expected. Given that our wrapper code around the native call to start the main loop is this:

static void main() {
    synchronized (Gdk.lock) {
        gtk_main();
    }
}

we need that gtk_main() behaved as if a Java wait() call was made and thus releases the lock so that other threads can run. Interestingly, this is exactly the effect we end up with because although this code is marked as being within the section locking the Gdk.lock object, as the GTK main loop cycles it releases the lock via gdk_threads_leave() and then reestablishes it with gdk_threads_enter(). The effect is that the monitor on Gdk.lock is frequently relinquished, which is the behaviour that is expected if a piece of Java code object executes wait() within a monitor block. Which is exactly what we need!

As for point 2, while no C developer working with GTK would ever have contemplated protecting every single function call, for us wrapping a synchronize (lock) { ... } block around each call is easy (especially given that we’re generating that layer — two extra lines in the code generator, and BAMN! Done!) And because the lock is reentrant (recursive if you will), the normal C reflex to avoiding nesting the GDK lock discussed in my last post is no longer applicable. The thread making the calls already holds the lock, and just proceeds merrily.

But isn’t that going to be really slow?

No, actually. Sun’s “Hotspot” Java VM is highly optimized for two particular cases:

  • acquiring an uncontested lock, and
  • entering a block if the current thread already holds the lock in question.

You see, threads needing to take locks is very common, and early on in Java it was a bottleneck. So they tuned the hell out of it and both are really fast. More to the point we’ve tried it, and there is no user visible performance impact. Things run really, really nicely.

Remember I dismissed arbitrary function marshaling? There are thousands of functions in GTK alone; to arbitrary wrap and pass a representation of each native function would mean creating a message object to convey every single call, with another object to wrap each parameter. Ouch. One of the great achievements of java-gnome 4.0 was to reduce the object pressure. And while we might have been able to do it this way and so have only one GUI thread, we would have completely wrecked the type-safe and simple design we have now which takes advantage of the fact that passing primitive types across JNI boundary is cheap (The very reason that primitives in Java are not objects is because sooner or later the bytes they represent will have to be passed across to native code, and so by design they optimized things to that this would be cheap. In our case, we take advantage of this by simply using JNI to make a function call with arguments on the stack. Too easy. Otherwise you have to do complex bi-directional type lookups (reflection on the Java side and the JNI equivalents in C), and doing this on every method and parameter really would be slow.

Against all that, the cost of grabbing a lock is trivial, and the elegance of just simply using Java’s synchronized mechanism is really encouraging. That’s ultimately why I’m excited about this design for giving java-gnome thread safety.

java-gnome thread safety in action

Below is a tread dump from a small sample demo program that I wrote. All it does is fire off a worker thread when a Button is clicked. The worker is just a tight loop to make repeated calls to update a Label with an (incrementing) number. If you’re not a Java developer you won’t be used to reading these, but even if you aren’t you will probably get the idea pretty quickly: all the worker threads are waiting on the GTK lock except B which has the lock and is the thread currently able to update the Label!

(For clarity, I’ve cut out the VM’s system threads)

Button pressed. Launching A.
Button pressed. Launching B.
Button pressed. Launching C.
Button pressed. Launching D.

Full thread dump Java HotSpot(TM) Client VM (1.5.0_08-b03 mixed mode):

"D" prio=1 tid=0x0823a3c8 nid=0x2dba waiting for monitor entry [0xa969a000..0xa969aeb0]
        at org.gnome.gtk.GtkLabel.setLabel(GtkLabel.java:52)
        - waiting to lock <0xaaaf1440> (a org.gnome.gdk.Gdk$Lock)
        at org.gnome.gtk.Label.setLabel(Label.java:67)
        at WorkerTiming.run(WorkerTiming.java:112)
        at java.lang.Thread.run(Thread.java:595)

"B" prio=1 tid=0x0822e800 nid=0x2db8 runnable [0xa9598000..0xa9599030]
        at org.gnome.gtk.GtkLabel.gtk_label_set_label(Native Method)
        at org.gnome.gtk.GtkLabel.setLabel(GtkLabel.java:53)
        - locked <0xaaaf1440> (a org.gnome.gdk.Gdk$Lock)
        at org.gnome.gtk.Label.setLabel(Label.java:67)
        at WorkerTiming.run(WorkerTiming.java:112)
        at java.lang.Thread.run(Thread.java:595)

"C" prio=1 tid=0x0823f0b8 nid=0x2db9 waiting for monitor entry [0xa9517000..0xa9517fb0]
        at org.gnome.gtk.GtkLabel.setLabel(GtkLabel.java:52)
        - waiting to lock <0xaaaf1440> (a org.gnome.gdk.Gdk$Lock)
        at org.gnome.gtk.Label.setLabel(Label.java:67)
        at WorkerTiming.run(WorkerTiming.java:112)
        at java.lang.Thread.run(Thread.java:595)

"A" prio=1 tid=0x08085fc0 nid=0x2db7 waiting for monitor entry [0xa9496000..0xa9497130]
        at org.gnome.gtk.GtkLabel.setLabel(GtkLabel.java:52)
        - waiting to lock <0xaaaf1440> (a org.gnome.gdk.Gdk$Lock)
        at org.gnome.gtk.Label.setLabel(Label.java:67)
        at WorkerTiming.run(WorkerTiming.java:112)
        at java.lang.Thread.run(Thread.java:595)

"main" prio=1 tid=0x0805c880 nid=0x2d95 waiting for monitor entry [0xbfdab000..0xbfdac138]
        at org.gnome.gtk.Gtk.gtk_main(Native Method)
        at org.gnome.gtk.Gtk.main(Gtk.java:124)
        - locked <0xaaaf1440> (a org.gnome.gdk.Gdk$Lock)
        at WorkerTiming.main(WorkerTiming.java:92)

A done.
C done.
D done.
B done.

Note the cunningly selected class name of the lock instance :). Normally people just do Object lock = new Object(); and synchronize on that. But “waiting to lock a java.lang.Object” isn’t as helpful as it might be, and “waiting on the Gdk lock“, while cosmetic, seems a nice touch.

One slightly weird thing is that the VM thread dump for “main” notes a marker on the stack saying where a synchronized block was entered which does make it look an awful lot like it’s also holding the lock. That doesn’t make any sense because JNI side MonitorExit() must have been called and the main loop not running else another thread couldn’t be running. I worried about this until I read it a bit more closely and realized it is correct: the thread itself says it is waiting to be able to enter a monitor, and the only runnable thread is indeed B. It’s still a bit confusing, though; I think we should see about contributing a fixing for that in the VM — something that is now possible thanks to Sun having freed Java.

Conclusion

Thread safety is ultimately about protecting data from being left in an inconsistent state because of a context switch. That GTK imposes the restriction that calls must all be made from within a defined lock is not unreasonable at all and a standard approach thread safety. That we have the ability to supplant that mutex with a one that takes care of additional application requirements is a really wonderful feature of GDK.

Code to implement this strategy in java-gnome has been implemented on a quick temporary branch (go bzr!) so we can kill off this idea if we have to, but so far it has handled every tricky situation we can come with to throw at it: firing new Windows, cascading signal callbacks, nested main loops, and lock starvation. While throwing a synchronized block around every call at first seems extreme, it doesn’t cost us much, and actually follows good practice of keeping the locked sections as small as possible.

It still needs more testing, but thanks very much to Srichand Pendyala (who worked with me to figure out that a message queue architecture for passing arbitrary calls to a single GUI thread across the JNI boundary would be a nightmare), Vreixo Formoso and Karsten Bräckelmann (who have traced the code paths of this implementation of thread safety, listened to my arguments, and have reached the point of telling me why it will work and to stop worrying), Roman Kennke (who discussed with me his hypothesis that a recursive mutex would be a sound default for the GDK lock), and as ever to our guardian angel, Owen Taylor (who never ceases to amaze me with the quality of his contributions).

I have little doubt that if we follow this course of action, we will end up finding races in the underlying GTK libraries where the required GDK_THREADS_ENTER()/GDK_THREADS_LEAVE() macros were not used to protect code in idle callbacks and what not. It’s inevitable; the code paths that we are exercising are not as widely used as the more usual “all in the lock all in a single-thread” pattern that most GNOME applications written in C follow. Given the requirements of the GDK threads page, however, this would amount to a bug, and if tripping over such things occasionally (and contributing to rooting them out) is the price of a thread safe java-gnome, then it seems well worth it. Ultimately it will make the code better for everyone, and that’s what open source is about, after all.

AfC