Java Hotspot Compiler - a heavily underappreciated technology

When I had my first contacts with Java, probably around Java 1.1 or Java 1.2, it felt all clumsy and slow. And this is still the reputation that Java has to many developers: bloated source code and slow performance.

The last years I’ve worked a lot with Java; it would not have been my personal first choice, but as this is usually the language the students know best, it was the best choice for this project, the data mining framework ELKI.

I’ve learned a lot on Java since, also on debugging and optimizing Java code. ELKI contains a number of tasks that require a good number chrunching performance; something where Java particularly had the reputation of being slow.

I must say, this is not entirely fair. Sure, the pure matrix multiplication performance of Java is not up to Fortran (BLAS libraries are usually implemented in Fortran, and many tools such as R or NumPy will use them for the heavy lifting). But there are other tasks than matrix multiplication, too!

There is a number of things where Java could be improved a lot. Some of this will be coming with Java 8, others is still missing. I’d particularly like to see native BLAS support and multi-valued on-stack returns (to allow intrinsic sincos, for example).

In this post, I want to emphasize that usually the Hotspot compiler does an excellent job.

A few years ago, I have always been laughing at those that claimed “Java code can even be faster than C code”; because the Java JVM is written in C. Having had a deeper look at what the hotspot compiler does, I’m now saying: I’m not surprised that quite often, reasonably good Java code outperforms reasonably good C code.

In fact, I’d love to see a “hotspot” optimizer for C.

So what is it what makes Hotspot so fast? In my opinion, the key ingredient to hotspot performance is aggressive inlining. And this is exactly why “reasonably well written” Java code can be faster than C code written at a similar level.

Let me explain this at an example. Assuming we want to compute a pariwise distance matrix; but we want the code to be able to support arbitrary distance functions. The code will roughly look like this (not heavily optimized):

for (int i = 0; i < size; i++) {
  for (int j = i + 1; j < size; j++) {
    matrix[i][j] = computeDistance(data[i], data[j]);
  }
}

In C, if you want to be able to choose computeDistance at runtime, you would likely make it a function pointer, or in C++ use e.g. boost::function or a virtual method. In Java, you would use an interface method instead, i.e. distanceFunction.distance().

In C, your compiler will most likely emit a jmp *%eax instruction to jump to the method to compute the distance; with virtual methods in C++, it would load the target method from the vtable and then jmp there. Technically, it will likely be a “register-indirect absolute jump”. Java will, however, try to inline this code at runtime, i.e. it will often insert the actual distance function used at the location of this call.

Why does this make a difference? CPUs have become quite good at speculative execution, prefetching and caching. Yet, it can still pay off to save those jmps as far as I can tell; and if it is just to allow the CPU to apply these techniques to predict another branch better. But there is also a second effect: the hotspot compiler will be optimizing the inlined version of the code, whereas the C compiler has to optimize the two functions independently (as it cannot know they will be only used in this combination).

Hotspot can be quite aggressive there. It will even inline and optimize when it is not 100% sure that these assumptions are correct. It will just add simple tests (e.g. adding some type checks) and jump back to the interpreter/compiler when these assumptions fail and then reoptimize again.

You can see the inlining effect in Java when you use the -Xcomp flag, telling the Java VM to compile everything at load time. It cannot do as much speculative inlining there, as it does not know which method will be called and which class will be seen. Instead, it will have to compile the code using virtual method invocations, just like C++ would use for executing this. Except that in Java, every single method will be virtual (in C++, you have to be explicit). You will likely see a substantial performance drop when using this flag, it is not recommended to use. Instead, let hotspot perform its inlining magic. It will inline a lot by default - in particular tiny methods such as getters and setters.

I’d love to see something similar in C or C++. There are some optimizations that can only be done at runtime, not at compile time. Maybe not even at linking time; but only with runtime type information, and that may also change over time (e.g. the user first computes a distance matrix for Euclidean distance, then for Manhattan distance).

Don’t get me wrong. I’m not saying Java is perfect. There are a lot of common mistakes, such as using java.util.Collections for primitive types, which comes at a massive memory cost and garbage collection overhead. The first thing to debug when optimizing Java applications is to check for memory usage overhead. But all in all, good Java code can indeed perform well, and may even outperform C code, due to the inlining optimization I just discussed; in particular on large projects where you cannot fine-tune inlining in C anymore.

Sometimes, Hotspot may also fail. Which is largely why I’ve been investigating these issues recently. In ELKI 0.6.0 I’m facing a severe performance regression with linear scans (which is actually the simpler codepath, not using indexes but using a simple loop as seen above). I had this with 0.5.0 before, but back then I was able to revert back to an earlier version that still performed good (even though the code was much more redundant). This time, I would have had to revert a larger refactoring that I wanted to keep, unfortunately.

Because the regression was quite severe - from 600 seconds to 1500-2500 seconds (still clearly better than -Xcomp) - I first assumed I was facing an actual programming bug. Careful inspection down to the assembler code produced by the hotspot VM did not reveal any such error. Then I tried Java 8, and the regression was gone.
So apparently, it is not a programming error, but Java 7 failed at optimizing it remotely as good as it did with the previous ELKI version!
If you are an Java guru, interested at tracking down this regression, feel free to contact me. It’s in an open source project, ELKI. I’d be happy to have good performance even for linear-scans, and Java 7. But I don’t want to waste any more hours on this, but instead plan to move on to Java 8 for other reasons (lambda expressions, which will greatly reduce the amount of glue coded needed), too. Plus, Java 8 is faster in my benchmarks.