This is a follow-up to the post by Daniel Lemire on a close topic.

Daniel Lemire hat experimented with boxing a primitive array in an interface, and has been trying to measure the cost.

I must admit I was a bit sceptical about his results, because I have seen Java successfully inlining code in various situations.

For an experimental library I occasionally work on, I had been spending quite a bit of time on benchmarking. Previously, I had used Google Caliper for it (I even wrote an evaluation tool for it to produce better statistics). However, Caliper hasn’t seen much updates recently, and there is a very attractive similar tool at openJDK now, too: Java Microbenchmarking Harness (actually it can be used for benchmarking at other scale, too).

Now that I have experience in both, I must say I consider JMH superior, and I have switched over my microbenchmarks to it. One of the nice things is that it doesn’t make this distinction of micro vs. macrobenchmarks, and the runtime of your benchmarks is easier to control.

I largely recreated his task using JMH. The benchmark task is easy: compute the sum of an array; the question is how much the cost is when allowing different data structures than double[].

My results, however, are quite different. And the statistics of JMH indicate the differences may be not significant, and thus indicating that Java manages to inline the code properly.

adapterFor       1000000  thrpt  50  836,898 ± 13,223  ops/s
adapterForL      1000000  thrpt  50  842,464 ± 11,008  ops/s
adapterForR      1000000  thrpt  50  810,343 ±  9,961  ops/s
adapterWhile     1000000  thrpt  50  839,369 ± 11,705  ops/s
adapterWhileL    1000000  thrpt  50  842,531 ±  9,276  ops/s
boxedFor         1000000  thrpt  50  848,081 ±  7,562  ops/s
boxedForL        1000000  thrpt  50  840,156 ± 12,985  ops/s
boxedForR        1000000  thrpt  50  817,666 ±  9,706  ops/s
boxedWhile       1000000  thrpt  50  845,379 ± 12,761  ops/s
boxedWhileL      1000000  thrpt  50  851,212 ±  7,645  ops/s
forSum           1000000  thrpt  50  845,140 ± 12,500  ops/s
forSumL          1000000  thrpt  50  847,134 ±  9,479  ops/s
forSumL2         1000000  thrpt  50  846,306 ± 13,654  ops/s
forSumR          1000000  thrpt  50  831,139 ± 13,519  ops/s
foreachSum       1000000  thrpt  50  843,023 ± 13,397  ops/s
whileSum         1000000  thrpt  50  848,666 ± 10,723  ops/s
whileSumL        1000000  thrpt  50  847,756 ± 11,191  ops/s

The postfix is the iteration type: sum using for loops, with local variable for the length (L), or in reverse order (R); while loops (again with local variable for the length). The prefix is the data layout: the primitive array, the array using a static adapter (which is the approach I have been using in many implementations in cervidae) and using a “boxed” wrapper class around the array (roughly the approach that Daniel Lemire has been investigating. On the primitive array, I also included the foreach loop approach (for(double v:array){).

If you look at the standard deviations, the results are pretty much identical, except for reverse loops. This is not surprising, given the strong inlining capabilities of Java - all of these codes will lead to next to the same CPU code after warmup and hotspot optimization.

I do not have a full explanation of the differences the others have been seeing. There is no “polymorphism” occurring here (at runtime) - there is only a single Array implementation in use; but this was the same with his benchmark.

Here is a visualization of the results (sorted by average):
Result boxplots
As you can see, most results are indiscernible. The measurement standard deviation is higher than the individual differences. If you run the same benchmark again, you will likely get a different ranking.

Note that performance may - drastically - drop once you use multiple adapters or boxing classes in the same hot codepath. Java Hotspot keeps statistics on the classes it sees, and as long as it only sees 1-2 different types, it performs quite aggressive optimizations instead of doing “virtual” method calls.