Java sum-of-array comparisons
This is a follow-up to the post by Daniel Lemire on a close topic.
Daniel Lemire hat experimented with boxing a primitive array in an interface, and has been trying to measure the cost.
I must admit I was a bit sceptical about his results, because I have seen Java successfully inlining code in various situations.
For an experimental library I occasionally work on, I had been spending quite a bit of time on benchmarking. Previously, I had used Google Caliper for it (I even wrote an evaluation tool for it to produce better statistics). However, Caliper hasn’t seen much updates recently, and there is a very attractive similar tool at openJDK now, too: Java Microbenchmarking Harness (actually it can be used for benchmarking at other scale, too).
Now that I have experience in both, I must say I consider JMH superior, and I have switched over my microbenchmarks to it. One of the nice things is that it doesn’t make this distinction of micro vs. macrobenchmarks, and the runtime of your benchmarks is easier to control.
I largely
recreated his task using JMH. The benchmark task is easy: compute the sum of an array; the question is how much the cost is when allowing different data structures than double[]
.
My results, however, are quite different. And the statistics of JMH indicate the differences may be not significant, and thus indicating that Java manages to inline the code properly.
adapterFor 1000000 thrpt 50 836,898 ± 13,223 ops/s
adapterForL 1000000 thrpt 50 842,464 ± 11,008 ops/s
adapterForR 1000000 thrpt 50 810,343 ± 9,961 ops/s
adapterWhile 1000000 thrpt 50 839,369 ± 11,705 ops/s
adapterWhileL 1000000 thrpt 50 842,531 ± 9,276 ops/s
boxedFor 1000000 thrpt 50 848,081 ± 7,562 ops/s
boxedForL 1000000 thrpt 50 840,156 ± 12,985 ops/s
boxedForR 1000000 thrpt 50 817,666 ± 9,706 ops/s
boxedWhile 1000000 thrpt 50 845,379 ± 12,761 ops/s
boxedWhileL 1000000 thrpt 50 851,212 ± 7,645 ops/s
forSum 1000000 thrpt 50 845,140 ± 12,500 ops/s
forSumL 1000000 thrpt 50 847,134 ± 9,479 ops/s
forSumL2 1000000 thrpt 50 846,306 ± 13,654 ops/s
forSumR 1000000 thrpt 50 831,139 ± 13,519 ops/s
foreachSum 1000000 thrpt 50 843,023 ± 13,397 ops/s
whileSum 1000000 thrpt 50 848,666 ± 10,723 ops/s
whileSumL 1000000 thrpt 50 847,756 ± 11,191 ops/s
The postfix is the iteration type: sum using for loops, with local variable for
the length (L), or in reverse order (R); while loops (again with local variable
for the length). The prefix is the data layout: the primitive array, the array
using a static adapter (which is the approach I have been using in many
implementations in cervidae) and using a “boxed” wrapper class around the array
(roughly the approach that Daniel Lemire has been investigating. On the primitive
array, I also included the foreach loop approach (for(double v:array){
).
If you look at the standard deviations, the results are pretty much identical, except for reverse loops. This is not surprising, given the strong inlining capabilities of Java - all of these codes will lead to next to the same CPU code after warmup and hotspot optimization.
I do not have a full explanation of the differences the others have been seeing. There is no “polymorphism” occurring here (at runtime) - there is only a single Array implementation in use; but this was the same with his benchmark.
Here is a visualization of the results (sorted by average):
As you can see, most results are indiscernible. The measurement standard deviation is
higher than the individual differences. If you run the same benchmark again,
you will likely get a different ranking.
Note that performance may - drastically - drop once you use multiple adapters or boxing classes in the same hot codepath. Java Hotspot keeps statistics on the classes it sees, and as long as it only sees 1-2 different types, it performs quite aggressive optimizations instead of doing “virtual” method calls.