Статьи

Тест производительности Java 8 Stream

Когда я читаю  учебник по производительности Java Анджелики Лангер  — Насколько быстры потоки Java 8?  Я не мог поверить, что для конкретной операции они занимали примерно в 15 раз больше времени, чем для циклов. Может ли потоковая производительность быть настолько плохой? Я должен был узнать!

По совпадению я недавно наблюдал  прохладный разговор о микробенчмаркинге Java-кода,  и я решил применить то, что я там изучил. Итак, давайте посмотрим, действительно ли медленные потоки.

обзор

Как обычно, я начну с унылого пролога. Это объяснит, почему вы должны быть очень осторожны с тем, что я представляю здесь, как я создал цифры, и как вы можете легко повторить и настроить тест. Если вас это не волнует, перейдите прямо к  Stream Performance .

Но сначала два быстрых указателя: на GitHub запущен весь тестовый код,   и  эта электронная таблица Google  содержит полученные данные.

пролог

отказ

Этот пост содержит много цифр, а цифры обманчивы. Они кажутся научными, точными и прочим, и они заманивают нас сосредоточиться на их взаимосвязи и интерпретации. Но мы всегда должны уделять одинаковое внимание тому, как они появились!

Числа, которые я приведу ниже, были получены в моей системе с очень конкретными тестами. Их легко обобщить! Я должен также добавить, что у меня есть только двухдневный опыт работы с нетривиальными методиками бенчмаркинга (то есть тех, которые не основаны на цикле и ручном  System.currentTimeMillis () ).

Будьте очень осторожны с включением полученных вами идей в вашу модель умственной деятельности. Дьявол, скрывающийся в деталях, — это сама JVM, и это лживый зверь. Вполне возможно, что мои тесты стали жертвами оптимизаций, которые исказили цифры.

система

  • Процессор : процессор Intel® Core ™ TM i7-4800MQ с тактовой частотой 2,70 ГГц
  • Оперативная память : Samsung DDR3 16GB @ 1.60GHz (тестирование проводилось исключительно в оперативной памяти)
  • ОС : Ubuntu 15.04. Версия ядра 3.19.0-26-generic
  • Java : 1.8.0_60
  • JMH : 1.10.5

эталонный тест

JMH

Тесты были созданы с использованием замечательного  Java Microbenchmarking Harness (JMH) , который разработан и используется самой командой производительности JVM. Он тщательно документирован, прост в настройке и использовании, а  объяснение с помощью примеров  — потрясающее!

Если вы предпочитаете случайное вступление, вам может понравиться  выступление Алексея Шипилева на Devoxx UK 2013 .

Настроить

Чтобы получить несколько надежные результаты, тесты запускаются индивидуально и многократно. Для каждого метода тестирования проводится отдельный прогон, состоящий из нескольких  вилок , каждый из которых выполняет несколько   итераций прогрева перед фактическими   итерациями измерения .

Я провел отдельные тесты с 50 000, 500 000, 5 000 000, 10 000 000 и 50 000 000 элементов. За исключением последнего, у всех было две вилки, каждая из которых состояла из пяти прогревов и пяти итераций измерения, где каждая итерация длилась три секунды. Части последней выполнялись в одну разветвительную линию, две прогрева и три итерации измерения, каждая продолжительностью 30 секунд.

В статье Лангера говорится, что их массивы заполнены случайными целыми числами. Я сравнил это с более приятным случаем, когда каждое  целое  в массиве равно его позиции в нем. Отклонение  между двумя сценариями в среднем составило 1,2%, при этом наибольшая разница составила 5,4%.

Since creating millions of randomized integers takes considerable time, I opted to execute the majority of the benchmarks on the ordered sequences only, so unless otherwise noted numbers pertain to this scenario.

Code

The benchmark code itself is available on GitHub. To run it, simply go to the command line, build the project, and execute the resulting jar:

mvn clean install
java -jar target/benchmarks.jar

Some easy tweaks:

  • adding a regular expression at the end of the execution call will only benchmark methods whose fully-qualified name matches that expression; e.g. to only run ControlStructuresBenchmark:
java -jar target/benchmarks.jar Control
  • the annotations on AbstractIterationBenchmark govern how often and how long each benchmark is executed
  • the constant NUMBER_OF_ELEMENTS defines the length of the array/list that is being iterated over
  • tweak CREATE_ELEMENTS_RANDOMLY to switch between an array of ordered or of random numbers

Поток производительность

Published by Bart under CC-BY-NC-ND 2.0.

Stream Performance

Repeating The Experiment

Let’s start with the case that triggered me to write this post: Finding the maximum value in an array of 500’000 random elements.

int m = Integer.MIN_VALUE;
for (int i = 0; i < intArray.length; i++)
if (intArray[i] > m)
m = intArray[i];

First thing I noticed: My laptop performs much better than the machine used for the JAX article. This was to be expected as it was described as “outdated hardware (dual core, no dynamic overclocking)” but it made me happy nevertheless since I paid enough for the damn thing. Instead of 0.36 ms it only took 0.130 ms to loop through the array.

More interesting are the results for using a stream to find the maximum:

// article uses 'reduce' to which 'max' delegates
Arrays.stream(intArray).max();

Langer reports a runtime of 5.35 ms for this, which compared to the loop’s 0.36 ms yields the reported slowdown by x15. I consistently measured about 0.560 ms, so I end up with a slowdown of “only” x4.5. Still a lot, though.

Next, the article compares iterating over lists against streaming them.

// for better comparability with looping over the array
// I do not use a "for each" loop (unlike the Langer's article);
// measurements show that this makes things a little faster
int m = Integer.MIN_VALUE;
for (int i = 0; i < intList.size(); i++)
if (intList.get(i) > m)
m = intList.get(i);
intList.stream().max(Math::max);

The results are 6.55 ms for the for loop and 8.33 ms for the stream. My measurements are 0.700 ms and 3.272 ms. While this changes their relative performance considerably, it creates the same order:

ANGELIKA LANGER ME
OPERATION TIME (MS) SLOWER TIME (MS) SLOWER
array_max_for 0.36 0.123
array_max_stream 5.35 14’861% 0.599 487%
list_max_for 6.55 22% 0.700 17%
list_max_stream 8.33 27% 3.272 467%

I ascribe the marked difference between iterations over arrays and lists to boxing; or rather to the resulting indirection. The primitive array is packed with the values we need but the list is backed by an array of Integers, i.e. references to the desired values which we must first resolve.

The considerable difference between Langer’s and my series of relative changes (+14’861% +22% +27% vs +487% + 17% + 467%) underlines her statement, that “the performance model of streams is not a trivial one”.

Bringing this part to a close, her article makes the following observation:

 We just compare two integers, which after JIT compilation is barely more than one assembly instruction. For this reason, our benchmarks illustrate the cost of element access – which need not necessarily be a typical situation. The performance figures change substantially if the functionality applied to each element in the sequence is cpu intensive. You will find that there is no measurable difference any more between for-loop and sequential stream if the functionality is heavily cpu bound.

So let’s have a lock at something else than just integer comparison.

Comparing Operations

I compared the following operations:

max
Finding the maximum value.
sum
Computing the sum of all values; aggregated in an 
int ignoring overflows.
arithmetic
To model a less simple numeric operation I combined the values with a a handful of bit shifts and multiplications.
string
To model a complex operation that creates new objects I converted the elements to strings and xor’ed them character by character.

These were the results (for 500’000 ordered elements; in milliseconds):

MAX SUM ARITHMETIC STRING
ARRAY LIST ARRAY LIST ARRAY LIST ARRAY LIST
FOR 0.123 0.700 0.186 0.714 4.405 4.099 49.533 49.943
STREAM 0.559 3.272 1.394 3.584 4.100 7.776 52.236 64.989

This underlines how cheap comparison really is, even addition takes a whooping 50% longer. We can also see how more complex operations bring looping and streaming closer together. The difference drops from almost 400% to 25%. Similarly, the difference between arrays and lists is reduced considerably. Apparently the arithmetic and string operations are CPU bound so that resolving the references had no negative impact.

(Don’t ask me why for the arithmetic operation streaming the array’s elements is faster than looping over them. I have been banging my head against that wall for a while.)

So let’s fix the operation and have a look at the iteration mechanism.

Comparing Iteration Mechanisms

There are at least two important variables in accessing the performance of an iteration mechanism: its overhead and whether it causes boxing, which will hurt performance for memory bound operations. I decided to try and bypass boxing by executing a CPU bound operation. As we have seen above, the arithmetic operation fulfills this on my machine.

Iteration was implemented with straight forward for and for-each loops. For streams I made some additional experiments:

@Benchmark
public int array_stream() {
// implicitly unboxed
return Arrays
.stream(intArray)
.reduce(0, this::arithmeticOperation);
}

@Benchmark
public int array_stream_boxed() {
// explicitly boxed
return Arrays
.stream(intArray)
.boxed()
.reduce(0, this::arithmeticOperation);
}

@Benchmark
public int list_stream_unbox() {
// naively unboxed
return intList
.stream()
.mapToInt(Integer::intValue)
.reduce(0, this::arithmeticOperation);
}

@Benchmark
public int list_stream() {
// implicitly boxed
return intList
.stream()
.reduce(0, this::arithmeticOperation);
}

Here, boxing and unboxing does not relate to how the data is stored (it’s unboxed in the array and boxed in the list) but how the values are processed by the stream.

Note that boxed converts the IntStream, a specialized implementation ofStream that only deals with primitive ints, to a Stream<Integer>, a stream over objects. This should have a negative impact on performance but the extent depends on how well escape analysis works.

Since the list is generic (i.e. no specialized IntArrayList), it returns aStream<Integer>. The last benchmark method calls mapToInt, which returns anIntStream. This is a naive attempt to unbox the stream elements.

ARITHMETIC
ARRAY LIST
FOR 4.405 4.099
FOREACH 4.434 4.707
STREAM (UNBOXED) 4.100 4.518
STREAM (BOXED) 7.694 7.776

Well, look at that! Apparently the naive unboxing does work (in this case). I have some vague notions why that might be the case but nothing I am able to express succinctly (or correctly). Ideas, anyone?

(Btw, all this talk about boxing/unboxing and specialized implementations makes me ever more happy that Project Valhalla is advancing so well.)

The more concrete consequence of these tests is that for CPU bound operations, streaming seems to have no considerable performance costs. After fearing a considerable disadvantage this is good to hear.

Comparing Number Of Elements

In general the results are pretty stable across runs with a varying sequence length (from 50’000 to 50’000’000). To this end I examined the normalized performance per 1’000’000 elements across those runs.

But I was pretty astonished that performance does not automatically improve with longer sequences. My simple mind assumed, that this would give the JVM the opportunity to apply more optimizations. Instead there are some notable cases were performance actually dropped:

FROM 500’000 TO 50’000’000 ELEMENTS
METHOD TIME
array_max_for + 44.3%
array_sum_for + 13.4%
list_max_for + 12.8%

Interesting that these are the simplest iteration mechanisms and operations.

Winners are more complex iteration mechanisms over simple operations:

FROM 500’000 TO 50’000’000 ELEMENTS
METHOD TIME
array_sum_stream – 84.9%
list_max_stream – 13.5%
list_sum_stream – 7.0%

This means that the table we have seen above for 500’000 elements looks a little different for 50’000’000 (normalized to 1’000’000 elements; in milliseconds):

MAX SUM ARITHMETIC STRING
ARRAY LIST ARRAY LIST ARRAY LIST ARRAY LIST
500’000 ELEMENTS
FOR 0.246 1.400 0.372 1.428 8.810 8.199 99.066 98.650
STREAM 1.118 6.544 2.788 7.168 8.200 15.552 104.472 129.978
50’000’000 ELEMENTS
FOR 0.355 1.579 0.422 1.522 8.884 8.313 93.949 97.900
STREAM 1.203 3.954 0.421 6.710 8.408 15.723 96.550 117.690

We can see that there is almost no change for the arithmetic and string operations. But things changes for the simpler max and sum operations, where more elements brought the field closer together.

Reflection

All in all I’d say that there were no big revelations. We have seen that palpable differences between loops and streams exist only with the simplest operations. It was a bit surprising, though, that the gap is closing when we come into the millions of elements. So there is little need to fear a considerable slowdown when using streams.

There are still some open questions, though. The most notable: What about parallel streams? Then I am curious to find out at which operation complexity I can see the change from iteration dependent (like sum and max) to iteration independent (like arithmetic) performance. I also wonder about the impact of hardware. Sure, it will change the numbers, but will there be qualitative differences as well?

Another takeaway for me is that microbenchmarking is not so hard. Or so I think until someone points out all my errors…