Why can't (or doesn't) the compiler optimize a predictable addition loop into a multiplication?

This is a question that came to mind while reading the brilliant answer by Mysticial to the question: why is it faster to process a sorted array than an unsorted array?

Context for the types involved:

const unsigned arraySize = 32768;
int data[arraySize];
long long sum = 0;

In his answer he explains that the Intel Compiler (ICC) optimizes this:

for (int i = 0; i < 100000; ++i)
    for (int c = 0; c < arraySize; ++c)
        if (data[c] >= 128)
            sum += data[c];

...into something equivalent to this:

for (int c = 0; c < arraySize; ++c)
    if (data[c] >= 128)
        for (int i = 0; i < 100000; ++i)
            sum += data[c];

The optimizer is recognizing that these are equivalent and is therefore exchanging the loops, moving the branch outside the inner loop. Very clever!

But why doesn't it do this?

for (int c = 0; c < arraySize; ++c)
    if (data[c] >= 128)
        sum += 100000 * data[c];

Hopefully Mysticial (or anyone else) can give an equally brilliant answer. I've never learned about the optimizations discussed in that other question before, so I'm really grateful for this.


The compiler can't generally transform

for (int c = 0; c < arraySize; ++c)
    if (data[c] >= 128)
        for (int i = 0; i < 100000; ++i)
            sum += data[c];

into

for (int c = 0; c < arraySize; ++c)
    if (data[c] >= 128)
        sum += 100000 * data[c];

because the latter could lead to overflow of signed integers where the former doesn't. Even with guaranteed wrap-around behaviour for overflow of signed two's complement integers, it would change the result (if data[c] is 30000, the product would become -1294967296 for the typical 32-bit int s with wrap around, while 100000 times adding 30000 to sum would, if that doesn't overflow, increase sum by 3000000000). Note that the same holds for unsigned quantities, with different numbers, overflow of 100000 * data[c] would typically introduce a reduction modulo 2^32 that must not appear in the final result.

It could transform it into

for (int c = 0; c < arraySize; ++c)
    if (data[c] >= 128)
        sum += 100000LL * data[c];  // resp. 100000ull

though, if, as usual, long long is sufficiently larger than int .

Why it doesn't do that, I can't tell, I guess it's what Mysticial said, "apparently, it does not run a loop-collapsing pass after loop-interchange".

Note that the loop-interchange itself is not generally valid (for signed integers), since

for (int c = 0; c < arraySize; ++c)
    if (condition(data[c]))
        for (int i = 0; i < 100000; ++i)
            sum += data[c];

can lead to overflow where

for (int i = 0; i < 100000; ++i)
    for (int c = 0; c < arraySize; ++c)
        if (condition(data[c]))
            sum += data[c];

wouldn't. It's kosher here, since the condition ensures all data[c] that are added have the same sign, so if one overflows, both do.

I wouldn't be too sure that the compiler took that into account, though (@Mysticial, could you try with a condition like data[c] & 0x80 or so that can be true for positive and negative values?). I had compilers make invalid optimisations (for example, a couple of years ago, I had an ICC (11.0, iirc) use signed-32-bit-int-to-double conversion in 1.0/n where n was an unsigned int . Was about twice as fast as gcc's output. But wrong, a lot of values were larger than 2^31 , oops.).


This answer does not apply to the specific case linked, but it does apply to the question title, and may be interesting to future readers:

Due to finite precision, repeated floating-point addition is not equivalent to multiplication . Consider:

float const step = 1e-15;
float const init = 1;
long int const count = 1000000000;

float result1 = init;
for( int i = 0; i < count; ++i ) result1 += step;

float result2 = init;
result2 += step * count;

cout << (result1 - result2);

Demo: http://ideone.com/7RhfP


The compiler contains various passes which does the optimization. Usually in each pass either an optimization on statements or loop optimizations are done. At present there is no model which does an optimization of loop body based on the loop headers. This is hard to detect and less common.

The optimization which was done was loop invariant code motion. This can be done using a set of techniques.

链接地址: http://www.djcxy.com/p/30.html

上一篇: 为什么HTML认为“chucknorris”是一种颜色?

下一篇: 为什么不能(或不)编译器将可预测的加法循环优化为乘法?