Why no compiler appears able to optimize this code?

Consider the following C code (assuming 80-bit long double ) (note, I do know of memcmp , this is just an experiment):

enum { sizeOfFloat80=10 }; // NOTE: sizeof(long double) != sizeOfFloat80
_Bool sameBits1(long double x, long double y)
{
    for(int i=0;i<sizeOfFloat80;++i)
        if(((char*)&x)[i]!=((char*)&y)[i])
            return 0;
    return 1;
}

All compilers I checked (gcc, clang, icc on gcc.godbolt.org) generate similar code, here's an example for gcc with options -O3 -std=c11 -fomit-frame-pointer -m32 :

sameBits1:
        movzx   eax, BYTE PTR [esp+16]
        cmp     BYTE PTR [esp+4], al
        jne     .L11
        movzx   eax, BYTE PTR [esp+17]
        cmp     BYTE PTR [esp+5], al
        jne     .L11
        movzx   eax, BYTE PTR [esp+18]
        cmp     BYTE PTR [esp+6], al
        jne     .L11
        movzx   eax, BYTE PTR [esp+19]
        cmp     BYTE PTR [esp+7], al
        jne     .L11
        movzx   eax, BYTE PTR [esp+20]
        cmp     BYTE PTR [esp+8], al
        jne     .L11
        movzx   eax, BYTE PTR [esp+21]
        cmp     BYTE PTR [esp+9], al
        jne     .L11
        movzx   eax, BYTE PTR [esp+22]
        cmp     BYTE PTR [esp+10], al
        jne     .L11
        movzx   eax, BYTE PTR [esp+23]
        cmp     BYTE PTR [esp+11], al
        jne     .L11
        movzx   eax, BYTE PTR [esp+24]
        cmp     BYTE PTR [esp+12], al
        jne     .L11
        movzx   eax, BYTE PTR [esp+25]
        cmp     BYTE PTR [esp+13], al
        sete    al
        ret
.L11:
        xor     eax, eax
        ret

This looks ugly, has branch on every byte and in fact doesn't seem to have been optimized at all (but at least the loop is unrolled). It's easy to see though that this could be optimized to the code equivalent to the following (and in general for larger data to use larger strides):

#include <string.h>
_Bool sameBits2(long double x, long double y)
{
    long long X=0; memcpy(&X,&x,sizeof x);
    long long Y=0; memcpy(&Y,&y,sizeof y);
    short Xhi=0; memcpy(&Xhi,sizeof x+(char*)&x,sizeof Xhi);
    short Yhi=0; memcpy(&Yhi,sizeof y+(char*)&y,sizeof Yhi);
    return X==Y && Xhi==Yhi;
}

And this code now gets much nicer compilation result:

sameBits2:
        sub     esp, 20
        mov     edx, DWORD PTR [esp+36]
        mov     eax, DWORD PTR [esp+40]
        xor     edx, DWORD PTR [esp+24]
        xor     eax, DWORD PTR [esp+28]
        or      edx, eax
        movzx   eax, WORD PTR [esp+48]
        sete    dl
        cmp     WORD PTR [esp+36], ax
        sete    al
        add     esp, 20
        and     eax, edx
        ret

So my question is: why is none of the three compilers able to do this optimization? It it something very uncommon to see in the C code?


Firstly, it is unable to do this optimization because you completely obfuscated the meaning of your code by overloading it with unduly amount of memory reinterpretation. A code like this justly makes the compiler react with "I don't know what on Earth this is, but if that's what you want, that's what you'll get". Why you expect the compiler to even bother to transform on kind of memory reinterpretation into another kind of memory reinterpretation (!) is completely unclear to me.

Secondly, it can probably be made to do it in theory, but it is probably not very high on the list of its priorities. Remember, that code optimization is usually done by a pattern matching algorithm, not by some kind of AI And this is just not one of the patterns it recognizes.

Most of the time your manual attempts to perform low-level optimization of the code will defeat compiler's effort to do the same. If you want to optimize it yourself, then go all the way. Don't expect to be able to start and then hand it over to the compiler to finish the job for you.

Comparison of two long double values x and y can be done very easily: x == y . If you want a bit-to-bit memory comparison, you will probably make the compiler's job easier by just using memcmp in a compiler that inherently knows what memcmp is (built-in, intrinsic function).

链接地址: http://www.djcxy.com/p/286.html

上一篇: 编译器优化:g ++比intel慢

下一篇: 为什么没有编译器能够优化此代码?