Which is faster: Stack allocation or Heap allocation

This question may sound fairly elementary, but this is a debate I had with another developer I work with.

I was taking care to stack allocate things where I could, instead of heap allocating them. He was talking to me and watching over my shoulder and commented that it wasn't necessary because they are the same performance wise.

I was always under the impression that growing the stack was constant time, and heap allocation's performance depended on the current complexity of the heap for both allocation (finding a hole of the proper size) and de-allocating (collapsing holes to reduce fragmentation, as many standard library implementations take time to do this during deletes if I am not mistaken).

This strikes me as something that would probably be very compiler dependent. For this project in particular I am using a Metrowerks compiler for the PPC architecture. Insight on this combination would be most helpful, but in general, for GCC, and MSVC++, what is the case? Is heap allocation not as high performing as stack allocation? Is there no difference? Or are the differences so minute it becomes pointless micro-optimization.


Stack allocation is much faster since all it really does is move the stack pointer. Using memory pools, you can get comparable performance out of heap allocation, but that comes with a slight added complexity and its own headaches.

Also, stack vs. heap is not only a performance consideration; it also tells you a lot about the expected lifetime of objects.


Stack is much faster. It literally only uses a single instruction on most architectures, in most cases, eg on x86:

sub esp, 0x10

(That moves the stack pointer down by 0x10 bytes and thereby "allocates" those bytes for use by a variable.)

Of course, the stack's size is very, very finite, as you will quickly find out if you overuse stack allocation or try to do recursion :-)

Also, there's little reason to optimize the performance of code that doesn't verifiably need it, such as demonstrated by profiling. "Premature optimization" often causes more problems than it's worth.

My rule of thumb: if I know I'm going to need some data at compile-time, and it's under a few hundred bytes in size, I stack-allocate it. Otherwise I heap-allocate it.


Honestly, it's trivial to write a program to compare the performance:

#include <ctime>
#include <iostream>

namespace {
    class empty { }; // even empty classes take up 1 byte of space, minimum
}

int main()
{
    std::clock_t start = std::clock();
    for (int i = 0; i < 100000; ++i)
        empty e;
    std::clock_t duration = std::clock() - start;
    std::cout << "stack allocation took " << duration << " clock ticksn";
    start = std::clock();
    for (int i = 0; i < 100000; ++i) {
        empty* e = new empty;
        delete e;
    };
    duration = std::clock() - start;
    std::cout << "heap allocation took " << duration << " clock ticksn";
}

It's said that a foolish consistency is the hobgoblin of little minds. Apparently optimizing compilers are the hobgoblins of many programmers' minds. This discussion used to be at the bottom of the answer, but people apparently can't be bothered to read that far, so I'm moving it up here to avoid getting questions that I've already answered.

An optimizing compiler may notice that this code does nothing, and may optimize it all away. It is the optimizer's job to do stuff like that, and fighting the optimizer is a fool's errand.

I would recommend compiling this code with optimization turned off because there is no good way to fool every optimizer currently in use or that will be in use in the future.

Anybody who turns the optimizer on and then complains about fighting it should be subject to public ridicule.

If I cared about nanosecond precision I wouldn't use std::clock() . If I wanted to publish the results as a doctoral thesis I would make a bigger deal about this, and I would probably compare GCC, Tendra/Ten15, LLVM, Watcom, Borland, Visual C++, Digital Mars, ICC and other compilers. As it is, heap allocation takes hundreds of times longer than stack allocation, and I don't see anything useful about investigating the question any further.

The optimizer has a mission to get rid of the code I'm testing. I don't see any reason to tell the optimizer to run and then try to fool the optimizer into not actually optimizing. But if I saw value in doing that, I would do one or more of the following:

  • Add a data member to empty , and access that data member in the loop; but if I only ever read from the data member the optimizer can do constant folding and remove the loop; if I only ever write to the data member, the optimizer may skip all but the very last iteration of the loop. Additionally, the question wasn't "stack allocation and data access vs. heap allocation and data access."

  • Declare e volatile , but volatile is often compiled incorrectly (PDF).

  • Take the address of e inside the loop (and maybe assign it to a variable that is declared extern and defined in another file). But even in this case, the compiler may notice that -- on the stack at least -- e will always be allocated at the same memory address, and then do constant folding like in (1) above. I get all iterations of the loop, but the object is never actually allocated.

  • Beyond the obvious, this test is flawed in that it measures both allocation and deallocation, and the original question didn't ask about deallocation. Of course variables allocated on the stack are automatically deallocated at the end of their scope, so not calling delete would (1) skew the numbers (stack deallocation is included in the numbers about stack allocation, so it's only fair to measure heap deallocation) and (2) cause a pretty bad memory leak, unless we keep a reference to the new pointer and call delete after we've got our time measurement.

    On my machine, using g++ 3.4.4 on Windows, I get "0 clock ticks" for both stack and heap allocation for anything less than 100000 allocations, and even then I get "0 clock ticks" for stack allocation and "15 clock ticks" for heap allocation. When I measure 10,000,000 allocations, stack allocation takes 31 clock ticks and heap allocation takes 1562 clock ticks.


    Yes, an optimizing compiler may elide creating the empty objects. If I understand correctly, it may even elide the whole first loop. When I bumped up the iterations to 10,000,000 stack allocation took 31 clock ticks and heap allocation took 1562 clock ticks. I think it's safe to say that without telling g++ to optimize the executable, g++ did not elide the constructors.


    In the years since I wrote this, the preference on Stack Overflow has been to post performance from optimized builds. In general, I think this is correct. However, I still think it's silly to ask the compiler to optimize code when you in fact do not want that code optimized. It strikes me as being very similar to paying extra for valet parking, but refusing to hand over the keys. In this particular case, I don't want the optimizer running.

    Using a slightly modified version of the benchmark (to address the valid point that the original program didn't allocate something on the stack each time through the loop) and compiling without optimizations but linking to release libraries (to address the valid point that we don't want to include any slowdown caused by linking to debug libraries):

    #include <cstdio>
    #include <chrono>
    
    namespace {
        void on_stack()
        {
            int i;
        }
    
        void on_heap()
        {
            int* i = new int;
            delete i;
        }
    }
    
    int main()
    {
        auto begin = std::chrono::system_clock::now();
        for (int i = 0; i < 1000000000; ++i)
            on_stack();
        auto end = std::chrono::system_clock::now();
    
        std::printf("on_stack took %f secondsn", std::chrono::duration<double>(end - begin).count());
    
        begin = std::chrono::system_clock::now();
        for (int i = 0; i < 1000000000; ++i)
            on_heap();
        end = std::chrono::system_clock::now();
    
        std::printf("on_heap took %f secondsn", std::chrono::duration<double>(end - begin).count());
        return 0;
    }
    

    displays:

    on_stack took 2.070003 seconds
    on_heap took 57.980081 seconds
    

    on my system when compiled with the command line cl foo.cc /Od /MT /EHsc .

    You may not agree with my approach to getting a non-optimized build. That's fine: feel free modify the benchmark as much as you want. When I turn on optimization, I get:

    on_stack took 0.000000 seconds
    on_heap took 51.608723 seconds
    

    Not because stack allocation is actually instantaneous but because any half-decent compiler can notice that on_stack doesn't do anything useful and can be optimized away. GCC on my Linux laptop also notices that on_heap doesn't do anything useful, and optimizes it away as well:

    on_stack took 0.000003 seconds
    on_heap took 0.000002 seconds
    
    链接地址: http://www.djcxy.com/p/244.html

    上一篇: 并发性和并行性之间的区别是什么?

    下一篇: 哪个更快:堆栈分配或堆分配