Is Java ArrayList really this much slower than C++ vector?

I do not want to start yet another pointless flamewar about whether Java or C++ is the better language in general. I want to know whether a comparison I did for a specific task is fair and the measured data correct.

We need to decide whether to use Java or C++ for our next project. I'm in the C++ camp but I want to have solid arguments for my case. Our application is a special and has the following needs:

  • The program must run reasonably fast and be reasonably memory efficient. We do not care about the last 20% performance. However, a 10x performance difference is a show stopper.
  • We have lots of arrays. We do not know their size upfront. It is therefore important that arrays can grow at the back in amortized O(1) running time.
  • The elements in the arrays consist of a small number of basic data types. The typical example is a tuple of integers or floats.
  • The arrays can get large. 10^6 elements are standard. We have applications with 10^7 elements and supporting 10^8 would be great.
  • I implemented a toy program in C++ and in Java. First, I present the C++ version:

    #include <iostream>
    #include <vector>
    #include <cstdlib>
    using namespace std;
    
    struct Point{
            float x, y;
    };
    
    int main(int argc, char*argv[]){
            int n = atoi(argv[1]);
    
            vector<Point>arr;
    
            for(int i=0; i<n; ++i){
                    Point p;
                    p.x = i;
                    p.y = i+0.5f;
                    arr.push_back(p);
            }
    
            float dotp = 0;
            for(int i=0; i<n; ++i)
                    dotp += arr[i].x * arr[i].y;
    
            cout << dotp << endl;
    }
    

    Next is the Java version that does the same thing:

    import java.util.*;
    
    class Point{
            public float x, y;
    }
    
    class Main{
            static public void main(String[]args){
                    int n = Integer.parseInt(args[0]);
    
                    ArrayList<Point> arr = new ArrayList<Point>();
    
                    for(int i=0; i<n; ++i){
                            Point p = new Point();
                            p.x = i;
                            p.y = i+0.5f;
                            arr.add(p);
                    }
    
                    float dotp = 0;
                    for(int i=0; i<n; ++i)
                            dotp += arr.get(i).x * arr.get(i).y;
    
                    System.out.println(dotp);
            }
    }
    

    I pass the number of elements using the commandline to the program to prevent the optimizer from executing the program during compilation. The computed value is not useful. The only interesting question is how fast the programs run and how much memory they use. I start with C++:

    $ g++ --version
    g++ (Ubuntu 4.8.4-2ubuntu1~14.04.3) 4.8.4
    $ g++ -O3 test.cpp -o test
    $ /usr/bin/time ./test 1000000
    3.33381e+17
    0.01user 0.00system 0:00.02elapsed 100%CPU (0avgtext+0avgdata 10084maxresident)k
    0inputs+0outputs (0major+2348minor)pagefaults 0swaps
    $ /usr/bin/time ./test 10000000
    3.36984e+20
    0.08user 0.01system 0:00.09elapsed 100%CPU (0avgtext+0avgdata 134380maxresident)k
    0inputs+0outputs (0major+4074minor)pagefaults 0swaps
    $ /usr/bin/time ./test 100000000
    2.42876e+23
    0.77user 0.09system 0:00.87elapsed 99%CPU (0avgtext+0avgdata 1050400maxresident)k
    0inputs+0outputs (0major+6540minor)pagefaults 0swaps
    

    The "user" time is how long the program ran. For 10^6 elements, it ran for 0.01 sec, for 10^7 elements 0.08 sec, and for 10^8 elements 0.77 sec. "maxresident" is the amount of physically memory in kilobyte that the kernel gave the program. For 10^6 its 10 MB, for 10^7 its 132 MB, and for 10^8 its 1 GB.

    The memory consumption sounds right. An array with x elements needs sizeof(float)*2*x=8*x bytes of memory. For 10^6 elements that is about 8MB, for 10^7 about 76MB, and for 10^8 about 762 MB.

    Next, I run the Java program:

    $ javac -version
    javac 1.6.0_41
    $ javac Main.java
    $ java -version
    java version "1.7.0_131"
    OpenJDK Runtime Environment (IcedTea 2.6.9) (7u131-2.6.9-0ubuntu0.14.04.2)
    OpenJDK 64-Bit Server VM (build 24.131-b00, mixed mode)
    $ /usr/bin/time java Main 1000000
    3.33381168E17
    0.16user 0.00system 0:00.09elapsed 173%CPU (0avgtext+0avgdata 79828maxresident)k
    0inputs+64outputs (0major+4314minor)pagefaults 0swaps
    $ /usr/bin/time java Main 10000000
    3.3698438E20
    5.23user 0.18system 0:02.07elapsed 261%CPU (0avgtext+0avgdata 424180maxresident)k
    0inputs+64outputs (0major+13508minor)pagefaults 0swaps
    $ /usr/bin/time java Main 100000000
    Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at Main.main(Main.java:14)
    Command exited with non-zero status 1
    3840.72user 13.06system 17:11.79elapsed 373%CPU (0avgtext+0avgdata 2281416maxresident)k
    0inputs+1408outputs (0major+139893minor)pagefaults 0swaps
    

    For 10^6 elements it needs 0.16 sec and 78 MB. For 10^7 elements it needs 5.23 sec and 414 MB. I tried to run the program for 10^8 elements but Java crashed. It used all cores of my machine (in a sequential program!) and ran for 17 min while occupying 2.2GB. My machine has 8 GB of memory.

    For 10^6 elements C++ is 0.16 / 0.01 = 16 times faster and needs 78/10 = 7.8 times less memory. For 10^7 elements C++ is 5.23/0.08 = 65 times faster and needs 414/132 = 3.1 times less memory. Java did not finish on the test instance with 10^8 elements while the C++ program finished within well below a second.

    For 10^6 Java seems manageable but less than ideal. For 10^7 and 10^8 it is an absolute no-go. I expected a slight performance advantage of C++ over Java but not something this drastic.

    The most likely explanation is that my testing methodology is wrong or that I have a non-obvious performance bottleneck in my Java code. Another explanation would be that the OpenJDK JVM significantly lacks behind JVMs from other vendors.

    Please explain to me why Java performs so bad in this benchmark. How did I unintentionally make Java look worse than it is?

    Thanks


    The JIT not running thus explains a small part of the effect but not the primary slowdown.

    Right, Java is slow on start up because of JIT and it takes some time till it runs with full speed.

    But the performance you describe is catastrophic and has another reason: You wrote

    It used all cores of my machine (in a sequential program!)

    and this must have been the garbage collector. The GC running hard means you're running out of memory. On my machine the times were

      28.689 millis for 1 M pairs
     143.104 millis for 10 M pairs
    3100.856 millis for 100 M pairs
      10.404 millis for 1 M pairs
     113.054 millis for 10 M pairs
    2528.371 millis for 100 M pairs
    

    which is still a pain, but a possibly usable starting point. Observe that the second run is faster as it gets optimized better. Note that this is not how Java benchmarks should be written!


    The reason for the memory consumption is that you have a List of references to objects containing two floats instead of vector of pairs of floats. Every references adds 4 or 8 bytes overhead, every object adds even more. Additionally, there's an indirection on every access.

    If memory matters, then this isn't the right way how to code it in Java. There's surely a better way (I made give it try), but the code may get ugly. Java without value types sucks at such computations. IMHO it excels nearly everywhere else (personal opinion).

    An equivalent C++ code would use vector<Point*> . If you do this, your code gets slower and memory bigger, but still better than Java (object header overhead).


    I rewrote the code so it uses a PointArray storing the two floats next to each other in a single array. Without measuring anything, I claim that the memory consumption is about the same now. The times now are

      31.505 millis for 1 M pairs
     232.658 millis for 10 M pairs
    1870.664 millis for 100 M pairs
      17.536 millis for 1 M pairs
     219.222 millis for 10 M pairs
    1757.475 millis for 100 M pairs
    

    which is still too slow. I guess, it's the bound checking, which can't be switched off in Java (unless you resolve to Unsafe ). Usually, the JIT (just-in-time compiler) can move them out of loops, which makes their cost negligible.

    It may be also my slow (factor 1.5) array resizing (IIRC vector uses a factor of 2). Or just a pity when the array gets resized when you're nearly done. As you're doing about nothing with the array, it may weight a lot.

    In any case, you need at least one good Java programmer when you want to get fast processing of arrays of primitives. It may take them a day or two till you get good performance. A good library could do as well. Using List<Point> is just too inefficient before Java 10 (or 11, or ...).

    链接地址: http://www.djcxy.com/p/96412.html

    上一篇: 正确拉动edxops /论坛的方式

    下一篇: Java ArrayList真的比C ++向量要慢吗?