Performance of C++ vs Virtual Machine languages in high frequency finance

2018-06-04 14:52:12

I thought the C/C++ vs C#/Java performance question was well trodden, meaning that I'd read enough evidence to suggest that the VM languages are not necessarily any slower than the "close-to-silicon" languages. Mostly because the JIT compiler can do optimizations that the statically compiled languages cannot.

However, I recently received a CV from a guy who claims that Java-based high frequency trading is always beaten by C++, and that he'd been in a situation where this was the case.

A quick browse on job sites indeed shows that HFT applicants need knowledge of C++, and a look at Wilmott forum shows all the practitioners talking about C++.

Is there any particular reason why this is the case? I would have thought that with modern financial business being somewhat complex, a VM language with type safety, managed memory, and a rich library would be preferred. Productivity is higher that way. Plus, JIT compilers are getting better and better. They can do optimizations as the program is running, so you'd think they's use that run-time info to beat the performance of the unmanaged program.

Perhaps these guys are writing the critical bits in C++ and and calling them from a managed environment (P/Invoke etc)? Is that possible?

Finally, does anyone have experience with the central question in this, which is why in this domain unmanaged code is without doubt preferred over managed?

As far as I can tell, the HFT guys need to react as fast as possible to incoming market data, but this is not necessarily a hard realtime requirement. You're worse off if you're slow, that's for sure, but you don't need to guarantee a certain speed on each response, you just need a fast average.

EDIT

Right, a couple of good answers thus far, but pretty general (well-trodden ground). Let me specify what kind of program HFT guys would be running.

The main criterion is responsiveness. When an order hits the market, you want to be the first to be able to react to it. If you're late, someone else might take it before you, but each firm has a slightly different strategy, so you might be OK if one iteration is a bit slow.

The program runs all day long, with almost no user intervention. Whatever function is handling each new piece of market data is run dozens (even hundreds) of times a second.

These firms generally have no limit as to how expensive the hardware is.

Firstly, 1 ms is an eternity in HFT. If you think it is not then it would be good to do a bit more reading about the domain. (It is like being 100 miles away from the exchange.) Throughput and latency are deeply intertwined as the formulae in any elementary queuing theory textbook will tell you. The same formulae will show jitter values (frequently dominated by the standard deviation of CPU queue delay if the network fabric is right and you have not configured quite enough cores).

One of the problems with HFT arbitrage is that once you decide to capture a spread, there are two legs (or more) to realize the profit. If you fail to hit all legs you can be left with a position that you really don't want (and a subsequent loss) - after all you were arbitraging not investing.

You don't want positions unless your strategy is predicting the (VERY near term!!!) future (and this, believe it or not, is done VERY successfully). If you are 1 ms away from exchange then some significant fraction of your orders won't be executed and what you wanted will be picked off. Most likely the ones that have executed one leg will end up losers or at least not profitable.

Whatever your strategy is for argument's sake let us say it ends up a 55%/45% win/loss ratio. Even a small change in the win/loss ratio can have in big change in profitability.

re: "run dozens (even hundreds)" seems off by orders of magnitude Even looking at 20000 ticks a second seems low, though this might be the average for the entire day for the instrument set that he is looking at.

There is high variability in the rates seen in any given second. I will give an example. In some of my testing I look at 7 OTC stocks (CSCO,GOOG,MSFT,EBAY,AAPL,INTC,DELL) in the middle of the day the per second rates for this stream can range from 0 mps (very very rare) to almost almost 2000 quotes and trades per peak second. (see why I think the 20000 above is low.)

I build infrastructure and measurement software for this domain and the numbers we talk about are 100000's and millions per second. I have C++ producer/consumer infrastructure libraries that can push almost 5000000 (5 million) messages/second between producer and consumer, (32 bit,2.4 GHz cores). These are 64 byte messages with new, construct, enqueue, synchronize , on the producer side and synchronize,dequeue,touch every byte,run virtual destructor,free on the consumer side. Now admittedly that is a simple benchmark with no Socket IO (and socket IO can be ugly) as would be at the end points of the end point pipe stages. It is ALL custom synchronization classes that only synchronize when empty, custom allocators, custom lock free queues and lists, occasional STL(with custom allocators) but more often custom intrusive collections (of which I have a significant library). More than once I have given a vendor in this arena a quadruple (and more) in throughput without increased batching at the socket endpoints.

I have OrderBook and OrderBook::Universe classes that take less than 2us for new, insert, find, partial fill, find, second fill, erase, delete sequence when averaged over 22000 instruments. The benchmark iterates over all 22000 instruments serially between the insert first fill and last fill so there are no cheap caching tricks involved. Operations into the same book are separated by accesses of 22000 different books. These are very much NOT the caching characteristics of real data. Real data is much more localized in time and consecutive trades frequently hit the same book.

All of this work involves careful consideration of the constants and caching characteristics in any of the algorithmic costs of the collections used. (Sometimes it seems that the K's in KO(n) KO(n*log n) etc., etc., etc. are dismissed a bit too glibly)

I work on the Marketdata infrastructure side of things. It is inconceivable to even think of using java or a managed environment for this work. And when you can get this kind of performance with C++ and I think it is quite hard to get million+/mps performance with a managed environment) I can't imagine any of the significant investment banks or hedge funds (for whom a $250000 salary for a top notch C++ programmer is nothing) not going with C++.

Is anybody out there really getting 2000000+/mps performance out of a managed environment? I know a few people in this arena and no one ever bragged about it to me. And I think 2mm in a managed environment would have some bragging rights.

I know of one major player's FIX order decoder doing 12000000 field decodes/sec. (3Ghz CPU) It is C++ and the guy who wrote it almost challenged anybody to come up with something in a managed environment that is even half that speed.

Technologically it is an interesting area with lots of fun performance challenges. Consider the options market when the underlying security changes - there might be say 6 outstanding price points with 3 or 4 different expiration dates. Now for each trade there were probably 10-20 quotes. Those quotes can trigger price changes in the options. So for each trade there might be 100 or 200 changes in options quotes. It is just a ton of data - not a Large Hadron Collider collision-detector-like amount of data but still a bit of a challenge. It is a bit different than dealing with keystrokes.

Even the debate about FPGA's goes on. Many people take the position that a well coded parser running on 3GHZ commodity HW can beat a 500MHz FPGA. But even if a tiny bit slower (not saying they are) FPGA based systems can tend to have tighter delay distributions. (Read "tend" - this is not a blanket statement) Of course if you have a great C++ parser that you push through a Cfront and then push that through the FPGA image generator... But that another debate...

A lot of it comes down to a simple difference between fact and theory. People have advanced theories to explain why Java should be (or at least might be) faster than C++. Most of the arguments have little to do with Java or C++ per se, but to dynamic versus static compilation, with Java and C++ really being little more than examples of the two (though, of course, it's possible to compile Java statically, or C++ dynamically). Most of these people have benchmarks to "prove" their claim. When those benchmarks are examined in any detail, it quickly becomes obvious that in quite a few cases, they took rather extreme measures to get the results they wanted (eg, quite a number enable optimization when compiling the Java, but specifically disabled optimization when compiling the C++).

Compare this to the Computer Language Benchmarks Game, where pretty much anybody can submit an entry, so all of the code tends to be optimized to a reasonable degree (and, in a few cases, even an unreasonable degree). It seems pretty clear that a fair number of people treat this as essentially a competition, with advocates of each language doing their best to "prove" that their preferred language is best. Since anybody can submit an implementation of any problem, a particularly poor submission has little effect on overall results. In this situation, C and C++ emerge as clear leaders.

Worse, if anything these results probably show Java in better light than is entirely accurate. In particular, somebody who uses C or C++ and really cares about performance can (and often will) use Intel's compiler instead of g++. This will typically give at least a 20% improvement in speed compared to g++.

Edit (in response to a couple of points raised by jalf, but really too long to fit reasonably in comments):

pointers being an optimizer writers nightmare. This is really overstating things (quite) a bit. Pointers lead to the possibility of aliasing, which prevents certain optimizations under certain circumstances. That said, inlining prevents the ill effects much of the time (ie, the compiler can detect whether there's aliasing rather than always generating code under the assumption that there could be). Even when the code does have to assume aliasing, caching minimizes the performance hit from doing so (ie, data in L1 cache is only minutely slower than data in a register). Preventing aliasing would help performance in C++, but not nearly as much as you might think.

Allocation being a lot faster with a garbage collector. It's certainly true that the default allocator in many C++ implementations is slower than what most (current) garbage collected allocators provide. This is balanced (to at least a degree) by the fact that allocations in C++ tend to be on the stack, which is also fast, whereas in a GC language nearly all allocations are usually on the heap. Worse, in a managed language you usually allocate space for each object individually whereas in C++ you're normally allocating space for all the objects in a scope together.

It's also true that C++ directly supports replacing allocators both globally and on a class-by-class basis, so when/if allocation speed really is a problem it's usually fairly easy to fix.

Ultimately, jalf is right: both of these points undoubtedly do favor "managed" implementations. The degree of that improvement should be kept in perspective though: they're not enough to let dynamically compiled implementations run faster on much code -- not even benchmarks designed from the beginning to favor them as much as possible.

Edit2: I see Jon Harrop has attempted to insert his two (billionths of a) cent's worth. For those who don't know him, Jon's been a notorious troll and spammer for years, and seems to be looking for new ground into which to sow weeds. I'd try to reply to his comment in detail, but (as is typical for him) it consists solely of unqualified, unsupported generalizations containing so little actual content that a meaningful reply is impossible. About all that can be done is to give onlookers fair warning that he's become well known for being dishonest, self-serving, and best ignored.

A JIT compiler could theoretically perform a lot of optimizations, yes, but how long are you willing to wait? A C++ app can take hours to compiler because it happens offline, and the user isn't sitting there tapping his fingers and waiting.

A JIT compiler has to finish within a couple of milliseconds. So which do you think can get away with the most complex optimizations?

The garbage collector is a factor too. Not because it is slower than manual memory management per se (I believe its amortized cost is pretty good, definitely comparable to manual memory handling), but it's less predictable. It can introduce a stall at pretty much any point, which might not be acceptable in systems that are required to be extremely responsive.

And of course, the languages lend themselves to different optimizations. C++ allows you to write very tight code, with virtually no memory overhead, and where a lot of high level operations are basically free (say, class construction).

In C# on the other hand, you waste a good chunk of memory. And simply instantiating a class carries a good chunk of overhead, as the base Object has to be initialized, even if your actual class is empty.

C++ allows the compiler to strip away unused code aggressively. In C#, most of it has to be there so it can be found with reflection.

On the other hand, C# doesn't have pointers, which are an optimizing compiler's nightmare. And memory allocations in a managed language are far cheaper than in C++.

There are advantages either way, so it is naive to expect that you can get a simple "one or the other" answer. Depending on the exact source code, the compiler, the OS, the hardware it's running on, one or the other may be faster. And depending on your needs, raw performance might not be the #1 goal. Perhaps you're more interested in responsiveness, in avoiding unpredictable stalls.

In general, your typical C++ code will perform similarly to equivalent C# code. Sometimes faster, sometimes slower, but probably not a dramatic difference either way.

But again, it depends on the exact circumstances. And it depends on how much time you're willing to spend on optimization. if you're willing to spend as much time as it takes, C++ code can usually achieve better performance than C#. It just takes a lot of work.

And the other reason, of course, is that most companies who use C++ already have a large C++ code base which they don't particularly want to ditch. They need that to keep working, even if they gradually migrate (some) new components to a managed language.

链接地址: http://www.djcxy.com/p/14972.html

上一篇: #include所有.cpp文件到一个编译单元中？

下一篇: C ++与虚拟机语言在高频融资中的表现