Performance Analysis via Core
The following code runs in about 1.5ms on my computer (compiled with GHC 8.0.1 and -02):
import Criterion
import Data.Bits
import Data.Int
import Criterion.Main
main :: IO ()
main = defaultMain [bench "mybench" $ nf (mybench 3840) (0 :: Int)]
mybench :: Int -> Int -> Double
{-# INLINE mybench #-}
mybench n = go 0 n
where go s i g | i == 0 = s
| otherwise =
let (w,_) = f 1 0 g
--w = f 1 0 g
f mag v gen | mag >= 18446744073709551616000 = (v,gen)
--f mag v gen | mag >= 18446744073709551616000 = v
| otherwise = v' `seq` f (mag*18446744073709551616 :: Integer) v' gen where
x = -8499970308474009078 :: Int
v' = (v * 18446744073709551616 + (fromIntegral x + 9223372036854775808))
y = fromInteger ((-9223372036854775808) + w `mod` 18446744073709551616)
coef = (fromIntegral (9007199254740991 .&. (y::Int64)) :: Double) / 9007199254740992
z = 2.0 * (-0.5 + coef)
in go (z+s) (i-1) g
However, if I use the commented alternates of w and f , the code runs in ~31μs! This was surprising to me since I changed very little, and because f runs twice for each of 3,840 iterations (ie, the code is barely used).
I went to the core to investigate. Here's the relevant portions of -ddump-simpl from the slow version and fast version.
Unfortunately, I can't see from the core what is making such a huge difference. The primary difference that I see is that in the fast version, GHC has realized that f doesn't need the gen argument. But surely that can't make a 45x/2 orders of magnitude performance difference.
The source code is a bit contrived (several args aren't needed or used), so my main question is about the core: I don't see any differences that would indicate such a drastic performance difference. What am I missing when analyzing the core? As a followup, what could I do at the source level of the first/slow version to get it to perform like the second/fast version?
It looks like in the fast version GHC lifted the computation:
y = fromInteger ((-9223372036854775808) + w `mod` 18446744073709551616)
out of the definition of go . Just look at where modInteger and plusInteger occur in both dumps.
It looks like in the assignment w = f 1 0 g it inlined the definition of f so that it doesn't have to compute w on each call to go . More specifically, f 1 0 g doesn't depend on any the parameters to go - ie. s , i or g , and so it's computation can be lifted out.
Even though g is passed to f in the expression f 1 0 g , it doesn't actually get used.
上一篇: 关于GHC实施的好介绍性文字?
下一篇: 通过Core进行性能分析
