-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Map performance - use pattern match on size to reduce recursive function calls #1069
base: master
Are you sure you want to change the base?
Conversation
Thanks for testing this out. The improvements seem decent but I've always found this undocumented special case strange and I'm not sure we want to proliferate this pattern. If we really want to avoid visiting foldMap :: Monoid m => (a -> m) -> Set a -> m
foldMap _ Tip = mempty
foldMap f (Set _ x l r) = go x l r
where
go x l r = case (l, r) of
(Bin _ lx ll lr, Bin _ rx rl rr) -> go lx ll lr <> f x <> go rx rl rr
(Bin _ lx ll lr, Tip) -> go lx ll lr <> f x
(Tip, Bin rx rl rr) -> f x <> go rx rl rr
(Tip, Tip) -> f x But I haven't measured the effects of such a change. |
The version you posted is about 10 to 15% slower than what's there already. See this commit which adds the change it to It's worth asking why it's slower – it seems that pattern matching on the strict packed integer should be very fast, as there's no pointer chasing or indirection involved, pattern matching on the tips does require some indirection. From the comment in the link above, eliding 2/3 of the recursive calls to Tip is a big deal, because there are a lot of Tips, effectively about the same number as the number of items in the map. Maybe we should add a GHC style note about this optimisation. Edit: I've written one. |
1dd98a1
to
d2093d3
Compare
Can you share more on this? This cannot be universally true, there are trade-offs here. It also disagrees with my benchmarks below.
It does not, thanks to GHC's pointer tagging. There should be little difference, if any, between checking if an unboxed int is 1 and checking the pointer tag. Let's reason about the implementations to see what we can expect. AnalysisWe consider three implementations:
Let's consider two classes of trees, each an extreme case for leaves.
Now we track metrics that might contribute to the cost.
For case 1 we have
For case 2 we have
We can see
We definitely cannot conclude that BenchmarksI benchmarked the implementations on a simple fold (equivalent to simple :: Set a -> Int#
simple (Bin _ _ l r) = simple l +# (1# +# simple r)
simple Tip = 0#
notip :: Set a -> Int#
notip Tip = 0#
notip (Bin sz x l r) = go sz x l r
where
go !_sz !_x l r = case l of
Bin lsz lx ll lr -> case r of
Bin rsz rx rl rr -> go lsz lx ll lr +# (1# +# go rsz rx rl rr)
Tip -> go lsz lx ll lr +# 1#
Tip -> case r of
Bin rsz rx rl rr -> 1# +# go rsz rx rl rr
Tip -> 1#
bin1 :: Set a -> Int#
bin1 (Bin 1 _ _ _) = 1#
bin1 (Bin _ _ l r) = bin1 l +# (1# +# bin1 r)
bin1 Tip = 0# Timings on my machine with GHC 9.10.1:
Full source code{- cabal:
build-depends: base, deepseq, tasty-bench
-}
{-# LANGUAGE BangPatterns, MagicHash #-}
{-# OPTIONS_GHC -O2 -fproc-alignment=64 #-}
{-# OPTIONS_GHC -ddump-simpl -ddump-cmm -ddump-to-file #-}
import Control.DeepSeq
import System.IO
import Data.Bits
import Test.Tasty.Bench
import GHC.Exts
main :: IO ()
main = do
defaultMain
[ bgroup ("depth" <> show d)
[ (env (pure (make1 d))) $ \t -> bgroup "case1"
[ bench "simple" $ whnf simple' t
, bench "notip" $ whnf notip' t
, bench "bin1" $ whnf bin1' t
]
, (env (pure (make2 d))) $ \t -> bgroup "case2"
[ bench "simple" $ whnf simple' t
, bench "notip" $ whnf notip' t
, bench "bin1" $ whnf bin1' t
]
]
| d <- [6, 12, 18]
]
---------------
data Set a = Bin !Int !a !(Set a) !(Set a) | Tip deriving Show
instance NFData a => NFData (Set a) where
rnf = rwhnf -- good enough for Set Int
make1 :: Int -> Set Int
make1 d = go 0 d
where
-- d >= 1
go !off 1 = Bin 1 off Tip Tip
go off d = Bin sz x (go off (d-1)) (go (x+1) (d-1))
where
lsz = bit (d-1) - 1
x = off + lsz + 1
sz = 1 + lsz + lsz
make2 :: Int -> Set Int
make2 d = go 0 d
where
-- d >= 2
go !off 2 = Bin 2 (off+1) (Bin 1 off Tip Tip) Tip
go off d = Bin sz x (go off (d-1)) (go (x+1) (d-1))
where
lsz = bit (d-3) * 3 - 1
x = off + lsz + 1
sz = 1 + lsz + lsz
---------------
wrap :: (Set a -> Int#) -> Set a -> ()
wrap f s = case f s of _ -> ()
simple' = wrap simple
simple :: Set a -> Int#
simple (Bin _ _ l r) = simple l +# (1# +# simple r)
simple Tip = 0#
notip' = wrap notip
notip :: Set a -> Int#
notip Tip = 0#
notip (Bin sz x l r) = go sz x l r
where
go !_sz !_x l r = case l of
Bin lsz lx ll lr -> case r of
Bin rsz rx rl rr -> go lsz lx ll lr +# (1# +# go rsz rx rl rr)
Tip -> go lsz lx ll lr +# 1#
Tip -> case r of
Bin rsz rx rl rr -> 1# +# go rsz rx rl rr
Tip -> 1#
bin1' = wrap bin1
bin1 :: Set a -> Int#
bin1 (Bin 1 _ _ _) = 1#
bin1 (Bin _ _ l r) = bin1 l +# (1# +# bin1 r)
bin1 Tip = 0# The results don't disagree with the analysis.
I haven't yet thought about what we should do based on this information, but I hope the analysis and benchmarks are reasonable. |
You need to add an UNPACK pragma. data Set a = Bin {-# UNPACK #-} !Int !a !(Set a) !(Set a) | Tip deriving Show Edit: removed incorrect numbers. |
I'm definitely in agreement with the images, I've considered the same things. With regards to benchmarking, I wrote up different versions of foldMap for Without an UNPACK pragma, there's an extra pointer chase. (I think the last line in the table should be 4n-1, n, 3n-1, n). |
It's not necessary, because
True, fixed!
But there are no foldMap benchmarks in https://github.com/haskell/containers/blob/master/containers-tests/benchmarks/Map.hs ? Regarding your numbers, the allocations metric is surprising. There should be zero allocations. Which I can confirm when I run my file above with |
Oh yeah, I added one for testing, it I just imported Data.Monoid and use Sum. Hmm, my numbers above weren't run with all optimisations on. |
Checking if the size is 1 takes one comparison and one conditional branch. Checking if both children are |
Hmm. I'm getting significant differences with different versions of GHC. I'm going to take some time putting together some more numbers (and maybe try to stop doing them on a laptop). For posterity: defaultMain
[ bench "foldMap" $ nf (foldMap Sum) m
, bench "foldMapWithKey" $ nf (M.foldMapWithKey (const Sum)) m
] |
So I performed my initial analysis using GHC 8.10.7, where the bin 1 trick was the fastest. GHC 8.10.7
But, in the 9 series, it looks like it has reversed (and the naïve case has improved too, it's closer to the bin 1 check case). GHC 9.10.1
This is using the benchmarks suite the the following additions: foldMapWithKey :: Monoid m => (k -> a -> m) -> Map k a -> m
foldMapWithKey f = go
where
go Tip = mempty
go (Bin 1 k v _ _) = f k v
go (Bin _ k v l r) = go l `mappend` (f k v `mappend` go r)
{-# INLINE foldMapWithKey #-}
foldMapWithKeySimple :: Monoid m => (k -> a -> m) -> Map k a -> m
foldMapWithKeySimple f = go
where
go Tip = mempty
go (Bin _ k v l r) = go l `mappend` (f k v `mappend` go r)
{-# INLINE foldMapWithKeySimple #-}
foldMapWithKeyNoTip :: Monoid m => (k -> a -> m) -> Map k a -> m
foldMapWithKeyNoTip _ Tip = mempty
foldMapWithKeyNoTip f (Bin _ k0 x0 l0 r0) = go k0 x0 l0 r0
where
go k x l r = case (l, r) of
(Bin _ lk lx ll lr, Bin _ rk rx rl rr) -> go lk lx ll lr <> f k x <> go rk rx rl rr
(Bin _ lk lx ll lr, Tip) -> go lk lx ll lr <> f k x
(Tip, Bin _ rk rx rl rr) -> f k x <> go rk rx rl rr
(Tip, Tip) -> f k x
{-# INLINE foldMapWithKeyNoTip #-} defaultMain
[ bench "foldMapWithKeySimple" $ nf (M.foldMapWithKeySimple (const Sum)) m
, bench "foldMapWithKeyBin1" $ nf (M.foldMapWithKey (const Sum)) m
, bench "foldMapWithKeyNoTip" $ nf (M.foldMapWithKeyNoTip (const Sum)) m
] |
I'll redo the filter, map, mapMaybe benchmarks on a latter GHC and report back later. |
These are my numbers with GHC 9.10. The "simple" cases are those in master currently, which don't use the bin 1 optimisation. For For What do you think?
|
Which is the "no tip" version? |
No Tip is Soumik's alternate traversal above. foldMapWithKeyNoTip :: Monoid m => (k -> a -> m) -> Map k a -> m
foldMapWithKeyNoTip _ Tip = mempty
foldMapWithKeyNoTip f (Bin _ k0 x0 l0 r0) = go k0 x0 l0 r0
where
go k x l r = case (l, r) of
(Bin _ lk lx ll lr, Bin _ rk rx rl rr) -> go lk lx ll lr <> f k x <> go rk rx rl rr
(Bin _ lk lx ll lr, Tip) -> go lk lx ll lr <> f k x
(Tip, Bin _ rk rx rl rr) -> f k x <> go rk rx rl rr
(Tip, Tip) -> f k x
{-# INLINE foldMapWithKeyNoTip #-} Prior to GHC 9.4, the Bin 1 technique is faster for It seems to me that GHC 9.4's improvements to pointer tagging (tag inference in particular) make a huge improvement in this case. GHC 9.2.4
GHC 9.4.7
It seems plausible that rethinking more traversals along these lines could work well, but they are more complex, and my attempts with So I think just taking the 25% improvements with the Bin 1 trick for In terms of this PR, all functions which are touched are improvements (and it doesn't touch |
I did some analysis of the maps used in the benchmark suite. Because they use the optimised ascending functions they produce extremely balances trees... only 1 node has a single child and 50% match the Bin 1 condition. I'm going to test these metrics against shuffled inputs. But it might be all bullshit (including the initial use of it?). I'm sorry guys, this has been a real roundabout. |
More experimentation is good! Don't worry about it.
Sounds good. I think we should update our Set/Map benchmarks to run on randomly structured trees instead of the very specific |
This change incorporates the idea already used for
foldMap
andtraverse
, where wepattern match on the size of "leaf nodes". If the size is 1, we already know both sides
are tips, and we don't need to recurse.
Of particular interest is
mapMaybe
, where we can also omit the extra function calls to link.Benchmarks show improvements around 10 to 20% for most operations affected,
though
foldl'
seems closer to 30%.Some (very) cursory analysis says that about 1/3 of Bin nodes in a tree should have no
children, hence changes up to 30% do seem plausible.
Pattern matching on a strict and unboxed
Int
should be very fast.