Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup after IntMap rewrite #698

Open
8 of 26 tasks
treeowl opened this issue Dec 30, 2019 · 49 comments
Open
8 of 26 tasks

Cleanup after IntMap rewrite #698

treeowl opened this issue Dec 30, 2019 · 49 comments
Labels

Comments

@treeowl
Copy link
Contributor

treeowl commented Dec 30, 2019

This ticket summarizes things that should be done either before or soon after the enormous IntMap replacement in #340.

Before merge

  • Reinstate the export list in Data.IntMap.Internal.

  • Add an export list to Data.IntMap.Merge.Internal.

  • Diff Haddocks to make sure no functions or instances have gone missing.

  • Generate a final benchmark comparison. I prefer to see these interleaved or side by side rather than one whole set after another, but I won't insist on it. This should go in the commit message for the final squashed commit.

  • Try a bit harder to optimize union.

  • See what non-merge functions can be defined in terms of merge tactics and runMissingAll. (More would require a restructuring of the module boundaries.)

  • Improve test coverage, especially of merge tactics and isSubmapOf. (Side question: should we bump up the number of test cases Travis asks for?)

  • Add internal documentation for merges and deletion.

  • Reinstate rewrite rules and the phased INLINE and NOINLINE directives needed to make them work. Many of them (e.g., map fusion rules and map/coerce) are unconditional improvements. Others (relating to list conversions) are likely desirable.

  • Check that GHC can use the new IntMap without (undue) performance degradation. (@sjakobi has offered to trigger the build jobs for this.)

  • See if anything can be done about the object code duplication resulting from the alternating levels. Is there some clever way to make them uniform enough to share code? (Sadly, the answer appears to be no, for reasons that are not entirely obvious. Flipping over to the complement of the key each time we go left seems to hurt lookups significantly, although not enormously.)

After merge

  • See if anything can be done about the source code duplication resulting from the alternating levels. Is there some clever way to make them uniform enough to share code?

  • Rewrite IntSet to match. Hopefully this is fairly mechanical, replacing values by bitmaps. Once this is done, get rid of bit twiddling utilities that are no longer used.

  • Pick up Speed up fromList for IntMap #653 and see how well it works with the new representation.

  • Optimize merge.

  • Resolve the strictness guarantees of fromAscList and friends. See Settle on a consistent strictness for fromList and fromAscList #473 and Rewrite Data.IntMap to be faster and use less memory #340 (comment).

  • Consider using more type wrappers for safety.

  • Evaluate using unboxed sums for intermediates. These are not always a win, but sometimes are.

  • Expand the benchmarks to cover more functions. Check for major regressions that the current benchmarks missed.

  • Consider adding as permanent APIs things that came up in the implementation:

    • runMissingAll :: WhenMissing f a b -> IntMap a -> f (IntMap b), plus maybe a pure variant. (DF: I'm not so convinced yet. What's it for?)
    • traverseMaybeWithKey :: (Key -> a -> f (Maybe b)) -> IntMap a -> f (IntMap b) (DF: yes, we most definitely want this by some name or other.)
    • dropMatched :: WhenMatched f a b c
    • firstMatched :: WhenMatched f a b a (and secondMatched :: WhenMatched f a b b)
    • fold variants that can ignore or be otherwise loose with ordering. (see Cleanup after IntMap rewrite #698 (comment), but more generally)

Uncategorized

  • Check whether the restructuring in Improve fromAscList and friends. #658 is a good idea for the new representation and, if so, reinstate it.

  • Continue to add internal documentation. For example, isSubmapOf and the like are pretty confusing.

  • Shorten long lines when reasonable. We generally aim for 80 characters or fewer. There's flexibility for unusual situations, but this code has some very long lines that could be broken cleanly.

  • Use liftA* in mergeA.

  • Evaluate whether we can get the same performance using foldMapDefault, fmapDefault, and equivalent tricks for keyed versions. That would cut down on source code.

  • Evaluate whether we should implement alter "by hand" as we used to. If not, then we should probably implement it using alterF @Identity.

@treeowl
Copy link
Contributor Author

treeowl commented Dec 30, 2019

@gereeter, @sjakobi, please let me know what should we added and what, in your opinions, should be done before and after the merge.

@treeowl
Copy link
Contributor Author

treeowl commented Dec 30, 2019

We may want to revisit the recent decision to make minimum and maximum traverse strictly left to right; I believe we'll use less stack space here if we make their traversal order explicitly unspecified. Alternatively, I guess we can probably be very clever and keep track of a minimum from the left (updated only when the next element is smaller) and a minimum from the right (updated when the previous element is not larger), and pick whichever is smallest in the end, preferring the one from the left in a tie.

product should definitely preserve the order of the elements in case the Num instance is a non-commutative ring. sum should probably preserve the order of the elements in case someone does something a bit weird, but I could be convinced to relax that.

@treeowl
Copy link
Contributor Author

treeowl commented Dec 30, 2019

@gereeter, please also let me know how you wish to be credited in the changelog and release announcement.

@treeowl
Copy link
Contributor Author

treeowl commented Dec 30, 2019

I keep wondering whether making the tree go entirely in order is really the right way. The obvious alternative is for the order to alternate from level to level, so that keys go to (say) the left whenever they're closer to the bound stored in the current node, whether that's a lower bound or an upper bound.

That would (potentially) go along with alternating whether to flip the sign of the bound when storing it, and flipping the sign of the key on each step down. I don't know if that would actually work or not, but it strikes me as plausible.

Edit: the sign flipping would be done using complement, because -minBound = minBound. The idea is that k < max is the same as complement max < complement k.

@sjakobi
Copy link
Member

sjakobi commented Dec 30, 2019

We may want to revisit the recent decision to make minimum and maximum traverse strictly left to right; I believe we'll use less stack space here if we make their traversal order explicitly unspecified.

My understanding is that the new IntMap doesn't suffer from the weird order reversal for negative keys that the old one had. The new haddocks on IntMap_ explain it quite clearly IMHO.

In consequence I believe that we don't lose anything by keeping minimum and maximum in-order.

@treeowl
Copy link
Contributor Author

treeowl commented Dec 30, 2019

We may want to revisit the recent decision to make minimum and maximum traverse strictly left to right; I believe we'll use less stack space here if we make their traversal order explicitly unspecified.

My understanding is that the new IntMap doesn't suffer from the weird order reversal for negative keys that the old one had. The new haddocks on IntMap_ explain it quite clearly IMHO.

In consequence I believe that we don't lose anything by keeping minimum and maximum in-order.

We don't get the weird order reversal of children, but we now have values in internal nodes, and their positions in the order depend on their level. To go strictly in order from a node whose bound is a maximum, we have to push the right subtree and the bound's associated value onto the stack before descending the left subtree. I don't know for sure if we can improve matters or not, but we might save a little by dealing with those values "as they come" to the extent we can. That surely applies to minimum and maximum, and I imagine it applies to sum and product too if we're careful....

@gereeter
Copy link

Generate a final benchmark comparison. I prefer to see these interleaved

My latest commits that have benchmark data just used the output of bench-cmp.pl. Is that fine?

Rewrite IntSet to match.

Tricky: I actually did this back when the work was out-of-tree, and although I'm sure with time (and newer GHC) one could press the optimizations more and make different design choices, it was slower than stock IntSet. Probably worth trying again, but there is a large space of choices to try out:

  • When a min or max shows up higher in the tree, should it be repeated in the bitmaps?
  • Should the bitmaps be aligned, representing series of integers with the same bit prefix, or should they start at the min/max for the node to not have waste zeros?
  • Is a singleton node represented as itself with two empty children or as a one-hot bitmap?
  • etc.

Check whether the restructuring in #658 is a good idea

And #653. I was putting it off as non-essential, but especially since fromList is one of only two regressions according to my up-to-date benchmarks (oddly, not insert even though fromList is just a loop over insert), I was keeping it in mind. But really, this can be post-merge.

Continue to add internal documentation. For example, isSubmapOf

The two big documentation blocks I was working to get done before merging were about deletions and merges, and isSubmapOf is basically a merge. Specific applications to specific functions can maybe wait until after merge, but those should probably be pre-merge.

We generally aim for 80 characters or fewer.

Aw, darn. I was wrapping my documentation at 100 characters. That at least is easy to fix. While there are definitely easy-to-wrap lines, I'm not so sure about some of the worst offenders. Maybe make an effort pre-merge but don't worry about it as a blocker?

Evaluate whether we should implement alter "by hand" as we used to. If not, then we should probably implement it using alterF @Identity.

I vaguely remember finding the trivial two-pass algorithm surprisingly faster that a specialized implementation? But it has been a while, and my memory is fuzzy. This will probably be pre-merge just because I'm curious now.

@gereeter, please also let me know how you wish to be credited in the changelog and release announcement.

Jonathan "gereeter" S.

@gereeter, @sjakobi, please let me know what should [be] added

Pre-merge:

  • Make a few more attempts to optimize union. I have a few ideas and experiments to run, but it has the worst regressions of the bunch: on my machine (Linux, GHC 8.8.1) it gets near 50% slower on the worst cases (although it does get faster on most cases).
  • See what non-merge functions can be defined in terms of merge tactics and runMissingAll. As I mentioned in Rewrite Data.IntMap to be faster and use less memory #340 (comment), the merge tactics can't use the main functions since they need access to all parts of the mutual recursion, but it is fairly easy to extract the top-level part of the WhenMissing bundle.
  • Get near-100% test coverage for non-trivial functions. We're almost there: I tested this at @sjakobi's suggestion, and it looked like the main offenders left are merge tactics. (We may also want to see if isSubmapOf can be more thoroughly tested: the standard 100 tests don't get close to hitting all the cases in a function that is designed to short-circuit everywhere.)

Post-merge:

  • Speed up fromList for IntMap #653 or equivalent.
  • Optimize merge. It can be a lot better than mergeA due to being pure (well, the optimizations probably would apply to commutative monads, too). See how different the implementation of mergeA and union is, for example. Post-merge since it is a fairly niche function, works, and isn't used to implement union, difference, and intersection yet.
  • Resolve the strictness guarantees of fromAscList and friends. See Rewrite Data.IntMap to be faster and use less memory #340 (comment).
  • Consider using more type wrappers for more safety. For example, I took the Bound idea from Rewrite Data.IntMap to be faster and use less memory #340 (comment), but didn't implement Carry. As I said, however, I don't think that it helps that much. I'd love to see some sort of linearity analysis instead, but that is way more complicated.
  • Evaluate using unboxed sums for intermediates.
  • Expand the benchmarks to cover more functions. I wouldn't be surprised if there are more regressions that are fairly easy to fix just because I didn't bother optimizing the function since it didn't get benchmarked.
  • Consider adding as permanent APIs things that came up in the implementation:
    • runMissingAll :: WhenMissing f a b -> IntMap a -> f (IntMap b), plus maybe a pure variant
    • traverseMaybeWithKey :: (Key -> a -> f (Maybe b)) -> IntMap a -> f (IntMap b)

@gereeter
Copy link

I don't know for sure if we can improve matters or not, but we might save a little by dealing with those values "as they come" to the extent we can.

That's why I implemented elem, although I suspect the win is greater in a case like that since you can terminate early as opposed to just having less on the stack. It is also kind of the reason why merge can be so much better than mergeA.

@sjakobi
Copy link
Member

sjakobi commented Dec 30, 2019

Additions for "before merge":

  • Check that GHC can use the new IntMap without (undue) performance degradation. (I can trigger the build jobs for this)

Can someone estimate what kind of speedup we can expect from this? If it's, say, < 20%, I'd say it can wait.

  • Continue to add internal documentation. For example, isSubmapOf and the like are pretty confusing.

+0.5 for adding documentation before the merge – just in case @gereeter gets hit by a bus.

  • Shorten long lines when reasonable. We generally aim for 80 characters or fewer. There's flexibility for unusual situations, but this code has some very long lines that could be broken cleanly.

+0.1 for doing this before the merge in order to reduce noise in git blame. Not a blocker IMO.

  • Use liftA* in mergeA.

+0.1 as above

  • Evaluate whether we can get the same performance using foldMapDefault, fmapDefault, and equivalent tricks for keyed versions. That would cut down on source code.

This can easily be delayed IMHO

@treeowl
Copy link
Contributor Author

treeowl commented Dec 30, 2019

Generate a final benchmark comparison. I prefer to see these interleaved

My latest commits that have benchmark data just used the output of bench-cmp.pl. Is that fine?

Yeah, that's fine.

Rewrite IntSet to match.

Tricky: I actually did this back when the work was out-of-tree, and although I'm sure with time (and newer GHC) one could press the optimizations more and make different design choices, it was slower than stock IntSet. Probably worth trying again, but there is a large space of choices to try out:

  • When a min or max shows up higher in the tree, should it be repeated in the bitmaps?
  • Should the bitmaps be aligned, representing series of integers with the same bit prefix, or should they start at the min/max for the node to not have waste zeros?
  • Is a singleton node represented as itself with two empty children or as a one-hot bitmap?
  • etc.

The most conservative approach, I think, is to match IntMap very directly, storing aligned bitmaps where IntMap stores values and using masked elements for keys. Did you try that version?

Check whether the restructuring in #658 is a good idea

And #653. I was putting it off as non-essential, but especially since fromList is one of only two regressions according to my up-to-date benchmarks (oddly, not insert even though fromList is just a loop over insert), I was keeping it in mind. But really, this can be post-merge.

#653 isn't merged yet because it's not a pure win and I froze up trying to make a decision. Your fromList regression is rather surprising!

Continue to add internal documentation. For example, isSubmapOf

The two big documentation blocks I was working to get done before merging were about deletions and merges, and isSubmapOf is basically a merge. Specific applications to specific functions can maybe wait until after merge, but those should probably be pre-merge.

You've definitely gone a long way toward getting documentation in shape. Thanks.

We generally aim for 80 characters or fewer.

Aw, darn. I was wrapping my documentation at 100 characters. That at least is easy to fix. While there are definitely easy-to-wrap lines, I'm not so sure about some of the worst offenders. Maybe make an effort pre-merge but don't worry about it as a blocker?

Hard-to-wrap lines are definitely a low priority.

Evaluate whether we should implement alter "by hand" as we used to. If not, then we should probably implement it using alterF @Identity.

I vaguely remember finding the trivial two-pass algorithm surprisingly faster that a specialized implementation? But it has been a while, and my memory is fuzzy. This will probably be pre-merge just because I'm curious now.

The generic version will always be better in the "key absent, don't insert" case, because we don't have anything like a fast setjmp/longjmp. In other cases, specialized versions will likely be better.

@gereeter, please also let me know how you wish to be credited in the changelog and release announcement.

Jonathan "gereeter" S.

@gereeter, @sjakobi, please let me know what should [be] added

I'll add those.

@treeowl
Copy link
Contributor Author

treeowl commented Dec 30, 2019

Additions for "before merge":

  • Check that GHC can use the new IntMap without (undue) performance degradation. (I can trigger the build jobs for this)

Thanks.

Can someone estimate what kind of speedup we can expect from this? If it's, say, < 20%, I'd say it can wait.

With the previous representation, the performance improvement was small, but the code was (in my opinion) much easier to understand.

  • Evaluate whether we can get the same performance using foldMapDefault, fmapDefault, and equivalent tricks for keyed versions. That would cut down on source code.

This can easily be delayed IMHO

Of course it can, but it shouldn't take long to check and reducing the enormous quantity of source code would be very nice.

@treeowl
Copy link
Contributor Author

treeowl commented Dec 30, 2019

What if the stored bounds in nodes are all minima, and the passed in ones are all maxima, with complements being taken (of bounds and keys) as necessary to maintain this? Ignoring the top level for a moment, lookup would go like this:

We have the stored minimum, the passed maximum, and the key. We xor the key with each and compare, choosing the left child if closer to the minimum and vice versa. Now in the left child, we would naturally store a new maximum, but instead we store its complement as a new minimum, and pass down the complement of the minimum and of the key. In the right child, we store the new minimum and pass down the key and maximum unchanged.

Would that work, @gereeter? I'd really love to cut down on the code size here, but my intuition for the structure is not yet wonderful.

@gereeter
Copy link

Would that work, @gereeter?

Yes, technically, but I'd expect a large performance hit for needlessly complementing all over the place. And while I have the bias of understanding what is going on now, I think it would be a lot more confusing, too.

The biggest thing to do for cutting down on code size is unifying functions that do similar things (Lazy and Strict, variants of one function, etc.). With judicious INLINE, this doesn't affect things at all. If we need to deduplicate code for different types of nodes, the type class approach seems much more appropriate (with the two-parameter hack to avoid type families), and performance can be kept the same with a bunch of SPECIALIZE pragmas.

(On that note: local functions currently don't have type signatures for the most part because of the lack of ScopedTypeVariables. Is that something we can get away with using? I see that it is gated in Data.Sequence.Internal.)

my intuition for the structure is not yet wonderful.

Anything in particular? Any confusion is an opportunity for better documentation in my mind.

@treeowl
Copy link
Contributor Author

treeowl commented Dec 30, 2019

Would that work, @gereeter?

Yes, technically, but I'd expect a large performance hit for needlessly complementing all over the place.

I'd be surprised. Basic operations like complement are usually extremely cheap; I'd expect their cost to be totally trivial compared to chasing pointers.

And while I have the bias of understanding what is going on now, I think it would be a lot more confusing, too.

That seems a much more likely problem. But the source duplication of the current scheme is also a potential source of errors and confusion. It seems to take some time to figure out what flops and how.

The biggest thing to do for cutting down on code size is unifying functions that do similar things (Lazy and Strict, variants of one function, etc.). With judicious INLINE, this doesn't affect things at all. If we need to deduplicate code for different types of nodes, the type class approach seems much more appropriate (with the two-parameter hack to avoid type families), and performance can be kept the same with a bunch of SPECIALIZE pragmas.

That certainly seems a promising approach to the source duplication, though it does nothing for object code size.

(On that note: local functions currently don't have type signatures for the most part because of the lack of ScopedTypeVariables. Is that something we can get away with using? I see that it is gated in Data.Sequence.Internal.)

Let's keep those gated for now … the current approach to scoped type variables is controversial, to say the least, with both Eisenberg and Kmett preferring a different sort. Again, a macro in containers.h is the way to do it.

my intuition for the structure is not yet wonderful.

Anything in particular? Any confusion is an opportunity for better documentation in my mind.

I think it could be helpful to have a few diagrams demonstrating basic operations, but those can be tough in text. Feel free to include something separate in HTML+SVG, LaTeX, or whatever, as long as it's not too big. But that can certainly wait till after the merge.

@gereeter
Copy link

Let's keep those gated for now

Ah, that could be a problem for reducing duplication. Without type signatures, GHC won't infer maximally polymorphic types for local definitions and we can't use SPECIALIZE pragmas, at least as far as I've figured out.

@treeowl
Copy link
Contributor Author

treeowl commented Dec 30, 2019

You may need to tie it to the type family usage to avoid a nasty NoMonoLocalBinds. And you don't need SPECIALIZE in the fallback.

@gereeter
Copy link

Other random API "omission": isSubmapOfWithKey :: (Key -> a -> b -> Bool) -> IntMap a -> IntMap b -> Bool. I don't see a use for it, but it would match all the other WithKey functions. It is possible to emulate with mergeA and a constant Applicative, but as with a lot of cases like that, you want to short-circuit based on the uppermost values.

@treeowl
Copy link
Contributor Author

treeowl commented Dec 30, 2019

Other random API "omission": isSubmapOfWithKey :: (Key -> a -> b -> Bool) -> IntMap a -> IntMap b -> Bool. I don't see a use for it, but it would match all the other WithKey functions. It is possible to emulate with mergeA and a constant Applicative, but as with a lot of cases like that, you want to short-circuit based on the uppermost values.

Yes, we should consider that, ideally with input from the libraries list. I wouldn't call it a priority myself.

@treeowl
Copy link
Contributor Author

treeowl commented Jan 2, 2020

I'd like to clarify something: the reason I care about object code size is not that it takes up space on disk or in RAM, but that it takes up space in cache and involves more code in each operation. This can have important performance consequences. I played around a bit and got insert and lookup working with the alternative approach. I made a brief unsuccessful attempt at delete; I'll try again when I get a chance. With this approach, using a type family introduces constraints of the form t ~ Comp (Comp t) all over the place, so I used the two-variable approach instead.

-- @P C@ tags a value as representing itself;
-- @C P@ tags a value as representing its complement.
data P
data C

-- A key tagged to indicate whether it represents itself
-- or its complement.
newtype TKey t u = TKey {getKey :: Key}
  deriving (Eq, Ord)

-- Each node stores the minimum value of that subtree. That may actually be
-- a *maximum* when considering what's represented.
newtype IntMap a = IntMap (IntMap_ P C a) deriving (Eq, Show)
data IntMap_ t u a
  = NonEmpty {-# UNPACK #-} !(TKey t u) a !(Node u t a) | Empty deriving (Eq)
data Node t u a = Bin {-# UNPACK #-} !(TKey t u) a !(Node u t a) !(Node t u a) | Tip deriving (Eq)

deriving instance Show a => Show (IntMap_ P C a)
deriving instance Show a => Show (IntMap_ C P a)
deriving instance Show a => Show (Node P C a)
deriving instance Show a => Show (Node C P a)

-- This type is just used for convenience when defining the Show instances for TKey.
data KeyRep = PKey !Key | CKey !Key deriving Show
instance Show (TKey P C) where
  showsPrec p (TKey k) = showsPrec p (PKey k)
instance Show (TKey C P) where
  showsPrec p (TKey k) = showsPrec p (CKey (complement k))

-- | Take the complement of a key
compKey :: TKey t u -> TKey u t
compKey (TKey k) = TKey (complement k)

xor :: TKey t u -> TKey t u -> Word
xor (TKey k) (TKey b) = fromIntegral $ Bits.xor k b

lookup :: forall a. Key -> IntMap a -> Maybe a
lookup k0 (IntMap m0) = start (TKey k0) m0
  where
    start :: TKey t u -> IntMap_ t u a -> Maybe a
    start !k Empty = Nothing
    start k (NonEmpty min minV node)
        | k < min = Nothing
        | k == min = Just minV
        | otherwise = go (compKey k) (xor k min) node

    go :: TKey t u -> Word -> Node t u a -> Maybe a
    go !k !_ Tip = Nothing
    go k xorCache (Bin min minV l r)
        | k > min = if xorCache < xorCacheMin
                    then go k xorCache r
                    else go (compKey k) xorCacheMin l
        | k < min = Nothing
        | otherwise = Just minV
      where xorCacheMin = xor k min

insert :: forall a. Key -> a -> IntMap a -> IntMap a
insert k0 v (IntMap m0) = IntMap $ start (TKey k0) m0
  where
    start :: TKey t u -> IntMap_ t u a -> IntMap_ t u a
    start k Empty = NonEmpty k v Tip
    start k (NonEmpty minK minV root)
      | k == minK = NonEmpty k v root
      | k > minK = NonEmpty minK minV $ go (xor k minK) (compKey minK) (compKey k) root
      | otherwise = NonEmpty k v $ insertMaxN (xor k minK) (compKey minK) minV root

    go :: Word -> TKey t u -> TKey t u -> Node t u a -> Node t u a
    go !xorCache !maxK !k Tip = Bin k v Tip Tip
    go xorCache maxK k (Bin minK minV l r)
      | minK < k =
         if xorCache < xorCacheMin
         then Bin minK minV l (go xorCache maxK k r)
         else Bin minK minV (go xorCacheMin (compKey minK) (compKey k) l) r
      | k < minK =
         if xor minK maxK < xorCacheMin
         then Bin k v Tip (Bin minK minV l r)
         else Bin k v (insertMaxN xorCacheMin (compKey minK) minV l) r
      | otherwise = Bin minK v l r
     where
        xorCacheMin :: Word
        xorCacheMin = xor k minK

insertMaxN :: Word -> TKey t u -> a -> Node t u a -> Node t u a
insertMaxN !xorcache k v Tip = Bin k v Tip Tip
insertMaxN xorcache k v (Bin minK minV l r)
  | xor k minK < xorcache = Bin minK minV (Bin (compKey k) v r l) Tip
  | otherwise = Bin minK minV l (insertMaxN xorcache k v r)

The traverseWithKey function I wrote still uses alternating helpers. I think it would be possible to avoid this with an extra conditional; I don't know if that would hurt performance.

traverseWithKey
  :: forall f a b. Applicative f
  => (Key -> a -> f b) -> IntMap a -> f (IntMap b)
traverseWithKey _ (IntMap Empty) = pure empty
traverseWithKey f (IntMap (NonEmpty k v root)) =
  liftA2 (\v' root' -> IntMap (NonEmpty k v' root')) (f (getKey k) v) (goC root)
  where
    goC :: Node C P a -> f (Node C P b)
    goC Tip = pure Tip
    goC (Bin k v l r) = liftA3 (\r' l' v' -> Bin k v' l' r') (goC r) (goP l) (f (complement (getKey k)) v)

    goP :: Node P C a -> f (Node P C b)
    goP Tip = pure Tip
    goP (Bin k v l r) = liftA3 (Bin k) (f (getKey k) v) (goC l) (goP r)

@treeowl
Copy link
Contributor Author

treeowl commented Jan 3, 2020

I can't figure out what LLVM is doing with this code.... I wonder if it actually expands it to yours, suggesting maybe I'm wasting my time. But benchmarking will tell, some day, maybe.

@sjakobi
Copy link
Member

sjakobi commented Jan 12, 2020

@treeowl Could you explain a bit how you can use complement to track minima and maxima?

Also, have you actually tested your code? I'm just stumped how it could possibly work.

@treeowl
Copy link
Contributor Author

treeowl commented Jan 12, 2020

@sjakobi insert and lookup are well tested. I'll upload my validity checker later. I still don't understand how deletion works in @gereeter's code, so I couldn't get that working. I don't use complement to track minima and maxima. It's a trick that means we don't have to track them in all the same cases, and I think we need to do so in fewer overall, but that could be wrong.

The general gist is that we always store the minimum in each node, but that may ultimately represent its complement, depending on how many left branches we've taken (with the initial jump down considered a left branch).

@treeowl
Copy link
Contributor Author

treeowl commented Jan 14, 2020

@sjakobi, here's my draft validity tester:

valid :: (Eq a, Show a) => IntMap a -> Property
valid m0@(IntMap m0_) =
   -- This will catch errors very reliably, but it's coarse-grained
   fromList (toList m0) === m0 .&&.
   -- This catches errors in a fine-grained way, but relies on being implemented
   -- right.
   start m0_
  where
    start :: IntMap_ P C a -> Property
    start Empty = property ()
    start (NonEmpty _minK _minV Tip) = property ()
    start (NonEmpty minK _minV (Bin maxK _maxV l r)) =
      go maxK (compKey minK) r .&&.
      go minK (compKey maxK) l

    go :: TKey t u -> TKey t u -> Node t u a -> Property
    go xmin xmax _
      | xmin >= xmax = counterexample "max not greater" False
    go _ _ Tip = property ()
    go xmin xmax (Bin minK _minV l r) =
      xor xmin minK > xor xmax minK .&&.
      counterexample "not between" (xmin < minK .&&. minK < xmax) .&&.
        go (compKey xmax) (compKey minK) l .&&.
        go minK xmax r

@treeowl
Copy link
Contributor Author

treeowl commented Jan 14, 2020

Thanks to @gereeter's recent cleanups and explanations, I was finally able to adapt delete to the complementy version and get it to pass the tests.

Complementy version of compareMinBound:

{-# INLINE compareMinBound #-}
compareMinBound :: TKey t u -> TKey t u -> BoundOrdering
compareMinBound k min
    | k > min = InBound
    | k < min = OutOfBound
    | otherwise = Matched

data BoundOrdering = InBound | OutOfBound | Matched deriving (Eq)

Complementy deletion:

delete :: forall a. Key -> IntMap a -> IntMap a
delete !_ (IntMap Empty) = IntMap Empty
delete !k0 m@(IntMap (NonEmpty min minV root)) = case compareMinBound k min of
    InBound -> IntMap (NonEmpty min minV (deleteNode (compKey k) (xor k min) root))
    OutOfBound -> m
    Matched -> IntMap (nodeToCompMap root)
  where k = TKey k0

deleteNode :: TKey t u -> Word -> Node t u a -> Node t u a
deleteNode !_ !_ Tip = Tip
deleteNode !k !xorCache n@(Bin min minV l r) = case compareMinBound k min of
    InBound | xorCache < xorCacheMin -> Bin min minV l (deleteNode k xorCache r)
            | otherwise              -> Bin min minV (deleteNode (compKey k) xorCacheMin l) r
    OutOfBound -> n
    Matched -> extractBinR l r
  where xorCacheMin = xor k min

extractBinR :: Node u t a -> Node t u a -> Node t u a
extractBinR Tip r = r
extractBinR (Bin max maxV innerL innerR) r =
    let NE min minV l = deleteMaxNode max maxV innerL innerR
    in Bin (compKey min) minV l r

deleteMaxNode :: TKey t u -> a -> Node u t a -> Node t u a -> NonEmptyIntMap_ t u a
deleteMaxNode !min minV Tip Tip = NE min minV Tip
deleteMaxNode !min minV (Bin max maxV l r) Tip = NE (compKey max) maxV (Bin min minV r l)
deleteMaxNode !min minV l (Bin innerMin innerMinV innerL innerR) =
    let NE max maxV inner = deleteMaxNode innerMin innerMinV innerL innerR
    in  NE max maxV (Bin min minV l inner)

nodeToCompMap :: Node t u a -> IntMap_ u t a
nodeToCompMap Tip = Empty
nodeToCompMap (Bin min minV innerL innerR) =
    let NE max maxV r = deleteMaxNode min minV innerL innerR
    in NonEmpty (compKey max) maxV r

data NonEmptyIntMap_ t u a = NE {-# UNPACK #-} !(TKey t u) a !(Node t u a)

That's just about half as much code.

@gereeter
Copy link

Quick benchmarks of the complementing code compared to not:

Benchmark     Runtime change  Original runtime
lookup hit     +13.45%        2.16e-04
lookup miss    +15.59%        2.28e-04
insert empty   +22.15%        2.39e-04
delete hit     +35.74%        1.87e-04
delete miss     +4.74%        7.49e-04
fromList        -1.39%        2.96e-04
                              
Minimum         -1.39%        
Average        +14.43%        
Maximum        +35.74%

I just copied your code into a project, added fromList, and pruned the benchmarks that weren't implemented yet. I didn't make any huge effort to remove external interference, and of course things improve over time, so take these numbers with a bucket of salt.

@sjakobi
Copy link
Member

sjakobi commented Jan 15, 2020

Quick benchmarks of the complementing code compared to not:

The "original" is your code for #340, right?

@gereeter
Copy link

Yes.

@treeowl
Copy link
Contributor Author

treeowl commented Jan 15, 2020

Do you think you could upload whatever you used to run that benchmark so I can play with it?

@treeowl
Copy link
Contributor Author

treeowl commented Jan 15, 2020

I find the size of these differences you report rather shocking, especially for deletion hit.

@gereeter
Copy link

src/IntMap.hs
{-# LANGUAGE BangPatterns, RankNTypes, StandaloneDeriving, FlexibleInstances, ScopedTypeVariables #-}

module IntMap where

import Data.Bits (complement)
import qualified Data.Bits as Bits
import qualified Data.List as List
import Control.DeepSeq (NFData(..))

type Key = Int

-- @P C@ tags a value as representing itself;
-- @C P@ tags a value as representing its complement.
data P
data C

-- A key tagged to indicate whether it represents itself
-- or its complement.
newtype TKey t u = TKey {getKey :: Key}
  deriving (Eq, Ord)

-- Each node stores the minimum value of that subtree. That may actually be
-- a *maximum* when considering what's represented.
newtype IntMap a = IntMap (IntMap_ P C a) deriving (Eq, Show)
data IntMap_ t u a
  = NonEmpty {-# UNPACK #-} !(TKey t u) a !(Node u t a) | Empty deriving (Eq)
data Node t u a = Bin {-# UNPACK #-} !(TKey t u) a !(Node u t a) !(Node t u a) | Tip deriving (Eq)
data NonEmptyIntMap_ t u a = NE {-# UNPACK #-} !(TKey t u) a !(Node t u a)

deriving instance Show a => Show (IntMap_ P C a)
deriving instance Show a => Show (IntMap_ C P a)
deriving instance Show a => Show (Node P C a)
deriving instance Show a => Show (Node C P a)

instance NFData a => NFData (IntMap a) where
    rnf (IntMap m) = rnf m

instance NFData a => NFData (IntMap_ t u a) where
    rnf Empty = ()
    rnf (NonEmpty _ v root) = rnf v `seq` rnf root

instance NFData a => NFData (Node t u a) where
    rnf Tip = ()
    rnf (Bin _ v l r) = rnf v `seq` rnf l `seq` rnf r

-- This type is just used for convenience when defining the Show instances for TKey.
data KeyRep = PKey !Key | CKey !Key deriving Show
instance Show (TKey P C) where
  showsPrec p (TKey k) = showsPrec p (PKey k)
instance Show (TKey C P) where
  showsPrec p (TKey k) = showsPrec p (CKey (complement k))

-- | Take the complement of a key
compKey :: TKey t u -> TKey u t
compKey (TKey k) = TKey (complement k)

xor :: TKey t u -> TKey t u -> Word
xor (TKey k) (TKey b) = fromIntegral $ Bits.xor k b

empty = IntMap Empty

fromList :: [(Key, a)] -> IntMap a
fromList = List.foldl' (flip (uncurry insert)) (IntMap Empty)

fromAscList = fromList
fromDistinctAscList = fromList

lookup :: forall a. Key -> IntMap a -> Maybe a
lookup k0 (IntMap m0) = start (TKey k0) m0
  where
    start :: TKey t u -> IntMap_ t u a -> Maybe a
    start !k Empty = Nothing
    start k (NonEmpty min minV node)
        | k < min = Nothing
        | k == min = Just minV
        | otherwise = go (compKey k) (xor k min) node

    go :: TKey t u -> Word -> Node t u a -> Maybe a
    go !k !_ Tip = Nothing
    go k xorCache (Bin min minV l r)
        | k > min = if xorCache < xorCacheMin
                    then go k xorCache r
                    else go (compKey k) xorCacheMin l
        | k < min = Nothing
        | otherwise = Just minV
      where xorCacheMin = xor k min

insert :: forall a. Key -> a -> IntMap a -> IntMap a
insert k0 v (IntMap m0) = IntMap $ start (TKey k0) m0
  where
    start :: TKey t u -> IntMap_ t u a -> IntMap_ t u a
    start k Empty = NonEmpty k v Tip
    start k (NonEmpty minK minV root)
      | k == minK = NonEmpty k v root
      | k > minK = NonEmpty minK minV $ go (xor k minK) (compKey minK) (compKey k) root
      | otherwise = NonEmpty k v $ insertMaxN (xor k minK) (compKey minK) minV root

    go :: Word -> TKey t u -> TKey t u -> Node t u a -> Node t u a
    go !xorCache !maxK !k Tip = Bin k v Tip Tip
    go xorCache maxK k (Bin minK minV l r)
      | minK < k =
         if xorCache < xorCacheMin
         then Bin minK minV l (go xorCache maxK k r)
         else Bin minK minV (go xorCacheMin (compKey minK) (compKey k) l) r
      | k < minK =
         if xor minK maxK < xorCacheMin
         then Bin k v Tip (Bin minK minV l r)
         else Bin k v (insertMaxN xorCacheMin (compKey minK) minV l) r
      | otherwise = Bin minK v l r
     where
        xorCacheMin :: Word
        xorCacheMin = xor k minK

insertMaxN :: Word -> TKey t u -> a -> Node t u a -> Node t u a
insertMaxN !xorcache k v Tip = Bin k v Tip Tip
insertMaxN xorcache k v (Bin minK minV l r)
  | xor k minK < xorcache = Bin minK minV (Bin (compKey k) v r l) Tip
  | otherwise = Bin minK minV l (insertMaxN xorcache k v r)

{-# INLINE compareMinBound #-}
compareMinBound :: TKey t u -> TKey t u -> BoundOrdering
compareMinBound k min
    | k > min = InBound
    | k < min = OutOfBound
    | otherwise = Matched

data BoundOrdering = InBound | OutOfBound | Matched deriving (Eq)

delete :: forall a. Key -> IntMap a -> IntMap a
delete !_ (IntMap Empty) = IntMap Empty
delete !k0 m@(IntMap (NonEmpty min minV root)) = case compareMinBound k min of
    InBound -> IntMap (NonEmpty min minV (deleteNode (compKey k) (xor k min) root))
    OutOfBound -> m
    Matched -> IntMap (nodeToCompMap root)
  where k = TKey k0

deleteNode :: TKey t u -> Word -> Node t u a -> Node t u a
deleteNode !_ !_ Tip = Tip
deleteNode !k !xorCache n@(Bin min minV l r) = case compareMinBound k min of
    InBound | xorCache < xorCacheMin -> Bin min minV l (deleteNode k xorCache r)
            | otherwise              -> Bin min minV (deleteNode (compKey k) xorCacheMin l) r
    OutOfBound -> n
    Matched -> extractBinR l r
  where xorCacheMin = xor k min

extractBinR :: Node u t a -> Node t u a -> Node t u a
extractBinR Tip r = r
extractBinR (Bin max maxV innerL innerR) r =
    let NE min minV l = deleteMaxNode max maxV innerL innerR
    in Bin (compKey min) minV l r

deleteMaxNode :: TKey t u -> a -> Node u t a -> Node t u a -> NonEmptyIntMap_ t u a
deleteMaxNode !min minV Tip Tip = NE min minV Tip
deleteMaxNode !min minV (Bin max maxV l r) Tip = NE (compKey max) maxV (Bin min minV r l)
deleteMaxNode !min minV l (Bin innerMin innerMinV innerL innerR) =
    let NE max maxV inner = deleteMaxNode innerMin innerMinV innerL innerR
    in  NE max maxV (Bin min minV l inner)

nodeToCompMap :: Node t u a -> IntMap_ u t a
nodeToCompMap Tip = Empty
nodeToCompMap (Bin min minV innerL innerR) =
    let NE max maxV r = deleteMaxNode min minV innerL innerR
    in NonEmpty (compKey max) maxV r
app/Main.hs
module Main where

import Control.DeepSeq (rnf)
import Control.Exception (evaluate)
import Gauge (bench, defaultMain, whnf)
import Data.List (foldl')
import qualified IntMap as M
import Data.Maybe (fromMaybe)
import Prelude hiding (lookup)

main = do
    let m = M.fromAscList elems :: M.IntMap Int
    evaluate $ rnf [m]
    evaluate $ rnf missKeys
    defaultMain
        [ bench "lookup hit"  $ whnf (lookup keys) m
        , bench "lookup miss" $ whnf (lookup missKeys) m
        , bench "insert empty" $ whnf (ins elems) M.empty
        , bench "delete hit"  $ whnf (del keys) m
        , bench "delete miss" $ whnf (del missKeys) m
        , bench "fromList" $ whnf M.fromList elems
        , bench "fromAscList" $ whnf M.fromAscList elems
        , bench "fromDistinctAscList" $ whnf M.fromDistinctAscList elems
        ]
  where
    elems = zip keys values
    keys = [1,3..2^13]
    missKeys = [0,2..2^13]
    values = [1,3..2^13]
    sum k v1 v2 = k + v1 + v2
    consPairL xs k v = (k, v) : xs
    consPairR k v xs = (k, v) : xs

add3 :: Int -> Int -> Int -> Int
add3 x y z = x + y + z
{-# INLINE add3 #-}

lookup :: [Int] -> M.IntMap Int -> Int
lookup xs m = foldl' (\n k -> fromMaybe n (M.lookup k m)) 0 xs

ins :: [(Int, Int)] -> M.IntMap Int -> M.IntMap Int
ins xs m = foldl' (\m (k, v) -> M.insert k v m) m xs

data PairS a b = PS !a !b

del :: [Int] -> M.IntMap Int -> M.IntMap Int
del xs m = foldl' (\m k -> M.delete k m) m xs

maybeDel :: Int -> Maybe Int
maybeDel n | n `mod` 3 == 0 = Nothing
           | otherwise      = Just n
Bench.cabal
cabal-version: 1.12

name:           Bench
version:        0.1.0.0
build-type:     Simple

library
  exposed-modules:
      IntMap
  hs-source-dirs:
      src
  ghc-options: -O2
  build-depends:
      base >=4.7 && <5
      , deepseq
  default-language: Haskell2010

executable Bench-exe
  main-is: Main.hs
  hs-source-dirs:
      app
  ghc-options: -O2
  build-depends:
      Bench
    , base >=4.7 && <5
    , deepseq
    , gauge
  default-language: Haskell2010
stack.yaml
resolver: ghc-8.8.1

packages:
- .

extra-deps:
- gauge-0.2.5@sha256:8d60450bdec985c146d5632d4f5a8d60cec27d71ba6787ed1ee64d945d4c7c33,3923
- basement-0.0.11@sha256:af43e2e334e515b52ca309919b135c51b5e9411e6d4c68d0e8950d61eb5f25d1,5711
- vector-0.12.0.3@sha256:1422b0bcf4e7675116ca8d9f473bf239850c58c4518a56010e3bfebeac345ace,7171
- primitive-0.7.0.0@sha256:ee352d97cc390d8513530029a2213a95c179a66c406906b18d03702c1d9ab8e5,3416

@treeowl
Copy link
Contributor Author

treeowl commented Jan 15, 2020

Thanks! That's very helpful. I'm going to play with it a bit later. One thing I want to do is see how the native code generator compares to LLVM for both implementations. A very brief look earlier suggested they were doing somewhat different things with the complements, but I don't understand what LLVM was doing.

@treeowl
Copy link
Contributor Author

treeowl commented Jan 19, 2020

I've improved the complementy code somewhat, but I still can't match yours, particularly for lookup. It may just fundamentally be slower. Oh well. The improvements to insertion and deletion, in case someone wants to swing around some day and try some other magic:

insert :: forall a. Key -> a -> IntMap a -> IntMap a
insert k0 v0 (IntMap m0) = IntMap $ start (TKey k0) v0 m0
  where
    start :: TKey t u -> a -> IntMap_ t u a -> IntMap_ t u a
    start !k v Empty = NonEmpty k v Tip
    start !k v (NonEmpty minK minV root) = case compareMinBound k minK of
      Matched -> NonEmpty k v root
      InBound -> NonEmpty minK minV $ go (xor k minK) (compKey minK) (compKey k) v root
      OutOfBound -> NonEmpty k v $ insertMaxN (xor k minK) (compKey minK) minV root

    go :: Word -> TKey t u -> TKey t u -> a -> Node t u a -> Node t u a
    go !_xorCache !_maxK !k v Tip = Bin k v Tip Tip
    go xorCache maxK k v (Bin minK minV l r) = case compareMinBound k minK of
      InBound
         | xorCache < xorCacheMin
         -> Bin minK minV l (go xorCache maxK k v r)
         | otherwise
         -> Bin minK minV (go xorCacheMin (compKey minK) (compKey k) v l) r
      OutOfBound
         | xor minK maxK < xorCacheMin
         -> Bin k v Tip (Bin minK minV l r)
         | otherwise
         -> Bin k v (insertMaxN xorCacheMin (compKey minK) minV l) r
      Matched -> Bin minK v l r
     where
        xorCacheMin :: Word
        xorCacheMin = xor k minK

insertMaxN :: Word -> TKey t u -> a -> Node t u a -> Node t u a
insertMaxN xorcache k v Tip = Bin k v Tip Tip
insertMaxN xorcache k v (Bin minK minV l r)
  | xor k minK < xorcache = Bin minK minV (Bin (compKey k) v r l) Tip
  | otherwise = Bin minK minV l (insertMaxN xorcache k v r)

delete :: forall a. Key -> IntMap a -> IntMap a
delete !_ (IntMap Empty) = IntMap Empty
delete !k0 m@(IntMap (NonEmpty min minV root)) = case compareMinBound k min of
    InBound -> IntMap (NonEmpty min minV (deleteNode (compKey k) (xor k min) root))
    OutOfBound -> m
    Matched -> IntMap (nodeToCompMap root)
  where k = TKey k0

deleteNode :: TKey t u -> Word -> Node t u a -> Node t u a
deleteNode !_ !_ Tip = Tip
deleteNode !k !xorCache n@(Bin min minV l r) = case compareMinBound k min of
    InBound | xorCache < xorCacheMin -> Bin min minV l (deleteNode k xorCache r)
            | otherwise              -> Bin min minV (deleteNode (compKey k) xorCacheMin l) r
    OutOfBound -> n
    Matched -> extractBin l r
  where xorCacheMin = xor k min

extractBin :: Node u t a -> Node t u a -> Node t u a
extractBin Tip r = r
extractBin (Bin max maxV innerL innerR) r =
    let NE min minV l = deleteMaxNode max maxV innerL innerR
    in Bin (compKey min) minV l r

deleteMaxNode :: TKey t u -> a -> Node u t a -> Node t u a -> NonEmptyIntMap_ t u a
deleteMaxNode !min minV m1 Tip = case m1 of
  Tip -> NE min minV Tip
  Bin max maxV l r -> NE (compKey max) maxV (Bin min minV r l)
deleteMaxNode !min minV l (Bin innerMin innerMinV innerL innerR) =
    let NE max maxV inner = deleteMaxNode innerMin innerMinV innerL innerR
    in  NE max maxV (Bin min minV l inner)

nodeToCompMap :: Node t u a -> IntMap_ u t a
nodeToCompMap Tip = Empty
nodeToCompMap (Bin min minV innerL innerR) =
    let NE max maxV r = deleteMaxNode min minV innerL innerR
    in NonEmpty (compKey max) maxV r

data NonEmptyIntMap_ t u a = NE {-# UNPACK #-} !(TKey t u) a !(Node t u a)

@sjakobi
Copy link
Member

sjakobi commented Jan 19, 2020

I wonder whether the smaller, complement-based code could perform better than the code from #340 in "real" applications as opposed to micro-benchmarks. In micro-benchmarks, I'd suspect that the repeated application of the same functions causes even the huge functions from #340 to stay cached. This is pure speculation though, and I'm not very experienced with low-level performance things.

Unfortunately I haven't found any applications that make heavy use of IntMap and would be easier to benchmark and profile than GHC so far. The Clash compiler relies heavily on IntMap but it also depends on the ghc library which makes building it with a modified containers tricky.

Another reason why I've been slightly worried about #340 is that the increased amount of code ultimately also offers more potential for (performance) bugs.

I'm also not a big fan of the tendency to reduce lines of code via extra function arguments that must be inlined by the compiler for proper performance. IMHO that makes the code less readable and more fickle with regards to performance and compiler changes…

@treeowl
Copy link
Contributor Author

treeowl commented Jan 19, 2020

@sjakobi the extra arguments are indeed unpleasant, though I'm not terribly worried about fragility in most cases. I really don't like having any more source code duplication than necessary. What we'd really like is probably something like

lookup# :: Key -> IntMap a -> (# (##) | a #)
lookup k m = case lookup# k m of
  (# | a #) -> Just a
  _ -> Nothing
{-# INLINE lookup #-}

member k m = case lookup# k m of
  (# | _ #) -> True
  _ -> False

But the compat story is something of a mess and will surely involve double-barreled continuations anyway. We still might want to do it though.

@treeowl
Copy link
Contributor Author

treeowl commented Jan 19, 2020

I just had a radically simple idea that definitely has its own trade-offs but might be worth considering if it can actually be implemented easily.

data IntMap a
  = Bin
     { _minK :: !Int
       _minV :: a
       _lft :: !(IntMap a)
       _maxK :: !Int
       _maxV :: a
       _rght :: !(IntMap a) }
  | Tip
  | Single !Int a

This is quite an enormous Bin node, which is a bit scary, but it seems kind of nice in some ways:

  1. No more mess of alternating.
  2. No more special monkey business at the root.
  3. No more implicit information flowing down.
  4. Even more compact representation.

@jwaldmann
Copy link
Contributor

haven't found any applications that make heavy use of IntMap

I made this a while ago: https://gitlab.imn.htwk-leipzig.de/waldmann/containers-benchmark

@sjakobi
Copy link
Member

sjakobi commented Jan 20, 2020

@treeowl Could you demonstrate a bit how this structure works? How do minimum and maximum keys relate to those in the children?


@jwaldmann That looks very useful! We should try to get that into the containers source tree somehow.

To build this with a custom containers, I added this cabal.project:

packages: ., ../containers/containers

and ran

cabal test -O2

(The benchmark executable is the main testsuite)

Comparing 205e2a8 (the current HEAD commit in #340) against #340's base commit f7e27e6, the new IntMap seems slightly slower: On a quiet machine I'm measuring 3.5s vs. 3.6s (using bench to produce timings).

Profiling seems to put the blame mostly on Data.IntMap.Lazy.{union,difference}With:

image

Interestingly, allocations for unionWith seem to have decreased remarkably!

The profiles are attached: base-O2.prof.txt, gereeter-O2.prof.txt.

@jwaldmann
Copy link
Contributor

@sjakobi
Copy link
Member

sjakobi commented Jan 20, 2020

@jwaldmann This also looks very interesting. It would be very nice if you could make a PR for your benchmark patches!

@treeowl
Copy link
Contributor Author

treeowl commented Jan 20, 2020

It occurs to me that fgl may also be a good place to look for benchmarks.

@sjakobi
Copy link
Member

sjakobi commented Jan 20, 2020

It occurs to me that fgl may also be a good place to look for benchmarks.

Maybe. I don't see any benchmarks in the repo though.

@sjakobi
Copy link
Member

sjakobi commented Jan 20, 2020

It occurs to me that fgl may also be a good place to look for benchmarks.

Maybe. I don't see any benchmarks in the repo though.

rdf relies on fgl and has benchmarks… might be worth a spin…

@jwaldmann
Copy link
Contributor

on using fgl via rdf: yes - but in their current state, rdf benchmarks take ages, because of[500,1000..100000] in https://github.com/TravisWhitaker/rdf/blob/master/bench/Main.hs#L42 .
I replaced this list with map (10^) [3..5] and got

        Tue Jan 21 11:37 2020 Time and Allocation Profiling Report  (Final)

           bench-rdf-prof +RTS -p -s -h -RTS

        total time  =       59.60 secs   (59600 ticks @ 1000 us, 1 processor)
        total alloc = 75,388,764,824 bytes  (excludes profiling overheads)

COST CENTRE         MODULE                      SRC                                                  %time %alloc

encodeIRI           Data.RDF.Encoder.Common     src/Data/RDF/Encoder/Common.hs:(55,1)-(63,25)         15.3   19.5
parseNQuads         Data.RDF.Parser.NQuads      src/Data/RDF/Parser/NQuads.hs:(66,1)-(68,22)          12.3   15.6
option              Data.Attoparsec.Combinator  Data/Attoparsec/Combinator.hs:90:1-25                 11.5   10.1
maybeBuilder        Data.RDF.Encoder.Common     src/Data/RDF/Encoder/Common.hs:(46,1)-(49,58)          9.5    4.6
parseHost           Data.RDF.Internal           src/Data/RDF/Internal.hs:(311,1)-(312,56)              7.5    5.4
encodeEscapedIRI    Data.RDF.Encoder.Common     src/Data/RDF/Encoder/Common.hs:(51,1)-(53,21)          7.0    3.4
parseLiteralBody    Data.RDF.Internal           src/Data/RDF/Internal.hs:(379,1)-(399,26)              6.2   10.5
encodeLiteral       Data.RDF.Encoder.Common     src/Data/RDF/Encoder/Common.hs:(72,1)-(77,28)          4.6    3.5
parseBlankNodeLabel Data.RDF.Internal           src/Data/RDF/Internal.hs:(362,1)-(371,31)              2.8    2.5
parseScheme         Data.RDF.Internal           src/Data/RDF/Internal.hs:(288,1)-(295,33)              2.7    2.1
encodeQuad          Data.RDF.Encoder.NQuads     src/Data/RDF/Encoder/NQuads.hs:(43,1)-(53,49)          2.6    3.4
encodeBlankNode     Data.RDF.Encoder.Common     src/Data/RDF/Encoder/Common.hs:(87,1)-(89,24)          2.2    1.4
nf                  Criterion.Measurement.Types src/Criterion/Measurement/Types.hs:(272,1)-(275,27)    2.1    0.9
parseSubject        Data.RDF.Internal           src/Data/RDF/Internal.hs:(338,1)-(342,68)              1.8    0.7
eitherResult        Data.Attoparsec.Text.Lazy   Data/Attoparsec/Text/Lazy.hs:(99,1)-(101,77)           1.8    4.3
parseEscapedIRI     Data.RDF.Internal           src/Data/RDF/Internal.hs:358:1-54                      1.6    1.4
encodeRDFGraph      Data.RDF.Encoder.NQuads     src/Data/RDF/Encoder/NQuads.hs:(55,1)-(59,50)          1.1    1.4
parse               Data.Attoparsec.Text.Lazy   Data/Attoparsec/Text/Lazy.hs:(79,1)-(88,58)            0.8    1.1

I am not seeing any IntMap (but some HashMap).

@gereeter
Copy link

The profiles are attached:

I may be misunderstanding something about how the profiling works, but I'm very concerned to see boundKey and unbox in there. Is the profiling adding some overhead to those functions and/or stopping them from being inlined? They should literally not exist: boundKey is unwrapping a newtype (i.e. doing nothing in Core), and unbox is only applied in cases where the key came from Node or IntMap_, where the integer was unboxed anyway and there is nothing to do. I've never seen them when looking at Core.

@gereeter
Copy link

stopping them from being inlined

I just realized that neither of those functions has INLINE explicitly written out, and the profiling documentation says it will add cost centers to all functions not marked INLINE.

@treeowl
Copy link
Contributor Author

treeowl commented Jan 21, 2020

The profiles are attached:

I may be misunderstanding something about how the profiling works, but I'm very concerned to see boundKey and unbox in there. Is the profiling adding some overhead to those functions and/or stopping them from being inlined? They should literally not exist: boundKey is unwrapping a newtype (i.e. doing nothing in Core), and unbox is only applied in cases where the key came from Node or IntMap_, where the integer was unboxed anyway and there is nothing to do. I've never seen them when looking at Core.

Overly aggressive cost-center profiling (e.g., -fprof-auto) will definitely hurt performance. -fprof-auto-exported is likely better, but make sure the box/unbox primitives are in a module not compiled like that. The safest bet is to add all the cost centers manually.

@gereeter
Copy link

After slapping some INLINEs and -fprof-auto-exported around (and running into the fact that you can't INLINE field accessors like boundKey), I got:

`master` branch
   total time  =        7.01 secs   (7013 ticks @ 1000 us, 1 processor)
   total alloc = 5,308,097,088 bytes  (excludes profiling overheads)

COST CENTRE         MODULE                   SRC                                               %time %alloc

split               Matchbox.Chain           src/Matchbox/Chain.hs:(34,1)-(36,34)               25.9   17.8
insert              Data.Map.Strict.Internal src/Data/Map/Strict/Internal.hs:(517,1)-(525,33)   10.5    2.3
value               Matchbox.Automaton       src/Matchbox/Automaton.hs:(127,1)-(132,18)          8.5    0.7
unionWithKey        Data.IntMap.Internal     src/Data/IntMap/Internal.hs:(1092,1)-(1093,84)      7.0   15.4
differenceWithKey   Data.IntMap.Internal     src/Data/IntMap/Internal.hs:(1126,1)-(1127,39)      6.4    4.1
delta_insert        Matchbox.Automaton       src/Matchbox/Automaton.hs:(163,1)-(176,25)          6.2   17.1
unions              Matchbox.Semiring        src/Matchbox/Semiring.hs:26:1-25                    5.6    4.1
intersectionWithKey Data.IntMap.Internal     src/Data/IntMap/Internal.hs:(1289,1)-(1290,102)     4.4    2.8
rfc                 Matchbox.Decomp          src/Matchbox/Decomp.hs:(79,1)-(94,52)               4.3    7.1
balanceR            Data.Map.Internal        src/Data/Map/Internal.hs:(4102,1)-(4121,50)         4.2   13.4
map                 Data.IntMap.Internal     src/Data/IntMap/Internal.hs:(2420,1)-(2424,26)      2.9    5.6
unionWith           Data.Map.Strict.Internal src/Data/Map/Strict/Internal.hs:(982,1)-(988,35)    2.0    0.8
filterWithKey       Data.IntMap.Internal     src/Data/IntMap/Internal.hs:(2599,1)-(2603,44)      2.0    1.2
`direct-bounded-intmap` branch
   total time  =        7.09 secs   (7093 ticks @ 1000 us, 1 processor)
   total alloc = 5,092,794,112 bytes  (excludes profiling overheads)

COST CENTRE      MODULE                   SRC                                               %time %alloc

split            Matchbox.Chain           src/Matchbox/Chain.hs:(34,1)-(36,34)               25.1   18.6
insert           Data.Map.Strict.Internal src/Data/Map/Strict/Internal.hs:(517,1)-(525,33)    9.7    2.4
value            Matchbox.Automaton       src/Matchbox/Automaton.hs:(127,1)-(132,18)          8.4    0.8
unionWith        Data.IntMap.Lazy         src/Data/IntMap/Lazy.hs:518:1-36                    7.0    9.2
differenceWith   Data.IntMap.Lazy         src/Data/IntMap/Lazy.hs:667:1-46                    6.4    4.0
delta_insert     Matchbox.Automaton       src/Matchbox/Automaton.hs:(163,1)-(176,25)          6.0   17.8
unions           Matchbox.Semiring        src/Matchbox/Semiring.hs:26:1-25                    5.3    4.3
balanceR         Data.Map.Internal        src/Data/Map/Internal.hs:(4102,1)-(4121,50)         4.6   13.9
rfc              Matchbox.Decomp          src/Matchbox/Decomp.hs:(79,1)-(94,52)               4.5    7.3
intersectionWith Data.IntMap.Lazy         src/Data/IntMap/Lazy.hs:877:1-50                    4.2    3.6
mapLazy          Data.IntMap.Internal     src/Data/IntMap/Internal.hs:1261:1-44               3.0    4.9
unionWith        Data.Map.Strict.Internal src/Data/Map/Strict/Internal.hs:(982,1)-(988,35)    2.0    0.9
filter           Data.IntMap.Internal     src/Data/IntMap/Internal.hs:2072:1-34               2.0    2.7
timesM           Matchbox.Height          src/Matchbox/Height.hs:(68,1)-(80,54)               1.1    0.7
map              Matchbox.Map.Enum        src/Matchbox/Map/Enum.hs:96:1-31                    1.0    0.0
unionDisjointL   Data.IntMap.Internal     src/Data/IntMap/Internal.hs:(1453,1)-(1460,58)      0.4    1.7

Now, I'm not sure I got everything right and I wasn't running on a particularly quiet machine, but at least I didn't see any boundKey, unbox, or xorCaches floating around. Overall, I'm not too surprised by the results:

  • Performance is around the same, possibly a bit slower, particularly in regards to unionWith (percentage is the same, but unionDisjoint also shows up a bit). This is a merge-heavy workload, it seems, and union has the last known majorly regressing cases, hence the "optimize union more" item on the TODO list for this issue.
  • IntMap allocation is way down. This is as it should be, since the structure is more compact. It really makes the GHC test failures that much more confusing, though.

@jwaldmann
Copy link
Contributor

Re: -fprof-auto

I think there are two ways of using profiling info:

  • find out which functions are called (e.g., does rdf use IntMap at all). For that, anything goes?
  • exact measurements, for micro-optimization - but cost centers may prevent inlining, and distort measurements.

Manually adding cost centers seems quite heavy. Statistical profiling would work without instrumentation? But what's the status? https://gitlab.haskell.org/ghc/ghc/wikis/dwarf/status#statistical-profiling

@gereeter
Copy link

Diff Haddocks to make sure no functions or instances have gone missing.

I just finished this myself and patched up the last holes that I found. Could someone else double check?

@sjakobi sjakobi added the IntMap label Jul 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants