Make `union`, `difference`, and `intersect*` run linearly by sorting elements first #203

JordanMartinez · 2020-12-31T19:06:24Z

Fixes #192

JordanMartinez · 2020-12-31T22:16:54Z

I've updated the intersectionBy and difference/differenceBy to work and pass CI. I'm still not sure how union differs from alt/<|>, so I'm going to wait until that question is clarified.

JordanMartinez · 2020-12-31T23:53:21Z

This PR is ready for review

hdgarrood · 2021-01-01T04:04:24Z

I haven't reviewed properly yet but just so you know, sorting based on a comparison function which only compares two elements at a time is always at least O(n log n), which is beyond linear. Not by that much, admittedly, since log n grows very slowly, but it's incorrect to say that it is linear if we're sorting.

JordanMartinez · 2021-01-01T05:25:22Z

Hmm..... Besides the eq to compare change and the lack of STFn, is there anything else that might make this benchmark misleading? Everything is slower by quite a bit.

Assume the following values for each benchmark below

let 
  shortNats = Array.range 0 100
  longNats = Array.range 0 10000
  mod3Eq x y = (x `mod` 3) == (y `mod` 3)
  mod3Cmp x y = compare (x `mod` 3) (y `mod` 3)
  
-- `master` uses `mod3Eq` whereas this PR uses `mod3Cmp`

unionBy

benchUnionBy = do
    log $ "unionBy (" <> show (Array.length shortNats) <> ")"
    benchWith 1000 \_ -> Array.unionBy mod3Eq shortNats shortNats

    log $ "unionBy (" <> show (Array.length longNats) <> ")"
    benchWith 100 \_ -> Array.unionBy mod3Eq longNats longNats

unionBy -- master branch
---------------
unionBy (101)
mean   = 60.73 μs
stddev = 81.23 μs
min    = 34.67 μs
max    = 1.36 ms
unionBy (10001)
mean   = 3.40 ms
stddev = 1.56 ms
min    = 2.98 ms
max    = 18.20 ms

unionBy -- this PR
---------------
unionBy (101)
mean   = 208.75 μs
stddev = 197.98 μs
min    = 162.88 μs
max    = 3.23 ms
unionBy (10001)
mean   = 32.38 ms
stddev = 6.30 ms
min    = 28.67 ms
max    = 73.98 ms

intersectBy

  benchIntersectBy = do
    log $ "intersectBy (" <> show (Array.length shortNats) <> ")"
    benchWith 1000 \_ -> Array.intersectBy mod3Eq shortNats shortNats

    log $ "intersectBy (" <> show (Array.length longNats) <> ")"
    benchWith 100 \_ -> Array.intersectBy mod3Eq longNats longNats

intersectBy -- current master
---------------
intersectBy (101)
mean   = 16.93 μs
stddev = 39.69 μs
min    = 7.69 μs
max    = 789.20 μs
intersectBy (10001)
mean   = 760.82 μs
stddev = 549.92 μs
min    = 626.33 μs
max    = 6.13 ms

intersectBy -- this PR
---------------
intersectBy (101)
mean   = 205.14 μs
stddev = 150.04 μs
min    = 158.94 μs
max    = 2.21 ms
intersectBy (10001)
mean   = 31.81 ms
stddev = 5.34 ms
min    = 28.55 ms
max    = 62.24 ms

difference

  benchDifference = do
    log $ "difference (" <> show (Array.length shortNats) <> ")"
    benchWith 1000 \_ -> Array.difference shortNats shortNats

    log $ "difference (" <> show (Array.length longNats) <> ")"
    benchWith 100 \_ -> Array.difference longNats longNats

difference -- current master
---------------
difference (101)
mean   = 79.57 μs
stddev = 48.39 μs
min    = 63.89 μs
max    = 742.35 μs
difference (10001)
mean   = 492.35 ms
stddev = 24.31 ms
min    = 463.04 ms
max    = 594.56 ms

difference -- this PR
---------------
difference (101)
mean   = 142.91 μs
stddev = 136.28 μs
min    = 106.35 μs
max    = 1.73 ms
difference (10001)
mean   = 21.30 ms
stddev = 3.75 ms
min    = 18.21 ms
max    = 44.52 ms

milesfrain · 2021-01-07T07:45:30Z

src/Data/Array.purs

+-- combineIndex compare [0] [0, 1]
+--   == [t3 true 0 0, t3 false 1 0, t3 false 2 1]


Suggested change

-- combineIndex compare [0] [0, 1]

-- == [t3 true 0 0, t3 false 1 0, t3 false 2 1]

-- combineIndex compare [7] [9, 5]

-- == [t3 true 0 7, t3 false 1 9, t3 false 2 5]

Easier to follow example with values that are distinct from indices

But then there aren't values that exist in both arrays.

Since combineIndex doesn't care about duplicates, I don't think we need the example to involve duplicates.
This would be like saying that [1,2] <> [3,4] == [1,2,3,4] is a bad example for append because it doesn't clarify that duplicates are preserved.

Could compromise with something like [7] [7, 5].

Since combineIndex doesn't care about duplicates, I don't think we need the example to involve duplicates.

🤦‍♂️ You're right! I forgot that this function doesn't need to care unlike the ones I'm working on. I'll make that fix.

milesfrain · 2021-01-07T07:57:07Z

src/Data/Array.purs

+-- ```
+combineIndex :: forall a. Array a -> Array a -> Array (Tuple Boolean (Tuple Int a))
+combineIndex left right = ST.run do
+  out <- STA.new


Are there any benefits to preallocating the array to a known size? Tried to find a clear answer for this in JS. And if there is a benefit that we want to take advantage of, we'd need to expand the Array.ST API.

I've wondered about that myself...

src/Data/Array.purs

milesfrain · 2021-01-07T08:17:11Z

src/Data/Array.purs

+  let rightLen = length right
+  ST.for 0 rightLen \idx -> do
+    let val = unsafePartial $ unsafeIndex right idx
+    void $ STA.push (Tuple false (Tuple (leftLen + idx) val)) out


Are there any performance penalties to using a record instead of a tuple? Luckily, this tuple happens to be type-safe because of the distinct types (Boolean, Int, a).

Not sure. I also tried it out with data Tuple3 a b c = Tuple3 a b c and that sped things up only slightly.

milesfrain · 2021-01-07T08:26:42Z

src/Data/Array.purs

-unionBy :: forall a. (a -> a -> Boolean) -> Array a -> Array a -> Array a
-unionBy eq xs ys = xs <> foldl (flip (deleteBy eq)) (nubByEq eq ys) xs
+unionBy :: forall a. (a -> a -> Ordering) -> Array a -> Array a -> Array a
+unionBy cmp left right = map snd $ ST.run do


It may be more efficient to take advantage of the of the fact that the output is always the first array plus some other stuff appended to it. Here's some non-ST pseudo-ish code describing that option:

union left right = left <> other where other = combineIndex right left -- intentionally putting right first and assuming stable sort # sortBy valueFunc # groupBy valueFunc # map head # filter keepRightOnly

What would be the runtime cost of this approach versus mine?

I made the rather stupid assumption that mine would be n+m because when Harry said the current version was n*m, I thought he was implying we should make the code linear. I didn't actually analyze my code using Big O notation to see how many steps it takes.

I believe all reasonable approaches are O(n log n) (where n is the sum of array sizes). But benchmarking could still reveal a 3x speedup with a different approach, which would still be O(n log n). I'll play around with this and report back with findings. The choice of input data could also significantly change the relative performance of different algorithms.

hdgarrood · 2021-01-08T19:21:49Z

I think these code changes should wait until we have reached a consensus on how these functions should actually work - there are still unknowns to do with things the handling of duplicates. I’m not sure how much it helps to change union to use Ord now if we plan to change its behaviour with respect to duplicates later, as that later change would still be breaking even if the type is the same.

JordanMartinez · 2021-01-08T20:15:37Z

So, will this change be merged before v0.14.0? Or should we discuss it further and include it in v0.15.0?

JordanMartinez · 2021-01-08T21:17:36Z

Minor optimization by only working with indices and copying values only once to the final array.

hdgarrood · 2021-01-09T00:13:57Z

I’m tempted to leave it for 0.15.0 personally.

JordanMartinez · 2021-01-11T15:12:50Z

I think this shouldn't be implemented until we have STFn. At the very least, the performance hit doesn't make it worth it in my mind. If one wanted this behavior, they could implement it locally in their project using this PR as a basis for what to do (and likely they would use FFI to get around the performance issue).

JordanMartinez · 2021-01-11T15:13:18Z

I'm going to mark arrays as done in the core libs tracking issue

Update difference and intersectBy to run linearly by sorting elements

615f2f4

JordanMartinez changed the title ~~Implement differenceBy~~ Make difference and intersectBy run linearly by sorting elements first Dec 31, 2020

JordanMartinez added 3 commits December 31, 2020 14:39

Remove unnecessary head in differenceBy and intersectBy

f1ffa4a

Extract sorting part out of combineIndexSort; rename to combineIndex

d82c947

Implement unionBy

b3be7f4

JordanMartinez mentioned this pull request Dec 31, 2020

Use Ord by default for all set-like operations #192

Open

JordanMartinez changed the title ~~Make difference and intersectBy run linearly by sorting elements first~~ Make union*, difference*, and intersect* run linearly by sorting elements first Dec 31, 2020

Sort array in-place in unionBy, differenceBy, and intersectBy

a9c9514

JordanMartinez mentioned this pull request Jan 1, 2021

Speedup 'nubBy' by doing more things to intermediate array in-place #205

Closed

milesfrain reviewed Jan 7, 2021

View reviewed changes

src/Data/Array.purs Outdated Show resolved Hide resolved

milesfrain reviewed Jan 7, 2021

View reviewed changes

Clarify comment explaining combineIndex

d04167b

JordanMartinez added 2 commits January 8, 2021 12:16

Use example for combineIndex that doesn't include duplicates

72e556d

differenceyBy: Only copy elements once; use indices everywhere else

b1577bc

milesfrain mentioned this pull request Jan 9, 2021

Add benchmarks for nubEq, union, intersect, and difference #206

Closed

JordanMartinez closed this Jan 11, 2021

JordanMartinez mentioned this pull request Apr 14, 2021

added sliding-window functions sliding and slidingWithSizeAndStep #212

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make `union`, `difference`, and `intersect*` run linearly by sorting elements first #203

Make `union`, `difference`, and `intersect*` run linearly by sorting elements first #203

JordanMartinez commented Dec 31, 2020 •

edited

Loading

JordanMartinez commented Dec 31, 2020

JordanMartinez commented Dec 31, 2020

hdgarrood commented Jan 1, 2021

JordanMartinez commented Jan 1, 2021

milesfrain Jan 7, 2021

JordanMartinez Jan 7, 2021

milesfrain Jan 7, 2021

JordanMartinez Jan 8, 2021

milesfrain Jan 7, 2021

JordanMartinez Jan 7, 2021

milesfrain Jan 7, 2021

JordanMartinez Jan 7, 2021

milesfrain Jan 7, 2021

JordanMartinez Jan 7, 2021

milesfrain Jan 7, 2021 •

edited

Loading

hdgarrood commented Jan 8, 2021

JordanMartinez commented Jan 8, 2021

JordanMartinez commented Jan 8, 2021

hdgarrood commented Jan 9, 2021

JordanMartinez commented Jan 11, 2021

JordanMartinez commented Jan 11, 2021

		-- combineIndex compare [0] [0, 1]
		-- == [t3 true 0 0, t3 false 1 0, t3 false 2 1]

Make union*, difference*, and intersect* run linearly by sorting elements first #203

Make union*, difference*, and intersect* run linearly by sorting elements first #203

Conversation

JordanMartinez commented Dec 31, 2020 • edited Loading

JordanMartinez commented Dec 31, 2020

JordanMartinez commented Dec 31, 2020

hdgarrood commented Jan 1, 2021

JordanMartinez commented Jan 1, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

milesfrain Jan 7, 2021 • edited Loading

Choose a reason for hiding this comment

hdgarrood commented Jan 8, 2021

JordanMartinez commented Jan 8, 2021

JordanMartinez commented Jan 8, 2021

hdgarrood commented Jan 9, 2021

JordanMartinez commented Jan 11, 2021

JordanMartinez commented Jan 11, 2021

Make `union`, `difference`, and `intersect*` run linearly by sorting elements first #203

Make `union`, `difference`, and `intersect*` run linearly by sorting elements first #203

JordanMartinez commented Dec 31, 2020 •

edited

Loading

milesfrain Jan 7, 2021 •

edited

Loading