Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Ord by default for all set-like operations #192

Open
hdgarrood opened this issue Dec 18, 2020 · 25 comments
Open

Use Ord by default for all set-like operations #192

hdgarrood opened this issue Dec 18, 2020 · 25 comments
Labels
purs-0.15 A reminder to address this issue or merge this PR before we release PureScript v0.15.0 status: blocked This issue or PR is blocked by something and cannot make progress. type: breaking change A change that requires a major version bump.

Comments

@hdgarrood
Copy link
Contributor

We already made nub use Ord by default in #91, but there are a few other unnecessarily quadratic functions in this library still which probably deserve the same treatment. I think these are union, intersect, and difference.

As well as being quadratic, union and difference have a strange-seeming behaviour where duplicates are only preserved in the first argument:

union [0, 0] [1, 1] == [0, 0, 1]

See #56 for more info. We ought to be careful about changing this behaviour but I think we should consider it, since I think it’s very likely to trip people up and we will be making breaking changes to these functions anyway if we do this.

@hdgarrood
Copy link
Contributor Author

If we go ahead with this, the same changes should propagate through to the lists library too.

@milesfrain
Copy link
Contributor

duplicates are only preserved in the first argument

Yes, this seems really strange, even if it does match how things are done in Haskell. I'm curious if any projects rely on this asymmetric duplicate-preserving behavior.

If we match Set behavior and remove all duplicates, then we get simpler code (just append then nub), and we can do property-based testing against Set.

Another option is to thin-out these libraries a bit and recommend using Set instead. Potential downsides are inconvenience, loss of ordering, and worse performance (would be interesting to benchmark). And we'd still need the Ord-free versions, since those aren't Set-compatible.

Perhaps we should open a discourse thread to discuss these proposals, since this affects multiple libraries.

@hdgarrood
Copy link
Contributor Author

The main reason I don't really want to just deprecate these functions and tell people to just use sets instead is that arrays have ordering, whereas sets don't, which means that the set-like Array/List operations can preserve ordering, and this might be useful in some cases.

@JordanMartinez
Copy link
Contributor

This is the last issue in this repo that doesn't currently have a PR opened for it.

@milesfrain
Copy link
Contributor

milesfrain commented Dec 30, 2020

I'm thinking we should table all these container (List, Array, etc.) updates for 0.15.
I think it would be more efficient to do a systematic update of all our container libraries, where we start with a proposal of what we want our final result to be, rather than through a bunch of piecemeal PRs.

@JordanMartinez
Copy link
Contributor

A few questions I have.

First, how is union different from <|>?

Second, in the following examples of difference/\\, what should the output be?

[1, 2, 3, 4, 3, 2] \\ [1]
-- I assume [2, 3, 4, 3, 2]

[1, 2, 3, 4, 3, 2] \\ [2]
[1, 2, 3, 4, 3, 2] \\ [2, 2]
-- It depends on whether each 2 in the right array
-- removes only one or all 2 element(s) in the left array

@milesfrain
Copy link
Contributor

how is union different from <|>?

union does a set-like union of the two arrays. <|> is the same as append/<>.
union keeps any duplicates that already exist in the first array and removes duplicates from the second array. Keeping these duplicates seems like the lesser-of-evils, since you're forced to break one of these rules when applying set operations on non-sets:

a union b == b union a  -- we break this rule
a union empty == a      -- we follow this rule
a union a == a          -- we follow this rule
a union (a <> a) == a   -- we follow this rule

I think difference behaves reasonably for arrays too, but it's still somewhat of an imperfect set-like operation.

Example input/output:

  logShow $ union [1,2,3,3] [2,2,4,4]
  logShow $ [1,2,3,3] <|> [2,2,4,4]
  logShow $ [1,2,3,3] <> [2,2,4,4]
  logShow $ [1, 2, 3, 4, 3, 2] \\ [1]
  logShow $ [1, 2, 3, 4, 3, 2] \\ [2]
  logShow $ [1, 2, 3, 4, 3, 2] \\ [2, 2]
[1,2,3,3,4]
[1,2,3,3,2,2,4,4]
[1,2,3,3,2,2,4,4]
[2,3,4,3,2]
[1,3,4,3,2]
[1,3,4,3]

https://try.ps.ai/?gist=aea5bdaf7669d2fd00c8d80cecc01b38

@milesfrain
Copy link
Contributor

The main reason I don't really want to just deprecate these functions and tell people to just use sets instead is that arrays have ordering, whereas sets don't, which means that the set-like Array/List operations can preserve ordering, and this might be useful in some cases.

I think an ordered set would be even better. http://hackage.haskell.org/package/ordered-containers-0.2.2/docs/Data-Set-Ordered.html
But no objections to keeping the set-like operations for arrays.

@JordanMartinez
Copy link
Contributor

ordered set would be even better.

Isn't better subjective to the problem one is solving?

@milesfrain
Copy link
Contributor

Better if we assume these requirements:

  • maintain ordering
  • not having any weirdness with set-like operations

Of course, there can be other requirements, but I didn't see anything else mentioned.

@JordanMartinez
Copy link
Contributor

a union b == b union a  -- we break this rule
a union empty == a      -- we follow this rule
a union a == a          -- we follow this rule
a union (a <> a) == a   -- we follow this rule

logShow $ union [1,2,3,3] [2,2,4,4]

Thanks @milesfrain. These examples helped! Pretty sure #203 is ready to go now.

@razcore-rad
Copy link

Should other containers than sets behave like sets during operations like union, intersect etc?

I personally would assume that an-order preserving container like Array would behave like this:

 -- union is <>
[ 0, 0, 1, 1 ] `union` [ 2, 2, 1 ] == [ 0, 0, 1, 1 ] <> [ 2, 2, 1 ] == [ 0, 0, 1, 1, 2, 2, 1 ]

-- intersect - occurrences are matched
[ 0, 0, 1, 1, 2, 2 ] `intersect` [ 1, 2, 2 ] == [ 1, 2, 2 ]

-- difference - occurrences are matched like `intersect`
[ 0, 0, 1, 2, 2 ] `difference` [ 1, 1, 2 ] == [ 0, 0, 2 ]

All preserving order so union isn't commutative. But the idea is that since these containers do preserve order, and allow duplicates, unlike the set container, then it makes sense that union isn't commutative. And I'd say that this interpretation of the operations makes sense cause of this. But I'm no expert in math so... there's that too.

@milesfrain
Copy link
Contributor

milesfrain commented Jan 14, 2021

union is <>

If that is the case, then I don't think we should offer union at all.

But there is another option for union that builds on your matching proposal for intersect and difference:

union a b = difference a b <> intersect a b <> difference b a

a = [0, 0, 1, 1, 2, 2]
b = [1, 2, 2, 3]

difference a b == [0, 0, 1]
intersect a b == [1, 2, 2]
difference b a == [3]

union a b == [0, 0, 1, 1, 2, 2, 3]
union b a == [3, 1, 2, 2, 0, 0, 1]
sort(union a b) == sort(union b a)

We could include index tracking and recovery to preserve ordering, so:

union b a == [1, 2, 2, 3, 0, 0, 1]

Edit: I realize my above definition could be simplified to:

union a b = a <> difference b a

Which appears to be pretty close to the existing behavior of:

union a b = a <> nub (difference b a)

I think the nub-free version is better-behaved.


But my number-one preference is to eliminate these set-like functions from Array, which rely on nonintuitive matching, and add an ordered set collection.

@razcore-rad
Copy link

I think there's a strong argument to be made that these functions only make sense for sets so I'm not opposed to removing them for array. I was just offering what I thought too be a better intuition for theirs functionality for this data type. And union as you point out could be improved.

As for ordered set are you talking about this? https://pursuit.purescript.org/packages/purescript-ordered-set/0.4.0

@milesfrain
Copy link
Contributor

As for ordered set are you talking about this? https://pursuit.purescript.org/packages/purescript-ordered-set/0.4.0

Oh, nice! Didn't find that on my initial survey.

@hdgarrood
Copy link
Contributor Author

As far as I'm aware, in all of these discussions, we haven't yet had one person indicate that they're currently using union, intersection, or difference, so I am starting to come around to removing them too.

@natefaubion
Copy link

We use intersection and difference in a few places. I don't think we use union.

@hdgarrood
Copy link
Contributor Author

Ah, fantastic. I'm curious: Would you miss them if we removed them? Would you notice if their behaviour around the handling of duplicates changed? Would it be awkward to refactor that code to use the ordered-set package instead?

@hdgarrood
Copy link
Contributor Author

(Just to be clear, this isn't sarcastic: I'm pleased that we can hear from someone who is actually using these functions)

@natefaubion
Copy link

I think we are assuming there are no duplicates, we just need the ordering and convenience of Arrays, likely. I don't think we should remove them without having a recommended replacement. We have not had any issues with the existing implementation.

@razcore-rad
Copy link

I think the discussion is more about the way they should behave, like if we were to take the concepts from scratch, what would be the "natural" behavior. Not that they can be used the way they are implemented as is, cause sure we can use them as they are, as long as they're consistent in some way we can find usages.

It strikes me that the current behavior is weird, that is removing duplicates from one array and not the other in some operations.

Don't you think that your particular case is better covered by purescript-ordered-set instead?

@hdgarrood
Copy link
Contributor Author

The problem with suggesting ordered-set as an alternative is that a) it’s a third party library and it’s hard to be sure how well maintained it’s going to be, and b) that package provides no O(1) conversion to Array, and converting from Array is necessarily O(n log n); if these functions work for arrays already no conversion is necessary. The obvious response to a) is that we should bring ordered-set into core but that’s not appealing to me either, as a huge part of the reason it’s taken us so long to get 0.14 released is that there is so much stuff in core already.

Since this is such a widely used module, and as long as we can’t identify one ideal way the Array functions should behave, I think my preference would now be to leave the behaviour alone, but change them to use Ord for better performance in the next round of breaking changes.

@JordanMartinez JordanMartinez added purs-0.15 A reminder to address this issue or merge this PR before we release PureScript v0.15.0 type: breaking change A change that requires a major version bump. labels Dec 1, 2021
@JordanMartinez
Copy link
Contributor

Following up with this, what needs to change?

@natefaubion
Copy link

It would require someone rewriting these functions in terms of Ord. Without anyone volunteering to do that, we should defer this.

@JordanMartinez
Copy link
Contributor

Ah, I remember this now. I tried to resolve this issue in #203 but stopped because performance suffered greatly: #203 (comment)

@JordanMartinez JordanMartinez added the status: blocked This issue or PR is blocked by something and cannot make progress. label Mar 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
purs-0.15 A reminder to address this issue or merge this PR before we release PureScript v0.15.0 status: blocked This issue or PR is blocked by something and cannot make progress. type: breaking change A change that requires a major version bump.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants