-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize traverse #4498
Optimize traverse #4498
Conversation
StackSafeMonads
Also fix 2.12 errors
different runtime types for the Applicative instance
Can we add some new laws in terms of
Nah. |
Can you post the results of the benchmarks? Apologies if I missed them. |
val iter = fa.iterator | ||
if (iter.hasNext) { | ||
val first = iter.next() | ||
G.map(iter.foldLeft(f(first)) { case (g, a) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be G.void(.. so that has a chance for an optimized implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good catch!
G.map(fa.iterator.foldLeft(G.pure(builder)) { case (bldrG, a) => | ||
G.flatMap(bldrG) { bldr => | ||
G.map(f(a)) { | ||
case Some(b) => bldr += b |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't see why we know this is safe or lawful.
We have a mutable data structure that could potentially be in a multithreaded situation with G.
Also, consider cases like IO where you have a long computation that finally fails, then you recover some part of it to succeed. It feels like this mutable builder could remember things from failed branches.
I think using an immutable builder (like for instance just building up a List, Chain or Vector) would be much easier to verify it is lawful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a mutable data structure that could potentially be in a multithreaded situation with G.
It can't be: every append to the builder happens in flatMap
so by law it must be sequential, not concurrent/parallel.
Also, consider cases like IO where you have a long computation that finally fails, then you recover some part of it to succeed. It feels like this mutable builder could remember things from failed branches.
Hmm. I'm not entirely sure how this applies to traverse
. It doesn't have a notion of "recovery". For sure, each individual step that the traverse
runs may have a notion of recovery, but that's just it: it will either succeed or it will fail. But there's no way to "recover" an intermediate structure from the traverse
itself.
I think using an immutable builder (like for instance just building up a List, Chain or Vector) would be much easier to verify it is lawful.
Won't disagree here. We could do a benchmark to see how much performance we are leaving on the table with that strategy.
Yes will do. I had originally asked whether I should add the new ones in a separate PR first which was why I didn't run them straight away |
@armanbilge @johnynek apologies it's taken me so long to get round to this. I've updated with benchmarks for the mutable builder approach. Are you still worried about lawfulness? Shall I try to do it with immutable data structures? Those |
b92a558
to
b338f18
Compare
Is this necessary? The monad laws should ensure that it's safe to use mutable builders. Nonetheless it will be good to confirm the performance delta for using immutable data structures
b338f18
to
8cc742e
Compare
Thanks for working on this. What a don't see is how a Monad such as fs2.Stream[IO, A] would interact here. In that. Case you can have laziness, concurrency and more than one output per input. Can you give an explanation as to why we shouldn't worry about mutability in such a case? |
We can't have concurrency: we are sequencing the effects with |
private[cats] def traverseDirectly[G[_], A, B]( | ||
fa: IterableOnce[A] | ||
)(f: A => G[B])(implicit G: StackSafeMonad[G]): G[Vector[B]] = { | ||
fa.iterator.foldLeft(G.pure(Vector.empty[B])) { case (accG, a) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can do even better here by using a Vector.newBuilder
, we can do even even better by using a dedicated builder for the desired collection (e.g. ArraySeq
or List
) and we can do even even even better by providing a size hint to the builder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might have misunderstood something, but it looks like that's how things were done before the commit 8cc742e was introduced, which added the immutable data structures. 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right :) let's just say this PR has been open a while and I lost track of the old stuff 😇
This is definitely a colloquial understanding of Monads, that they represent an sequential computation, but I don't see the law that allows us to make the claim you want. For instance, in the current PR we are using immutable items inside the monadic type constructor. I think the current writing will work. But if we changed those to be mutable, I don't see a proof that it will work, even though it may. For instance, consider this type: sealed trait Tree[F[_], A]
case class Item[F[_], A](fa: A) extends Tree[F, A]
case class InsideF[F[_], A](treeF: F[Tree[F, A]]) extends Tree[F, A]
case class Branch[A](left: Tree[F, A], right: Tree[F, A]) extends Tree[F, A]
object Tree {
def pure[F[_], A](a: A): Tree[F, A] = Item(a)
def flatMap[F[_]: Functor, A, B](fa: Tree[F, A])(fn: A => Tree[F, B]): Tree[F, B] =
fa match {
case Item(a) => fn(a)
case InsideF(fa) => InsideF(fa.map(flatMap(_)(fn)))
case Branch(left, right) =>
Branch(flatMap(left)(fn), flatMap(right)(fn))
}
} Now, I haven't checked if this follows the monad laws as we have expressed them, but I think it or something like it could. In this picture note we are recursively calling flatMap inside the F context and outside of it. So, if F is lazy, we return before we have fully recursed. Couldn't something like this cause a problem? I think saying "monads are sequential" is a bit too hand-wavey to bolt into the default implementation for every user of cats. I think we need a stronger argument than that. |
To see if this is more performant than an immutable vector
if (iter.hasNext) { | ||
val first = iter.next() | ||
G.void(iter.foldLeft(f(first)) { case (g, a) => | ||
G.flatMap(g) { _ => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not G.productR(g, f(a))
here? Is it to avoid calling f
when g
may have already failed? I think a comment would be helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was no good reason 😅 Thanks, I'll change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it to avoid calling
f
wheng
may have already failed?
Maybe this is something we should consider though?
Edit: ahh, I see your comment on f95b087. Fair enough.
fa: IterableOnce[A] | ||
)(f: A => G[Option[B]])(implicit G: StackSafeMonad[G]): G[Vector[B]] = { | ||
fa.iterator.foldLeft(G.pure(Vector.empty[B])) { case (bldrG, a) => | ||
G.flatMap(bldrG) { acc => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a comment on why we didn't write:
G.map2(bldrG, f(a)) {
case (acc, Some(b)) => acc :+ b
case (acc, _) => acc
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, good point 👍
fa: IterableOnce[A] | ||
)(f: A => G[B])(implicit G: StackSafeMonad[G]): G[List[B]] = { | ||
G.map(fa.iterator.foldLeft(G.pure(List.empty[B])) { case (accG, a) => | ||
G.flatMap(accG) { acc => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a comment why not G.map2(accG, f(a))(_ :: _)
Since we know G is lazy (a StackSafeMonad has to be I think), I'm not sure the downside here. One answer would be we don't have to call f
if we have already failed for a short-circuiting monad, but we are still iterating the whole list, so we are doing O(N) work. Adding the call to allocate the Monad doesn't seem like a big problem, since we have to allocate the function to pass to flatMap in the current case.
By calling map2 we are at least communicating to G what we are doing, and in principle, some monads could optimize this (e.g. a Parser can make a more optimized map2 than flatMap, and it can also be StackSafe since runs lazily only when input is passed to the resulting parser).
@valencik sorry to bother you. Just tagging you in case you missed my previous comment |
I agree, and similarly, I also do not know why
This almost seems like followup work to me? I think, if we are fine with the |
Yeah I'm more than happy to draw a line under this PR and follow up separately - it's been open far too long 😆 I'll move it out of draft now |
Just to double check I understand, do you mean the implementation of Yeah, let's switch that to |
Currently running benchmarks again on a tweak which uses |
Ok, I ran the benchmarks, I think Feel free to leave it for a day or two. I also ran a test benchmark run with a |
Thanks so much again! Presumably that PR is the same as the How is |
Use `Chain` in `traverseDirectly` helpers
Before we get into the weeds, my position remains that we should merge this PR if we are ok with the small regression to @djspiewak Pinging you as the creator of these benchmarks, what do you think here? We've improved or held stable on on fronts except @Benchmark
def traverseVectorError(bh: Blackhole) = {
val result = vectorT.traverse(vector) { i =>
Eval.later {
Blackhole.consumeCPU(Work)
if (i == length * 0.3) {
throw Failure
}
i * 2
}
}
try {
bh.consume(result.value)
} catch {
case Failure => ()
}
} My gut says these improvements are worth it and we should merge.
It was an experiment where I added additional The commit: efa7ec0 |
Ah cool, thanks. Yeah I had asked about that before and I think we decided to leave that for potential follow-up |
@valencik sorry, I've lost the plot again 😅 Is there anything else I need to do to this right now? |
We need to make a call on whether or not we're ok with the small regression in the traverseError micro benchmarks. |
val (leftL, rightL) = fa.splitAt(leftSize) | ||
runHalf(leftSize, leftL) | ||
.flatMap { left => | ||
val right = runHalf(rightSize, rightL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an interesting benchmark would be do do Eval.defer(runHalf(rightSize, rightL))
which would short circuit for applicatives G that can short circuit.
I wonder how much that would hurt the non-short circuiting case... this is not for this PR, just something we may come back to in a later PR.
val rightSize = size - leftSize | ||
runHalf(leftSize, idx) | ||
.flatMap { left => | ||
val right = runHalf(rightSize, idx + leftSize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment about the possibility to short circuit the right with Eval.defer(runHalf(rightSize, idx + leftSize))
but that would add cost to the cases where we do need to evaluate everything, which is probably the most common case.
So my read based on: #4498 (comment) is that for a benchmark on Vector with a stack safe monad (Eval) that short circuits 30% the way through the Vector (by throwing, which is highly suspect to me, because that is an effect that Eval isn't designed to deal with), we are currently 12% slower. I think the benchmark isn't one we should care about, but instead maybe As to why this benchmark is currently slower, I think as the benchmark is written, in the new PR we fully interate over the entire Vector building a chain of map2, which in Eval is just implemented the naive way with flatMap. So, we iterate the entire vector for sure. Then we call .value and evaluate the Eval. That process will start running the Eval and throw when it finally gets to the index that throws. In the old code (the code that builds the tree up), we build a tree by cutting the length (which we get in an O(1) call) in halfs recursively. Importantly, this process does not iterate over the Vector, it is just splitting the index and basically making a binary tree of indexes to call inside an Eval later. Then using an extra layer of Eval (to use map2Eval) we start evaluating the function call left to right. So, in this implementation we only access the first 30% of the indices before we hit the call to f that throws. So put that way, I think what this means is that actually for Vector, short circuiting via the tree seems actually faster than directly using map2. This is possibly due to the fact that iterating the vector may have to fetch from memory not in cache and we have to do those fetches in the new code, but in the old code, we only fetch an index (possibly incurring some cache miss) when we actually are sure we will require the function to be evaluated. I think perhaps the best call is to simply not do this optimization for traverse for Vector and keep the current code for that particular function. Seems that would be a quick change, and then I would be happy to merge. |
btw: I was the person who implemented the tree style traverse in cats (having noticed it by starting with the TreeList datastructure I added to cats-collections: https://github.com/typelevel/cats-collections/blob/master/core/src/main/scala/cats/collections/TreeList.scala)... I now wonder if the implementation of two strategies in traverseViaChain actually is helping... I can't see why at the moment... There is a tree portion, and then a linear portion, the linear portion is currently set at 128 wide. The only motivation I can see for the linear portion is that may be cheaper to build a List than a Chain, but I don't recall if I benchmarked that. For any future readers, benchmarking that code with a different width setting (say 32, 64, 128 (current), and 256) might be interesting, along with a version that is a tree all the way down (similar to the implementation of traverse_ in Vector). If it is very close, it would be worth simplifying the code to only use the binary index tree algorithm for sure. |
I can absolutely do that. Just for my own benefit I'm curious to understand the preference for keeping the performance in the exception-throwing case and sacrificing the improvement in the happy path case? |
Actually sorry. I misread the plots. I was looking at this pr vs traverseDirectlyVector I think (the second set of metrics) but the relevant set is this pr vs main which were the first set of metrics. Okay, yeah, it seems there are a lot of wins. Seems the success case is about as much faster as the error case is slower and since success should be more common than failure seems like we should merge. We can merge as is. IMO. Shall I click merge? Do we need to wait for anyone else? |
Thank you! I'm happy to merge 😆 I think @valencik was as well? Also sorry that there are so many versions of the benchmarks - it's gotten quite confusing |
LET'S DO IT!! 🎉 🚀 🐱 |
This addresses the
StackSafeMonad
part of #4408. I thought that there were enough changes in this to merit a separate PR for theDefer
part.Open questions (will edit these out of the description later):
TraverseFilter
laws hard-codeOption
. How do we test that the new branching implementation is lawful?The following benchmarks were run on an AWS t3.xlarge:
Baseline (b08196e)
Chain (2d5f4d7)
Vector (702ab8b)