Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add Concurrency entity for worker #1405

Closed
wants to merge 13 commits into from

Conversation

shijiesheng
Copy link
Member

@shijiesheng shijiesheng commented Nov 21, 2024

What changed?

  • added worker package for modularity
  • added resizable Permit entity wrapping the underlying implementation; these permits are synchronization primitive for concurrency control over poller and tasks processing goroutines
  • wrapped buffered channel with Permit interface for task concurrency.
  • removed poller autoscaler methods

Why?

add dynamic params component towards worker auto configuration
Screenshot 2024-11-21 at 11 24 19 AM

How did you test it?

unit test
integration test

Potential risks

Copy link

codecov bot commented Nov 21, 2024

Codecov Report

Attention: Patch coverage is 65.00000% with 21 lines in your changes missing coverage. Please review.

Project coverage is 82.40%. Comparing base (e3802b7) to head (0b11e7d).

Files with missing lines Patch % Lines
internal/worker/channel_permit.go 0.00% 21 Missing ⚠️
Files with missing lines Coverage Δ
internal/internal_poller_autoscaler.go 92.39% <100.00%> (-0.32%) ⬇️
internal/internal_worker_base.go 83.88% <100.00%> (+1.93%) ⬆️
internal/worker/resizable_permit.go 100.00% <100.00%> (ø)
internal/worker/channel_permit.go 0.00% <0.00%> (ø)

... and 1 file with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e3802b7...0b11e7d. Read the comment docs.

Copy link
Member

@taylanisikdemir taylanisikdemir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look safe as long as the underlying semaphore library is not buggy. It will be the critical component determining concurrency controls of client SDK. I recommend deep diving on that to understand the implementation, check for potential deadlock cases and write comprehensive unit/concurrency tests for the wrapper permit implementation

internal/worker/dynamic_params.go Outdated Show resolved Hide resolved
internal/worker/dynamic_params.go Outdated Show resolved Hide resolved
internal/worker/dynamic_params.go Outdated Show resolved Hide resolved
internal/worker/dynamic_params.go Outdated Show resolved Hide resolved
internal/internal_worker_base.go Outdated Show resolved Hide resolved
internal/internal_worker_base.go Outdated Show resolved Hide resolved
internal/worker/dynamic_params.go Outdated Show resolved Hide resolved
@shijiesheng shijiesheng changed the title add DynamicParams for worker add Concurrency entity for worker and replace implementation for poller request channel Nov 26, 2024
@shijiesheng shijiesheng changed the title add Concurrency entity for worker and replace implementation for poller request channel add Concurrency entity for worker Nov 26, 2024
@@ -74,7 +77,7 @@ func (p *permit) AcquireChan(ctx context.Context, wg *sync.WaitGroup) <-chan str
}
select { // try to send to channel, but don't block if listener is gone
case ch <- struct{}{}:
default:
case <-time.After(10 * time.Millisecond): // wait time is needed to avoid race condition of channel sending
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

found a race condition of channel sending. Adding a wait time would be more reliable

Copy link
Member

@Groxx Groxx Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm :\ unfortunately this means that if processing is delayed by 10ms it will block the chan-reader forever. that's not too unlikely with big CPU spikes, and definitely not impossible.

tbh I think that might rule this impl out entirely. though I think it's possible to build a AcquireChan(...) (<-chan struct{}, cancel func()) that doesn't have this issue, and that might be worth doing.

or we might have to embrace the atomic-like behavior around this and add retries to (*baseWorker).runPoller / anything using AcquireChan. that wouldn't be a fatal constraint afaict, though it's not ideal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we expect chan-reader to always consume from the returned ch unless ctx is canceled then we can replace this goroutine implementation with

defer wg.Done()
if err := p.sem.Acquire(ctx, 1); err != nil { 
   return // assuming Acquire only returns err if ctx.Done
}
select {
case ch <- struct{}{}:
case <-ctx.Done():
  p.sem.Release(1)
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed through Taylan's suggestion. It should be safe now.

@Groxx
Copy link
Member

Groxx commented Nov 26, 2024

Reading through marusama/semaphore/v2:

  • yep, looks likely correct to me. so 👍 for that library.
  • some cautions worth noting:
    • golang.org/x/sync/semaphore does FIFO acquiring, and:
      • it probably performs MUCH better under heavy load, as marusama wakes up all waiters every time and they all race to acquire
      • it is strictly FIFO, which prevents starvation of >1 acquires, which marusama does not prevent
      • x/sync/semaphore does not resize, so it's not really an option anyway
    • there are very clear uint32 overflow possibilities that it does not check
      • many of these will likely cause an invalid state (e.g. count underflow) and panic, but I suspect it's not guaranteed, and I'm pretty confident it's not guaranteed that it'll panic somewhere ~immediately.

So... I think we're probably fine. For pollers we won't run into any of these in practice.
It's probably worth leaving some comments on our use though, to make sure ^ those are checked before using it elsewhere, and before using it with any un-checked config (e.g. a user sets 2 billion concurrent as "unlimited", and behavior is undefined).

In the meantime we should probably get some integer overflow checks added to it, and a new release. There's no reason to allow those to occur.

Copy link
Member

@Groxx Groxx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm gonna have to block this for https://github.com/cadence-workflow/cadence-go-client/pull/1405/files#r1859368368 since that's potentially fatal.

Overall though:

  • library looks fine (more details above)
  • changes look reasonable
    • four-layer-deep bw.concurrency.TaskPermit.AcquireChan accesses are a bit dubious in principle, but I think they make sense here. the added structure/layers seems useful.
    • wrapper as a whole seems good to have, though do we have any plans to do non-1 values? I'd get rid of count if not, but I suppose that depends on the "resource". (but see library comment for risks there, e.g. we definitely cannot use it for "bytes" in many cases)
  • I have not checked the tests in detail but the high level approach seems useful. might need some fine-tuning, but it's reasonably "meaningful" and that's a very solid start.

I'm... not entirely sure what to do to resolve the core "resizable semaphore but we also need chans" issue tbh. We might end up needing / badly-wanting chans, so that may be important. I suspect there's some other library, but I haven't hunted for one or thought too much on what it'd need to be correct 🤔

@shijiesheng
Copy link
Member Author

shijiesheng commented Nov 27, 2024

I think I'm gonna have to block this for https://github.com/cadence-workflow/cadence-go-client/pull/1405/files#r1859368368 since that's potentially fatal.

Overall though:

  • library looks fine (more details above)

  • changes look reasonable

    • four-layer-deep bw.concurrency.TaskPermit.AcquireChan accesses are a bit dubious in principle, but I think they make sense here. the added structure/layers seems useful.
    • wrapper as a whole seems good to have, though do we have any plans to do non-1 values? I'd get rid of count if not, but I suppose that depends on the "resource". (but see library comment for risks there, e.g. we definitely cannot use it for "bytes" in many cases)
  • I have not checked the tests in detail but the high level approach seems useful. might need some fine-tuning, but it's reasonably "meaningful" and that's a very solid start.

I'm... not entirely sure what to do to resolve the core "resizable semaphore but we also need chans" issue tbh. We might end up needing / badly-wanting chans, so that may be important. I suspect there's some other library, but I haven't hunted for one or thought too much on what it'd need to be correct 🤔

Double checked. We don't do non-one values in acquire at any place. I'll remove it in the next PR. Some of the pollerAutoScaler interface methods are redundant and thus can be removed.

Comment on lines 67 to 84
// AcquireChan returns a permit ready channel. Similar to Acquire, but non-blocking.
// Remember to call Release(1) to release the permit after usage
func (p *permit) AcquireChan(ctx context.Context, wg *sync.WaitGroup) <-chan struct{} {
ch := make(chan struct{})
wg.Add(1)
go func() {
defer wg.Done()
if err := p.sem.Acquire(ctx, 1); err != nil {
return
}
select { // try to send to channel, but don't block if listener is gone
case ch <- struct{}{}:
case <-ctx.Done():
p.sem.Release(1)
}
}()
return ch
}
Copy link
Member

@Groxx Groxx Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm. this is possible to use correctly, and doesn't leak goroutines beyond ctx.Done() which is good (and possibly good enough), but it still leaves the chan permanently blocking in some cases.

that's correct as long as the AcquireChan(...) caller also listens to the same / a derived ctx.Done() and stops using the chan if that occurs. and ideally also cancels the context when it stops reading in all other branches.

func (bw *baseWorker) runPoller() { does this currently, because bw.shutdownCh closes when bw.limiterContext is canceled and there effectively is no timeout, but it feels kinda error-prone 🤔
and it also feels like there might be an alternative that isn't as risky... though I'm not yet sure what that might be.

Copy link
Member

@Groxx Groxx Nov 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From trying to build some alternatives, and thinking about it some more, two thoughts:

1: "Permanently block if the context is timed out" is I think the correct choice, because the alternative is to close the chan and that can easily lead to infinite loops + might be interpreted as "no limit". So you'd also need a "detect if closed" check everywhere all the time, rather than just adding the case that is needed to know when to stop. So 👍 that's fine here.

2: With resizing being allowed, I'm growing convinced that either "one-use chan + goroutine + release func if not read, per AcquireChan call" or "a background maintenance goroutine" is unavoidable. Limit and count changes have to be synchronized with reads so reads can't be buffered, but we can't tell if a reader is gone forever or just delayed so we can't pair releases up with acquires synchronously, so we need some kind of buffer somewhere else.

So... this might work, though the "correct use requires passing a context that you also wait on in the same select, and also cancel when you stop reading" detail is still striking me as a moderate footgun.
E.g. the core runPoller loop is only safe right now because there is both no timeout and no other branch in the select - if we add a branch, it'll leak goroutines ~forever every time it takes that other branch, and also block shutdown.

But that is something we could document thoroughly and probably not run into. And using murasama/semaphore/v2 does make it a much simpler implementation than doing it by hand.


Mind if I sit on this over the long weekend, and maybe we can grab others / discuss an alternative I made that's a bit more misuse-resistant? With some careful docs I think this is acceptable and looks functionally-correct, but I'm not entirely sure it's worth keeping...

@Groxx
Copy link
Member

Groxx commented Dec 2, 2024

To stick some slack-chatting publicly:

  • I'm game to approve with some detailed docs and probably a change to the main loop to prevent future risks (i.e. also select on the limiterContext + cancellation, or docs so future reviewers catch it)
  • sounds like we might be dropping the attempt in general, not needing a dynamically-resizing Permit. if so this is all sorta moot, but I'm happy to review if not.

Comment on lines -27 to -30
// Acquire X ResourceUnit of resource
Acquire(ResourceUnit) error
// Release X ResourceUnit of resource
Release(ResourceUnit)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these methods are unnecessary after Permit introduction

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with this removed, this is basically a "periodically update a cache and broadcast" thing, rather than a resource-controller.

which seems fine, just checking that that's what is needed. without anything using the hooks or GetCurrent() I'm kinda struggling to figure out what's needed eventually vs something we should get rid of while we can.

@shijiesheng
Copy link
Member Author

To stick some slack-chatting publicly:

  • I'm game to approve with some detailed docs and probably a change to the main loop to prevent future risks (i.e. also select on the limiterContext + cancellation, or docs so future reviewers catch it)
  • sounds like we might be dropping the attempt in general, not needing a dynamically-resizing Permit. if so this is all sorta moot, but I'm happy to review if not.

correct, I dropped the resizable permit implementation for task permit and still uses channel. So the controversial error-prone async AcquireChan implementation is no longer needed.

The plan is to add guardrail on resources (CPU/memory) for concurrencies so users can increase the max limit safely.

Comment on lines +41 to +48
func (p *channelPermit) Acquire(ctx context.Context) error {
select {
case <-ctx.Done():
return fmt.Errorf("failed to acquire permit before context is done")
case p.channel <- struct{}{}:
return nil
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pretty strong sign that there need to be more tests: without a GetChan() reader, this will never succeed (and then it'll allow too many things / may block release permanently)

Count() int
Quota() int
Release()
SetQuota(int)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems worth splitting this out so you don't have to no-op it. It isn't even needed - this all compiles fine with SetQuota only defined on a non-channel permit.

Comment on lines +171 to +174
concurrency := &worker.ConcurrencyLimit{
PollerPermit: worker.NewResizablePermit(options.pollerCount),
TaskPermit: worker.NewChannelPermit(options.maxConcurrentTask),
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently these are two quite different things, but they're sharing a "Permit" type (sharing the name is fine, their semantics are similar)

All they have in common is

Quota() int
Release()

because one is change-able and one is not, one uses chans and one does not.

Is there a need to combine their APIs in Permit? seems like it's just making compile-time guarantees weaker for no benefit.

@shijiesheng shijiesheng closed this Dec 3, 2024
@shijiesheng
Copy link
Member Author

close this PR as it has so many questions and risks

@shijiesheng
Copy link
Member Author

reworked in #1410

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants