-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create separate worker usage data collection and move hardware emit there #1293
base: master
Are you sure you want to change the base?
Conversation
8973aec
to
f02ecae
Compare
|
||
func (w *workerUsageCollector) Start() { | ||
w.wg.Add(1) | ||
go func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to spawn a goroutine per worker? Why not ensure only 1 running?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only the hardware emitting is once per host, all other metrics will be worker-specific. (e.g activity poll response vs. decision poll response)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now I see only w.collectHardwareUsage() which will just spawn bunch of data into the same scope. I would suggest separating hardware emitter and worker specific metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's current design, for each type of metrics based on their origin, I will create a separate gorountine for each of them. But they would be contained under a single workerusagecollector so that their result can be collected and sent in one place
…adence-workflow#1270) Enable client side estimated history size exposure via API
case <-ticker.C: | ||
// Given that decision worker and activity worker are running in the same host, we only need to collect | ||
// hardware usage from one of them. | ||
if w.workerType == "DecisionWorker" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this might not be future proof and also if customer is running separate processes for decision and activity workers then we will not have the hardware usage of those hosts that only runs activity workers. we should also not create no-op workerUsageCollector
s if only one of them will do the work.
@Groxx what would be your recommendation for host level metric reporting on the client side? I would like to avoid global static variables but this use case probably requires one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We tried Sync.Once before, but that would cause issues with unit testing as it will just wait indefinitely for this routine to stop while blocking all other goroutine from closing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can override it in the unit tests
type once interface {
Do(func())
}
var collectHardwareUsageOnce once
in typical startup this would be set to sync.Once
:
collectHardwareUsageOnce = sync.Once{}
in test code you can initialize this to a fake implementation
collectHardwareUsageOnce = myFakeOnce{} // myFakeOnce implements Do(func())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your suggestion. I have implemented that in the latest commit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see EmitOnce
being used in workerUsageCollector. We should only have one (singleton) instance of workerUsageCollector
which would be lazily created by the first worker instance. Rest of the workers would create a noOpUsageCollector. This lazy initialization logic should be hidden from workers. Worker just calls newWorkerUsageCollector()
and that function should determine whether it's first time or not. Let's discuss offline if more clarification needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In our usecase, only the hardware info are once per host collected. Other worker type (decision worker and activity worker) should have different workerUsageCollector as they track different task type behaviors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what type of information are you planning to collect per worker basis in this workerUsageCollector
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tasklist backlog/poll response since decision and activity worker have their own pollers and that need to be scaled independently
zap.String(tagPanicStack, st)) | ||
} | ||
}() | ||
defer w.wg.Done() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are a few things problematic about this goroutine closure
- this
wg.Done()
will be called once goroutine forgo w.runHardwareCollector()
is started. It shouldn't be marked as done untilrunHardwareCollector()
terminates so should be moved there - no need for a panic recovery here
- no need for a goroutine to invoke
runHardwareCollector
.
case <-ticker.C: | ||
// Given that decision worker and activity worker are running in the same host, we only need to collect | ||
// hardware usage from one of them. | ||
if w.workerType == "DecisionWorker" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what type of information are you planning to collect per worker basis in this workerUsageCollector
?
|
||
// Optional: This implementation ensures that a specific function is executed only once per instance. | ||
// The mechanism can be overridden by other interfaces that implement the 'Do()' method. | ||
// | ||
// default: nil, that would ensure some functions are executed only once | ||
Sync oncePerHost |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is user visible worker options. we shouldn't expose oncePerHost
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do not exposed that here, we might need to insert that as part of the workerExecutionParameters
and add the Sync.Once as part of the parameter into function "NewWorker" and "newAggregatedWorker". What do you think about that idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it makes sense to add to workerExecutionParameters
as it is not exposed outside. I'd also recommend looking at WithSomeOption
pattern in go https://www.sohamkamani.com/golang/options-pattern/.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files
Continue to review full report in Codecov by Sentry.
|
What changed?
Create a dedicated worker usage collector
Move hardware usage emitting functionality from base worker to the worker usage collector
Why?
We want to create a separate component responsible for collecting worker usage rather than a huge code block in the base worker.
How did you test it?
Tested locally as well as tested in staging env to ensure metrics consistency and no goroutine leak.
Potential risks
Instead of using Sync.Once to ensure the goroutine is run once per host, we move the hardware emitting to decision worker only as the Sync.Once might cause test timeout as it would keep other goroutine wait until the current one returns. So if a host does not have decision worker (impossible at the moment), it's hardware metrics won't be emitted.