Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enhance: avoid too many gorountines when calc distance #33770

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

CocaineCong
Copy link

the pr is resubmit for this #32776 pr

sorry for my mistake to close the pr #32776 and I can't reopen it again... so I have to resubmit this new pr 😭

I'm not sure if that's the right way to modify it ?

or use channel to control the number of goroutines instead of ants pool, like following:

bufSize := int(leftNum / 3)
ch := make(chan struct{}, bufSize)
for i := int64(0); i < leftNum; i++ {
waitGroup.Add(1)
ch <- struct{}{}
go func(i int64) {
	defer waitGroup.Done()
	CalcFFBatch(dim, left, i, right, metricUpper, &distArray)
	<-ch
    }(i)
}

@sre-ci-robot sre-ci-robot added the size/S Denotes a PR that changes 10-29 lines. label Jun 11, 2024
@mergify mergify bot added the needs-dco DCO is missing in this pull request. label Jun 11, 2024
Copy link
Contributor

mergify bot commented Jun 11, 2024

@CocaineCong Thanks for your contribution. Please submit with DCO, see the contributing guide https://github.com/milvus-io/milvus/blob/master/CONTRIBUTING.md#developer-certificate-of-origin-dco.

@mergify mergify bot added the kind/feature Issues related to feature request from users label Jun 11, 2024
@CocaineCong CocaineCong changed the title feat:avoid too many goroutines by ants pool enhance: avoid too many gorountines when calc distance Jun 11, 2024
@mergify mergify bot added the kind/enhancement Issues or changes related to enhancement label Jun 11, 2024
@mergify mergify bot added dco-passed DCO check passed. and removed needs-dco DCO is missing in this pull request. labels Jun 11, 2024
Copy link
Contributor

@chasingegg chasingegg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution! It will be nice of you to show some performance numbers under very large batch sizes between old way with the new approach and add it to UT

pkg/util/distance/calc_distance.go Outdated Show resolved Hide resolved
waitGroup.Done()
}
// avoid too many goroutines by ants pool
poolSize := int(leftNum / 3)
pool, err := ants.NewPoolWithFunc(poolSize, calcWorker)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refer to milvus pkg/util/conc/pool.go, where wraps the ants.pool, you could use the pool by referring to other usages.
Also make this pool as static, not init it every time when we compute distances.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, i got it

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey, @chasingegg last week I was used sync.Once to make the pool init once. just like that:

var (
	calcPool         *conc.Pool[any]
	calcPoolInitOnce = new(sync.Once)
)

func initCalcPool() {
	calcPool = conc.NewDefaultPool[any]()
}

func GetCalcPool() *conc.Pool[any] {
	calcPoolInitOnce.Do(initCalcPool)
	return calcPool
}

but I find it will going deep into a dead cycle when I run the unit test.

I haven't found the reason for the dead cycle, so I have to make the new pool first in every time.

Copy link
Contributor

@chasingegg chasingegg Jun 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I ask how can I reproduce your dead cycle problem? which unit test?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I ask how can I reproduce your dead cycle problem? which unit test?

You can reproduce it by following these steps:

  1. copy the sync.Once code to make the pool init once.
image
  1. replace pool := conc.NewDefaultPool[any]() to pool := GetCalcPool()
image
  1. run the ut where in pkg/util/distance/calc_distance_test.go:155 , and find will go to the dead cycle.
image
  1. I'm guessing this would be a problem caused by me, so I'm still debuging this problem. But I've been so busy these past few days that I've procrastinated a lot. I will find out the reason why this weekend.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I will also look into it and hopefully could reproduce the problem

@mergify mergify bot added needs-dco DCO is missing in this pull request. and removed dco-passed DCO check passed. labels Jun 12, 2024
Copy link
Contributor

mergify bot commented Jun 12, 2024

@CocaineCong Thanks for your contribution. Please submit with DCO, see the contributing guide https://github.com/milvus-io/milvus/blob/master/CONTRIBUTING.md#developer-certificate-of-origin-dco.

Copy link
Contributor

mergify bot commented Jun 12, 2024

@CocaineCong E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

@sre-ci-robot sre-ci-robot added size/M Denotes a PR that changes 30-99 lines. and removed size/S Denotes a PR that changes 10-29 lines. labels Jun 12, 2024
@CocaineCong CocaineCong force-pushed the less_gorountine branch 2 times, most recently from 6254c14 to efd1413 Compare June 12, 2024 17:01
@mergify mergify bot added dco-passed DCO check passed. and removed needs-dco DCO is missing in this pull request. labels Jun 12, 2024
Copy link
Contributor

mergify bot commented Jun 12, 2024

@CocaineCong E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

@chasingegg
Copy link
Contributor

/run-cpu-e2e

@sre-ci-robot sre-ci-robot added size/S Denotes a PR that changes 10-29 lines. and removed size/M Denotes a PR that changes 30-99 lines. labels Jun 17, 2024
@CocaineCong
Copy link
Author

@chasingegg PTAL

@chasingegg
Copy link
Contributor

@chasingegg PTAL

Commented

Copy link
Contributor

mergify bot commented Jul 1, 2024

@CocaineCong ut workflow job failed, comment rerun ut can trigger the job again.

@CocaineCong
Copy link
Author

hey, @chasingegg I found the reason why there is a dead loop.
If we use sync.Once for lazy loading, it will be released after the first call and then block continuously.
Therefore, there are three solutions:

  • one is not to release, but it may cause memory issues.
  • two is initialize a pool every time when it goes in.
  • three is don't make changes, keep the status quo.

@sre-ci-robot sre-ci-robot added size/M Denotes a PR that changes 30-99 lines. and removed size/S Denotes a PR that changes 10-29 lines. labels Jul 12, 2024
@mergify mergify bot added the needs-dco DCO is missing in this pull request. label Jul 12, 2024
Copy link
Contributor

mergify bot commented Jul 12, 2024

@CocaineCong Thanks for your contribution. Please submit with DCO, see the contributing guide https://github.com/milvus-io/milvus/blob/master/CONTRIBUTING.md#developer-certificate-of-origin-dco.

@mergify mergify bot removed the dco-passed DCO check passed. label Jul 12, 2024
@CocaineCong
Copy link
Author

@chasingegg PTAL.

CocaineCong and others added 6 commits July 12, 2024 09:48
Signed-off-by: FanOne <[email protected]>
Signed-off-by: FanOne <[email protected]>
Signed-off-by: FanOne <[email protected]>
Signed-off-by: FanOne <[email protected]>
Signed-off-by: FanOne <[email protected]>
Signed-off-by: FanOne <[email protected]>
Signed-off-by: FanOne <[email protected]>
@mergify mergify bot added dco-passed DCO check passed. and removed needs-dco DCO is missing in this pull request. labels Jul 12, 2024
Copy link
Contributor

mergify bot commented Jul 12, 2024

@CocaineCong E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

@sre-ci-robot sre-ci-robot added area/dependency Pull requests that update a dependency file area/test sig/testing labels Jul 13, 2024
Signed-off-by: FanOne <[email protected]>
@sre-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: CocaineCong
To complete the pull request process, please assign wxyucs after the PR has been reviewed.
You can assign the PR to them by writing /assign @wxyucs in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/dependency Pull requests that update a dependency file area/test dco-passed DCO check passed. kind/enhancement Issues or changes related to enhancement kind/feature Issues related to feature request from users sig/testing size/M Denotes a PR that changes 30-99 lines.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants