Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: The number of query count * is larger than the number of inserted entities after long run of continuous major compaction to multiple collections #38526

Open
1 task done
binbinlv opened this issue Dec 17, 2024 · 7 comments
Assignees
Labels
kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@binbinlv
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: 2.4-20241215-352e51a8
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):    pulsar
- SDK version(e.g. pymilvus v2.0.0rc2): 2.5.0
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

The number of query count * is larger than the number of inserted entities after long run of continuous major compaction

The inserted number is: 10 m

>>> c = Collection("major_compaction_collection_enable_scalar_clustering_key_1kw")
>>> c.query("count>=0", output_fields=["count(*)"])
data: ["{'count(*)': 10790000}"]
>>>
>>> c = Collection("major_compaction_collection_enable_scalar_clustering_key_1kw_pk")
>>> c.query("count>=0", output_fields=["count(*)"])
data: ["{'count(*)': 10646961}"]
>>>

Expected Behavior

The number of query count * keeps the same with the number of inserted entities after long run of continuous major compaction

Steps To Reproduce

1. inserted 10m (dim = 128) data (there are two 10m collections)
2. long run continuous major compaction for >10h to these two collections
3. check the query(count(*))

Milvus Log

https://grafana-4am.zilliz.cc/explore?orgId=1&left=%7B%22datasource%22:%22Loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bcluster%3D%5C%22devops%5C%22,namespace%3D%5C%22chaos-testing%5C%22,pod%3D~%5C%22major-24-ewyem.*%5C%22%7D%22%7D%5D,%22range%22:%7B%22from%22:%22now-1h%22,%22to%22:%22now%22%7D%7D

Anything else?

collection name:

  1. major_compaction_collection_enable_scalar_clustering_key_1kw
  2. major_compaction_collection_enable_scalar_clustering_key_1kw_pk
@binbinlv binbinlv added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 17, 2024
@binbinlv binbinlv added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Dec 17, 2024
@binbinlv binbinlv added this to the 2.5.0 milestone Dec 17, 2024
@binbinlv binbinlv added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 17, 2024
@binbinlv
Copy link
Contributor Author

/assign @xiaocai2333

@binbinlv
Copy link
Contributor Author

/unassign @yanliang567

@binbinlv binbinlv modified the milestones: 2.5.0, 2.4.18 Dec 17, 2024
@binbinlv binbinlv changed the title [Bug]: The number of query count * is larger than the number of inserted entities after long run of continuous major compaction [Bug]: The number of query count * is larger than the number of inserted entities after long run of continuous major compaction to multiple collections Dec 17, 2024
@binbinlv
Copy link
Contributor Author

binbinlv commented Dec 17, 2024

And the executing plan number changed from 16 to 15 for the partition key collection after 10h+ major compaction:
image

@xiaocai2333
Copy link
Contributor

image

When markInputSegmentsDropped fails, it does not return an error. The task is marked as successful, but the input segment remains in place. Since the input segment is in the compacting state, it blocks the generation of subsequent clustering compaction tasks. Therefore, one clustering compaction task was missing.

2024-12-16 21:32:07.063	
[2024/12/16 13:32:07.063 +00:00] [WARN] [datacoord/compaction_task_clustering.go:341] ["mark input segments as Dropped failed, skip it and wait retry"] [planID=454647792551654805] [error="fail to update meta in clustering compaction[operation=markInputSegmentsDropped UpdateSegmentsInfo]: context deadline exceeded"]
2024-12-16 21:32:07.064	
[2024/12/16 13:32:07.064 +00:00] [INFO] [datacoord/compaction_task_clustering.go:90] ["clustering compaction task state changed"] [triggerID=454647792551654804] [PlanID=454647792551654805] [collectionID=454644445494935234] [lastState=indexing] [currentState=completed] ["elapse seconds"=529]

Later, due to a timeout when connecting to etcd, the lease renewal failed, causing mixcoord to restart. This reset the compacting state of the input segment, which resulted in both the input and result segments participating in subsequent clustering compaction, leading to data duplication.

2024-12-17 02:59:13.312	
[2024/12/16 18:59:13.312 +00:00] [WARN] [retry/retry.go:104] ["retry func failed, reach max retry"] [attempt=3]
2024-12-17 02:59:13.312	
[2024/12/16 18:59:13.312 +00:00] [WARN] [sessionutil/session_util.go:579] ["fail to retry keepAliveOnce"] [serverName=rootcoord] [LeaseID=5914514724381164462] [error="etcdserver: requested lease not found"]
2024-12-17 02:59:13.312	
[2024/12/16 18:59:13.312 +00:00] [WARN] [sessionutil/session_util.go:908] ["connection lost detected, shuting down"]
2024-12-17 02:59:13.312	
[2024/12/16 18:59:13.312 +00:00] [ERROR] [rootcoord/root_coord.go:277] ["Root Coord disconnected from etcd, process will exit"] ["Server Id"=6] [stack="github.com/milvus-io/milvus/internal/rootcoord.(*Core).Register.func1.1\n\t/workspace/source/internal/rootcoord/root_coord.go:277"]

PR #38170 will fix it, need to cherry-pick to 2.4.

@binbinlv
Copy link
Contributor Author

Verified and fixed in master branch:

milvus: master-20241219-8fcb33c2-amd64

@binbinlv
Copy link
Contributor Author

The pr for 2.4 branch has not been merged yet, so keeps this issue open until the verification for 2.4 branch finished too.

@binbinlv
Copy link
Contributor Author

remove the urgent label first.

@binbinlv binbinlv removed the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Dec 20, 2024
@yanliang567 yanliang567 modified the milestones: 2.4.18, 2.4.19, 2.4.20 Dec 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

3 participants