[Bug]: The number of query count * is larger than the number of inserted entities after long run of continuous major compaction to multiple collections #38526

binbinlv · 2024-12-17T08:28:54Z

Is there an existing issue for this?

I have searched the existing issues

Environment

- Milvus version: 2.4-20241215-352e51a8
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):    pulsar
- SDK version(e.g. pymilvus v2.0.0rc2): 2.5.0
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

The number of query count * is larger than the number of inserted entities after long run of continuous major compaction

The inserted number is: 10 m

>>> c = Collection("major_compaction_collection_enable_scalar_clustering_key_1kw")
>>> c.query("count>=0", output_fields=["count(*)"])
data: ["{'count(*)': 10790000}"]
>>>
>>> c = Collection("major_compaction_collection_enable_scalar_clustering_key_1kw_pk")
>>> c.query("count>=0", output_fields=["count(*)"])
data: ["{'count(*)': 10646961}"]
>>>

Expected Behavior

The number of query count * keeps the same with the number of inserted entities after long run of continuous major compaction

Steps To Reproduce

1. inserted 10m (dim = 128) data (there are two 10m collections)
2. long run continuous major compaction for >10h to these two collections
3. check the query(count(*))

Milvus Log

https://grafana-4am.zilliz.cc/explore?orgId=1&left=%7B%22datasource%22:%22Loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bcluster%3D%5C%22devops%5C%22,namespace%3D%5C%22chaos-testing%5C%22,pod%3D~%5C%22major-24-ewyem.*%5C%22%7D%22%7D%5D,%22range%22:%7B%22from%22:%22now-1h%22,%22to%22:%22now%22%7D%7D

Anything else?

collection name:

major_compaction_collection_enable_scalar_clustering_key_1kw
major_compaction_collection_enable_scalar_clustering_key_1kw_pk

The text was updated successfully, but these errors were encountered:

binbinlv · 2024-12-17T08:29:50Z

/assign @xiaocai2333

binbinlv · 2024-12-17T08:29:56Z

/unassign @yanliang567

binbinlv · 2024-12-17T08:39:07Z

And the executing plan number changed from 16 to 15 for the partition key collection after 10h+ major compaction:

xiaocai2333 · 2024-12-17T09:27:22Z

When markInputSegmentsDropped fails, it does not return an error. The task is marked as successful, but the input segment remains in place. Since the input segment is in the compacting state, it blocks the generation of subsequent clustering compaction tasks. Therefore, one clustering compaction task was missing.

2024-12-16 21:32:07.063	
[2024/12/16 13:32:07.063 +00:00] [WARN] [datacoord/compaction_task_clustering.go:341] ["mark input segments as Dropped failed, skip it and wait retry"] [planID=454647792551654805] [error="fail to update meta in clustering compaction[operation=markInputSegmentsDropped UpdateSegmentsInfo]: context deadline exceeded"]
2024-12-16 21:32:07.064	
[2024/12/16 13:32:07.064 +00:00] [INFO] [datacoord/compaction_task_clustering.go:90] ["clustering compaction task state changed"] [triggerID=454647792551654804] [PlanID=454647792551654805] [collectionID=454644445494935234] [lastState=indexing] [currentState=completed] ["elapse seconds"=529]

Later, due to a timeout when connecting to etcd, the lease renewal failed, causing mixcoord to restart. This reset the compacting state of the input segment, which resulted in both the input and result segments participating in subsequent clustering compaction, leading to data duplication.

2024-12-17 02:59:13.312	
[2024/12/16 18:59:13.312 +00:00] [WARN] [retry/retry.go:104] ["retry func failed, reach max retry"] [attempt=3]
2024-12-17 02:59:13.312	
[2024/12/16 18:59:13.312 +00:00] [WARN] [sessionutil/session_util.go:579] ["fail to retry keepAliveOnce"] [serverName=rootcoord] [LeaseID=5914514724381164462] [error="etcdserver: requested lease not found"]
2024-12-17 02:59:13.312	
[2024/12/16 18:59:13.312 +00:00] [WARN] [sessionutil/session_util.go:908] ["connection lost detected, shuting down"]
2024-12-17 02:59:13.312	
[2024/12/16 18:59:13.312 +00:00] [ERROR] [rootcoord/root_coord.go:277] ["Root Coord disconnected from etcd, process will exit"] ["Server Id"=6] [stack="github.com/milvus-io/milvus/internal/rootcoord.(*Core).Register.func1.1\n\t/workspace/source/internal/rootcoord/root_coord.go:277"]

PR #38170 will fix it, need to cherry-pick to 2.4.

binbinlv · 2024-12-20T06:43:39Z

Verified and fixed in master branch:

milvus: master-20241219-8fcb33c2-amd64

binbinlv · 2024-12-20T06:44:41Z

The pr for 2.4 branch has not been merged yet, so keeps this issue open until the verification for 2.4 branch finished too.

binbinlv · 2024-12-20T06:45:01Z

remove the urgent label first.

binbinlv added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 17, 2024

binbinlv assigned yanliang567 Dec 17, 2024

binbinlv added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Dec 17, 2024

binbinlv added this to the 2.5.0 milestone Dec 17, 2024

binbinlv added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 17, 2024

sre-ci-robot assigned xiaocai2333 Dec 17, 2024

sre-ci-robot unassigned yanliang567 Dec 17, 2024

binbinlv modified the milestones: 2.5.0, 2.4.18 Dec 17, 2024

binbinlv removed the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Dec 20, 2024

yanliang567 modified the milestones: 2.4.18, 2.4.19, 2.4.20 Dec 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: The number of query count * is larger than the number of inserted entities after long run of continuous major compaction to multiple collections #38526

[Bug]: The number of query count * is larger than the number of inserted entities after long run of continuous major compaction to multiple collections #38526

binbinlv commented Dec 17, 2024

binbinlv commented Dec 17, 2024

binbinlv commented Dec 17, 2024

binbinlv commented Dec 17, 2024 •

edited

Loading

xiaocai2333 commented Dec 17, 2024

binbinlv commented Dec 20, 2024

binbinlv commented Dec 20, 2024

binbinlv commented Dec 20, 2024

[Bug]: The number of query count * is larger than the number of inserted entities after long run of continuous major compaction to multiple collections #38526

[Bug]: The number of query count * is larger than the number of inserted entities after long run of continuous major compaction to multiple collections #38526

Comments

binbinlv commented Dec 17, 2024

Is there an existing issue for this?

Environment

Current Behavior

Expected Behavior

Steps To Reproduce

Milvus Log

Anything else?

binbinlv commented Dec 17, 2024

binbinlv commented Dec 17, 2024

binbinlv commented Dec 17, 2024 • edited Loading

xiaocai2333 commented Dec 17, 2024

binbinlv commented Dec 20, 2024

binbinlv commented Dec 20, 2024

binbinlv commented Dec 20, 2024

binbinlv commented Dec 17, 2024 •

edited

Loading