Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: insert data not run flush, data loss occurs after restarting querynode and datanode. #38284

Open
1 task done
qiqiqi998 opened this issue Dec 6, 2024 · 17 comments
Open
1 task done
Assignees
Labels
kind/bug Issues or changes related a bug priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@qiqiqi998
Copy link

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: 2.4.17
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):    pulsar
- SDK version(e.g. pymilvus v2.0.0rc2):pymilvus 2.3.4
- OS(Ubuntu or CentOS): ubuntu
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

after inserting 12 data, restart the querynode and datanode, but the data cannot be found.
图片
图片

python3 -i test.py

import random
import time
import numpy as np
from pymilvus import connections, CollectionSchema, FieldSchema, DataType, Collection, utility
 
# 配置 Milvus 连接
CLUSTER_ENDPOINT = "10.72.8.192"
PORT = "32086"
USER = "root"
PASSWORD = "Milvus"
COLLECTION = "test_collection"
 
# 连接到 Milvus
connections.connect("default", host=CLUSTER_ENDPOINT, port=PORT, user=USER, password=PASSWORD)
 
# 创建集合
def init_db():
    if utility.has_collection(COLLECTION):
        print(f"集合 '{COLLECTION}' 已经存在。")
        return
    fields = [
        FieldSchema(name="pk", dtype=DataType.INT64, is_primary=True, auto_id=False, max_length=100),
        FieldSchema(name="random", dtype=DataType.DOUBLE),
        FieldSchema(name="embeddings", dtype=DataType.FLOAT_VECTOR, dim=5)
    ] # 定义集合的字段
    schema = CollectionSchema(fields=fields, description="Test collection")
    collection = Collection(name=COLLECTION, schema=schema, consistency_level="Strong") # 创建集合
    index = {
        "index_type": "IVF_FLAT",
        "metric_type": "L2",
        "params": {"nlist": 128},
    }
    collection.create_index("embeddings", index)
    print(f"集合 '{COLLECTION}' 创建成功。")
 
# 插入向量
def insert_vectors(num_vectors):
    collection = Collection(name=COLLECTION)
    batch_size = 10
    for i in range(0, num_vectors, batch_size):
        current_batch_size = min(batch_size, num_vectors - i)
        embeddings = np.random.rand(current_batch_size, 5).tolist()
        entities = [
            [j for j in range(i, i+current_batch_size)],  # pk 字段
            np.random.rand(current_batch_size).tolist(),       # random 字段,随机双精度浮点数
            embeddings                               # embeddings 字段,所有向量都是随机生成的
        ]
        collection.insert(entities)
        # collection.flush()
        print(f"插入成功一批: {current_batch_size} 个向量。")
 
# 查询向量
def query_vectors(num_ids):
    collection = Collection(name=COLLECTION)
    collection.load()
    ids = list(range(num_ids))
    missing_ids = []
    for i in range(0, num_ids): # 查询数据
        results = collection.query(expr=f"pk in {ids[i:i+1]}", output_fields=["pk"])
        if not results:
            missing_ids.append(ids[i])
    if not missing_ids:
        print("成功: 所有ID都已查询到")
    else:
        print(f"未查询到的ID: {missing_ids}")
 
# 删除所有向量
def delete_all():
    collection = Collection(name=COLLECTION)
    collection.delete(expr="pk != -1")
    print(f"集合 '{COLLECTION}' 中的所有向量已删除。")
 
# 删除集合
def drop_all():
    utility.drop_collection(f"{COLLECTION}")
    print(f"集合 '{COLLECTION}' 已删除。")
 
 
print('''
    COLLECTION = "test_collection"
    init_db()  # 初始化数据库
    insert_vectors(12)  # 插入 12 个随机向量
    time.sleep(1)  # 等待 1 秒
    query_vectors(12)  # 查询前 12 个向量
    query_vectors(13)  # 查询前 13 个向量
    delete_all()  # 删除所有向量
    drop_all()  # 删除集合
''')

Expected Behavior

The result exists.

Steps To Reproduce

1. insert 10 rows of data.
2. query.
3. restart querynode and datanode.
4. query.

Milvus Log

No response

Anything else?

No response

@qiqiqi998 qiqiqi998 added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 6, 2024
@qiqiqi998
Copy link
Author

The newly inserted data can be query.
图片

@yanliang567
Copy link
Contributor

@zhuwenxing
could you please help to reproduce the issue?

@yanliang567 yanliang567 added triage/needs-information Indicates an issue needs more information in order to work on it. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 7, 2024
@xiaofan-luan
Copy link
Collaborator

this is very common, because milvus runs approximate nearest number and no gurantee that all data can be searched.
If you want to use strict match:

  1. disable growing segment index
  2. use FLAT index

@qiqiqi998
Copy link
Author

this is very common, because milvus runs approximate nearest number and no gurantee that all data can be searched. If you want to use strict match:

1. disable growing segment index

2. use FLAT index

@xiaofan-luan
i don't think this is normal.

  1. using FLAT is the same.
  2. the result of count is 0.
    图片
    图片
    图片

@yanliang567
Copy link
Contributor

/assign @congqixia
please help to take a look. When I try to flush the collection after reboot, the growing segment becomes to dropped. I think this is not correct.

@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed triage/needs-information Indicates an issue needs more information in order to work on it. labels Dec 9, 2024
@yanliang567 yanliang567 added this to the 2.4.18 milestone Dec 9, 2024
@xiaofan-luan
Copy link
Collaborator

If

/assign @congqixia please help to take a look. When I try to flush the collection after reboot, the growing segment becomes to dropped. I think this is not correct.

If all data is deleted, why could not a segment become dropped? Maybe simply because of compaction happened

@qiqiqi998
Copy link
Author

If

/assign @congqixia please help to take a look. When I try to flush the collection after reboot, the growing segment becomes to dropped. I think this is not correct.

If all data is deleted, why could not a segment become dropped? Maybe simply because of compaction happened

not delete data, just restart the querynode and datanode,the data of the growing segment is lost.

@congqixia
Copy link
Contributor

Found the reason why standalone could reproduce this problem

[2024/12/09 07:23:35.278 +00:00] [WARN] [server/rocksmq_impl.go:883] ["RocksMQ: trying to seek to no exist position, reset current id"] [topic=yanliang-2417-rootcoord-dml_5] [group=datanode-4-yanliang-2417-rootcoord-dml_5_454485287722142520v0-true] [msgId=454486637675544581]

Rocksmq failed to seek the correct checkpoint and use the latest position, which means some data maybe lost when using rocksMQ.

@qiqiqi998 since we could only reproduce with standalone+rocksmq setup, could you please verify the actual milvus instance setup was cluster or not?

@qiqiqi998
Copy link
Author

Found the reason why standalone could reproduce this problem

[2024/12/09 07:23:35.278 +00:00] [WARN] [server/rocksmq_impl.go:883] ["RocksMQ: trying to seek to no exist position, reset current id"] [topic=yanliang-2417-rootcoord-dml_5] [group=datanode-4-yanliang-2417-rootcoord-dml_5_454485287722142520v0-true] [msgId=454486637675544581]

Rocksmq failed to seek the correct checkpoint and use the latest position, which means some data maybe lost when using rocksMQ.

@qiqiqi998 since we could only reproduce with standalone+rocksmq setup, could you please verify the actual milvus instance setup was cluster or not?

cluster+pulsar, the deployment of helm
图片

@yanliang567
Copy link
Contributor

@qiqiqi998 could you please collection the milvus and pulsar logs when you restart the querynode and datanode?
Please refer this doc to export the whole Milvus logs for investigation.

@qiqiqi998
Copy link
Author

@qiqiqi998 could you please collection the milvus and pulsar logs when you restart the querynode and datanode? Please refer this doc to export the whole Milvus logs for investigation.

logs.tar.gz
图片

@congqixia
Copy link
Contributor

@qiqiqi998 the Info level log looks fine and there was no any suspicious entries in log or screenshot. I was wondering that if you could reproduce in DEBUG log level and provided birdwatcher backup with https://github.com/milvus-io/birdwatcher/releases/tag/v1.0.7

@xiaocai2333
Copy link
Contributor

It seems this log is from before the restart? Because there are insertion messages.

[2024/12/09 08:37:34.957 +00:00] [INFO] [datacoord/meta.go:1286] ["meta update: add allocation - complete"] [segmentID=454487750778753045]
[2024/12/09 08:37:34.957 +00:00] [INFO] [datacoord/services.go:229] ["success to assign segments"] [traceID=7e58b760e9e4f338b4202d8ca2b5be2d] [collectionID=454487750778552541] [assignments="[{\"SegmentID\":454487750778753045,\"NumOfRows\":10,\"ExpireTime\":454487823327428613}]"]
[2024/12/09 08:37:34.974 +00:00] [INFO] [datacoord/meta.go:1286] ["meta update: add allocation - complete"] [segmentID=454487750778753045]
[2024/12/09 08:37:34.974 +00:00] [INFO] [datacoord/services.go:229] ["success to assign segments"] [traceID=3a5381e65cd5b729003c318fc50cd50f] [collectionID=454487750778552541] [assignments="[{\"SegmentID\":454487750778753045,\"NumOfRows\":2,\"ExpireTime\":454487823340535810}]"]
[2024/12/09 08:39:52.935 +00:00] [INFO] [datacoord/meta.go:1286] ["meta update: add allocation - complete"] [segmentID=454487750778753045]
[2024/12/09 08:39:52.935 +00:00] [INFO] [datacoord/services.go:229] ["success to assign segments"] [traceID=772df0d79cdca9846e4c37a2cc8da011] [collectionID=454487750778552541] [assignments="[{\"SegmentID\":454487750778753045,\"NumOfRows\":2,\"ExpireTime\":454487859503300615}]"]

@yanliang567
Copy link
Contributor

@qiqiqi998 would you like to set up a call to discuss this issue? if possible, could you share your deployment and how to reproduce this issue via remote screening. please contact me by [email protected] if it is okay for you.

@qiqiqi998
Copy link
Author

qiqiqi998 commented Dec 10, 2024

It seems this log is from before the restart? Because there are insertion messages.

[2024/12/09 08:37:34.957 +00:00] [INFO] [datacoord/meta.go:1286] ["meta update: add allocation - complete"] [segmentID=454487750778753045]
[2024/12/09 08:37:34.957 +00:00] [INFO] [datacoord/services.go:229] ["success to assign segments"] [traceID=7e58b760e9e4f338b4202d8ca2b5be2d] [collectionID=454487750778552541] [assignments="[{\"SegmentID\":454487750778753045,\"NumOfRows\":10,\"ExpireTime\":454487823327428613}]"]
[2024/12/09 08:37:34.974 +00:00] [INFO] [datacoord/meta.go:1286] ["meta update: add allocation - complete"] [segmentID=454487750778753045]
[2024/12/09 08:37:34.974 +00:00] [INFO] [datacoord/services.go:229] ["success to assign segments"] [traceID=3a5381e65cd5b729003c318fc50cd50f] [collectionID=454487750778552541] [assignments="[{\"SegmentID\":454487750778753045,\"NumOfRows\":2,\"ExpireTime\":454487823340535810}]"]
[2024/12/09 08:39:52.935 +00:00] [INFO] [datacoord/meta.go:1286] ["meta update: add allocation - complete"] [segmentID=454487750778753045]
[2024/12/09 08:39:52.935 +00:00] [INFO] [datacoord/services.go:229] ["success to assign segments"] [traceID=772df0d79cdca9846e4c37a2cc8da011] [collectionID=454487750778552541] [assignments="[{\"SegmentID\":454487750778753045,\"NumOfRows\":2,\"ExpireTime\":454487859503300615}]"]

you can check the startup time of querynode and datanode,after starting, i try query and insert.
图片

@qiqiqi998
Copy link
Author

qiqiqi998 commented Dec 10, 2024

data loss occurs in the pulsar use 1 partition topic, while non partition topics are normal.

pulsar: broker.conf
allowAutoTopicCreationType=partitioned
defaultNumPartitions=1

@xiaofan-luan
Copy link
Collaborator

data loss occurs in the pulsar use 1 partition topic, while non partition topics are normal.

pulsar: broker.conf allowAutoTopicCreationType=partitioned defaultNumPartitions=1

I think 1 parititon and non partition is same.
For milvus, we don't support multiple partitions. becasue message in the same topic can not guarantee order

@yanliang567 yanliang567 modified the milestones: 2.4.18, 2.4.19 Dec 24, 2024
@yanliang567 yanliang567 added this to the 2.4.20 milestone Dec 30, 2024
@yanliang567 yanliang567 modified the milestones: 2.4.20, 2.4.21 Jan 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

5 participants