[Bug]: insert data not run flush, data loss occurs after restarting querynode and datanode. #38284

qiqiqi998 · 2024-12-06T09:51:37Z

Is there an existing issue for this?

I have searched the existing issues

Environment

- Milvus version: 2.4.17
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):    pulsar
- SDK version(e.g. pymilvus v2.0.0rc2):pymilvus 2.3.4
- OS(Ubuntu or CentOS): ubuntu
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

after inserting 12 data, restart the querynode and datanode, but the data cannot be found.

python3 -i test.py

import random
import time
import numpy as np
from pymilvus import connections, CollectionSchema, FieldSchema, DataType, Collection, utility
 
# 配置 Milvus 连接
CLUSTER_ENDPOINT = "10.72.8.192"
PORT = "32086"
USER = "root"
PASSWORD = "Milvus"
COLLECTION = "test_collection"
 
# 连接到 Milvus
connections.connect("default", host=CLUSTER_ENDPOINT, port=PORT, user=USER, password=PASSWORD)
 
# 创建集合
def init_db():
    if utility.has_collection(COLLECTION):
        print(f"集合 '{COLLECTION}' 已经存在。")
        return
    fields = [
        FieldSchema(name="pk", dtype=DataType.INT64, is_primary=True, auto_id=False, max_length=100),
        FieldSchema(name="random", dtype=DataType.DOUBLE),
        FieldSchema(name="embeddings", dtype=DataType.FLOAT_VECTOR, dim=5)
    ] # 定义集合的字段
    schema = CollectionSchema(fields=fields, description="Test collection")
    collection = Collection(name=COLLECTION, schema=schema, consistency_level="Strong") # 创建集合
    index = {
        "index_type": "IVF_FLAT",
        "metric_type": "L2",
        "params": {"nlist": 128},
    }
    collection.create_index("embeddings", index)
    print(f"集合 '{COLLECTION}' 创建成功。")
 
# 插入向量
def insert_vectors(num_vectors):
    collection = Collection(name=COLLECTION)
    batch_size = 10
    for i in range(0, num_vectors, batch_size):
        current_batch_size = min(batch_size, num_vectors - i)
        embeddings = np.random.rand(current_batch_size, 5).tolist()
        entities = [
            [j for j in range(i, i+current_batch_size)],  # pk 字段
            np.random.rand(current_batch_size).tolist(),       # random 字段，随机双精度浮点数
            embeddings                               # embeddings 字段，所有向量都是随机生成的
        ]
        collection.insert(entities)
        # collection.flush()
        print(f"插入成功一批: {current_batch_size} 个向量。")
 
# 查询向量
def query_vectors(num_ids):
    collection = Collection(name=COLLECTION)
    collection.load()
    ids = list(range(num_ids))
    missing_ids = []
    for i in range(0, num_ids): # 查询数据
        results = collection.query(expr=f"pk in {ids[i:i+1]}", output_fields=["pk"])
        if not results:
            missing_ids.append(ids[i])
    if not missing_ids:
        print("成功: 所有ID都已查询到")
    else:
        print(f"未查询到的ID: {missing_ids}")
 
# 删除所有向量
def delete_all():
    collection = Collection(name=COLLECTION)
    collection.delete(expr="pk != -1")
    print(f"集合 '{COLLECTION}' 中的所有向量已删除。")
 
# 删除集合
def drop_all():
    utility.drop_collection(f"{COLLECTION}")
    print(f"集合 '{COLLECTION}' 已删除。")
 
 
print('''
    COLLECTION = "test_collection"
    init_db()  # 初始化数据库
    insert_vectors(12)  # 插入 12 个随机向量
    time.sleep(1)  # 等待 1 秒
    query_vectors(12)  # 查询前 12 个向量
    query_vectors(13)  # 查询前 13 个向量
    delete_all()  # 删除所有向量
    drop_all()  # 删除集合
''')

Expected Behavior

The result exists.

Steps To Reproduce

1. insert 10 rows of data.
2. query.
3. restart querynode and datanode.
4. query.

Milvus Log

No response

Anything else?

No response

The text was updated successfully, but these errors were encountered:

qiqiqi998 · 2024-12-06T09:56:57Z

The newly inserted data can be query.

yanliang567 · 2024-12-07T05:52:57Z

@zhuwenxing
could you please help to reproduce the issue?

xiaofan-luan · 2024-12-08T13:49:43Z

this is very common, because milvus runs approximate nearest number and no gurantee that all data can be searched.
If you want to use strict match:

disable growing segment index
use FLAT index

qiqiqi998 · 2024-12-09T02:39:20Z

this is very common, because milvus runs approximate nearest number and no gurantee that all data can be searched. If you want to use strict match:
1. disable growing segment index

2. use FLAT index

@xiaofan-luan
i don't think this is normal.

using FLAT is the same.
the result of count is 0.

yanliang567 · 2024-12-09T04:55:43Z

/assign @congqixia
please help to take a look. When I try to flush the collection after reboot, the growing segment becomes to dropped. I think this is not correct.

xiaofan-luan · 2024-12-09T05:46:27Z

If

/assign @congqixia please help to take a look. When I try to flush the collection after reboot, the growing segment becomes to dropped. I think this is not correct.

If all data is deleted, why could not a segment become dropped? Maybe simply because of compaction happened

qiqiqi998 · 2024-12-09T07:16:23Z

If

/assign @congqixia please help to take a look. When I try to flush the collection after reboot, the growing segment becomes to dropped. I think this is not correct.

If all data is deleted, why could not a segment become dropped? Maybe simply because of compaction happened

not delete data, just restart the querynode and datanode,the data of the growing segment is lost.

congqixia · 2024-12-09T07:39:01Z

Found the reason why standalone could reproduce this problem

[2024/12/09 07:23:35.278 +00:00] [WARN] [server/rocksmq_impl.go:883] ["RocksMQ: trying to seek to no exist position, reset current id"] [topic=yanliang-2417-rootcoord-dml_5] [group=datanode-4-yanliang-2417-rootcoord-dml_5_454485287722142520v0-true] [msgId=454486637675544581]

Rocksmq failed to seek the correct checkpoint and use the latest position, which means some data maybe lost when using rocksMQ.

@qiqiqi998 since we could only reproduce with standalone+rocksmq setup, could you please verify the actual milvus instance setup was cluster or not?

qiqiqi998 · 2024-12-09T07:51:59Z

Found the reason why standalone could reproduce this problem
[2024/12/09 07:23:35.278 +00:00] [WARN] [server/rocksmq_impl.go:883] ["RocksMQ: trying to seek to no exist position, reset current id"] [topic=yanliang-2417-rootcoord-dml_5] [group=datanode-4-yanliang-2417-rootcoord-dml_5_454485287722142520v0-true] [msgId=454486637675544581]
Rocksmq failed to seek the correct checkpoint and use the latest position, which means some data maybe lost when using rocksMQ.

@qiqiqi998 since we could only reproduce with standalone+rocksmq setup, could you please verify the actual milvus instance setup was cluster or not?

cluster+pulsar, the deployment of helm

yanliang567 · 2024-12-09T08:06:22Z

@qiqiqi998 could you please collection the milvus and pulsar logs when you restart the querynode and datanode?
Please refer this doc to export the whole Milvus logs for investigation.

qiqiqi998 · 2024-12-09T08:41:56Z

@qiqiqi998 could you please collection the milvus and pulsar logs when you restart the querynode and datanode? Please refer this doc to export the whole Milvus logs for investigation.

logs.tar.gz

congqixia · 2024-12-10T02:30:39Z

@qiqiqi998 the Info level log looks fine and there was no any suspicious entries in log or screenshot. I was wondering that if you could reproduce in DEBUG log level and provided birdwatcher backup with https://github.com/milvus-io/birdwatcher/releases/tag/v1.0.7

xiaocai2333 · 2024-12-10T02:38:55Z

It seems this log is from before the restart? Because there are insertion messages.

[2024/12/09 08:37:34.957 +00:00] [INFO] [datacoord/meta.go:1286] ["meta update: add allocation - complete"] [segmentID=454487750778753045]
[2024/12/09 08:37:34.957 +00:00] [INFO] [datacoord/services.go:229] ["success to assign segments"] [traceID=7e58b760e9e4f338b4202d8ca2b5be2d] [collectionID=454487750778552541] [assignments="[{\"SegmentID\":454487750778753045,\"NumOfRows\":10,\"ExpireTime\":454487823327428613}]"]
[2024/12/09 08:37:34.974 +00:00] [INFO] [datacoord/meta.go:1286] ["meta update: add allocation - complete"] [segmentID=454487750778753045]
[2024/12/09 08:37:34.974 +00:00] [INFO] [datacoord/services.go:229] ["success to assign segments"] [traceID=3a5381e65cd5b729003c318fc50cd50f] [collectionID=454487750778552541] [assignments="[{\"SegmentID\":454487750778753045,\"NumOfRows\":2,\"ExpireTime\":454487823340535810}]"]
[2024/12/09 08:39:52.935 +00:00] [INFO] [datacoord/meta.go:1286] ["meta update: add allocation - complete"] [segmentID=454487750778753045]
[2024/12/09 08:39:52.935 +00:00] [INFO] [datacoord/services.go:229] ["success to assign segments"] [traceID=772df0d79cdca9846e4c37a2cc8da011] [collectionID=454487750778552541] [assignments="[{\"SegmentID\":454487750778753045,\"NumOfRows\":2,\"ExpireTime\":454487859503300615}]"]

yanliang567 · 2024-12-10T02:45:34Z

@qiqiqi998 would you like to set up a call to discuss this issue? if possible, could you share your deployment and how to reproduce this issue via remote screening. please contact me by [email protected] if it is okay for you.

qiqiqi998 · 2024-12-10T02:52:56Z

It seems this log is from before the restart? Because there are insertion messages.

[2024/12/09 08:37:34.957 +00:00] [INFO] [datacoord/meta.go:1286] ["meta update: add allocation - complete"] [segmentID=454487750778753045]
[2024/12/09 08:37:34.957 +00:00] [INFO] [datacoord/services.go:229] ["success to assign segments"] [traceID=7e58b760e9e4f338b4202d8ca2b5be2d] [collectionID=454487750778552541] [assignments="[{\"SegmentID\":454487750778753045,\"NumOfRows\":10,\"ExpireTime\":454487823327428613}]"]
[2024/12/09 08:37:34.974 +00:00] [INFO] [datacoord/meta.go:1286] ["meta update: add allocation - complete"] [segmentID=454487750778753045]
[2024/12/09 08:37:34.974 +00:00] [INFO] [datacoord/services.go:229] ["success to assign segments"] [traceID=3a5381e65cd5b729003c318fc50cd50f] [collectionID=454487750778552541] [assignments="[{\"SegmentID\":454487750778753045,\"NumOfRows\":2,\"ExpireTime\":454487823340535810}]"]
[2024/12/09 08:39:52.935 +00:00] [INFO] [datacoord/meta.go:1286] ["meta update: add allocation - complete"] [segmentID=454487750778753045]
[2024/12/09 08:39:52.935 +00:00] [INFO] [datacoord/services.go:229] ["success to assign segments"] [traceID=772df0d79cdca9846e4c37a2cc8da011] [collectionID=454487750778552541] [assignments="[{\"SegmentID\":454487750778753045,\"NumOfRows\":2,\"ExpireTime\":454487859503300615}]"]

you can check the startup time of querynode and datanode,after starting, i try query and insert.

qiqiqi998 · 2024-12-10T11:32:38Z

data loss occurs in the pulsar use 1 partition topic, while non partition topics are normal.

pulsar: broker.conf
allowAutoTopicCreationType=partitioned
defaultNumPartitions=1

xiaofan-luan · 2024-12-15T14:57:15Z

data loss occurs in the pulsar use 1 partition topic, while non partition topics are normal.

pulsar: broker.conf allowAutoTopicCreationType=partitioned defaultNumPartitions=1

I think 1 parititon and non partition is same.
For milvus, we don't support multiple partitions. becasue message in the same topic can not guarantee order

qiqiqi998 added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 6, 2024

qiqiqi998 assigned yanliang567 Dec 6, 2024

yanliang567 added triage/needs-information Indicates an issue needs more information in order to work on it. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 7, 2024

sre-ci-robot assigned congqixia Dec 9, 2024

yanliang567 added this to the 2.4.18 milestone Dec 9, 2024

yanliang567 modified the milestones: 2.4.18, 2.4.19 Dec 24, 2024

yanliang567 added this to the 2.4.20 milestone Dec 30, 2024

yanliang567 modified the milestones: 2.4.20, 2.4.21 Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: insert data not run flush, data loss occurs after restarting querynode and datanode. #38284

[Bug]: insert data not run flush, data loss occurs after restarting querynode and datanode. #38284

qiqiqi998 commented Dec 6, 2024

qiqiqi998 commented Dec 6, 2024

yanliang567 commented Dec 7, 2024

xiaofan-luan commented Dec 8, 2024

qiqiqi998 commented Dec 9, 2024

yanliang567 commented Dec 9, 2024

xiaofan-luan commented Dec 9, 2024

qiqiqi998 commented Dec 9, 2024

congqixia commented Dec 9, 2024

qiqiqi998 commented Dec 9, 2024

yanliang567 commented Dec 9, 2024

qiqiqi998 commented Dec 9, 2024

congqixia commented Dec 10, 2024

xiaocai2333 commented Dec 10, 2024

yanliang567 commented Dec 10, 2024

qiqiqi998 commented Dec 10, 2024 •

edited

Loading

qiqiqi998 commented Dec 10, 2024 •

edited

Loading

xiaofan-luan commented Dec 15, 2024

[Bug]: insert data not run flush, data loss occurs after restarting querynode and datanode. #38284

[Bug]: insert data not run flush, data loss occurs after restarting querynode and datanode. #38284

Comments

qiqiqi998 commented Dec 6, 2024

Is there an existing issue for this?

Environment

Current Behavior

Expected Behavior

Steps To Reproduce

Milvus Log

Anything else?

qiqiqi998 commented Dec 6, 2024

yanliang567 commented Dec 7, 2024

xiaofan-luan commented Dec 8, 2024

qiqiqi998 commented Dec 9, 2024

yanliang567 commented Dec 9, 2024

xiaofan-luan commented Dec 9, 2024

qiqiqi998 commented Dec 9, 2024

congqixia commented Dec 9, 2024

qiqiqi998 commented Dec 9, 2024

yanliang567 commented Dec 9, 2024

qiqiqi998 commented Dec 9, 2024

congqixia commented Dec 10, 2024

xiaocai2333 commented Dec 10, 2024

yanliang567 commented Dec 10, 2024

qiqiqi998 commented Dec 10, 2024 • edited Loading

qiqiqi998 commented Dec 10, 2024 • edited Loading

xiaofan-luan commented Dec 15, 2024

qiqiqi998 commented Dec 10, 2024 •

edited

Loading

qiqiqi998 commented Dec 10, 2024 •

edited

Loading