Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: faile to export all data #38646

Open
1 task done
minglong-huang opened this issue Dec 23, 2024 · 1 comment
Open
1 task done

[Bug]: faile to export all data #38646

minglong-huang opened this issue Dec 23, 2024 · 1 comment
Assignees
Labels
kind/bug Issues or changes related a bug triage/needs-information Indicates an issue needs more information in order to work on it.

Comments

@minglong-huang
Copy link

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version:milvusdb/milvus:v2.4.0 
- Deployment mode(standalone or cluster):standalone 
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):pymilvus  v2.5.0
- OS(Ubuntu or CentOS): Ubuntu 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

Firstly, I installed milvus and ran it for a month using the following method:
wget https://github.com/milvus-io/milvus/releases/download/v2.4.0/milvus-standalone-docker-compose.yml -O docker-compose.yml
sudo docker-compose up -d
Then in recent days,I try export data into graphrag format,But I found it faile to export all data
Here is my code

from pymilvus import connections, FieldSchema, CollectionSchema, DataType, Collection,utility
import pandas as pd
from xinference.client import Client
from tqdm import tqdm

client = Client("http://192.0.0.181:9997")
list_models_run = client.list_models()
model_uid = list_models_run['bge-m3']['id']
embedding_client = client.get_model(model_uid)

connections.connect(host='0.0.0.0', port="19530")
collection = Collection(name='temp')
entities = collection.load()

pk = collection.schema.primary_field.name
vector_field_name = collection.indexes[0].field_name

batch_size = 10
output_fields = ["id", "text", "file_name", "text_embedding"]
query_iter = collection.query_iterator(
    batch_size=batch_size,
    output_fields=output_fields,
    partition_names=['_default']
)

new_doc_file = []
pabr = tqdm(total=collection.num_entities)
while True:
    docs = query_iter.next()
    if len(docs) == 0:
        # close the iterator
        query_iter.close()
        break
    for doc in docs:
        new_doc = {}
        for k, v in doc.items():
            if k =='file_name':
                new_doc['document_ids'] = [v]
            elif k == 'text':
                new_doc[k] = v
                n_tokens = embedding_client.create_embedding(v)['usage']['total_tokens']
                new_doc['n_tokens'] = n_tokens
            else:
                new_doc[k] = v
        new_doc['entity_ids'] = []
        new_doc['relationship_ids'] = []

        new_doc_file.append(new_doc)
    pabr.update(batch_size)


df = pd.DataFrame(new_doc_file)
df.to_parquet('/home/nlp/temp/test_1.parquet', engine='pyarrow')

num_entities = 10104 But really export 9753

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

No response

Anything else?

No response

@minglong-huang minglong-huang added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 23, 2024
@yanliang567
Copy link
Contributor

please retry with Milvus 2.5.0, which had an improvement for query_iterator()

/assign @minglong-huang
/unassign

@yanliang567 yanliang567 added triage/needs-information Indicates an issue needs more information in order to work on it. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug triage/needs-information Indicates an issue needs more information in order to work on it.
Projects
None yet
Development

No branches or pull requests

2 participants