Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: client.describe_index shows zero indexed rows despite data being present and ANN search working #38560

Open
1 task done
Liqs-v2 opened this issue Dec 18, 2024 · 5 comments
Assignees
Labels
help wanted Extra attention is needed

Comments

@Liqs-v2
Copy link

Liqs-v2 commented Dec 18, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- PyMilvus version: 2.5.0
- Milvus-Lite Version: 2.4.10
- Deployment mode(standalone or cluster): Local DB
- MQ type(rocksmq, pulsar or kafka):
- OS(Ubuntu or CentOS): MacOS 15.1.1 and Windows 11 24H2
- CPU/Memory:
  - Mac: M3 Max, 64GB
  - Windows: Ryzen 7 7735U, 16GB
- GPU: Integrated for both
- Others:

Current Behavior

The Problem

I am trying to use Milvus as DB for a RAG system in a research project related to SWE-bench and AI Agents. My current setup is as follows:

  client = MilvusClient('data/task_embeddings.db')
  schema = MilvusClient.create_schema(
      auto_id=False,
  )

  schema.add_field(field_name="id", datatype=DataType.INT64, is_primary=True, description="primary id")
  schema.add_field(field_name="instance_id", datatype=DataType.VARCHAR, max_length=512,
                   description="instance id name")
  schema.add_field(field_name="trajectory", datatype=DataType.VARCHAR, max_length=65535,
                   description="agent trajectory")
  schema.add_field(field_name="problem_statement", datatype=DataType.VARCHAR, max_length=65535,
                   description="github issue")
  schema.add_field(field_name="vector", datatype=DataType.FLOAT_VECTOR, dim=768,
                   description="problem statement embedding vector")

  # Create the collection
  client.create_collection(
      collection_name="swe_bench_lite",
      dimensions=index_dimensions,
      schema=schema
  )
  index_params = MilvusClient.prepare_index_params()
  index_params.add_index(
      field_name="vector",
      metric_type="COSINE",
      index_type="FLAT",
      index_name="vector_index",
  )

  # Create index for first collection
  client.create_index(
      collection_name="swe_bench_lite",
      index_params=index_params,
      sync=True
  )

Then I compute embeddings for the problem statements in SWE-bench and insert the data as follows:

    client.insert(collection_name='swe_bench_lite', data=[dict(row) for row in swe_bench_lite])

When then running client.describe_index(collection_name='swe_bench_lite', index_name='vector_index') I get the following result:

{'dim': '768', 'field_name': 'vector', 'index_name': 'vector_index', 'index_type': 'FLAT', 'indexed_rows': 0, 'metric_type': 'COSINE', 'pending_index_rows': 0, 'state': 'Finished', 'total_rows': 0}

Indicating that no rows were indexed. I have verified that rows were inserted as expected.

Expected Behavior

I expected client.describe_index to correctly show how many rows are indexed or the search to fail or at least give a warning without an indexation.

Steps To Reproduce

from pymilvus import MilvusClient, DataType, FieldSchema, CollectionSchema, Collection

client = MilvusClient('data/demo.db')
schema = MilvusClient.create_schema(
    auto_id=False,
)

index_dimensions = 2
schema.add_field(field_name="id", datatype=DataType.INT64, is_primary=True, description="primary id")
schema.add_field(field_name="vector", datatype=DataType.FLOAT_VECTOR, dim=index_dimensions,
                 description="embedding vector")

# Create the collection
client.create_collection(
    collection_name="foo",
    dimensions=index_dimensions,
    schema=schema
)

index_params = MilvusClient.prepare_index_params()
index_params.add_index(
    field_name="vector",
    metric_type="COSINE",
    index_type="FLAT",
    index_name="vector_index",
)

# Create index for first collection
client.create_index(
    collection_name="foo",
    index_params=index_params,
    sync=True
)

data = [
    {
        "id": 1,
        "vector": [0.1, 0.2]
    },
    {
        "id": 2,
        "vector": [0.2, 0.3]
    },
    {
        "id": 3,
        "vector": [0.3, 0.4]
    }
]

client.insert(
    collection_name="foo",
    data=data
)

# Check that the data were correctly inserted
client.get(
    collection_name="foo",
    ids=[1, 2, 3])

# Were the data indexed?
client.describe_index(collection_name='foo', index_name='vector_index') 
# OUTPUT: {'dim': '2', 'field_name': 'vector', 'index_name': 'vector_index', 'index_type': 'FLAT', 'indexed_rows': 0, 'metric_type': 'COSINE', 'pending_index_rows': 0, 'state': 'Finished', 'total_rows': 0}

client.search(collection_name='foo', data=[[0.6,0.1]])
# OUTPUT: data: ["[{'id': 3, 'distance': 0.7233555912971497, 'entity': {}}, {'id': 2, 'distance': 0.683941125869751, 'entity': {}}, {'id': 1, 'distance': 0.5881717205047607, 'entity': {}}]"] 


### Milvus Log

_No response_

### Anything else?

_No response_
@Liqs-v2 Liqs-v2 added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 18, 2024
@akshatvishu
Copy link

I think describe_index show zero row indexed because FLAT index is a brute-force index which does not require index building or provide metrics like “indexed rows.” It’s essentially a baseline configuration that relies on scanning all vectors directly, resulting in 100% recall without any offline indexing step.

@yanliang567
Copy link
Contributor

You did not see the inserted rows in describe_index() because the rows are not flushed onto disk. Usually Milvus will automatically do flush() in a few seconds, you can also manually do client.flush(). But please not manually do flush() frequently, since it is a heavy io operation to the system.

/assign @Liqs-v2
/unassign

@sre-ci-robot sre-ci-robot assigned Liqs-v2 and unassigned yanliang567 Dec 20, 2024
@yanliang567 yanliang567 added help wanted Extra attention is needed and removed kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 20, 2024
@Liqs-v2
Copy link
Author

Liqs-v2 commented Dec 20, 2024

I think describe_index show zero row indexed because FLAT index is a brute-force index which does not require index building or provide metrics like “indexed rows.” It’s essentially a baseline configuration that relies on scanning all vectors directly, resulting in 100% recall without any offline indexing step.

That makes sense to me, but I still think from a UX perspective this is not ideal. To me it seemed like there might be something broken and I lost some degree of trust in the integrity of the results.

@Liqs-v2
Copy link
Author

Liqs-v2 commented Dec 20, 2024

I suppose a way forward would be to at least document this behaviour? I can do this later.

@yanliang567
Copy link
Contributor

Please feel free to create a new doc for this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants