Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: [null & default] The searched results number is larger than expected when search with expression "field_name == 0" on nullable field with None data without flush #37734

Closed
1 task done
binbinlv opened this issue Nov 16, 2024 · 8 comments
Assignees
Labels
2.5-features kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@binbinlv
Copy link
Contributor

binbinlv commented Nov 16, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: master-20241115-d1596297-amd64
- Deployment mode(standalone or cluster):both
- MQ type(rocksmq, pulsar or kafka):    all
- SDK version(e.g. pymilvus v2.0.0rc2): 2.5.0rc121
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

The searched results number is larger than expected when search with expression "field_name == 0" on nullable field with None data without flush

search_results_check: limit(topK) searched (10) is not equal with expected (1) (func_check.py:346)

Expected Behavior

search_results_check: limit(topK) searched (1) is equal with expected (1) (func_check.py:346)

Steps To Reproduce

    @pytest.mark.tags(CaseLabel.L1)
    # @pytest.mark.skip(reason="issue #37547")
    def test_search_none_data_expr_cache(self, is_flush):
        """
        target: test search case with none data to test expr cache
        method: 1. create collection with double datatype as nullable field
                2. search with expr "nullableFid == 0"
                3. drop this collection
                4. create collection with same collection name and same field name but modify the type of nullable field
                   as varchar datatype
                5. search with expr "nullableFid == 0" again
        expected: 1. search successfully with limit(topK) for the first collection
                  2. report error for the second collection with the same name
        """
        # 1. initialize with data
        collection_w, _, _, insert_ids, time_stamp = \
            self.init_collection_general(prefix, True, is_flush=is_flush,
                                         nullable_fields={ct.default_float_field_name: 0.5})[0:5]
        collection_name = collection_w.name
        # 2. generate search data
        vectors = cf.gen_vectors_based_on_vector_type(default_nq, default_dim)
        # 3. search with expr "nullableFid == 0"
        search_exp = f"{ct.default_float_field_name} == 0"
        output_fields = [default_int64_field_name, default_float_field_name]
        collection_w.search(vectors[:default_nq], default_search_field,
                            default_search_params, default_limit,
                            search_exp,
                            output_fields=output_fields,
                            check_task=CheckTasks.check_search_results,
                            check_items={"nq": default_nq,
                                         "ids": insert_ids,
                                         "limit": 1,
                                         "output_fields": output_fields})
        # 4. drop collection
        collection_w.drop()
        # 5. create the same collection name with same field name but varchar field type
        int64_field = cf.gen_int64_field(is_primary=True)
        string_field = cf.gen_string_field(ct.default_float_field_name, nullable=True)
        json_field = cf.gen_json_field()
        float_vector_field = cf.gen_float_vec_field()
        fields = [int64_field, string_field, json_field, float_vector_field]
        schema = cf.gen_collection_schema(fields)
        collection_w = self.init_collection_wrap(name=collection_name, schema=schema)
        int64_values = pd.Series(data=[i for i in range(default_nb)])
        string_values = pd.Series(data=[str(i) for i in range(default_nb)], dtype="string")
        json_values = [{"number": i, "string": str(i), "bool": bool(i),
                        "list": [j for j in range(i, i + ct.default_json_list_length)]} for i in range(default_nb)]
        float_vec_values = cf.gen_vectors(default_nb, default_dim)
        df = pd.DataFrame({
            ct.default_int64_field_name: int64_values,
            ct.default_float_field_name: None,
            ct.default_json_field_name: json_values,
            ct.default_float_vec_field_name: float_vec_values
        })
        collection_w.insert(df)
        collection_w.create_index(ct.default_float_vec_field_name, ct.default_flat_index)
        collection_w.load()
        collection_w.flush()
        collection_w.search(vectors[:default_nq], default_search_field,
                            default_search_params, default_limit,
                            search_exp,
                            output_fields=output_fields,
                            check_task=CheckTasks.err_res,
                            check_items={"err_code": 1100,
                                         "err_msg": "failed to create query plan: cannot parse expression: float == 0, "
                                                    "error: comparisons between VarChar and Int64 are not supported: "
                                                    "invalid parameter"})

Milvus Log

https://grafana-4am.zilliz.cc/explore?orgId=1&left=%7B%22datasource%22:%22Loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bcluster%3D%5C%22devops%5C%22,namespace%3D%5C%22chaos-testing%5C%22,pod%3D~%5C%22test-null-master-cjqsw.*%5C%22%7D%22%7D%5D,%22range%22:%7B%22from%22:%22now-1h%22,%22to%22:%22now%22%7D%7D

Anything else?

collection name: search_collection_euwUGGzx

@binbinlv binbinlv added kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on. 2.5-features labels Nov 16, 2024
@binbinlv binbinlv added this to the 2.5.0 milestone Nov 16, 2024
@binbinlv
Copy link
Contributor Author

  1. if search after flush, it is ok, the number search is 1 not 10.
  2. if search with ""field_name == 1', it is OK, the number search is 1 not 10.

@binbinlv
Copy link
Contributor Author

And after verifying the crash issue #37547 (now the crash issue is fixed), this issue exposed using the same case.

@binbinlv
Copy link
Contributor Author

if there is no "None" data in the collection, this issue not exists.

@czs007
Copy link
Collaborator

czs007 commented Nov 18, 2024

@JsDove is helping on this.

sre-ci-robot pushed a commit that referenced this issue Nov 18, 2024
@smellthemoon
Copy link
Contributor

@binbinlv pr merged, could you help to verify it?

@smellthemoon
Copy link
Contributor

/assign @binbinlv

@binbinlv
Copy link
Contributor Author

working on verification.

@binbinlv
Copy link
Contributor Author

Verified and fixed.
milvus: master-20241119-b6612e02-amd64
pymilvus: 2.5.0rc122
results:

 testcases/test_search.py::TestCollectionSearchNoneAndDefaultData.test_search_none_data_expr_cache[False] ✓                                                                                                                      50% █████
 testcases/test_search.py::TestCollectionSearchNoneAndDefaultData.test_search_none_data_expr_cache[True] ✓

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.5-features kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

3 participants