Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

loading an Int64 with a schema that says Int32 raises an OverflowError #218

Open
sibbiii opened this issue Jun 17, 2024 · 1 comment
Open

Comments

@sibbiii
Copy link
Contributor

sibbiii commented Jun 17, 2024

Hi,

This might look like a stupid bug report at first glance, but let me explain:

Assume a master service is reading data from MongoDB where the data is written into MongoDB by other services.
By design, one of the cool things of MongoDB ist that it can work schemaless (i know you can enforce a schema).

Several services write data to MongoDB:
collection.insert_one({'data_to_test': 42})

and some master service reads this data:
pymongoarrow.api.aggregate_arrow_all(collection, [], schema=pymongoarrow.api.Schema({'data_to_test': pyarrow.int32()}))

this works absolutely fine. And even if some service writes a sting (or ObjectID, or datetime, or ...) to this field:
collection.insert_one({'data_to_test': 'a string'})

the master service just receives a 'null' and all is fine.

But then, one day the master service completely breaks because one service wrote an Int64 to this field:
temp_collection._collection.insert_one({'data_to_test': 1_000_000_000_000})

now the master service does not get a null. pymongoarrow.api.aggregate_arrow_all raises with OverflowError: value too large to convert to int32_t.

I have now written me acceptance tests for all possible combinations of data in MonoDB and reading them with any schema, e.g. there is an int in the database, and you read with a schema that says: string. All combinations work fine (setting the value to null on type mismatch is fine). The only combination just breaks everything is:

collection.insert_one({'data_to_test': 1_000_000_000_000})
pymongoarrow.api.aggregate_arrow_all(collection, [], schema=pymongoarrow.api.Schema({'data_to_test': pyarrow.int32()}))

I consider this a "bug", because I cannot read any Int32 data if there is a single Int64 in the database. My solution now is to always read as Int64 and then downcast, but is this really how it should be?

Ps.: Its not a showstopper once you know that reading as Int32 is a nogo if the schema is not enforced, but its kind of surprising that the reading raises as all other combinations work fine.

@keanamo
Copy link

keanamo commented Jun 24, 2024

Hi @sibbiii, I've created a ticket to track this request https://jira.mongodb.org/browse/PYTHON-4519

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants