-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++][Parquet] Writing non-nullable field with nulls to Parquet generates invalid Parquet file #41667
Comments
Hi folks, has anyone had a chance to look at this issue? This is causing data corruption for us. If it is intended behaviour we can work around it, but it doesn't look like it should be. |
Hi, I'd like to try draw attention to this bug again. This is silently causing data corruption in a way that is easy to accidentally do and very hard to detect.
The data in column two is wrong, there should be a null but instead the column leaves it out and starts repeating itself. This still occurs in pyarrow 18.0.0. |
Hi @adhardy, apologies for this being missed. I've posted this to Zulip and will try to find someone to take a look soon. |
So, the problem is that the Arrow data is invalid (a non-nullable field has nulls), and the Parquet writer doesn't notice the inconsistency, ending up writing invalid data. This was already reported in #31329. I'm not sure we want to do anything in the Parquet writer to avoid this. However, it would be nice if validation of such incorrect data actually failed: see #31387 |
From a user perspective, my expectation is that if I set nullable=False in my schema I am expecting some sort of failure, I guess in a similar way to if I was to pass a mis-matched type. At the moment if I pass in a None/null to a non-nullable field, my data becomes corrupted and I get no warning about it. I would rather the None gets written, at least my data is "correct". This is maybe less of a problem I guess in statically typed languages as something would have probably already failed before I got to that point but this is really easy to do in Python. At present there is really no use in setting nullable=False from the Python API - it does not validate that there are no nulls, and it will just corrupt my data if there is, I may as well always leave nullable=True. |
…t contains nulls A non-nullable column that contains nulls would result in an invalid Parquet file.
…ains nulls (#44921) ### Rationale for this change A non-nullable column that contains nulls would result in an invalid Parquet file, so we'd rather raise an error when writing. This detection is only implemented for leaf columns. Implementing it for non-leaf columns would be more involved, and also doesn't actually seem necessary. ### Are these changes tested? Yes. ### Are there any user-facing changes? Raising a clear error when trying to write invalid data to Parquet, instead of letting the Parquet writer silently generate an invalid file. * GitHub Issue: #41667 Authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
Issue resolved by pull request 44921 |
I agree with this, and we have not been consistent in enforcing it. With #44921, at least writing a Parquet file with such data will fail explicitly instead of producing a corrupt file. |
Describe the bug, including details regarding any error messages, version, and platform.
Platform MacOs 14.5 (23F79)
Version: 15.0.2 and 16.1.0.
yields
which is not correct for column 2.
I would expect this to fail on set up of the table, which is what happens if you replace
with
Component(s)
Python
The text was updated successfully, but these errors were encountered: