Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Parquet] Writing non-nullable field with nulls to Parquet generates invalid Parquet file #41667

Closed
p-ortmann opened this issue May 15, 2024 · 7 comments

Comments

@p-ortmann
Copy link

p-ortmann commented May 15, 2024

Describe the bug, including details regarding any error messages, version, and platform.

Platform MacOs 14.5 (23F79)
Version: 15.0.2 and 16.1.0.

import pyarrow as pa
import pandas as pd
import pyarrow.parquet as pq
data = {
    'column1': [1, 2, None],
    'column2': ['a', None, 'c']
}

schema = pa.schema([
    pa.field('column1', pa.int64(), nullable=True),
    pa.field('column2', pa.string(), nullable=False)  # make column2 not nullable
])

table = pa.Table.from_pydict(data, schema=schema) # set up table with data that doesn't match the schema
assert table.schema.equals(schema)

print('table before writing \n')
print(table.to_pandas())
pq.write_table(table, 'output.parquet')

table = pq.read_table('output.parquet')

print('table after writing and reading \n')
print(table.to_pandas())

yields

table before writing 

   column1 column2
0      1.0       a
1      2.0    None
2      NaN       c
table after writing and reading 

   column1 column2
0      1.0       a
1      2.0       c
2      NaN       a

which is not correct for column 2.

I would expect this to fail on set up of the table, which is what happens if you replace

table = pa.Table.from_pydict(data, schema=schema)

with

dataframe = pd.DataFrame(data)
table = pa.Table.from_pandas(dataframe, schema=schema)

Component(s)

Python

@adhardy
Copy link

adhardy commented May 27, 2024

Hi folks, has anyone had a chance to look at this issue? This is causing data corruption for us. If it is intended behaviour we can work around it, but it doesn't look like it should be.

@p-ortmann p-ortmann changed the title Table.from_pydict does not validate nullables Table.from_pydict casues data corruption when nulls passed to a non-nullable field Dec 2, 2024
@p-ortmann p-ortmann changed the title Table.from_pydict casues data corruption when nulls passed to a non-nullable field Table.from_pydict causes data corruption when nulls passed to a non-nullable field Dec 2, 2024
@adhardy
Copy link

adhardy commented Dec 2, 2024

Hi, I'd like to try draw attention to this bug again.

This is silently causing data corruption in a way that is easy to accidentally do and very hard to detect.

table before writing 

   column1 column2
0      1.0       a
1      2.0    None
2      NaN       c
table after writing and reading 

   column1 column2
0      1.0       a
1      2.0       c
2      NaN       a

The data in column two is wrong, there should be a null but instead the column leaves it out and starts repeating itself.

This still occurs in pyarrow 18.0.0.

@assignUser assignUser added the Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. label Dec 2, 2024
@amoeba amoeba changed the title Table.from_pydict causes data corruption when nulls passed to a non-nullable field [Python] Table.from_pydict causes data corruption when nulls passed to a non-nullable field Dec 2, 2024
@amoeba
Copy link
Member

amoeba commented Dec 2, 2024

Hi @adhardy, apologies for this being missed. I've posted this to Zulip and will try to find someone to take a look soon.

@pitrou pitrou changed the title [Python] Table.from_pydict causes data corruption when nulls passed to a non-nullable field [C++][Parquet] Writing non-nullable field with nulls to Parquet generates invalid Parquet file Dec 3, 2024
@pitrou pitrou removed the Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. label Dec 3, 2024
@pitrou
Copy link
Member

pitrou commented Dec 3, 2024

So, the problem is that the Arrow data is invalid (a non-nullable field has nulls), and the Parquet writer doesn't notice the inconsistency, ending up writing invalid data. This was already reported in #31329.

I'm not sure we want to do anything in the Parquet writer to avoid this. However, it would be nice if validation of such incorrect data actually failed: see #31387

@adhardy
Copy link

adhardy commented Dec 3, 2024

From a user perspective, my expectation is that if I set nullable=False in my schema I am expecting some sort of failure, I guess in a similar way to if I was to pass a mis-matched type.

At the moment if I pass in a None/null to a non-nullable field, my data becomes corrupted and I get no warning about it. I would rather the None gets written, at least my data is "correct".

This is maybe less of a problem I guess in statically typed languages as something would have probably already failed before I got to that point but this is really easy to do in Python.

At present there is really no use in setting nullable=False from the Python API - it does not validate that there are no nulls, and it will just corrupt my data if there is, I may as well always leave nullable=True.

pitrou added a commit to pitrou/arrow that referenced this issue Dec 3, 2024
…t contains nulls

A non-nullable column that contains nulls would result in an invalid Parquet file.
pitrou added a commit that referenced this issue Dec 4, 2024
…ains nulls (#44921)

### Rationale for this change

A non-nullable column that contains nulls would result in an invalid Parquet file, so we'd rather raise an error when writing.

This detection is only implemented for leaf columns. Implementing it for non-leaf columns would be more involved, and also doesn't actually seem necessary.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Raising a clear error when trying to write invalid data to Parquet, instead of letting the Parquet writer silently generate an invalid file.

* GitHub Issue: #41667

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
@pitrou pitrou added this to the 19.0.0 milestone Dec 4, 2024
@pitrou
Copy link
Member

pitrou commented Dec 4, 2024

Issue resolved by pull request 44921
#44921

@pitrou pitrou closed this as completed Dec 4, 2024
@pitrou
Copy link
Member

pitrou commented Dec 4, 2024

From a user perspective, my expectation is that if I set nullable=False in my schema I am expecting some sort of failure, I guess in a similar way to if I was to pass a mis-matched type.

I agree with this, and we have not been consistent in enforcing it. With #44921, at least writing a Parquet file with such data will fail explicitly instead of producing a corrupt file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants