[C++][Parquet] Writing non-nullable field with nulls to Parquet generates invalid Parquet file #41667

p-ortmann · 2024-05-15T11:27:46Z

Describe the bug, including details regarding any error messages, version, and platform.

Platform MacOs 14.5 (23F79)
Version: 15.0.2 and 16.1.0.

import pyarrow as pa
import pandas as pd
import pyarrow.parquet as pq
data = {
    'column1': [1, 2, None],
    'column2': ['a', None, 'c']
}

schema = pa.schema([
    pa.field('column1', pa.int64(), nullable=True),
    pa.field('column2', pa.string(), nullable=False)  # make column2 not nullable
])

table = pa.Table.from_pydict(data, schema=schema) # set up table with data that doesn't match the schema
assert table.schema.equals(schema)

print('table before writing \n')
print(table.to_pandas())
pq.write_table(table, 'output.parquet')

table = pq.read_table('output.parquet')

print('table after writing and reading \n')
print(table.to_pandas())

yields

table before writing 

   column1 column2
0      1.0       a
1      2.0    None
2      NaN       c
table after writing and reading 

   column1 column2
0      1.0       a
1      2.0       c
2      NaN       a

which is not correct for column 2.

I would expect this to fail on set up of the table, which is what happens if you replace

table = pa.Table.from_pydict(data, schema=schema)

with

dataframe = pd.DataFrame(data)
table = pa.Table.from_pandas(dataframe, schema=schema)

Component(s)

Python

The text was updated successfully, but these errors were encountered:

adhardy · 2024-05-27T07:52:53Z

Hi folks, has anyone had a chance to look at this issue? This is causing data corruption for us. If it is intended behaviour we can work around it, but it doesn't look like it should be.

adhardy · 2024-12-02T09:27:13Z

Hi, I'd like to try draw attention to this bug again.

This is silently causing data corruption in a way that is easy to accidentally do and very hard to detect.

table before writing 

   column1 column2
0      1.0       a
1      2.0    None
2      NaN       c
table after writing and reading 

   column1 column2
0      1.0       a
1      2.0       c
2      NaN       a

The data in column two is wrong, there should be a null but instead the column leaves it out and starts repeating itself.

This still occurs in pyarrow 18.0.0.

amoeba · 2024-12-02T21:11:25Z

Hi @adhardy, apologies for this being missed. I've posted this to Zulip and will try to find someone to take a look soon.

pitrou · 2024-12-03T16:18:03Z

So, the problem is that the Arrow data is invalid (a non-nullable field has nulls), and the Parquet writer doesn't notice the inconsistency, ending up writing invalid data. This was already reported in #31329.

I'm not sure we want to do anything in the Parquet writer to avoid this. However, it would be nice if validation of such incorrect data actually failed: see #31387

adhardy · 2024-12-03T16:36:01Z

From a user perspective, my expectation is that if I set nullable=False in my schema I am expecting some sort of failure, I guess in a similar way to if I was to pass a mis-matched type.

At the moment if I pass in a None/null to a non-nullable field, my data becomes corrupted and I get no warning about it. I would rather the None gets written, at least my data is "correct".

This is maybe less of a problem I guess in statically typed languages as something would have probably already failed before I got to that point but this is really easy to do in Python.

At present there is really no use in setting nullable=False from the Python API - it does not validate that there are no nulls, and it will just corrupt my data if there is, I may as well always leave nullable=True.

…t contains nulls A non-nullable column that contains nulls would result in an invalid Parquet file.

…ains nulls (#44921) ### Rationale for this change A non-nullable column that contains nulls would result in an invalid Parquet file, so we'd rather raise an error when writing. This detection is only implemented for leaf columns. Implementing it for non-leaf columns would be more involved, and also doesn't actually seem necessary. ### Are these changes tested? Yes. ### Are there any user-facing changes? Raising a clear error when trying to write invalid data to Parquet, instead of letting the Parquet writer silently generate an invalid file. * GitHub Issue: #41667 Authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

pitrou · 2024-12-04T12:28:18Z

Issue resolved by pull request 44921
#44921

pitrou · 2024-12-04T14:09:15Z

From a user perspective, my expectation is that if I set nullable=False in my schema I am expecting some sort of failure, I guess in a similar way to if I was to pass a mis-matched type.

I agree with this, and we have not been consistent in enforcing it. With #44921, at least writing a Parquet file with such data will fail explicitly instead of producing a corrupt file.

p-ortmann added the Type: bug label May 15, 2024

github-actions bot added the Component: Python label May 15, 2024

p-ortmann changed the title ~~Table.from_pydict does not validate nullables~~ Table.from_pydict casues data corruption when nulls passed to a non-nullable field Dec 2, 2024

p-ortmann changed the title ~~Table.from_pydict casues data corruption when nulls passed to a non-nullable field~~ Table.from_pydict causes data corruption when nulls passed to a non-nullable field Dec 2, 2024

adamreeve mentioned this issue Dec 2, 2024

[Python] PyArrow allows creating non-nullable columns containing nulls #43145

Closed

assignUser added the Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. label Dec 2, 2024

amoeba changed the title ~~Table.from_pydict causes data corruption when nulls passed to a non-nullable field~~ [Python] Table.from_pydict causes data corruption when nulls passed to a non-nullable field Dec 2, 2024

pitrou changed the title ~~[Python] Table.from_pydict causes data corruption when nulls passed to a non-nullable field~~ [C++][Parquet] Writing non-nullable field with nulls to Parquet generates invalid Parquet file Dec 3, 2024

pitrou removed the Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. label Dec 3, 2024

pitrou mentioned this issue Dec 3, 2024

[C++] Parquet writes broken file or incorrect data when nullable=False #31329

Closed

pitrou added a commit to pitrou/arrow that referenced this issue Dec 3, 2024

apacheGH-41667: [C++][Parquet] Refuse writing non-nullable column tha…

450410a

…t contains nulls A non-nullable column that contains nulls would result in an invalid Parquet file.

github-actions bot mentioned this issue Dec 3, 2024

GH-41667: [C++][Parquet] Refuse writing non-nullable column that contains nulls #44921

Merged

github-actions bot assigned pitrou Dec 3, 2024

pitrou added this to the 19.0.0 milestone Dec 4, 2024

pitrou closed this as completed Dec 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++][Parquet] Writing non-nullable field with nulls to Parquet generates invalid Parquet file #41667

[C++][Parquet] Writing non-nullable field with nulls to Parquet generates invalid Parquet file #41667

p-ortmann commented May 15, 2024 •

edited

Loading

adhardy commented May 27, 2024

adhardy commented Dec 2, 2024 •

edited

Loading

amoeba commented Dec 2, 2024

pitrou commented Dec 3, 2024

adhardy commented Dec 3, 2024 •

edited

Loading

pitrou commented Dec 4, 2024

pitrou commented Dec 4, 2024

[C++][Parquet] Writing non-nullable field with nulls to Parquet generates invalid Parquet file #41667

[C++][Parquet] Writing non-nullable field with nulls to Parquet generates invalid Parquet file #41667

Comments

p-ortmann commented May 15, 2024 • edited Loading

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

adhardy commented May 27, 2024

adhardy commented Dec 2, 2024 • edited Loading

amoeba commented Dec 2, 2024

pitrou commented Dec 3, 2024

adhardy commented Dec 3, 2024 • edited Loading

pitrou commented Dec 4, 2024

pitrou commented Dec 4, 2024

p-ortmann commented May 15, 2024 •

edited

Loading

adhardy commented Dec 2, 2024 •

edited

Loading

adhardy commented Dec 3, 2024 •

edited

Loading