BUG: Parquet size grows exponential for categorical data #55776

aseganti · 2023-10-31T11:07:30Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import os

if __name__ == "__main__":
    for n in [10, 1e2, 1e3, 1e4, 1e5]:
        for n_col in [1, 10, 100, 1000, 10000]:
            input = pd.DataFrame([{"{i}": f"{i}_cat" for col in range(n_col)} for i in range(int(n))])
            input.iloc[0:100].to_parquet("a.parquet")
            for col in input.columns:
                input[col] = input[col].astype("category")
            input.iloc[0:100].to_parquet("b.parquet")
            a_size_mb = os.stat("a.parquet").st_size / (1024 * 1024)
            b_size_mb = os.stat("b.parquet").st_size / (1024 * 1024)
            print(f"{n} {n_col} {a_size_mb} {b_size_mb} {100*b_size_mb/a_size_mb:.2f}")

Issue Description

It seems that when saving a data frame with a categorical column inside the size can grow exponentially.

This seems to happen because when we save the categorical data to parquet, we are saving the data + all the categories existing in the original data. This happens even when the categories are not present in the original data.

To reproduce the bug, it is enough to run the script above.

That produces this output:

10 1 0.0015506744384765625 0.001689910888671875 108.98
10 10 0.0015506744384765625 0.001689910888671875 108.98
10 100 0.0015506744384765625 0.001689910888671875 108.98
10 1000 0.0015506744384765625 0.001689910888671875 108.98
10 10000 0.0015506744384765625 0.001689910888671875 108.98
100.0 1 0.0019960403442382812 0.0021104812622070312 105.73
100.0 10 0.0019960403442382812 0.0021104812622070312 105.73
100.0 100 0.0019960403442382812 0.0021104812622070312 105.73
100.0 1000 0.0019960403442382812 0.0021104812622070312 105.73
100.0 10000 0.0019960403442382812 0.0021104812622070312 105.73
1000.0 1 0.0019960403442382812 0.0053577423095703125 268.42
1000.0 10 0.0019960403442382812 0.0053577423095703125 268.42
1000.0 100 0.0019960403442382812 0.0053577423095703125 268.42
1000.0 1000 0.0019960403442382812 0.0053577423095703125 268.42
1000.0 10000 0.0019960403442382812 0.0053577423095703125 268.42
10000.0 1 0.0019960403442382812 0.042061805725097656 2107.26
10000.0 10 0.0019960403442382812 0.042061805725097656 2107.26
10000.0 100 0.0019960403442382812 0.042061805725097656 2107.26
10000.0 1000 0.0019960403442382812 0.042061805725097656 2107.26
10000.0 10000 0.0019960403442382812 0.042061805725097656 2107.26
100000.0 1 0.0019960403442382812 0.43596935272216797 21841.71
100000.0 10 0.0019960403442382812 0.43596935272216797 21841.71
100000.0 100 0.0019960403442382812 0.43596935272216797 21841.71
100000.0 1000 0.0019960403442382812 0.43596935272216797 21841.71
100000.0 10000 0.0019960403442382812 0.43596935272216797 21841.71

Expected Behavior

In my opinion either:

The two file should have (almost) the same size
There should be warning telling the user that such difference in size is possible

Installed Versions

INSTALLED VERSIONS ------------------ commit : ba1cccd python : 3.10.12.final.0 python-bits : 64 OS : Linux OS-release : 5.15.120+ Version : #1 SMP Wed Aug 30 11:19:59 UTC 2023 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : en_US.UTF-8 LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 2.1.0
numpy : 1.23.5
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 67.7.2
pip : 23.1.2
Cython : 3.0.4
pytest : 7.4.3
hypothesis : None
sphinx : 5.0.2
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.3
html5lib : 1.1
pymysql : None
psycopg2 : 2.9.9
jinja2 : 3.1.2
IPython : 7.34.0
pandas_datareader : 0.10.0
bs4 : 4.11.2
bottleneck : None
dataframe-api-compat: None
fastparquet : None
fsspec : 2023.6.0
gcsfs : 2023.6.0
matplotlib : 3.7.1
numba : 0.56.4
numexpr : 2.8.7
odfpy : None
openpyxl : 3.1.2
pandas_gbq : 0.17.9
pyarrow : 9.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.11.3
sqlalchemy : 2.0.22
tables : 3.8.0
tabulate : 0.9.0
xarray : 2023.7.0
xlrd : 2.0.1
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

The text was updated successfully, but these errors were encountered:

kamilkrukowski · 2023-10-31T20:31:01Z

I think you are pointing out that a DataFrame with a category column where the number of categories is much larger than the number of unique values requires much more storage than an equivalent string column (Which does not store the unused categories).

That should be expected behavior, and we would not want a save_to_parquet function to unnecessarily 'mutate' and 'delete categories'. If you delete the un-used categories, the memory overhead should be negligible or a some reasonable constant due to parquet category dictionary lookup.

Is there anything else I'm missing here? Do the other Pandas 'save_to_format' functions delete unused categories? Is there a weird convention used here not used in other 'save_to_format' functions?

aseganti · 2023-11-16T10:09:30Z

I can tell you that as a "end user", I was not expecting the parquet file to be so big. This caused performance issue for us because we did not realize that something like this was possible.

In my opinion to avoid this happening again you could think of:

Add a warning when saving a dataframe containing categories that are not present in the data.
Make sure that the categories are synchronized with the data. Meaning if as the result of a filtering operation the categories disappear, then remove them from the list of category.

but I don't know enough of pandas internal to understand if any of this is feasible.

rhshadrach · 2023-11-16T21:57:51Z

Thanks for the report!

It seems that when saving a data frame with a categorical column inside the size can grow exponentially.

What is the meaning of exponentially?

The total universe of categories is part of the data that makes up a categorical column, and can impact the result of various operations (e.g. groupby with observed=False). This takes space to save. If I'm understanding the issue right, that's all we're seeing here. I think this is to be expected and a warning is not necessary.

You can always call remove_unused_categories prior to saving if you prefer.

aseganti · 2023-11-17T08:39:46Z

Thanks for the report!

It seems that when saving a data frame with a categorical column inside the size can grow exponentially.

What is the meaning of exponentially?

The total universe of categories is part of the data that makes up a categorical column, and can impact the result of various operations (e.g. groupby with observed=False). This takes space to save. If I'm understanding the issue right, that's all we're seeing here. I think this is to be expected and a warning is not necessary.

You can always call remove_unused_categories prior to saving if you prefer.

I did not know the remove_unused_categories function, I will use this in the future.

Regarding if the warning is needed or not, I think that we have a different understanding of "this is to be expected" :) Maybe since you know the internals of pandas, you were expecting this. I use pandas since various years as a user and I was not expecting my parquet file to change from hundreds of MB to couple of KB by just converting the category to a string before saving it.

What about adding an option to: to_FORMAT like remove_unused_categories so that at least this is explicitly explained in the documentation?

rhshadrach · 2023-11-18T12:54:33Z

Regarding if the warning is needed or not, I think that we have a different understanding of "this is to be expected"

I think users expect to be able to "roundtrip" when possible. That is, saving something to disk and then loading again should give you the same thing. This isn't always possible (e.g. CSV does not store data types), but when it is, we should do it. That is what pandas is doing here.

If there are unimportant parts of an object that don't need to be saved, it is up to the user to prune them to their liking.

What about adding an option to: to_FORMAT like remove_unused_categories so that at least this is explicitly explained in the documentation?

In general (but with exceptions), I am negative on taking df.method_a().method_b() and making it into df.method_b(use_method_a=[True|False]), and I believe that is what you are proposing to do here.

However I think a line in the Notes section of the docstring of to_parquet about unused categories makes sense and would be very welcome!

alippai · 2023-11-21T04:27:29Z

I suspect this issue shares the same root cause: apache/arrow#38818
I understand they are slightly different and the dict has to roundtrip correctly, however there might be some inefficiency.

rhshadrach · 2023-11-21T21:52:29Z

@alippai - are you saying the additional categories are using up more disk space than should be expected? Can you demonstrate this?

alippai · 2023-11-21T22:00:55Z

Yes, I’ll need some time, it’s a busy week. I wouldn't expect it to be substantial if compression is used, but let me quantify this later this week or next week.

knowhere616 · 2024-04-10T09:52:14Z

Hi, I am a beginner contributor. Is this issue still unfixed and if it is, can I be assigned on it?

rhshadrach · 2024-04-10T21:23:58Z

@knowhere616 - indeed, this is still open, and the action is in the last sentence of #55776 (comment).

Also, in case you have not seen our contributor documentation, take a look here:

https://pandas.pydata.org/docs/dev/development/contributing.html#finding-an-issue-to-contribute-to

abeltavares · 2024-04-12T22:22:21Z

Please assign me to the issue.
It will be my first contribution.

rhshadrach · 2024-04-12T22:30:12Z

I linked to a section of the contributor documentation that has instructions for how to get the issue assigned to you. Let me know if anything isn't clear!

abeltavares · 2024-04-12T23:06:40Z

take

aseganti added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 31, 2023

rhshadrach added Categorical Categorical Data Type IO Parquet parquet, feather Closing Candidate May be closeable, needs more eyeballs and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 16, 2023

rhshadrach added Docs good first issue and removed Bug Closing Candidate May be closeable, needs more eyeballs labels Nov 18, 2023

github-actions bot assigned rohanrvpatil Nov 26, 2023

rohanrvpatil removed their assignment Nov 26, 2023

arogozhnikov mentioned this issue Jan 12, 2024

BUG: parquet serialization/deserialization adds all dict keys into column #56842

Open

3 tasks

github-actions bot assigned abeltavares Apr 12, 2024

abeltavares mentioned this issue Apr 13, 2024

DOC: Add documentation on parquet categorical data handling #58245

Merged

2 tasks

mroeschke closed this as completed in #58245 Apr 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Parquet size grows exponential for categorical data #55776

BUG: Parquet size grows exponential for categorical data #55776

aseganti commented Oct 31, 2023 •

edited

Loading

kamilkrukowski commented Oct 31, 2023

aseganti commented Nov 16, 2023

rhshadrach commented Nov 16, 2023 •

edited

Loading

aseganti commented Nov 17, 2023

rhshadrach commented Nov 18, 2023

alippai commented Nov 21, 2023 •

edited

Loading

rhshadrach commented Nov 21, 2023

alippai commented Nov 21, 2023

knowhere616 commented Apr 10, 2024

rhshadrach commented Apr 10, 2024

abeltavares commented Apr 12, 2024

rhshadrach commented Apr 12, 2024

abeltavares commented Apr 12, 2024

BUG: Parquet size grows exponential for categorical data #55776

BUG: Parquet size grows exponential for categorical data #55776

Comments

aseganti commented Oct 31, 2023 • edited Loading

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

kamilkrukowski commented Oct 31, 2023

aseganti commented Nov 16, 2023

rhshadrach commented Nov 16, 2023 • edited Loading

aseganti commented Nov 17, 2023

rhshadrach commented Nov 18, 2023

alippai commented Nov 21, 2023 • edited Loading

rhshadrach commented Nov 21, 2023

alippai commented Nov 21, 2023

knowhere616 commented Apr 10, 2024

rhshadrach commented Apr 10, 2024

abeltavares commented Apr 12, 2024

rhshadrach commented Apr 12, 2024

abeltavares commented Apr 12, 2024

aseganti commented Oct 31, 2023 •

edited

Loading

rhshadrach commented Nov 16, 2023 •

edited

Loading

alippai commented Nov 21, 2023 •

edited

Loading