Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Parquet size grows exponential for categorical data #55776

Closed
2 of 3 tasks
aseganti opened this issue Oct 31, 2023 · 13 comments · Fixed by #58245
Closed
2 of 3 tasks

BUG: Parquet size grows exponential for categorical data #55776

aseganti opened this issue Oct 31, 2023 · 13 comments · Fixed by #58245
Assignees
Labels
Categorical Categorical Data Type Docs good first issue IO Parquet parquet, feather

Comments

@aseganti
Copy link

aseganti commented Oct 31, 2023

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import os

if __name__ == "__main__":
    for n in [10, 1e2, 1e3, 1e4, 1e5]:
        for n_col in [1, 10, 100, 1000, 10000]:
            input = pd.DataFrame([{"{i}": f"{i}_cat" for col in range(n_col)} for i in range(int(n))])
            input.iloc[0:100].to_parquet("a.parquet")
            for col in input.columns:
                input[col] = input[col].astype("category")
            input.iloc[0:100].to_parquet("b.parquet")
            a_size_mb = os.stat("a.parquet").st_size / (1024 * 1024)
            b_size_mb = os.stat("b.parquet").st_size / (1024 * 1024)
            print(f"{n} {n_col} {a_size_mb} {b_size_mb} {100*b_size_mb/a_size_mb:.2f}")

Issue Description

It seems that when saving a data frame with a categorical column inside the size can grow exponentially.

This seems to happen because when we save the categorical data to parquet, we are saving the data + all the categories existing in the original data. This happens even when the categories are not present in the original data.

To reproduce the bug, it is enough to run the script above.

That produces this output:

10 1 0.0015506744384765625 0.001689910888671875 108.98
10 10 0.0015506744384765625 0.001689910888671875 108.98
10 100 0.0015506744384765625 0.001689910888671875 108.98
10 1000 0.0015506744384765625 0.001689910888671875 108.98
10 10000 0.0015506744384765625 0.001689910888671875 108.98
100.0 1 0.0019960403442382812 0.0021104812622070312 105.73
100.0 10 0.0019960403442382812 0.0021104812622070312 105.73
100.0 100 0.0019960403442382812 0.0021104812622070312 105.73
100.0 1000 0.0019960403442382812 0.0021104812622070312 105.73
100.0 10000 0.0019960403442382812 0.0021104812622070312 105.73
1000.0 1 0.0019960403442382812 0.0053577423095703125 268.42
1000.0 10 0.0019960403442382812 0.0053577423095703125 268.42
1000.0 100 0.0019960403442382812 0.0053577423095703125 268.42
1000.0 1000 0.0019960403442382812 0.0053577423095703125 268.42
1000.0 10000 0.0019960403442382812 0.0053577423095703125 268.42
10000.0 1 0.0019960403442382812 0.042061805725097656 2107.26
10000.0 10 0.0019960403442382812 0.042061805725097656 2107.26
10000.0 100 0.0019960403442382812 0.042061805725097656 2107.26
10000.0 1000 0.0019960403442382812 0.042061805725097656 2107.26
10000.0 10000 0.0019960403442382812 0.042061805725097656 2107.26
100000.0 1 0.0019960403442382812 0.43596935272216797 21841.71
100000.0 10 0.0019960403442382812 0.43596935272216797 21841.71
100000.0 100 0.0019960403442382812 0.43596935272216797 21841.71
100000.0 1000 0.0019960403442382812 0.43596935272216797 21841.71
100000.0 10000 0.0019960403442382812 0.43596935272216797 21841.71

Expected Behavior

In my opinion either:

  1. The two file should have (almost) the same size
  2. There should be warning telling the user that such difference in size is possible

Installed Versions

INSTALLED VERSIONS ------------------ commit : ba1cccd python : 3.10.12.final.0 python-bits : 64 OS : Linux OS-release : 5.15.120+ Version : #1 SMP Wed Aug 30 11:19:59 UTC 2023 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : en_US.UTF-8 LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 2.1.0
numpy : 1.23.5
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 67.7.2
pip : 23.1.2
Cython : 3.0.4
pytest : 7.4.3
hypothesis : None
sphinx : 5.0.2
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.3
html5lib : 1.1
pymysql : None
psycopg2 : 2.9.9
jinja2 : 3.1.2
IPython : 7.34.0
pandas_datareader : 0.10.0
bs4 : 4.11.2
bottleneck : None
dataframe-api-compat: None
fastparquet : None
fsspec : 2023.6.0
gcsfs : 2023.6.0
matplotlib : 3.7.1
numba : 0.56.4
numexpr : 2.8.7
odfpy : None
openpyxl : 3.1.2
pandas_gbq : 0.17.9
pyarrow : 9.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.11.3
sqlalchemy : 2.0.22
tables : 3.8.0
tabulate : 0.9.0
xarray : 2023.7.0
xlrd : 2.0.1
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

@aseganti aseganti added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 31, 2023
@kamilkrukowski
Copy link

I think you are pointing out that a DataFrame with a category column where the number of categories is much larger than the number of unique values requires much more storage than an equivalent string column (Which does not store the unused categories).

That should be expected behavior, and we would not want a save_to_parquet function to unnecessarily 'mutate' and 'delete categories'. If you delete the un-used categories, the memory overhead should be negligible or a some reasonable constant due to parquet category dictionary lookup.

Is there anything else I'm missing here? Do the other Pandas 'save_to_format' functions delete unused categories? Is there a weird convention used here not used in other 'save_to_format' functions?

@aseganti
Copy link
Author

I can tell you that as a "end user", I was not expecting the parquet file to be so big. This caused performance issue for us because we did not realize that something like this was possible.

In my opinion to avoid this happening again you could think of:

  1. Add a warning when saving a dataframe containing categories that are not present in the data.
  2. Make sure that the categories are synchronized with the data. Meaning if as the result of a filtering operation the categories disappear, then remove them from the list of category.

but I don't know enough of pandas internal to understand if any of this is feasible.

@rhshadrach
Copy link
Member

rhshadrach commented Nov 16, 2023

Thanks for the report!

It seems that when saving a data frame with a categorical column inside the size can grow exponentially.

What is the meaning of exponentially?

The total universe of categories is part of the data that makes up a categorical column, and can impact the result of various operations (e.g. groupby with observed=False). This takes space to save. If I'm understanding the issue right, that's all we're seeing here. I think this is to be expected and a warning is not necessary.

You can always call remove_unused_categories prior to saving if you prefer.

@rhshadrach rhshadrach added Categorical Categorical Data Type IO Parquet parquet, feather Closing Candidate May be closeable, needs more eyeballs and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 16, 2023
@aseganti
Copy link
Author

Thanks for the report!

It seems that when saving a data frame with a categorical column inside the size can grow exponentially.

What is the meaning of exponentially?

The total universe of categories is part of the data that makes up a categorical column, and can impact the result of various operations (e.g. groupby with observed=False). This takes space to save. If I'm understanding the issue right, that's all we're seeing here. I think this is to be expected and a warning is not necessary.

You can always call remove_unused_categories prior to saving if you prefer.

I did not know the remove_unused_categories function, I will use this in the future.

Regarding if the warning is needed or not, I think that we have a different understanding of "this is to be expected" :) Maybe since you know the internals of pandas, you were expecting this. I use pandas since various years as a user and I was not expecting my parquet file to change from hundreds of MB to couple of KB by just converting the category to a string before saving it.

What about adding an option to: to_FORMAT like remove_unused_categories so that at least this is explicitly explained in the documentation?

@rhshadrach
Copy link
Member

Regarding if the warning is needed or not, I think that we have a different understanding of "this is to be expected"

I think users expect to be able to "roundtrip" when possible. That is, saving something to disk and then loading again should give you the same thing. This isn't always possible (e.g. CSV does not store data types), but when it is, we should do it. That is what pandas is doing here.

If there are unimportant parts of an object that don't need to be saved, it is up to the user to prune them to their liking.

What about adding an option to: to_FORMAT like remove_unused_categories so that at least this is explicitly explained in the documentation?

In general (but with exceptions), I am negative on taking df.method_a().method_b() and making it into df.method_b(use_method_a=[True|False]), and I believe that is what you are proposing to do here.

However I think a line in the Notes section of the docstring of to_parquet about unused categories makes sense and would be very welcome!

@rhshadrach rhshadrach added Docs good first issue and removed Bug Closing Candidate May be closeable, needs more eyeballs labels Nov 18, 2023
@alippai
Copy link
Contributor

alippai commented Nov 21, 2023

I suspect this issue shares the same root cause: apache/arrow#38818
I understand they are slightly different and the dict has to roundtrip correctly, however there might be some inefficiency.

@rhshadrach
Copy link
Member

@alippai - are you saying the additional categories are using up more disk space than should be expected? Can you demonstrate this?

@alippai
Copy link
Contributor

alippai commented Nov 21, 2023

Yes, I’ll need some time, it’s a busy week. I wouldn't expect it to be substantial if compression is used, but let me quantify this later this week or next week.

@knowhere616
Copy link

Hi, I am a beginner contributor. Is this issue still unfixed and if it is, can I be assigned on it?

@rhshadrach
Copy link
Member

@knowhere616 - indeed, this is still open, and the action is in the last sentence of #55776 (comment).

Also, in case you have not seen our contributor documentation, take a look here:

https://pandas.pydata.org/docs/dev/development/contributing.html#finding-an-issue-to-contribute-to

@abeltavares
Copy link
Contributor

Please assign me to the issue.
It will be my first contribution.

@rhshadrach
Copy link
Member

I linked to a section of the contributor documentation that has instructions for how to get the issue assigned to you. Let me know if anything isn't clear!

@abeltavares
Copy link
Contributor

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Docs good first issue IO Parquet parquet, feather
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants