-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Parquet size grows exponential for categorical data #55776
Comments
I think you are pointing out that a DataFrame with a category column where the number of categories is much larger than the number of unique values requires much more storage than an equivalent string column (Which does not store the unused categories). That should be expected behavior, and we would not want a save_to_parquet function to unnecessarily 'mutate' and 'delete categories'. If you delete the un-used categories, the memory overhead should be negligible or a some reasonable constant due to parquet category dictionary lookup. Is there anything else I'm missing here? Do the other Pandas 'save_to_format' functions delete unused categories? Is there a weird convention used here not used in other 'save_to_format' functions? |
I can tell you that as a "end user", I was not expecting the parquet file to be so big. This caused performance issue for us because we did not realize that something like this was possible. In my opinion to avoid this happening again you could think of:
but I don't know enough of pandas internal to understand if any of this is feasible. |
Thanks for the report!
What is the meaning of exponentially? The total universe of categories is part of the data that makes up a categorical column, and can impact the result of various operations (e.g. groupby with You can always call remove_unused_categories prior to saving if you prefer. |
I did not know the Regarding if the warning is needed or not, I think that we have a different understanding of "this is to be expected" :) Maybe since you know the internals of pandas, you were expecting this. I use pandas since various years as a user and I was not expecting my parquet file to change from hundreds of MB to couple of KB by just converting the category to a string before saving it. What about adding an option to: |
I think users expect to be able to "roundtrip" when possible. That is, saving something to disk and then loading again should give you the same thing. This isn't always possible (e.g. CSV does not store data types), but when it is, we should do it. That is what pandas is doing here. If there are unimportant parts of an object that don't need to be saved, it is up to the user to prune them to their liking.
In general (but with exceptions), I am negative on taking However I think a line in the |
I suspect this issue shares the same root cause: apache/arrow#38818 |
@alippai - are you saying the additional categories are using up more disk space than should be expected? Can you demonstrate this? |
Yes, I’ll need some time, it’s a busy week. I wouldn't expect it to be substantial if compression is used, but let me quantify this later this week or next week. |
Hi, I am a beginner contributor. Is this issue still unfixed and if it is, can I be assigned on it? |
@knowhere616 - indeed, this is still open, and the action is in the last sentence of #55776 (comment). Also, in case you have not seen our contributor documentation, take a look here: https://pandas.pydata.org/docs/dev/development/contributing.html#finding-an-issue-to-contribute-to |
Please assign me to the issue. |
I linked to a section of the contributor documentation that has instructions for how to get the issue assigned to you. Let me know if anything isn't clear! |
take |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
It seems that when saving a data frame with a categorical column inside the size can grow exponentially.
This seems to happen because when we save the categorical data to parquet, we are saving the data + all the categories existing in the original data. This happens even when the categories are not present in the original data.
To reproduce the bug, it is enough to run the script above.
That produces this output:
Expected Behavior
In my opinion either:
Installed Versions
pandas : 2.1.0
numpy : 1.23.5
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 67.7.2
pip : 23.1.2
Cython : 3.0.4
pytest : 7.4.3
hypothesis : None
sphinx : 5.0.2
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.3
html5lib : 1.1
pymysql : None
psycopg2 : 2.9.9
jinja2 : 3.1.2
IPython : 7.34.0
pandas_datareader : 0.10.0
bs4 : 4.11.2
bottleneck : None
dataframe-api-compat: None
fastparquet : None
fsspec : 2023.6.0
gcsfs : 2023.6.0
matplotlib : 3.7.1
numba : 0.56.4
numexpr : 2.8.7
odfpy : None
openpyxl : 3.1.2
pandas_gbq : 0.17.9
pyarrow : 9.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.11.3
sqlalchemy : 2.0.22
tables : 3.8.0
tabulate : 0.9.0
xarray : 2023.7.0
xlrd : 2.0.1
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None
The text was updated successfully, but these errors were encountered: