-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: pandas.DataFrame.to_parquet()
causing memory leak
#55296
Comments
Thanks. Do you get the same memory leak when you use pure pyarrow code ( |
I did not try to convert the |
The Line 190 in 824a273
|
Played around with the code you linked for a bit and it looks like the leak is caused exactly by line 190, i.e. the Line 190 in 824a273
Even without writing the resulting table to a file, the leak occurs if the Pandas DataFrame gets converted into a PyArrow Table. Here's the memory usage of the Reproducible Example above at the 10000th iteration replacing
The amount of memory leaked is basically equivalent to the |
Thanks for looking into it. If this is the case it might be worth opening an issue in the arrow repository about it |
Just did it, @mroeschke thank you for the support! |
pandas.DataFrame.to_parquet()
causing memory leak
Going to close this now, since it looks like it's an Arrow bug. Please feel free to ping if there's anything else we need to do on our end. |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
pandas.DataFrame.to_parquet()
causes a memory leak whenengine='pyarrow'
(default option).Using another engine (e.g.:
engine='fastparquet'
) or outputting the same data in another format (e.g.:pandas.DataFrame.to_json()
, seewrite_to_json()
in the Reproducible Example) avoids the memory leak.The problem seems to be more pronounced on DataFrames containing nested structs. A sample problematic data schema and a compliant data generator is included in the Reproducible Example.
From the Reproducible Example above:
pd.DataFrame.to_parquet()
call:VS the same code but setting
engine='fastparquet'
inpd.DataFrame.to_parquet()
Expected Behavior
No memory leaks.
Installed Versions
pandas : 2.1.1
numpy : 1.26.0
pytz : 2021.3
dateutil : 2.8.2
setuptools : 65.6.3
pip : 23.2.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.1
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.11.0
pandas_datareader : None
bs4 : 4.11.1
bottleneck : None
dataframe-api-compat: None
fastparquet : 0.8.3
fsspec : 2022.10.0
gcsfs : None
matplotlib : 3.7.0
numba : None
numexpr : None
odfpy : None
openpyxl : 3.1.2
pandas_gbq : None
pyarrow : 13.0.0
pyreadstat : None
pyxlsb : None
s3fs : 2022.10.0
scipy : 1.10.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None
The text was updated successfully, but these errors were encountered: