[Python] `pyarrow.Table.from_pandas()` causing memory leak #37989

RizzoV · 2023-10-03T13:14:13Z

Describe the bug, including details regarding any error messages, version, and platform.

Issue Description

(continuing from pandas-dev/pandas#55296)

pyarrow.Table.from_pandas() causes a memory leak on DataFrames containing nested structs. A sample problematic data schema and a compliant data generator is included in the Reproducible Example below.

From the Reproducible Example:

1st pa.Table.from_pandas() call:

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    74     91.9 MiB     91.9 MiB           1   @profile
    75                                         def convert_df_to_table(df: pd.DataFrame):
    76     91.9 MiB      0.0 MiB           1       table = pa.Table.from_pandas(df, schema=pa.schema(sample_schema))

2000th call:

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    74    140.1 MiB    140.1 MiB           1   @profile
    75                                         def convert_df_to_table(df: pd.DataFrame):
    76    140.1 MiB      0.0 MiB           1       table = pa.Table.from_pandas(df, schema=pa.schema(sample_schema))

10000th call:

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    74    329.4 MiB    329.4 MiB           1   @profile
    75                                         def convert_df_to_table(df: pd.DataFrame):
    76    329.5 MiB      0.0 MiB           1       table = pa.Table.from_pandas(df, schema=pa.schema(sample_schema))

Reproducible Example

import os
import string
import sys
from random import choice, randint
from uuid import uuid4

import pandas as pd
import pyarrow as pa
from memory_profiler import profile

sample_schema = pa.struct(
    [
        ("a", pa.string()),
        (
            "b",
            pa.struct(
                [
                    ("ba", pa.list_(pa.string())),
                    ("bc", pa.string()),
                    ("bd", pa.string()),
                    ("be", pa.list_(pa.string())),
                    (
                        "bf",
                        pa.list_(
                            pa.struct(
                                [
                                    (
                                        "bfa",
                                        pa.struct(
                                            [
                                                ("bfaa", pa.string()),
                                                ("bfab", pa.string()),
                                                ("bfac", pa.string()),
                                                ("bfad", pa.float64()),
                                                ("bfae", pa.string()),
                                            ]
                                        ),
                                    )
                                ]
                            )
                        ),
                    ),
                ]
            ),
        ),
        ("c", pa.int64()),
        ("d", pa.int64()),
        ("e", pa.string()),
        (
            "f",
            pa.struct(
                [
                    ("fa", pa.string()),
                    ("fb", pa.string()),
                    ("fc", pa.string()),
                    ("fd", pa.string()),
                    ("fe", pa.string()),
                    ("ff", pa.string()),
                    ("fg", pa.string()),
                ]
            ),
        ),
        ("g", pa.int64()),
    ]
)


def generate_random_string(str_length: int) -> str:
    return "".join(
        [choice(string.ascii_lowercase + string.digits) for n in range(str_length)]
    )


@profile
def convert_df_to_table(df: pd.DataFrame) -> None:
     table = pa.Table.from_pandas(df, schema=pa.schema(sample_schema))


def generate_random_data():
    return {
        "a": [generate_random_string(128)],
        "b": [
            {
                "ba": [generate_random_string(128) for i in range(50)],
                "bc": generate_random_string(128),
                "bd": generate_random_string(128),
                "be": [generate_random_string(128) for i in range(50)],
                "bf": [
                    {
                        "bfa": {
                            "bfaa": generate_random_string(128),
                            "bfab": generate_random_string(128),
                            "bfac": generate_random_string(128),
                            "bfad": randint(0, 2**32),
                            "bfae": generate_random_string(128),
                        }
                    }
                ],
            }
        ],
        "c": [randint(0, 2**32)],
        "d": [randint(0, 2**32)],
        "e": [generate_random_string(128)],
        "f": [
            {
                "fa": generate_random_string(128),
                "fb": generate_random_string(128),
                "fc": generate_random_string(128),
                "fd": generate_random_string(128),
                "fe": generate_random_string(128),
                "ff": generate_random_string(128),
                "fg": generate_random_string(128),
            }
        ],
        "g": [randint(0, 2**32)],
    }


def main():
    for i in range(10000):
        df = pd.DataFrame.from_dict(generate_random_data())
        # pa.jemalloc_set_decay_ms(0)
        convert_df_to_table(df)  # memory leak


if __name__ == "__main__":
    main()

Installed Versions

INSTALLED VERSIONS
------------------
python              : 3.10.9.final.0
python-bits         : 64
OS                  : Darwin
OS-release          : 22.6.0
Version             : Darwin Kernel Version 22.6.0: Fri Sep 15 13:39:52 PDT 2023; root:xnu-8796.141.3.700.8~1/RELEASE_X86_64
machine             : x86_64
processor           : i386
byteorder           : little
LC_ALL              : None
LANG                : it_IT.UTF-8
LOCALE              : it_IT.UTF-8

pyarrow             : 13.0.0
pandas              : 2.1.1
numpy               : 1.26.0

Component(s)

Python

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2023-10-03T18:16:18Z

@RizzoV thanks for the report and nice reproducer!

I can reproduce this running your example with memray:

From the memray stats, it looks like the memory being held at the end is mostly coming from the list with strings, so somehow the conversion to arrow seems to keep those list object alive (haven't yet looked at how that is possible, though).
And also the pandas metadata conversion (the json dump) seems to accumulate memory, although that's a bit strange (but I don't see that in the smaller reproducer below).

It seems it is specifically happens when having a list that is nested inside another column (eg struct of list), so I can reproduce the observation as well with this simplified example:

import string
from random import choice

import pandas as pd
import pyarrow as pa


sample_schema = pa.struct(
    [
        ( "a", pa.struct([("aa", pa.list_(pa.string()))])),
    ]
)


def generate_random_string(str_length: int) -> str:
    return "".join(
        [choice(string.ascii_lowercase + string.digits) for n in range(str_length)]
    )


def generate_random_data():
    return {
        "a": [{"aa": [generate_random_string(128) for i in range(50)]}],
    }


def main():
    for i in range(10000):
        df = pd.DataFrame.from_dict(generate_random_data())
        # pa.jemalloc_set_decay_ms(0)
        table = pa.Table.from_pandas(df, schema=pa.schema(sample_schema))


if __name__ == "__main__":
    main()

Ashokcs94 · 2023-12-06T04:21:30Z

@RizzoV / @jorisvandenbossche : any solution for the memory leak in to_parquet() ?, we are also facing this issue for long time

RizzoV · 2023-12-06T08:09:56Z

@Ashokcs94 no solution from my side sadly, we still have to work around it

chunyang · 2024-03-07T23:22:57Z

I believe I found a fix for this in #40412, please take a look :)

…m Python list of dicts (#40412) ### Rationale for this change When creating Arrow arrays using `pa.array` from lists of dicts, memory usage is observed to increase over time despite the created arrays going out of scope. The issue appears to only happen for lists of dicts, as opposed to lists of numpy arrays or other types. ### What changes are included in this PR? This PR makes two changes to _python_to_arrow.cc_, to ensure that new references created by [`PyDict_Items`](https://docs.python.org/3/c-api/dict.html#c.PyDict_Items) and [`PySequence_GetItem`](https://docs.python.org/3/c-api/sequence.html#c.PySequence_GetItem) are properly reference counted via `OwnedRef`. ### Are these changes tested? The change was tested against the following reproduction script: ```python """Repro memory increase observed when creating pyarrow arrays.""" # System imports import logging # Third-party imports import numpy as np import psutil import pyarrow as pa LIST_LENGTH = 5 * (2**20) LOGGER = logging.getLogger(__name__) def initialize_logging() -> None: logging.basicConfig( format="%(asctime)s - %(name)s - %(levelname)s - %(message)s", level=logging.INFO, ) def get_rss_in_mib() -> float: """Return the Resident Set Size of the current process in MiB.""" return psutil.Process().memory_info().rss / 1024 / 1024 def main() -> None: initialize_logging() for idx in range(100): data = np.random.randint(256, size=(LIST_LENGTH,), dtype=np.uint8) # data = "a" * LIST_LENGTH pa.array([{"data": data}]) if (idx + 1) % 10 == 0: LOGGER.info( "%d dict arrays created, RSS: %.2f MiB", idx + 1, get_rss_in_mib() ) LOGGER.info("---------") for idx in range(100): pa.array( [ np.random.randint(256, size=(LIST_LENGTH,), dtype=np.uint8).tobytes(), ] ) if (idx + 1) % 10 == 0: LOGGER.info( "%d non-dict arrays created, RSS: %.2f MiB", idx + 1, get_rss_in_mib() ) if __name__ == "__main__": main() ``` Prior to this change, the reproduction script produces the following output: ``` 2024-03-07 23:14:17,560 - __main__ - INFO - 10 dict arrays created, RSS: 121.05 MiB 2024-03-07 23:14:17,698 - __main__ - INFO - 20 dict arrays created, RSS: 171.07 MiB 2024-03-07 23:14:17,835 - __main__ - INFO - 30 dict arrays created, RSS: 221.09 MiB 2024-03-07 23:14:17,971 - __main__ - INFO - 40 dict arrays created, RSS: 271.11 MiB 2024-03-07 23:14:18,109 - __main__ - INFO - 50 dict arrays created, RSS: 320.86 MiB 2024-03-07 23:14:18,245 - __main__ - INFO - 60 dict arrays created, RSS: 371.65 MiB 2024-03-07 23:14:18,380 - __main__ - INFO - 70 dict arrays created, RSS: 422.18 MiB 2024-03-07 23:14:18,516 - __main__ - INFO - 80 dict arrays created, RSS: 472.20 MiB 2024-03-07 23:14:18,650 - __main__ - INFO - 90 dict arrays created, RSS: 522.21 MiB 2024-03-07 23:14:18,788 - __main__ - INFO - 100 dict arrays created, RSS: 572.23 MiB 2024-03-07 23:14:18,789 - __main__ - INFO - --------- 2024-03-07 23:14:19,001 - __main__ - INFO - 10 non-dict arrays created, RSS: 567.61 MiB 2024-03-07 23:14:19,211 - __main__ - INFO - 20 non-dict arrays created, RSS: 567.61 MiB 2024-03-07 23:14:19,417 - __main__ - INFO - 30 non-dict arrays created, RSS: 567.61 MiB 2024-03-07 23:14:19,623 - __main__ - INFO - 40 non-dict arrays created, RSS: 567.61 MiB 2024-03-07 23:14:19,832 - __main__ - INFO - 50 non-dict arrays created, RSS: 567.61 MiB 2024-03-07 23:14:20,047 - __main__ - INFO - 60 non-dict arrays created, RSS: 567.61 MiB 2024-03-07 23:14:20,253 - __main__ - INFO - 70 non-dict arrays created, RSS: 567.61 MiB 2024-03-07 23:14:20,499 - __main__ - INFO - 80 non-dict arrays created, RSS: 567.61 MiB 2024-03-07 23:14:20,725 - __main__ - INFO - 90 non-dict arrays created, RSS: 567.61 MiB 2024-03-07 23:14:20,950 - __main__ - INFO - 100 non-dict arrays created, RSS: 567.61 MiB ``` After this change, the output changes to the following. Notice that the Resident Set Size (RSS) no longer increases as more Arrow arrays are created from list of dict. ``` 2024-03-07 23:14:47,246 - __main__ - INFO - 10 dict arrays created, RSS: 81.73 MiB 2024-03-07 23:14:47,353 - __main__ - INFO - 20 dict arrays created, RSS: 76.53 MiB 2024-03-07 23:14:47,445 - __main__ - INFO - 30 dict arrays created, RSS: 82.20 MiB 2024-03-07 23:14:47,537 - __main__ - INFO - 40 dict arrays created, RSS: 86.59 MiB 2024-03-07 23:14:47,634 - __main__ - INFO - 50 dict arrays created, RSS: 80.28 MiB 2024-03-07 23:14:47,734 - __main__ - INFO - 60 dict arrays created, RSS: 85.44 MiB 2024-03-07 23:14:47,827 - __main__ - INFO - 70 dict arrays created, RSS: 85.44 MiB 2024-03-07 23:14:47,921 - __main__ - INFO - 80 dict arrays created, RSS: 85.44 MiB 2024-03-07 23:14:48,024 - __main__ - INFO - 90 dict arrays created, RSS: 82.94 MiB 2024-03-07 23:14:48,132 - __main__ - INFO - 100 dict arrays created, RSS: 87.84 MiB 2024-03-07 23:14:48,132 - __main__ - INFO - --------- 2024-03-07 23:14:48,229 - __main__ - INFO - 10 non-dict arrays created, RSS: 87.84 MiB 2024-03-07 23:14:48,324 - __main__ - INFO - 20 non-dict arrays created, RSS: 87.84 MiB 2024-03-07 23:14:48,420 - __main__ - INFO - 30 non-dict arrays created, RSS: 87.84 MiB 2024-03-07 23:14:48,516 - __main__ - INFO - 40 non-dict arrays created, RSS: 87.84 MiB 2024-03-07 23:14:48,613 - __main__ - INFO - 50 non-dict arrays created, RSS: 87.84 MiB 2024-03-07 23:14:48,710 - __main__ - INFO - 60 non-dict arrays created, RSS: 87.84 MiB 2024-03-07 23:14:48,806 - __main__ - INFO - 70 non-dict arrays created, RSS: 87.84 MiB 2024-03-07 23:14:48,905 - __main__ - INFO - 80 non-dict arrays created, RSS: 87.84 MiB 2024-03-07 23:14:49,009 - __main__ - INFO - 90 non-dict arrays created, RSS: 87.84 MiB 2024-03-07 23:14:49,108 - __main__ - INFO - 100 non-dict arrays created, RSS: 87.84 MiB ``` When this change is tested against the reproduction script provided in #37989 (comment), the reported memory increase is no longer observed. I have not added a unit test, but it may be possible to add one similar to the reproduction scripts used above, provided there's an accurate way to capture process memory usage on all the platforms that Arrow supports, and provided memory usage is not affected by concurrently running tests. If this code could be tested under valgrind, that may be an even better way to go. ### Are there any user-facing changes? * GitHub Issue: #37989 Authored-by: Chuck Yang <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>

jorisvandenbossche · 2024-03-15T15:33:08Z

Issue resolved by pull request 40412
#40412

…ay from Python list of dicts (apache#40412) ### Rationale for this change When creating Arrow arrays using `pa.array` from lists of dicts, memory usage is observed to increase over time despite the created arrays going out of scope. The issue appears to only happen for lists of dicts, as opposed to lists of numpy arrays or other types. ### What changes are included in this PR? This PR makes two changes to _python_to_arrow.cc_, to ensure that new references created by [`PyDict_Items`](https://docs.python.org/3/c-api/dict.html#c.PyDict_Items) and [`PySequence_GetItem`](https://docs.python.org/3/c-api/sequence.html#c.PySequence_GetItem) are properly reference counted via `OwnedRef`. ### Are these changes tested? The change was tested against the following reproduction script: ```python """Repro memory increase observed when creating pyarrow arrays.""" # System imports import logging # Third-party imports import numpy as np import psutil import pyarrow as pa LIST_LENGTH = 5 * (2**20) LOGGER = logging.getLogger(__name__) def initialize_logging() -> None: logging.basicConfig( format="%(asctime)s - %(name)s - %(levelname)s - %(message)s", level=logging.INFO, ) def get_rss_in_mib() -> float: """Return the Resident Set Size of the current process in MiB.""" return psutil.Process().memory_info().rss / 1024 / 1024 def main() -> None: initialize_logging() for idx in range(100): data = np.random.randint(256, size=(LIST_LENGTH,), dtype=np.uint8) # data = "a" * LIST_LENGTH pa.array([{"data": data}]) if (idx + 1) % 10 == 0: LOGGER.info( "%d dict arrays created, RSS: %.2f MiB", idx + 1, get_rss_in_mib() ) LOGGER.info("---------") for idx in range(100): pa.array( [ np.random.randint(256, size=(LIST_LENGTH,), dtype=np.uint8).tobytes(), ] ) if (idx + 1) % 10 == 0: LOGGER.info( "%d non-dict arrays created, RSS: %.2f MiB", idx + 1, get_rss_in_mib() ) if __name__ == "__main__": main() ``` Prior to this change, the reproduction script produces the following output: ``` 2024-03-07 23:14:17,560 - __main__ - INFO - 10 dict arrays created, RSS: 121.05 MiB 2024-03-07 23:14:17,698 - __main__ - INFO - 20 dict arrays created, RSS: 171.07 MiB 2024-03-07 23:14:17,835 - __main__ - INFO - 30 dict arrays created, RSS: 221.09 MiB 2024-03-07 23:14:17,971 - __main__ - INFO - 40 dict arrays created, RSS: 271.11 MiB 2024-03-07 23:14:18,109 - __main__ - INFO - 50 dict arrays created, RSS: 320.86 MiB 2024-03-07 23:14:18,245 - __main__ - INFO - 60 dict arrays created, RSS: 371.65 MiB 2024-03-07 23:14:18,380 - __main__ - INFO - 70 dict arrays created, RSS: 422.18 MiB 2024-03-07 23:14:18,516 - __main__ - INFO - 80 dict arrays created, RSS: 472.20 MiB 2024-03-07 23:14:18,650 - __main__ - INFO - 90 dict arrays created, RSS: 522.21 MiB 2024-03-07 23:14:18,788 - __main__ - INFO - 100 dict arrays created, RSS: 572.23 MiB 2024-03-07 23:14:18,789 - __main__ - INFO - --------- 2024-03-07 23:14:19,001 - __main__ - INFO - 10 non-dict arrays created, RSS: 567.61 MiB 2024-03-07 23:14:19,211 - __main__ - INFO - 20 non-dict arrays created, RSS: 567.61 MiB 2024-03-07 23:14:19,417 - __main__ - INFO - 30 non-dict arrays created, RSS: 567.61 MiB 2024-03-07 23:14:19,623 - __main__ - INFO - 40 non-dict arrays created, RSS: 567.61 MiB 2024-03-07 23:14:19,832 - __main__ - INFO - 50 non-dict arrays created, RSS: 567.61 MiB 2024-03-07 23:14:20,047 - __main__ - INFO - 60 non-dict arrays created, RSS: 567.61 MiB 2024-03-07 23:14:20,253 - __main__ - INFO - 70 non-dict arrays created, RSS: 567.61 MiB 2024-03-07 23:14:20,499 - __main__ - INFO - 80 non-dict arrays created, RSS: 567.61 MiB 2024-03-07 23:14:20,725 - __main__ - INFO - 90 non-dict arrays created, RSS: 567.61 MiB 2024-03-07 23:14:20,950 - __main__ - INFO - 100 non-dict arrays created, RSS: 567.61 MiB ``` After this change, the output changes to the following. Notice that the Resident Set Size (RSS) no longer increases as more Arrow arrays are created from list of dict. ``` 2024-03-07 23:14:47,246 - __main__ - INFO - 10 dict arrays created, RSS: 81.73 MiB 2024-03-07 23:14:47,353 - __main__ - INFO - 20 dict arrays created, RSS: 76.53 MiB 2024-03-07 23:14:47,445 - __main__ - INFO - 30 dict arrays created, RSS: 82.20 MiB 2024-03-07 23:14:47,537 - __main__ - INFO - 40 dict arrays created, RSS: 86.59 MiB 2024-03-07 23:14:47,634 - __main__ - INFO - 50 dict arrays created, RSS: 80.28 MiB 2024-03-07 23:14:47,734 - __main__ - INFO - 60 dict arrays created, RSS: 85.44 MiB 2024-03-07 23:14:47,827 - __main__ - INFO - 70 dict arrays created, RSS: 85.44 MiB 2024-03-07 23:14:47,921 - __main__ - INFO - 80 dict arrays created, RSS: 85.44 MiB 2024-03-07 23:14:48,024 - __main__ - INFO - 90 dict arrays created, RSS: 82.94 MiB 2024-03-07 23:14:48,132 - __main__ - INFO - 100 dict arrays created, RSS: 87.84 MiB 2024-03-07 23:14:48,132 - __main__ - INFO - --------- 2024-03-07 23:14:48,229 - __main__ - INFO - 10 non-dict arrays created, RSS: 87.84 MiB 2024-03-07 23:14:48,324 - __main__ - INFO - 20 non-dict arrays created, RSS: 87.84 MiB 2024-03-07 23:14:48,420 - __main__ - INFO - 30 non-dict arrays created, RSS: 87.84 MiB 2024-03-07 23:14:48,516 - __main__ - INFO - 40 non-dict arrays created, RSS: 87.84 MiB 2024-03-07 23:14:48,613 - __main__ - INFO - 50 non-dict arrays created, RSS: 87.84 MiB 2024-03-07 23:14:48,710 - __main__ - INFO - 60 non-dict arrays created, RSS: 87.84 MiB 2024-03-07 23:14:48,806 - __main__ - INFO - 70 non-dict arrays created, RSS: 87.84 MiB 2024-03-07 23:14:48,905 - __main__ - INFO - 80 non-dict arrays created, RSS: 87.84 MiB 2024-03-07 23:14:49,009 - __main__ - INFO - 90 non-dict arrays created, RSS: 87.84 MiB 2024-03-07 23:14:49,108 - __main__ - INFO - 100 non-dict arrays created, RSS: 87.84 MiB ``` When this change is tested against the reproduction script provided in apache#37989 (comment), the reported memory increase is no longer observed. I have not added a unit test, but it may be possible to add one similar to the reproduction scripts used above, provided there's an accurate way to capture process memory usage on all the platforms that Arrow supports, and provided memory usage is not affected by concurrently running tests. If this code could be tested under valgrind, that may be an even better way to go. ### Are there any user-facing changes? * GitHub Issue: apache#37989 Authored-by: Chuck Yang <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>

# Description It seems like there's a [memory leak](apache/arrow#37989) in older `pyarrow` versions and this fix has only been [addressed recently](https://github.com/apache/arrow/commits/apache-arrow-16.0.0?after=6a28035c2b49b432dc63f5ee7524d76b4ed2d762+174) in `pyarrow==16.0.0`. Since we have `pyarrow` pinned to `<11.0.0` within the batch predictor, this PR loosens this restriction and simply sets it to allow the latest released version `pyarrow<=17.0.0` instead. # Modifications - `python/batch-predictor/requirements.txt` - Loosening of `pyarrow` version pinned # Tests  # Checklist - [x] Added PR label - [ ] Added unit test, integration, and/or e2e tests - [ ] Tested locally - [ ] Updated documentation - [ ] Update Swagger spec if the PR introduce API changes - [ ] Regenerated Golang and Python client if the PR introduces API changes # Release Notes  ```release-note NONE ```

nicksilver · 2024-11-04T12:19:41Z

Not sure if I should open up a new bug report, but I am having the same memory leak issue when writing a list of strings (not dicts) to parquet from pandas (pyarrow=18.0.0). It seems like a similar issue and probably similar fix.

DatSplit · 2024-11-04T15:31:42Z

@nicksilver I suspect that I'm having the same issue/ a similar right now.

amoeba · 2024-11-18T16:35:17Z

Hi @nicksilver or @DatSplit, a new issue would be good. What would be most helpful in that issue would be a minimal reproducer like the one in #37989 (comment).

kr-hansen · 2024-12-11T23:01:34Z

@nicksilver or @DatSplit did either of you open a new issue? I see #44295 as a minimal reproducer case, but that was before your posts here. I'm also hitting some type of memory leak related to this. Do we need another open issue, or is that one from October sufficient to dig in on?

nicksilver · 2024-12-15T14:53:42Z

Hi @kr-hansen and @amoeba, I did not open a new issue although I believe the leak still exists. My workaround was to convert my list of strings to a list of dictionaries to comply with this fix. When I did this the memory leak was gone. I believe if you used the same minimal reproducer from this issue but saved a list of strings, instead of dicts, you would get the leak. I'm probably not the best person to open a new issue -- I am not much of a developer -- this was just something I stumbled upon while trying to reformat some json files.

RizzoV added the Type: bug label Oct 3, 2023

github-actions bot added the Component: Python label Oct 3, 2023

chunyang mentioned this issue Mar 7, 2024

GH-37989: [Python] Plug reference leaks when creating Arrow array from Python list of dicts #40412

Merged

github-actions bot assigned chunyang Mar 7, 2024

jorisvandenbossche added this to the 16.0.0 milestone Mar 15, 2024

jorisvandenbossche closed this as completed Mar 15, 2024

jorisvandenbossche added the backport-candidate label Mar 15, 2024

guozhans mentioned this issue Mar 22, 2024

[Python][Parquet] Memory leak still showed on parquet.write_table and Table.from_pandas #40738

Closed

wence- mentioned this issue Apr 15, 2024

[Python] Memory leak for repeated StructArray creation #41172

Closed

This was referenced Apr 15, 2024

Backport #40412 to pyarrow-15.x #41221

Closed

Backport #40412 to pyarrow-14.x #41222

Closed

deadlycoconuts mentioned this issue Sep 23, 2024

fix(batch-predictor): Bump up pyarrow version limit caraml-dev/merlin#609

Merged

6 tasks

raulcd removed the backport-candidate label Oct 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] `pyarrow.Table.from_pandas()` causing memory leak #37989

[Python] `pyarrow.Table.from_pandas()` causing memory leak #37989

RizzoV commented Oct 3, 2023 •

edited

Loading

jorisvandenbossche commented Oct 3, 2023

Ashokcs94 commented Dec 6, 2023

RizzoV commented Dec 6, 2023

chunyang commented Mar 7, 2024

jorisvandenbossche commented Mar 15, 2024

nicksilver commented Nov 4, 2024 •

edited

Loading

DatSplit commented Nov 4, 2024 •

edited

Loading

amoeba commented Nov 18, 2024

kr-hansen commented Dec 11, 2024

nicksilver commented Dec 15, 2024

[Python] pyarrow.Table.from_pandas() causing memory leak #37989

[Python] pyarrow.Table.from_pandas() causing memory leak #37989

Comments

RizzoV commented Oct 3, 2023 • edited Loading

Describe the bug, including details regarding any error messages, version, and platform.

Issue Description

Reproducible Example

Installed Versions

Component(s)

jorisvandenbossche commented Oct 3, 2023

Ashokcs94 commented Dec 6, 2023

RizzoV commented Dec 6, 2023

chunyang commented Mar 7, 2024

jorisvandenbossche commented Mar 15, 2024

nicksilver commented Nov 4, 2024 • edited Loading

DatSplit commented Nov 4, 2024 • edited Loading

amoeba commented Nov 18, 2024

kr-hansen commented Dec 11, 2024

nicksilver commented Dec 15, 2024

[Python] `pyarrow.Table.from_pandas()` causing memory leak #37989

[Python] `pyarrow.Table.from_pandas()` causing memory leak #37989

RizzoV commented Oct 3, 2023 •

edited

Loading

nicksilver commented Nov 4, 2024 •

edited

Loading

DatSplit commented Nov 4, 2024 •

edited

Loading