Inefficient conversions while iterating a dataframe #3634

sagiahrac · 2024-12-25T17:15:03Z

Describe the bug

Iterating over the following dataframe with daft is over 1000x slower than converting the daft dataframe to pandas and iterating over the pandas dataframe instead.

To Reproduce

import numpy as np
import daft

np.random.seed(0)
n_rows = 1_000
list_size = 100_000
data = {"list": np.random.randint(0, 256, (n_rows, list_size), dtype=np.uint8)}
df = daft.from_pydict(data)

print("Iter with pandas:")
%timeit for row in df.to_pandas().itertuples(index=False): pass

print("Iter with daft:")
%timeit for x in df.iter_rows(): pass

Iter with pandas:
20.8 ms ± 551 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Iter with daft:
32.2 s ± 115 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Expected behavior

I would expect iteration alone to be faster than conversion + iteration.
Using a pyarrow array view as a numpy ndarray could resolve that issue and potentially similar ones for structs:
https://arrow.apache.org/docs/python/numpy.html#arrow-to-numpy
I wonder if it could also work for tensors (using pyarrow.Tensor or reshaping a flattened lists), as these casts are quite slow too.

Component(s)

Expressions, Python Runner

Additional context

It appears that cast operations are typically heavy, but at times it seems like only one cpu core is being utilized during these operations.

daft v0.4.0
python 3.10.13

The text was updated successfully, but these errors were encountered:

sagiahrac · 2024-12-30T19:08:48Z

I'm not sure if this is directly related, but here's another example:

import daft
import tempfile
import numpy as np
import pyarrow.parquet as pq

n_rows = 1000
tensor_shape = (1000, 1000)  # 1MB per tensor

with tempfile.TemporaryDirectory() as tmpdir:
    data = {"tensor": np.random.randint(low=1, high=10,
                                        size=(n_rows, *tensor_shape), dtype=np.uint8)}
    
    daft.from_pydict(data).write_parquet(tmpdir)
    
    print(f"Reading with pyarrow:")
    %timeit pq.read_table(tmpdir)
    
    print(f"Reading from daft:")
    %timeit daft.read_parquet(tmpdir).collect()

Reading with pyarrow:
4.33 s ± 54.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Reading from daft:
21.1 s ± 9.86 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

If tensors are stored as flattened lists (physical types), reshaping could potentially be an O(1) operation, as it might only require creating a new view of the existing data. This is why the comparison between the arrow representation of the column and the daft representation seemed reasonable to me, but I might have missed something - please let me know if it doesn’t make sense.

kevinzwang · 2025-01-06T19:31:16Z

Hi @sagiahrac, I believe that daft.DataFrame.iter_rows is slow because of several reasons, including some you've pointed out:

we cannot vectorize the conversion since it is done row by row
we have to cross the Python-Rust boundary in every row, which is costly in both instructions, memory, as well as GIL acquisition
The daft executor may not be well optimized for this type of row-based operation (BTW @sagiahrac are you using the local native executor correct?)

I can't give you a definitive answer why your Pandas iteration is faster but I would guess it would be because of at least the last two reasons above. What are you using iter_rows for? Perhaps we can find a way to do it without having to use a Python for loop.

sagiahrac added bug Something isn't working needs triage labels Dec 25, 2024

kevinzwang added performance perf p2 Nice to have features and removed bug Something isn't working perf needs triage labels Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inefficient conversions while iterating a dataframe #3634

Inefficient conversions while iterating a dataframe #3634

sagiahrac commented Dec 25, 2024 •

edited

Loading

sagiahrac commented Dec 30, 2024

kevinzwang commented Jan 6, 2025

Inefficient conversions while iterating a dataframe #3634

Inefficient conversions while iterating a dataframe #3634

Comments

sagiahrac commented Dec 25, 2024 • edited Loading

Describe the bug

To Reproduce

Expected behavior

Component(s)

Additional context

sagiahrac commented Dec 30, 2024

kevinzwang commented Jan 6, 2025

sagiahrac commented Dec 25, 2024 •

edited

Loading