Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inefficient conversions while iterating a dataframe #3634

Open
sagiahrac opened this issue Dec 25, 2024 · 2 comments
Open

Inefficient conversions while iterating a dataframe #3634

sagiahrac opened this issue Dec 25, 2024 · 2 comments
Labels
p2 Nice to have features performance

Comments

@sagiahrac
Copy link
Contributor

sagiahrac commented Dec 25, 2024

Describe the bug

Iterating over the following dataframe with daft is over 1000x slower than converting the daft dataframe to pandas and iterating over the pandas dataframe instead.

To Reproduce

import numpy as np
import daft

np.random.seed(0)
n_rows = 1_000
list_size = 100_000
data = {"list": np.random.randint(0, 256, (n_rows, list_size), dtype=np.uint8)}
df = daft.from_pydict(data)

print("Iter with pandas:")
%timeit for row in df.to_pandas().itertuples(index=False): pass

print("Iter with daft:")
%timeit for x in df.iter_rows(): pass
Iter with pandas:
20.8 ms ± 551 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Iter with daft:
32.2 s ± 115 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Expected behavior

I would expect iteration alone to be faster than conversion + iteration.
Using a pyarrow array view as a numpy ndarray could resolve that issue and potentially similar ones for structs:
https://arrow.apache.org/docs/python/numpy.html#arrow-to-numpy
I wonder if it could also work for tensors (using pyarrow.Tensor or reshaping a flattened lists), as these casts are quite slow too.

Component(s)

Expressions, Python Runner

Additional context

It appears that cast operations are typically heavy, but at times it seems like only one cpu core is being utilized during these operations.

daft v0.4.0
python 3.10.13

@sagiahrac sagiahrac added bug Something isn't working needs triage labels Dec 25, 2024
@sagiahrac
Copy link
Contributor Author

I'm not sure if this is directly related, but here's another example:

import daft
import tempfile
import numpy as np
import pyarrow.parquet as pq

n_rows = 1000
tensor_shape = (1000, 1000)  # 1MB per tensor

with tempfile.TemporaryDirectory() as tmpdir:
    data = {"tensor": np.random.randint(low=1, high=10,
                                        size=(n_rows, *tensor_shape), dtype=np.uint8)}
    
    daft.from_pydict(data).write_parquet(tmpdir)
    
    print(f"Reading with pyarrow:")
    %timeit pq.read_table(tmpdir)
    
    print(f"Reading from daft:")
    %timeit daft.read_parquet(tmpdir).collect()
Reading with pyarrow:
4.33 s ± 54.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Reading from daft:
21.1 s ± 9.86 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

If tensors are stored as flattened lists (physical types), reshaping could potentially be an O(1) operation, as it might only require creating a new view of the existing data. This is why the comparison between the arrow representation of the column and the daft representation seemed reasonable to me, but I might have missed something - please let me know if it doesn’t make sense.

@kevinzwang
Copy link
Member

Hi @sagiahrac, I believe that daft.DataFrame.iter_rows is slow because of several reasons, including some you've pointed out:

  • we cannot vectorize the conversion since it is done row by row
  • we have to cross the Python-Rust boundary in every row, which is costly in both instructions, memory, as well as GIL acquisition
  • The daft executor may not be well optimized for this type of row-based operation (BTW @sagiahrac are you using the local native executor correct?)

I can't give you a definitive answer why your Pandas iteration is faster but I would guess it would be because of at least the last two reasons above. What are you using iter_rows for? Perhaps we can find a way to do it without having to use a Python for loop.

@kevinzwang kevinzwang added performance perf p2 Nice to have features and removed bug Something isn't working perf needs triage labels Jan 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
p2 Nice to have features performance
Projects
None yet
Development

No branches or pull requests

2 participants