You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Iterating over the following dataframe with daft is over 1000x slower than converting the daft dataframe to pandas and iterating over the pandas dataframe instead.
To Reproduce
importnumpyasnpimportdaftnp.random.seed(0)
n_rows=1_000list_size=100_000data= {"list": np.random.randint(0, 256, (n_rows, list_size), dtype=np.uint8)}
df=daft.from_pydict(data)
print("Iter with pandas:")
%timeitforrowindf.to_pandas().itertuples(index=False): passprint("Iter with daft:")
%timeitforxindf.iter_rows(): pass
Iter with pandas:
20.8 ms ± 551 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Iter with daft:
32.2 s ± 115 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Expected behavior
I would expect iteration alone to be faster than conversion + iteration.
Using a pyarrow array view as a numpy ndarray could resolve that issue and potentially similar ones for structs: https://arrow.apache.org/docs/python/numpy.html#arrow-to-numpy
I wonder if it could also work for tensors (using pyarrow.Tensor or reshaping a flattened lists), as these casts are quite slow too.
Component(s)
Expressions, Python Runner
Additional context
It appears that cast operations are typically heavy, but at times it seems like only one cpu core is being utilized during these operations.
daft v0.4.0
python 3.10.13
The text was updated successfully, but these errors were encountered:
I'm not sure if this is directly related, but here's another example:
importdaftimporttempfileimportnumpyasnpimportpyarrow.parquetaspqn_rows=1000tensor_shape= (1000, 1000) # 1MB per tensorwithtempfile.TemporaryDirectory() astmpdir:
data= {"tensor": np.random.randint(low=1, high=10,
size=(n_rows, *tensor_shape), dtype=np.uint8)}
daft.from_pydict(data).write_parquet(tmpdir)
print(f"Reading with pyarrow:")
%timeitpq.read_table(tmpdir)
print(f"Reading from daft:")
%timeitdaft.read_parquet(tmpdir).collect()
Reading with pyarrow:
4.33 s ± 54.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Reading from daft:
21.1 s ± 9.86 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
If tensors are stored as flattened lists (physical types), reshaping could potentially be an O(1) operation, as it might only require creating a new view of the existing data. This is why the comparison between the arrow representation of the column and the daft representation seemed reasonable to me, but I might have missed something - please let me know if it doesn’t make sense.
Hi @sagiahrac, I believe that daft.DataFrame.iter_rows is slow because of several reasons, including some you've pointed out:
we cannot vectorize the conversion since it is done row by row
we have to cross the Python-Rust boundary in every row, which is costly in both instructions, memory, as well as GIL acquisition
The daft executor may not be well optimized for this type of row-based operation (BTW @sagiahrac are you using the local native executor correct?)
I can't give you a definitive answer why your Pandas iteration is faster but I would guess it would be because of at least the last two reasons above. What are you using iter_rows for? Perhaps we can find a way to do it without having to use a Python for loop.
Describe the bug
Iterating over the following dataframe with daft is over 1000x slower than converting the daft dataframe to pandas and iterating over the pandas dataframe instead.
To Reproduce
Expected behavior
I would expect iteration alone to be faster than conversion + iteration.
Using a pyarrow array view as a numpy ndarray could resolve that issue and potentially similar ones for structs:
https://arrow.apache.org/docs/python/numpy.html#arrow-to-numpy
I wonder if it could also work for tensors (using
pyarrow.Tensor
or reshaping a flattened lists), as these casts are quite slow too.Component(s)
Expressions, Python Runner
Additional context
It appears that cast operations are typically heavy, but at times it seems like only one cpu core is being utilized during these operations.
daft v0.4.0
python 3.10.13
The text was updated successfully, but these errors were encountered: