Fix/investigate inconsistent partition size in fetching larger datasets #41

phobson · 2022-12-01T19:55:10Z

Looks into #40

… fail)

jrbourbeau

Hrm, maybe the failure you were seeing only happens sporadically? You might try bumping CI a few times to see if things consistently pass

phobson · 2022-12-01T22:42:00Z

@jrbourbeau I left the xfail mark on the test parameter, so it "successfully failed" here. Sorry for that confusion.

I'm trying to (roughly) bisect where things fall apart. I'll push that as parameter to the test and remove the xfail (for now)

phobson · 2022-12-02T22:30:50Z

@jrbourbeau (cc @hayesgb)

I figured out what's going on here. Since we have so little control over the sizes of the batches that snowflake is returning, so of the individual batches are larger than the requested partition size. So the answer to the would be to split up the batch.

Inside PDB during a failing test:

(Pdb) type(batch)
<class 'snowflake.connector.result_batch.ArrowResultBatch'>
(Pdb) print(f"{batch.rowcount=} but {target=}")
batch.rowcount=87977 but target=80565

So that means that I see some possible options here:

Try to split up the large batches. That'll be complex because I think we'll have to materialize (fetch) the results to do so. That means, with the way they code is currently written, that we'd have a mix of materialized on non-materialize results. Not ideal, happy to dive into the rabbit whole further if desired.
Accept the fact that we don't control the batch sizes, and that will be occasionally wrong for partition sizes smaller than around 5 MiB (based on what I've see so far). We could emit a warning that some partitions are larger than what the user requests and nudge them towards dask.dataframe.repartition docs.
Something else I haven't thought, but would be happy to hear about.

phobson · 2022-12-02T22:33:58Z

Potential third option from me: relax the test a bit. Current the check is:

assert (partition_sizes < 2 * parse_bytes("2 MiB")).all()

So we're already using a fudge factor of 2. Based on what I've noticed digging into this that we could get away with something between 2.2 and 2.5

jrbourbeau

@phobson and I chatted a bit offline about this one and decided to go with adjusting the fudge factor in the test and a small additional note in the read_snowflake docstring about how we only attempt to merge snowflake result batches, but not split them.

split the partition size check into its own test, parametrize (should…

28f9559

… fail)

jrbourbeau reviewed Dec 1, 2022

View reviewed changes

jrbourbeau reviewed Dec 6, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/investigate inconsistent partition size in fetching larger datasets #41

Fix/investigate inconsistent partition size in fetching larger datasets #41

phobson commented Dec 1, 2022

jrbourbeau left a comment

phobson commented Dec 1, 2022 •

edited

Loading

phobson commented Dec 2, 2022

phobson commented Dec 2, 2022

jrbourbeau left a comment

Fix/investigate inconsistent partition size in fetching larger datasets #41

Are you sure you want to change the base?

Fix/investigate inconsistent partition size in fetching larger datasets #41

Conversation

phobson commented Dec 1, 2022

jrbourbeau left a comment

Choose a reason for hiding this comment

phobson commented Dec 1, 2022 • edited Loading

phobson commented Dec 2, 2022

phobson commented Dec 2, 2022

jrbourbeau left a comment

Choose a reason for hiding this comment

phobson commented Dec 1, 2022 •

edited

Loading