Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix/investigate inconsistent partition size in fetching larger datasets #41

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

phobson
Copy link
Contributor

@phobson phobson commented Dec 1, 2022

Looks into #40

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm, maybe the failure you were seeing only happens sporadically? You might try bumping CI a few times to see if things consistently pass

@phobson
Copy link
Contributor Author

phobson commented Dec 1, 2022

@jrbourbeau I left the xfail mark on the test parameter, so it "successfully failed" here. Sorry for that confusion.

I'm trying to (roughly) bisect where things fall apart. I'll push that as parameter to the test and remove the xfail (for now)

@phobson
Copy link
Contributor Author

phobson commented Dec 2, 2022

@jrbourbeau (cc @hayesgb)

I figured out what's going on here. Since we have so little control over the sizes of the batches that snowflake is returning, so of the individual batches are larger than the requested partition size. So the answer to the would be to split up the batch.

Inside PDB during a failing test:

(Pdb) type(batch)
<class 'snowflake.connector.result_batch.ArrowResultBatch'>
(Pdb) print(f"{batch.rowcount=} but {target=}")
batch.rowcount=87977 but target=80565

So that means that I see some possible options here:

  1. Try to split up the large batches. That'll be complex because I think we'll have to materialize (fetch) the results to do so. That means, with the way they code is currently written, that we'd have a mix of materialized on non-materialize results. Not ideal, happy to dive into the rabbit whole further if desired.

  2. Accept the fact that we don't control the batch sizes, and that will be occasionally wrong for partition sizes smaller than around 5 MiB (based on what I've see so far). We could emit a warning that some partitions are larger than what the user requests and nudge them towards dask.dataframe.repartition docs.

  3. Something else I haven't thought, but would be happy to hear about.

@phobson
Copy link
Contributor Author

phobson commented Dec 2, 2022

Potential third option from me: relax the test a bit. Current the check is:

assert (partition_sizes < 2 * parse_bytes("2 MiB")).all()

So we're already using a fudge factor of 2. Based on what I've noticed digging into this that we could get away with something between 2.2 and 2.5

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@phobson and I chatted a bit offline about this one and decided to go with adjusting the fudge factor in the test and a small additional note in the read_snowflake docstring about how we only attempt to merge snowflake result batches, but not split them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants