Consolidate block fetch requests. #1733

groutr · 2024-10-22T19:18:55Z

Implement the TODO to consolidate consecutive block fetch requests into a single fetch request.

Adapted from the itertools example found here: https://docs.python.org/2.6/library/itertools.html#examples

martindurant · 2024-10-22T19:21:39Z

Would you mind explaining more completely what this does? It also makes that block of code look more complex (to me), so some comments inline would be appreciated.

groutr · 2024-10-22T19:59:04Z

Python's groupby works by splitting based on a change in the value of a key function. The chosen key computes the difference between an ascending range and the block numbers (which are also necessarily sorted). If the that difference changes, we know that there was a "hole" in the block numbers (ie, we already have the block cached).

Assume a file of 5 blocks, [0, 1, 2, 3, 4] where we have already cached block 2. The caller requests a byte range that covers blocks 0, 1, 3.

need = (i for i in block_range if i not in self.blocks)   # lazily evaluated to [0, 1, 3]
groupby(enumerate(need), key=lambda x: x[0]-x[1])
    Enumerating need pairs the needed blocks with an ascending range
    (0, 0), (1, 1), (2, 3)  # Note that the actual block numbers are the second values.
    Applying the key we get values: 0, 0, 1
    Groupby splits a new group on each key value change so we end up with [(0, 0), (1, 1)] and [(2, 3)] groups.
Since we enumerated need, we need to extract the second values from each group to get our blocks, 
so we map itemgetter(1) over the group.
For the first group, we have blocks [0, 1], so we compute the offset for beginning of block 0 and the end of block 1.
One call is made to fetch for both blocks 0 and 1.

The second group is just a single block, 3. One call is made to fetch for that block.

I'll add an explanation via comments.

martindurant · 2024-10-22T20:01:33Z

You might be interested, that something similar happens in https://github.com/fsspec/filesystem_spec/blob/master/fsspec/utils.py#L528 (where we don't have blocks, but we do have byte ranges and possible gaps).

groutr · 2024-10-22T20:39:08Z

@martindurant Thank you for pointing my attention to that function. I'm not sure if the approach here is directly applicable there. I'll have to spend more time reasoning thru it.

groutr · 2024-10-24T17:00:35Z

@martindurant Would you prefer that I instead use merge_offset_ranges in this PR?

martindurant · 2024-10-24T17:03:15Z

Would you prefer that I instead use merge_offset_ranges

Only if it's obvious how. I am not in a good position to make that choice, so long as you have considered the possibility.

groutr · 2024-10-25T06:28:58Z

@martindurant I played around with an implementation and observed that merge_offset_ranges does not sufficiently preserve enough info to adequately update the cache metadata. Once the byte ranges are merged, we lose the info about which blocks those ranges represent and it becomes much harder to update the cache metadata.

groutr added 2 commits October 22, 2024 13:16

Consolidate block fetch requests.

27af35d

Add missing parenthesis

f110c08

groutr added 3 commits October 22, 2024 13:26

Lazy evaluation of needed blocks.

ce10ae9

More readable expression for needed blocks.

af23c7d

Count the blocks already fetched.

2102882

Add explanatory comments.

78cf39b

Remove unused import.

6e549fd

lint

ee9cb51

martindurant merged commit 61b66b6 into fsspec:master Oct 28, 2024
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consolidate block fetch requests. #1733

Consolidate block fetch requests. #1733

groutr commented Oct 22, 2024 •

edited

Loading

martindurant commented Oct 22, 2024

groutr commented Oct 22, 2024

martindurant commented Oct 22, 2024

groutr commented Oct 22, 2024

groutr commented Oct 24, 2024

martindurant commented Oct 24, 2024

groutr commented Oct 25, 2024

Consolidate block fetch requests. #1733

Consolidate block fetch requests. #1733

Conversation

groutr commented Oct 22, 2024 • edited Loading

martindurant commented Oct 22, 2024

groutr commented Oct 22, 2024

martindurant commented Oct 22, 2024

groutr commented Oct 22, 2024

groutr commented Oct 24, 2024

martindurant commented Oct 24, 2024

groutr commented Oct 25, 2024

groutr commented Oct 22, 2024 •

edited

Loading