Sheets parameter for the load processor #138

roll · 2020-05-27T12:04:24Z

fixes [dataflows] revisit loading multiple sheets with one load step frictionlessdata/datapackage-pipelines#188

roll · 2020-05-27T12:07:57Z

It's my second take on this issue. The first attempt was #110

The new one uses tabulator's workbook_cache argument for XLSX which allows downloading every workbook only once - https://github.com/frictionlessdata/tabulator-py#xlsxlsx-read--write

I think dataflows is a better place for high-level logic as sheets management and cache control so my current idea that tabulator just needs to provide enough low-level tools to be managed by dataflows

cschloer · 2020-06-01T11:43:42Z

I will take a look!

akariv · 2020-06-13T16:25:56Z

Hey @roll -

Wouldn't it be better if there was a standard way to open a 'container' kind of file - for example, an Excel file or Google Spreadsheet with multiple sheets or a ZIP file with multiple files.
Then you would be able to iterate on them with a consistent API.

This implementation basically re-opens the Excel file for each of the sheets, reads a sample, infers a schema - and then checks to see if the sheet name adheres to the sheets regular expression... we might have solved the re-downloading problem but it's still very inefficient.

I'm thinking we could have a generic class similar to Stream, which would be Container:

>>> container = tabulator.Container('path/to/excel/file.xlsx')
# OR 
>>> container = tabulator.Container('path/to/archive/file.zip')
>>> for item in container.iter():
...  print(item.name) # could be sheet name or filename in zipfile
...  print(item.options) # dict of options to be used in Stream, e.g. '{"sheet": 1}' or {"compression": "zip"} etc.
... stream = item.stream(**other_options)  # Returns a Stream object

then you could do:

FILENAME = 'path/to/excel/file.xlsx'
Flow(*[
   load(FILENAME, headers=1, name='res-%d' % idx, **item.options)
   for i, item in enumerate(Container(FILENAME).iter())
]).process()

roll · 2020-06-14T06:47:42Z

@akariv
Yes sure, I agree. And once I have time I will do something like this (we have some container concepts already like Package or Storage; maybe we can use them; may some new one). BTW, I'm not sure the currently used technique is really a noticeable overhead as ZIP-files should support proper random-access.

Aside from implementation, what do you think the best API will be for DPP? For dataflows it's easy to imaging but we need it also in a declarative form.

akariv · 2020-06-14T07:43:34Z

To answer the DPP question - off the top of my head.
We could have a load_batch processor (in dataflows and in DPP) with the following parameter:

selection - list of REs / predicate functions to control filename/sheet selection

Internally it will iterate on the different parts (using the tabulator.Collection object ! 😄 ) and will generate a bunch of loads as necessary.

Example selections:

["https://path/to/file.xls", "Expenses.+"]
Select all expenses sheets from an excel file on the web
["/path/to/file.zip", ".+\.csv"]
Select all csv files in a zip file
["https://path/to/file.zip", "report.+\.xlsx", "Raw Data"]
Select the 'Raw Data' sheet in all report excel files in a zip file on the web

resource name can be a slug of the filenames/sheetnames of the different parts combined (we can complicate it later)

roll · 2020-06-15T06:24:00Z

@akariv
Cool, it sounds good. I doubt I will have enough time to implement tabulator.Collection/etc before the end of the current stage of the BCO-DMO pilot so I would suggest:

to agree on load_batch API (@cschloer please take a look if it makes sense for you)
I will implement the first version of load_batch probably using the current state of tabulator
We will switch to tabulator.Collection on the next iteration (only implementation-wise so it will not be breaking for users)
WDYT?

cschloer · 2020-06-15T10:28:46Z

Our custom load processor already has something relatively akin to "load_batch" - it can take a comma separated list of urls or a regular expression (for local or s3) url, and then it just generates a bunch of load steps for each resulting URL. But if your implementation will improve the load times for multiple sheets within an xlsx file, I am happy to switch over to your implementation. 👍

roll added 7 commits May 27, 2020 14:00

Added docs

651aafe

Added tests

94860b5

Added load

ddd9aea

Fixed tests

c8a9cb2

Use workbook_cache

6a91c10

Uncomment finalizer

7398418

FIxed tests

fbdbd1c

akariv mentioned this pull request Sep 24, 2021

Storage API design frictionlessdata/frictionless-py#913

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sheets parameter for the load processor #138

Sheets parameter for the load processor #138

roll commented May 27, 2020

roll commented May 27, 2020

cschloer commented Jun 1, 2020

akariv commented Jun 13, 2020

roll commented Jun 14, 2020

akariv commented Jun 14, 2020

roll commented Jun 15, 2020

cschloer commented Jun 15, 2020

Sheets parameter for the load processor #138

Are you sure you want to change the base?

Sheets parameter for the load processor #138

Conversation

roll commented May 27, 2020

roll commented May 27, 2020

cschloer commented Jun 1, 2020

akariv commented Jun 13, 2020

roll commented Jun 14, 2020

akariv commented Jun 14, 2020

roll commented Jun 15, 2020

cschloer commented Jun 15, 2020