Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sheets parameter for the load processor #138

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

roll
Copy link
Contributor

@roll roll commented May 27, 2020

@roll
Copy link
Contributor Author

roll commented May 27, 2020

Hi @akariv @cschloer,

It's my second take on this issue. The first attempt was #110

The new one uses tabulator's workbook_cache argument for XLSX which allows downloading every workbook only once - https://github.com/frictionlessdata/tabulator-py#xlsxlsx-read--write

I think dataflows is a better place for high-level logic as sheets management and cache control so my current idea that tabulator just needs to provide enough low-level tools to be managed by dataflows

@cschloer
Copy link
Contributor

cschloer commented Jun 1, 2020

I will take a look!

@akariv
Copy link
Member

akariv commented Jun 13, 2020

Hey @roll -

Wouldn't it be better if there was a standard way to open a 'container' kind of file - for example, an Excel file or Google Spreadsheet with multiple sheets or a ZIP file with multiple files.
Then you would be able to iterate on them with a consistent API.

This implementation basically re-opens the Excel file for each of the sheets, reads a sample, infers a schema - and then checks to see if the sheet name adheres to the sheets regular expression... we might have solved the re-downloading problem but it's still very inefficient.

I'm thinking we could have a generic class similar to Stream, which would be Container:

>>> container = tabulator.Container('path/to/excel/file.xlsx')
# OR 
>>> container = tabulator.Container('path/to/archive/file.zip')
>>> for item in container.iter():
...  print(item.name) # could be sheet name or filename in zipfile
...  print(item.options) # dict of options to be used in Stream, e.g. '{"sheet": 1}' or {"compression": "zip"} etc.
... stream = item.stream(**other_options)  # Returns a Stream object

then you could do:

FILENAME = 'path/to/excel/file.xlsx'
Flow(*[
   load(FILENAME, headers=1, name='res-%d' % idx, **item.options)
   for i, item in enumerate(Container(FILENAME).iter())
]).process()

@roll
Copy link
Contributor Author

roll commented Jun 14, 2020

@akariv
Yes sure, I agree. And once I have time I will do something like this (we have some container concepts already like Package or Storage; maybe we can use them; may some new one). BTW, I'm not sure the currently used technique is really a noticeable overhead as ZIP-files should support proper random-access.

Aside from implementation, what do you think the best API will be for DPP? For dataflows it's easy to imaging but we need it also in a declarative form.

@akariv
Copy link
Member

akariv commented Jun 14, 2020

To answer the DPP question - off the top of my head.
We could have a load_batch processor (in dataflows and in DPP) with the following parameter:

  • selection - list of REs / predicate functions to control filename/sheet selection

Internally it will iterate on the different parts (using the tabulator.Collection object ! 😄 ) and will generate a bunch of loads as necessary.

Example selections:

  • ["https://path/to/file.xls", "Expenses.+"]
    Select all expenses sheets from an excel file on the web
  • ["/path/to/file.zip", ".+\.csv"]
    Select all csv files in a zip file
  • ["https://path/to/file.zip", "report.+\.xlsx", "Raw Data"]
    Select the 'Raw Data' sheet in all report excel files in a zip file on the web

resource name can be a slug of the filenames/sheetnames of the different parts combined (we can complicate it later)

@roll
Copy link
Contributor Author

roll commented Jun 15, 2020

@akariv
Cool, it sounds good. I doubt I will have enough time to implement tabulator.Collection/etc before the end of the current stage of the BCO-DMO pilot so I would suggest:

  • to agree on load_batch API (@cschloer please take a look if it makes sense for you)
  • I will implement the first version of load_batch probably using the current state of tabulator
  • We will switch to tabulator.Collection on the next iteration (only implementation-wise so it will not be breaking for users)
    WDYT?

@cschloer
Copy link
Contributor

Our custom load processor already has something relatively akin to "load_batch" - it can take a comma separated list of urls or a regular expression (for local or s3) url, and then it just generates a bunch of load steps for each resulting URL. But if your implementation will improve the load times for multiple sheets within an xlsx file, I am happy to switch over to your implementation. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[dataflows] revisit loading multiple sheets with one load step
3 participants