-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sheets parameter for the load processor #138
base: master
Are you sure you want to change the base?
Conversation
roll
commented
May 27, 2020
- fixes [dataflows] revisit loading multiple sheets with one load step frictionlessdata/datapackage-pipelines#188
It's my second take on this issue. The first attempt was #110 The new one uses I think |
I will take a look! |
Hey @roll - Wouldn't it be better if there was a standard way to open a 'container' kind of file - for example, an Excel file or Google Spreadsheet with multiple sheets or a ZIP file with multiple files. This implementation basically re-opens the Excel file for each of the sheets, reads a sample, infers a schema - and then checks to see if the sheet name adheres to the I'm thinking we could have a generic class similar to >>> container = tabulator.Container('path/to/excel/file.xlsx')
# OR
>>> container = tabulator.Container('path/to/archive/file.zip')
>>> for item in container.iter():
... print(item.name) # could be sheet name or filename in zipfile
... print(item.options) # dict of options to be used in Stream, e.g. '{"sheet": 1}' or {"compression": "zip"} etc.
... stream = item.stream(**other_options) # Returns a Stream object then you could do: FILENAME = 'path/to/excel/file.xlsx'
Flow(*[
load(FILENAME, headers=1, name='res-%d' % idx, **item.options)
for i, item in enumerate(Container(FILENAME).iter())
]).process() |
@akariv Aside from implementation, what do you think the best API will be for DPP? For |
To answer the DPP question - off the top of my head.
Internally it will iterate on the different parts (using the Example
resource name can be a slug of the filenames/sheetnames of the different parts combined (we can complicate it later) |
@akariv
|
Our custom load processor already has something relatively akin to "load_batch" - it can take a comma separated list of urls or a regular expression (for local or s3) url, and then it just generates a bunch of load steps for each resulting URL. But if your implementation will improve the load times for multiple sheets within an xlsx file, I am happy to switch over to your implementation. 👍 |