Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support file wildcards in GSProcessing inputs #1107

Open
thvasilo opened this issue Dec 6, 2024 · 0 comments
Open

Support file wildcards in GSProcessing inputs #1107

thvasilo opened this issue Dec 6, 2024 · 0 comments
Labels
good first issue Good for newcomers gsprocessing For issues and PRs related the the GSProcessing library

Comments

@thvasilo
Copy link
Contributor

thvasilo commented Dec 6, 2024

Currently we read files in GSProcessing by directly using the path provided by the user in the config in a spark.read.parquet/csv(filepath) call. Spark doesn't support wildcards when used like this, but GConstuct has support for filepath wildcards.

To ensure better compatibility between the two we should support wildcards for S3 paths on GSProcessing as well. One option is to use boto to list all files under the parent path and then apply the wildcard rule, then pass the resulting list of files to the input.

This can happen in config parsing time.

@thvasilo thvasilo added good first issue Good for newcomers gsprocessing For issues and PRs related the the GSProcessing library labels Dec 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers gsprocessing For issues and PRs related the the GSProcessing library
Projects
None yet
Development

No branches or pull requests

1 participant