-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace anaconda-project downloads #444
Comments
Noting that the ability to download data while preparing the project/environment is useful for instance when an example is deployed. Otherwise, if an example requires a large dataset to be downloaded, the first user is going to have to wait a little too long :) No big deal but not great. |
cc @jbednar as I am aware this is a topic you're thinking about these days. |
Paper a bit outdated (5 years old) by the authors of pooch, stating that the only alternatives to their knowledge were fsspec and intake: https://github.com/fatiando/pooch/blob/main/paper/paper.md. |
If we don't want to commit to Intake anymore and there's no tool replacing it that meets our needs, then I can imagine we could standardize something around a |
Are we not going to be able to have this capability in conda-project? Would be good to discuss that with the conda-project developers and see what the right approach could be. Projects do generally need to have data or they won't be useful... |
I don't know, other tools like uv, poetry, pixi don't have that built-in. I'm not sure I want to push this feature request, feel free to do so! What I'm also uncomfortable with is just the feeling we're re-creating anaconda-project, and also the idea to be locked in a tool (which so far has no users) with a unique feature.
Data projects yes but that's not all application projects (e.g. a simple GUI), and library projects usually not. |
Sure, but uv, poetry, and pixi aren't specifically made for data projects like those in this repo, and conda-project is, so here I'm talking about data projects. Plus the number of data projects greatly outweighs the number of application projects. E.g. there are currently 10 million Jupyter Notebooks on Github, versus maybe some hundreds of thousands of libraries that get packaged up. So I'm concerned about having a good solution for data projects, whether that solution is in conda-project or via some other tool. |
Really? There's no single mention of the
Application projects don't include libraries in my mind, but things like API, CLI, GUI, scripts, etc. I can't tell if there are more of them than data application projects, but yes for sure there are many data projects out there.
I'd also love to have a good solution for data projects. But as someone who got to maintain Examples for a little while, I wouldn't commit to a tool that makes it more difficult to maintain Examples (not well maintained, low adoption, etc.). In which case, I'd rather rely on something custom that can easily be migrated if need be. |
Yes, really. :-) The conda-project README says:
The "other files" includes data; what else would that be? Then it links to my "8 Levels of Reproduciblity", which was written about data projects, or at least notebooks or dashboards rather than libraries or APIs or CLIs. Then it says:
Which in turn says:
So sure, conda-project needs some better, clearer docs, but I consider it to be coming very clearly from a perspective of "package up some code with all the stuff needed to reproduce a result" rather than something like "I have written a library I want to share with other people who will then import it" or "I have written an end-user application that I want to publish on an app store".
Well, conda-project isn't something that came from heaven; it was written by some co-workers of yours, and so I think you can either (1) contribute to making it be something that meets your needs, (2) write something completely custom, or (3) find something that already meets your needs. I haven't seen (3) show up in this thread or elsewhere, and between 1 and 2 I'd vote for 1, since collaborating on a shared tool that we together make into something valuable seems much better than us developing some custom solution just for our narrow use case, which would mean something with even lower adoption and even worse maintenance. |
I've opened an issue to ask about that feature on conda-project conda-incubator/conda-project#176
What I want more than anything else is that, when we decide to migrate away from anaconda-project (or are forced when it starts to break, e.g. with a new Python version), we pick a tool that is already widely used. |
To be clear, "data project" does not necessarily imply that there is the ability to fetch data; it just means that we are expecting that most projects will somehow work with data. Fetching data is only crucial when datasets are much larger than the rest of the project such that it makes sense to treat them differently. So while I strongly consider conda-project to be about data projects primarily, whether it should have functionality about fetching data is a separate question best discussed at that issue. |
anaconda-project
has a handy feature that allows to declare a series of files to download (and optionally unzip) when preparing a project (see https://anaconda-project.readthedocs.io/en/latest/user-guide/reference.html#file-downloads). Some day we will need to replaceanaconda-project
by another tool (e.g. conda-project, pixi) which, at the moment, don't provide this feature. To prepare this transition, we'll need to find an alternative way to download data.Features we use:
filename: data
) for archives to unzipPotential alternatives:
The text was updated successfully, but these errors were encountered: