-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove pathogen-specific tools from base runtimes #7
Comments
Generally, I would love to have a way to use pathogen-specific Docker images in our workflows! That's been my dream since we added pango-learn to the base image back in the early pandemic. For the specific candidates you mentioned for removal, I can make some specific notes:
I would also recommend removing would be the pango-learn packages and its binary dependencies of gofasta and minimap2, since all of our pango annotations come from Nextclade now. |
This is the other half of workflows as programs, namely the "the artifacts/bundling (keyword: buildpacks) side of things", no? (And yes, we should totally do this if it's at all feasible -- there are a number of times I haven't done something because I know it's going to be such a hassle / burden to make the needed dependency available to our runtimes.) |
Yes, precisely. The whole idea there is that instead of having runtimes and pathogens separately, we have pathogens that are (or contain) their runtimes. We want to avoid having N pathogens and N×M pathogen-runtimes and making the user match them. The implementation examples Victor gave (and things like ncov-ingest's image) are coming at this from what I'd call a more ad-hoc approach, and I do not think we should go down that path as a way to get to custom runtimes per pathogen. That way lies ecosystem fragmentation and incurs significant usability costs (to both users and developers, us and others). There's lots of considerations of this work. For example, our runtimes are not small when installed on disk. We're going to want to be able to share a concrete, installed base across pathogens (not just a conceptual base). We'll also want to consider the cost vs. benefits of moving something out of the base runtimes; it will have non-trivial overhead (both conceptual and actual) and we should only do it when it's worth it. I'm not convinced many candidates given above meet that threshold? What concretely are we gaining with the removal of each? |
Do you have examples? They would be very helpful to guide both eventual work on this topic but also suggest pain points we might be able to alleviate now with the current base runtimes. |
The one I was reminded of with this week's avian-flu work is nextstrain/avian-flu#80. There have been a bunch of others along the lines of "can't use this pip dependency, not in our runtimes" but I managed to find an alternate solution so it wasn't a dealbreaker. |
Thanks for the discussions! It's clear that there is work to be done on the dependencies/runtimes front. I don't think there is an urgent need to remove anything now, but it's nice to have discussions about this alongside the ongoing workflows as programs work. @tsibley has a good point with
I've updated the issue description to be more open-ended, noting the ad-hoc approach as the only known workaround in the absence of a better solution. |
I'm 99% sure that no one uses pangoLEARN anymore, especially not the deprecated versions we've pinned. Removing it would allow us to no longer wonder whether we need it anymore (and regain some valuable parking lots ;)). I just asked the original users of these tools on Slack whether they still use these tools. |
Dan Lu confirmed on Slack that they're no longer using pangoLEARN. I also realized that I made an issue for this last year. As @joverlee521 pointed out on that issue, we'll need to remove the pangoLEARN bits from ncov, too. We/I can continue further discussion of this on that issue. |
I can picture a bunch of gotchas here if suddenly I have to remember to use different specific non-default Docker images when I run I agree with Tom here
My strong sense is that the gain in reduced Docker image size in stripping these out would not be worth the developer hassle on maintaining multiple images and then the user hassle of knowing which image to use for which workflow. |
At last week's dev discussion on pathogen workflow improvements, we concluded that this is the proper solution. It is a long term solution that needs more thinking and prototyping. Until then, we can live with shared runtimes, even if they may be bloated. The short-sighted wording of this issue ("remove pathogen-specific tools from base runtimes") doesn't need to happen any time soon. Closing as unplanned. |
This applies to docker-base and conda-base.
Context
Our base image has accumulated various pathogen-specific tools over time, some of which signficantly contribute to build time and image size. By removing these pathogen-specific tools, we can ensure the base image/environment reflects a continually updated version of Nextstrain tools and their dependencies. Using fauna as an example, more detailed reasoning is in nextstrain/fauna#170.
Candidates
Note
This seems like the right move for Fauna, but I'm not sure how far we want to take it. As we expand the number of core pathogens that rely on runtimes, the common base will only get smaller.
Workarounds
The ad-hoc approach of defining/creating custom runtimes by extending a base (examples: 1, 2, 3) has been used internally to some extent. The process is quite involved and not ideal for users at large.
The text was updated successfully, but these errors were encountered: