Where should code and scripts live? #25

nickreich · 2024-07-19T14:06:56Z

nickreich
Jul 19, 2024
Maintainer

As we were discussing setting up the forthcoming SARS-CoV-2 Variant Nowcast Hub yesterday, we ran across some guidelines in the hubverse documentation that did not fully jive with our understanding of best practices. The question is around where to include code/scripts that are used to create target data and/or scores for model output in the hub. The hubverse currently says things like:

Generally, Hub file structure is intended primarily as a storage space for primary data. All other code and outputs related to model output validation, visualizations, reports, ensemble construction, etc., should be placed in repositories other than the primary Hub location.

and

To the extent possible, only the workflow definition files should be stored within the Hub file space, with any additional scripts or functionality residing in an external location

However, should this really be the recommendation, even for scripts that are needed to generate target data that live in the hub? It seemed to be the consensus of people on the call that we were on, that code for generating target data should live in the hub. And perhaps code for generating scores can/should also live in the hub (although we were less sure about this).

This seems at least partially related to ongoing conversations about how to manage automated calculations of score data.

But also just generally, it might be nice to be more specific about

in what cases (if any) code can/should be stored in the hub itself,
if code is allowable in a hub, do we have a recommended storage location for it. E.g. Should there be a separate scripts or code folder, or should the scripts/code live in the folder relevant to the data being generated.

Early on in the hubverse visioning, there were some folks who felt fairly strongly that only data should live in the repos. I was not one of them, but I just wanted to register that there were some strong voices about this viewpoint.

LucieContamin · 2024-07-19T14:31:37Z

LucieContamin
Jul 19, 2024
Maintainer

It's a good question, I agree that it would be nice to have a clearer recommendation for that.

I might need to think about it more, but I kind of think of two/three situations:
(A) a hub wants to have everything in one repo and it might be especially easy for hub with small forecasts/projections files
(B) a hub might prefer to have separate repository for different functions, for examples: a main hub for information, validation, storing model output file; a "analysis" repo for the score, etc.
and of course we can imagine a situation (C) where it's a mix of (A) and (B).

So the code/script storage need might be different for different hubs. But, in any case, I don't think we should be restrictive about code, as it's often small files and it should not cause any issues if stored in "expected" folder and is documented on where to find it.
Also, for any hub using GitHub actions, it might be easier to store the code in the hub.

But to answer to your 2 points:

I think for all situation (A), (B), (C) we should allow teams to store whatever code/scripts in the hub itself with the caveat that if not "stored" in an expected place it might cause issues with Hubverse tools.
I think we should only be strict on where NOT to store the code, for example strongly recommend to not store code in model-output folder as it might cause issue with the hubverse functions. For SMH, we do have a code folder, storing code for the validation for example but we also store script in the target-data folder so that the team can easily find it here and replicate the process to generate the target-data with the associated data and documentation.

That's my first idea, but I might have miss something.

0 replies

harryhoch · 2024-07-19T14:49:58Z

harryhoch
Jul 19, 2024
Collaborator

I am basically in agreement with @LucieContamin, but I think there's an important broader question of philosophy here.

My perspective comes from some history of the development of Internet protocols, such as the Robustness Principle -"be conservative in what you do, be liberal in what you accept from others". In this view, the Hubverse should insist upon a relatively minimal set of required behaviors while still making suggestions as to what we think will work well. This can be enacted through the use of phrasing of requirement levels - such as those used in Internet protocol development - which use phrases like "MUST", "MUST NOT" ,"SHOULD", "SHOULD NOT", "MAY", and others to distinguish between what is required and what is not.

Thus if we believe that it's best to not include code in the hub file space, we might say that they "SHOULD NOT" do this, suggesting that "there may exist valid reasons in particular circumstances when the particular behavior is acceptable or even useful, but the full
implications should be understood and the case carefully weighed..." (From the Internet protocol development requirement levels cited above).

As per Lucie's point, we could certainly say that they "MUST NOT" store code in the model_output folder.

0 replies

bsweger · 2024-07-19T15:42:46Z

bsweger
Jul 19, 2024
Collaborator

As usual, I really like @harryhoch's perspective (am thinking here of the Robustness Principle and prior comments about extensibility). I also agree with @LucieContamin's notes.

When considering a MUST NOT vs SHOULD NOT recommendation about where to store code/scripts, a factor to consider is the multiple functions of a Hub's repo:

A place for hub admins to define how a hub will operate
A user interface for participating modelers
A place for devs to write code for target data, evals, etc. [if we choose to do that]

The third item adds additional moving parts that potentially introduce confusion to first two use cases. For example, if the test suite for the target data code suddenly begins failing on the day submissions are due, people will have a bad time.

We could certainly mitigate the likelihood of something like that happening (e.g., very targeted GitHub workflows), but there's a complexity penalty to mixing all of these concerns.

As the person developing the package for the Variant Nowcast target data, I'd prefer not to worry about impacting hub modelers when jumping in with a bug fix or other time-sensitive update.

That's obviously a single perspective about a single hub. But it's worth calling out as a "con" if we decide on a SHOULD NOT recommendation.

1 reply

nickreich Jul 22, 2024
Maintainer Author

This is a really helpful and specific example of how code could "get in the way" of hub operations.

elray1 · 2024-07-24T14:16:39Z

elray1
Jul 24, 2024
Maintainer

Emily suggested that it would be helpful to catalogue the kinds of code/functionality that hub administrators might create. Here's a go at that:

Creating/updating target data
Creating/updating auxiliary data
Computing scores
Creating hub config files (e.g., if round definitions are updated each week, there may be a script that updates tasks.json automatically)
Custom validation functions

Here's my take on where discussion on the devteam call landed (others should chime in if I'm not capturing the discussion correctly!):

We like Harry's ideas about being clear about what must/should not be done (see comment in this discussion thread above)
- we agree that it is fair to say that code must not be stored under model-output
- we don't think we can say code must not, or even should not, be stored in hub repos generally.
In addition to saying what must/should not be done, it could be helpful to make some recommendations for best practices for what should be done (these would likely not be enforcable). There seem to be a few ideas here and I'm not sure we landed on any particular agreement.
- code to do a thing could live in the same folder as its output. e.g., code to create target data could live in the same folder as the target data it creates
- code could all be located centrally in a folder such as code, src, R, or scripts.
- we might suggest that if code is "big", it be moved out of the hub repo.

3 replies

zkamvar Jul 24, 2024
Maintainer

TL;DR: dedicate a central folder for code and disallow code in data folders because it's better for the future

My 0.02USD on this coming from an organization where I spent a large amount of time disentangling old and active GitHub repositories with combined code and data.

code to do a thing could live in the same folder as its output. e.g., code to create target data could live in the same folder as the target data it creates

Given established guidelines for reproducible research, I would say that code SHOULD NOT live inside the any of the default hubverse data directories.

The advantage of having code for generating targets live inside the targets directory is that the code does not have to worry about project paths. The disadvantages are many: it's much easier to accidentally bork code if it lives with the data (ask me how I know!), and, importantly, not having a standard place for code to live means that each hub will be built in a slightly different way. This seems fine now, but in five years, the hub structure may not be what it is today and migration will be very challenging (again, ask me how I know).

code could all be located centrally in a folder such as code, src, R, or scripts.

we might suggest that if code is "big", it be moved out of the hub repo.

These two points facilitate each other quite well because it becomes relatively straightforward to extract a single folder from the git repository and move it elsewhere (even with git history intact: https://www.pixelite.co.nz/article/extracting-file-folder-from-git-repository-with-full-git-history/).

We could provide guidelines and even code skeletons for the developers/administrators writing code on how to structure their code to be project-centric so that it will generate/validate data no matter where it lives.

nickreich Jul 25, 2024
Maintainer Author

I'm on board with making recommendations that

if code is included in the hub repo, it lives in a single folder, maybe suggesting it be named src, per recommendations of the first link in @zkamvar's reply above.
if code has the potential to disrupt/break other CI operations in the hub (e.g. validation of incoming submissions) then it be moved to another repo.

annakrystalli Jul 27, 2024
Maintainer

Adding my .02$ too.

R/ is best reserved for functions only (the rest of the suggestions great for scripts), following standard R convention, which allows hub administrators to formally manage their functions and use R software engineering tools and practices e.g. testing, documentation, distribution etc. This somewhat feeds into the next point.
The validation workflow is generally independent of any code in the hub (it only depends on the hubValidations package and call validate_pr(). So there is no way code or other workflows in the hub can affect submission validation unless the code involves custom validation functions that are being called by validate_pr(). Problems with such functions could potentially cause validation issues, although those are also wrapped in try() so would return a failed EXEC ERROR for the custom checks instead of crashing the entire validation. That's kind of annoying too though but, as @zkamvar already mentioned) doesn't really have anything to do with where the code lives (could well be in an external custom package) but more about code quality which is generally improved by following point 1.

I would also suggest using/demonstrating/recommending functions + workflow pipelines with e.g. targets for common hub work workflows (or perhaps even orderly, @zkamvar might have better insight on that)

nickreich · 2024-08-28T13:19:13Z

nickreich
Aug 28, 2024
Maintainer Author

We added additional documentation here to reflect the discussion above: https://hubverse.io/en/latest/user-guide/hub-structure.html

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The Hubverse

Where should code and scripts live? #25

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 4 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

The Hubverse

Where should code and scripts live? #25

nickreich Jul 19, 2024 Maintainer

Replies: 5 comments · 4 replies

LucieContamin Jul 19, 2024 Maintainer

harryhoch Jul 19, 2024 Collaborator

bsweger Jul 19, 2024 Collaborator

nickreich Jul 22, 2024 Maintainer Author

elray1 Jul 24, 2024 Maintainer

zkamvar Jul 24, 2024 Maintainer

nickreich Jul 25, 2024 Maintainer Author

annakrystalli Jul 27, 2024 Maintainer

nickreich Aug 28, 2024 Maintainer Author

nickreich
Jul 19, 2024
Maintainer

Replies: 5 comments 4 replies

LucieContamin
Jul 19, 2024
Maintainer

harryhoch
Jul 19, 2024
Collaborator

bsweger
Jul 19, 2024
Collaborator

nickreich Jul 22, 2024
Maintainer Author

elray1
Jul 24, 2024
Maintainer

zkamvar Jul 24, 2024
Maintainer

nickreich Jul 25, 2024
Maintainer Author

annakrystalli Jul 27, 2024
Maintainer

nickreich
Aug 28, 2024
Maintainer Author