Add a GitHub action to sync data to S3 #11

bsweger · 2024-02-28T19:47:48Z

Resolves #10

First pass at an "official" version of the cloud sync workflow.

This version uses information from the admin config to get the name of the S3 bucket and is safe to use on hubs that are not yet cloud enabled.

There will be follow-up PRs if we end up using GitHub actions to trigger post-sync data actions (e.g., converting model-output data to parquet).

First pass at an "official" version of the cloud sync workflow. This version uses information from the admin config to get the name of the S3 bucket and is safe to use on hubs that are not yet cloud enabled.

bsweger · 2024-02-28T19:51:31Z

@annakrystalli this action grabs cloud values from admin.json using the names we were discussing here: hubverse-org/schemas#65 (comment)

I'll work on the schema PR next, but it's trivial to update this action if we end up using different names.

annakrystalli · 2024-02-29T14:09:16Z

This looks like a great start. I have two related questions:

I'm not clear from the process how authentication is handled exactly? Is this only set up once and then any change that happens in the repo is just synched without further authentication? If so, is that safe? Asking completely out of curiousity as I can't say I know best practice for these sort of situation.
The hubverse AWS account is hard-coded in the workflow. I'm assuming that's safe. More importantly, how easy would it be for folks who don't want to host on the hubverse account to set up synching to their own AWS account? I think this is an important use case we should be able to support. Totally get that this workflow can be a good first step towards that but I think it's a good time to already think about generalisation as, perhaps this can afffect the name of the action? E.g. perhaps we want the name of this particular action to include hubverse and aws to reflect the specificity involved in it?

bsweger · 2024-02-29T15:52:58Z

@annakrystalli Great questions, thanks!

AWS Authentication
We'll do a deeper dive into the authentication during the demo next week, but at a high-level:

We're operating under the assumption that each cloud-enabled hub will have its own S3 bucket and set of permissions that can write to it. This model makes managing security more straightforward and less risky while also making it easier for a hub admin to move their data.
The GitHub action is using an OpenID Connect (OIDC) provider to authenticate with AWS, which is the recommended auth method as it doesn't require the storage of long-term AWS credentials in GitHub secrets.
Our AWS GitHub OIDC provider grants GitHub temporary access to an AWS IAM role with permission to write to the hub's S3 bucket. The access will only be granted if the auth request comes from the main branch of the hub's designated repo.
Provisioning the AWS infrastructure that creates the bucket/permissions is handled by us (the Hubverse team) as part of a one-time cloud onboarding process.
The OIDC auth process is handled by the configure-aws-credentials action
There's more info in the cloud-infrastructure hub README

AWS Account Number
Yes, we're not considering an AWS account number to be a secret. Agree that we should have a path for people who want to provide their own cloud hosting. To do that using this proposed setup, a hub admin would need to:

change the AWS account number in the GitHub action
update the configure-aws-credentials step to authenticate (for example, an admin might choose to authenticate with longer-term AWS secrets instead of using OIDC)

We could choose to move the hard-coded Hubverse AWS account number into the admin.json schema as a default value. I chose to put it here instead because I didn't love the idea of tightly-coupling our AWS account number with the schema.

That said, if you think that providing the account # in the schema and reading it dynamically in the action will provide a smoother path, happy to make that change here and in the related schema PR!

hub-cloud-upload/hub-cloud-upload.yaml

Based on a conversation in the PR for the cloud admin.json updates, we decided to rename some cloud.host properties. This commit updates the GitHub action to reflect those changes.

annakrystalli · 2024-02-29T19:53:49Z

Thanks for the info!

I totally agree that hard wiring the account number into the schema is not ideal but I don't think we need to. The schema would just need to check that whatever value is given conforms to what is expected for an AWS account. It's something each hub would need to add to their admin.json but perhaps that could be part of onboarding. A bit manual which I don't love but at least is intentional.

Having said I think it's fine to keep it in this particular action but I think the name of the action should reflect that so users know this template is for syncing to a hubverse hosted hub in AWS.

Based on PR feedback, this commit renames the action: 1. to make it clear that it's AWS based 2. to make it clear that it operates on the Hubverse AWS account

bsweger · 2024-02-29T21:26:57Z

I totally agree that hard wiring the account number into the schema is not ideal but I don't think we need to. The schema would just need to check that whatever value is given conforms to what is expected for an AWS account. It's something each hub would need to add to their admin.json but perhaps that could be part of onboarding. A bit manual which I don't love but at least is intentional.

Having said I think it's fine to keep it in this particular action but I think the name of the action should reflect that so users know this template is for syncing to a hubverse hosted hub in AWS.

Got it--I see what you're saying about being intentional with the AWS account number. It's not clear how many people will want to use the cloud or will want to self-host (and where).

Until we get a better sense of that, great suggestion to rename the action to something that indicates the specific use case of sending data to a Hubverse-hosted AWS account (just pushed the commit).

annakrystalli · 2024-03-01T07:13:51Z

Awesome. This looks great. Happy for it to be merged.

annakrystalli

My concerns have been resolved so approving from my end

annakrystalli · 2024-02-29T13:59:17Z

hub-cloud-upload/README.md

+
+**Important**: The repo's write permissions are limited to the `main` branch. Running this action on another branch
+or on a fork will fail.
+


Important: The repo's write permissions are limited to the main branch. Running this action on another branch or on a fork will fail.

This is great (and of course necessary)! Out of curiosity, how to you enforce that restriction in the workflow?

Excellently-timed question--I've been working on some info for next week's dev meeting to talk through setup for cloud-enabled hubs (if you're on Confluence, you can see the WIP here).

In short, each hub gets an AWS bucket and an AWS IAM role that allows writing to the bucket. The role's trust policy states that it can only be used (assumed in AWS parlance) by a request originating from GitHub. That same policy further clarifies that the GitHub request must be associated with a specific repo and branch.

So at a high level:

The GitHub action tells AWS that it wants to "assume" a role. You can see that request in this line of hubverse-aws-upload.yaml.

AWS receives the request and check's the role's trust policy.

If everything checks out, AWS issues a temporary token to GitHub, which allows GitHub to use the role (this is why we don't need to store any AWS creds as GitHub secrets)

For example, this is the trust policy used by the test hubverse-infrastructure repo.

1. { 2. "Version": "2012-10-17", 3. "Statement": [ 4. { 5. "Effect": "Allow", 6. "Principal": { 7. "Federated": "arn:aws:iam::767397675902:oidc-provider/token.actions.githubusercontent.com" 8. }, 9. "Action": "sts:AssumeRoleWithWebIdentity", 10. "Condition": { 11. "StringEquals": { 12. "token.actions.githubusercontent.com:aud": "sts.amazonaws.com", 13. "token.actions.githubusercontent.com:sub": "repo:Infectious-Disease-Modeling-Hubs/hubverse-infrastructure:ref:refs/heads/main" 14. } 15. } 16. } 17. ] 18. }

line 7 says that the GitHub trusted identity provider in our AWS account (which I set up a few weeks ago) is eligible to request the use of this trust policy

line 9 lists the single action that is permitted under this trust policy: assuming the role via a web identity (in this case, the web identity is GitHub)

line 12 states that the AWS resource requested by GitHub must be AWS's STS (secure token service). The AWS aws-actions/configure-aws-credentials action sets this by default, so you won't see that coded explicitly in hubvserse-aws-upload.yaml)

line 13 says that the request from GitHub must originate from the main branch of Infectious-Disease-Modeling-Hubs/hubverse-infrastructure

hub-cloud-upload/hub-cloud-upload.yaml

Per the PR convo, sync optional hub directories (e.g., target-data) as well as the required directories. Additionally, add a note that specific directories can be excluded by removing them from the "hub_directories" list.

bsweger · 2024-03-01T21:10:21Z

I think we're in good shape for the first iteration of this new AWS sync workflow--merging so people can actually give it a spin.

Add a GitHub action to sync data to S3

1deef6e

First pass at an "official" version of the cloud sync workflow. This version uses information from the admin config to get the name of the S3 bucket and is safe to use on hubs that are not yet cloud enabled.

bsweger assigned annakrystalli Feb 28, 2024

bsweger commented Feb 29, 2024

View reviewed changes

hub-cloud-upload/hub-cloud-upload.yaml Outdated Show resolved Hide resolved

Incorporate schema updates

5f46209

Based on a conversation in the PR for the cloud admin.json updates, we decided to rename some cloud.host properties. This commit updates the GitHub action to reflect those changes.

Rename the action for clarity

f79acd1

Based on PR feedback, this commit renames the action: 1. to make it clear that it's AWS based 2. to make it clear that it operates on the Hubverse AWS account

annakrystalli approved these changes Mar 1, 2024

View reviewed changes

Also sync optional hub directories

cd9f7ad

Per the PR convo, sync optional hub directories (e.g., target-data) as well as the required directories. Additionally, add a note that specific directories can be excluded by removing them from the "hub_directories" list.

bsweger merged commit 6800362 into main Mar 1, 2024

bsweger deleted the bsweger/add-cloud-sync-action branch March 1, 2024 21:21

bsweger added this to the hubverse cloud sync milestone Apr 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a GitHub action to sync data to S3 #11

Add a GitHub action to sync data to S3 #11

bsweger commented Feb 28, 2024 •

edited

Loading

bsweger commented Feb 28, 2024

annakrystalli commented Feb 29, 2024

bsweger commented Feb 29, 2024 •

edited

Loading

annakrystalli commented Feb 29, 2024

bsweger commented Feb 29, 2024

annakrystalli commented Mar 1, 2024

annakrystalli left a comment

annakrystalli Feb 29, 2024

bsweger Mar 1, 2024

bsweger commented Mar 1, 2024


		Important: The repo's write permissions are limited to the `main` branch. Running this action on another branch
		or on a fork will fail.

Add a GitHub action to sync data to S3 #11

Add a GitHub action to sync data to S3 #11

Conversation

bsweger commented Feb 28, 2024 • edited Loading

bsweger commented Feb 28, 2024

annakrystalli commented Feb 29, 2024

bsweger commented Feb 29, 2024 • edited Loading

annakrystalli commented Feb 29, 2024

bsweger commented Feb 29, 2024

annakrystalli commented Mar 1, 2024

annakrystalli left a comment

Choose a reason for hiding this comment

annakrystalli Feb 29, 2024

Choose a reason for hiding this comment

bsweger Mar 1, 2024

Choose a reason for hiding this comment

bsweger commented Mar 1, 2024

bsweger commented Feb 28, 2024 •

edited

Loading

bsweger commented Feb 29, 2024 •

edited

Loading