-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a GitHub action to sync data to S3 #11
Conversation
First pass at an "official" version of the cloud sync workflow. This version uses information from the admin config to get the name of the S3 bucket and is safe to use on hubs that are not yet cloud enabled.
@annakrystalli this action grabs cloud values from I'll work on the schema PR next, but it's trivial to update this action if we end up using different names. |
This looks like a great start. I have two related questions:
|
@annakrystalli Great questions, thanks! AWS Authentication
AWS Account Number
We could choose to move the hard-coded Hubverse AWS account number into the That said, if you think that providing the account # in the schema and reading it dynamically in the action will provide a smoother path, happy to make that change here and in the related schema PR! |
Based on a conversation in the PR for the cloud admin.json updates, we decided to rename some cloud.host properties. This commit updates the GitHub action to reflect those changes.
Thanks for the info! I totally agree that hard wiring the account number into the schema is not ideal but I don't think we need to. The schema would just need to check that whatever value is given conforms to what is expected for an AWS account. It's something each hub would need to add to their Having said I think it's fine to keep it in this particular action but I think the name of the action should reflect that so users know this template is for syncing to a hubverse hosted hub in AWS. |
Based on PR feedback, this commit renames the action: 1. to make it clear that it's AWS based 2. to make it clear that it operates on the Hubverse AWS account
Got it--I see what you're saying about being intentional with the AWS account number. It's not clear how many people will want to use the cloud or will want to self-host (and where). Until we get a better sense of that, great suggestion to rename the action to something that indicates the specific use case of sending data to a Hubverse-hosted AWS account (just pushed the commit). |
Awesome. This looks great. Happy for it to be merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concerns have been resolved so approving from my end
hub-cloud-upload/README.md
Outdated
|
||
**Important**: The repo's write permissions are limited to the `main` branch. Running this action on another branch | ||
or on a fork will fail. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Important: The repo's write permissions are limited to the
main
branch. Running this action on another branch or on a fork will fail.
This is great (and of course necessary)! Out of curiosity, how to you enforce that restriction in the workflow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellently-timed question--I've been working on some info for next week's dev meeting to talk through setup for cloud-enabled hubs (if you're on Confluence, you can see the WIP here).
In short, each hub gets an AWS bucket and an AWS IAM role that allows writing to the bucket. The role's trust policy states that it can only be used (assumed in AWS parlance) by a request originating from GitHub. That same policy further clarifies that the GitHub request must be associated with a specific repo and branch.
So at a high level:
- The GitHub action tells AWS that it wants to "assume" a role. You can see that request in this line of
hubverse-aws-upload.yaml
. - AWS receives the request and check's the role's trust policy.
- If everything checks out, AWS issues a temporary token to GitHub, which allows GitHub to use the role (this is why we don't need to store any AWS creds as GitHub secrets)
For example, this is the trust policy used by the test hubverse-infrastructure
repo.
1. {
2. "Version": "2012-10-17",
3. "Statement": [
4. {
5. "Effect": "Allow",
6. "Principal": {
7. "Federated": "arn:aws:iam::767397675902:oidc-provider/token.actions.githubusercontent.com"
8. },
9. "Action": "sts:AssumeRoleWithWebIdentity",
10. "Condition": {
11. "StringEquals": {
12. "token.actions.githubusercontent.com:aud": "sts.amazonaws.com",
13. "token.actions.githubusercontent.com:sub": "repo:Infectious-Disease-Modeling-Hubs/hubverse-infrastructure:ref:refs/heads/main"
14. }
15. }
16. }
17. ]
18. }
- line 7 says that the GitHub trusted identity provider in our AWS account (which I set up a few weeks ago) is eligible to request the use of this trust policy
- line 9 lists the single action that is permitted under this trust policy: assuming the role via a web identity (in this case, the web identity is GitHub)
- line 12 states that the AWS resource requested by GitHub must be AWS's STS (secure token service). The AWS
aws-actions/configure-aws-credentials
action sets this by default, so you won't see that coded explicitly inhubvserse-aws-upload.yaml
) - line 13 says that the request from GitHub must originate from the main branch of
Infectious-Disease-Modeling-Hubs/hubverse-infrastructure
Per the PR convo, sync optional hub directories (e.g., target-data) as well as the required directories. Additionally, add a note that specific directories can be excluded by removing them from the "hub_directories" list.
I think we're in good shape for the first iteration of this new AWS sync workflow--merging so people can actually give it a spin. |
Resolves #10
First pass at an "official" version of the cloud sync workflow.
This version uses information from the admin config to get the name of the S3 bucket and is safe to use on hubs that are not yet cloud enabled.
There will be follow-up PRs if we end up using GitHub actions to trigger post-sync data actions (e.g., converting model-output data to parquet).