Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a GitHub action to sync data to S3 #11

Merged
merged 4 commits into from
Mar 1, 2024

Conversation

bsweger
Copy link
Contributor

@bsweger bsweger commented Feb 28, 2024

Resolves #10

First pass at an "official" version of the cloud sync workflow.

This version uses information from the admin config to get the name of the S3 bucket and is safe to use on hubs that are not yet cloud enabled.

There will be follow-up PRs if we end up using GitHub actions to trigger post-sync data actions (e.g., converting model-output data to parquet).

First pass at an "official" version of the cloud sync workflow.
This version uses information from the admin config to get the
name of the S3 bucket and is safe to use on hubs that are not
yet cloud enabled.
@bsweger
Copy link
Contributor Author

bsweger commented Feb 28, 2024

@annakrystalli this action grabs cloud values from admin.json using the names we were discussing here: hubverse-org/schemas#65 (comment)

I'll work on the schema PR next, but it's trivial to update this action if we end up using different names.

@annakrystalli
Copy link
Member

This looks like a great start. I have two related questions:

  1. I'm not clear from the process how authentication is handled exactly? Is this only set up once and then any change that happens in the repo is just synched without further authentication? If so, is that safe? Asking completely out of curiousity as I can't say I know best practice for these sort of situation.
  2. The hubverse AWS account is hard-coded in the workflow. I'm assuming that's safe. More importantly, how easy would it be for folks who don't want to host on the hubverse account to set up synching to their own AWS account? I think this is an important use case we should be able to support. Totally get that this workflow can be a good first step towards that but I think it's a good time to already think about generalisation as, perhaps this can afffect the name of the action? E.g. perhaps we want the name of this particular action to include hubverse and aws to reflect the specificity involved in it?

@bsweger
Copy link
Contributor Author

bsweger commented Feb 29, 2024

@annakrystalli Great questions, thanks!

AWS Authentication
We'll do a deeper dive into the authentication during the demo next week, but at a high-level:

  • We're operating under the assumption that each cloud-enabled hub will have its own S3 bucket and set of permissions that can write to it. This model makes managing security more straightforward and less risky while also making it easier for a hub admin to move their data.
  • The GitHub action is using an OpenID Connect (OIDC) provider to authenticate with AWS, which is the recommended auth method as it doesn't require the storage of long-term AWS credentials in GitHub secrets.
  • Our AWS GitHub OIDC provider grants GitHub temporary access to an AWS IAM role with permission to write to the hub's S3 bucket. The access will only be granted if the auth request comes from the main branch of the hub's designated repo.
  • Provisioning the AWS infrastructure that creates the bucket/permissions is handled by us (the Hubverse team) as part of a one-time cloud onboarding process.
  • The OIDC auth process is handled by the configure-aws-credentials action
  • There's more info in the cloud-infrastructure hub README

AWS Account Number
Yes, we're not considering an AWS account number to be a secret. Agree that we should have a path for people who want to provide their own cloud hosting. To do that using this proposed setup, a hub admin would need to:

  • change the AWS account number in the GitHub action
  • update the configure-aws-credentials step to authenticate (for example, an admin might choose to authenticate with longer-term AWS secrets instead of using OIDC)

We could choose to move the hard-coded Hubverse AWS account number into the admin.json schema as a default value. I chose to put it here instead because I didn't love the idea of tightly-coupling our AWS account number with the schema.

That said, if you think that providing the account # in the schema and reading it dynamically in the action will provide a smoother path, happy to make that change here and in the related schema PR!

Based on a conversation in the PR for the cloud admin.json updates,
we decided to rename some cloud.host properties. This commit updates
the GitHub action to reflect those changes.
@annakrystalli
Copy link
Member

Thanks for the info!

I totally agree that hard wiring the account number into the schema is not ideal but I don't think we need to. The schema would just need to check that whatever value is given conforms to what is expected for an AWS account. It's something each hub would need to add to their admin.json but perhaps that could be part of onboarding. A bit manual which I don't love but at least is intentional.

Having said I think it's fine to keep it in this particular action but I think the name of the action should reflect that so users know this template is for syncing to a hubverse hosted hub in AWS.

Based on PR feedback, this commit renames the action:
1. to make it clear that it's AWS based
2. to make it clear that it operates on the Hubverse AWS account
@bsweger
Copy link
Contributor Author

bsweger commented Feb 29, 2024

I totally agree that hard wiring the account number into the schema is not ideal but I don't think we need to. The schema would just need to check that whatever value is given conforms to what is expected for an AWS account. It's something each hub would need to add to their admin.json but perhaps that could be part of onboarding. A bit manual which I don't love but at least is intentional.

Having said I think it's fine to keep it in this particular action but I think the name of the action should reflect that so users know this template is for syncing to a hubverse hosted hub in AWS.

Got it--I see what you're saying about being intentional with the AWS account number. It's not clear how many people will want to use the cloud or will want to self-host (and where).

Until we get a better sense of that, great suggestion to rename the action to something that indicates the specific use case of sending data to a Hubverse-hosted AWS account (just pushed the commit).

@annakrystalli
Copy link
Member

Awesome. This looks great. Happy for it to be merged.

Copy link
Member

@annakrystalli annakrystalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concerns have been resolved so approving from my end


**Important**: The repo's write permissions are limited to the `main` branch. Running this action on another branch
or on a fork will fail.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: The repo's write permissions are limited to the main branch. Running this action on another branch or on a fork will fail.

This is great (and of course necessary)! Out of curiosity, how to you enforce that restriction in the workflow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellently-timed question--I've been working on some info for next week's dev meeting to talk through setup for cloud-enabled hubs (if you're on Confluence, you can see the WIP here).

In short, each hub gets an AWS bucket and an AWS IAM role that allows writing to the bucket. The role's trust policy states that it can only be used (assumed in AWS parlance) by a request originating from GitHub. That same policy further clarifies that the GitHub request must be associated with a specific repo and branch.

So at a high level:

  1. The GitHub action tells AWS that it wants to "assume" a role. You can see that request in this line of hubverse-aws-upload.yaml.
  2. AWS receives the request and check's the role's trust policy.
  3. If everything checks out, AWS issues a temporary token to GitHub, which allows GitHub to use the role (this is why we don't need to store any AWS creds as GitHub secrets)

For example, this is the trust policy used by the test hubverse-infrastructure repo.

1.   {
2.      "Version": "2012-10-17",
3.      "Statement": [
4.          {
5.              "Effect": "Allow",
6.             "Principal": {
7.                  "Federated": "arn:aws:iam::767397675902:oidc-provider/token.actions.githubusercontent.com"
8.              },
9.              "Action": "sts:AssumeRoleWithWebIdentity",
10.             "Condition": {
11.                 "StringEquals": {
12.                     "token.actions.githubusercontent.com:aud": "sts.amazonaws.com",
13.                     "token.actions.githubusercontent.com:sub": "repo:Infectious-Disease-Modeling-Hubs/hubverse-infrastructure:ref:refs/heads/main"
14.                 }
15.             }
16.         }
17.     ]
18. }
  • line 7 says that the GitHub trusted identity provider in our AWS account (which I set up a few weeks ago) is eligible to request the use of this trust policy
  • line 9 lists the single action that is permitted under this trust policy: assuming the role via a web identity (in this case, the web identity is GitHub)
  • line 12 states that the AWS resource requested by GitHub must be AWS's STS (secure token service). The AWS aws-actions/configure-aws-credentials action sets this by default, so you won't see that coded explicitly in hubvserse-aws-upload.yaml)
  • line 13 says that the request from GitHub must originate from the main branch of Infectious-Disease-Modeling-Hubs/hubverse-infrastructure

hub-cloud-upload/hub-cloud-upload.yaml Outdated Show resolved Hide resolved
Per the PR convo, sync optional hub directories (e.g., target-data)
as well as the required directories.

Additionally, add a note that specific directories can be excluded
by removing them from the "hub_directories" list.
@bsweger
Copy link
Contributor Author

bsweger commented Mar 1, 2024

I think we're in good shape for the first iteration of this new AWS sync workflow--merging so people can actually give it a spin.

@bsweger bsweger merged commit 6800362 into main Mar 1, 2024
@bsweger bsweger deleted the bsweger/add-cloud-sync-action branch March 1, 2024 21:21
@bsweger bsweger added this to the hubverse cloud sync milestone Apr 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add first pass of action to sync hubverse data to cloud backend
4 participants