Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set up CI #59

Open
vincerubinetti opened this issue Nov 12, 2024 · 12 comments · Fixed by #124
Open

Set up CI #59

vincerubinetti opened this issue Nov 12, 2024 · 12 comments · Fixed by #124

Comments

@vincerubinetti
Copy link
Contributor

vincerubinetti commented Nov 12, 2024

I believe this is what our GitHub Actions CI workflow should eventually look like:

name: Update data

on:
  pull_request:
  workflow_dispatch:

jobs:
  update:
    runs-on: ubuntu-latest

    steps:
      - name: Debug dump
        uses: crazy-max/ghaction-dump-context@v2

      - name: Checkout code
        uses: actions/checkout@v4

      - name: Setup Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.11"
          cache-dependency-path: "**/environment.yml"

      - name: Install Python packages
        run: |
          echo Some conda install command

      - name: SSH debug
        if: runner.debug == '1'
        uses: mxschmitt/action-tmate@v3

      - name: Update data
        run: |
          python some-script-1.py
          python some-script-2.py
          python some-script-3.py

      # optional. might be nice for the site to know when the data was last updated. 
      - name: Record data compile time
        run: sed -i "" "s/PUBLIC_DATA_DATE=*/PUBLIC_DATA_DATE=$(date -uIseconds)/" site/.env

      - name: Open pull request with updated files
        if: github.event_name == 'workflow_dispatch'
        uses: peter-evans/create-pull-request@v6
        with:
          branch: update-data
          title: Update data

      - name: Commit updated files
        if: github.event_name == 'pull_request'
        uses: stefanzweifel/git-auto-commit-action@v5
        with:
          commit_message: Update data

This will allow you to either manually run the workflow and have it open a PR with the updated data, or open a PR manually and have it run the data update automatically on the PR.

As for the site, I'm thinking that we should use Netlify instead of GitHub Pages for the new website. They are both free and easy to use, but Netlify also gives us live deploy previews of PRs, built in. You can of course also set up your Netlify site to use your custom domain. And we'll have it configured such that it just rebuilds and redeploys the site when there's any changes in the repo (on main or any PR branch), including the /data folder. As such, there's no need to trigger anything related to the site in this gh-actions workflow; it will happen automatically. Yes this means that the site will be rebuilt when ineffectual things like the readme change, but the cost is minimal; the site only takes a few seconds to build.

@hdashnow
Copy link
Member

That all sounds great.

At some point, I'd like to add a few things to this process if appropriate (unless they belong elsewhere)

  • When a new release has been created:
    • version the json data and link to it, like what I did manually here: https://github.com/dashnowlab/STRchive/releases/tag/v1.2.0.
    • This should also appear at the top of the table and somewhere on the locus pages along with an updated date.
    • Run scripts to update locus definitions (probably should generate a PR and be checked)
  • Update literature for existing loci (scheduled, auto PR)
  • Search for new locus literature (scheduled, auto PR)

@vincerubinetti
Copy link
Contributor Author

vincerubinetti commented Nov 14, 2024

Not sure what you had planned, but it seemed to me like maybe all the data should be updated at the same time, i.e. as part of the same gh-actions workflow. Unless you want to be able to, say, update the literature independently from the other stuff. That might muddy the waters though, like the literature would be on its own version (if it even has a version) separate from the rest of the data.

version the json data and link to it, like what I did manually here:

For reference, here's the "versioning" workflow I have for Lab Website Template. ncipollo/release-action makes it easy to make tags and releases.

This should also appear at the top of the table and somewhere on the locus pages along with an updated date.

If what I said above is what you decide to do, there'd be one version for all of the data, and that version could be displayed in the header perhaps, with a link to the list of releases / changelog.

Instead of the "record data compile time" step I had in my example .yaml file above, it'd actually be better if you made a GitHub CFF citation file for this repo. You probably should have this anyway. But it will also allow me to conveniently get the version and date of the data to display on the website somewhere.

Run scripts to update locus definitions (probably should generate a PR and be checked)
Update literature for existing loci (scheduled, auto PR)
Search for new locus literature (scheduled, auto PR)

I'd imagine all these scripts would run in sequence in this same workflow, and then the workflow could be triggered by workflow_dispatch (running it with a manual button click in the GitHub web interface), pull_request (when you manually open a PR for whatever reason), and schedule (perhaps weekly or monthly). Then it would open a PR with peter-evans/create-pull-request for you to review and merge. We could have it always open a new PR, or give a specific branch name such that if one is already open (e.g. you haven't gotten around to merging last week's update PR yet), it just updates that one.

@vincerubinetti
Copy link
Contributor Author

Just a sanity check here, are all the processing scripts and such actually in this repo? I feel like I've run into cases where I ctrl+f the whole repo, looking for a bit of Python script that generated/processed some JSON, and I can't find it.

Because all of that code will need to be in this repo on the same branch for the CI process, or else we'll need some complicated workarounds.

@laurelhiatt
Copy link
Contributor

Hey Vince sorry I didn't see this, I thought CI was confidence interval for the plots lol, which is out of my jurisdiction. we can meet about this tomorrow or this week if you'd like

@vincerubinetti
Copy link
Contributor Author

Screenshot 2024-11-26 at 1 09 09 PM

@hdashnow
Copy link
Member

#93 should have all the components needed for automation. See the STRchive/README.md and let me know if more details are needed.
Note that the run-manubot.py script runs so long (12+ hours) that I've never actually run it on the full dataset, only a subset. I think we'll need to deal with this before we can use it in CI. In the meantime, you can run everything else using snakemake --config stages="skip-refs" in a few seconds.

@vincerubinetti
Copy link
Contributor Author

It probably will always take at least a couple of hours if doing every citation, but i would look into running things in parallel to see if it improves times. The manubot script I gave you just runs one at a time. You can also pass multiple IDs to manubot in one command, though idk if it parallelizes that.

@hdashnow
Copy link
Member

hdashnow commented Dec 2, 2024

I thought about doing things in parallel. It could cause rate limit issues if not done carefully. I've got it down to ~2 hours run time now, so let's see how we go and decide if it's worth the effort to find more speed-ups.

@vincerubinetti
Copy link
Contributor Author

vincerubinetti commented Dec 5, 2024

Here's an updated workflow that is working except for the conda activate step which I don't know enough to debug.

Put this in /.github/workflows/update-data.yaml
name: Update data

on:
  pull_request:
    branches: main
    paths:
      - "data/**"
      - "scripts/**"
  schedule:
    - cron: "0 0 1 * *"
  workflow_dispatch:

jobs:
  update:
    runs-on: ubuntu-latest

    steps:
      - name: Debug dump
        uses: crazy-max/ghaction-dump-context@v2

      - name: Checkout code
        uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.12"
          cache-dependency-path: "**/environment.yml"

      - name: Set up Conda
        uses: s-weigand/setup-conda@v1

      - name: Activate Conda
        run: |
          conda env create --file scripts/environment.yml
          conda init
          source ~/.bashrc
          conda activate strchive

      - name: Update data (short)
        if: ${{ github.event_name == 'pull_request' }}
        run: snakemake --config stages="skip-refs"

      - name: Update data (full)
        if: ${{ github.event_name == 'schedule' || github.event_name == 'workflow_dispatch' }}
        run: snakemake

      - name: Open pull request with updated files
        if: ${{ !(github.event_name == 'pull_request') }}
        uses: peter-evans/create-pull-request@v7
        with:
          branch: update-data
          title: Update data

      - name: Commit updated files to current pull request
        if: ${{ github.event_name == 'pull_request' }}
        uses: stefanzweifel/git-auto-commit-action@v5
        with:
          commit_message: Update data

And here's a working workflow for #96. Take a careful look at the logic that inserts the version string into the file names. I wrote it assuming you would rename the .bed files to start with STRchive-disease-loci. Change it to whatever you want.

Put this in /.github/workflows/make-release.yaml
name: Make release

on:
  push:
    branches:
      - main
    paths:
      - CITATION.cff

jobs:
  release:
    runs-on: ubuntu-latest

    steps:
      - name: Debug dump
        uses: crazy-max/ghaction-dump-context@v2

      - name: Checkout code
        uses: actions/checkout@v4
        with:
          fetch-depth: 2

      - name: Get previous version file
        run: git show HEAD~1:CITATION.cff >> CITATION-previous.cff

      - name: Install packages
        run: npm install yaml@v2 semver@v7 glob@v11

      - name: Get version
        id: version
        uses: actions/github-script@v7
        with:
          result-encoding: string
          script: |
            const { readFileSync, renameSync } = require("fs");
            const { valid, eq, lt } = require("semver");
            const { parse } = require("yaml");
            const { globSync } = require("glob");

            // load and parse file contents
            const { version: newVersion } = parse(readFileSync("CITATION.cff").toString());
            const { version: oldVersion } = parse(readFileSync("CITATION-previous.cff").toString());

            console.log(`Old version: ${oldVersion}`);
            console.log(`New version: ${newVersion}`);

            // check version
            if (!valid(newVersion) || lt(newVersion, oldVersion))
              throw Error("Version not valid");
            if (eq(oldVersion, newVersion)) {
              console.log("Version unchanged");
              return "";
            }

            // add version to artifact filenames
            for (const file of globSync("**/STRchive-*.json"))
              renameSync(file, file.replace(".json", `_v${newVersion}.json`));
            for (const file of globSync("**/STRchive-disease-loci*.bed"))
              renameSync(file, file.replace("-loci", `-loci_v${newVersion}_`));

            return newVersion;

      - name: SSH debug
        if: runner.debug == '1'
        uses: mxschmitt/action-tmate@v3

      - name: Release
        uses: softprops/action-gh-release@v2
        if: ${{ steps.version.outputs.result }}
        with:
          tag_name: v${{ steps.version.outputs.result }}
          files: |
            **/STRchive-loci*.json
            **/STRchive-citations*.json
            **/STRchive-disease-loci*.bed

Pro-tip: Add this step somewhere under steps, and the workflow will pause there and allow you to SSH into the runner machine and run any commands you want. I.e. it'll give you a command like ssh [email protected] to run in your terminal. Helpful to put right before a command that keeps failing, then you go in and debug.

- name: SSH debug
  uses: mxschmitt/action-tmate@v3

@hdashnow
Copy link
Member

I'm out of my depth on this CI stuff... This worked on my fork, but can't get it working in this repository.
https://github.com/dashnowlab/STRchive/actions/runs/12246554105

@vincerubinetti
Copy link
Contributor Author

Here's the failing run:

https://github.com/dashnowlab/STRchive/actions/runs/12246554105/job/34162663499

And the relevant logs:

remote: Permission to dashnowlab/STRchive.git denied to github-actions[bot].
fatal: unable to access 'https://github.com/dashnowlab/STRchive/': The requested URL returned error: 403

It's failing with a 403 permission denied error code. So the action runner (i.e. the "github actions bot") doesn't have permissions to commit/push to the repo.

I believe you'll need to allow actions read + write permissions in the repo settings:

Screenshot 2024-12-10 at 12 33 52 PM

Hopefully if you change that and re-run the failed workflow, it should work.

@vincerubinetti vincerubinetti linked a pull request Dec 11, 2024 that will close this issue
@hdashnow
Copy link
Member

Oops, closed prematurely. Still working on the R bit of this.

@hdashnow hdashnow reopened this Dec 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants