Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automate/simplify process for new sequences #18

Closed
4 tasks done
victorlin opened this issue Apr 14, 2022 · 9 comments
Closed
4 tasks done

Automate/simplify process for new sequences #18

victorlin opened this issue Apr 14, 2022 · 9 comments
Assignees
Labels
enhancement New feature or request

Comments

@victorlin
Copy link
Member

victorlin commented Apr 14, 2022

@j23414, @joverlee521 and I just met to update 3 new Zika sequences for 2022-04-14, following the docs:

  1. fauna/ZIKA.md
  2. zika/README.md

These improvements can be made:

  • Automate check of new sequences
  • Use ViPR API
  • Remove fauna from the process
  • Automate run of full process (similar to the process for ncov)
@victorlin victorlin added the enhancement New feature or request label Apr 14, 2022
@trvrb
Copy link
Member

trvrb commented Apr 14, 2022

Awesome! I would imagine a new ingest directory in this repo that would do the fetch from ViPR and push to S3 (dropping Fauna from the process).

@victorlin
Copy link
Member Author

I made a small GitHub Actions workflow to somewhat address the first task (Automate check of new sequences), but it's not very useful since you still have to click the latest run and some more clicks to see the count of sequences:

image

Maybe better to do this work all at once.


I imagine there will be some GitHub Actions workflow capable of running the entire process (offloading nextstrain build to AWS/Terra if necessary). Then, two options:

  1. Run on a schedule (daily/weekly?), regardless of new sequences.
  2. Check for new sequences on a schedule (daily?), run only if new sequences detected.

Seems like the source of Zika sequences does not update frequently, so (1) is simpler but might result in redundant compute if the source data hasn't changed. Is this something we should be worrying about?

With (2) there is potential to be more "real-time" (could even check hourly), but not sure if that's a direction to pursue here.

@j23414
Copy link
Contributor

j23414 commented Apr 15, 2022

I vote choice 2. Choice 1 would eventually add up in compute costs, especially if we're using a similar system with X+1 other pathogens. I also lean toward choice 2 for being more streamlined.

I lean toward a daily/weekly check for new sequences, hourly seems too frequent considering the timing of the current updates.

@victorlin
Copy link
Member Author

victorlin commented Apr 21, 2022

Expanding a bit on the options above:

  1. Run on a schedule (daily/weekly?), regardless of new sequences.
    • This is the simplest approach, what we are effectively doing for all pathogens currently. Can be done locally without any caching.
    • ncov/ncov-ingest is an example of this approach being automated.
  2. Check for new sequences on a schedule (daily?), run only if new sequences detected.
    • This would require some form of caching to compare latest sequences with what's been done existing.
      1. At the simplest, maintain a cache of the sequence count. Run process from scratch if this number changes.
      2. Maintain a database of the sequences. Process here is what's being done now: update/add sequences to database, run zika workflow from database contents.

I don't think maintaining count cache (2.a) is very practical.

If we want to do (2.b), it would be worthwhile to keep zika in fauna instead of the current plan to move away from it. We just need to automate the current manual steps.

@j23414
Copy link
Contributor

j23414 commented Apr 21, 2022

Ah, without fauna, we can still do (2.b) where the database would be a pair of flat file at s3://nextstrain-data/files/zika/. Open to discussion though! :D

Currently, the Snakefile is pulling from the s3

@trvrb
Copy link
Member

trvrb commented Apr 21, 2022

I'd think you could do we do for ncov-ingest and compare hashes of freshly ingested data and data that's on S3. Only spin out a new analysis if this hash has been updated. Everyday ingest occurs, but process job only triggers if S3 files are updated.

@victorlin
Copy link
Member Author

victorlin commented Apr 21, 2022

@j23414 @trvrb Good points, I forgot about S3. That makes sense, would enable conditional processing that's simpler than (2.a) without needing fauna.

@victorlin victorlin moved this from New to Prioritized in Nextstrain planning (archived) Apr 27, 2022
@j23414
Copy link
Contributor

j23414 commented May 6, 2022

Expanding Task 3 "Remove fauna from the process" into:

  • migrate relevant files and functions for zika_upload.py
  • collection date processing -> use augur's version
  • location processing
  • migrate relevant files and functions for zika_update.py

@joverlee521
Copy link
Contributor

Resolved by #52

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
No open projects
Status: Prioritized
Development

No branches or pull requests

4 participants