Automate/simplify process for new sequences #18

victorlin · 2022-04-14T20:57:57Z

@j23414, @joverlee521 and I just met to update 3 new Zika sequences for 2022-04-14, following the docs:

These improvements can be made:

Automate check of new sequences
Use ViPR API
Remove fauna from the process
Automate run of full process (similar to the process for ncov)

The text was updated successfully, but these errors were encountered:

trvrb · 2022-04-14T21:19:36Z

Awesome! I would imagine a new ingest directory in this repo that would do the fetch from ViPR and push to S3 (dropping Fauna from the process).

victorlin · 2022-04-15T23:09:38Z

I made a small GitHub Actions workflow to somewhat address the first task (Automate check of new sequences), but it's not very useful since you still have to click the latest run and some more clicks to see the count of sequences:

Maybe better to do this work all at once.

I imagine there will be some GitHub Actions workflow capable of running the entire process (offloading nextstrain build to AWS/Terra if necessary). Then, two options:

Run on a schedule (daily/weekly?), regardless of new sequences.
Check for new sequences on a schedule (daily?), run only if new sequences detected.

Seems like the source of Zika sequences does not update frequently, so (1) is simpler but might result in redundant compute if the source data hasn't changed. Is this something we should be worrying about?

With (2) there is potential to be more "real-time" (could even check hourly), but not sure if that's a direction to pursue here.

j23414 · 2022-04-15T23:29:18Z

I vote choice 2. Choice 1 would eventually add up in compute costs, especially if we're using a similar system with X+1 other pathogens. I also lean toward choice 2 for being more streamlined.

I lean toward a daily/weekly check for new sequences, hourly seems too frequent considering the timing of the current updates.

victorlin · 2022-04-21T21:15:54Z

Expanding a bit on the options above:

Run on a schedule (daily/weekly?), regardless of new sequences.
- This is the simplest approach, what we are effectively doing for all pathogens currently. Can be done locally without any caching.
- ncov/ncov-ingest is an example of this approach being automated.
Check for new sequences on a schedule (daily?), run only if new sequences detected.
- This would require some form of caching to compare latest sequences with what's been done existing.
  1. At the simplest, maintain a cache of the sequence count. Run process from scratch if this number changes.
  2. Maintain a database of the sequences. Process here is what's being done now: update/add sequences to database, run zika workflow from database contents.

I don't think maintaining count cache (2.a) is very practical.

If we want to do (2.b), it would be worthwhile to keep zika in fauna instead of the current plan to move away from it. We just need to automate the current manual steps.

j23414 · 2022-04-21T21:27:49Z

Ah, without fauna, we can still do (2.b) where the database would be a pair of flat file at s3://nextstrain-data/files/zika/. Open to discussion though! :D

Currently, the Snakefile is pulling from the s3

trvrb · 2022-04-21T21:29:21Z

I'd think you could do we do for ncov-ingest and compare hashes of freshly ingested data and data that's on S3. Only spin out a new analysis if this hash has been updated. Everyday ingest occurs, but process job only triggers if S3 files are updated.

victorlin · 2022-04-21T21:36:31Z

@j23414 @trvrb Good points, I forgot about S3. That makes sense, would enable conditional processing that's simpler than (2.a) without needing fauna.

j23414 · 2022-05-06T18:31:04Z

Expanding Task 3 "Remove fauna from the process" into:

migrate relevant files and functions for zika_upload.py
collection date processing -> use augur's version
location processing
migrate relevant files and functions for zika_update.py

joverlee521 · 2024-04-12T22:07:44Z

Resolved by #52

victorlin added the enhancement New feature or request label Apr 14, 2022

victorlin assigned j23414, victorlin and joverlee521 Apr 14, 2022

nextstrain-bot added this to Nextstrain planning (archived) Apr 15, 2022

nextstrain-bot moved this to New in Nextstrain planning (archived) Apr 15, 2022

victorlin moved this from New to Prioritized in Nextstrain planning (archived) Apr 27, 2022

joverlee521 closed this as completed Apr 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automate/simplify process for new sequences #18

Automate/simplify process for new sequences #18

victorlin commented Apr 14, 2022 •

edited

Loading

trvrb commented Apr 14, 2022

victorlin commented Apr 15, 2022

j23414 commented Apr 15, 2022

victorlin commented Apr 21, 2022 •

edited

Loading

j23414 commented Apr 21, 2022

trvrb commented Apr 21, 2022

victorlin commented Apr 21, 2022 •

edited

Loading

j23414 commented May 6, 2022 •

edited

Loading

joverlee521 commented Apr 12, 2024

Automate/simplify process for new sequences #18

Automate/simplify process for new sequences #18

Comments

victorlin commented Apr 14, 2022 • edited Loading

trvrb commented Apr 14, 2022

victorlin commented Apr 15, 2022

j23414 commented Apr 15, 2022

victorlin commented Apr 21, 2022 • edited Loading

j23414 commented Apr 21, 2022

trvrb commented Apr 21, 2022

victorlin commented Apr 21, 2022 • edited Loading

j23414 commented May 6, 2022 • edited Loading

joverlee521 commented Apr 12, 2024

victorlin commented Apr 14, 2022 •

edited

Loading

victorlin commented Apr 21, 2022 •

edited

Loading

victorlin commented Apr 21, 2022 •

edited

Loading

j23414 commented May 6, 2022 •

edited

Loading