Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

meta: sr2silo for Loculus Architecture #45

Open
4 tasks done
gordonkoehn opened this issue Nov 25, 2024 · 4 comments
Open
4 tasks done

meta: sr2silo for Loculus Architecture #45

gordonkoehn opened this issue Nov 25, 2024 · 4 comments
Assignees
Labels
meta Epic task / Overarching issue

Comments

@gordonkoehn
Copy link
Collaborator

gordonkoehn commented Nov 25, 2024

Output from meeting with Alexander and Chaoran on 25.11.2024 – the final architecture of V-pipe & Loculus:

The difficulty: The crux is that V-Pipe's input and output data are much larger than the consensus sequences that Loculus was designed for. 14kb << 600 Mb. Thus, there are some caveats to how Loculus can handle V-Pipe data. Mainly, big data files are not passed through the Loculus backend but through S3 bucket references.

Below is the currently envisioned final setup:

V-Pipe and Loculus:

IMG_4235

  1. User-uploads raw data via web interface / AWS S3 uploader into an S3 bucket and metadata.
  2. Loculus Backend writes the incoming metadata and S3 raw references into the PostgreSQL
  3. Loclulus backend dispatches the metadata and S3 raw references to the Pipeline: V-Pipe - GET request
  4. V-Pipe fetches the raw data from that S3 bucket and processes them
  5. V-Pipe final step: sr2silo activates to process V-Pipe's output.bam to .ndjson of merged, paired reads enriched with the metadata
  6. sr2silo uploads the .ndjson to another S3 bucket / and POST request the metadata and S3 URL to the Loculus Backend
  7. Loculus Backend sends the metadata .ndjson and S3 URLs to SILO preprocessing.
  8. Silo preprocessing enriches the ndjson with the full sequences fetching the correct reads from the S3 output in 6)
  9. Silo pre-processing indexes on all files in that massive ndjson, as it needs all sequences

V-Pipe outputs to Silo*

IMG_4236

This is step 5-6). In the final setup some wrapper code will implement a GET request in V-Pipe to be received from the Loculus backend upon which it will run V-Pipe and return the request with a POST to the Loculus backend and read and write all data but meta, from an S3.

On the output side of this stands the below s2silo. It will be triggered upon completion of V-Pipe. Probably as a docker-compose, to take a single .bam and metadata. It will do:

  1. [temporary] GET request (to be received from Loculus Backend with nothing but an index)

  2. read-processing (bam->sam->pair & merge -> align & translate i.e. nextclade like)

  3. nextclade-like output to ndjson with Rust code from Fabian

  4. enrich ndjson with metadata per line

  5. upload ndjson to S3

  6. POST to Loculus backend with S3 URLs of processed ndjson and metadata

  7. will later be implemented with by the program that prepared inputs and directories for V-pipe, for now we'll artificially import a batch of data with sr2Silo alone.

Sub-Issues:

Open Questions

  • Do users want to download only SNVs? Do users want to download BAMs as well? If so the above would need modification.
@gordonkoehn gordonkoehn added the meta Epic task / Overarching issue label Nov 25, 2024
@gordonkoehn gordonkoehn self-assigned this Nov 25, 2024
@gordonkoehn
Copy link
Collaborator Author

Doubt: What was the reason for processing the .bam to .ndjson in one place? As compared to having the nextclade-like output stored and processed in front of SILO.

So that only at once place we need to store one file with data and metadata once. Special for wastewater.

@gordonkoehn
Copy link
Collaborator Author

Sync with Alexander:

Probably the silo_input_transformer will later move to the SILO Preprocessing and we will upload BAMs and metadata to the Loculus backend, yet the BAMs for nucleotides and amino acids already read paired and merged to some S3.

For now this should be all in sr2silo to get going.

@gordonkoehn
Copy link
Collaborator Author

See the discussion here:

@gordonkoehn
Copy link
Collaborator Author

I shall remove this from the board for cleanness.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
meta Epic task / Overarching issue
Projects
None yet
Development

No branches or pull requests

1 participant