Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

using local blast tools in lieu of entrez #12

Open
btupper opened this issue Mar 20, 2024 · 13 comments
Open

using local blast tools in lieu of entrez #12

btupper opened this issue Mar 20, 2024 · 13 comments

Comments

@btupper
Copy link

btupper commented Mar 20, 2024

I shared the local NCBI database search idea with Julia Brown at Bigelow. She thinks that we may be able to leverage the blastdbcmd to replace some of the entrez functionality we use now.

@egreyavis Perhaps we can set up a time to walk through the examples with an eye toward building some R wrappers.

@btupper
Copy link
Author

btupper commented Mar 28, 2024

@btupper
Copy link
Author

btupper commented Apr 3, 2024

@egreyavis which databases (databasii?) are of interest to the project?

@btupper
Copy link
Author

btupper commented Apr 5, 2024

SO, I downloaded 1,2 and 6 ('Invertebrate', 'Plant (including fungi and algae)' and 'Other vertebrate') and then ran restez::db_create() I get this error...

Inspecting 4322 file(s) to add to the database ...
... 'gbinv1.seq.gz' (1/4322)
... 'gbinv10.seq.gz' (2/4322)
... 'gbinv100.seq.gz' (3/4322)
... 'gbinv1000.seq.gz' (4/4322)
... 'gbinv1001.seq.gz' (5/4322)
... 'gbinv1002.seq.gz' (6/4322)
... 'gbinv1003.seq.gz' (7/4322)
... 'gbinv1004.seq.gz' (8/4322)
... 'gbinv1005.seq.gz' (9/4322)
... 'gbinv1006.seq.gz' (10/4322)
... 'gbinv1007.seq.gz' (11/4322)
... 'gbinv1008.seq.gz' (12/4322)
... 'gbinv1009.seq.gz' (13/4322)
... 'gbinv101.seq.gz' (14/4322)
... 'gbinv1010.seq.gz' (15/4322)
... 'gbinv1011.seq.gz' (16/4322)
... 'gbinv1012.seq.gz' (17/4322)
... 'gbinv1013.seq.gz' (18/4322)
... 'gbinv1014.seq.gz' (19/4322)
... 'gbinv1015.seq.gz' (20/4322)
... 'gbinv1016.seq.gz' (21/4322)
... 'gbinv1017.seq.gz' (22/4322)
... 'gbinv1018.seq.gz' (23/4322)
Error in paste0(lines[indexes], collapse = "\n") : 
  result would exceed 2^31-1 bytes
Calls: db_create ... gb_build -> flatfile_read -> lapply -> FUN -> paste0
In addition: There were 23 warnings (use warnings() to see them)

I think this exceeds R's limit. I'll have to investigate. If needed, can we run the process on the 3 databasii separately and then merge the results?

@egreyavis
Copy link
Contributor

Yes we could run them separately and then merge.

@btupper
Copy link
Author

btupper commented Apr 5, 2024

OK - I'll set that up and see what happens

@btupper
Copy link
Author

btupper commented Apr 8, 2024

I'm thinking we should set the max_length argument to avoid that error. (It's not really an error but a limitation of character lengths in R - who knew one might want 2^31 characters in a sequence?). I'm not sure about downstream consequences, but I suspect that it would allow us to proceed. How about max_length = 10^9 which is about half the 2^31 at 1 billion.

@egreyavis
Copy link
Contributor

egreyavis commented Apr 11, 2024 via email

@btupper
Copy link
Author

btupper commented Apr 11, 2024

I have tried a number of different max_length values (10^6, 10^15, 10^20, etc) and I still encounter that error on occasion. I kicked all three (invertebrates, plants and vertebrates) this morning - each using a yaml similar tot he follwoing.

name: invertebrate
rootpath: /mnt/storage/data/edna/refdb/restez
download:
  preselection: "1"
  db: nucleotide
  overwrite: TRUE
  max_tries: 3
create:
  db_type: nucleotide
  max_length: 1e6
  min_length: 1

Invertebrates bailed early with that same error. Plants and vertebrates are still running.

@btupper
Copy link
Author

btupper commented Apr 12, 2024

Good news! The vertebrates database was successfully built.

Bad news! Plants joined invertebrates in failing to build.

@btupper
Copy link
Author

btupper commented Apr 12, 2024

I have cloned the restez package, and have added error trapping/handling to the bit of code that flops. I'll build the package on "charlie" and give it a whirl. If that resolves the issues (by skipping the big ones) then we are at least unblocked. I'll keep you posted.

@egreyavis
Copy link
Contributor

egreyavis commented Apr 12, 2024 via email

@btupper
Copy link
Author

btupper commented Apr 23, 2024

Good news! Three databases (databasii?) downloaded an operational...

INFO [2024-04-23 12:19:49] db_name: invertebrate
INFO [2024-04-23 12:19:49] db_ready: TRUE
Checking setup status at  ...
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Restez path ...
... Path '/mnt/storage/data/edna/refdb/restez/invertebrate/restez'
... Does path exist? 'Yes'
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Download ...
... Path '/mnt/storage/data/edna/refdb/restez/invertebrate/restez/downloads'
... Does path exist? 'Yes'
... N. files 2100
... Total size 261G
... GenBank division selections 'Invertebrate'
... GenBank Release 259
... Last updated '2024-04-08 10:32:31'
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Database ...
... Path '/mnt/storage/data/edna/refdb/restez/invertebrate/restez/sql_db'
... Does path exist? 'Yes'
... Total size 404G
... Does the database have data? 'Yes'
... Number of sequences 1441629
... Min. sequence length 1
... Max. sequence length 1e+06
... Last_updated '2024-04-23 05:58:35'
INFO [2024-04-23 12:19:49] db_name: other_vertebrate
INFO [2024-04-23 12:19:49] db_ready: TRUE
Checking setup status at  ...
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Restez path ...
... Path '/mnt/storage/data/edna/refdb/restez/other_vertebrate/restez'
... Does path exist? 'Yes'
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Download ...
... Path '/mnt/storage/data/edna/refdb/restez/other_vertebrate/restez/downloads'
... Does path exist? 'Yes'
... N. files 510
... Total size 62.1G
... GenBank division selections 'Other vertebrate'
... GenBank Release 259
... Last updated '2024-04-08 12:10:53'
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Database ...
... Path '/mnt/storage/data/edna/refdb/restez/other_vertebrate/restez/sql_db'
... Does path exist? 'Yes'
... Total size 50.7G
... Does the database have data? 'Yes'
... Number of sequences 832069
... Min. sequence length 1
... Max. sequence length 1e+06
... Last_updated '2024-04-12 00:00:57'
INFO [2024-04-23 12:19:50] db_name: plant_with_fungi_algae
INFO [2024-04-23 12:19:50] db_ready: TRUE
Checking setup status at  ...
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Restez path ...
... Path '/mnt/storage/data/edna/refdb/restez/plant_with_fungi_algae/restez'
... Does path exist? 'Yes'
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Download ...
... Path '/mnt/storage/data/edna/refdb/restez/plant_with_fungi_algae/restez/downloads'
... Does path exist? 'Yes'
... N. files 1714
... Total size 337G
... GenBank division selections 'Plant (including fungi and algae)'
... GenBank Release 259
... Last updated '2024-04-08 11:57:06'
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Database ...
... Path '/mnt/storage/data/edna/refdb/restez/plant_with_fungi_algae/restez/sql_db'
... Does path exist? 'Yes'
... Total size 118G
... Does the database have data? 'Yes'
... Number of sequences 1345347
... Min. sequence length 1
... Max. sequence length 1e+06
... Last_updated '2024-04-15 02:54:05'
INFO [2024-04-23 12:19:50] done!

@egreyavis
Copy link
Contributor

egreyavis commented Apr 28, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants