Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blast Databse to JSON/Solr index #23

Open
averagehat opened this issue Aug 21, 2015 · 6 comments
Open

Blast Databse to JSON/Solr index #23

averagehat opened this issue Aug 21, 2015 · 6 comments

Comments

@averagehat
Copy link
Collaborator

We can imagine that someone (say, me) wants to dump their existing blast database (say nr/nt) into something Seqr-compatible.

blastdbcmd can dump FASTA entries like so:

>gi|66816243|ref|XP_642131.1| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] >gi|1705556|sp|P54670.1|CAF1_DICDI RecName: Full=Calfumirin-1; Short=CAF-1 >gi|793761|dbj|BAA06266.1| calfumirin-1 [Dictyostelium discoideum AX2] >gi|60470106|gb|EAL68086.1| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]
MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEY
KEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQK

We have an index command but it doesn't know about the metadata between the |.
@lianyi Do you have anything for this?

@lianyi
Copy link
Collaborator

lianyi commented Aug 22, 2015

It seems to be a convention, that concatenate the db|id with |. Perhaps we can use a regexp to parse out the metadata if needed.

@lewisg-ncbi
Copy link
Collaborator

Hi Mike,

Yes, it would be very helpful to have such a program. This was one of the "reach" goals for the hackathon and not that difficult to do…

Best,
Lewis

From: Mike Panciera [mailto:[email protected]]
Sent: Friday, August 21, 2015 11:25 AM
To: DCGenomics/seqr [email protected]
Subject: [seqr] Blast Databse to JSON/Solr index (#23)

We can imagine that someone (say, me) wants to dump their existing blast database (say nr/nt) into something Seqr-compatible.

blastdbcmd can dump FASTA entries like so:

gi|66816243|ref|XP_642131.1| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] >gi|1705556|sp|P54670.1|CAF1_DICDI RecName: Full=Calfumirin-1; Short=CAF-1 >gi|793761|dbj|BAA06266.1| calfumirin-1 [Dictyostelium discoideum AX2] >gi|60470106|gb|EAL68086.1| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]

MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEY

KEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQK

@lianyihttps://github.com/lianyi Do you have anything for this?


Reply to this email directly or view it on GitHubhttps://github.com//issues/23.

@averagehat
Copy link
Collaborator Author

One can specify the output format of blastbdcmd so maybe the thing to do is have it output a TSV file which Seqr could accept as an alternative to JSON

I will try this and see how it works out

@DCGenomics
Copy link
Contributor

Let me know if y'all want to see interface with tom madden, head of blast.

Cheers!

Ben
On Aug 26, 2015 12:32 PM, "Mike Panciera" [email protected] wrote:

One can specify the output format of blastbdcmd so maybe the thing to do is
have it output a TSV file which Seqr could accept as an alternative to JSON

I will try this and see how it works out


Reply to this email directly or view it on GitHub
#23 (comment).

This was referenced Aug 27, 2015
@averagehat
Copy link
Collaborator Author

I discovered that using the blastdbcmd with outfmt options, i.e.

blastdbcmd -db databases/ncbi/blast/nr/nr -entry all -outfmt "%s,%a,%g,%o,%i,%t,%l,%h,%T,%X,%e,%L,%C,%S,%N,%B,%K,%P" -target_only 

Takes a (prohibitively?) long time to run (and can't be parallelized simply, as far as I know).
Running it to dump into FASTA format is much faster, but you lose some of the metadata, it seems.

@lewisg-ncbi
Copy link
Collaborator

Mike,

It could be that some of the fields you are requesting are slow, but some are fast. blast stores data in multiple files.

Best,
Lewis

From: Mike Panciera [mailto:[email protected]]
Sent: Wednesday, September 02, 2015 5:44 PM
To: NCBI-Hackathons/seqr [email protected]
Cc: Geer, Lewis (NIH/NLM/NCBI) [E] [email protected]
Subject: Re: [seqr] Blast Databse to JSON/Solr index (#23)

I discovered that using the blastdbcmd with outfmt options, i.e.

blastdbcmd -db databases/ncbi/blast/nr/nr -entry all -outfmt "%s,%a,%g,%o,%i,%t,%l,%h,%T,%X,%e,%L,%C,%S,%N,%B,%K,%P" -target_only

Takes a (prohibitively?) long time to run (and can't be parallelized simply, as far as I know).
Running it to dump into FASTA format is much faster, but you lose some of the metadata, it seems.


Reply to this email directly or view it on GitHubhttps://github.com//issues/23#issuecomment-137253016.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants