Blast Databse to JSON/Solr index #23

averagehat · 2015-08-21T15:24:54Z

We can imagine that someone (say, me) wants to dump their existing blast database (say nr/nt) into something Seqr-compatible.

blastdbcmd can dump FASTA entries like so:

>gi|66816243|ref|XP_642131.1| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] >gi|1705556|sp|P54670.1|CAF1_DICDI RecName: Full=Calfumirin-1; Short=CAF-1 >gi|793761|dbj|BAA06266.1| calfumirin-1 [Dictyostelium discoideum AX2] >gi|60470106|gb|EAL68086.1| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]
MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEY
KEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQK

We have an index command but it doesn't know about the metadata between the |.
@lianyi Do you have anything for this?

The text was updated successfully, but these errors were encountered:

lianyi · 2015-08-22T14:14:25Z

It seems to be a convention, that concatenate the db|id with |. Perhaps we can use a regexp to parse out the metadata if needed.

lewisg-ncbi · 2015-08-24T13:43:57Z

Hi Mike,

Yes, it would be very helpful to have such a program. This was one of the "reach" goals for the hackathon and not that difficult to do…

Best,
Lewis

From: Mike Panciera [mailto:[email protected]]
Sent: Friday, August 21, 2015 11:25 AM
To: DCGenomics/seqr [email protected]
Subject: [seqr] Blast Databse to JSON/Solr index (#23)

We can imagine that someone (say, me) wants to dump their existing blast database (say nr/nt) into something Seqr-compatible.

blastdbcmd can dump FASTA entries like so:

gi|66816243|ref|XP_642131.1| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] >gi|1705556|sp|P54670.1|CAF1_DICDI RecName: Full=Calfumirin-1; Short=CAF-1 >gi|793761|dbj|BAA06266.1| calfumirin-1 [Dictyostelium discoideum AX2] >gi|60470106|gb|EAL68086.1| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]

MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEY

KEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQK

@lianyihttps://github.com/lianyi Do you have anything for this?

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/23.

averagehat · 2015-08-26T16:32:21Z

One can specify the output format of blastbdcmd so maybe the thing to do is have it output a TSV file which Seqr could accept as an alternative to JSON

I will try this and see how it works out

DCGenomics · 2015-08-27T13:16:12Z

Let me know if y'all want to see interface with tom madden, head of blast.

Cheers!

Ben
On Aug 26, 2015 12:32 PM, "Mike Panciera" [email protected] wrote:

One can specify the output format of blastbdcmd so maybe the thing to do is
have it output a TSV file which Seqr could accept as an alternative to JSON

I will try this and see how it works out

—
Reply to this email directly or view it on GitHub
#23 (comment).

averagehat · 2015-09-02T21:43:47Z

I discovered that using the blastdbcmd with outfmt options, i.e.

blastdbcmd -db databases/ncbi/blast/nr/nr -entry all -outfmt "%s,%a,%g,%o,%i,%t,%l,%h,%T,%X,%e,%L,%C,%S,%N,%B,%K,%P" -target_only

Takes a (prohibitively?) long time to run (and can't be parallelized simply, as far as I know).
Running it to dump into FASTA format is much faster, but you lose some of the metadata, it seems.

lewisg-ncbi · 2015-09-02T21:57:04Z

Mike,

It could be that some of the fields you are requesting are slow, but some are fast. blast stores data in multiple files.

Best,
Lewis

From: Mike Panciera [mailto:[email protected]]
Sent: Wednesday, September 02, 2015 5:44 PM
To: NCBI-Hackathons/seqr [email protected]
Cc: Geer, Lewis (NIH/NLM/NCBI) [E] [email protected]
Subject: Re: [seqr] Blast Databse to JSON/Solr index (#23)

I discovered that using the blastdbcmd with outfmt options, i.e.

blastdbcmd -db databases/ncbi/blast/nr/nr -entry all -outfmt "%s,%a,%g,%o,%i,%t,%l,%h,%T,%X,%e,%L,%C,%S,%N,%B,%K,%P" -target_only

Takes a (prohibitively?) long time to run (and can't be parallelized simply, as far as I know).
Running it to dump into FASTA format is much faster, but you lose some of the metadata, it seems.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/23#issuecomment-137253016.

This was referenced Aug 27, 2015

Index should accept CSV/JSON #26

Closed

JSON/CSV file loading #27

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Blast Databse to JSON/Solr index #23

Blast Databse to JSON/Solr index #23

averagehat commented Aug 21, 2015

lianyi commented Aug 22, 2015

lewisg-ncbi commented Aug 24, 2015

averagehat commented Aug 26, 2015

DCGenomics commented Aug 27, 2015

averagehat commented Sep 2, 2015

lewisg-ncbi commented Sep 2, 2015

Blast Databse to JSON/Solr index #23

Blast Databse to JSON/Solr index #23

Comments

averagehat commented Aug 21, 2015

lianyi commented Aug 22, 2015

lewisg-ncbi commented Aug 24, 2015

averagehat commented Aug 26, 2015

DCGenomics commented Aug 27, 2015

averagehat commented Sep 2, 2015

lewisg-ncbi commented Sep 2, 2015