Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when use fastmultigather against rocksdb (Error: No such file or directory (os error 2) - Tested with multiple cases) #381

Closed
tnmquann opened this issue Jul 5, 2024 · 8 comments

Comments

@tnmquann
Copy link

tnmquann commented Jul 5, 2024

Hi @ctb ,
Currently I'm using these commands:

Prepare data

cd /mnt/data/tnmquann/benchmarking/12_experiment
# Step 1: sourmash manysketch
sourmash scripts manysketch manysketch.csv -o manysketch.zip -c 20 -p k=31,scaled=1000,abund

# Step 2: unzip the manysketch.zip (Notes: I used this folder for all the commands below)
unzip manysketch.zip -d manysketch

# Additional: index gtdb-rs207.genomic-reps.dna.k31.zip
cd /mnt/data/tnmquann/database/sourmash/GTDB_R07-RS207
sourmash scripts index gtdb-rs207.genomic-reps.dna.k31.zip -o gtdb-rs207.genomic-reps.dna.k31.rocksdb -c 30

# Check indexed database
sourmash scripts check gtdb-rs207.genomic-reps.dna.k31.rocksdb

# Output
== This is sourmash version 4.8.10. ==
== Please cite Irber et. al (2024), doi:10.21105/joss.06830. ==
checking index 'gtdb-rs207.genomic-reps.dna.k31.rocksdb'
Opening DB
Starting check
Finished check
...index is ok!

I tried many different solutions and got the following results

Solution 1 & 2: Work perfectly

cd /mnt/data/tnmquann/benchmarking/12_experiment
# Solution 1 - OK
sourmash scripts fastmultigather --cores 20 /mnt/data/tnmquann/benchmarking/12_experiment/manysketch/SOURMASH-MANIFEST.csv /mnt/data/tnmquann/database/sourmash/GTDB_R07-RS207/gtdb-rs207.genomic-reps.dna.k31.zip
# Solution 2 - OK (use loop + parallel package to run this script for each sample)
# Recreate *.sig.zip for each samples, then use fastgather
sourmash scripts fastgather /mnt/data/tnmquann/benchmarking/12_experiment/zip/trimmed-SRR17380114.sig.zip /mnt/data/tnmquann/database/sourmash/GTDB_R07-RS207/gtdb-rs207.genomic-reps.dna.k31.zip -c 20 -o trimmed-SRR17380114.csv

Both methods above do the job perfectly, except for solution 3 below (fastmultigather with rocksdb)

Solution 3: fastmultigather with rocksdb

Currently, the feature is only available when the database is indexed directly into the processing folder.

Solution 3.1 : Use the path to the indexed database

cd /mnt/data/tnmquann/benchmarking/12_experiment
sourmash scripts fastmultigather SOURMASH-MANIFEST.csv gtdb-rs207.genomic-reps.dna.k31.rocksdb -c 20 -o gather.csv

# Output
== This is sourmash version 4.8.10. ==
== Please cite Irber et. al (2024), doi:10.21105/joss.06830. ==
=> sourmash_plugin_branchwater 0.9.5; cite Irber et al., doi: 10.1101/2022.11.02.514947
ksize: 31 / scaled: 1000 / moltype: DNA / threshold bp: 50000
gathering all sketches in 'SOURMASH-MANIFEST.csv' against '/mnt/data/tnmquann/database/sourmash/GTDB_R07-RS207/gtdb-rs207.genomic-reps.dna.k31.rocksdb' using 20 threads
Error: No such file or directory (os error 2)

Solution 3.2: Copy indexed database into the processing folder and then run the commands

cd /mnt/data/tnmquann/benchmarking/12_experiment
cp -r /mnt/data/tnmquann/database/sourmash/GTDB_R07-RS207/gtdb-rs207.genomic-reps.dna.k31.rocksdb /mnt/data/tnmquann/benchmarking/12_experiment/manysketch
sourmash scripts fastmultigather SOURMASH-MANIFEST.csv gtdb-rs207.genomic-reps.dna.k31.rocksdb -c 20 -o gather.csv

## Output
== This is sourmash version 4.8.10. ==
== Please cite Irber et. al (2024), doi:10.21105/joss.06830. ==
=> sourmash_plugin_branchwater 0.9.5; cite Irber et al., doi: 10.1101/2022.11.02.514947
ksize: 31 / scaled: 1000 / moltype: DNA / threshold bp: 50000
gathering all sketches in 'SOURMASH-MANIFEST.csv' against '/mnt/data/tnmquann/database/sourmash/GTDB_R07-RS207/gtdb-rs207.genomic-reps.dna.k31.rocksdb' using 20 threads
Error: No such file or directory (os error 2)

# Try to re-check the copied rocksdb
sourmash scripts check gtdb-rs207.genomic-reps.dna.k31.rocksdb
## Output
== This is sourmash version 4.8.10. ==
== Please cite Irber et. al (2024), doi:10.21105/joss.06830. ==
checking index 'gtdb-rs207.genomic-reps.dna.k31.rocksdb'
Opening DB
Error: No such file or directory (os error 2)

Solution 3.3: Base on @ctb ‘s suggestion

cd /mnt/data/tnmquann/benchmarking/12_experiment/manysketch
# Symlink
ln -s /mnt/data/tnmquann/database/sourmash/GTDB_R07-RS207/gtdb-rs207.genomic-reps.dna.k31.rocksdb .

sourmash scripts fastmultigather SOURMASH-MANIFEST.csv gtdb-rs207.genomic-reps.dna.k31.rocksdb -c 20 -o gather.csv

## Output
== This is sourmash version 4.8.10. ==
== Please cite Irber et. al (2024), doi:10.21105/joss.06830. ==
=> sourmash_plugin_branchwater 0.9.5; cite Irber et al., doi: 10.1101/2022.11.02.514947
ksize: 31 / scaled: 1000 / moltype: DNA / threshold bp: 50000
gathering all sketches in 'SOURMASH-MANIFEST.csv' against 'gtdb-rs207.genomic-reps.dna.k31.rocksdb' using 20 threads
Error: No such file or directory (os error 2)

# Try to re-check the copied rocksdb
sourmash scripts check gtdb-rs207.genomic-reps.dna.k31.rocksdb
## Output
== This is sourmash version 4.8.10. ==
== Please cite Irber et. al (2024), doi:10.21105/joss.06830. ==
checking index 'gtdb-rs207.genomic-reps.dna.k31.rocksdb'
Opening DB
Error: No such file or directory (os error 2)

Solution 3.4: Base on @bluegenes 's suggestion

cd /mnt/data/tnmquann/database/sourmash/GTDB_R07-RS207
cp gtdb-rs207.genomic-reps.dna.k31.zip /mnt/data/tnmquann/benchmarking/12_experiment/manysketch

# Index database
sourmash scripts index gtdb-rs207.genomic-reps.dna.k31.zip -o gtdb-rs207.genomic-reps.dna.k31.rocksdb -c 30

# Check indexed database
sourmash scripts check gtdb-rs207.genomic-reps.dna.k31.rocksdb
## Output
== This is sourmash version 4.8.10. ==
== Please cite Irber et. al (2024), doi:10.21105/joss.06830. ==
checking index 'gtdb-rs207.genomic-reps.dna.k31.rocksdb'
Opening DB
Starting check
Finished check
...index is ok!

# Re-run fastmultigather
sourmash scripts fastmultigather SOURMASH-MANIFEST.csv gtdb-rs207.genomic-reps.dna.k31.rocksdb -c 20 -o gather.csv
## Output is OK

I think there's a problem with the RocksDB folder configuration when running the index command.

@ctb
Copy link
Collaborator

ctb commented Jul 7, 2024

Thanks @tnmquann for this very detailed issue! Looking into it now.

First question, which is unrelated to the problem you're experiencing, I think, but I wanted to ask - you shouldn't need to unzip manysketch.zip into a directory. You should be able to run

# Solution 1 - OK
sourmash scripts fastmultigather --cores 20 /mnt/data/tnmquann/benchmarking/12_experiment/manysketch.zip /mnt/data/tnmquann/database/sourmash/GTDB_R07-RS207/gtdb-rs207.genomic-reps.dna.k31.zip

directly, without the unzip and the use of SOURMASH-MANIFEST.csv. (It's actually kind of cool that running it on SOURMASH-MANIFEST.csv works, incidentally! But it should be unnecessary!)

@ctb
Copy link
Collaborator

ctb commented Jul 7, 2024

OK, I can replicate the problem with fastmultigather on my laptop. Not sure why I wasn't running into it before...

In brief,

# within directory `rocks-index`:
sourmash scripts index fake-metag.sig.zip -o fake-metag.rocksdb
ls -1 ../2.fa.sig > query.txt
sourmash scripts fastmultigather query.txt fake-metag.rocksdb
# works fine

# go to another directory
mkdir ../rocks2
cd ../rocks2
ln -s ../rocks-index/fake-metag.rocksdb .

# fails:
sourmash scripts check fake-metag.rocksdb

# fails:
ls -1 ../2.fa.sig > query.txt
sourmash scripts fastmultigather  query.txt fake-metag.rocksdb

@tnmquann
Copy link
Author

tnmquann commented Jul 8, 2024

Thanks @tnmquann for this very detailed issue! Looking into it now.

First question, which is unrelated to the problem you're experiencing, I think, but I wanted to ask - you shouldn't need to unzip manysketch.zip into a directory. You should be able to run

# Solution 1 - OK
sourmash scripts fastmultigather --cores 20 /mnt/data/tnmquann/benchmarking/12_experiment/manysketch.zip /mnt/data/tnmquann/database/sourmash/GTDB_R07-RS207/gtdb-rs207.genomic-reps.dna.k31.zip

directly, without the unzip and the use of SOURMASH-MANIFEST.csv. (It's actually kind of cool that running it on SOURMASH-MANIFEST.csv works, incidentally! But it should be unnecessary!)

Hi @ctb
Thanks for your question :D. Actually, I decompress manysketch.zip file for two main reasons:

  1. My old experience trying to use the output from manysketch directly into fastmultigather in the old version was not really good (if I remember correctly, I made errors in v0.8.1, so I have to temporarily ignore this plugin). Note: I tried again in the newer version (v0.9.3+) and the error was fixed.
  2. I was doing benchmarking with yacht in my BSc thesis when I discovered that this tool is developed based on sourmash and sourmash_branchwater modules. I combined the results from yacht and sourmash to minimize the possibility of false positives, and the results were impressive. However, the problem occurred when I used manysketch.zip to use directly on the yacht, which resulted in the following error:
ValueError: Expected exactly one signature with ksize 31 in /mnt/data/quantnm/benchmarking/12_experiment/sketches/manysketch.zip, found 81. Likely you will need to do something like: sourmash sig merge /mnt/data/quantnm/benchmarking/12_experiment/sketches/manysketch.zip -o <new signature with just one sketch in it>.

So I have another workaround for this problem: I decompress the resulting file from the manysketch module and "try" to reconstruct the *.sig.zip files individually (it seems this tool uses the module multisearch to run for each sample separately). The results show that this workaround is quite good :D. The only problem I'm facing is that this combination takes a lot of time (that's why I want to use the fastmultigather module with rocksdb to decrease data processing time).

I'd be happy to discuss further if you have any questions.

@ctb
Copy link
Collaborator

ctb commented Jul 8, 2024

thanks! no, that all makes sense. And we should talk to the YACHT authors (with whom we are quite friendly ;)) about updating their code!

@ctb
Copy link
Collaborator

ctb commented Jul 9, 2024

This was rapidly turning into a heisenbug for me, so I brute-forced it and wrote a script to explore -

tl;dr RocksDB indexes built from .zip files FAIL when referenced from other directories, while RocksDB indexes built from lists of files work fine!

(@bluegenes may owe me a drink because it was so hard to nail down this problem!)

#! /bin/bash 
set -e
set -x

rm -fr foo1 foo2 foo3

mkdir foo1
cd foo1

ls -1 ../{1,2,3,4,5,6,7,8,9}.fa.sig > list.txt
sourmash sig cat ../{1,2,3,4,5,6,7,8,9}.fa.sig -k 31 -o list.sig.zip
sourmash sig merge -k 31 ../{1,2,3}.fa.sig -o fake-metag.sig.gz

sourmash scripts index list.txt -o foo-from-list.db
sourmash scripts index list.sig.zip -o foo-from-zip.db

sourmash scripts check foo-from-list.db
sourmash scripts check foo-from-zip.db

sourmash scripts fastmultigather fake-metag.sig.gz foo-from-list.db -o out.csv
sourmash scripts fastmultigather fake-metag.sig.gz foo-from-zip.db -o out.csv

### 

cd ../
mkdir foo2
cd foo2

cp ../foo1/fake-metag.sig.gz .
sourmash scripts check ../foo1/foo-from-list.db
sourmash scripts check ../foo1/foo-from-zip.db ## this fails!                   

@ctb
Copy link
Collaborator

ctb commented Jul 9, 2024

A more succinct version

#! /bin/bash 
set -e
set -x

rm -fr foo5 list.txt list.sig.zip

ls -1 {1,2,3,4,5,6,7,8,9}.fa.sig > list.txt
sourmash sig cat {1,2,3,4,5,6,7,8,9}.fa.sig -k 31 -o list.sig.zip

mkdir foo5

sourmash scripts index list.txt -o foo5/foo-from-list.db
sourmash scripts index list.sig.zip -o foo5/foo-from-zip.db

sourmash scripts check foo5/foo-from-list.db
cd foo5
sourmash scripts check foo-from-list.db

cd ../
sourmash scripts check foo5/foo-from-zip.db
cd foo5
sourmash scripts check foo-from-zip.db # this breaks                            

@ctb
Copy link
Collaborator

ctb commented Jul 9, 2024

OK, it looks like by default the rocksdb does not store the sketches internally, and what is happening is that the path to the zip file containing sketches is being interpreted problematically. 🤿 time.

@ctb
Copy link
Collaborator

ctb commented Aug 12, 2024

hi @tnmquann this led in some interesting directions 😅

The problem is described in detail here, #415, but not resolved.

See PRs #390 and #416 for better default behavior and improved documentation.

If it's OK, I'm going to close this issue (since we now know what's going on) in favor of #415, which describes it without (yet) fixing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants