All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog and this project (attempts to) adhere to Semantic Versioning.
- Lots of little bug-fixes for scoring plots to update to new versions of python / matplotlib
- Remove Biopython as a dependency
- Add tqdm as a dependency in pip
- Add skani as a comparison option
- Tiny bugfix related to loading cached secondary clusters
- Tiny bugfix related to loading cached secondary clusters
- Fix pandas pivot bug
- Quick bugfix for last update
- Only run checkM on bugs not in the genomeInfo list
- #179
- Thanks Jon Sanders!
- Add the options "--skip_evaluate" and "--skip_analyze" to dereplicate
- Update parse_stb.py to allow a list of genomes as the .fasta input in --reverse mode
- Set default S_algorithm to fastANI and sa to 0.95
- Run Prodigal in "single" mode
- If using multiround_primary_clustering, yell if not also tertiary_clustering
- Better timing prediction when using fastANI
- clusterAlg is now used to cluster the initial groups when doing mutliround_primary_clustering (before it was always "single")
- fixed a bug that happened when you ran skipSecondary with centW > 0 (#120)
- checkM can now be run in groups (added argument variable "checkm_group_size")
- other slight internal restructuring around how checkM works
- a little more info added to the "troubleshooting dRep" section
- Update parse_stb.py to "append" instead of open, which lets it handle way bigger numbers of genomes
- Report how many genomes are being compared in last step of multi-round primary clustering
- Add the "extra_weight_table" flag to allow uses to add custom extra scores to their genomes
- Do some automatic checking of genome input (List number of genomes, throw warning if same base name is used twice, throw warning if over 5000 genomes and no —multiround_primary_clustering)
- Make a section in the docs for troubleshooting checkM and reference it when checkM fails
- Remove some assert statements
- Really just a version bump to update bioconda requirements
- Refactoring the test suite and the d_cluster module
- Adding help to the -g option
- Bare-bones support for gzipped genomes (lots of dependencies dont handle them)
- Give a warning when it has errors loading the Mash table
- Completely refactor the test suite to use pytest
- Make plotting only give tracebacks when run in debug mode
- Remove most of the options (just
dereplicate
andcompare
remain) - Add greedy clustering support! Both
multiround_primary_clustering
andgreedy_secondary_clustering
- Add centrality support; also handle centrality with greedy_secondary_clustering (will be calculated with Mash)
- Add --run_tertiary_clustering. This feature runs a final dRep job within the original dRep folder,
and adjusts Cdb and Wdb accordingly (see
run_tertiary_clustering
ind_evaluate.py
)
- Log information about GenomeInformation when loading it
- Numerous improvements to ScaffoldLevel_dRep.py, including ability to process in chunks
- Update parse_stb.py to handle zipped .fasta files
- Add helper scripts ScaffoldLevel_dRep.py and parse_stb.py
- Trying to fix a bug related to pandas categories
- More bug fixes related to FastANI
- Allow loading of cached Ndb.csv
- The bug I tried to fix in 2.5.1 is able to fix itself
- Instead of crashing out, FastANI will report the error and keep going if parsing fails
- FastANI is now an option for secondary comparisons
- You can now feed in a list of genomes via the -g option
- More edits to make goANI work (what a bizarre bug with the output sometimes having different headers?)
- Changed the flag -n_PRESET to --n_PRESET
- Handle the case where a nsimscan run completely fails in goANI mode
- Remove "--force_overwrite" from checkM since its no longer supported
- Updated warnings and documentation to reflect checkM being in python3 now
- Updated help to link to documentation
- Updated documenation in other little ways
- print tracebacks when plots fail
- fixed a weird bug with plot 5 resulting from genomeInformation having too many columns
- fixed plot 6 failing due to the deprecatoin of d_cluster.av_ani
- goANI bug resulting when there is no overlap in a filtered nsimscan
- goANI is now added as a secondary clustering algorithm; an open-source alternative to gANI
- Renamed --noQualityFiltering to --ignoreGenomeQuality
- Changed some things around to satisfy pandas deprication warnings
- Fixed bug where Mash dendrogram labels were scrambled if a big list of genomes was used (thanks brymerr921)
- Fixed typos (thanks AstrobioMike!)
- Added the --set_recursion option for filter to handle dendropy errors
- WorkDirectory now loads databases in a way that makes more sense for large tables
- Some extra caching debug options
- Some commented out memory stuff
- Mash comparisons are now actually multithreaded (thanks mruehlemann)
- Throws an error if run with python2
- RAM optimization with regards to loading Mash table
- Pickle protocol 4 to allow storage of larger clusterings
- Increased debug tools
- use threading instead of multiprocessing. This should significantly help with RAM utilization of large genome lists
- Added some extra debug options
- removed the overwrite option and enabled it by default. It was half-baked and didn't work anyways
- added unit tests for scoring
- taxonomy now works when prodigal was not previously run
- some changes in d_cluster that make gANI work
- changes to the test suite so it doesn't fail if centrifuge isn't installed
- Fixed test_suite to work
- Fixed try / excepts around the plots failing. Looks like a matplotlib issue
- Plot 6 will not crash if Widb is not present
- Put try / excepts around the plots failing
- API documentation
- dereplicate_wf to dereplicate
- compare_wf to compare
- removed adjust (simply removed reference to it in the argument parser; easy to bring back if desired)
- Complete API coverage
- Ability to include genome information from other sources more easily
- More tests
- argument parsing broken up into more groups
- more tests
- complete API coverage
- ANImf, gANI, and ANIn now make folders in the output, so that it doesnt have too many files in one dir
- mash paste now goes in chunks, so that it will work if you have huge numbers of genomes (getting around OSError: [Errno 7] Argument list too long)
- full API coverage
- ability to make / generate genomeInfo better
- full API coverage
- deleted the whole re-cluster thing
- fixed centrifuge call (shell=False)
- made ANImf the default comparison algorithm
- added a little documentation on what ANImf is
- made ANImf not rerun if .delta.filtered is present
- made bonus call centrifuge using the new calling method
- made the new calling method actually take into account the number of threads
- updated tests to account for default ANImf
- with gANI, fixed a minor bug with the naming of self-comparisons in Ndb.csv
- with ANImf, fixed a major bug that was preventing it from working right at all
- added ANImf comparison method
- test_suite now works
- removed unnecessary import
- output of all external commands is now stored in the log directory
- all external commands are now run through the same method
- mash should now work on larger argument lists
- filtering with regards to strain heterogeneity fixed
- Logging now logs mummer commands run with ANIn option
- Choose can now function without checkM
- Scoring with regards to strain heterogeneity modified
- N50 calculation is now correct
- fixed proper pip sklean-learn (thanks Ben Woodcroft)
- added the blank folder test_backend/ so that the test suite will work
- added links to ISME publication in readme and documentation
- genome input for bonus
- option to to the 'percent' method for determining taxonomy
- testing for taxonomy
- Tdb now includes the columns "best_hit" and "full_tax" (for both methods of determining taxonomy)
- bonus --check_dependencies now exists
- gANI now gives a message when it fails
- pytest is now automatically installed with pip
- the logger now does ' '.join(args) when printing the args to run dRep
- documentation now correctly says "ANIcalculator"
- makes sure ANIcalculator and checkM work when loading them (even if you can find them in the system path)
- having the user add their own Chdb now works
- in bonus, changed an erroneous > to >= when calling centrifuge
- changed the way the test_suite compares dataframes
- nc option now works with gANI (controller makes it a float)
- test_backend is now a thing automatically
- coverage values at the threshold are now accepted (< instead of <=)
- setup now automatically installs scikit-bio (needed for MDS plot)
- the loop in which genome lengths are calculated in d_cluster is changed to prevent errors when running large numbers of genomes
- default coverage method is now larger (tests are updated to reflect this)
- gANI now properly computes the coverage using the "larger" method
- gANI can now tell when it's not installed
- documentation about dependencies changed; versions added as well
- changed the way ani averaging is done; substantial speed increase when working with very large secondary clusters
- added the dRep version to the log file
- prodigal now properly threads for gANI
- Plot 3 (MDS) now prints onto a square grid
- Testing suite now launches all tests automatically
- The final analyze plot can now be made!
- pyenv message
- log typos
- prodigal now multithreads the correct amount
- compare_wf double-printing issue
- --SkipMash now works
- default checkM method is now lineage_wf in dereplicate_wf, filter, and choose
- pyenv now reverences non-anaconda in the documentation and error messages
- default min length in now 50,000
- more thorough testing (though still not nearly enough)
- log now contains the exact command run
- Mash sketch size argument actually works now
- Fixed analyze to produce non-ugly plots
- Can now change the method of calculating coverage with ANIn
- Added TODO
- Bug that caused the program to crash when looking for Mash
- Lots of typos (in documentation and program help)
- Fixed some typos in documentation
- Added bioRxiv link in documentation
- Edited matplotlib to make text that can be edited by illustrator
- Small bug in telling the user when checkM/prodigal is not in system path
- Small bug preventing SkipSecondary from working
- Test for SkipSecondary
- Time estimates now take into account threading
- Fixed the "choose" operation erroneously taking the "adjust" arguments as well
- Removed the 'mauve' option as a secondary algorithm
- Fixed a lot of erroneous logging
- Removed a bunch of commented-out methods
- READme updated
- Documentation
- A bug where '-h' wasn't working properly
- Some messages in cluster that should be going to log
- Basic functional testing
- This here Changelog
- Probably a lot of other stuff that I don't remember (due to the previous lack of Changelog)
- Moved the argument parsing to a "controller" class
- The backend of the way the log runs. It's better now
- This is the name of the release generated before implementation of a changelog. The development version number for this release was 0.2.3