Skip to content

Commit

Permalink
MRG: Improve documentation and refactor, esp the categories code (#33)
Browse files Browse the repository at this point in the history
* add some Python tests; refactor; add docs

* upd README

* remove redundant code

* update CI to run python tests

* fix labels problem, update plots

* refactor xtick/ytick labels code

* bump version to 0.3.5

* clean up category coloring code names

* nicer cleanup
  • Loading branch information
ctb authored Jun 8, 2024
1 parent bc9a475 commit 3cfa25b
Show file tree
Hide file tree
Showing 10 changed files with 256 additions and 94 deletions.
6 changes: 5 additions & 1 deletion .github/workflows/build-test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,10 @@ jobs:
shell: bash -l {0}
run: pip install .

- name: run python tests
shell: bash -l {0}
run: make test

- name: build examples
shell: bash -l {0}
run: make cleanrun
run: make cleanall
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ install:
examples:
cd examples && make

cleanrun:
cleanall:
cd examples && make cleanall

dist:
Expand Down
85 changes: 75 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,13 @@
sequence analysis and comparisons.

`betterplot` is a sourmash plugin that provides improved plotting/viz
and cluster examination for sourmash-based sketch comparisons.
and cluster examination for sourmash-based sketch comparisons. It
includes better similarity matrix plotting, MDS plots, and
clustermaps, as well as support for coloring samples based on
categories. It also includes support for sparse comparison output
formats produced by the fast multithreaded `manysearch` and `pairwise`
functions in the
[branchwater plugin for sourmash](https://github.com/sourmash-bio/sourmash_plugin_branchwater).

## Why does this plugin exist?

Expand Down Expand Up @@ -40,7 +46,58 @@ pip install sourmash_plugin_betterplot

## Usage

See the examples below.
See the examples below for some example command lines and output,
and use command-line help (`-h/--help`) to see available options.

### Labels on plots: the `labels-to` CSV file.

The `labels-to` CSV file taken by most (all?) of the comparison matrix
plotting functions (e.g. `plot2`, `plot3`, `mds`) is the same format
output by
[`sourmash compare ... --labels-to <file>`](https://sourmash.readthedocs.io/en/latest/command-line.html#sourmash-compare-compare-many-signatures)
and loaded by `sourmash plot --labels-from <file>`. The format is
hopefully obvious, but there are a few things to mention -

* the `sort_order` column specifies the order of the columns with respect
to the samples in the distance matrix. This is there to support arbitrary
re-arranging and processing of the CSV file.
* the `label` column is the name that will be displayed on the plot, as well as
for the default "categories" CSV matching (see below). You can edit this
by hand (spreadsheet, text editor) or programmatically.
* as a side note, the `labels.txt` file output by `sourmash compare`
is entirely ignored ;).

### Categories on plots: the "categories" CSV file

One of the nice features of the betterplot functions is the ability to
provide categories that color the plots. This is critical for some
plots - for example, the `mds` and `mds2` plots don't make much sense
without colors! - and nice for other plots, like `plot3` and
`clustermap1`, where you can color columns/rows by category.

To make use of this feature, you need to provide a "categories" CSV
file (typically `-C/--categories-csv`). This file is reasonably flexible
in format; it must contain at least two columns, one named `category`,
but can contain more as long as `category` is provided.

The simplest possible categories CSV format is shown in
[10sketches-categories.csv](examples/10sketches-categories.csv), and
it contains two columns, `label` and `category`. When this file is
loaded, `label` is matched to the name of each point/row/column, and
that point is then assigned that category.

Additional flexibility is provided by the column matching.

Some restrictions of / observations on the current implementation:
* if a categories CSV is provided, every point must have an
associated category. It should be possible to have MORE many points and
categories - checkme, @CTB!
* there is currently no way to specify a specific color for a
category; they get assigned at random.
* it is entirely OK to edit the labels file (see above) and just add
a `category` column. This won't be picked up by the
code automatically - you'll need to specify the same file via `-C` -
but it works fine!

## Examples

Expand All @@ -49,7 +106,7 @@ of the repository after installing the plugin.

### `plot2` - basic 3 sketches example

Compare 3 sketches, and cluster.
Compare 3 sketches with `sourmash compare`, and cluster.

This command:
```
Expand All @@ -66,7 +123,7 @@ produces this plot:

### `plot2` - 3 sketches example with a cut line: plot2 --cut-point 1.2

Compare 3 sketches, cluster, and show a cut point.
Compare 3 sketches with `sourmash compare`, cluster, and show a cut point.

This command:
```
Expand All @@ -84,8 +141,9 @@ produces this plot:

### `plot2` - dendrogram of 10 sketches with a cut line + cluster extraction

Compare 10 sketches, cluster, and use a cut point to extract
multiple clusters. Use `--dendrogram-only` to plot just the dendrogram.
Compare 10 sketches with `sourmash compare`, cluster, and use a cut
point to extract multiple clusters. Use `--dendrogram-only` to plot
just the dendrogram.

This command:
```
Expand All @@ -106,7 +164,7 @@ as well as a set of 6 clusters to `10sketches.cmp.*.csv`.

### `mds`- multidimensional Scaling (MDS) plot of 10-sketch comparison

Use MDS to display a comparison.
Use MDS to display a comparison generated by `sourmash compare`.

These commands:
```
Expand Down Expand Up @@ -147,6 +205,9 @@ produces this plot:

### `pairwise_to_compare` - convert `pairwise` output to `sourmash compare` output and plot

Convert the sparse comparison CSV (created using the
[branchwater plugin's `pairwise` command](https://github.com/sourmash-bio/sourmash_plugin_branchwater)) into a `sourmash compare`-style similarity matrix.

These commands:
```
# build pairwise
Expand All @@ -171,8 +232,8 @@ produce this plot:

### `plot3` - seaborn clustermap with color categories

The
[`seaborn` clustermap](https://seaborn.pydata.org/generated/seaborn.clustermap.html)
Plot a `sourmash compare` similarity matrix using the
[`seaborn` clustermap](https://seaborn.pydata.org/generated/seaborn.clustermap.html), which
offers some nice visualization options.

These commands:
Expand All @@ -191,6 +252,10 @@ produce this plot:

### `clustermap1` - seaborn clustermap for non-symmetric matrices

Plot the sparse comparison CSV (created using the
[branchwater plugin's `manysearch` command](https://github.com/sourmash-bio/sourmash_plugin_branchwater)) using seaborn's clustermap. Supports separate
category coloring on rows and columns.

These commands:
```
sourmash sig cat sketches/{2,47,48,49,51,52,53,59,60}.sig.zip \
Expand Down Expand Up @@ -245,4 +310,4 @@ followed by `twine upload dist/...`.

---

CTB May 2024
CTB June 2024
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ name = "sourmash_plugin_betterplot"
description = "sourmash plugin for improved plotting/viz and cluster examination."
readme = "README.md"
requires-python = ">=3.10"
version = "0.3.4"
version = "0.3.5"

dependencies = ["sourmash>=4.8.8,<5",
"matplotlib", "numpy", "scipy", "scikit-learn", "seaborn"]
Expand Down
Loading

0 comments on commit 3cfa25b

Please sign in to comment.