Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introducing ShortSeq192, the successor to ShortSeq128 #10

Merged
merged 11 commits into from
Mar 24, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,8 @@ ShortSeqs are compact and efficient Python objects that hold short sequences whi
| Sequence Length | PyUnicode Size | PyBytes Size | ShortSeq Size | % Reduced |
|-----------------|----------------|----------------|-----------------:|------------|
| 0-32 nt | 56-88 bytes | 40-72 bytes | 32 bytes (fixed) | **43-64%** |
| 33-64 nt | 88-120 bytes | 72-104 bytes | 40 bytes (fixed) | **55-67%** |
| 65-1024 nt | 120-1080 bytes | 104-1064 bytes | 56-288 bytes | **53-73%** |
| 33-96 nt | 88-152 bytes | 72-136 bytes | 48 bytes (fixed) | **55-68%** |
| 97-1024 nt | 152-1080 bytes | 136-1064 bytes | 64-288 bytes | **57-73%** |

<sup>* Object sizes were measured on Python 3.10 using `asizeof()` from the `pympler` package. % Reduced is PyUnicode vs. ShortSeq</sup>

Expand Down Expand Up @@ -74,7 +74,7 @@ However, AMD processors [prior to Zen 3](https://en.wikipedia.org/wiki/X86_Bit_m
<img src="doc/plots/mem_by_length.svg" alt="Memory usage by length"/>
</p>

Note that the measurement of Gzip Bytes is the _length_ in bytes of the compressed sequence at maximum compression (level 9), which is much smaller than the actual PyBytes object that `gzip.compress()` returns. This footprint is therefore unattainable when using Python's gzip module, and instead serves as a theoretical lower bound for the memory footprint of a compressed sequence.
Note that the measurement of Gzip Bytes is the _length_ in bytes of the compressed sequence at maximum compression (level 9), which is much smaller than the actual PyBytes object returned by `gzip.compress()`. This footprint is therefore unattainable when using Python's gzip module, and instead serves as a theoretical lower bound for the memory footprint of a compressed sequence.

[View source: MemoryBenchmarks.test_mem_by_length()](shortseq/tests/benchmark.py#L44)
</br></br>
Expand All @@ -92,7 +92,7 @@ Edit distance calculation is extremely efficient for ShortSeqs and can be perfor
<img src="doc/plots/from_bytes_time.svg" alt="Construction time from PyBytes input"/>
</p>

ShortSeq construction involves encoding the sequence string into a compressed binary representation, which is an O(n) operation, whereas `x.decode()` and `np.char.asarray()` are O(1) because they essentially copy the object's internal buffer.
This benchmark is very generous towards the competition because no attempt is made to check for non-nucleotide characters, whereas ShortSeq uses a bloom filter to do so for all inputs. ShortSeq construction involves encoding the sequence string into a compressed binary representation, which is an O(n) operation, whereas `x.decode()` and `np.char.asarray()` are O(1) because they essentially copy the object's internal buffer.

[View source: TimeBenchmarks.test_construction_from_bytes()](shortseq/tests/benchmark.py#L84)
</br></br>
Expand Down
Loading
Loading