-
Notifications
You must be signed in to change notification settings - Fork 1
Home
Cassiopee is a Ruby module to search string with exact match or an allowed distance in an other string (or file). An index can optionally be saved for further searches.
Exact or approximate search is used in many fields among which bioinformatics to search some patterns in DNA/RNA sequences. The software works on small or large sequences.
The code is open source.
Gem is available on RubyGems.org Gem can be created from cassiopee.gemspec
Two methods DIRECT (default) and SUFFIX are available. DIRECT parse the string on all positions and send results. SUFFIX save all found suffixes then check for a match within suffixes. This last option is RAM intensive for large sequences but speeds up the process when several searches are made in the same context, it avoids reparsing the whole sequence.
It is possible to define a filter on start position. If store is not used, it also speeds up the search. Setting max to 0 means no max. This will limit the matches to a window in the indexed string.
Comments is an array of line start characters. Lines matching one of those chars will be skipped and not indexed.
Optimal methods (length or cost) will remove some matches from final result. This is a post-treatment step.
For length, it will keep the longest match for a same start position.
For cost, it will keep the lower cost (hamming or levenshtein) for a same start position.
It is possible to define an alphabet ambiguity e.g. to associate multiple char values to a singe one. This is common in bioinformatics for dna sequences for example.
Such a file is like:
b=c,g,t
r=a,g
When loading with loadAmbiguityFile, useAmbiguity var is set, and search (exact or ambiguous) will use this alphabet transformation. This has impact on performances, mainly for exact search.
Class CrawlerCache add (very) basic cache management. If useCache is set in Crawler, then result is saved in a file. If next request is identical or within same scope (positions, errors), cached results (or subset) are sent back instead of reparsing.