BDT-COFFEE is based on the integration of consistency information through Cassandra database in TC, previously generated by the MapReduce processing paradigm (PPCAS), in order to enable large datasets to be processed with the aim of improving the performance and scalability of the original algorithm.
T-Coffee and PPCAS compilation requires the following tools installed on your system make
, gcc-c++
. Moreover, T-Coffee needs cpp-driver
, boost
, sparse-map
and a custom memory allocator to allow allocations and deallocations to be done in any order jemalloc
(the makefile path for these libraries are set to $(HOME)
).
The execution requires a Hadoop
, Spark
and Cassandra
infrastructure with the environment variables correctly set and its path
. Also a Python
installation with Numpy
is needed.
Download the git repository on your computer.
Make sure you have installed the required dependencies listed above. When done, move in the PPCASv2 folder and enter the following commands:
$ make
The shared library will be automatically generated.
After, move in the project root folder and enter the following commands:
$ cd src
$ make
The binary will be automatically generated in the folder.
It is included a script named run
in the src folder which executes PPCAS and BDT-Coffee with the required and optional parameters.
Required parameters:
$ -i [sequence_file]
$ -m [master_spark_ip]
$ -s [seed_cassandra]
Optional parameters:
$ -c (chunk, default: no chunk)
$ -p (number of partitions, default: number of sequences)
There are input sequences in the examples folder.
BB11001.tfa
a small dataset from BAliBASE
.
rrm_*
being *
the number of sequences obtained from HomFam
dataset.
Calculate the alignment for the rrm family with 100 sequences (optimized with chunk=10 subgrouping):
$ ./run.sh -i ../examples/rrm_100 -m 192.168.101.51 -s 192.168.101.51 -c 10