Skip to content

Scalable Consistency in T-Coffee through Apache Spark and Cassandra database - JCB2018

License

Notifications You must be signed in to change notification settings

jllados/BDT-Coffee

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BDT-Coffee

BDT-COFFEE is based on the integration of consistency information through Cassandra database in TC, previously generated by the MapReduce processing paradigm (PPCAS), in order to enable large datasets to be processed with the aim of improving the performance and scalability of the original algorithm.

Prerequisites

T-Coffee and PPCAS compilation requires the following tools installed on your system make, gcc-c++. Moreover, T-Coffee needs cpp-driver, boost, sparse-map and a custom memory allocator to allow allocations and deallocations to be done in any order jemalloc (the makefile path for these libraries are set to $(HOME)).

The execution requires a Hadoop, Spark and Cassandra infrastructure with the environment variables correctly set and its path. Also a Python installation with Numpy is needed.

Compile

Download the git repository on your computer.

Make sure you have installed the required dependencies listed above. When done, move in the PPCASv2 folder and enter the following commands:

$ make

The shared library will be automatically generated.

After, move in the project root folder and enter the following commands:

$ cd src
$ make

The binary will be automatically generated in the folder.

Usage

It is included a script named run in the src folder which executes PPCAS and BDT-Coffee with the required and optional parameters.

Required parameters:

$ -i [sequence_file]
$ -m [master_spark_ip]
$ -s [seed_cassandra]

Optional parameters:

$ -c (chunk, default: no chunk)
$ -p (number of partitions, default: number of sequences)

Example

There are input sequences in the examples folder.

BB11001.tfa a small dataset from BAliBASE.

rrm_* being * the number of sequences obtained from HomFam dataset.

Calculate the alignment for the rrm family with 100 sequences (optimized with chunk=10 subgrouping):

$ ./run.sh -i ../examples/rrm_100 -m 192.168.101.51 -s 192.168.101.51 -c 10

About

Scalable Consistency in T-Coffee through Apache Spark and Cassandra database - JCB2018

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published