The FleXgeo is a software package designed for protein conformational ensemble analyses based on a differential geometry representation of protein backbones. The package is composed of a binary of the core program which calculates the differential geometry descriptors and a set of python scripts designed for the analyses of the results (more details bellow). FleXgeo is still a prototype version and more features will be added in the future, currently, there are ready to use scripts to:
- cluster protein conformations, via global clustering solution per residue based on its curvature and torsion distribution.
- quantify protein residues flexibility, via the computation of dmax
- compare protein conformations to a reference structure, via the computation of euclidean distances on the curvature and torsion space.
FleXgeo code was written by PhD. Antonio Marinho da Silva Neto and PhD. Rinaldo Wander Montalvao.
- 1 . Download FleXgeo from GitHub
$ git clone https://github.com/AMarinhoSN/FleXgeo.git
- 2 . Install requirements
You will need Python 3 to run the scripts provided and you can install the required python libraries using pip3. To install python libraries directly on your default python enviroment:
Or you can create a virtual enviroment for FleXgeo by:
$ pip3 install cython $ pip3 install -r /path/to/FleXgeo/requirements.txt
You also gonna need Lua interpreter to compute dmax. On Linux, you can install it by:$ pip3 install virtualenv $ cd virtual_envs_location $ virtualenv flexgeo_env $ source flexgeo_env/bin/activate $ cd FLEXGEO_LOCATION $ pip3 install cython $ pip3 install -r requirements.txt
$ sudo apt-get install luajit
There are two stages of FleXgeo applications, the differential geometry descriptors calculation and the analyses part. A website with tutorials will be provided in the future, but for now check the quick and dirt guide:
- 1 . Calculate Differential Geometry
$ /path/to/FleXgeo/bin/FleXgeo_[MY_OS] -pdb=ensemble.pdb [options]
FleXgeo accepts the following arguments:
OPTIONS | DESCRIPTION | DEFAULT |
---|---|---|
-pdb=[filename.pdb] |
Set input .pdb filename | User must provide |
-ncpus=[int] |
Set the number of cpus FleXgeo will use | All cpus available |
-isSingle |
Indicate if input pdb is a single conformation pdb | FALSE |
-outprfx=[prefix] |
Set the output files prefix to be used | 'Diffgeo_' |
-
2. Analyses
- Plot xgeo data.
$ python3 /path/to/PlotFXgeoData.py DiffGeo_xgeo.csv
usage: PlotFXgeoData.py [-h] in_xgeo This scripts generate plots for FleXgeo results. positional arguments: in_xgeo 'xgeo.csv' FleXgeo data file. optional arguments: -h, --help show this help message and exit
This script will output a set of plots as '.png' files and a 'XgeoObj.p', which is a pickle file that can be used to load the data on other scripts. The plots generated are:
- 2D line plots with conformations curvature and torsion values per residues
- Violin plots of the distribution of curvature and torsion values observed per residue
- Calculate distance between all conformations and a reference conformation on the ensemble.
$ python3.5 /path/to/FleXgeo/CalcEnsDistFromRef.py -in=XgeoObj.p
This script will output a distance matrix plot and a '.csv' file with the computed distances.
- Calculate distance between all conformations and an external reference conformation
$ python3.5 /path/to/FleXgeo/CalcEnsDistFromRef.py -in=XgeoObj.p -ext_ref=/path/to/ref_xgeo.csv
- Calculate residues Max Euclidean distance observed (dMax)
- Eliminate extreme bins outliers using a percentage treshold [default = 1% of the total conformations on the ensemble]
$ luajit /path/to/FleXgeo/HistProc.lua
NOTE: This lua script will use the "Diffgeo_stat.lua" file generated by FleXgeo as input and output a "DiffGeoSpec.ssv". This ".ssv" contains the lenght of the interval to be considered on each dimension for dMax.
- Compute dMax for each residue
$ python3.5 /path/to/FleXgeo/ComputeResDMax.py
This script use "DiffGeoSpec.ssv" as input to compute the dMax and outputs: 1) "maxdt.csv" : the dMax values per residues 2) "aa_list.csv" : the most flexible residues detected (more details bellow) 3) "z-score.csv": the dMax z-score used on most flexible residues detection
In addition to the computation of dMax, this script also run a Haar Wavelet transformation of the Z-score of residues dmax in order to automatically identify the most flexible residues.
- Clustering conformations
- Run HDBSCAN
>$ python3.5 /path/to/FleXgeo/GetResClusters.py DiffGeo_xgeo.csv
USAGE: GetResClusters.py [-h] [-res RES] [-out_path OUT_PATH] [-min_pcluster MIN_PCLUSTER] in_csv positional arguments: in_csv csv file with FleXgeo data optional arguments: -h, --help show this help message and exit -res RES specify residue to be clustered. (default: 'ALL') -out_path OUT_PATH specify dir to write output files. (default: working dir) -min_pcluster MIN_PCLUSTER set the minimum conformations percentage a cluster must have (default: .05)
This script will output: 1. scatter 2D plots of the curvature and torsion values colored according to the clustering solution 2. a '.clst' file which contain the conformations indexes of each cluster found
- Write cluster pdbs
$ python3.5 /path/to/FleXgeo/WriteClustersPDB.py cluster.clstr source.pdb res
USAGE: WriteClustersPDB.py [-h] [-out_path OUT_PATH] in_clstr src_pdb res Write pdb files of cluster from '.clst'. positional arguments: in_clstr clstr file. src_pdb source pdb file. res specify residue to write clusters pdbs. optional arguments: -h, --help show this help message and exit -out_path OUT_PATH specify dir to write output files. (default: working dir)
- Working with normalized values
From our experience, working with FleXgeo raw data usually leads to the same conclusion of using normalized values. However, is possible that different analyses scenarios may require different normalization procedures. If you need to work with normalized values, don’t worry, we've got you covered. If you need normalized FleXgeo data for your analyses, you can use the "Diffgeo_NORM.csv" (for values rescaled to [0,1]) instead of the Diffgeo_xgeo.csv as input for scripts. If you need normalized values by keeping the same mean observed on the original dataset, you can compute it using:
$ python3.5 /path/to/FleXgeo/NormByMean.py Diffgeo_xgeo.csv
usage: NormByMean.py [-h] [-out_dir OUT_DIR] [-out_sfx OUT_SFX] in_csv Generate normalized values of FleXgeo descriptors by keeping the same mean values observed in the original data. positional arguments: in_csv csv file with FleXgeo data optional arguments: -h, --help show this help message and exit -out_dir OUT_DIR specify dir to write output files. (default: working dir) -out_sfx OUT_SFX Suffix of csv output (default: DiffGeo_NORM_mean)
This NormByMean.py script will generate a new ".csv" (default output name = DiffGeo_NORM_mean.csv ) with the normalized values and you can use it instead of the Diffgeo_xgeo.csv.
FleXgeo outputs 5 .csv files:
1. **Diffgeo_xgeo.csv** : contains the calculated differential geometry descriptors of the input ensemble.
2. Diffgeo_NORM.csv : Same values of "Diffgeo_xgeo,csv" but normalized to [0,1].
3. Diffgeo_MEAN.csv: Mean values of FleXgeo descriptors per residue
4. Diffgeo_STD.csv: Standard deviation of FleXgeo descriptors per residue
5. Diffgeo_VAR.csv: Variation of FleXgeo descriptors per residue
6. DiffgeoStat.lua : The histogram bins position and value of each descriptor for each residue.
Marinho da Silva Neto, Antonio, et al. ‘A Superposition Free Method for Protein Conformational Ensemble Analyses and Local Clustering Based on a Differential Geometry Representation of Backbone’. Proteins: Structure, Function, and Bioinformatics, Dec. 2018. Crossref, doi:10.1002/prot.25652.
Unfortunately, we use Numerical Recipes on our code and we are not allowed to distribute the source code. We plan to rewrite those part of the code in the future, but this is not on our top priorities right now. If you have some trouble on running a binary file on your machine, feel free to contact Antonio at [email protected] and we can try to provide a specific binary for you.