forked from kr-colab/FILET
-
Notifications
You must be signed in to change notification settings - Fork 0
Software for detecting introgression using supervised machine learning
License
dschride/FILET
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This directory contains all of the code necessary to run FILET (Finding Introgressed Loci using Extra-Trees). Briefly, this tool uses an Extra-Trees classifier (Geurts et al. 2005: http://www.montefiore.ulg.ac.be/~ernst/uploads/news/id63/extremely-randomized-trees.pdf) to classify genomic windows from a as one of three different modes of evolution based on population genomic data from a phased two-species sample: 1) Introgression from species 1 to species 2 2) Introgression from species 2 to species 1 3) No introgression The manuscript describing FILET will be posted on biorxiv shortly, at which point I will add the link to this README. Meanwhile, you can check out this talk for a brief overview of the method and results from an application to Drosophila simulans and D. sechellia: https://www.youtube.com/watch?v=t6zy8ZsRKt4 In order to run FILET, the user will need several python packages, including scikit-learn (http://scikit-learn.org/stable/), scipy (http://www.scipy.org/), and numpy (http://www.numpy.org/). The simplest way to get all of this up and running is to install Anaconda (https://store.continuum.io/cshop/anaconda/), which contains these and many more python packages and is very easy to install. You will also need GCC (https://gcc.gnu.org/) and GSL (https://www.gnu.org/software/gsl/). The FILET pipeline should then work on any Linux system (and probably OS X too, though this has not yet been tested). Once you have cloned this repository into the desired directory, cd into this directory and then run: make all This will compile the C tools included in this pipeline, and you should then be ready to rock. However, because FILET requires training prior to performing classifications, running the software is a multi-step process. Below, we outline each step, and in examplePipeline.sh we describe the whole process in greater detail, with a complete example run. Here are the steps of the FILET pipeline: - Simulate training data: Statistics summarizing patterns of variation within each simulated genomic window must be calculated from training data. - Calculate feature vectors from these training data, and combine into a matrix of labeled training examples. - Train the FILET classifier. - Calculate summary statistics from genomic windows we wish to classify. - Classify genomic data (with known or inferred haplotypic phase). Not all of these steps are mandatory: if you have generated your own feature vectors from training data, then provided these data are in the proper format (see example in featureVectorsToClassify/) you can proceed to the training step. Similarly, if you have calculated summary statistics from genomic data, then once a classifier has been trained you can skip straight to the data classification step as well (again, provided these statistics are listed in the proper format; see example in dataToClassify/). Note that this pipeline also assumes that you have already simulated training data (with ms-style output), which can be done using msmove (https://github.com/geneva/msmove) or the simulator of your choice. The requirements for simulated data and their file names are described in the comments above Step 4 of examplePipeline.sh. Note: the full results from our scan in D. simulans and D. sechellia will be added to a directory called simSechResults later today. (I.e. once my dev server comes back up...)
About
Software for detecting introgression using supervised machine learning
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published
Languages
- C 81.4%
- Scheme 10.6%
- Shell 3.6%
- Python 3.5%
- Other 0.9%