Skip to content
Hamed Haseli edited this page Jul 20, 2020 · 8 revisions

IMPC statistical pipeline documentations

Contents

  • IMPC statistical pipeline documentation
  • IMPC statistical pipeline
  • Preprocessing the raw data
  • Packaging the raw data for parallelisation
  • Executing the IMPC statistical pipeline
  • Preparation before running the stats pipeline
  • Executing the pipeline
  • Postprocessing the IMPC-SP
  • Frequently asked questions

IMPC statistical pipeline

Working with the IMPC data is an existing experience for any data scientist. However, the nature of high-throughput pipelines allows too many data points that overflow the complexity of running the statistical analysis methods. In this manual, we describe the step by step execution of the IMPC statistical pipeline. To follow this manual, the following software must be installed on your machine:

  1. Unix/Linux operation system
  2. Unix/Linux terminal
  3. IBM LSF platform https://en.wikipedia.org/wiki/Platform_LSF
  4. R software https://cran.r-project.org/

Preprocessing the raw data

The input data to the IMPC statistical pipeline (IMPC-SP) is in the form of comma-separated values (CSV), tab-separated values (TSV), Rdata (See R software data.frame) or Parquet files. The latter must be in the flat mode (no nested structure in the parquet files allowed). The CSV or TSV files can be on a remote server but parquet files must be locally available on the disk. The entire IMPC-SP require 300GB to 1.5TB disk space depending on the number of analyses included in the StatPackets. This document considers the LSF cluster as the computing driver for the stats pipeline, however, IMPC-SP can be run on a single core machine but potentially takes a significant amount of time (estimated 1.5 months).

The diagram below shows the optimal steps to run the data preparation pipeline as fast as possible.

s

The whole IMPC-SP require R software with the list of packages and dependencies shown in the table below,

R Package name R Package name
1. DRrequiredAgeing (available from the GitHub) 2. OpenStats
3. SmoothWin 4. base64enc
5. RJSONIO 6. jsonlite
7. DBI 8. foreach
9. doParallel 10. parallel
11. nlme 12. plyr
13. rlist 14. pingr
15. robustbase 16. abind
17. stringi 18. RPostgreSQL
19. data.table 20. Tmisc
21. devtools 22. miniparquet

|

The driver packages are DRrequiredAgeing, OpenStats and SmoothWin that need to be updated every time the stats pipeline runs. This makes sure that the latest version of the software packages is used in the analysis pipeline.

  • One can update the driver packages by running the commands below from the terminal:
  1. R -e "file.copy(file.path(DRrequiredAgeing:::local(), 'StatsPipeline/jobs/UpdatePackagesFromGithub.R') , to = file.path(getwd(), 'UpdatePackagesFromGithub.R'))"
  2. Rscript UpdatePackagesFromGithub.R

Having the packages updated, the first step is to read the input files. CSV, TSV and Rdata files can be directly read in the pipeline (just to _ Packaging the raw data for parallelisation _). Parquet files require an extra step to be converted into the R data frames. To this end, the parquet files need to be available locally on the disk. The whole process is divided into four steps, two for creating and two for executing jobs:

  1. Read the parquet files and create a list of jobs for the LSF cluster to process the data,
  2. Process the data and create scattered Rdata files,
  3. Create a set of jobs to merge the scattered Rdata files into one single file per IMPC procedure.
  4. Run the merging step

The scripts for the 4 steps above are available from the R package DRrequiredAgeing.

Copy the contents of the directory into a path on your machine

  • Path to the scripts: Run the following command on the terminal to see the full path
    • R -e "file.path(DRrequiredAgeing:::local(),'StatsPipeline/0-ETL')"

There are 4 scripts in the directory that you just copied. Run the commands below to get the data frames ready

  1. Rscript Step1MakePar2RdataJobs "FULL PATH TO THE PARQUET FILES" + trailing /
  2. chmod 775 jobs_step2_Parquet2Rdata.bch
  • ./ jobs_step2_Parquet2Rdata.bch
    1. This should take 10+ minutes on the LSF cluster depending on the available resources
    2. Do not go to step 3 before this step has finishes
    3. The output is a directory named _ ProcedureScatterRdata _ filled with loads of small Rdata files
  1. Rscript Step3MergeRdataFilesJobs.R "FULL PATH TO THE ProcedureScatterRdata DIRECTORY" + trailing /
  2. chmod 775 jobs_step4_MergeRdatas.bch
  • ./jobs_step4_MergeRdatas.bch
    1. This should take 1+ hour depending on the available resources on the LSF cluster
    2. The output of this step is a directory named _ Rdata _ filled with procedures data files.
  1. each step above is accompanied by the log files. If no error found in the log files then you can safely remove the _ ProcedureScatterRdata _ directory by running
  • rm –rf _ ProcedureScatterRdata _

Packaging the raw data for parallelisation

The previous step leads to having bulky data files. This is very inefficient for parallelization via LSF cluster. In the next step, we allow breaking the raw data into small packages that can be independently processed via parallel processing. This step is fully automatic and only requires an initialisation process. The output of this step is a set of LSF jobs XXXX.bch that need to be concatenated into a single file or can be used individually for each IMPC procedure. The script for this step is available from the path that comes out of the command below:

  • R -e "file.path(DRrequiredAgeing:::local(),'StatsPipeline/jobs)"

The script is named _ InputDataGenerator.R _ . You can customize the output .bch files for the amount of memory required for each job in LSF by tweaking the memory/cpu/etc. parameters in this script.

To run the _ InputDataGenerator.R _ follow the steps below:

  1. create a list of LSF jobs for each raw data file in the previous section by running the command below on your terminal:
  • R -e "DRrequiredAgeing:::jobCreator('FULL PATH TO THE _ Rdata _ DIRECTORY OR RAW DATA FILES')"
    1. This command creates a job file DataGenerationJobList.bch and an empty directory DataGeneratingLog that stores the log files.
    2. This command is similar to ls ing the directory and creates a LSF job for each input intery (data)
  • Run the output script by:
    1. chmod 775 DataGenerationJobList.bch
    2. ./ DataGenerationJobList.bch
    3. This normally takes 18+ hours depending on the available resources on the LSF cluster.

You can check the log files in DataGeneratingLog directory for any error and if there is no error shown in the log, the preprocessing step is marked as successful. Here is the command to check the errors in the file:

  • grep "exit" * -lR

  • As the log directory could get bulky quickly, we suggest compressing the whole directory to save space on the disk. You can run the command below to compress and remove the log file directory

    • zip –rm DataGeneratingLog/
  • One import note in this step is to adjust the LSF jobs configurations such as memory limit (in _ InputDataGenerator.R _). Overestimating the memory required for the LSF jobs prevents unwanted halt of the LSF jobs.

Executing the IMPC statistical pipeline

The output of the previous steps is a set of directories for individual IMPC procedures that contain an XXX.bch file. The next step is to append these XXX.bch files into a single file we call AllJobs.bch. You can use methods like find to search for the XXX.bch files and cat to append these files. The example of merging command is shown below:

  • cat *.bch >> AllJobs.bch

Preparation before running the stats pipeline

Some preparation is recommended before running the stats pipeline that is listed below:

  • Updating the list of levels (for categorical data) from IMPReSS. To this end run the command below in your terminal:
    • R -e "DRrequiredAgeing:::updateImpress(updateImpressFileInThePackage = TRUE,updateOptionalParametersList = TRUE,updateTheSkipList = TRUE)"
      • This command updated the _ required for analysis _ parameters as well as adds the meta data parameters to the skip list of the statistical pipeline. The skip list is available in the DRrequiredAgeing package directory. Run the command below to retrieve the full path:
        • Rscript "DRrequiredAgeing:::local()"

Executing the pipeline

The IMPC-SP require a function.R driver script written in R to perform the analysis to the data. The script is available from

  • R -e "file.path(DRrequiredAgeing:::local(),'StatsPipeline/jobs')"

put the function.R script and the AllJobs.bch in the same directory and execute the AllJobs.bch to start the IMPC-SP. Some notes are required for better understanding of the IMPC-SP.

  • You can set some parameters in the function.R such as activating softwindowing. Here is the typical function.R and the parameters wherein:

    • mainAgeing(
    • file = suppressWarnings(tail(UnzipAndfilePath(file), 1)),
      • The input file (csv,tsv, Rdata)
    • subdir = 'Results_DR12V1OpenStats',
      • Name of the output directory
    • concurrentControlSelect = FALSE,
      • Concurrent control selection applies to the controls?
    • seed = 123456,
      • Random number generator seed
    • For windowing only,

    • messages = FALSE,
      • Write error messages from the softwindowing pipeline to the output file
    • utdelim = '\t',
      • The StatPacket delimtor
    • Windowing

    • activeWindowing = FALSE,
      • Activate SoftWindowing
    • check = 2,
      • The check type in SoftWindowing. See check argument in the SmoothWin package
    • storeplot = FALSE,
      • Store softwindowing plots in a file that accompanies the statpacket
    • plotWindowing = FALSE,
      • Set to TRUE to plot the SoftWindowing output
    • debug = FALSE,
      • Show more details of the process
    • MMOptimise = c(1,1,1,1,1,1),
      • See MM_Optimise in the OpenStats package
    • FERRrep = 1500,
      • Total iterations in the Fisher's Exact Test framework (monte carlo iterations)
    • activateMulticore = FALSE,
      • Activate multi core processing
    • coreRatio = 1,
      • The core proportion (1=100% cores)
    • MultiCoreErrorHandling = 'stop',
      • Error handeling for the multi core processing. Here the process fails if encounters and error
    • inorder = FALSE,
      • See inorder in foreach package
    • verbose = TRUE,
      • see verbose in foreach package
    • OverwriteExistingFiles = FALSE,
      • Overwrite the statspacket if already exists
    • storeRawData = TRUE,
      • Store the raw data with the statpackets
    • outlierDetection = FALSE,
      • Activate outlier detection strategy
    • compressRawData = TRUE,
      • zip the output raw data
    • writeOutputToDB = FALSE,
      • write the statpackets to mysqlite db in a directory db in the results directory. Note that there could be up to 10k individual mysqlite databases in the db directory. An extra step requires to merge all dbs.
    • onlyFillNotExisitingResults = FALSE
      • Only run the statistical analyses if the StatPacket does not exist
    • )
  • It is highly recommended that remove the log files prior to run/re-run of the IMPC-SP. To do this, navigate to the AllJobs.bch directory and run the commands below in your terminal:

    • find ./*/*_RawData/ClusterErr/ -name *ClusterErr -type f |xargs rm
    • find ./*/*_RawData/ClusterOut/ -name *ClusterOut -type f |xargs rm
  • Results: IMPC-SP outputs the directory named in subdir argument in function.R script. The statpackets are located on the right hand side of the following directory structure path:

    • Centre/procedure_group/parameter_stable_id/colony_id/zygosity/metadata_group
      • Note 1: All special characters in the path above is replaced by the underscore (_)
      • Note 2. Depends on the input data, there could be more that one statpacket in a path

Postprocessing the IMPC-SP

IMPC-SP require some QC checks and random validation to assure the output results are reliable and there is no failure in the process. Here we list some typical checks to the pipeline outputs:

  • Uniting the log files: log files are the best place to track down any error and/or failure in the process. The issue here is that the log files are scattered all around the directories. To address this complexity we copy all log files to a single directory and run a check. To copy the log files, navigate to the AllJobs.bch directory and run the commands below:
    • find ./*/*_RawData/ClusterOut/ -name *ClusterOut -type f |xargs cp --backup=numbered -t ~/ XXXXXX
    • find ./*/*_RawData/ClusterErr/ -name *ClusterErr -type f |xargs cp --backup=numbered -t ~/ XXXXXX
      • Here XXXXXX is a directory that you have created for log files
  • Searching for errors in the log file: You can search for any failure in the log files by running the command below:
    • grep "exit" * -lR
      • The existence of errors must be investigated manually
  • Random checking of the results: a random check of the results are recommanded for the IMPC-SP

Frequently asked questions

Here we show some of the frequency asked questions.

  • Where are the IMPC-SP on Github?
  • Where can I find function.R
    • This file is located in the extension directory of the DRrequiredAgeing package see (here)[https://github.com/mpi2/impc_stats_pipeline/tree/master/Late%20adults%20stats%20pipeline/DRrequiredAgeing/DRrequiredAgeingPackage/inst/extdata]
  • How long normally the IMPC-SP takes.
    • This depends on the LSF cluster and the available resources. Setting the EMBL-EBI LSF cluster as a reference, the whole process takes from 2 days to 4 days.
  • How to ask for help?
Clone this wiki locally