Analysis pipeline for the precisionFDA Brain Cancer Predictive Modeling and Biomarker Discovery challenge using msaenet.
It was ranked as the 2nd place solution by predictive performance.
Check out our presentation video recording and slides at the 9th Annual Health Informatics & Data Science Virtual Symposium at Georgetown University.
Team: Nan Xiao, Soner Koc, Kaushik Ghose from Seven Bridges.
This solution features the following models:
- Feature selection with the multi-step adaptive SCAD-net method (Xiao and Xu, 2015).
- A relaxed version of the "Stability Selection" procedure (Meinshausen and Bühlmann, 2010) was used to aggregate the selected features from 100 perturbated models and only keep the consistently selected features.
- Gradient boosting decision tree (GBDT) models for predictive modeling with the selected genomic features and all four clinical features. The tree models include xgboost (Chen and Guestrin, 2016), lightgbm (Ke et al., 2017), catboost (Prokhorenkova et al., 2018), and a two-layer stacking tree model (Wolpert, 1992). We created an R package stackgbm for doing this after the challenge ended.
Most of the depended R packages are installable from CRAN. Two special ones:
- lightgbm: install from source. For macOS, it is advised to compile with a Homebrew gcc toolchain instead of the default LLVM toolchain.
- catboost: install the latest compiled binary package from their GitHub releases.
Open run.R
and follow the steps. Note that some steps could take a few hours to run despite the fact that they are fully parallelized.