Adaptive and automatic gradient tree boosting computations
aGTBoost is a lightning fast gradient boosting library designed to avoid manual tuning and cross-validation by utilizing an information theoretic approach. This makes the algorithm adaptive to the dataset at hand; it is completely automatic, and with minimal worries of overfitting. Consequently, the speed-ups relative to state-of-the-art implementations are in the thousands while mathematical and technical knowledge required on the user are minimized.
Note: Currently for academic purposes: Implementing and testing new innovations w.r.t. information theoretic choices of GTB-complexity. See below for to-do research list.
R: Finally on CRAN! Install the stable version with
install.packages("agtboost")
or install the development version from GitHub
devtools::install_github("Blunde1/agtboost/R-package")
Users experiencing errors after warnings during installlation, may be helped by the following command prior to installation:
Sys.setenv(R_REMOTES_NO_ERRORS_FROM_WARNINGS="true")
agtboost
essentially has two functions, a train function gbt.train
and a predict function predict
.
From the code below it should be clear how to train an aGTBoost model using a design matrix x
and a response vector y
, write ?gbt.train
in the console for detailed documentation.
library(agtboost)
# -- Load data --
data(caravan.train, package = "agtboost")
data(caravan.test, package = "agtboost")
train <- caravan.train
test <- caravan.test
# -- Model building --
mod <- gbt.train(train$y, train$x, loss_function = "logloss", verbose=10)
# -- Predictions --
prob <- predict(mod, test$x) # Score after logistic transformation: Probabilities
agtboost
also contain functions for model inspection and validation.
- Feature importance:
gbt.importance
generates a typical feature importance plot. Techniques like inserting noise-features are redundant due to computations w.r.t. approximate generalization (test) loss. - Convergence:
gbt.convergence
computes the loss over the path of boosting iterations. Check visually for convergence on test loss. - Model validation:
gbt.ksval
transforms observations to standard uniformly distributed random variables, if the model is specified correctly. Perform a formal Kolmogorov-Smirnov test and plots transformed observations for visual inspection.
# -- Feature importance --
gbt.importance(feature_names=colnames(caravan.train$x), object=mod)
# -- Model validation --
gbt.ksval(object=mod, y=caravan.test$y, x=caravan.test$x)
The functions gbt.ksval
and gbt.importance
create the following plots:
Furthermore, an aGTBoost model is (see example code)
- highly robust to dimensions: Comparisons to (penalized) linear regression in (very) high dimensions
- has minimal worries of overfitting: Stock market classificatin
- and can train further given previous models: Boosting from a regularized linear model
- My research
- Eigen Linear algebra
- Rcpp for the R-package
- Adaptive and automatic deterministic frequentist gradient tree boosting.
- Information criterion for fast histogram algorithm (non-exact search) (Fall 2020, planned)
- Adaptive L2-penalized gradient tree boosting. (Fall 2020, planned)
- Automatic stochastic gradient tree boosting. (Fall 2020/Spring 2021, planned)
- Optimal stochastic gradient tree boosting.
- An information criterion for automatic gradient tree boosting
- agtboost: Adaptive and Automatic Gradient Tree Boosting Computations
Any help on the following subjects are especially welcome:
- Utilizing sparsity (possibly Eigen sparsity).
- Paralellizatin (CPU and/or GPU).
- Distribution (Python, Java, Scala, ...),
- good ideas and coding best-practices in general.
Please note that the priority is to work on and push the above mentioned scheduled updates. Patience is a virtue. :)