- This repo is the official implementation of A Geometric Perspective towards Neural Calibration via Sensitivity Decomposition (tian21gsd). The pape is accpted at NeurIPS 2021. as a spotlight paper.
- We reimplememented Exploring Covariate and Concept Shift for Out-of-Distribution Detection (tian21explore) and include it in the code base as well. The paper is accepted at NeurIPS 2021 workshop on Distribution Shift.
- For a brief introduction to these two papers, please visit the project page.
conda env create -f requirements.yaml
conda activate gsd
- Dataset will be automatically downloaded in the
./datasets
directory the first time. - We provide support for CIFAR10 and CIFAR100. Please change
name
in the configuration file accordingly (default: CIFAR10).
data:
name: cifar10
- Three sample training configuration files are provided.
-
To train a vanilla model.
python train.py --config ./configs/train/resnet_vanilla.yaml
-
To train the GSD model proposed in tian21gsd.
python train.py --config ./configs/train/resnet_gsd.yaml
-
To train the Geometric ODIN model proposed in tian21exploring.
python train.py --config ./configs/train/resnet_geo_odin.yaml
-
1, We provide support for evaluation on CIFAR10, CIFAR100, CIFAR10C, CIFAR100C and SVHN. We consider both out-of-distribution (OOD) detection and confidence calibration. Models trained on different datasets will use different evaluation datasets.
OOD detection | Calibration | |||||
---|---|---|---|---|---|---|
Training | Near OOD | Far OOD | Special | ID | OOD | |
CIFAR10 | CIFAR10C | CIFAR100 | SVHN | CIFAR100 Splits | CIFAR10 | CIFAR10C |
CIFAR100 | CIFAR100C | CIFAR10 | SVHN | CIFAR100 | CIFAR100C |
-
The
eval.py
file optionally calibrates a model. It 1) evaluates calibration performance and 2) saves several scores for OOD detection evaluation later.-
Run the following commend to evaluate on a test set.
python eval.py --config ./configs/eval/resnet_{model}.yaml
-
To specify a calibration method, select the
calibration
attribute out of supported ones (use'none'
to avoid calibration). Note that a vanilla model can be calibrated using three supported methods, temperature scaling, matrix scaling and dirichlet scaling. GSD and Geometric ODIN use the alpha-beta scaling.testing: calibration: temperature # ['temperature','dirichlet','matrix','alpha-beta','none']
-
To select a testing dataset, modify the
dataset
attribute. Note that the calibration dataset (specified underdata: name
) can be different than the testing dataset.testing: dataset: cifar10 # cifar10, cifar100, cifar100c, cifar10c, svhn testing dataset
-
-
Calibration benchmark
- Results will be saved under
./runs/test/{data_name}/{arch}/{calibration}/{test_dataset}_calibration.txt
. - We use Expected Calibration Error (ECE), Negative Log Likelihood and Brier score for calibration evaluation.
- We recommend using a 5-fold evalution for in-distribution (ID) calibration benchmark because
CIFAR10/100
does not have a val/test split. Note thatevalx.py
does not save OOD scores.python evalx.py --config ./configs/train/resnet_{model}.yaml
- (Optional) To use the proposed exponential mapping (tian21gsd) for calibration, set the attribute
exponential_map
to 0.1.
- Results will be saved under
-
Out-of-Distribution (OOD) benchmark
- OOD evaluation needs to run
eval.py
two times to extract OOD scores from both the ID and OOD datasets. - Results will be saved under
./runs/test/{data_name}/{arch}/{calibration}/{test_dataset}_scores.csv
. For example, to evaluate OOD detection performance of a vanilla model (ID:CIFAR10 vs. OOD:CIFAR10C), you need to runeval.py
twice on CIFAR10 and CIFAR10C as the testing dataset. Upon completion, you will see two files,cifar10_scores.csv
andcifar10c_scores.csv
in the same folder. - After the evaluation results are saved, to calculate OOD detection performance, run
calculate_ood.py
and specify the conditions of the model: training set, testing set, model name and calibration method. The flags will help the function locate csv files saved in the previous step.python utils/calculate_ood.py --train cifar10 --test cifar10c --model resnet_vanilla --calibration none
- We use AUROC and TNR@TPR95 as evaluation metrics.
- OOD evaluation needs to run
- confidence calibration Performance of models trained on CIFAR10
accuracy | ECE | Nll | ||||
---|---|---|---|---|---|---|
CIFAR10 | CIFAR10C | CIFAR10 | CIFAR10C | CIFAR10 | CIFAR10C | |
Vanilla | 96.25 | 69.43 | 0.0151 | 0.1433 | 0.1529 | 1.0885 |
Temperature Scaling | 96.02 | 71.54 | 0.0028 | 0.0995 | 0.1352 | 0.8699 |
Dirichlet Scaling | 95.93 | 71.15 | 0.0049 | 0.1135 | 0.1305 | 0.9527 |
GSD (tian21gsd) | 96.23 | 71.7 | 0.0057 | 0.0439 | 0.1431 | 0.7921 |
Geometric ODIN (tian21explore) | 95.92 | 70.18 | 0.0016 | 0.0454 | 0.1309 | 0.8138 |
- Out-of-Distribution Detection Performance (AUROC) of models trained on CIFAR10
AUROC | score function | CIFAR100 | CIFAR10C | SVHN |
---|---|---|---|---|
Vanilla | MSP | 88.33 | 71.49 | 91.88 |
Energy | 88.11 | 71.94 | 92.88 | |
GSD (tian21gsd) | U | 92.68 | 77.68 | 99.29 |
Geometric ODIN (tian21explore) | U | 92.53 | 78.77 | 99.60 |
- Pretrained models