COResets and Data Subset selection
Reduce end to end training time from days to hours (or hours to minutes), and energy requirements/costs by an order of magnitude using coresets and data selection.
- In this README
- What is CORDS?
- Highlights
- Starting with CORDS
- Applications
- Speedups achieved using CORDS
- Tutorials
- Documentation
- Mailing List
- Acknowledgment
- Team
- Resources
- Publications
CORDS is COReset and Data Selection library for making machine learning time, energy, cost, and compute efficient. CORDS is built on top of PyTorch. Today, deep learning systems are extremely compute-intensive, with significant turnaround times, energy inefficiencies, higher costs, and resource requirements [7, 8]. CORDS is an effort to make deep learning more energy, cost, resource, and time-efficient while not sacrificing accuracy. The following are the goals CORDS tries to achieve:
Data Efficiency
Reducing End to End Training Time
Reducing Energy Requirement
Faster Hyper-parameter tuning
Reducing Resource (GPU) Requirement and Costs
The primary purpose of CORDS is to select the suitable representative data subsets from massive datasets, and it does so iteratively. CORDS uses recent advances in data subset selection, particularly ideas of coresets and submodularity select such subsets. CORDS implements several state-of-the-art data subset/coreset selection algorithms for efficient supervised learning(SL) and semi-supervised learning(SSL).
Some of the algorithms currently implemented with CORDS include:
For Efficient and Robust Supervised Learning:
- GLISTER
- GradMatch
- CRAIG
- SubmodularSelection (Facility Location, Feature Based Functions, Coverage, Diversity)
- RandomSelection
For Efficient and Robust Semi-supervised Learning:
We are continuously incorporating newer and better algorithms into CORDS. Some of the features of CORDS includes:
- Reproducibility of SOTA in Data Selection and Coresets: Enable easy reproducibility of SOTA described above. We are trying also to add more algorithms, so if you have an algorithm you would like us to include, please let us know,
- Benchmarking: We have benchmarked CORDS (and the algorithms present right now) on several datasets, including CIFAR-10, CIFAR-100, MNIST, SVHN, and ImageNet.
- Ease of Use: One of the main goals of CORDS is that it is easy to use and add to CORDS. Feel free to contribute to CORDS!
- Modular design: The data selection algorithms are directly incorporated into data loaders, allowing one to use their own training loop for varied utility scenarios.
- A broad number of use cases: CORDS is currently implemented for simple image classification tasks and hyperparameter tuning, but we are working on integrating several additional use cases like Auto-ML, object detection, speech recognition, semi-supervised learning, etc.
- 3x to 5x speedups, cost reduction, and energy reductions in the training of deep models in supervised learning
- 3x+ speedups, cost/energy reduction for deep model training in semi-supervised learning
- 3x to 30x speedups and cost/energy reduction for Hyper-parameter tuning using subset selection with SOTA schedulers (Hyperband and ASHA) and algorithms (TPE, Random)
To install the latest version of the CORDS package using PyPI:
pip install cords
To install using the source:
git clone https://github.com/decile-team/cords.git
cd cords
pip install -r requirements/requirements.txt
To better understand CORDS's functionality, we have provided example Jupyter notebooks and python code in the examples folder, which can be easily executed by using Google Colab. We also provide a simple SL, SSL, and HPO training loops that runs experiments using a provided configuration file. To run this loop, you can look into following code examples:
Create a subset selection based data loader at train time and use the subset selection based data loader with your own training loop.
Essentially, with subset selection-based data loaders, it is pretty straightforward to use subset selection strategies directly because they are integrated directly into subset data loaders; this allows users to use subset selection strategies directly by using their respective subset selection data loaders.
Below is an example that shows the subset selection process is simplified by just calling a data loader in supervised learning setting,
from cords.utils.data.dataloader.SL.adaptive import GLISTERDataLoader
#Pass on necessary arguments for GLISTERDataLoader
dss_args = dict(model=model,
loss=criterion_nored,
eta=0.01,
num_classes=10,
num_epochs=300,
device='cuda',
fraction=0.1,
select_every=20,
kappa=0,
linear_layer=False,
selection_type='SL',
greedy='Stochastic')
dss_args = DotMap(dss_args)
#Create GLISTER subset selection dataloader
dataloader = GLISTERDataLoader(trainloader,
valloader,
dss_args,
logger,
batch_size=20,
shuffle=True,
pin_memory=False)
for epoch in range(num_epochs):
for _, (inputs, targets, weights) in enumerate(dataloader):
"""
Standard PyTorch training loop using weighted loss
Our training loop differs from the standard PyTorch training loop in that along with
data samples and their associated target labels; we also have additional sample weight
information from the subset data loader, which can be used to calculate the weighted
loss for gradient descent. We can calculate the weighted loss by using default PyTorch
loss functions with no reduction.
"""
In our current version, we deployed subset selection data loaders in supervised learning and semi-supervised learning settings.
from train_sl import TrainClassifier
from cords.utils.config_utils import load_config_data
config_file = '/content/cords/configs/SL/config_glister_cifar10.py'
cfg = load_config_data(config_file)
clf = TrainClassifier(cfg)
clf.train()
from train_ssl import TrainClassifier
from cords.utils.config_utils import load_config_data
config_file = '/content/cords/configs/SSL/config_retrieve-warm_vat_cifar10.py'
cfg = load_config_data(config_file)
clf = TrainClassifier(cfg)
clf.train()
You can use the default configurations that we have provided in the configs folder, or you can make a custom configuration. For making your custom configuration file for training, please refer to CORDS Configuration File Documentation.
The subset selection strategies for efficient supervised learning in CORDS allow one to train models faster. We can use the faster model training using data subsets for quicker configuration evaluations in Hyper-parameter tuning. A detailed pipeline figure of efficient hyper-parameter tuning using subset based training for faster configuration evaluations can be seen below:
We can use any existing data subset selection strategy in CORDS along with existing hyperparameter search and scheduling algorithms currently. We currently use Ray-Tune library for hyper-parameter tuning and search algorithms.
Please find the tutorial notebook explaining the usage of CORDS subset selections strategies for Efficient Hyper-parameter optimization in the following notebook
To achieve significantly faster speedups, one can use the subset selection data loaders from CORDS while keeping the training algorithm the same. Look at the speedups one can achieve using the subset selection data loaders from CORDS below:
We have added example python code and tutorial notebooks under the examples folder. See this link
The documentation for the latest version of CORDS can always be found here.
We value and encourage contributions from the open-source community to enhance the CORDS library. Here are some guidelines for contributing:
-
Report issues: If you come across any bugs or have suggestions for improvements, please raise an issue on our GitHub repository. Provide detailed information about the problem or feature request, including steps to reproduce the issue if applicable.
-
Feature requests: If you have ideas for new features or enhancements, feel free to submit a feature request on GitHub. Clearly describe the proposed functionality and how it aligns with the goals of the CORDS library.
-
Code contributions: We welcome code contributions to improve CORDS. If you plan to contribute code, please follow these steps:
- Fork the CORDS repository on GitHub.
- Create a new branch for your work based on the
develop
branch. - Make your changes and ensure they are well-documented and tested.
- Submit a pull request, providing a clear explanation of the changes made and their purpose.
-
Code style: When contributing code, please adhere to the existing code style and formatting conventions used in the CORDS library. Consistency in code style helps maintain readability and makes it easier to review and merge contributions.
-
Testing: Ensure that your code changes pass the existing tests
To receive updates about CORDS and to be a part of the community, join the Decile_CORDS_Dev group.
https://groups.google.com/forum/#!forum/Decile_CORDS_Dev/join
This library takes inspiration, builds upon, and uses pieces of code from several open source codebases. These include Teppei Suzuki's consistency based SSL repository and Richard Liaw's Tune repository. Also, CORDS uses submodlib for submodular optimization.
CORDS is created and maintained by Krishnateja Killamsetty, Dheeraj N Bhat, Rishabh Iyer, and Ganesh Ramakrishnan. We look forward to have CORDS more community driven. Please use it and contribute to it for your efficient learning research, and feel free to use it for your commercial projects. We will add the major contributors here.
[1]: Krishnateja Killamsetty, Guttu Sai Abhishek, Aakriti, Alexandre V. Evfimievski, Lucian Popa, Ganesh Ramakrishnan, Rishabh Iyer, “AUTOMATA: Gradient Based Data Subset Selection for Compute-Efficient Hyper-parameter Tuning”. arXiv [cs.LG], 2022. arXiv:2203:08212.
[2]: Krishnateja Killamsetty, Xujiang Zhou, Feng Chen, and Rishabh Iyer, “RETRIEVE: Coreset Selection for Efficient and Robust Semi-Supervised Learning”. To Appear in Neural Information Processing Systems, NeurIPS 2021.
[3]: Krishnateja Killamsetty, Durga Sivasubramanian, Ganesh Ramakrishnan, Abir De, Rishabh Iyer. “GRAD-MATCH: Gradient Matching based Data Subset Selection for Efficient Deep Model Training”. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, 139:5464–5474. Proceedings of Machine Learning Research. PMLR, 2021.
[4]: Krishnateja Killamsetty, Durga Sivasubramanian, Ganesh Ramakrishnan, Rishabh Iyer. “GLISTER: Generalization based Data Subset Selection for Efficient and Robust Learning”. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Virtual Event, February 2-9, 2021, 8110–8118. AAAI Press, 2021.
[5]: Baharan Mirzasoleiman, Jeff Bilmes, and Jure Leskovec. “Coresets for Data-efficient Training of Machine Learning Models”. In International Conference on Machine Learning (ICML), July 2020
[6]: Vishal Kaushal, Rishabh Iyer, Suraj Kothiwade, Rohan Mahadev, Khoshrav Doctor, and Ganesh Ramakrishnan, “Learning From Less Data: A Unified Data Subset Selection and Active Learning Framework for Computer Vision”. 7th IEEE Winter Conference on Applications of Computer Vision (WACV), 2019 Hawaii, USA
[7]: Schwartz, Roy, et al. "Green AI." arXiv preprint arXiv:1907.10597 (2019).
[8]: Strubell, Emma, Ananya Ganesh, and Andrew McCallum. “Energy and policy considerations for deep learning in NLP.” In ACL 2019.
[9]: Kai Wei, Rishabh Iyer, Jeff Bilmes, “Submodularity in Data Subset Selection and Active Learning”. International Conference on Machine Learning (ICML) 2015
[10]: Wei, Kai, et al. Submodular subset selection for large-scale speech training data. 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2014.