Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test #36

Draft
wants to merge 38 commits into
base: master
Choose a base branch
from
Draft

Test #36

Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
2c74919
test
Jan 30, 2022
90f7506
test
Jan 30, 2022
b8af0d7
Update README.md
giovannimonea Jan 31, 2022
f0d28ab
Update README.md
giovannimonea Jan 31, 2022
c08d998
Update .cirrus.yml
giovannimonea Feb 1, 2022
21e274f
Update test_base.py
giovannimonea Feb 1, 2022
1fbce2b
Modified readme
giovannimonea Feb 1, 2022
8c39ea6
Update README.md
nefagi-01 Feb 2, 2022
4ce8702
Added first implementation for web_app
giovannimonea Feb 2, 2022
abb5c37
Removed text from html
giovannimonea Feb 2, 2022
eaf5187
Small change
giovannimonea Feb 2, 2022
a48b06e
Create README.rst
nefagi-01 Feb 2, 2022
5f05af4
Delete README.md
nefagi-01 Feb 2, 2022
ca99cbb
Update README.rst
nefagi-01 Feb 2, 2022
a73bb6d
Update README.rst
nefagi-01 Feb 2, 2022
8c16253
Update README.rst
nefagi-01 Feb 2, 2022
4405c77
Update README.rst
nefagi-01 Feb 2, 2022
06731f2
Update README.rst
nefagi-01 Feb 2, 2022
fcc8120
Update README.rst
nefagi-01 Feb 2, 2022
49bedc4
Update README.rst
nefagi-01 Feb 2, 2022
8e329a3
Update README.rst
nefagi-01 Feb 2, 2022
7d5b772
Update README.rst
nefagi-01 Feb 2, 2022
bcdafa3
Update README.rst
nefagi-01 Feb 2, 2022
dd5ff31
Update README.rst
nefagi-01 Feb 2, 2022
88c7892
Update README.rst
nefagi-01 Feb 2, 2022
535675f
Update README.rst
nefagi-01 Feb 2, 2022
46cb66f
Merge branch 'branch_to_pull' of https://github.com/epfl-iglobalhealt…
giovannimonea Feb 2, 2022
a16bbb2
web app initial commit
Feb 3, 2022
7ae5e6c
Updated template for input
giovannimonea Feb 3, 2022
5539948
web app commit
Feb 3, 2022
c238ead
web app commit
Feb 3, 2022
f296d79
Implemented use of target column in form
giovannimonea Feb 7, 2022
b11f365
Solved bug with datasets with categorical features
giovannimonea Feb 7, 2022
4ea8cb5
Deleted trash directory
giovannimonea Feb 7, 2022
49187bc
Update README.rst
giovannimonea Feb 7, 2022
4829bae
Cirrus CI badge
giovannimonea Feb 7, 2022
acca95f
Update README.rst
giovannimonea Feb 9, 2022
dbb5eb3
Update README.rst
giovannimonea Feb 9, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .cirrus.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ unittest_task:
container:
image: python:slim
install_dependencies_script: |
pip3 install --upgrade pip
python3 -m pip install --upgrade pip
pip3 install unittest_xml_reporting
pip3 install geocoder
pip3 install pandas
Expand All @@ -19,4 +19,4 @@ unittest_task:
upload_results_artifacts:
path: ./*.xml
format: junit
type: text/xml
type: text/xml
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
base_repository/hardware/dump.json
base_repository/hardware_data/dump.json
prediction_feature/datasets/creditcard.csv
prediction_feature/datasets/mnist/mnist_test.csv
prediction_feature/datasets/mnist/mnist_train.csv
Expand Down
109 changes: 0 additions & 109 deletions README.md

This file was deleted.

170 changes: 170 additions & 0 deletions README.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,170 @@
|Cirrus CI|

.. |Cirrus CI| image:: https://api.cirrus-ci.com/github/epfl-iglobalhealth/cumulator.svg
:target: https://cirrus-ci.com/github/epfl-iglobalhealth/cumulator

=========
CUMULATOR
=========

A tool to quantify and report the carbon footprint of machine learning computations and communication in academia and healthcare

Aim
___
Raise awareness about the carbon footprint of machine learning methods and to encourage further optimization and the rationale use of AI-powered tools.
This work advocates for sustainable AI and the rational use of IT systems.

Key Carbon Indicators
_____________________
* **One hour of GPU load is equivalent to 112 gCO2eq**
* **1 GB of data traffic through a data center is equivalent to 31 gCO2eq**

Prerequisites
_______________
The tool works with Linux, Windows and MacOS

**Required Libraries**

- ``geocoder`` (https://geocoder.readthedocs.io)
- ``geopy`` (https://geopy.readthedocs.io/en/stable/)
- ``GPUtil`` (https://pypi.org/project/GPUtil/)
- ``cpuinfo`` (https://pypi.org/project/py-cpuinfo/)

Install and use
_______________

Free software: MIT license

``pip install cumulator`` <- installs CUMULATOR

``from cumulator import base`` <- imports the script

``cumulator = base.Cumulator()`` <- creates an Cumulator instance

**Measure cost of computations.**

- First option: Activate or deactivate chronometer by using ``cumulator.on()``, ``cumulator.off()`` whenever you perform ML computations (typically within each interation). It will automatically record each time duration in ``cumulator.time_list`` and sum it in ``cumulator.cumulated_time()``. Then return carbon footprint due to all computations using ``cumulator.computation_costs()``.
- Second option: Automatically track the cost of computation of a generic function with ``cumulator.run(function, *args, **kwargs)`` and then use ``cumulator.computation_costs()`` as before. An example is reported below:

::

cumulator = Cumulator()
model = LinearRegression()
diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True)

# without output and with keywords arguments
cumulator.run(model.fit, X=diabetes_X, y=diabetes_y)

# with output and without keywords arguments
y = cumulator.run(model.predict, diabetes_X)

# show results
cumulator.computation_costs()



**Measure cost of communications.**

- Each time your models sends a data file to another node of the network, record the size of the file which is communicated (in kilo bytes) using ``cumulator.data_transferred(file_size)``. The amount of data transferred is automatically recorded in ``cumulator.file_size_list`` and accumulated in ``cumulator.cumulated_data_traffic``. Then return carbon footprint due to all communications using ``cumulator.communication_costs()``.

**Display your total carbon footprint**

- Display the carbon footprint of your recorded actions with ``cumulator.display_carbon_footprint()``:

::

########
Overall carbon footprint: 1.02e-08 gCO2eq
########
Carbon footprint due to computations: 1.02e-08 gCO2eq
Carbon footprint due to communications: 0.00e+00 gCO2eq
This carbon footprint is equivalent to 1.68e-13 incandescent lamps switched to leds.


- You can also return the total carbon footprint as a number using ``cumulator.total_carbon_footprint()``.

**Default assumptions: geo-localization, CPU-GPU detection (can be manually modified for better estimation):**

Cumulator will try to detect the CPU and the GPU used and set the respective computation cost value. In case the detection fails the default value will be set.
It is possible to manually modify the default value.

``self.hardware_load = 250 / 3.6e6`` <- computation costs: power consumption of a typical GPU in Watts converted to kWh/s

``self.one_byte_model = 6.894E-8`` <- communication costs: average energy impact of traffic in a typical data centers, kWh/kB

Cumulator will try to set the carbon intensity value based on the geographical position of the user. In case the detection fails the default value will be set.
It is possible to manually modify the default value.

``self.carbon_intensity = 447`` <- conversion to carbon footprint: average carbon intensity value in gCO2eq/kWh in the EU in 2014

``self.n_gpu = 1`` <- number of GPU used in parallel

**Prediction consumption and F1-Score on classification tasks**

- ``cumulator.predict_consumptions_f1(dataset, target)``: Cumulator offers a feature for estimating both the consupmtion and the F1-Score of different classification machine learning algorithms (i.e: Linear, Decision Tree, Random Forest, Neural Network) given the dataset that the user is using. The goal is to allow users to choose the algorithm giving the best score but with the least consumption possible.

An example is reported below:

::

from base import Cumulator
from sklearn.datasets import load_iris,load_diabetes
import pandas as pd
import numpy as np

cumulator = Cumulator()
iris = load_diabetes()
data1 = pd.DataFrame(data= np.c_[iris['data'], iris['target']], columns= iris['feature_names'] + ['target'])
cumulator.predict_consumptions_f1(data1, 'target')

**Important**:
The model used for prediction consumption and F1-Score has been trained on datasets with up to:

- 1000 features
- 20 classes
- 100000 instances
- 80000 missing values.

Therefore when using this feature please check if your datasets exceeds these values.

More information about the prediction feature and the recognition of the user position and GPU/CPU at https://github.com/epfl-iglobalhealth/CS433-2021-ecoML.

Project Structure
_________________

::

src/
├── cumulator
├── base.py <- implementation of the Cumulator class
├── prediction_feature <- implementation of the prediction feature
└── bonus.py <- Impact Statement Protocol

Cite
____

::

@article{cumulator,
title={A tool to quantify and report the carbon footprint of machine learning computations and communication in academia and healthcare},
author={Tristan Trebaol, Mary-Anne Hartley, Martin Jaggi and Hossein Shokri Ghadikolaei},
journal={Infoscience EPFL: record 278189},
year={2020}
}

ChangeLog
_________
* 18.06.2020: 0.0.6 update README.rst
* 11.06.2020: 0.0.5 add number of processors (0.0.4 failed)
* 08.06.2020: 0.0.3 added bonus.py carbon impact statement
* 07.06.2020: 0.0.2 added communication costs and cleaned src/
* 21.05.2020: 0.0.1 deployment on PypI and integration with Alg-E

Links
_____
* Material: https://drive.google.com/drive/u/1/folders/1Cm7XmSjXo9cdexejbLpbV0TxJkthlAGR
* GitHub: https://github.com/epfl-iglobalhealth/cumulator
* PyPI: https://pypi.org/project/cumulator/
* Prediction Feature, geo-localization, CPU/GPU detection: https://github.com/epfl-iglobalhealth/CS433-2021-ecoML


Binary file removed base_repository/__pycache__/base.cpython-39.pyc
Binary file not shown.
27 changes: 13 additions & 14 deletions base_repository/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,6 @@
This is the base class of cumulator.
'''

import os

parentdir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
os.sys.path.insert(0, parentdir)
import json
import time as t
import geocoder
Expand All @@ -17,13 +13,16 @@
import os
import re

from prediction_feature.prediction_helper import get_predictions, compute_features
from prediction_feature.visualization_helper import scatterplot
from base_repository.prediction_feature.prediction_helper import get_predictions, compute_features
from base_repository.prediction_feature.visualization_helper import scatterplot

parentdir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
os.sys.path.insert(0, parentdir)

country_dataset_path = 'countries/country_dataset_adjusted.csv'
gpu_dataset_path = 'hardware/gpu.csv'
metrics_dataset_path = 'metrics/CO2_metrics.json'
cpu_dataset_path = 'hardware/cpu.csv'
country_dataset_path = 'countries_data/country_dataset_adjusted.csv'
gpu_dataset_path = 'hardware_data/gpu.csv'
metrics_dataset_path = 'metrics_conversion_data/CO2_metrics.json'
cpu_dataset_path = 'hardware_data/cpu.csv'
regexp_cpu = '(Core|Ryzen).* (i\d-\d{3,5}.?|\d \d{3,5}.?)'


Expand Down Expand Up @@ -63,9 +62,9 @@ def set_hardware(self, hardware):
elif hardware == "cpu":
# search_cpu will try to detect the cpu on the device and set the corresponding TDP value as TDP value of Cumulator
self.detect_cpu()
# in case of wrong value of hardware let default TDP
# in case of wrong value of hardware_data let default TDP
else:
print(f'hardware expected to be "cpu" or "gpu". TDP set to default value {self.TDP}')
print(f'hardware_data expected to be "cpu" or "gpu". TDP set to default value {self.TDP}')

# function for trying to detect gpu and set corresponding TDP value as TDP value of cumulator
def detect_gpu(self):
Expand Down Expand Up @@ -196,7 +195,7 @@ def display_carbon_footprint(self):
"{:.2e}".format(self.computation_costs()))
print('Carbon footprint due to communications: %s gCO2eq' %
"{:.2e}".format(self.communication_costs()))
# loading metrics dataset
# loading metrics_conversion_data dataset
dirname = os.path.dirname(__file__)
relative_metric_dataset_path = os.path.join(dirname, metrics_dataset_path)

Expand All @@ -205,7 +204,7 @@ def display_carbon_footprint(self):
# computing equivalent of gCO2eq
for metric in metrics:
metric['equivalent'] = float(metric['eq_factor']) * (self.total_carbon_footprint())
# select random equivalent metrics and print
# select random equivalent metrics_conversion_data and print
metric = metrics[random.randint(0, len(metrics) - 1)]
print('This carbon footprint is equivalent to {:0.2e} {}.'.format(metric['equivalent'],
metric['measure'].lower()))
Expand Down
Loading