Energy consumption of code small language models serving with runtime engines and execution providers

Summary

Contributions

Actionable guidelines for practitioners
Measuring the impact of deep learning serving configurations on energy and performance
An analysis of deep learning serving configurations

DL serving configurations

A duplet of a runtime engine and an execution provider: <[Runtime engine], [Execution provider]>

Runtime engines
- Default Torch (TORCH)
- ONNX Runtime engine (ONNX)
- OpenVINO runtime (OV)
- Torch JIT (JIT)
Execution providers
- CPU Execution Provider (CPU)
- CUDA Execution Provider (CUDA)

Repository Structure

The repository is structured as follows:

- app
  | API, schemas
- dataset
  | input dataset generation
- experiments
  | Notebooks and scripts to process profilers datasets
- manuals
  | Self-contained manuals related to the serving infrastructure
- model_selection
  | This folder contains models selection scripts and metadata
- scripts
  | Environment scripts and bash scripts for automated experiments
- testing
  | Scripts to send request to server
- requirements.txt: The dependencies of our implementation
- runall_update.sh: Bash script to start server and run experiments
- code_slm_selection.csv: Selection of used code SLM

Replication package

1. Data management

Needed: HumanEval dataset
Output: New input dataset
files:
- dataset/*

2. Modelling

Needed: Selection criteria
Output: Selected models
files:
- code_slm_selection.csv

3. Development

Needed: Development of serving infrastructure, selected models
Output: Serving infrastructure
files:
- app/

4. Operation

Needed: Deployed serving infrastructure
Output: results (profilers datasets)
files:
- testing/

Edit experiment parameters (time,files,...):
1. server settings
  - app/models_code_load.py: Model classes
    - MAX_LENGTH tokens
2. experiment settings
  - testing/utils.py: experiment settings, python script
    - input dataset
  - repeat.sh: repeat n experiments runs or just execute runall
    - runall_update.sh: experiment settings, bash script
      - run server
      - run experiments for each runtime engine
Run server and experiments: runall.sh

nohup ./repeat.sh > repeat.out 2>&1 &

Or:

nohup ./runall.sh > results/runall.out 2>&1 &

Obtain results/*

5. Research output

Needed: Profilers datasets
Output: Research output, data analysis and, support files to answer RQs
files:
- experiments/
- figures
- tables
- statistical results

Files in experiments/

visualize_{profiler} - Visualization of raw data obtained from profilers.
01_get_info_{profiler} - Preprocessing of raw data obtained from profilers (script).
02_get_time_marks - Get time marks of inferences done during experiment (script)
03_analysis_{execution_provider} - Process data for analysis (notebook).
04_aggregation - Aggregated data (notebook).
05_aggregated_plots - Box plots (notebook).
06_tests - Obtaining statistical results of used statistical tests (script).
07_tests_merge - Merge test results, organized by dependent variable (notebook).
08_analysis - Notebook to analyze results (notebook).
09_result_tables - Creation of paper table (notebook).

Models

Energy tracking tools

Help

Testing

Dataset: testing/inputs.txt

Run server:
uvicorn app.api_code:app  --host 0.0.0.0 --port 8000  --reload --reload-dir app

Make inferences:
python3 testing/main.py -i torch -r 5 | tee -a results/out_torch.log
python3 testing/main.py -i onnx -r 5 | tee -a results/out_onnx.log
python3 testing/main.py -i ov -r 5 | tee -a results/out_ov.log
python3 testing/main.py -i torchscript -r 5 | tee -a results/out_torchscript.log

Results are saved in results/

Useful Guides

API creation. Guide to create an API to deploy ML models.
Add pretrained model. Guide to add pretrained ML models (from HuggingFace, hdf5 format, pickle format) to do inferences through an API.
Deploy ML models in a cloud provider (General). Guide to deploy ML models using an API in a cloud provider.
See more

Other repos

Citation

Please use the following BibTex entry:

@article{duran2024serving,
  title={Identifying architectural design decisions for achieving green ML serving},
  author={Dur{\'a}n, Francisco and Martinez, Matias and Lago, Patricia and Mart{\'\i}nez-Fern{\'a}ndez, Silverio},
  journal={arXiv preprint arXiv:},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 148 Commits
app		app
dataset		dataset
experiments		experiments
manuals		manuals
model_selection		model_selection
scripts		scripts
testing		testing
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
_run_inferences.sh		_run_inferences.sh
_run_one_inference.sh		_run_one_inference.sh
_runall.sh		_runall.sh
repeat.sh		repeat.sh
requirements.txt		requirements.txt
runall_update.sh		runall_update.sh
start_server.py		start_server.py
start_server.sh		start_server.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Energy consumption of code small language models serving with runtime engines and execution providers

Summary

Contributions

DL serving configurations

Repository Structure

Replication package

1. Data management

2. Modelling

3. Development

4. Operation

5. Research output

Models

Energy tracking tools

Help

Testing

Useful Guides

Other repos

Citation

About

Releases 4

Packages

Languages

License

GAISSA-UPC/energy-ml-serving

Folders and files

Latest commit

History

Repository files navigation

Energy consumption of code small language models serving with runtime engines and execution providers

Summary

Contributions

DL serving configurations

Repository Structure

Replication package

1. Data management

2. Modelling

3. Development

4. Operation

5. Research output

Models

Energy tracking tools

Help

Testing

Useful Guides

Other repos

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 4

Packages 0

Languages

Packages