Skip to content

PurdueDualityLab/ptm-quantify-esem-2024

Repository files navigation

What do we know about Hugging Face? A systematic literature review and quantitative validation of qualitative claims

Datasets

Data from HuggingFace, PeaTMOSS, PTMTorrent, and Ecosyste.ms were used in this paper. Some of the data was augmented for this paper.

Where feasible, the data used has been included here.

Dataset Source Modified for this paper? Copy Included in this Repo Filename
PeaTMOSS Link Yes Yes peatmoss_data.json
HF Model Metadata Link No Yes N/A
PTMTorrent Link No Yes ptmtorrent_data.csv
Ecosyste.ms Link No No N/A

Ecosyste.ms

Due to the size of the Packages Dataset from Ecosyste.ms, it is not possible to include a copy of the data used in this dataset. The snapshots used for the Metric of Turnover of top packages over time can be found here. The snapshots from 2024-03-01, 2023-10-22, 20203-0808, and 2023-11-09 were used. When loaded into Postgres, they were named as ecosystems_{year}_{month}.

PeaTMOSS

The PeaTMOSS dataset was augmented in order to improve the dataset. The table model_to_base_model was augmented with data from this dataset. Only links between models that were captured in PeaTMOSS were added. The file PeaTMOSS_DIST.db.zip does not contain the GitHub metadata, as that information was not used in the study. As a result, it is much more compact than the actual dataset.

Metrics

The datasets and files used for each metric are present here:

Claim Metric Datasets Used Corresponding Figure Files Used
C1: Transformers library increases accessibility Preservation rate of libraries to descendants PeaTMOSS Figure 5 proportion_DirectDescendantsToLibrary.py
C2: Popularity impacts PTM selection Turnover of top packages over time PeaTMOSS, PTMTorrent, HF Model Metadata [1][2], Ecosyste.ms Figure 6 turnover.ipynb
C2: Popularity impacts PTM selection Number of descendants of top packages PeaTMOSS Figure 7 number_DirectDescendantsToParentModels.py, number_DirectDescendantsToDownloads.py, descendents.py
C3: Documentation quality influences selection Documentation quality score PeaTMOSS, HF Model Metadata [1][2] Figure 8 model_cards.py