What do we know about Hugging Face? A systematic literature review and quantitative validation of qualitative claims
Data from HuggingFace, PeaTMOSS, PTMTorrent, and Ecosyste.ms were used in this paper. Some of the data was augmented for this paper.
Where feasible, the data used has been included here.
Dataset | Source | Modified for this paper? | Copy Included in this Repo | Filename |
---|---|---|---|---|
PeaTMOSS | Link | Yes | Yes | peatmoss_data.json |
HF Model Metadata | Link | No | Yes | N/A |
PTMTorrent | Link | No | Yes | ptmtorrent_data.csv |
Ecosyste.ms | Link | No | No | N/A |
Due to the size of the Packages Dataset from Ecosyste.ms, it is not possible to include a copy of the data used in this dataset. The snapshots used for the Metric of Turnover of top packages over time can be found here. The snapshots from 2024-03-01, 2023-10-22, 20203-0808, and 2023-11-09 were used. When loaded into Postgres, they were named as ecosystems_{year}_{month}.
The PeaTMOSS dataset was augmented in order to improve the dataset. The table model_to_base_model was augmented with data from this dataset. Only links between models that were captured in PeaTMOSS were added. The file PeaTMOSS_DIST.db.zip does not contain the GitHub metadata, as that information was not used in the study. As a result, it is much more compact than the actual dataset.
The datasets and files used for each metric are present here:
Claim | Metric | Datasets Used | Corresponding Figure | Files Used |
---|---|---|---|---|
C1: Transformers library increases accessibility | Preservation rate of libraries to descendants | PeaTMOSS | Figure 5 | proportion_DirectDescendantsToLibrary.py |
C2: Popularity impacts PTM selection | Turnover of top packages over time | PeaTMOSS, PTMTorrent, HF Model Metadata [1][2], Ecosyste.ms | Figure 6 | turnover.ipynb |
C2: Popularity impacts PTM selection | Number of descendants of top packages | PeaTMOSS | Figure 7 | number_DirectDescendantsToParentModels.py, number_DirectDescendantsToDownloads.py, descendents.py |
C3: Documentation quality influences selection | Documentation quality score | PeaTMOSS, HF Model Metadata [1][2] | Figure 8 | model_cards.py |