-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for Software Heritage Identifiers (SWHID) as source of repository #219
Comments
Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! 🤗 |
Hi! This definitely sounds interesting. Thanks for linking to the background documents. For the benefit of anyone short on time could you perhaps say a bit about how mature SWHID is? E.g. Is it quite new, well established, who's supporting it long term, how much uptake has it had in the community (and which communities), etc? Is this something you want to work on, or are you just drawing this to our attention to stimulate some discussion? If you want some general comments it'd worth posting on the Jupyter community forum https://discourse.jupyter.org/ where a much wider audience hang out. |
Hi @manics, The Software Heritage project is now pretty well established, and the usage of SWHID as identifiers for long term identification of source code artifacts adoption is getting wider.. We currently have the HAL repository that automatically stores source code coming with papers deposited on there (see this one for example), and recently eLife and IPOL are also using the Software Heritage Archive as backend for long term preservation and identification of software used in their papers, and we are working at having more communities involved. We also are working on having SWHID used for software citation in academic papers. For now, eLife and IPOL are starting to use it, and JTCAM now recommends their usage. @rdicosmo even wrote a biblatex style for SWHID! There is also this paper by the RDA/Force11 Software Source Code Identification Working Group that may be interesting in this regard. About doing it on our side or not, it's something I need to discuss with other SWH team members. One possibility might be to use the Sloan Foundation grant to finance this work (not completely sure we can). I expect it should not be a very long task, but as we all know, it's always more complicated than expected. |
@douardda Thanks for the update! Before putting in a grant proposal let's make sure there's a consensus from the maintainers here that it should be added. |
We don't have a formal process for accepting new content providers in repo2docker, though since you mention it I think it's something we should consider. We've got our monthly JupyterHub team meeting tomorrow: |
I tried it out yesterday to get a feeling for it. I went to https://www.softwareheritage.org/ and scrolled down to the search box. Typed "binder-examples" which took me to https://archive.softwareheritage.org/browse/search/?q=binder-examples&with_visit=true&with_content=true. I selected the first Then I looked at the API docs to find out how we could automate this. https://archive.softwareheritage.org/api/1/ has a list of all the endpoints. I think https://archive.softwareheritage.org/api/1/vault/directory/doc/ followed by a call to https://archive.softwareheritage.org/api/1/vault/directory/raw/doc/ would be what we need. Another example I found is https://archive.softwareheritage.org/browse/origin/directory/?origin_url=https://github.com/binder-examples/r which is the archived version of https://github.com/binder-examples/r. However it looks like the last time it was archived was in 2018. What is the process for getting things archived? Is there a crawler that constantly checks for new things? Do people submit requests for archiving? One thing I was wondering is who uses SWHIDs right now and how. It would be good to talk to people who use it to retrieve files to learn more about how they use it and how they expect things to work. Overall it reminds me of archive.org but for software. |
(FTR as we discussed these points in the last monthly JupyterHub team meeting) The idea is indeed to use the SWH public API to let a user use a SWHID as source of repository (the same way one can currently use a DOI).
The vault may not be the ideal way of retrieving the directory needed to build the binder execution environment because it's an asynchronous service. About delays in the archival, yes they happen. We do our best to keep the lag as small as possible, but we cannot guarantee a git revision pushed on github will be gathered in the SWH Archive in a given amount of time. The typical use case to me is more something like a user finds a SWHID of a piece of code (in a jupyter notebook) as in a scientific paper and want to try this notebook. |
I wrote a quick PR to add support for SWHID in repo2docker, see jupyterhub/repo2docker#988 |
Add support for SWHID as provider of jupyter notebooks
The Software Heritage project aims at collecting, preserving, and sharing all software that is publicly available in source code form (see the Software Heritage Misson).
To be able to do so, each software source code artifact must be identified by an intrinsic persistent identifier, the SWHID (see also this document
As a result, as soon as a Jupyter notebook has been harvested and stored in the Software Heritage Archive (be it by the regular scrapping process of SWH or because it is the result of a software deposit on a open archive repository like HAL), it would make it possible to use binder to directly run a notebook even if the original source for this code has disappeared.
The text was updated successfully, but these errors were encountered: