Add support for Software Heritage Identifiers (SWHID) as source of repository #219

douardda · 2020-11-17T14:33:15Z

Add support for SWHID as provider of jupyter notebooks

The Software Heritage project aims at collecting, preserving, and sharing all software that is publicly available in source code form (see the Software Heritage Misson).

To be able to do so, each software source code artifact must be identified by an intrinsic persistent identifier, the SWHID (see also this document

As a result, as soon as a Jupyter notebook has been harvested and stored in the Software Heritage Archive (be it by the regular scrapping process of SWH or because it is the result of a software deposit on a open archive repository like HAL), it would make it possible to use binder to directly run a notebook even if the original source for this code has disappeared.

welcome · 2020-11-17T14:33:17Z

Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! 🤗

If you haven't done so already, check out Jupyter's Code of Conduct. Also, please try to follow the issue template as it helps other other community members to contribute more effectively.

You can meet the other Jovyans by joining our Discourse forum. There is also an intro thread there where you can stop by and say Hi! 👋

Welcome to the Jupyter community! 🎉

manics · 2020-11-17T18:54:40Z

Hi! This definitely sounds interesting. Thanks for linking to the background documents. For the benefit of anyone short on time could you perhaps say a bit about how mature SWHID is? E.g. Is it quite new, well established, who's supporting it long term, how much uptake has it had in the community (and which communities), etc?

Is this something you want to work on, or are you just drawing this to our attention to stimulate some discussion? If you want some general comments it'd worth posting on the Jupyter community forum https://discourse.jupyter.org/ where a much wider audience hang out.

douardda · 2020-11-18T09:50:16Z

Hi @manics,

The Software Heritage project is now pretty well established, and the usage of SWHID as identifiers for long term identification of source code artifacts adoption is getting wider.. We currently have the HAL repository that automatically stores source code coming with papers deposited on there (see this one for example), and recently eLife and IPOL are also using the Software Heritage Archive as backend for long term preservation and identification of software used in their papers, and we are working at having more communities involved. We also are working on having SWHID used for software citation in academic papers. For now, eLife and IPOL are starting to use it, and JTCAM now recommends their usage. @rdicosmo even wrote a biblatex style for SWHID!

There is also this paper by the RDA/Force11 Software Source Code Identification Working Group that may be interesting in this regard.

About doing it on our side or not, it's something I need to discuss with other SWH team members. One possibility might be to use the Sloan Foundation grant to finance this work (not completely sure we can). I expect it should not be a very long task, but as we all know, it's always more complicated than expected.

manics · 2020-11-18T10:47:26Z

@douardda Thanks for the update! Before putting in a grant proposal let's make sure there's a consensus from the maintainers here that it should be added.

douardda · 2020-11-18T13:35:25Z

Yes indeed. For the record, I believe there have a discussion on a similar subject a few years ago between @rdicosmo and @minrk (and maybe others).

Now, what is the proper way of reaching such a consensus? Should I create a discussion on discourse?

manics · 2020-11-18T14:09:11Z

We don't have a formal process for accepting new content providers in repo2docker, though since you mention it I think it's something we should consider.

We've got our monthly JupyterHub team meeting tomorrow:
jupyterhub/team-compass#346
I'll add this issue to the agenda, just so everyone's aware of it. If you're free you're more than welcome to join the meeting and say a few words about this, just add it to the agenda

betatim · 2020-11-19T07:21:32Z

I tried it out yesterday to get a feeling for it. I went to https://www.softwareheritage.org/ and scrolled down to the search box. Typed "binder-examples" which took me to https://archive.softwareheritage.org/browse/search/?q=binder-examples&with_visit=true&with_content=true. I selected the first binder-examples/requirements result and ended up here https://archive.softwareheritage.org/browse/origin/directory/?origin_url=https://github.com/binder-examples/requirements. There is a download button there which would get me that version. I clicked the download button but that showed me a message "Archive cooking service is currently experiencing issues. Please try again later.".

Then I looked at the API docs to find out how we could automate this. https://archive.softwareheritage.org/api/1/ has a list of all the endpoints. I think https://archive.softwareheritage.org/api/1/vault/directory/doc/ followed by a call to https://archive.softwareheritage.org/api/1/vault/directory/raw/doc/ would be what we need.

Another example I found is https://archive.softwareheritage.org/browse/origin/directory/?origin_url=https://github.com/binder-examples/r which is the archived version of https://github.com/binder-examples/r. However it looks like the last time it was archived was in 2018. What is the process for getting things archived? Is there a crawler that constantly checks for new things? Do people submit requests for archiving?

One thing I was wondering is who uses SWHIDs right now and how. It would be good to talk to people who use it to retrieve files to learn more about how they use it and how they expect things to work.

Overall it reminds me of archive.org but for software.

douardda · 2020-11-20T10:02:46Z

(FTR as we discussed these points in the last monthly JupyterHub team meeting)

The idea is indeed to use the SWH public API to let a user use a SWHID as source of repository (the same way one can currently use a DOI).

Then I looked at the API docs to find out how we could automate this. https://archive.softwareheritage.org/api/1/ has a list of all the endpoints. I think https://archive.softwareheritage.org/api/1/vault/directory/doc/ followed by a call to https://archive.softwareheritage.org/api/1/vault/directory/raw/doc/ would be what we need.

The vault may not be the ideal way of retrieving the directory needed to build the binder execution environment because it's an asynchronous service.
In this case, I think using using the API to list the content of a directory the given SWHID refers to, either directly if the SWHID is a reference to a directory (swh:dir:) or the directory linked to the revision if it's a revision (swh:rev:). (For other SHWID types, it will depend on other aspects like if there are enough context to get an non-ambiguous directory that can be retrieved from that SWHID).
Then retrieve the directory content using API calls.

About delays in the archival, yes they happen. We do our best to keep the lag as small as possible, but we cannot guarantee a git revision pushed on github will be gathered in the SWH Archive in a given amount of time.

The typical use case to me is more something like a user finds a SWHID of a piece of code (in a jupyter notebook) as in a scientific paper and want to try this notebook.

douardda · 2020-11-26T14:23:51Z

I wrote a quick PR to add support for SWHID in repo2docker, see jupyterhub/repo2docker#988

douardda added the enhancement label Nov 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Software Heritage Identifiers (SWHID) as source of repository #219

Add support for Software Heritage Identifiers (SWHID) as source of repository #219

douardda commented Nov 17, 2020

welcome bot commented Nov 17, 2020

manics commented Nov 17, 2020 •

edited

Loading

douardda commented Nov 18, 2020 •

edited

Loading

manics commented Nov 18, 2020

douardda commented Nov 18, 2020

manics commented Nov 18, 2020

betatim commented Nov 19, 2020 •

edited

Loading

douardda commented Nov 20, 2020

douardda commented Nov 26, 2020

Add support for Software Heritage Identifiers (SWHID) as source of repository #219

Add support for Software Heritage Identifiers (SWHID) as source of repository #219

Comments

douardda commented Nov 17, 2020

Add support for SWHID as provider of jupyter notebooks

welcome bot commented Nov 17, 2020

manics commented Nov 17, 2020 • edited Loading

douardda commented Nov 18, 2020 • edited Loading

manics commented Nov 18, 2020

douardda commented Nov 18, 2020

manics commented Nov 18, 2020

betatim commented Nov 19, 2020 • edited Loading

douardda commented Nov 20, 2020

douardda commented Nov 26, 2020

manics commented Nov 17, 2020 •

edited

Loading

douardda commented Nov 18, 2020 •

edited

Loading

betatim commented Nov 19, 2020 •

edited

Loading