Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What do researchers search for when looking for code repositories? #1

Open
lukecoy opened this issue Sep 17, 2015 · 14 comments
Open

What do researchers search for when looking for code repositories? #1

lukecoy opened this issue Sep 17, 2015 · 14 comments

Comments

@lukecoy
Copy link
Contributor

lukecoy commented Sep 17, 2015

From README

This project’s objective is to create an open source web dashboard capable of searching multiple code hosting services for the benefit of the research community

Here are a couple of questions to start the discussion about what would make a Software Discovery Dashboard most useful for researchers:

  • What are valuable search criteria when attempting to discover code repositories?
  • What kind of information would researchers find necessary (or just helpful) in search results?
@lukecoy lukecoy changed the title Researcher Feedback What do researchers search for when looking for code repositories? Sep 17, 2015
@versae
Copy link

versae commented Sep 18, 2015

It would be great if the Software Discovery Dashboard included options to search for reference implementations of published papers, by looking up the authors names, DOIs, or titles of the paper.

License and language would also be interesting.

@pdurbin
Copy link

pdurbin commented Sep 18, 2015

@acabunoc asked me to repeat what I said in #sciencelab that https://dataverse.harvard.edu has a fair amount of R code. Stata too.

@okdistribute
Copy link

What are valuable search criteria when attempting to discover code repositories?

  • I've heard that that searching for tables by 'data type' is useful, but doesn't really exist. This would require better practice around schema creation and publishing alongside raw data, though.

What kind of information would researchers find necessary (or just helpful) in search results?

  • file size, estimated download time, # of rows/files, keywords. Inspired by opendatacache.com by @talos

@yarikoptic
Copy link

I usually "apt-cache search" first to find software at my fingertips. If not present there -- then google it up. And then in neurscience/neuroimaging domain there are NIF (http://www.neuinfo.org/) and NITRC (http://nitrc.org) which collate/host various related software projects. Google at times leads me there ;)

As for software implementing some publication/method -- we have plans (not sufficient force yet) to add centralized reporting to duecredit (https://github.com/duecredit/duecredit/) so later you would be able to find software implementing some referenced publication

@pdurbin
Copy link

pdurbin commented Sep 18, 2015

@arfon is thinking about the related area of software citation: https://twitter.com/arfon/status/628504262121816064

@mbjones
Copy link

mbjones commented Sep 18, 2015

Re-usable packages from CRAN, PyPI, etc. are one thing. The actual scripts researchers write and use in analysis are another. People are now archiving analytical code in R, Matlab, and other languages into various data repositories such as the KNB and FigShare as part of their archived data packages. Here's an example of such a package with R code, which has very minimal metadata about the software.

For this type of code in the KNB (and DataONE) it would be useful to be able to search for software used in analyses based on a classification of the types of analysis that was done, on who created it, in which papers it was used, etc. Some (idiosyncratic) example queries researchers might want would include:

  • What software was used to produce the results from the paper with identifier {DOI}?
  • Which derived products (data sets, figures, etc) were created using {analysis type}?
    • example analysis types: MCMC, logistic regression, ANOVA
  • What software was used by researcher {name or ORCID}?
  • What software can process data from {format} to {format}?

@schae234
Copy link

For us (computational biologists) at least, most of the time it's method driven. We want to answer such and such and heard that method X was a good. Or that method Y overcomes difficulties that method X does not. The starting point is then literature based and we just hope that the code is available somewhere online.

I imagine a useful dashboard for computational biologists might contain topics broken down by methods and then by implementation. E.g --

  • GWAS
    • Mixed Models
    • Plotting
    • ...
  • NGS
    • aligners
    • RNASeq
    • ...
  • read mapping
    ...

@zmughal
Copy link

zmughal commented Sep 19, 2015

You might want to also take a look at this idea from the Scholar Ninja project http://juretriglav.si/discovery-of-scientific-software/ which recommends scientific software while browsing GitHub by extracting software citations from papers.

@blahah
Copy link

blahah commented Sep 21, 2015

I have three routes to finding relevant software:

  1. To do a particular kind of analysis, I go in search of the right tool. In this case, I read the literature first. Then I read blogs, forums, BioStar, and search Twitter. And I ask people whose depth of knowledge I respect.
  2. Something comes to my attention passively (via mention on twitter, someone starring a repo on github, it reaches the front page of Hacker News, etc.)
  3. Doing something non-scientific, or not specifically scientific. In this case I actually search for packages or code. Usually on rubygems, npm, sometimes github, or sometimes google by combining keywords about the language with keywords about the functionality I want.

Actually very rarely will I search for scientific code, because unless it is some sort of general utility or plumbing, I care first about whether the underlying method is good, then about whether it is implemented well.

There are many sites which attempt to categorise or provide search of scientific software, but mostly they are much harder to use than google.

@schae234
Copy link

@blahah, we are mainly driven by method also, you succinctly summarized our approach in your post. Curious, what is your main 'branch' of research? We are mainly genetics and systems biology. I'm wondering if work-flows differ much between disciplines? Do the physical sciences have organizational approaches the biological sciences don't?

@blahah
Copy link

blahah commented Sep 21, 2015

@schae234 computational biology / genomics here, so we overlap considerably I would think.

@npch
Copy link

npch commented Sep 21, 2015

Some initial thoughts:

  • Does it work with files of format XXX?
  • Does it implement important-algorithm-in-my-field XXX?
  • Does it work on platform XXX? (Where XXX is increasingly R, Galaxy, etc.)
  • What's the license on the code?
  • When was it last updated? (For some value of "freshness")
  • Is there an associated paper showing off scientific results produced using the software?

Also, I had more general thoughts about this area in the following two articles:

@amb8805
Copy link
Contributor

amb8805 commented Sep 24, 2015

Thanks everyone for the input, it helps give us more context and an idea how to approach the problem. Please watch for new issues as we learn more and could use more informed input.

@blahah what kind of overhead is there with the existing research software search services that makes them hard to use? Could you give an example?

@bunnybooboo
Copy link

Comparative dashboard suggesting similar tools. Information surrounding licensing, open/proprietary/free, update activity, github repo, programming language, API, gallery of examples/use cases, data footprint, minimum spec, ratings, frameworks that also incorporate this tool, automation possibilities.

@varzaman varzaman closed this as completed Oct 6, 2015
@varzaman varzaman reopened this Oct 6, 2015
@varzaman varzaman modified the milestone: Sprint 1 Oct 6, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests