You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've tested the endpoint with some success. The incremental replication key is 'since' - which accepts not a datetime, but an incremental integer identify column which is applied to each repository as it is created. Apparently, this key can be used to paginate through all public repos on github.
Use case
The use case here would be to collect repo ids we could use to then collect follow-up metrics - specifically for repos match a naming conventions for Singer plugins and forks: tap-<something>, target-<something>, pipelinewise-tap-<something>, etc. Once we collect the repo names and IDs we would collect and aggregate additional github metrics on usage, commits, etc.
As part of our initiative to make taps more discoverable and to help the Singer/Stitch/Meltano community members more quickly locate and evaluate from the large (and growing) list of available taps and targets.
New vs existing tap
I know the paradigm we have here in this tap expects a set of specific repos to extract, and this application would break with that paradigm. If it is preferable to spin this off as a separate tap, I would understand that argument and in that case would likely try to spin off a fork specifically for the purpose of parsing the github index (maybe tap-github-index?).
Expected volume of data
The volume of data is large, but not prohibitively so.
Looks like approximately 48 million public repos according to a quick github search:
The text was updated successfully, but these errors were encountered:
aaronsteers
changed the title
Pull list of _all_ public repositories using the repositories endpoint
Pull index of all public repositories using the repositories endpoint
Apr 24, 2021
I'd like to pull a list of all public repositories using the
repositories
endpoint described here: https://docs.github.com/en/rest/reference/repos#list-public-repositoriesAPI info
I've tested the endpoint with some success. The incremental replication key is 'since' - which accepts not a datetime, but an incremental integer identify column which is applied to each repository as it is created. Apparently, this key can be used to paginate through all public repos on github.
Use case
The use case here would be to collect repo ids we could use to then collect follow-up metrics - specifically for repos match a naming conventions for Singer plugins and forks:
tap-<something>
,target-<something>
,pipelinewise-tap-<something>
, etc. Once we collect the repo names and IDs we would collect and aggregate additional github metrics on usage, commits, etc.More info here: https://gitlab.com/meltano/singerhub/-/issues/3 and https://gitlab.com/meltano/singerhub/-/issues/11
As part of our initiative to make taps more discoverable and to help the Singer/Stitch/Meltano community members more quickly locate and evaluate from the large (and growing) list of available taps and targets.
New vs existing tap
I know the paradigm we have here in this tap expects a set of specific repos to extract, and this application would break with that paradigm. If it is preferable to spin this off as a separate tap, I would understand that argument and in that case would likely try to spin off a fork specifically for the purpose of parsing the github index (maybe
tap-github-index
?).Expected volume of data
The volume of data is large, but not prohibitively so.
Looks like approximately 48 million public repos according to a quick github search:
https://github.com/search?q=is:public
This is up from 28 million approximately a year ago:
Sample record:
The text was updated successfully, but these errors were encountered: