- Prepare python environment and install package
- For development please use
pip install -r requirements.dev.txt
- Copy
sample.env
todev.env
, fill it with required variables and source it
export GITHUB_TOKEN="<your GitHub token>"
export LANGUAGES_DIR="<prepared directory with languages>"
export MIRROR_CRAWL_INTERVAL_SECONDS=1
export MIRROR_CRAWL_MIN_RATE_LIMIT=500
export MIRROR_CRAWL_BATCH_SIZE="<how often save data>"
export MIRROR_CRAWL_DIR="<where to save crawled data>"
- To avoid block from GitHub prepare Rate Limit watcher
watch -d -n 5 'curl https://api.github.com/rate_limit -s -H "Authorization: Bearer $GITHUB_TOKEN" "Accept: application/vnd.github.v3+json"'
python -m mirror.cli --help
clone Clone repos from search api to output dir.
commits Read repos json file and upload all commits for that repos one.
crawl Processes arguments as parsed from the command line and uses.
nextid Prints ID of most recent repository crawled and written to the.
sample Writes repositories sampled from a crawl directory to an output.
search Crawl via search api.
validate Prints ID of most recent repository crawled and written to the.
Run the crawl
command to extract all repositories metadata and save in a .json
file.
python -m mirror.cli crawl \
--crawldir $MIRROR_CRAWL_DIR \
--interval $MIRROR_CRAWL_INTERVAL_SECONDS \
--min-rate-limit $MIRROR_CRAWL_MIN_RATE_LIMIT \
--batch-size $MIRROR_CRAWL_BATCH_SIZE
Say you need to extract only a small pool of repositories for analysis then you can set more precise criteria that you need via search
command.
python -m mirror.cli search --crawldir "$MIRROR_CRAWL_DIR/search" -l "python" -st ">500" -r 5
The clone
command uses the standard git clone
to extract search results of repositories and clones to local machine.
python -m mirror.cli clone --crawldir "$LANGUAGES_DIR" -s ">500" -ls "python"
Structure of $LANGUAGES_DIR
directory:
> $LANGUAGES_DIR
> language 1
> repo 1
> repo 2
...
> language 2
> repo 1
> repo 2
...
...
Also, there is possibility to upload popular repositories with python code. See example in ex_clone.py
Command commits
extract all commits from repository and save .json
files with commits for each repository.
python -m mirror.cli commits \
--crawldir $MIRROR_CRAWL_DIR \
--repos-file $MIRROR_CRAWL_INTERVAL_SECONDS \
It creates .csv
file with flat json structure.
python -m mirror.github.utils --json-files-folder "$MIRROR_CRAWL_DIR" --output-csv "$MIRROR_CRAWL_DIR/output.csv" --command commits