GitHub is one of the biggest open source community and a large proportion of programmer do have a GitHub account. As a result, we think GitHub users are a good representation of all programmers.
For our project, we will be gathering as much GitHub user’s information as possible including:
- name
- location
- repositories
Also, we are going to get the data relating to economic, population and high-tech industry from:
- OECD database
- American Bureau of Labor Statistic
We would like to retrieve data to a local database and clean them.
The analysis will include two part.
-
GitHub user’s biography:
We are trying to understand where are they, what institute do they work for, and how do they introduce themselves their repositories.
-
GitHub user’s repositories:
We are trying to look at what language do they use and Characteristic of their repositories
We are trying to predict the the popularity of a user's repository based ou given user and given user's other repositories.
We also would like to predict a user's identity from given information on repositories.
We are using regressions and simple machine learning techniques in predictions above.
Tools we use:
- GitHub API
- R Language
- MongoDB
We will try our best to keep our code in R. Since some data might not be available through Github API, we might use use third-party GitHub data, such as GHTorrent, and GHArchive. However, we would rather try to fetch data by ourselves so that the process of getting unbiased data is crystal clear to us.
Data_Collector.rmd is trying to fetch all the users by go through the graph of users.
In the set up
block, I just start populate the database with my own GitHub account.
I store me and my friends (followers and followings) in unfetched collection.
Then I go through the unfetched collection and get their friends to the unfetched collection.
After go through a single person, I move that person to user collection.
Consequently, people in user are those who have been processed while those in unfetched needs to be processed.
Each time going through unfetched collection helps me get people in to user collection.
- Replace GitHub Api key in the Rmd file with your own key.
- Set up a mongoDB and save the url and port information. Then connect to your mongoDB.
- Go to code block named
set up
. replace the api url of the author with whichever GitHub user you like. If you decide not to change, the data might be biased since I am Chinese and my followers and followings are also Chinese. If you are not going run thestart fetch
for multiple time, you might be restricted to a Chinese GitHub community. - run the rest of the code.
- run the last block whoes name is
start fetch
multiple time populates the data base.