GitHub User Analysis

Members: Xin Guan | Ziqian Ge

GitHub is one of the biggest open source community and a large proportion of programmer do have a GitHub account. As a result, we think GitHub users are a good representation of all programmers.

Data sources

For our project, we will be gathering as much GitHub user’s information as possible including:

name
location
email
repositories

Also, we are going to get the data relating to economic, population and high-tech industry from:

OECD database
American Bureau of Labor Statistic

We would like to retrieve data to a local database and clean them.

Analysis

The analysis will include two part.

GitHub user’s biography:

We are trying to understand where are they, what institute do they work for, and how do they introduce themselves their repositories.
GitHub user’s repositories:

We are trying to look at what language do they use and Characteristic of their repositories

Prediction

We are trying to predict the the popularity of a user's repository based ou given user and given user's other repositories.

We also would like to predict a user's identity from given information on repositories.

We are using regressions and simple machine learning techniques in predictions above.

Tools and Approaches

Tools we use:

GitHub API
R Language
MongoDB

We will try our best to keep our code in R. Since some data might not be available through Github API, we might use use third-party GitHub data, such as GHTorrent, and GHArchive. However, we would rather try to fetch data by ourselves so that the process of getting unbiased data is crystal clear to us.

How to use Data_Collector.Rmd?

Data_Collector.rmd is trying to fetch all the users by go through the graph of users. In the set up block, I just start populate the database with my own GitHub account. I store me and my friends (followers and followings) in unfetched collection. Then I go through the unfetched collection and get their friends to the unfetched collection. After go through a single person, I move that person to user collection. Consequently, people in user are those who have been processed while those in unfetched needs to be processed. Each time going through unfetched collection helps me get people in to user collection.

Replace GitHub Api key in the Rmd file with your own key.
Set up a mongoDB and save the url and port information. Then connect to your mongoDB.
Go to code block named set up. replace the api url of the author with whichever GitHub user you like. If you decide not to change, the data might be biased since I am Chinese and my followers and followings are also Chinese. If you are not going run the start fetch for multiple time, you might be restricted to a Chinese GitHub community.
run the rest of the code.
run the last block whoes name is start fetch multiple time populates the data base.

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
MongoDB		MongoDB
images		images
.gitignore		.gitignore
Data_collector.Rmd		Data_collector.Rmd
LICENSE		LICENSE
Prediction.Rmd		Prediction.Rmd
README.md		README.md
RepoAnalysis.md		RepoAnalysis.md
Repo_Analyse.Rmd		Repo_Analyse.Rmd
Report.html		Report.html
Report.md		Report.md
User_Analysis.Rmd		User_Analysis.Rmd
github_icon.svg		github_icon.svg
presentation.pdf		presentation.pdf
presentation.pptx		presentation.pptx
repo_user.csv		repo_user.csv
user.csv		user.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GitHub User Analysis

Members: Xin Guan | Ziqian Ge

Data sources

Analysis

Prediction

Tools and Approaches

How to use Data_Collector.Rmd?

About

Releases

Packages

Languages

License

jamesgeziqian/GitHub-User-Analysis

Folders and files

Latest commit

History

Repository files navigation

GitHub User Analysis

Members: Xin Guan | Ziqian Ge

Data sources

Analysis

Prediction

Tools and Approaches

How to use Data_Collector.Rmd?

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages