Skip to content

Latest commit

 

History

History
223 lines (195 loc) · 7.45 KB

README.md

File metadata and controls

223 lines (195 loc) · 7.45 KB

About

DS Mining is a project with the purpose of mining data from from public Data Science repositories to identify patterns and behaviors. The project is developed in Python 3.8, uses SQL Alchemy to deal with a SQLite database and pytest as a testing framework.

Table of Contents

Overview

The project consists of 4 step that go from data collection to analysis.

Project Workflow

Workflow

Workflow

Corpus

The corpus of the project consists of a search in GitHub's GraphQL API for the terms: "Data Science", "Ciência de Dados", "Ciencia de los Datos" and "Science des Données". After the collection, we estabelished the following requirements a repository has to have to be analyzed:

  • At least 1 language, 1 commit and 1 contributor
  • Is not a course project

Repositories that did not meet the requirements were discarded on step 2, the filtering.

Scripts Description

Main Scripts

Script Description Input Table Output Table
s1_collect.py Queries projects' metadata from GitHub API None Queries
s2_filter.ipynb Filters and selects repositories for further extractions Queries Repositories
s3_extract.py Extracts data from selected repositories Repositories Commits, Notebooks, Cells, Python Files, Requirement Files and others tables derivated from them

Extraction Scripts

Script Description Input Table Output Table
e1_download.py Downloads selected repositories from GitHub Repositories Repositories
e2_notebooks_and_cells.py Extracts Notebooks and Cells from repositories Repositories Notebooks, Cells
e3_python_files.py Extracts Python Files from repositories Repositores Python Files
e4_requirement_files.py Extracts Requirement Files from repositories Repositores Requirement Files
e5_markdown_cells.py Extracts features from markdown cells Cells with type "markdown" Cell Markdown Features
e6_code_cells.py Extracts features from code cells Cells with type "code" Cell Modules, Cell Data IOs
e7_python_features.py Extracts features from python files Python Files Python Modules, Python Data IOs

Aggregation Scripts

Script Description Input Table Output Table
ag1_notebook_aggregate.ipynb Aggregates some of the data related to Notebooks and their Cells for an easier analysis Cell Markdown Features, Cell Modules, Cell Data IOs Notebook Markdowns, Modules, Data IOs
ag2_python_aggregate.ipynb Aggregates some of the data related to Python Files for an easier analysis Python Modules, Python Data IOs Modules, Data IOs

Analysis Notebooks

After we extract all the data from selected repositories, we use Jupyter Notebooks to analyze the data and generate conclusions and graphic outputs.

Notebook Description
a1_collected.ipynb Analyzes collected repositories' features
a2_filtered.ipynb Analyzes language-related features
a3_selected.ipynb Analyzes selected repositories' features
a4_modules.ipynb Analyzes modules extracted
a5_code_and_data.ipynb Analyzes code features and data inputs/outputs

Results

The resulting database is available here (~28GB).

Installation

The project primarily uses Python 3.8 as an interpreter, but it also uses other Python versions (2.7 and 3.5) when extracting features from Abstract Syntax Trees from other versions, to deal with that we use Conda, instructions to install it on Linux can be found here.

After downloading and installing conda you might need to add export PATH="~/anaconda3/bin":$PATH to your .bashrc file. Then you must run conda init to initialize conda.

Requirements

We also used several Python modules that can be found on requirements.txt. You can follow the instructions bellow to set up the conda enviroments and download the modules in each one of them.

Install nltk stopwords that will be used in Cell Markdowns extraction by running python -c "import nltk; nltk.download('stopwords')"

Conda 2.7

conda create -n dsm27 python=2.7 -y
conda activate dsm27  
pip install --upgrade pip
pip install -r requirements.txt
pip install astunparse

Conda 3.5

conda create -n dsm35 python=3.5 -y
conda activate dsm35
pip install --upgrade pip
pip install -r requirements.txt

Conda 3.8

conda create -n dsm38 python=3.8 -y
conda activate dsm38
pip install --upgrade pip
pip install -r requirements.txt

Running

To run the project you simply have to run scripts s1, s2, s3, p1, p2 and then each analysis notebook. To run the tests you can call them using pytest file.py or pytest directory/

Tests

Run the tests by using python -m pytest tests

References