An end-to-end framework which help the company to predict software engineering trends and the developers to know more about a docker image.
Our goal is to provide different companies with a dynamic dataset through which meaningful inferences can be made.
Our aim is to gather data from Docker Hub and analyse the trends. Docker Hub is a cloud-based repository in which Docker users and partners create, test, store and distribute container images.
This project was developed as part of coursework for Data-X at Berkeley.
Link to supporting presentation
We use Conda to manage the environment and packages.
We use the following packages (among many others):
- Python 3.6 or above
- Pandas
- Matplotlib
- Plotly
- Seaborn
- boto3
To fetch new .json
files from the AWS S3 bucket
cd data/
aws s3 sync s3://docker-recent recent-data
Start by downloading or cloning this repository.
git clone https://github.com/cshubhamrao/docker-hub-data.git
cd docker-hub-data-x
Create the conda environment from the environment.yml
file:
conda env create -f environment.yml
Now activate the environment by:
conda activate docker-hub
jupyter lab
- Data - This folder contains all the data related files and folders that are generated or
are stored for later use.
This is also the folder where all the 'plots' generated by
analytics.ipynb
and another scripts. - Misc - Contains all the miscellaneous scripts that are required for this project.
- Scripts - This folder is the main folder. This contains all the scripts that we used to scrape the data, clean that data, select required data to do analysis, and finally do analysis on the data and derive inference from the data.