Static Jupyter Notebook : link
Interactive Binder : link
In this project, we are going to perform some simple Data Analysis with the footballing dataset in hand, and implement a Linear Regression Model using scikit-learn.
Dataset : Kaggle link Github link
The dataset contains records of 42,000 + international football games. The available information includes the participant teams, goals scored by each team, date, venue etc.
Dependencies:
numpy : 1.20.2
plotly : 4.14.3
matplotlib : 3.4.2
pandas : 1.2.3
chart_studio : 1.1.0
scipy : 1.6.3
scikit-learn : 0.24.2
sklearn : 0.0
threadpoolctl: 2.1.0
joblib : 1.0.1
NOTE : If you are using the Binder link, it will automatically recreate the environment and download these dependencies.
Also, the last two cells(involving the watermark library), might not work in Binder. They aren't a part of the Data Analysis, and are just there to find out the dependencies used in the project. However if you do wish to run them, then add
%pip install watermark
before the cells to install the library.
On the basis of the available data, we try to find out the following :
- In which Tournament was the highest number of games played?
- Which are the top 10 teams with the most wins?
- Which are the best teams with respect to Winning percentage (having played a minimum of 100 games)?
- Calculating Total Goals scored, and find out the top teams w.r.t. Goals scored per game.
- Check for any correlation between the Winning %, and Goals per game.
Finally, we fit a Simple Linear Regression Model on the Winning % and Goals scored per game, and find out the accuracy of the prediction by the machine learning model.