This project focuses on the Iris dataset. The Iris dataset contains measurements of sepal length, sepal width, petal length, and petal width for three species of iris flowers: setosa, versicolor, and virginica.
-
Data Exploration and Visualization:
- The project starts by loading the Iris dataset using Pandas and checking for null values.
- Basic statistics like mean, min, max, and quartiles are calculated and printed.
- The number of occurrences of each species (setosa, versicolor, virginica) is counted and printed.
- A pie chart is created to visualize the distribution of the three species in the dataset.
- Box plots and violin plots are used to identify outliers and visualize the distribution of features across different species.
-
Correlation Analysis:
- A heatmap is created to visualize the correlation matrix of features in the Iris dataset.
-
Scatter Plot:
- A scatter plot is generated to show the relationship between sepal length and sepal width.
-
Machine Learning:
- The project then moves into machine learning algorithms for classification.
- The dataset is split into training and testing sets.
- Logistic Regression, Support Vector Machines, K-Nearest Neighbors, Naive Bayes, and Decision Tree classifiers are implemented and evaluated.
- Accuracy scores and confusion matrices are printed for each algorithm.
-
Summary of Results:
- A summary table is created, showing the accuracy scores of different machine learning models on the Iris dataset.
In short, this project explores the Iris dataset, visualizes its features, checks for correlations, and applies various machine learning algorithms for classification. The goal is to predict the species of iris flowers based on their measurements using different machine learning models. The results suggest that several algorithms, such as Logistic Regression, Support Vector Machines, Naive Bayes, KNN, and Decision Tree, perform well on this dataset.