This repository contains an in-depth comparative analysis of multiple machine learning algorithms for text classification tasks. The algorithms evaluated include Support Vector Machines (SVM), Random Forest, and Naive Bayes. The analysis utilizes a comprehensive dataset of research paper abstracts and full texts related to various types of cancer.
The dataset used for this analysis comprises research paper abstracts and full texts. The papers cover different types of cancer, providing a rich and diverse set of texts for classification.
The following machine learning algorithms were evaluated in this study:
- Support Vector Machines (SVM)
- Random Forest
- Naive Bayes
- Tokenization: Splitting the text into individual words or tokens.
- Stemming: Reducing words to their root form.
- TF-IDF Vectorization: Converting text data into numerical form using Term Frequency-Inverse Document Frequency (TF-IDF) vectorization.
Accuracy Precision Recall F1 Score
The results of the analysis highlight the performance of each algorithm based on the evaluation metrics. Detailed findings are provided in the results section of the project.