Ted Talks Clustering

Clustering over +4000 Ted Talks using the most common clustering algorithms and a comparaison between tf-idf and Word Embeddings. Data from: https://www.kaggle.com/datasets/miguelcorraljr/ted-ultimate-dataset

Comparison of clustering formed using tf-idf and word embeddings using the most commons clustering algorithms like KMeans, Gaussian Mixture Models and Agglomerative Clustering.
Tuning of the hyperparameters of all models.
Comparaison of the results using multiple clustering metrics (DBI, Silhoutte and Calinski).
Bonus experiment using only the most relevant tf-idf words and partly solving the curse of dimensionality.
Bonus experiment using word embeddings from Microsoft MiniLM-L12-H384.
Final analysis using wordclouds and n-grams to identify the topics.
Found insights about which algorithms and metrics work best for document clustering and why.
I used cuML, Spark (PySpark) and sentence-transformers.

Wordclouds generated from KMeans with Bert Embeddings:

Wordclouds generated from Gaussian Mixture Models with Bert Embeddings:

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
Clusters_GMM.png		Clusters_GMM.png
Clusters_KMeans.png		Clusters_KMeans.png
README.md		README.md
Trabajo_Final_Aprendizaje_No_Supervisado- Santiago González Silot.pdf		Trabajo_Final_Aprendizaje_No_Supervisado- Santiago González Silot.pdf
Trabajo_Final_Clustering.ipynb		Trabajo_Final_Clustering.ipynb

Provide feedback