Project for the 2020-2021 NTUA ECE class "Advanced Databases". Τeam members: Skourtsidis Giorgos, Fivos Kalogiannis
-
Students had to write queries using both PySpark's interfaces: RDD and SparkSQL. SQL queries had to be tested on both CSV and PARQUET files and compare differences between all the results.
-
In part B, we had to implement 2 distributed join algorithms (repartition and broadcast join) and compare the results. We also had to experiment with Spark's query join optimizer.