A bottom up clustering approach in which each object starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. This implementation is based on the paper "An efficient agglomerative clustering algorithm using a heap" and works as follows:
- Begin with N objects where each object is a cluster on tis own. The clusters are uniquely identified by a cluster id (e.g. 1 to N)
- Calculate pairwise distances between the clusters and insert them into a heap
- Retreive from the heap the nearest pair of clusters, which is the first element in the heap (complexity O(1)).
- Merge the two nearest clusters (e.g. a and b) into a new cluster (e.g. K) and insert the cluster K to the set of all clusters. Remove a and b from the set of all clusters. This step reduces the number of all clusters by one. Calculate the pairwise distances between cluster K and all the remaining clusters.
- Execute steps 3 and 4, N-1 times until you end up with a big single cluster.
The complexity for calculating the pairwise distances between the new cluster K and all remaining clusters (Nr) is O(Nr). Each pairwise distance is inserted into the heap, which has a complexity of O(log(n)). Since step 3 and 4 are executed N-1 times, the overall time complexity is (N-1) * Nr * log(n) or more simply N2 * log(n).
- C++11
- CMake 3.5
- STL Containers (no libraries)
- The project can work for any number of dimensions
- A sample input file is provided that contains 3D points to be clustered (0.txt)
- The nearest pair of clusters is chosen using the centroid approach (i.e. the clusters whose centroids are closer are merged at each step)
- The distance metric used is Euclidean Distance
git clone https://github.com/gevago01/Agglomerative-Clustering.git
cd Agglomerative-Clustering
mkdir build
cd build
cmake ..
make
./Agglomerative
Double checked with python scipy hierarchical clustering