Name		Name	Last commit message	Last commit date
parent directory ..
Finding_duplicates.ipynb		Finding_duplicates.ipynb
Readme.md		Readme.md

Readme.md

Finding Duplicate Images

In this notebook task of finding all the different duplicate images among the given set of image is performed. Huge database of images exists which have multiple copies of same image with different ids/ names. To remove duplicate images from such database an efficient way, then comparing similarity of each image with every other image to find the duplicates, is required.

For the purpose of finding the duplicates, in this notebook, images are first passed through a feature extractor (deep learning network trained on a classification task with last classification layer removed). The feature vector or image vector found are then coverted to hash codes for efficent search using Locality Sensitivity Hashing (LSH).

annoyindex package is used to perform LSH. The hash code of an image is found by making multiple random hyperplanes on the dimension of image vector and hash code is given by the sides of each hyperplane a particular image lies. This works on the intuition that on high dimensional space similar objects will have similar location on the space.

Once the hash for all the images are found a particular image is tested for similarity only with images having same hash code. This allows the search for duplicate to be done much quicker.

Packages used

keras, tensorlfow, numpy, cv2, matplotlib, annoyindex

Dataset

For this notebook any image dataset containing duplicate images would suffice, although a dataset which already have all the duplicates found will allow for model comparison.

The link for the dataset used in this notebook is here. Only the images are requried for this notebook.

Results

Here are few examples of the dupicate images found:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate_images

Duplicate_images

Readme.md

Finding Duplicate Images

Packages used

Dataset

Results

Files

Duplicate_images

Directory actions

More options

Directory actions

More options

Latest commit

History

Duplicate_images

Folders and files

parent directory

Readme.md

Finding Duplicate Images

Packages used

Dataset

Results