Skip to content

PowerShell module and script to get duplicate files via their content's hash.

License

Notifications You must be signed in to change notification settings

benjaminlukeclark/Get-Duplicate-Files

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Get-Duplicate-Files

PowerShell module and script to get duplicate files.

When the monitor.ps1 file is run, then a background job is started which will first scan for all files in the provided directory. It will then begin to enumerate through these files and attempt to match duplicates based on their file hashes.

This is run in a background job, whilst the main thread keeps tabs on the background job. While monitor.ps1 is running there will be output similar to the below on the screen:

PS C:\Users\Administrator\Documents\GitHub\Dev\Get-Duplicate-Files> .\monitor.ps1 -SEARCH_LOCATION C:\ -LOG_LOCATION C:\Testing\ -SEARCH_DEPTH 6 -FILE_EXTENSIONS "TXT"
09:10:2019:13:20:30 # CPU Usage: 35% # RAM Usage: 58% # 14.46/49.51GB C: free space # Scanning C:\ at depth 6
09:10:2019:13:20:41 # CPU Usage: 54% # RAM Usage: 58% # 14.46/49.51GB C: free space # 483\1415 files processed
09:10:2019:13:20:52 # CPU Usage: 42% # RAM Usage: 58% # 14.46/49.51GB C: free space # Processing file 1374\1415
09:10:2019:13:21:03 # CPU Usage: 26% # RAM Usage: 58% # 14.46/49.51GB C: free space # Creating reports
PS C:\Users\Administrator\Documents\GitHub\Dev\Get-Duplicate-Files>

In addition to this, each file scanned will output data to a log file similar to the below in the specified directory:

09:10:2019:12:17:53 # File: C:\backups\HelloWorld.txt # Owner: BUILTIN\Administrators # Hash: E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855# Same hash: no duplication found
09:10:2019:12:17:53 # File: C:\DevItems\PythonEnv\SNOWAPI_UT\Lib\site-packages\pip-19.0.3.dist-info\entry_points.txt # Owner: BUILTIN\Administrators # Hash: 4BFCDFC58DB942D403558D4188B02638A4A4908E6597308A3CB89839212EA6C6# Same hash: no duplication found
09:10:2019:12:17:54 # File: C:\DevItems\PythonEnv\SNOWAPI_UT\Lib\site-packages\pip-19.0.3.dist-info\LICENSE.txt # Owner: BUILTIN\Administrators # Hash: 5BA21FBB0964F936AD7D15362D1ED6D4931CC8C8F9FF2D4D91190E109BE74431# Same hash: no duplication found
09:10:2019:12:17:54 # File: C:\DevItems\PythonEnv\SNOWAPI_UT\Lib\site-packages\pip-19.0.3.dist-info\top_level.txt # Owner: BUILTIN\Administrators # Hash: CEEBAE7B8927A3227E5303CF5E0F1F7B34BB542AD7250AC03FBCDE36EC2F1508# Same hash: no duplication found
09:10:2019:12:17:54 # File: C:\DevItems\Test\temp\.venv\Lib\site-packages\pip-19.0.3.dist-info\entry_points.txt # Owner: BUILTIN\Administrators # Hash: 4BFCDFC58DB942D403558D4188B02638A4A4908E6597308A3CB89839212EA6C6# Possible Duplicate: C:\DevItems\PythonEnv\SNOWAPI_UT\Lib\site-packages\pip-19.0.3.dist-info\entry_points.txt

And finally two reports will be produced in the specified log directory. The txt report will be similar to the below:

##################################################
#     Duplicate File Locator report for C:\ at depth *
#     3213 possible duplicates found
#     File extensions searched: txt png
#     Report generated 10/09/2019 12:19:45
##################################################
1 -- C:\DevItems\Test\temp\.venv\Lib\site-packages\pip-19.0.3.dist-info\entry_points.txt is possibly a duplicate of C:\DevItems\PythonEnv\SNOWAPI_UT\Lib\site-packages\pip-19.0.3.dist-info\entry_points.txt
2 -- C:\DevItems\Test\temp\.venv\Lib\site-packages\pip-19.0.3.dist-info\LICENSE.txt is possibly a duplicate of C:\DevItems\PythonEnv\SNOWAPI_UT\Lib\site-packages\pip-19.0.3.dist-info\LICENSE.txt
3 -- C:\DevItems\Test\temp\.venv\Lib\site-packages\pip-19.0.3.dist-info\top_level.txt is possibly a duplicate of C:\DevItems\PythonEnv\SNOWAPI_UT\Lib\site-packages\pip-19.0.3.dist-info\top_level.txt
4 -- C:\DevItems\Test\temp\.venv\Lib\site-packages\pylint\test\functional\assignment_from_no_return_py3.txt is possibly a duplicate of C:\backups\HelloWorld.txt
5 -- C:\DevItems\Test\temp\.venv\Lib\site-packages\pylint\test\functional\control_pragmas.txt is possibly a duplicate of Possible 
Duplicate: C:\backups\HelloWorld.txt

Whilst the HTML report is a simple HTML page with all duplicates enclosed in a table element.

All files are created in the given directory under "YEAR/MONTH/DAY" with names similar to the below, where RANDOM is a randomly generated int between 1 and 10000 for that run:

  • log-RANDOM.txt
  • report-RANDOM.txt
  • report-RANDOM.html

Note that the output and log file may be easily imported into Excel: copy the data into a single column, then select "Split to text" then use "#" as a delimiter.

monitor.ps1

SYNOPSIS

Intended to find duplication files under a specified directory.

DESCRIPTION

Takes a file extension, search location and search depth and attempts to find duplicate files with this extension under the specified location. The search is recursive to the depth specified. E.G. 0 would mean just the top-level, 1 would mean sub-folders, 2 would mean sub-folders in all sub-folders etc.

A logfile is produced under LOG_LOCATION and the status of the search is outputted to the console every MONITORING_FREQUENCY number of seconds

PARAMETER FILE_EXTENSIONS

Accepts a string array of file extensions to be included in the search.

Default is *

Example: "txt","png"

PARAMETER SEARCH_LOCATION

Specifies target location for the search. Expected input is a string.

String is expected to start with drive letter. E.G.

C:\Testing

The following will crash the script:

\localhost\c$\Testing

PARAMETER SEARCH_DEPTH

Expects a string specifying search depth. This is converted to an INT and then used to determine what sublevel of directories to search.

E.G. 1 means that C:\TargetLocation\Sub Dir\ would but searched but C:\TargetLocation\Sub Dir\Sub Dir\ would not be searched

0 searches just the top-level directory

Default is *, meaning all sub directories are searched regardless of level

NOTE: Test parameters state that * should be default, hence the string datatype

PARAMETER LOG_LOCATION

Specifies location for a log to be saved to. Expects directory path:

C:\Hello\

Within this path a structure will be created in the format of: -- Year --- Month ---- Day

Then within this structure will be the following files:

  • log-RANDOM.txt
  • report-RANDOM.txt
  • report-RANDOM.html

PARAMETER MONITORING_REQUENCY

Expects an int specifying the number of seconds between monitoring update outputs.

Default value is 10.

Usage

Clone repo

git clone https://github.com/Sudoblark/Get-Duplicate-Files.git

Change into dir

cd Get-Duplicate-Files

Run with require params SEARCH_LOCATION and LOG_LOCATION either with cmd:

C:\Get-Duplicate-Files>powershell.exe ".\monitor.ps1 -SEARCH_LOCATION C:\Testing -LOG_LOCATION C:\Testing"

Or PowerShell:

PS C:\Get-Duplicate-Files> .\monitor.ps1 -SEARCH_LOCATION C:\Testing -LOG_LOCATION C:\Testing

Monitor output will appear in the console and log files will be created under LOG_LOCATION as per monitoring messages.

Further ideas / Known issues

  • CSS/JS for html report to allow for report branding and a filterable table so that the report may be given to customers/SMEs for that particular storage location
  • Sometimes the "Duplicate file" string retains "Possible Duplicate:" in reports
  • Restructure text report into more readable format
  • Expand job to feed into another automated process to allow cleanup of duplicated files and/or front-end for human-verification
  • Expand information gathered in log and report files
  • If no files with the specified extension(s) are found then no log file is produced but reports are; log file is produced during duplicate hash checking, so if there are no files to check/process no log file is created
  • No error handling for expected errors inside backup job
  • Could do more validation on params and add help messages
  • Comment blocks for all functions
  • Param block for monitor.ps1 script
  • For drive lookup to work the SEARCH_LOCATION must begin with the drive letter, e.g. C:\
  • Script only works on Windows. Would like to change to work on both Unix and Windows OS
  • Update monitoring section for more accuracy
  • Take out functions inside $DuplicationCheck scriptblock and store these separately

About

PowerShell module and script to get duplicate files via their content's hash.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published