PowerShell module and script to get duplicate files.
When the monitor.ps1 file is run, then a background job is started which will first scan for all files in the provided directory. It will then begin to enumerate through these files and attempt to match duplicates based on their file hashes.
This is run in a background job, whilst the main thread keeps tabs on the background job. While monitor.ps1 is running there will be output similar to the below on the screen:
PS C:\Users\Administrator\Documents\GitHub\Dev\Get-Duplicate-Files> .\monitor.ps1 -SEARCH_LOCATION C:\ -LOG_LOCATION C:\Testing\ -SEARCH_DEPTH 6 -FILE_EXTENSIONS "TXT"
09:10:2019:13:20:30 # CPU Usage: 35% # RAM Usage: 58% # 14.46/49.51GB C: free space # Scanning C:\ at depth 6
09:10:2019:13:20:41 # CPU Usage: 54% # RAM Usage: 58% # 14.46/49.51GB C: free space # 483\1415 files processed
09:10:2019:13:20:52 # CPU Usage: 42% # RAM Usage: 58% # 14.46/49.51GB C: free space # Processing file 1374\1415
09:10:2019:13:21:03 # CPU Usage: 26% # RAM Usage: 58% # 14.46/49.51GB C: free space # Creating reports
PS C:\Users\Administrator\Documents\GitHub\Dev\Get-Duplicate-Files>
In addition to this, each file scanned will output data to a log file similar to the below in the specified directory:
09:10:2019:12:17:53 # File: C:\backups\HelloWorld.txt # Owner: BUILTIN\Administrators # Hash: E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855# Same hash: no duplication found
09:10:2019:12:17:53 # File: C:\DevItems\PythonEnv\SNOWAPI_UT\Lib\site-packages\pip-19.0.3.dist-info\entry_points.txt # Owner: BUILTIN\Administrators # Hash: 4BFCDFC58DB942D403558D4188B02638A4A4908E6597308A3CB89839212EA6C6# Same hash: no duplication found
09:10:2019:12:17:54 # File: C:\DevItems\PythonEnv\SNOWAPI_UT\Lib\site-packages\pip-19.0.3.dist-info\LICENSE.txt # Owner: BUILTIN\Administrators # Hash: 5BA21FBB0964F936AD7D15362D1ED6D4931CC8C8F9FF2D4D91190E109BE74431# Same hash: no duplication found
09:10:2019:12:17:54 # File: C:\DevItems\PythonEnv\SNOWAPI_UT\Lib\site-packages\pip-19.0.3.dist-info\top_level.txt # Owner: BUILTIN\Administrators # Hash: CEEBAE7B8927A3227E5303CF5E0F1F7B34BB542AD7250AC03FBCDE36EC2F1508# Same hash: no duplication found
09:10:2019:12:17:54 # File: C:\DevItems\Test\temp\.venv\Lib\site-packages\pip-19.0.3.dist-info\entry_points.txt # Owner: BUILTIN\Administrators # Hash: 4BFCDFC58DB942D403558D4188B02638A4A4908E6597308A3CB89839212EA6C6# Possible Duplicate: C:\DevItems\PythonEnv\SNOWAPI_UT\Lib\site-packages\pip-19.0.3.dist-info\entry_points.txt
And finally two reports will be produced in the specified log directory. The txt report will be similar to the below:
##################################################
# Duplicate File Locator report for C:\ at depth *
# 3213 possible duplicates found
# File extensions searched: txt png
# Report generated 10/09/2019 12:19:45
##################################################
1 -- C:\DevItems\Test\temp\.venv\Lib\site-packages\pip-19.0.3.dist-info\entry_points.txt is possibly a duplicate of C:\DevItems\PythonEnv\SNOWAPI_UT\Lib\site-packages\pip-19.0.3.dist-info\entry_points.txt
2 -- C:\DevItems\Test\temp\.venv\Lib\site-packages\pip-19.0.3.dist-info\LICENSE.txt is possibly a duplicate of C:\DevItems\PythonEnv\SNOWAPI_UT\Lib\site-packages\pip-19.0.3.dist-info\LICENSE.txt
3 -- C:\DevItems\Test\temp\.venv\Lib\site-packages\pip-19.0.3.dist-info\top_level.txt is possibly a duplicate of C:\DevItems\PythonEnv\SNOWAPI_UT\Lib\site-packages\pip-19.0.3.dist-info\top_level.txt
4 -- C:\DevItems\Test\temp\.venv\Lib\site-packages\pylint\test\functional\assignment_from_no_return_py3.txt is possibly a duplicate of C:\backups\HelloWorld.txt
5 -- C:\DevItems\Test\temp\.venv\Lib\site-packages\pylint\test\functional\control_pragmas.txt is possibly a duplicate of Possible
Duplicate: C:\backups\HelloWorld.txt
Whilst the HTML report is a simple HTML page with all duplicates enclosed in a table element.
All files are created in the given directory under "YEAR/MONTH/DAY" with names similar to the below, where RANDOM is a randomly generated int between 1 and 10000 for that run:
- log-RANDOM.txt
- report-RANDOM.txt
- report-RANDOM.html
Note that the output and log file may be easily imported into Excel: copy the data into a single column, then select "Split to text" then use "#" as a delimiter.
SYNOPSIS
Intended to find duplication files under a specified directory.
DESCRIPTION
Takes a file extension, search location and search depth and attempts to find duplicate files with this extension under the specified location. The search is recursive to the depth specified. E.G. 0 would mean just the top-level, 1 would mean sub-folders, 2 would mean sub-folders in all sub-folders etc.
A logfile is produced under LOG_LOCATION and the status of the search is outputted to the console every MONITORING_FREQUENCY number of seconds
PARAMETER FILE_EXTENSIONS
Accepts a string array of file extensions to be included in the search.
Default is *
Example: "txt","png"
PARAMETER SEARCH_LOCATION
Specifies target location for the search. Expected input is a string.
String is expected to start with drive letter. E.G.
C:\Testing
The following will crash the script:
\localhost\c$\Testing
PARAMETER SEARCH_DEPTH
Expects a string specifying search depth. This is converted to an INT and then used to determine what sublevel of directories to search.
E.G. 1 means that C:\TargetLocation\Sub Dir\ would but searched but C:\TargetLocation\Sub Dir\Sub Dir\ would not be searched
0 searches just the top-level directory
Default is *, meaning all sub directories are searched regardless of level
NOTE: Test parameters state that * should be default, hence the string datatype
PARAMETER LOG_LOCATION
Specifies location for a log to be saved to. Expects directory path:
C:\Hello\
Within this path a structure will be created in the format of: -- Year --- Month ---- Day
Then within this structure will be the following files:
- log-RANDOM.txt
- report-RANDOM.txt
- report-RANDOM.html
PARAMETER MONITORING_REQUENCY
Expects an int specifying the number of seconds between monitoring update outputs.
Default value is 10.
Clone repo
git clone https://github.com/Sudoblark/Get-Duplicate-Files.git
Change into dir
cd Get-Duplicate-Files
Run with require params SEARCH_LOCATION and LOG_LOCATION either with cmd:
C:\Get-Duplicate-Files>powershell.exe ".\monitor.ps1 -SEARCH_LOCATION C:\Testing -LOG_LOCATION C:\Testing"
Or PowerShell:
PS C:\Get-Duplicate-Files> .\monitor.ps1 -SEARCH_LOCATION C:\Testing -LOG_LOCATION C:\Testing
Monitor output will appear in the console and log files will be created under LOG_LOCATION as per monitoring messages.
- CSS/JS for html report to allow for report branding and a filterable table so that the report may be given to customers/SMEs for that particular storage location
- Sometimes the "Duplicate file" string retains "Possible Duplicate:" in reports
- Restructure text report into more readable format
- Expand job to feed into another automated process to allow cleanup of duplicated files and/or front-end for human-verification
- Expand information gathered in log and report files
- If no files with the specified extension(s) are found then no log file is produced but reports are; log file is produced during duplicate hash checking, so if there are no files to check/process no log file is created
- No error handling for expected errors inside backup job
- Could do more validation on params and add help messages
- Comment blocks for all functions
- Param block for monitor.ps1 script
- For drive lookup to work the SEARCH_LOCATION must begin with the drive letter, e.g. C:\
- Script only works on Windows. Would like to change to work on both Unix and Windows OS
- Update monitoring section for more accuracy
- Take out functions inside $DuplicationCheck scriptblock and store these separately