Plagiarism detection system that checks the similarity rate between selected document and a set of documents provided in text form.
Developed program will print two distinct outputs for each document comparison process.
- Similarity rate between the main and each compared document.
- The most similar 5 sentences in each document.
Primary objectives of the project ordered by their importance.
- Running speed (efficiency) of the algorithm.
- Similarity detection ability.
- Readability of the code.
We generated a .jar file since executing the Main class from terminal was bit problematic because of dependency to other classes. Program can find and select main document and comparison folder by default if they are in the same directory. Move .jar file to a folder/directory where there are a “main_doc.txt” which will be checked and “Documents” folder containing text documents for plagiarism check.
`Java -jar termproject.jar`
Executing the jar file also can be done by giving main document and documents folder as arguments.
`Java -jar termproject.jar -main=”main_doc.txt” -compare=”Documents/”`
Each phase of the program without getting much into the detail.
- Receive main file and documents folder as argument.
- Read all files to Strings.
- Take content of main file and check similarity with each document.
- Print document similarity rate and most similar five sentences.
Our first design was a naïve method (brute force) solution with O(m*n) complexity. However, with the implementation of Rabin-Karp string searching algorithm, at the end we managed to achieve O(n+m) complexity with 501ms average execution time. Average execution time calculated for single main document and three comparison documents.
MIT License