-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large changes interrupted by single words #16
Comments
@shiftypenguin Thank you for your feedback, and thank you for I guess I understand your situation, but not for sure. |
After further investigation, it doesn't seem to be doing exactly what I thought it was doing. I'm not sure exactly what it's doing, but I can change something that's completely unrelated to the weirdly differenced text, and it'll fix itself. Here are two files that produce the (poorly) described behavior. origin-excerpt.txt And here, I've removed some seemingly unrelated text, and it seems to produce more sane results. |
Thank you for the example files. Now I understand the situation better. Even when --word option is given, DocDiff examines input text like this
I guess that's why the output gets harder for human to read when lines Hmm. I have no good idea to solve this easily for now. |
Would it be possible to process it as one long string and not differentiate between lines? |
Or, what you could do is get the edit distance between corresponding paragraphs, and when the words deleted or replaced to the original paragraph's word count ratio (say 50% is the cutoff) is too much between two paragraphs, check the origin paragraph against each paragraph in the target file starting from where it first failed to get a high enough ratio, and see if any of them pass. If not, the paragraph is deleted. If so, the lines that the paragraph failed to pass the ratio test are added lines. It's two in the morning where I am, so what I said might be confusing. I might draw up a flowchart in the morning. I also can't read ruby, so I apologize if this is already how you've done this. I also can't check this against the example I gave you. I'll have to do that in the morning, |
When using the word resolution (possibly character too, haven't tested it), I often find large paragraphs separated by single words, making it more difficult to discern changes. I was thinking maybe just merging the changes surrounding the single word, and just having the same word in both the deletion and the addition. Possibly determined by the size of the changes around it in relation to the isolated word.
For instance, a single word surrounded on both sides by single word replacements wouldn't get merged into the changes, but a single word surrounded by five word changes on either side would be merged into them, creating just one change rather than two divided by an unchanged word.
The text was updated successfully, but these errors were encountered: