Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large changes interrupted by single words #16

Open
shiftypenguin opened this issue Mar 10, 2016 · 5 comments
Open

Large changes interrupted by single words #16

shiftypenguin opened this issue Mar 10, 2016 · 5 comments

Comments

@shiftypenguin
Copy link

When using the word resolution (possibly character too, haven't tested it), I often find large paragraphs separated by single words, making it more difficult to discern changes. I was thinking maybe just merging the changes surrounding the single word, and just having the same word in both the deletion and the addition. Possibly determined by the size of the changes around it in relation to the isolated word.

For instance, a single word surrounded on both sides by single word replacements wouldn't get merged into the changes, but a single word surrounded by five word changes on either side would be merged into them, creating just one change rather than two divided by an unchanged word.

@hisashim
Copy link
Owner

@shiftypenguin Thank you for your feedback, and thank you for
using DocDiff. Excuse me for such a delay in replying.

I guess I understand your situation, but not for sure.
Could you give me a concrete example of input and output text to
describe the problem? I'd appreciate it a lot.

@shiftypenguin
Copy link
Author

After further investigation, it doesn't seem to be doing exactly what I thought it was doing. I'm not sure exactly what it's doing, but I can change something that's completely unrelated to the weirdly differenced text, and it'll fix itself.

Here are two files that produce the (poorly) described behavior.

origin-excerpt.txt
edited-linefix-excerpt.txt

And here, I've removed some seemingly unrelated text, and it seems to produce more sane results.

origin-excerpt-good.txt
edited-linefix-excerpt-good.txt

@hisashim
Copy link
Owner

Thank you for the example files. Now I understand the situation better.

Even when --word option is given, DocDiff examines input text like this
mainly for efficiency:

  1. First, compare them line by line.
  2. Then, compare different lines (=:=paragraphs) word by word.

I guess that's why the output gets harder for human to read when lines
(paragraphs) are different as in origin-excerpt.txt and
edited-linefix-excerpt.txt.

Hmm. I have no good idea to solve this easily for now.

@shiftypenguin
Copy link
Author

Would it be possible to process it as one long string and not differentiate between lines?

@shiftypenguin
Copy link
Author

Or, what you could do is get the edit distance between corresponding paragraphs, and when the words deleted or replaced to the original paragraph's word count ratio (say 50% is the cutoff) is too much between two paragraphs, check the origin paragraph against each paragraph in the target file starting from where it first failed to get a high enough ratio, and see if any of them pass. If not, the paragraph is deleted. If so, the lines that the paragraph failed to pass the ratio test are added lines.

It's two in the morning where I am, so what I said might be confusing. I might draw up a flowchart in the morning. I also can't read ruby, so I apologize if this is already how you've done this. I also can't check this against the example I gave you. I'll have to do that in the morning,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants