Large changes interrupted by single words #16

shiftypenguin · 2016-03-10T01:17:22Z

When using the word resolution (possibly character too, haven't tested it), I often find large paragraphs separated by single words, making it more difficult to discern changes. I was thinking maybe just merging the changes surrounding the single word, and just having the same word in both the deletion and the addition. Possibly determined by the size of the changes around it in relation to the isolated word.

For instance, a single word surrounded on both sides by single word replacements wouldn't get merged into the changes, but a single word surrounded by five word changes on either side would be merged into them, creating just one change rather than two divided by an unchanged word.

hisashim · 2016-03-16T17:12:22Z

@shiftypenguin Thank you for your feedback, and thank you for
using DocDiff. Excuse me for such a delay in replying.

I guess I understand your situation, but not for sure.
Could you give me a concrete example of input and output text to
describe the problem? I'd appreciate it a lot.

shiftypenguin · 2016-03-18T04:47:52Z

After further investigation, it doesn't seem to be doing exactly what I thought it was doing. I'm not sure exactly what it's doing, but I can change something that's completely unrelated to the weirdly differenced text, and it'll fix itself.

Here are two files that produce the (poorly) described behavior.

origin-excerpt.txt
edited-linefix-excerpt.txt

And here, I've removed some seemingly unrelated text, and it seems to produce more sane results.

origin-excerpt-good.txt
edited-linefix-excerpt-good.txt

hisashim · 2016-03-21T12:12:25Z

Thank you for the example files. Now I understand the situation better.

Even when --word option is given, DocDiff examines input text like this
mainly for efficiency:

First, compare them line by line.
Then, compare different lines (=:=paragraphs) word by word.

I guess that's why the output gets harder for human to read when lines
(paragraphs) are different as in origin-excerpt.txt and
edited-linefix-excerpt.txt.

Hmm. I have no good idea to solve this easily for now.

shiftypenguin · 2016-03-21T21:00:51Z

Would it be possible to process it as one long string and not differentiate between lines?

shiftypenguin · 2016-03-24T06:11:08Z

Or, what you could do is get the edit distance between corresponding paragraphs, and when the words deleted or replaced to the original paragraph's word count ratio (say 50% is the cutoff) is too much between two paragraphs, check the origin paragraph against each paragraph in the target file starting from where it first failed to get a high enough ratio, and see if any of them pass. If not, the paragraph is deleted. If so, the lines that the paragraph failed to pass the ratio test are added lines.

It's two in the morning where I am, so what I said might be confusing. I might draw up a flowchart in the morning. I also can't read ruby, so I apologize if this is already how you've done this. I also can't check this against the example I gave you. I'll have to do that in the morning,

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large changes interrupted by single words #16

Large changes interrupted by single words #16

shiftypenguin commented Mar 10, 2016

hisashim commented Mar 16, 2016

shiftypenguin commented Mar 18, 2016

hisashim commented Mar 21, 2016

shiftypenguin commented Mar 21, 2016

shiftypenguin commented Mar 24, 2016

Large changes interrupted by single words #16

Large changes interrupted by single words #16

Comments

shiftypenguin commented Mar 10, 2016

hisashim commented Mar 16, 2016

shiftypenguin commented Mar 18, 2016

hisashim commented Mar 21, 2016

shiftypenguin commented Mar 21, 2016

shiftypenguin commented Mar 24, 2016