Detect common substrings #1582

BeritJanssen · 2022-11-30T14:51:43Z

BeritJanssen
Nov 30, 2022
Maintainer

For some corpora, it may be interesting to see whether documents have overlap in their (verbatim) text, which could be used to detect quotes or plagiarism.

lukavdplas · 2022-12-02T11:43:02Z

lukavdplas
Dec 2, 2022
Maintainer

I like this! Some thoughts:

A modest implementation could be to let users find other occurrences of specific strings. They could select an segment they deem interesting and easily find other occurrences:

This is really just an interface for a normal match phrase query, but could work as a more natural exploration of cited / copied text.

A more involved implementation could allow users to see which passages in a document are found elsewhere, for example in a kind of "heatmap":

Which could then be combined with the sentence above. (The picture here uses sentences as a unit, but you could also base this on, say, 4-grams, which may be more informative.)

You could generate such a heat map live. Basically run a match phrase search for every substring you want to check (e.g. every sentence) and note the number of results.

Another use case would be to select a document and find documents that have many similar passages. I'm sure this could be done with some clever search requests as well.

0 replies

oktaal · 2022-12-07T14:25:37Z

oktaal
Dec 7, 2022
Maintainer

Some overlap with #958

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detect common substrings #1582

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Detect common substrings #1582

BeritJanssen Nov 30, 2022 Maintainer

Replies: 2 comments

lukavdplas Dec 2, 2022 Maintainer

oktaal Dec 7, 2022 Maintainer

BeritJanssen
Nov 30, 2022
Maintainer

lukavdplas
Dec 2, 2022
Maintainer

oktaal
Dec 7, 2022
Maintainer