Detect common substrings #1582
Replies: 2 comments
-
I like this! Some thoughts: A modest implementation could be to let users find other occurrences of specific strings. They could select an segment they deem interesting and easily find other occurrences: This is really just an interface for a normal match phrase query, but could work as a more natural exploration of cited / copied text. A more involved implementation could allow users to see which passages in a document are found elsewhere, for example in a kind of "heatmap": Which could then be combined with the sentence above. (The picture here uses sentences as a unit, but you could also base this on, say, 4-grams, which may be more informative.) You could generate such a heat map live. Basically run a match phrase search for every substring you want to check (e.g. every sentence) and note the number of results. Another use case would be to select a document and find documents that have many similar passages. I'm sure this could be done with some clever search requests as well. |
Beta Was this translation helpful? Give feedback.
-
Some overlap with #958 |
Beta Was this translation helpful? Give feedback.
-
For some corpora, it may be interesting to see whether documents have overlap in their (verbatim) text, which could be used to detect quotes or plagiarism.
Beta Was this translation helpful? Give feedback.
All reactions