English detector fails when checking czech text #23

computerphysicslab · 2020-09-02T12:10:31Z

English detector fails when checking czech text:

package main

import (
"fmt"
"github.com/chrisport/go-lang-detector/langdet"
"github.com/chrisport/go-lang-detector/langdet/langdetdef"
)

var isEnglishDetector langdet.Detector

func isEnglish(text string) bool {
if len(isEnglishDetector.Languages) == 0 {
fmt.Println("* Init English detector ...")
isEnglishDetector = langdetdef.NewWithDefaultLanguages()
}

if isEnglishDetector.GetClosestLanguage(text) == "english" {
	return true
}

return false

}

func main() {
fmt.Println(isEnglish("do not care about quantity"))
fmt.Println(isEnglish("V jeho jednomyslném schválení však brání dlouhodobý nesouhlas dvojice zmíněných států. „Slyším tak často z Polska a Maďarska, že nemají problém s právním státem, až bych skoro čekala, že to dokážou tím, že pro to zvednou ruku,“ prohlásila. (ČTK)*"))
fmt.Println(isEnglish("Jesteśmy przekonani, że właśnie taki rodzaj dziennikarstwa najlepiej pomaga rozumieć to, co dzieje się dookoła nas i stanowi najbardziej wartościowy wkład w rozwój demokracji oraz wartości obywatelskich"))
}

OUTPUT:

Init English detector ...
true
true
true

The text was updated successfully, but these errors were encountered:

chrisport · 2020-11-21T10:08:56Z

that's interesting. Actually the confidence for English, German and French for your snippets is quite high, which means, these languages share similarities and to distinguish them, you would need to set a higher minimum confidence.
If you print the confidence by using GetLanguages rather than GetClosestLanguage , you will see this:

[{english 90} {french 75} {german 50} {turkish 44} {hebrew 19} {arabic 1} {russian 0} {CJK 0}]
[{english 80} {german 80} {french 79} {turkish 79} {hebrew 73} {arabic 69} {russian 68} {CJK 0}]
[{english 76} {german 76} {french 73} {turkish 72} {hebrew 67} {arabic 62} {russian 61} {CJK 0}]

So you have 2 possibilities here:

Option 1
Increase the Minimum confidence to let's say 85 --> now it will correctly return:

english
undefined
undefined

This will work well, if your detector should only detect English and you don't care about Czech so much.

Option 2
Add the Czech language to the detector. I did so by using a random Wikipedia article, copied it in a text file and analysed it using the library. See also the Readme on how to do that. The result will be:

english
czech
czech

Confidence levels:

[{english 90} {french 75} {Czech 55} {german 50} {turkish 44} {hebrew 19} {arabic 1} {russian 0} {CJK 0}]
[{Czech 94} {english 80} {german 80} {french 79} {turkish 79} {hebrew 73} {arabic 69} {russian 68} {CJK 0}]
[{Czech 76} {english 76} {german 76} {french 73} {turkish 72} {hebrew 67} {arabic 62} {russian 61} {CJK 0}]

I hope this was helpful to you, please let me know if I can support you in your specific use-case.

zolastro · 2020-12-01T08:54:27Z

Maybe this has something to do with #24

chrisport self-assigned this Nov 21, 2020

chrisport added help wanted question labels Nov 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

English detector fails when checking czech text #23

English detector fails when checking czech text #23

computerphysicslab commented Sep 2, 2020

chrisport commented Nov 21, 2020 •

edited

Loading

zolastro commented Dec 1, 2020

English detector fails when checking czech text #23

English detector fails when checking czech text #23

Comments

computerphysicslab commented Sep 2, 2020

chrisport commented Nov 21, 2020 • edited Loading

zolastro commented Dec 1, 2020

chrisport commented Nov 21, 2020 •

edited

Loading