Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

English detector fails when checking czech text #23

Open
computerphysicslab opened this issue Sep 2, 2020 · 4 comments
Open

English detector fails when checking czech text #23

computerphysicslab opened this issue Sep 2, 2020 · 4 comments

Comments

@computerphysicslab
Copy link

English detector fails when checking czech text:

package main

import (
"fmt"
"github.com/chrisport/go-lang-detector/langdet"
"github.com/chrisport/go-lang-detector/langdet/langdetdef"
)

var isEnglishDetector langdet.Detector

func isEnglish(text string) bool {
if len(isEnglishDetector.Languages) == 0 {
fmt.Println("* Init English detector ...")
isEnglishDetector = langdetdef.NewWithDefaultLanguages()
}

if isEnglishDetector.GetClosestLanguage(text) == "english" {
	return true
}

return false

}

func main() {
fmt.Println(isEnglish("do not care about quantity"))
fmt.Println(isEnglish("V jeho jednomyslném schválení však brání dlouhodobý nesouhlas dvojice zmíněných států. „Slyším tak často z Polska a Maďarska, že nemají problém s právním státem, až bych skoro čekala, že to dokážou tím, že pro to zvednou ruku,“ prohlásila. (ČTK)*"))
fmt.Println(isEnglish("Jesteśmy przekonani, że właśnie taki rodzaj dziennikarstwa najlepiej pomaga rozumieć to, co dzieje się dookoła nas i stanowi najbardziej wartościowy wkład w rozwój demokracji oraz wartości obywatelskich"))
}

OUTPUT:

  • Init English detector ...
    true
    true
    true
@chrisport
Copy link
Owner

chrisport commented Nov 21, 2020

that's interesting. Actually the confidence for English, German and French for your snippets is quite high, which means, these languages share similarities and to distinguish them, you would need to set a higher minimum confidence.
If you print the confidence by using GetLanguages rather than GetClosestLanguage , you will see this:

[{english 90} {french 75} {german 50} {turkish 44} {hebrew 19} {arabic 1} {russian 0} {CJK 0}]
[{english 80} {german 80} {french 79} {turkish 79} {hebrew 73} {arabic 69} {russian 68} {CJK 0}]
[{english 76} {german 76} {french 73} {turkish 72} {hebrew 67} {arabic 62} {russian 61} {CJK 0}]

So you have 2 possibilities here:

Option 1
Increase the Minimum confidence to let's say 85 --> now it will correctly return:

english
undefined
undefined

This will work well, if your detector should only detect English and you don't care about Czech so much.

Option 2
Add the Czech language to the detector. I did so by using a random Wikipedia article, copied it in a text file and analysed it using the library. See also the Readme on how to do that. The result will be:

english
czech
czech

Confidence levels:

[{english 90} {french 75} {Czech 55} {german 50} {turkish 44} {hebrew 19} {arabic 1} {russian 0} {CJK 0}]
[{Czech 94} {english 80} {german 80} {french 79} {turkish 79} {hebrew 73} {arabic 69} {russian 68} {CJK 0}]
[{Czech 76} {english 76} {german 76} {french 73} {turkish 72} {hebrew 67} {arabic 62} {russian 61} {CJK 0}]

I hope this was helpful to you, please let me know if I can support you in your specific use-case.

@zolastro
Copy link

zolastro commented Dec 1, 2020

Maybe this has something to do with #24

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants
@chrisport @computerphysicslab @zolastro and others