Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TÜRKİYE a valid spelling? #347

Open
cjyetman opened this issue Sep 9, 2023 · 10 comments
Open

TÜRKİYE a valid spelling? #347

cjyetman opened this issue Sep 9, 2023 · 10 comments

Comments

@cjyetman
Copy link
Collaborator

cjyetman commented Sep 9, 2023

The list-one.xls that we use from https://www.six-group.com/en/products-services/financial-information/data-standards.html to get the ISO4217 codes in get_iso4217.R (or better, will use once the URL is updated in an PR #348) uses
TÜRKİYE (note the İ). Is this a valid spelling we should support?

@NilsEnevoldsen
Copy link
Contributor

Do we not support that? We definitely should. It's just the capital form of "i" in Turkish.

@NilsEnevoldsen
Copy link
Contributor

Does our regex engine have an equivalent of CultureInvariant, or is that a dotnet-exclusive nicety?

@vincentarelbundock
Copy link
Owner

Does our regex engine have an equivalent of CultureInvariant, or is that a dotnet-exclusive nicety?

Oh, that's neat. I had to google it because I had never heard of this flag. Also funny that the C# docs use exactly our problematic case as their main example.

Maybe I missed something, but I don't think grep(perl=TRUE) supports this out of the box.

Screenshot 2023-09-09 145903

@cjyetman
Copy link
Collaborator Author

We currently rely on ignore.case = TRUE in our grepl() command here:

countrycode/R/countrycode.R

Lines 280 to 281 in 848e9ce

matchidx <- sapply(dict[[origin]], grepl, x = choices,
perl = TRUE, ignore.case = TRUE)

Should we consider casting all strings-to-match with tolower() before using grepl()? Probably a small performance hit, but maybe worth it. If we ensure that all of our regexes are lowercase, we could then maybe remove the ignore.case = TRUE, which would probably give a small performance boost.

turkey <- "TÜRKİYE"

countrycode::countrycode(turkey, "country.name", "country.name")
#> Warning: Some values were not matched unambiguously: TÜRKİYE
#> [1] NA

tolower(turkey)
#> [1] "türkiye"
countrycode::countrycode(tolower(turkey), "country.name", "country.name")
#> [1] "Turkey"

turkey.en.regex <- countrycode::codelist$country.name.en.regex[match("TUR", countrycode::codelist$iso3c)]
grepl(x = turkey, pattern = turkey.en.regex, perl = TRUE, ignore.case = TRUE)
#> [1] FALSE
grepl(x = tolower(turkey), pattern = turkey.en.regex, perl = TRUE, ignore.case = TRUE)
#> [1] TRUE
grepl(x = tolower(turkey), pattern = turkey.en.regex, perl = TRUE, ignore.case = FALSE)
#> [1] TRUE

@vincentarelbundock
Copy link
Owner

FWIW, I don't have a view and not a real good sense of whether this can cause problems (probably not?).

@cjyetman
Copy link
Collaborator Author

It's not the most precise solution, but might capture similar issues with other extended characters, while seemingly not changing the logic at all.

A more precise solution could be to add İ to the regexes for Turkey...

turkey.en.regex <- countrycode::codelist$country.name.en.regex[match("TUR", countrycode::codelist$iso3c)]

grepl(x = c("Turkey", "Turkiye", "Türkiye", "TÜRKİYE"), pattern = turkey.en.regex, perl = TRUE, ignore.case = TRUE)
#> [1]  TRUE  TRUE  TRUE FALSE

turkey.en.regex
#> [1] "turkey|t(ü|u)rkiye"
turkey.en.regex_mod <- "turkey|t(ü|u)rk(i|İ)ye"

grepl(x = c("Turkey", "Turkiye", "Türkiye", "TÜRKİYE"), pattern = turkey.en.regex_mod, perl = TRUE, ignore.case = TRUE)
#> [1] TRUE TRUE TRUE TRUE

side note: I don't know why we use () (capture groups) instead of []

turkey.en.regex_mod <- "turkey|t[ü|u]rk[i|İ]ye"
grepl(x = c("Turkey", "Turkiye", "Türkiye", "TÜRKİYE"), pattern = turkey.en.regex_mod, perl = TRUE, ignore.case = TRUE)
#> [1] TRUE TRUE TRUE TRUE

@vincentarelbundock
Copy link
Owner

Let's just go with the more precise option, then, as there are fewer unknowns.

What's the benefit of ()? We don't want to capture anything...

@cjyetman
Copy link
Collaborator Author

Let's just go with the more precise option, then, as there are fewer unknowns.

👍🏻

What's the benefit of ()? We don't want to capture anything...

I have no idea. Honestly, I'm somewhat surprised that it works as expected using (). My understanding is that "character class" ([]) is the proper way of matching any one of a list of characters. At some point, a bunch of capture groups (()) got introduced into the regexes here.

@vincentarelbundock
Copy link
Owner

weird. I can see it for lookaheads and such... oh well!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants