`TÜRKİYE` a valid spelling? #347

cjyetman · 2023-09-09T10:47:20Z

The list-one.xls that we use from https://www.six-group.com/en/products-services/financial-information/data-standards.html to get the ISO4217 codes in get_iso4217.R (or better, will use once the URL is updated in an PR #348) uses
TÜRKİYE (note the İ). Is this a valid spelling we should support?

The text was updated successfully, but these errors were encountered:

vincentarelbundock · 2023-09-09T14:34:46Z

Probably relevant: https://en.wikipedia.org/wiki/%C4%B0#:~:text=%C4%B0%2C%20or%20i%2C%20called%20dotted,Kazakh%2C%20Tatar%2C%20and%20Turkish.

NilsEnevoldsen · 2023-09-09T14:35:41Z

Do we not support that? We definitely should. It's just the capital form of "i" in Turkish.

NilsEnevoldsen · 2023-09-09T14:48:12Z

Does our regex engine have an equivalent of CultureInvariant, or is that a dotnet-exclusive nicety?

vincentarelbundock · 2023-09-09T19:01:51Z

Does our regex engine have an equivalent of CultureInvariant, or is that a dotnet-exclusive nicety?

Oh, that's neat. I had to google it because I had never heard of this flag. Also funny that the C# docs use exactly our problematic case as their main example.

Maybe I missed something, but I don't think grep(perl=TRUE) supports this out of the box.

cjyetman · 2023-09-10T15:14:27Z

We currently rely on ignore.case = TRUE in our grepl() command here:

countrycode/R/countrycode.R

Lines 280 to 281 in 848e9ce

    
           matchidx <- sapply(dict[[origin]], grepl, x = choices, 
        
                              perl = TRUE, ignore.case = TRUE)

Should we consider casting all strings-to-match with tolower() before using grepl()? Probably a small performance hit, but maybe worth it. If we ensure that all of our regexes are lowercase, we could then maybe remove the ignore.case = TRUE, which would probably give a small performance boost.

turkey <- "TÜRKİYE"

countrycode::countrycode(turkey, "country.name", "country.name")
#> Warning: Some values were not matched unambiguously: TÜRKİYE
#> [1] NA

tolower(turkey)
#> [1] "türkiye"
countrycode::countrycode(tolower(turkey), "country.name", "country.name")
#> [1] "Turkey"

turkey.en.regex <- countrycode::codelist$country.name.en.regex[match("TUR", countrycode::codelist$iso3c)]
grepl(x = turkey, pattern = turkey.en.regex, perl = TRUE, ignore.case = TRUE)
#> [1] FALSE
grepl(x = tolower(turkey), pattern = turkey.en.regex, perl = TRUE, ignore.case = TRUE)
#> [1] TRUE
grepl(x = tolower(turkey), pattern = turkey.en.regex, perl = TRUE, ignore.case = FALSE)
#> [1] TRUE

vincentarelbundock · 2023-09-14T13:51:55Z

FWIW, I don't have a view and not a real good sense of whether this can cause problems (probably not?).

cjyetman · 2023-09-14T14:04:41Z

It's not the most precise solution, but might capture similar issues with other extended characters, while seemingly not changing the logic at all.

A more precise solution could be to add İ to the regexes for Turkey...

turkey.en.regex <- countrycode::codelist$country.name.en.regex[match("TUR", countrycode::codelist$iso3c)]

grepl(x = c("Turkey", "Turkiye", "Türkiye", "TÜRKİYE"), pattern = turkey.en.regex, perl = TRUE, ignore.case = TRUE)
#> [1]  TRUE  TRUE  TRUE FALSE

turkey.en.regex
#> [1] "turkey|t(ü|u)rkiye"
turkey.en.regex_mod <- "turkey|t(ü|u)rk(i|İ)ye"

grepl(x = c("Turkey", "Turkiye", "Türkiye", "TÜRKİYE"), pattern = turkey.en.regex_mod, perl = TRUE, ignore.case = TRUE)
#> [1] TRUE TRUE TRUE TRUE

side note: I don't know why we use () (capture groups) instead of []

turkey.en.regex_mod <- "turkey|t[ü|u]rk[i|İ]ye"
grepl(x = c("Turkey", "Turkiye", "Türkiye", "TÜRKİYE"), pattern = turkey.en.regex_mod, perl = TRUE, ignore.case = TRUE)
#> [1] TRUE TRUE TRUE TRUE

vincentarelbundock · 2023-09-14T14:13:30Z

Let's just go with the more precise option, then, as there are fewer unknowns.

What's the benefit of ()? We don't want to capture anything...

cjyetman · 2023-09-14T14:18:47Z

Let's just go with the more precise option, then, as there are fewer unknowns.

👍🏻

What's the benefit of ()? We don't want to capture anything...

I have no idea. Honestly, I'm somewhat surprised that it works as expected using (). My understanding is that "character class" ([]) is the proper way of matching any one of a list of characters. At some point, a bunch of capture groups (()) got introduced into the regexes here.

vincentarelbundock · 2023-09-14T14:21:41Z

weird. I can see it for lookaheads and such... oh well!

cjyetman mentioned this issue Sep 9, 2023

update ISO4217 scraper #348

Draft

cjyetman mentioned this issue Oct 1, 2024

update GENC scraper #345

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`TÜRKİYE` a valid spelling? #347

`TÜRKİYE` a valid spelling? #347

cjyetman commented Sep 9, 2023 •

edited

Loading

vincentarelbundock commented Sep 9, 2023

NilsEnevoldsen commented Sep 9, 2023

NilsEnevoldsen commented Sep 9, 2023

vincentarelbundock commented Sep 9, 2023

cjyetman commented Sep 10, 2023

vincentarelbundock commented Sep 14, 2023

cjyetman commented Sep 14, 2023

vincentarelbundock commented Sep 14, 2023

cjyetman commented Sep 14, 2023

vincentarelbundock commented Sep 14, 2023

TÜRKİYE a valid spelling? #347

TÜRKİYE a valid spelling? #347

Comments

cjyetman commented Sep 9, 2023 • edited Loading

vincentarelbundock commented Sep 9, 2023

NilsEnevoldsen commented Sep 9, 2023

NilsEnevoldsen commented Sep 9, 2023

vincentarelbundock commented Sep 9, 2023

cjyetman commented Sep 10, 2023

vincentarelbundock commented Sep 14, 2023

cjyetman commented Sep 14, 2023

vincentarelbundock commented Sep 14, 2023

cjyetman commented Sep 14, 2023

vincentarelbundock commented Sep 14, 2023

`TÜRKİYE` a valid spelling? #347

`TÜRKİYE` a valid spelling? #347

cjyetman commented Sep 9, 2023 •

edited

Loading