Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regex: Northern Ireland #170

Closed
embruna opened this issue Jan 12, 2018 · 5 comments
Closed

Regex: Northern Ireland #170

embruna opened this issue Jan 12, 2018 · 5 comments
Labels
match issue regex requires a regex fix

Comments

@embruna
Copy link

embruna commented Jan 12, 2018

When abbreviated, Northern Ireland converts to IRL instead of NA.

`test<-as.data.frame(c("Northern Ireland","N Ireland", "N. Ireland","Ireland"))

names(test)<-"country"

test

output_iso3c<-countrycode(test$country, "country.name", "iso3c", warn = TRUE)

output_iso3c<-cbind(test,output_iso3c)

output_iso3c
`

country output_iso3c
1 Northern Ireland
2 N Ireland IRL
3 N. Ireland IRL
4 Ireland IRL

@embruna
Copy link
Author

embruna commented Jan 12, 2018

This is a related to the issue raised in #167, where it was abbreviated as "Ireland (Northern)".

PS Just wanted to say thanks for countrycode - it's really excellent.

@cjyetman cjyetman added regex requires a regex fix match issue labels Oct 27, 2018
@cjyetman
Copy link
Collaborator

potential fix for excluding variations of Northern Ireland

test <- c("Ireland", "Northern Ireland", "North Ireland", "N Ireland", "N. Ireland", 
          "Ireland (Northern)", "Ireland (North)", "Ireland (N)", "Ireland (N.)",
          "Ireland, Northern", "Ireland, North", "Ireland, N.", "Ireland, N", 
          "in the north of Ireland")
test <- setNames(test, test)
# only the first and last should be true

# existing
irl_regex <- "^(?!.*north).*\\bireland"
sapply(test, function(x) grepl(irl_regex, x, perl = TRUE, ignore.case = TRUE))
# Ireland               Northern Ireland        North Ireland 
# TRUE                  FALSE                   FALSE 
# N Ireland             N. Ireland              Ireland (Northern) 
# TRUE                  TRUE                    FALSE 
# Ireland (North)       Ireland (N)             Ireland (N.) 
# FALSE                 TRUE                    TRUE 
# Ireland, Northern     Ireland, North          Ireland, N. 
# FALSE                 FALSE                   TRUE 
# Ireland, N            in the north of Ireland 
# TRUE                  FALSE 

# proposed
irl_regex <- "(?<![\\bnorthern|\\bnorth|\\bn|\\bn\\.]\\s)(?!.*[,\\(\\s][north|n\\)])ireland"
sapply(test, function(x) grepl(irl_regex, x, perl = TRUE, ignore.case = TRUE))
# Ireland                Northern Ireland           North Ireland 
# TRUE                   FALSE                      FALSE 
# N Ireland              N. Ireland                 Ireland (Northern) 
# FALSE                  FALSE                      FALSE 
# Ireland (North)        Ireland (N)                Ireland (N.) 
# FALSE                  FALSE                      FALSE 
# Ireland, Northern      Ireland, North             Ireland, N. 
# FALSE                  FALSE                      FALSE 
# Ireland, N             in the north of Ireland 
# FALSE                  TRUE 

@cjyetman
Copy link
Collaborator

this fails on "population Ireland" though ☹️

test <- c("Ireland", "Northern Ireland", "North Ireland", "N Ireland", "N. Ireland", 
          "Ireland (Northern)", "Ireland (North)", "Ireland (N)", "Ireland (N.)",
          "Ireland, Northern", "Ireland, North", "Ireland, N.", "Ireland, N", 
          "in the north of Ireland", "population Ireland")
test <- setNames(test, test)

# proposed
irl_regex <- "(?<![\\bnorthern|\\bnorth|\\bn|\\bn\\.]\\s)(?!.*[,\\(\\s][north|n\\)])ireland"
sapply(test, function(x) grepl(irl_regex, x, perl = TRUE, ignore.case = TRUE))
#           Ireland        Northern Ireland           North Ireland 
#              TRUE                   FALSE                   FALSE 
#         N Ireland              N. Ireland      Ireland (Northern) 
#             FALSE                   FALSE                   FALSE 
#   Ireland (North)             Ireland (N)            Ireland (N.) 
#             FALSE                   FALSE                   FALSE 
# Ireland, Northern          Ireland, North             Ireland, N. 
#             FALSE                   FALSE                   FALSE 
#        Ireland, N in the north of Ireland      population Ireland 
#             FALSE                    TRUE                   FALSE 

@vincentarelbundock
Copy link
Owner

Why not be super aggressive and just ban "N" before or after. Over time, I've become more convinced that we can't support sentences, so that would be OK, no?

@vincentarelbundock vincentarelbundock changed the title N Ireland converts to IRL Regex: Northern Ireland May 14, 2020
@vincentarelbundock
Copy link
Owner

#313

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
match issue regex requires a regex fix
Projects
None yet
Development

No branches or pull requests

3 participants