-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regex: Northern Ireland #170
Comments
This is a related to the issue raised in #167, where it was abbreviated as "Ireland (Northern)". PS Just wanted to say thanks for countrycode - it's really excellent. |
potential fix for excluding variations of Northern Ireland test <- c("Ireland", "Northern Ireland", "North Ireland", "N Ireland", "N. Ireland",
"Ireland (Northern)", "Ireland (North)", "Ireland (N)", "Ireland (N.)",
"Ireland, Northern", "Ireland, North", "Ireland, N.", "Ireland, N",
"in the north of Ireland")
test <- setNames(test, test)
# only the first and last should be true
# existing
irl_regex <- "^(?!.*north).*\\bireland"
sapply(test, function(x) grepl(irl_regex, x, perl = TRUE, ignore.case = TRUE))
# Ireland Northern Ireland North Ireland
# TRUE FALSE FALSE
# N Ireland N. Ireland Ireland (Northern)
# TRUE TRUE FALSE
# Ireland (North) Ireland (N) Ireland (N.)
# FALSE TRUE TRUE
# Ireland, Northern Ireland, North Ireland, N.
# FALSE FALSE TRUE
# Ireland, N in the north of Ireland
# TRUE FALSE
# proposed
irl_regex <- "(?<![\\bnorthern|\\bnorth|\\bn|\\bn\\.]\\s)(?!.*[,\\(\\s][north|n\\)])ireland"
sapply(test, function(x) grepl(irl_regex, x, perl = TRUE, ignore.case = TRUE))
# Ireland Northern Ireland North Ireland
# TRUE FALSE FALSE
# N Ireland N. Ireland Ireland (Northern)
# FALSE FALSE FALSE
# Ireland (North) Ireland (N) Ireland (N.)
# FALSE FALSE FALSE
# Ireland, Northern Ireland, North Ireland, N.
# FALSE FALSE FALSE
# Ireland, N in the north of Ireland
# FALSE TRUE |
this fails on "population Ireland" though test <- c("Ireland", "Northern Ireland", "North Ireland", "N Ireland", "N. Ireland",
"Ireland (Northern)", "Ireland (North)", "Ireland (N)", "Ireland (N.)",
"Ireland, Northern", "Ireland, North", "Ireland, N.", "Ireland, N",
"in the north of Ireland", "population Ireland")
test <- setNames(test, test)
# proposed
irl_regex <- "(?<![\\bnorthern|\\bnorth|\\bn|\\bn\\.]\\s)(?!.*[,\\(\\s][north|n\\)])ireland"
sapply(test, function(x) grepl(irl_regex, x, perl = TRUE, ignore.case = TRUE))
# Ireland Northern Ireland North Ireland
# TRUE FALSE FALSE
# N Ireland N. Ireland Ireland (Northern)
# FALSE FALSE FALSE
# Ireland (North) Ireland (N) Ireland (N.)
# FALSE FALSE FALSE
# Ireland, Northern Ireland, North Ireland, N.
# FALSE FALSE FALSE
# Ireland, N in the north of Ireland population Ireland
# FALSE TRUE FALSE |
Why not be super aggressive and just ban "N" before or after. Over time, I've become more convinced that we can't support sentences, so that would be OK, no? |
When abbreviated, Northern Ireland converts to IRL instead of NA.
`test<-as.data.frame(c("Northern Ireland","N Ireland", "N. Ireland","Ireland"))
names(test)<-"country"
test
output_iso3c<-countrycode(test$country, "country.name", "iso3c", warn = TRUE)
output_iso3c<-cbind(test,output_iso3c)
output_iso3c
`
The text was updated successfully, but these errors were encountered: