Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternative names to specify language #1

Open
sebastian-meckovski opened this issue Oct 5, 2024 · 4 comments
Open

Alternative names to specify language #1

sebastian-meckovski opened this issue Oct 5, 2024 · 4 comments

Comments

@sebastian-meckovski
Copy link

sebastian-meckovski commented Oct 5, 2024

Hi.

The script returns a reliable dataset and would be useful for my project.

Would it be possible to reformat the dataset to specify the alternative name in which language it is provided? So instead of this:

FR,France,,,Paris,"Baariis,Bahliz,Baris,Lungsod ng Paris,Lutece ..... "

To return something like

FR,France,,,Paris,"af: Baariis, za: Bahliz, tr: Baris, ..... "

Because without this I don't know how to use these alternative names.

@joelacus
Copy link
Owner

Hi! Sorry for the late response. This is a good point. The current source just lists the alternative names. I'll see if I can find a source for the alternative place names with the language they belong to and update the script.

@sebastian-meckovski
Copy link
Author

I have written a script that may be solving this.
https://github.com/sebastian-meckovski/geo-data-generator/blob/master/countries_data.py

It does many things, for example it drops all administrative area names of each location unless there are two or more cities with the same in in the same country.

But most importantly it creates join between these two datasets:

global_cities_url = 'http://download.geonames.org/export/dump/allCountries.zip'
alternate_names_url = 'http://download.geonames.org/export/dump/alternateNamesV2.zip'

and then by specifying the languages in comma separated string it will get all the languages needed. Feel free to use this as example

@joelacus
Copy link
Owner

joelacus commented Dec 2, 2024

Ah nice, that's cool. I'm glad you figured something out.

I found the alternateNamesV2 dataset as well and have managed to implement it into the script, but now I'm trying to optimise it somehow as 18 million lines is a lot to process...

@sebastian-meckovski
Copy link
Author

Yeah the alternate names dataset is huge and loading this into RAM alone takes around a minute or two. There's another problem - many places have few alternate names so the script also needs to take that into account. I only needed one alternate place per language so my script selects only one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants