Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't import utf8 file with unicode text #7

Open
kintopp opened this issue Apr 30, 2012 · 7 comments
Open

Can't import utf8 file with unicode text #7

kintopp opened this issue Apr 30, 2012 · 7 comments

Comments

@kintopp
Copy link

kintopp commented Apr 30, 2012

Omeka 1.5.1 and CsvImport v.1.3.3. Collated utf8 MySQL database. Fresh Omeka install.

If I import the bundled tests/test.csv file in the plugin all works correctly. Modifying this sample data to include umlauts works correctly. Modifying data to include Greek or Japanese text results in the test file not being imported. i.e. I'm returned to the import dialogue without having an opportunity to match fields. When the Japanese or Greek text is replaced with Roman text again the file is properly imported by the plugin once more. Test csv file opens as UTF-8 in BBedit and was saved again as such.

@zerocrates
Copy link
Member

I have a sneaking suspicion this might be related to PHP's locale setting. Having the locale on your server set to something other than UTF-8 may be what's causing this, since the CSV-reading functionality we use is locale-sensitive (sometimes).

If this is the case, it's slightly tricky to fix on our end, since we can try to set a locale, but you need to give a language/region in addition to an encoding. We could make the assumption of en_US, but that's not going to work everywhere.

@kintopp
Copy link
Author

kintopp commented Apr 30, 2012

Can you show me where to look for this? I'm not a developer... I'm using OSX (and testing under XAMPP) and locale returns:

LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=

Looking in the XAMPP configuration overview, I see this under PHP Variables which might be relevant also:

_SERVER["HTTP_ACCEPT_LANGUAGE"] en,de;q=0.5

@zerocrates
Copy link
Member

Those look like "right" locales (though they may not end up being the ones used by PHP). As a test, calling echo setlocale(LC_ALL, '0'); should tell you what PHP thinks its current locale setting is.

@zerocrates
Copy link
Member

People seem to be reporting better luck using Firefox when uploading their CSV files. I'm not really sure how that could be affecting this, but several people have reported success with Firefox after failure from other browsers.

@ghost ghost assigned willynilly Feb 7, 2013
@willynilly
Copy link
Contributor

I was not able to reproduce this bug on Chrome with the latest master. I tested Japanese, Chinese, Greek, and Vietnamese on an UTF-8 file using my Mac.

@symac
Copy link

symac commented Oct 18, 2015

@kintopp I know this is an old issue but I had the same problem today. And after different tries, I think the issue is with _validateSource in application/libraries/Omeka/File/Ingest/Url.php. The URL I have for files, which contains diacritics, do not validate via the Zend_Uri::factory.

I have been able to load the file by changing the Url from :
http://geobib.fr/tmp/CPA/012-Vue_générale_prise_de_la_Petite-Perrière.jpg
to :
http://geobib.fr/tmp/CPA/012-Vue_g%C3%A9n%C3%A9rale_prise_de_la_Petite-Perri%C3%A8re.jpg

And this now works for me, so think it might be worth leaving this comment if somebody encounters the same issue.

@zerocrates @willynilly pinging you in case it makes sense for you and you think of a fix for this (I am leaving this file on my server for some weeks if you want to try to replicate with it don't hesitate)

@zerocrates
Copy link
Member

I think this issue of UTF-8 in URLs is different than the usual problem that this issue represents, which is more about the encoding of the whole file and having the import not even progress to the mapping screen.

I'd have to think a little about whether we can handle this automatically... I wouldn't want to just always urlencode the URL before ingesting it, because that could mess up URLs that are already encoded. At a minimum we can document that URL encoding should be used. Maybe there's also some way to make Zend's validator accept these URLs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants