-
-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Accept and resolve BCP47 language names (WIP) #1641
base: master
Are you sure you want to change the base?
Accept and resolve BCP47 language names (WIP) #1641
Conversation
core/languages.lua
Outdated
|
||
-- BEGIN OMIKHLEIA HACKLANG | ||
-- Disabled for now, see further below. | ||
-- local cldr = require("cldr") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment below, if ICU actually does what we need here that's fine, otherwise cldr-lua should be extended to provide the necessary functionality.
core/languages.lua
Outdated
if res then | ||
return res, langbcp47 | ||
end | ||
langbcp47 = langbcp47:match("^(.+)-.*$") -- split at dash (-) and remove last part. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this matching/parsing bit needs to be in cldr-lua.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wherever it eventually ends, the utility is also useful for other packages. E.g. I wish I had it for my "smartquotes" package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ICU bits should probably get moved from the font code to some base language support module, and yes should be setup in a way that every package can use them. This bit which isn't really ICU still looks like it goes in CLDR to me and could also return information about what the segments mean.
I think there are two important next steps for SILE. One is having the tooling worked out to build Rust functions to both crates & Lua rocks (so SILE can still be used as a library and we have a two way bridge to implement anything however works best). That is kind of held up waiting on mlua v0.10 because it is going to make that effort much more ergonomic. The other one is this. Handling everything with full BCP-47 locales internally and falling back gracefully to shorter ISO language codes where necessary has became what I see as the next big push I need to make. I don't know whether we can/should do this separately from the other item above, but I'm going to be looking into it. |
I've started disentangling the hyphenation patterns in #2102 (and see also #2104 for what could be the next logical step here - I've kept those in a branch for now). One implication is that the (So as to have a homogeneous implementation there, while currently some return tables, some affect globals, etc.) |
Thanks for the pointers on what end of the stick to grab first, I'll go start digging. (I'm also expecting a son to be born in the next few days so don't be surprised if I drop off the radar for a while.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review comments are kind of rambling, I'm just poking around. I rebased your commit as best I could on master. I think it's broken now because I didn't rewire up the bits that changed, just figured out what hunks to pull out of the merge conflict. I'll keep going on it trying to sort out the parts.
core/font.lua
Outdated
-- BEGIN OMIKHLEIA HACKLANG | ||
-- BAD STUFF WARNING: This SILE.languageSupport.languages[] is broken, (nearly always returning nil), | ||
-- see comment in languages.lua | ||
-- ON THE OTHER HAND, setting font.script doesn't seem to be used in any sensible way... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The script parameter isn't used for all languages or fonts, but some locale and font combinations do make use of it and the shaper then needs this info to correctly shape them. We're passing this on to HarfBuzz if it is set, that's all we need to do with it.
core/languages.lua
Outdated
-- BEGIN OMIKHLEIA HACKLANG | ||
-- Either done too soon or plain wrong (a BCP47 language can e.g. contain a script for instance, sr-Latn | ||
-- and I don't see that in https://github.com/alerque/cldr-lua/blob/master/cldr/data/locales.lua | ||
-- I don't really know why we would want that CLDR filtering here BTW... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From memory (obviously I'll be testing this here) one of the reasons for this is because using icu.canonicalize_language()
gets us the canonical language code, but it does not validate that what we got could even be recognized, and just passes through garbage if you git it garbage. If my recollection is correct here, we should be doing both. First trying to parse a canonical name, then checking if that is a known language in the database and hence we can reasonably be expected to know anything about it.
261c295
to
e9999ac
Compare
Some "food for thought" regarding #1631
I fancied checking what adding
\language[main=en-GB]
1 to the SILE manual would imply...Regard this a (kind of working but ugly) draft attempt.
It raises some interesting questions for the future:
font.script
, remainders of no-longer properly initializedSILE.languageSupport.languages
, etc.)How to move forward certainly requires a longer and tougher discussion.
Footnotes
Or anything else you'd like. I played with "sr-Latn", which currently resolves to "sr" (= sr-Cyrl actually), no real interest obviously, except to see chapter titles in Cyrillic and hyphenations not working for the English text :) ↩