Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Localization for as many languages as possible #1474

Draft
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

Omikhleia
Copy link
Member

@Omikhleia Omikhleia commented Jul 10, 2022

I was a bit annoyed since my first attempts with SILE 0.10 that in English, with the standard book class, I would get "Chapter 1" whereas in French I would still get only "1". Now that we have that fluent thing with i18n files, I decided to give it a try. After all, seeing what was done for Norwegian and Esperanto, it ought to be a matter of "just" providing a few translation strings... Er... Wait, we just have a few lone languages supported? And how are expecting this to go on, language after language, and at the same time be sort of future proof? (What about "parts", "list of figures", "list of tables", etc. -- which some of us may already have in their classes)...

This PR therefore:

  1. Attempts at providing i18n files for as many supported languages as possible

    • Translations come from several sources
      • Most of them are (a subset) from LaTeX babel (https://github.com/latex3/babel). The latter is LPPL-licensed, which could have been an issue, but I do think:
        • We could be in the section 6 exceptions of the LPPL (as long as we don't complain about the translations)
        • Mere language strings cannot be protected anyway by any kind of license (and that's the only thing we use here, in a different form and without any other code)
      • Some internal resources I had collected from an earlier work with DITA sources + some personal tweaking
        for languages I kind of know: Except in very rare cases, I eventually ended up following Babel.
        (E.g. an exception to this is the ToC header, which resolved to "Contents" in Babel for en, rather than our "Table of Contents" - I kept the latter ; a contrary example corresponds mainly to cases where differences were actually a matter of taste, e.g. "table des figures" vs. "table des illustrations")
  2. Handle a few complex cases

    • Languages that have specific rules for chapter/part headers, such as Japanese, but also Hungarian (and a few other countries). In the latter case, to be sure the Babel rules where ok, I had a look on Google Books to confirmed that this was indeed used (e.g. there are Hungarian books with indeed "1. fejezet", "II. rész", etc.)
    • I am not sure what el expects to be, for now the localization file is a link to el-monoton (corresponding to modern Greek)
    • I split Norwegian differently (nb and no have the same localization, but nn has its own... Our recent Norwegian friend only provided the translations for Bokmål, but Nynorsk is slightly different in written form... I followed Babel once again, but also checked a few differences on https://ordbokene.no to be sure)
  3. Get rid of the awkward hello { $name } patterns...

    • Honestly, we are not going to ask people to translate such a useless thing, are we?
    • However, I kept them for en and tr as they are used as examples in the Manual, but honestly, I'd want them to go away.
  4. Get rid of book-chapter-title-pre and change book:chapter:post. These were mixing translation issue and (vertical) spacing, this was IMHO pretty bad and the translation part shall all be done in the fluent templates... Currently, the only language using a \medskip instead of a \par is Japanese... Honestly again, unless there's a real good reason, I'd expect this to go away too...

  5. Change the toc key. Anyway, the Manual ("c08-language") mentioned toc-heading which did not exist... It was actually toc-title, I changed that to tableofcontents anyway, just because.

  6. One annoying thing for languages without available localization is that a \tableofcontents would yield nothing at all in the first output (no header, but also no message at first SILE run...). So even if the proposed pattern file is mostly commented out in these cases, some of the keys preferably have to be present to avoid that ugly case (i.e. using the English wording if nothing else is currently available). That's true too for some bibliography patterns (it's still better than nothing to a least get the name parameter...)

Ah ah, now we can play...

  • Tests are ok (two of them needed basic fixes)
  • But the manual no longer compiles...
[80] 
Error detected:
	./lua_modules/share/lua/5.3/fluent/messages.lua:322: attempt to index a nil value
make[2]: *** [Makefile:1452: documentation/sile.pdf] Error 1

I started investigating this, and I am finding some other issues... (E.g. that fluent thing seems to leaks beyond its expected scope, with possible code smell at one point -- I will go on looking before opening a dedicated report, but it's kind of orthogonal to this very PR).

image

@alerque
Copy link
Member

alerque commented Jul 15, 2022

I was a bit annoyed since my first attempts with SILE 0.10 that in English, with the standard book class, I would get "Chapter 1" whereas in French I would still get only "1".

Yes, fair criticism. There is neither linguistic or typographical justification for that difference.

And how are expecting this to go on, language after language, and at the same time be sort of future proof? (What about "parts", "list of figures", "list of tables", etc. -- which some of us may already have in their classes)...

I'm of two minds about including keys we don't yet use, but I'll stew on that while I work on other stuff going in this release.

   * Translations come from several sources

Fantastic work, I meant to look into pulling together sources myself...

The latter is LPPL-licensed, which could have been an issue, but I do think: [...] Mere language strings cannot be protected anyway by any kind of license (and that's the only thing we use here, in a different form and without any other code)

I think this is correct. If it's not we actually have bigger issues with some of our hyphenation patterns which I think were lifted from TeX.

   * I am not sure what `el` expects to be, for now the localization file is a link to `el-monoton` (corresponding to modern Greek)

I think thats right, as fara as I know grc should be used for ancient Greek. The caveat is I've had trouble getting fonts to recognize that, but we'll deal with that when we work on CLDR fallbacks.

3. Get rid of the awkward `hello { $name }` patterns...

I may object to this. The whole point of this string was having something in each language that is not an actual localization used by in SILE output for testing purposes to make sure we can test that the right localization is being used. Having a substitution is important for this.

Don't worry about fixing it, I'll look into it.

5. Change the toc key.

I think I want to scope FTL terms by package, so tableofcontents is not a good key, that's the namespace. We need a key after that, tableofcontents-header is a namespace plus a key.

I'll look into fixing this and perhaps others that didn't get scoped by package.

@Omikhleia
Copy link
Member Author

I think I want to scope FTL terms by package, so tableofcontents is not a good key, that's the namespace. We need a key after that, tableofcontents-header is a namespace plus a key.

Yup, possibly. That's what I actually tried for chapters etc. (with the parametrized chapter-template, but that could be changed to chapter-numbering header or such). My take on it was to have the base translation terrms (chapter = "Chapter") distinct from the assembling rules used by class/packages ("Chapter N", or whatever order a language uses). In the same vein, I do think tableofcontents is actually a good standalone key for "Table of Contents" (= whether used as ToC header or not, that's the exact translation), though we might have an additional indirection too for the one used as a ToC headers. Not sure it's really needed here, though, in that very case. But by calling it tableofcontents-header or so, we would be conflating the translation with (one of) its potential use, and that's not so good.

I'm of two minds about including keys we don't yet use, but I'll stew on that while I work on other stuff going in this release.

I am not even hiding where I am heading to for a revised/augmented book class 🤣

... Actually the month names are of the things I didn't import from Babel.... Though this could be useful for BibTeX (but I only worked on that package after...). Yet, there are other complexities there anyway (e.g. issue numbers such as "no. 5" is not a translation issue only and would need a command hook, e.g. "n° 5" in French (with a superscript o) and similarly in some other languages.)

As for the "Hello World", we can use any of the useful strings instead of it, or instantiate it on need for demonstration purpose. It avoids having to guess it for 70+ languages... It's also very idiomatic and inherently difficult - e.g. for French I wouldn't be that at ease to find a translation... ("Bonjour Monde" is agrammatical, an article would be needed, but for languages with genders, that's quickly messy and the pattern cannot be general any longer).

@alerque
Copy link
Member

alerque commented Jul 16, 2022

My take on it was to have the base translation terrms (chapter = "Chapter") distinct from the assembling rules used by class/packages ("Chapter N", or whatever order a language uses). In the same vein, I do think tableofcontents is actually a good standalone key for "Table of Contents" (= whether used as ToC header or not, that's the exact translation), though we might have an additional indirection too for the one used as a ToC headers. Not sure it's really needed here, though, in that very case. But by calling it tableofcontents-header or so, we would be conflating the translation with (one of) its potential use, and that's not so good.

On the contrary, keying off of the potential use is exactly what we want. In fact all the keys should be tightly scoped to their intended use. This is where is where Fluent stands head and shoulders above other localization systems: it is not just a key/value store that leaves the intended use of strings up to the programmer to sort out, it enables the translator to know enough context about the actual usage to provide a natural and accurate translation without the programmer having to understand the complexities of target languages.

Assembling translations out of smaller building blocks is possible and in some cases a good idea, but again the way Fluent handles this is putting the translator in charge. The smaller building block translations should be private to the FTL file and not exposed to the programmer (in our case SILE & documents).

Using the ToC as our case in point, at the moment the only context SILE uses this is is the header, but lets say it also spat out a CLI message saying what it was doing. You could use a term like this:

-tableofcontents = Table of Contents

tableofcontents-header = { -tableofcontents }

cli-generating-toc = Now generating { -tableofcontents }

This is somewhat contrived and clumsy, but the point is to demonstrate how a term would be used. SILE would not be able to access the term, only the public messages. Each language could use or not use terms as they saw fit as an implementation detail.

I am not even hiding where I am heading to for a revised/augmented book class rofl

I kind of figured ;-)

... Actually the month names are of the things I didn't import from Babel.... Though this could be useful for BibTeX (but I only worked on that package after...). Yet, there are other complexities there anyway (e.g. issue numbers such as "no. 5" is not a translation issue only and would need a command hook, e.g. "n° 5" in French (with a superscript o) and similarly in some other languages.)

Yes, we're going to have to think about how to handle SILE command hooks in messages. Should we just post process the message as SIL format input?

no-number = N\super{o} { $num }

I haven't thought through what the syntax & processing impacts of that would be but certainly that already hits a syntax conflict with braces that would not be easy to work around while staying compatible with other Fluent tooling and not being hideously ugly/unwieldy for translators.

Assuming XML format on post processing might be easier, e.g.:

no-number = N<super>o</super> { $num }

As for the "Hello World", [...] It's also very idiomatic and inherently difficult - e.g. for French I wouldn't be that at ease to find a translation... ("Bonjour Monde" is agrammatical, an article would be needed, but for languages with genders, that's quickly messy and the pattern cannot be general any longer).

Yes, that's exactly why it's a useful demo term! I don't think we need to have it for every language, but it would make a good demonstration case for how French can implement it one way and a language with grammatical genders for names could implement it a different way. Having this tech demo string to play with in tests & docs be something other that a string actually used in outputs gives us the flexibility to play with it and update docs and demos without it ever being a breaking change for anyone. If we used some existing key for that we would be limited to how that key was actually used in practice.

@alerque
Copy link
Member

alerque commented Jul 19, 2022

@Omikhleia Did you generate these with a script that I could perhaps use to re-generate them with some tweaks? I have a few bulk edits I want to make (like using Fluent terms and package namespaces for key) but it might be easier with access to the original script rather than writing one from scratch.

@Omikhleia
Copy link
Member Author

@alerque

@Omikhleia Did you generate these with a script (...)

Lovingly but manually crafted
...well, I don't think sed/grep/awk commands count as scripts, and even... they are no longer in my history :( - but there was quite an amount of manual tweaking and re-ordering.

@alerque
Copy link
Member

alerque commented Jul 19, 2022

Roger that, and no problem. I just thought it was worth asking.

By the way when I do jobs like that something I usually is is commit the automatic stuff first with the command I used to generate it in the commit message so I could redo that step later if needed, then followup commits with the manual bits. That has served me well sometimes when I want to come back later and tweak the automatic stuff and re-apply the manual tweaks on top of the new base.

Omikhleia and others added 3 commits July 19, 2022 22:52
This reverts commit 17d3ce6 and bits of
previous commits.
We *want* to fail with an error if localized strings are requested for
non-languages, so providing generic English stand-ins seems like
a counter-productive move.

Additionally for anybody starting a new localization it would be better
to start with a language similar to their target because implementation
details might differ. With simple key/value lookups it might seem like
any language could be a template for any other, but Fluent is much more
flexible than that.
@alerque
Copy link
Member

alerque commented Jul 23, 2022

Just a heads up my current thinking on this is that I really don't want to include keys for things we don't currently use. I also want to stick to Fluent norms of only exposing the full contextual translations and making any partial translation such as the keywords used here private terms. I plan on keeping these commits around to cherry pick from as we add classes/packages/features that use these keys, but will cherry pick just the ones we use so far for the next release.

@Omikhleia
Copy link
Member Author

I really don't want to include keys for things we don't currently use.

Why not just make these keys private, and move on?

@Omikhleia
Copy link
Member Author

Anyhow, are you going to fix, first of all, the fluent scope-leaking stuff?

While lovingly hand crafted, they don't reflect the structure of how
we're doing to use keys as a namespace and they are getting in the way
of automatically reprocessing these files.

Also we don't really want to include commented keys for untranslated
terms because that is one more thing to maintain and get out of sync
without necessarily providing a benefit. We want people to reference
similar languages when translating.

```console
sed -i -e '/^#/d' *.ftl
sed -i -e '/^$/d' *.ftl
```
```console
sed -i -e '/Rerun SILE/d' *.ftl
git checkout -- i18n/en.ftl
```
In the future on an as-needed basis these could be converted to terms,
but so far there is no use case and doing this should be a translators
implementation choice not something SILE suggests.

Prepend something to all files, then:

```vim
silent normal! 0gg/^appendix
df=xd$nf{v%p
silent normal! 0gg/^part
df=xd$nf{v%p
silent normal! 0gg/^chapter
df=xd$nf{v%p
```
Then touchoup eu, hr, lt
@alerque alerque mentioned this pull request Aug 4, 2022
@alerque alerque removed this from the v0.14.0 milestone Aug 4, 2022
@alerque
Copy link
Member

alerque commented Aug 4, 2022

Refactoring the package system with modules attached to classes fixed most of the scope leak issues with Fluent. There may be more, but the ones I know of are fixed. Do note that the global fluent instance just uses whatever locale it was last set to, so it must be poked with the current document language every time it is used. I do expect to wrap this is some better abstractions as we sort out setting scope in general (see #1327 and others).


I'm leaving this PR open because these is still lots of good stuff in here to be mined. For the next release I'm only including localization of strings we use internally, but I'm not ruling out preloading more — we just need to work out the key naming a bit.

Just for the record I have a partially refactored version of the branch in this PR in my fork here that (with the commands noted in some of the commits here) make it easy to fetch strings from a git worktree of this PR into another branch.

@alerque alerque marked this pull request as draft August 16, 2022 10:33
@Omikhleia
Copy link
Member Author

This old PR never came to fruition as-is and now has conflicts.

Due to inactivity, and as part of backlog cleaning, I would have been tempted to close it, but the topic was interesting.

But we did address the important parts of it.

On one hand:

I'm leaving this PR open because these is still lots of good stuff in here to be mined. For the next release I'm only including localization of strings we use internally, but I'm not ruling out preloading more (...)

However, it never happened in 2+ years and the conflicts and recent changes won't help...

Moreover, on the other hand, anyone is free, as I did back then, to "mine" Babel and other existing resources and propose something to SILE in today's state of the art.

So I'm adding the "pending closure" as a label - As a last chance for contributors to eventually say their word.
After some period of time, this inactive PR will be closed without further notice.

@Omikhleia Omikhleia added the pending closure Backlog cleaning (inactive non bug-issues) label Oct 22, 2024
@alerque
Copy link
Member

alerque commented Oct 29, 2024

I have some local notes on things I still hope to mine out of this, lets keep it around until I finish. The merge conflicts aren't worth fixing en masse but cherry picking out of here is still useful, but my doing so is pending some other language handling changes.

@alerque alerque removed the pending closure Backlog cleaning (inactive non bug-issues) label Oct 29, 2024
@alerque alerque self-assigned this Oct 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

2 participants