Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split resolveMessage() into format() & formatToParts() #22

Merged
merged 6 commits into from
Aug 18, 2023
Merged

Conversation

eemeli
Copy link
Member

@eemeli eemeli commented Jul 22, 2023

This is a pretty deep rewrite of the proposed API. I started by drafting this as a sequence of smaller steps, but this is the smallest first step that I could separate out; further PRs will add support for markup expressions, bidi isolation, and additional default formatters. Much of this has been informed by offline conversations with @littledan in particular, as well as recent progress and conversations in the Unicode MFWG.

Fundamentally, this set of changes reconsiders how the API deals with objects. These need to be address by Intl.MessageFormat in three different places:

  1. On input, we need to support values that contain e.g. a number and a currency or unit specifier. Matching the style of Intl.NumberFormat, this means a value combined with a basket of options.
  2. During processing, the inputs and output of a custom message function need to be defined.
  3. On output, the formatted-part representation of a message needs to support non-string values.

Previously with the resolveMessage() approach, a single definition was shared for all three use cases. Technically, this works, but not without at least three significant drawbacks:

  1. It forces every user to consider values that are resolved but not yet formatted. They're not getting a ready output, but something that may provide them with an output. The mental load here is pretty heavy.
  2. It's different from the current behaviour of other Intl formatters.
  3. The relatively common use case of formatting a message to a string requires the construction of a bunch of intermediate JS objects that are then immediately dismissed, which is pretty inefficient.

The alternative approach proposed here matches existing Intl formatters much more closely, with format() providing a string and formatToParts() an array of static message parts, rather than values with functions attached to them. The match isn't perfect, but it's pretty close. This change does sacrifice some of the expressibility and re-use that resolveMessage() enabled, but it should be easier to use and reason about.

Compared to the previous, this is how object values work here:

  1. On input, the definition and requirements are much looser, and separately defined for each function.
  2. For custom message functions, the updated MessageValue is very similar to its previous incarnation, but in most use cases, these objects won't ever need to be reified.
  3. On output, the formatted parts do not contain any methods, only property values. Their structure is still a bit more varied and complex than the { type: string, value: string } used by other Intl formatters. This is unfortunately unavoidable, as not all the values formatted by Intl.MessageFormat are representable as strings, and some (such as numbers) may contain sub-parts, rather than a single string value.

Rather than looking at the diff, this PR may be easier to review by looking at its resulting shape on the updates branch. I've asked for reviews from a number of people for this, as I figured you might be interested parties in this significant a change. I don't really expect everyone to reply.

Copy link
Contributor

@stasm stasm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the direction of this change. I recall being a bit confused by the resolveMessage() API. (In retrospect, I should have filed an issue about it.) This PR addresses most of the reasons for my confusion and makes the API similar to what I imagined in my implementation experiment for MessageFormat 2 — but more complete with errors and resolution fallback. Some more early feedback below.

README.md Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
it resolves instead to the result of calling the `string` function
with the value as input and no options.

Otherwise, un-annotated values resolve to the following shape:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to suggest that format and formatToParts take a dictionary whose values are either unknown or instances of MessageValue. This allows two interesting use-cases:

  1. Pre-set certain formatting options on input values.
  2. Pass instances of custom types which extend MessageValue.

Then, you'd first check if an un-annotated value is an instance of MessageValue. If so, carry on. If not, resolve to MessageUnknownValue. This way, you don't have to try to resolve every expression to a number and string first.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to suggest that format and formatToParts take a dictionary whose values are either unknown or instances of MessageValue. This allows two interesting use-cases:

  1. Pre-set certain formatting options on input values.

This is possible without making MessageValue a fully-fledged class. To see what I mean, take a look at how the proposed behaviour of the :number formatter constructs options.

  1. Pass instances of custom types which extend MessageValue.

What would be the benefit of allowing that to happen without a custom annotation on the expression? As proposed, it's possible to return custom MessageValue extensions from a user-defined function, but it does require an annotation on the MF2 expression.

Then, you'd first check if an un-annotated value is an instance of MessageValue. If so, carry on. If not, resolve to MessageUnknownValue.

How would you propose that we "check if an un-annotated value is an instance of MessageValue"? As currently proposed, that's just an interface description for plain objects with some expected properties.

This way, you don't have to try to resolve every expression to a number and string first.

The number/string/unknown resolution that I think you're referring to here is what's applied with an unannotated expression, like {$foo}. It's not applied when resolving e.g. {$foo :custom}, which instead calls the custom function that's defined in the constructor options, which is expected to return a MessageValue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the benefit of allowing that to happen without a custom annotation on the expression? As proposed, it's possible to return custom MessageValue extensions from a user-defined function, but it does require an annotation on the MF2 expression.

"Require" isn't the best word here, because there's nothing we can do on the syntax or data model validation level to prevent such uses. So we're left with runtime errors or MessageUnknownValues. Instead, I suggest we allow passing instances of MessageValues as input variables, so that:

  1. they produce known MessageParts when used in a placeholder without an annotation,
  2. they can be used as arguments and options to custom functions.

For example, it may be useful to be able to pass as input variables data that's not supposed to be formatted at all and instead be positioned in the correct spot inside the translation: see the WrappedValue example from my implementation.

How would you propose that we "check if an un-annotated value is an instance of MessageValue"? As currently proposed, that's just an interface description for plain objects with some expected properties.

It should be a base class instead, and custom function authors should be able to subclass it for the purposes of their custom logic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #24 to continue the discussion about whether MessageValue should be subclassable, and if it should be allowed to be passed as input variables.

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
Co-authored-by: Stanisław Małolepszy <[email protected]>
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
#### Default Functions

Two commonly used message functions `number` and `string` are provided as a starting point,
and as handlers for placeholders without an annotation,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
and as handlers for placeholders without an annotation,
and as handlers for expressions without an annotation,

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is deliberate. Selectors without an annotation do not get a default handler.

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
@dminor dminor removed their request for review July 26, 2023 13:45
@dminor
Copy link
Collaborator

dminor commented Jul 26, 2023

The reviews from Tim and Stas are quite insightful and I don't think there's much more I can add here.

Copy link
Member

@ryzokuken ryzokuken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, thanks. I know its not from your PR but the onError method would certainly make things tricky by making users employ two different error handling mechanisms. Could this behavior be replaced using regular JS error handling instead? Allowing people to catch a subclassed "MessageFormatError" instead?

README.md Show resolved Hide resolved
README.md Show resolved Hide resolved
type: 'fallback';
value: undefined;
locale: 'und';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it better to return something like null that makes it obvious that it's not a useful value? Otherwise someone might write code that extracts this and passes this elsewhere without checking for "und" (I suppose they could still do that with null but atleast the error would be more obvious then).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think we ought to make und into a working reference for the root locale, but that's rather from this proposal. I need to organise my thoughts on this some more and raise an issue about this in TG2.

README.md Outdated Show resolved Hide resolved
and as MF2 considers all literals to be strings,
the JSON string representation should be supported for numerical and boolean values
when used as an input or as an option value.
Each function will need to parse these separately from their string representations.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could probably link to the grammar in ECMA-404 for this

README.md Outdated Show resolved Hide resolved
@eemeli
Copy link
Member Author

eemeli commented Jul 31, 2023

From @ryzokuken:
[...] I know its not from your PR but the onError method would certainly make things tricky by making users employ two different error handling mechanisms. Could this behavior be replaced using regular JS error handling instead? Allowing people to catch a subclassed "MessageFormatError" instead?

This is actually mandated by the MF2 spec, which currently includes in its Error Handling section:

In all cases, when encountering an error, a message formatter MUST provide some representation of the message. An informative error or errors MUST also be separately provided.

To be fair, I did write that part of the MF2 spec, but did so rather intentionally and with the support of the MFWG.

Intl.MessageFormat is different from the existing Intl formatters in that it relies on data (a message) provided by a user, and for any given message, the data is often not directly controlled by a developer, but provided by a translator. In other words, this interface ends up relying on inputs which will fail more often than existing formatters, and will do so without being caught in automated tests.

From experience, it turns out to be really important for localization not to fail catastrophically, because that will lead to broken user experiences because not all formatting calls will be wrapped in a try...catch. Hence the default behaviour of warning on error, rather than throwing.

Copy link
Contributor

@stasm stasm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like I said in my previous review, the direction of this PR is good, but I have reservations about certain design choices here still. If you prefer to file new issues about these, I'll be happy to discuss them further outside this PR. (They will definitely be easier to discuss against this PR's spec than the current main.)

A summary of my feedback:

  • There's more work needed around the MessageValue type and its use, both internal and potentially external; I think it should be a subclassable type available to users and authors of custom functions.

  • The wording of the spec would benefit from defining abstract operations that convert unknown values to known interfaces.

  • MessageValue should not have a locale property. Instead, the formatting methods should take locale as argument.

it resolves instead to the result of calling the `string` function
with the value as input and no options.

Otherwise, un-annotated values resolve to the following shape:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the benefit of allowing that to happen without a custom annotation on the expression? As proposed, it's possible to return custom MessageValue extensions from a user-defined function, but it does require an annotation on the MF2 expression.

"Require" isn't the best word here, because there's nothing we can do on the syntax or data model validation level to prevent such uses. So we're left with runtime errors or MessageUnknownValues. Instead, I suggest we allow passing instances of MessageValues as input variables, so that:

  1. they produce known MessageParts when used in a placeholder without an annotation,
  2. they can be used as arguments and options to custom functions.

For example, it may be useful to be able to pass as input variables data that's not supposed to be formatted at all and instead be positioned in the correct spot inside the translation: see the WrappedValue example from my implementation.

How would you propose that we "check if an un-annotated value is an instance of MessageValue"? As currently proposed, that's just an interface description for plain objects with some expected properties.

It should be a base class instead, and custom function authors should be able to subclass it for the purposes of their custom logic.

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
localeContext?: LocaleContext;
source?: string;
meta?: Record<string, string>;
locale: string;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MessageValue should not have a locale property. That's what Parts are for.

Consider {$count :number}. The number stored in $count may be decorated with formatting options like minimumFractionDigits etc., but it's not in any particular locale just yet. It's the locale of the message (or the placeholder if we allow overriding it) that decides how $count will be formatted.

There's nothing inherently language-specific about the mathematical value of 5 :) Same goes for many string literals, like user names or Wi-Fi network names.

I understand that we need the locale somewhere for toString and toParts to work. I think we should just pass it as argument: toString(locale). Perhaps also call it toLocaleString() for consistency with other JS APIs.

Alternatively, we can consider passing in a larger structure: toString(fmtctx). Such "formatting context" may be useful if we want to allow implementations to cache formatters.

(A formatting context should be passed into formatting methods rather than be referenced by a property of MessageValue to help avoid memory leaks and facilitate IPC use-cases.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need the locale here to support messages which embed parts that are not in the message's locale. Consider for example:

The number 1,234.5 is formatted in French as "1 234,5".

When formatting that to parts, the last two parts end up as:

[
  ...,
  { type: 'number', locale: 'fr', parts: [{ type: 'integer', value: '1' }, ...] },
  { type: 'literal', value: '".' }
]

To get there, the fr locale needs to be available in the corresponding MessageValue. It does not need to be available in $count, because :number defines its own expected input shape, and for that the locale is optional.

Fundamentally, MessageValue is a resolved value that contains everything needed to format it, or to do something else with it like feed it into some custom function. That includes its locale.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #25 to continue this, because I still think locale is special and should not be part of MessageValue.

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
@eemeli
Copy link
Member Author

eemeli commented Aug 12, 2023

@stasm I applied some changes in response to your comments, in particular wrt. the MessageFunction types. What I had before was not really workable, and misleading.

As it's currently structured, MessageValue is not intended as an input data type, but as a primarily internal "resolved value" shape. It is a superset of the interfaces required by value+options input, so that e.g. the resolved value of a :number expression can be used as input for another :number expression, but most of the time it doesn't show up in the user's JS code at all.

I think the value gained by making it a real class is marginal, and not worth the cost, but I'm open to being persuaded otherwise. Would it be possible for you to open a separate issue discussing the possibility of doing so as a further step, so that we don't block this PR on it?

@stasm
Copy link
Contributor

stasm commented Aug 18, 2023

Thanks for the updates. I filed the following issues to continue the discussion without blocking this PR:

It's going to be easier to discuss them against this PR rather than the current state of main, so I'm in favor of merging it.

Comment on lines 64 to +66
const mf = new Intl.MessageFormat(source, ['en']);
const notifications = mf.resolveMessage({ count: 1 });
notifications.toString(); // 'You have one new notification'
const notifications = mf.format({ count: 1 });
// 'You have 1 new notification'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The relatively common use case of formatting a message to a string requires the construction of a bunch of intermediate JS objects that are then immediately dismissed, which is pretty inefficient.

Looks like this PR still needs a lot of intermediate JS objects.

I thought it would be something like this:

const mf = new Intl.MessageFormat('en');
const notifications = mf.format(source, { count: 1 });
const another_string = mf.format(source2, { interop: 1 });

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That approach would leave the greatest part of the work to the format() calls, though, and that's quite different from how existing Intl formatters work.

Consider for instance how unit formatting works:

const nf = new Intl.NumberFormat('en-GB', { style: 'unit', unit: 'kilometer', unitDisplay: 'long' });
nf.format(42); // "42 kilometres"

There, everything except for the final runtime values is included in the constructor. Now, let's say that we wanted that distance to be embedded in a longer message. With the currently proposed API that would look like this:

const src = `
  let $dist = {$dist :number style=unit unit=kilometer unitDisplay=long}
  {Your destination is in {$dist}.}`;
const mf = new Intl.MessageFormat(src, 'en-GB');
mf.format({ dist: 42 }); // "Your destination is in 42 kilometres."

In other words, when getting either a NumberFormat or MessageFormat instance, we should be able to tell what sort of a thing it's formatting. If we leave the source to be a format() argument for MessageFormat, we won't be able to really say anything at all about the MessageFormat instance, or whether the message source is even syntactically valid.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jack-Works As moving the source argument to the format() and formatToParts() methods could be considered as a next step from this PR, please open a separate issue to consider that. I would be happy to continue there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I agree, but does that mean we need to create a new object for every message we have?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's what it means.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants