-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TeX2MathML converter implementation guidelines #39
Comments
Collecting macros for creating a tiny tree with each of the MathML Core elements is certainly quite doable. My fear is that to make it useful, you'd also have to collect the macros needed for generating the different meaningful values for each of the MathML Core attributes. And then macros to generate some of the idiomatic expression trees. For example, would such a list care that a script of an ( \ldots )^2 attached to full parenthetical base, (latexml with enabled grammar): <msup>
<mrow>
<mo stretchy="false">(</mo>
<mi mathvariant="normal">…</mi>
<mo stretchy="false">)</mo>
</mrow>
<mn>2</mn>
</msup> vs attaching to the closing fence (via mathjax) <mo stretchy="false">(</mo>
<mo>…</mo>
<msup>
<mo stretchy="false">)</mo>
<mn>2</mn>
</msup> vs not attaching at all. (latexml with grammar disabled via <mo>(</mo>
<mi mathvariant="normal">…</mi>
<mo>)</mo>
<msup>
<mi/>
<mn>2</mn>
</msup> The elements are about the same, but the trees are markedly different. Well, there's also apparent debate whether the ellipsis is an I fear that a useful list will have to spend a lot more writing in talking about tree structure, than the individual leaf elements. |
In order to get the intent attributes, there will need to be TeX macros
corresponding to the intents.
For example, suppose we want to convey this (true but mostly useless)
fact:
If x squared is strictly between 0 and 100 then the absolute value of x
is in the open interval from 0 to 10.
Possible TeX markup to capture that meaning (although it might not be
pronounced as written above) is:
If $0 < x^2 < 100$ then $\abs{x} \in \oointerval{0}{10}$.
The "in" macro is standard.
The macros "int" and "oointerval" need to be defined. (And maybe
have different names.) Depending on how the macro is defined,
the open-open (i.e., open at both ends) interval could be written
as \oointerval{0, 10} , making it a function of one parameter but
requiring the author to type the comma.
My point is that the absolute value and the open interval have to
be macros, because this version requires guessing the intents:
If $0 < x^2 < 100$ then $|x| \in (0, 10)$.
It will be good to have a discussion about how we think actual
authors will write material that captures the intent.
|
@davidfarmer I think even in a controlled environment like Wikipedia, with a very restricted set of commands it is hard to predict what people will actually write. Often there is a lot of formatting included. For example, for your interval example, the actual code in Wikipedia looks like this
As you can see, due to the absence of a native TeX or MathML-based solution to annotate intent, people came up with custom templates such as closed-open, etc. However, those templates are hard to discover for authors. E.g, the closed-open template is only used 45 times within English Wikipedia. On the other hand, I think the effort people spend in writing and rewriting Wikipedia articles is much higher than the effort to write a paper once and upload it to arxiv. Feel free to look at the statistics of the interval example. Just to quote one of many impressive numbers: The average time between edits is 8.2 days. |
@dginev I think it would be useful to write this. At least to everyone implementing conversion tex2mathml converters, which could be 20 people or even more. I am afraid, I might have co-authored papers that are read by fewer people;-) Maybe we can just start something and see how it goes. Can you recommend an authoring tool? |
I think for this type of collaborative writing, HackMD may be my current default choice. They have higlighting of TeX and XML snippets (similarly to github issues), and also have native MathML rendering (which I had asked of them some time back, |
@physikerwelt made what I think is a key point: "having users change the MathML code either directly or via a WYSIWYG editor is not a solution" What is needed, especially in an environment like Wikipedia, is either a) A human-readable, human-writable source format which automatically b) An editing program which lets a person create the content, and which all My hope is to support option a). The key point @physikerwelt made is a warning Both options can coexist of the editing program of b) can output the source I like option a) because it provides an archival format which can adapt to future |
@davidfarmer exactly. Just to link it back to the Wikimedia terminology. a) corresponds to wikitext and b) corresponds to VisualEditor
I would like to mention that the development of the VisualEditor was extremely challenging to the constraint that the wikitext output should still remain human-readable and editable. This constraint did not only make it a bit more effort but put it into a whole new class of problems and increased the effort about several orders of magnitude. |
A quick update on that. @Hyper-Node and I have now converted the Latex (subset) parser to PHP and now have an AST of the LaTeX representation. We are now looking for ideas to generate the MathML output from that AST. @Hyper-Node is going to look into the MathJax source code to come up with a high-level design on how to generate the MathML output from that tree. I will generate the lists mentioned before. |
Would people like this to be the focus of the MathML Full meeting this week?
Neil
Message ID: ***@***.***>
… |
Move this to mathml-docs. |
TLDR: Can we create a list of LaTeX commands that generate all elements described by the core spec?
The goal of the Wikimedia community group math is to improve the display of mathematical expressions in Wikipedia. Indeed, using browser-based MathML rendering to deliver high-quality formulae is desirable. The new MathML core specification seems promising as it appears to be detailed enough to implement and evaluate MathML rendering engines based on the spec. Therefore, there are good reasons to be optimistic. Once the spec is final and the rendering engines have been implemented, reasonable MathML markup will lead to appealing rendering results that the community will appreciate.
However, the de-facto standard in 2022 for authoring and rendering mathematical formulae are formats from the TeX family. Therefore, I suggest a deeper investigation of the conversion process TeX like inputs formats to MathML. We need conversion tools that generate the intended MathML 4 output from TeX like input as a prerequisite for our new MathML 4 standard to become a success story. In 2018, we evaluated several TeX2MathML conversion tools including those listed on our tool page. At that point, we created a manual gold standard dataset for presentation and Content MathML. However, the gold standard dataset's quality might not be optimal as it was influenced by LaTeXML. In particular, we used LaTeXML to generate the initial version of the MathML output and fixed problems we spotted by chance in that output.
Therefore, I suggest creating a non-normative document describing how to convert TeX expressions to the corresponding MathML core expression. While this task is open-ended, I recommend stopping after all elements described in the MathML core spec have at least one corresponding LaTeX input.
After that is completed and we still have enthusiasm, we could extend the exercise not only for core but also for intent. Here one could stop, for example, after having touched all symbols with the planned custom style tag annotations and their corresponding content MathML representations.
Disclaimer: I am currently considering implementing a texvc to MathML converter in PHP. For a TDD workflow, it would therefore be good to be able to generate meaningful test cases.
The text was updated successfully, but these errors were encountered: