Serialization to HTML & .docx? #197

photocyte · 2023-05-02T16:38:40Z

Mentioned during office hours on 2023-05-02
Related to, but ultimately a simpler lift / subset of: #195
Related to: #158

I'd like to get feedback from folks, on if this proposal is a worthwhile goal.

Introduction

In short: I found a way to convert HTML with specific HTML markup, to .docx with comments overlaying particular words (using Pandoc).

See the attached HTML file for an example source HTML. This is the appropriate pandoc command:
pandoc example_original.html --reference-doc='example_ref.docx' -t docx -o example.docx

(Ignore --reference-doc='example_ref.docx' that is just to apply fonts / paper sizes from an existing template, i.e. 8.5"x11" vs A4 paper size)

So, the idea is one could have LabOP “serialize” to HTML+RDFa format, and from there into a HTML+docx-convertible-comments format (as exemplified by example_original.html), and finally converted to .docx, where in this docx the necessary context for the back-translation from the .docx into LabOP RDF is specified within the comments. (This potential for back-translation is why I am calling it a serialization, rather than a specialization)

example_original.html.zip

Ordered lists can be conserved between HTML and .docx

A key point is that: Pandoc will convert the "ordered list" in HTML, into an analogous ordered list object in .docx. This has some user conveniences, like formatting, and easily adding or removing steps with re-numbering of the ordered list.

In contrast, Markdown has some competing interpretations of how to handle ordered lists. It does not handle "lettered" ordered lists (i.e. a,b,c...) by default. See:
https://pandoc.org/MANUAL.html#extension-fancy_lists

If you try to make multi-level ordered lists in Markdown, you will see the major Markdown renderers will interpret it in mutually incompatible ways.

HTML is a nice format in of itself to view protocols

In example_original.html, I show the use of the <details> and <summary> tags, to make little dropdown menus for unfolding the text. It's a nice feature that is "free" in HTML5. No dynamic javascripty stuff required.

One can also put in meaningless but HTML5 supported checkboxes, to make it a simple checklist for a user:

Both HTML and .docx can embed files

Another key point is that, there are ways in both HTML and in .docx, to embed files:

In HTML, you can use <a> tags, where the href=target is a data URI scheme base64 encoded file. (shown in above image)
In .docx, you can use the OLEObject to embed arbitrary files (I don't know exactly how this works. This knowledge is more just from poking around in the .docx XML). But notably, such embedded files are also supported / downloadable / viewable when .docx is viewed in Google Doc.

The idea is: The LabOP serialized into .docx, could package necessary context (The ontologies used, the original RDF)

I don't believe Pandoc would convert a data URI embedded in HTML to an embedded file in .docx, so presumably, to get a file embedded in the HTML+RDFa or the HTML+docx-convertible-comments using data URIs, would need some custom .docx XML editing in order to get the file embedded.

Microsoft Word has user conveniences for preserving & duplicating the comments

A last key point, is that when copy-pasting text in Microsoft Word, the comments will follow them. This presumably makes for an easy way for a totally oblivious user to (*minorly) edit a LabOP protocol serialized to .docx, and have it be able to be back-translated back into LabOP RDF.

*=Having the editing of the .docx by the user, and it's back translation of the LabOP protocol into RDF representation, being robust, is a much larger question, and is not being proposed. The "MVP" solution, is simply to have the back-translator throw an error, if the user edits the LabOP .docx to the point where it cannot be trivially interpreted back into LabOP RDF.

Markdown is a lift for users, both to view, and edit

I think .docx is preferable to Markdown, as someone can add their own formatting (bolding, changing fonts, font size, print it out exactly as they want), without needing to understand Markdown at all (Understanding Markdown is a lift, for most laboratory users)

Markdown is meant for authoring for the web, whereas .docx is meant for authoring for paper. Unfortunately, where in laboratory science world where paper lab notebooks & post-it notes of protocols still reign supreme, I believe LabOP needs to consider the paper workflow with first class support.

The text was updated successfully, but these errors were encountered:

photocyte · 2023-05-02T16:48:20Z

This was mentioned in the 2023-05-02 office hours. My notes:

"Tammi "IntentParser" for SD2 in Google Docs. In brief, this was addon for Google Docs, that would lookup potentially relevant ontology terms, from human written non-ontology-linked text. It also used stashing ontology/machine readable representation in the .docx-style comments."

danbryce · 2023-05-03T14:02:07Z

Here is a link to the article describing the Intent Parser: https://pubs.acs.org/doi/abs/10.1021/acssynbio.1c00285

…

On May 2, 2023, at 11:48 AM, Timothy R. Fallon, PhD ***@***.***> wrote: This was mentioned in the 2023-05-02 office hours. My notes: "Tammi "IntentParser" for SD2 in Google Docs. In brief, this was addon for Google Docs, that would lookup potentially relevant ontology terms, from human written non-ontology-linked text. It also used stashing ontology/machine readable representation in the .docx-style comments." — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Serialization to HTML & .docx? #197

Serialization to HTML & .docx? #197

photocyte commented May 2, 2023

photocyte commented May 2, 2023

danbryce commented May 3, 2023 via email

Serialization to HTML & .docx? #197

Serialization to HTML & .docx? #197

Comments

photocyte commented May 2, 2023

Introduction

Ordered lists can be conserved between HTML and .docx

HTML is a nice format in of itself to view protocols

Both HTML and .docx can embed files

Microsoft Word has user conveniences for preserving & duplicating the comments

Markdown is a lift for users, both to view, and edit

photocyte commented May 2, 2023

danbryce commented May 3, 2023 via email