diff --git a/README.md b/README.md index 2d3cb4b..411620d 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,6 @@ # The Wordinator -Version 0.8.0 +Version 0.9.1 Generate high-quality Microsoft Word DOCX files using a simplified XML format (simple word processing XML). @@ -26,12 +26,24 @@ If you need to go from Word documents back to XML, you may find the DITA for Pub ## Release Notes +* 0.9.1 + + * Out-of-the-box DITA HTML5 transform. + * Handle unnamespaced HTML5 + * Added some useful documentation + * Added command-line help + +* 0.9.0 + +Working for XHTML input. DOCX pretty complete + * 0.8.0 Use final version of POI 4.0.0 * 0.7.0 -** Improved performance by only reading template doc once + + * Improved performance by only reading template doc once ## Word feature support @@ -48,7 +60,171 @@ The Wordinator supports generation of documents with the following Word features ## Getting Started -TBD +The Wordinator is packaged as a runable Java JAR file. It also requires an XSLT transform and a Word DOTX template in addition to your input file. + +To try it you can use the basic XHTML- or HTML5-to-DOCX transform that is included in the Wordinator materials. For production use you will need to create your own transform that expresses the details of mapping from your XML or HTML to your styles. This can be pretty easy to implement though--you shouldn't normally need any significant XSLT knowledge. + +### Installation + +Unzip the release package into a convenient location. The release includes the Wordinator JAR file and base XSLT tranforms, along with a generic Word template (as a convenience). + +You need to be able run the `java` command using Java 8 or newer. + +If you have ant installed you can run the Wordinator using the `build.xml` script in the root of the distributaion package (`src/main/ant/build.xml` in the project source). + +### Running the Wordinator With Ant + +The `build.xml` file in the distribution provides two targets: `html2docx` and `ditahtml2docx`. The default target is `ditahtml2docx`. + +If you just run the `ant` command from the Wordinator distribution directory it will run the `ditahtml2docx` target against the sample HTML file included in the distribution: + +``` +c:\projects\wordinator> ant +Buildfile: /Users/ekimber/workspace/wordinator/dist/wordinator/build.xml + +init: + +ditahtml2docx: + [java] + 2019-03-07 22:14:54,322 [INFO ] Input document or directory='/Users/ekimber/workspace/wordinator/dist/wordinator/html/sample_web_page.html' + [java] + 2019-03-07 22:14:54,324 [INFO ] Output directory ='/Users/ekimber/workspace/wordinator/dist/wordinator/out' + [java] + 2019-03-07 22:14:54,324 [INFO ] DOTX template ='/Users/ekimber/workspace/wordinator/dist/wordinator/docx/Test_Template.dotx' + [java] + 2019-03-07 22:14:54,324 [INFO ] XSLT template ='/Users/ekimber/workspace/wordinator/dist/wordinator/xsl/ditahtml2docx/ditahtml2docx.xsl' + [java] + 2019-03-07 22:14:54,325 [INFO ] Chunk level ='root' +... + [java] + 2019-03-07 22:14:55,759 [INFO ] Generating DOCX file "/Users/ekimber/workspace/wordinator/dist/wordinator/out/sample_web_page.docx" + [java] + 2019-03-07 22:14:56,249 [INFO ] Transform applied. + +BUILD SUCCESSFUL +Total time: 4 seconds +``` + +Edit the `build.xml` file to see the properties you can set to specify your own values for the command-line parameters. + +You can create a file named `build.properties` in the same directory as the `build.xml` file to set properties statically or you can specify them using `-D` parameters to the `ant` command: + +``` +c:\projects\wordinator> ant -Dditahtml2docx.dotx=myTemplate.dotx +``` + +### Running the Wordinator From OxygenXML + +You can set up an Oxygen Ant transformation scenario and apply it against HTML files to generate DOCX files from them. + +To set up a transformation scenario follow these steps: + +1. Open an HTML file in OxygenXML +1. Open the Configure Transformation Scenarios dialog +1. Select "New" and then "Ant transformation" +1. Give the scenario a meaningful title, e.g. "DITA HTML to DOCX" +1. In the "Build file" field put the path and name of the `build.xml` file. Take the defaults for the other fields in this tab. +1. Switch to the "Parameters" tab and add the following parameters: + * input.html: `${cfd}/${cfne}` + * output.dir: `${cfd}/out` + * ditahtml.dotx: Path to your DOTX file + * ditahtml.xsl: Path to your XSLT (if you have one, otherwise omit) +1. Switch to the "Output" tab and set the Output field to `${cfd}/out/${cfn}.docx`. Make sure that "Open in system application" is selected. + +You can omit any of the parameters that you have set using a `build.properties` file. + +You should now be able to run the scenario against any HTML file and have the resulting DOCX file open in Microsoft Word. + +### Running the Wordinator From The Command Line + +1. Open a command window and navigate to the directory you unzipped the Wordinator package into: + +``` +cd c:\projects\wordinator +``` + +2. Run this command: +``` +java -jar wordinator.jar -i html/sample_web_page.html -o out -x xsl/html2docx/html2docx.xsl -t docx/Test_Template.dotx +``` + +You should see a lot of messages, ending with this: +``` ++ 2019-03-07 16:58:33,873 [INFO ] Generating DOCX file "/Users/ekimber/workspace/wordinator/dist/wordinator/out/sample_web_page.docx" ++ 2019-03-07 16:58:34,406 [INFO ] Transform applied. +``` + +3. Open the file `out\sample_web_page.docx` in Microsoft Word + +It's not a very pretty test but it demonstrates that the tool is working. + +### Wordinator Commandline Options + +* -i The input XML file or directory +* -o The output directory +* -t The DOTX Word template +* -x The XSLT transform to apply to the input file to generate SWPX files. + +If the `-i` parameter is a directory then it looks for `*.swpx` files and generates a DOCX file for each one. + +### Adapting Wordinator To Your Needs + +The base HTML-to-DOCX transform is very basic and is not intended to be used as is. + +To create good results for your content you will need the following: + +* A Word template (DOTX) that defines the named styles you need to achieve your in-Word styling requirements. For many documents the built-in Word styles will suffice. You may also have existing templates that that you need to map to. The important thing for the mapping to Word is the style names: the mapping from your input XML to Word is in terms of named paragraph, character, and table styles. +* A custom XSLT style sheet that implements the mapping from your input XML to Simple Word Procesing XML that is then the input to the DOCX generation phase. A a minimum you need to provide the mapping from element type names and @class values to paragraph abnd character style names. This can be done with relatively simple XSLT module that overrides the base HTML-to-DOCX transform. +* The XML from which you will generate the Word documents. This can be any XML but the Wordinator-provided transforms are set up for XHTML and HTML5, so if you are either authoring in HTML5 or you can generate XHTML or HTML5 from your XML then the transform is relatively simple. For example, the provided ditahtml2docx transform handles the HTML5 produced by the DITA Open Toolkit. + +### Java Integration + +The release package uses a jar that contains all the dependency jars required by the Wordinator. + +However, if you want to include the Wordinator in a larger application where the dependencies should be managed as separate JAR files, you can build the JAR from the project source. + +The Wordinator project is a Maven project. + +### SimpleWP XML (SWPX) + +The Simple Word Processing XML format is the direct input to the DOCX generation phase of the Wordinator. + +It is essentially a simplification of Word's internal XML format. + +The SWPX format is defined in the simplewpml.rng file in the `doctypes/simplewpml` directory. The RNG file includes documentation on the SWPX elements and attributes and how to use them. + +The XSLT file `xsl/html2docx/baseProcessing.xsl` does most of the work of generating SWPX from HTML and it also serves to demonstrate how to generate SWPX if you want to implement direct generation from some other XML format. + +### Customizing the HTML-to-SWPX Transforms + +The module `xsl/html2docx/get-style-name.xsl` implements the default mapping from HTML elements to style names. It uses a variable that is a map from @class attribute values to Word style names: + +``` + + + + + +``` + +Each `` element maps a @class name (`key="'p1'"`) to a style name (`select="'Paragraph 1'"`). + +You can override this variable in a custom XSLT to add your own mapping. + +Note that the values of the @key and @select attributes are XSLT string literals: `'p1'` and `'Paragraph 1'`. Note the straight single quotes (`'`) around the strings. If you forget those your results will be strange. + +The map variable is used like so: + +``` + + + + + + + + + +``` + +Here, the @class attribute of the element that matches the template is tokenized on blank spaces and then the first value is used to look up an entry in the `$classToStyleNameMap` variable. + +*TBD: More guidance on customizing the mapping. Would also be easy to implement using a JSON file to define the mapping as a separate configuration file.* ## Managing Word Styles @@ -98,7 +274,7 @@ Maven dependency: org.wordinator wordinator - 0.8.0 + 0.9.1 ``` diff --git a/build.xml b/build.xml new file mode 100644 index 0000000..ef06bd9 --- /dev/null +++ b/build.xml @@ -0,0 +1,62 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/pom.xml b/pom.xml index 1526656..7894f24 100644 --- a/pom.xml +++ b/pom.xml @@ -5,7 +5,7 @@ org.wordinator wordinator jar - 0.9.0 + 0.9.1 wordinator.org http://wordinator.org diff --git a/src/main/ant/build.xml b/src/main/ant/build.xml new file mode 100644 index 0000000..47109a5 --- /dev/null +++ b/src/main/ant/build.xml @@ -0,0 +1,58 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/src/main/doctypes/simplewpml/test/simplewpml-test-01.xml b/src/main/doctypes/simplewpml/test/simplewpml-test-01.xml index 528e8dd..954291a 100644 --- a/src/main/doctypes/simplewpml/test/simplewpml-test-01.xml +++ b/src/main/doctypes/simplewpml/test/simplewpml-test-01.xml @@ -1,5 +1,6 @@ +

diff --git a/src/main/java/org/wordinator/xml2docx/MakeDocx.java b/src/main/java/org/wordinator/xml2docx/MakeDocx.java index dc859b8..3f00743 100644 --- a/src/main/java/org/wordinator/xml2docx/MakeDocx.java +++ b/src/main/java/org/wordinator/xml2docx/MakeDocx.java @@ -15,6 +15,7 @@ import org.apache.commons.cli.CommandLine; import org.apache.commons.cli.CommandLineParser; import org.apache.commons.cli.DefaultParser; +import org.apache.commons.cli.HelpFormatter; import org.apache.commons.cli.Option; import org.apache.commons.cli.Options; import org.apache.commons.cli.ParseException; @@ -56,10 +57,25 @@ public class MakeDocx public static void main( String[] args ) throws ParseException { - Options options = buildOptions(); - - handleCommandLine(options, args, log); + boolean GOOD_OPTIONS = false; + Options options = null; + try { + options = buildOptions(); + GOOD_OPTIONS = true; + } catch (Exception e) { + // + } + try { + handleCommandLine(options, args, log); + } catch (ParseException e) { + GOOD_OPTIONS = false; + } + + if (!GOOD_OPTIONS) { + HelpFormatter formatter = new HelpFormatter(); + formatter.printHelp( "wordinator", options, true ); + } } /** diff --git a/src/main/xsl/ditahtml2docx/ditahtml2docx.xsl b/src/main/xsl/ditahtml2docx/ditahtml2docx.xsl new file mode 100644 index 0000000..8ab856c --- /dev/null +++ b/src/main/xsl/ditahtml2docx/ditahtml2docx.xsl @@ -0,0 +1,78 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + [INFO] inputdoc={document-uri(.)} + + [INFO] outdir={$outdir} + + + + [DEBUG] Applying templates to root element: {namespace-uri(/*)}:{name(/*)} + + + + + + + + \ No newline at end of file diff --git a/src/main/xsl/ditahtml2docx/get-style-name.xsl b/src/main/xsl/ditahtml2docx/get-style-name.xsl new file mode 100644 index 0000000..1d601b5 --- /dev/null +++ b/src/main/xsl/ditahtml2docx/get-style-name.xsl @@ -0,0 +1,133 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + [DEBUG] get-style-name: section/header. Returning "{$result}". + + + + + + + + + + + + + [DEBUG] get-style-name: {name(.)}. Returning "{$result}". + + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/src/main/xsl/html2docx/baseProcessing.xsl b/src/main/xsl/html2docx/baseProcessing.xsl index 6ab2bce..113d4a7 100644 --- a/src/main/xsl/html2docx/baseProcessing.xsl +++ b/src/main/xsl/html2docx/baseProcessing.xsl @@ -17,7 +17,7 @@ - + @@ -31,14 +31,16 @@ - - + @@ -85,20 +87,20 @@ - + + [DEBUG] make-body-content: Handling {name(..)}/{name(.)} - + - + @@ -110,7 +112,7 @@ - + @@ -123,7 +125,7 @@ - + @@ -132,12 +134,12 @@ - + - + @@ -149,12 +151,12 @@ - + - + @@ -167,12 +169,12 @@ - + - + @@ -181,7 +183,7 @@ - + @@ -195,7 +197,7 @@ - + @@ -210,7 +212,7 @@ ================================== --> - + @@ -233,7 +235,13 @@ xhtml:h3/text() | xhtml:h4/text() | xhtml:h5/text() | - xhtml:h6/text() + xhtml:h6/text() | + h1/text() | + h2/text() | + h3/text() | + h4/text() | + h5/text() | + h6/text() "> @@ -253,7 +261,7 @@ Figures and images ========================== --> - + @@ -261,7 +269,7 @@ - + @@ -286,7 +294,7 @@ - + @@ -296,7 +304,7 @@ - + @@ -326,7 +334,7 @@ - + @@ -476,10 +484,10 @@ ========================== --> + xhtml:p | p | + xhtml:dt | dt | + xhtml:dd[empty(xhtml:p)] | dd[empty(p)] | + xhtml:pre | pre"> @@ -499,7 +507,7 @@ @@ -528,8 +536,20 @@ xhtml:i/text() | xhtml:b/text() | xhtml:u/text() | - xhtml:tt/text() - "> + xhtml:tt/text() | + span/text() | + dfn/text() | + a//text() | + pre//text() | + li//text() | + dt//text() | + dd//text() | + code/text() | + i/text() | + b/text() | + u/text() | + tt/text() + "> @@ -590,7 +610,7 @@ Lists ========================== --> - + @@ -598,12 +618,12 @@ - + - + @@ -614,14 +634,14 @@ FIXME: Provide a more general way of handing a mix of text nodes and block nodes. --> - + - + - + @@ -631,7 +651,7 @@ Inline things ========================== --> - + + [DEBUG] xhtml:br: {name(../..)}/{name(..)}/{name(.)} @@ -640,7 +660,7 @@ - + @@ -653,7 +673,7 @@ Fallback templates ========================== --> - + @@ -737,7 +757,7 @@ modes ============================== --> - + @@ -848,7 +868,9 @@ - [WARN] wp:p found in wp:p: - + diff --git a/src/main/xsl/html2docx/set-format-attributes.xsl b/src/main/xsl/html2docx/set-format-attributes.xsl index 68a4709..e72fcdb 100644 --- a/src/main/xsl/html2docx/set-format-attributes.xsl +++ b/src/main/xsl/html2docx/set-format-attributes.xsl @@ -65,8 +65,8 @@ @@ -76,8 +76,8 @@ @@ -91,7 +91,7 @@ @@ -103,7 +103,7 @@ - + @@ -137,7 +137,7 @@ - + diff --git a/version.properties b/version.properties new file mode 100644 index 0000000..d12cae7 --- /dev/null +++ b/version.properties @@ -0,0 +1 @@ +version=0.9.1 \ No newline at end of file