Skip to content

Commit

Permalink
Merged develop for release 0.9.1
Browse files Browse the repository at this point in the history
  • Loading branch information
contrext committed Mar 8, 2019
1 parent 93d620d commit 0419a1c
Show file tree
Hide file tree
Showing 12 changed files with 610 additions and 63 deletions.
184 changes: 180 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# The Wordinator

Version 0.8.0
Version 0.9.1

Generate high-quality Microsoft Word DOCX files using a simplified XML format (simple word processing XML).

Expand All @@ -26,12 +26,24 @@ If you need to go from Word documents back to XML, you may find the DITA for Pub

## Release Notes

* 0.9.1

* Out-of-the-box DITA HTML5 transform.
* Handle unnamespaced HTML5
* Added some useful documentation
* Added command-line help

* 0.9.0

Working for XHTML input. DOCX pretty complete

* 0.8.0

Use final version of POI 4.0.0

* 0.7.0
** Improved performance by only reading template doc once

* Improved performance by only reading template doc once

## Word feature support

Expand All @@ -48,7 +60,171 @@ The Wordinator supports generation of documents with the following Word features

## Getting Started

TBD
The Wordinator is packaged as a runable Java JAR file. It also requires an XSLT transform and a Word DOTX template in addition to your input file.

To try it you can use the basic XHTML- or HTML5-to-DOCX transform that is included in the Wordinator materials. For production use you will need to create your own transform that expresses the details of mapping from your XML or HTML to your styles. This can be pretty easy to implement though--you shouldn't normally need any significant XSLT knowledge.

### Installation

Unzip the release package into a convenient location. The release includes the Wordinator JAR file and base XSLT tranforms, along with a generic Word template (as a convenience).

You need to be able run the `java` command using Java 8 or newer.

If you have ant installed you can run the Wordinator using the `build.xml` script in the root of the distributaion package (`src/main/ant/build.xml` in the project source).

### Running the Wordinator With Ant

The `build.xml` file in the distribution provides two targets: `html2docx` and `ditahtml2docx`. The default target is `ditahtml2docx`.

If you just run the `ant` command from the Wordinator distribution directory it will run the `ditahtml2docx` target against the sample HTML file included in the distribution:

```
c:\projects\wordinator> ant
Buildfile: /Users/ekimber/workspace/wordinator/dist/wordinator/build.xml
init:
ditahtml2docx:
[java] + 2019-03-07 22:14:54,322 [INFO ] Input document or directory='/Users/ekimber/workspace/wordinator/dist/wordinator/html/sample_web_page.html'
[java] + 2019-03-07 22:14:54,324 [INFO ] Output directory ='/Users/ekimber/workspace/wordinator/dist/wordinator/out'
[java] + 2019-03-07 22:14:54,324 [INFO ] DOTX template ='/Users/ekimber/workspace/wordinator/dist/wordinator/docx/Test_Template.dotx'
[java] + 2019-03-07 22:14:54,324 [INFO ] XSLT template ='/Users/ekimber/workspace/wordinator/dist/wordinator/xsl/ditahtml2docx/ditahtml2docx.xsl'
[java] + 2019-03-07 22:14:54,325 [INFO ] Chunk level ='root'
...
[java] + 2019-03-07 22:14:55,759 [INFO ] Generating DOCX file "/Users/ekimber/workspace/wordinator/dist/wordinator/out/sample_web_page.docx"
[java] + 2019-03-07 22:14:56,249 [INFO ] Transform applied.
BUILD SUCCESSFUL
Total time: 4 seconds
```

Edit the `build.xml` file to see the properties you can set to specify your own values for the command-line parameters.

You can create a file named `build.properties` in the same directory as the `build.xml` file to set properties statically or you can specify them using `-D` parameters to the `ant` command:

```
c:\projects\wordinator> ant -Dditahtml2docx.dotx=myTemplate.dotx
```

### Running the Wordinator From OxygenXML

You can set up an Oxygen Ant transformation scenario and apply it against HTML files to generate DOCX files from them.

To set up a transformation scenario follow these steps:

1. Open an HTML file in OxygenXML
1. Open the Configure Transformation Scenarios dialog
1. Select "New" and then "Ant transformation"
1. Give the scenario a meaningful title, e.g. "DITA HTML to DOCX"
1. In the "Build file" field put the path and name of the `build.xml` file. Take the defaults for the other fields in this tab.
1. Switch to the "Parameters" tab and add the following parameters:
* input.html: `${cfd}/${cfne}`
* output.dir: `${cfd}/out`
* ditahtml.dotx: Path to your DOTX file
* ditahtml.xsl: Path to your XSLT (if you have one, otherwise omit)
1. Switch to the "Output" tab and set the Output field to `${cfd}/out/${cfn}.docx`. Make sure that "Open in system application" is selected.

You can omit any of the parameters that you have set using a `build.properties` file.

You should now be able to run the scenario against any HTML file and have the resulting DOCX file open in Microsoft Word.

### Running the Wordinator From The Command Line

1. Open a command window and navigate to the directory you unzipped the Wordinator package into:

```
cd c:\projects\wordinator
```

2. Run this command:
```
java -jar wordinator.jar -i html/sample_web_page.html -o out -x xsl/html2docx/html2docx.xsl -t docx/Test_Template.dotx
```

You should see a lot of messages, ending with this:
```
+ 2019-03-07 16:58:33,873 [INFO ] Generating DOCX file "/Users/ekimber/workspace/wordinator/dist/wordinator/out/sample_web_page.docx"
+ 2019-03-07 16:58:34,406 [INFO ] Transform applied.
```

3. Open the file `out\sample_web_page.docx` in Microsoft Word

It's not a very pretty test but it demonstrates that the tool is working.

### Wordinator Commandline Options

* -i The input XML file or directory
* -o The output directory
* -t The DOTX Word template
* -x The XSLT transform to apply to the input file to generate SWPX files.

If the `-i` parameter is a directory then it looks for `*.swpx` files and generates a DOCX file for each one.

### Adapting Wordinator To Your Needs

The base HTML-to-DOCX transform is very basic and is not intended to be used as is.

To create good results for your content you will need the following:

* A Word template (DOTX) that defines the named styles you need to achieve your in-Word styling requirements. For many documents the built-in Word styles will suffice. You may also have existing templates that that you need to map to. The important thing for the mapping to Word is the style names: the mapping from your input XML to Word is in terms of named paragraph, character, and table styles.
* A custom XSLT style sheet that implements the mapping from your input XML to Simple Word Procesing XML that is then the input to the DOCX generation phase. A a minimum you need to provide the mapping from element type names and @class values to paragraph abnd character style names. This can be done with relatively simple XSLT module that overrides the base HTML-to-DOCX transform.
* The XML from which you will generate the Word documents. This can be any XML but the Wordinator-provided transforms are set up for XHTML and HTML5, so if you are either authoring in HTML5 or you can generate XHTML or HTML5 from your XML then the transform is relatively simple. For example, the provided ditahtml2docx transform handles the HTML5 produced by the DITA Open Toolkit.

### Java Integration

The release package uses a jar that contains all the dependency jars required by the Wordinator.

However, if you want to include the Wordinator in a larger application where the dependencies should be managed as separate JAR files, you can build the JAR from the project source.

The Wordinator project is a Maven project.

### SimpleWP XML (SWPX)

The Simple Word Processing XML format is the direct input to the DOCX generation phase of the Wordinator.

It is essentially a simplification of Word's internal XML format.

The SWPX format is defined in the simplewpml.rng file in the `doctypes/simplewpml` directory. The RNG file includes documentation on the SWPX elements and attributes and how to use them.

The XSLT file `xsl/html2docx/baseProcessing.xsl` does most of the work of generating SWPX from HTML and it also serves to demonstrate how to generate SWPX if you want to implement direct generation from some other XML format.

### Customizing the HTML-to-SWPX Transforms

The module `xsl/html2docx/get-style-name.xsl` implements the default mapping from HTML elements to style names. It uses a variable that is a map from @class attribute values to Word style names:

```
<xsl:variable name="classToStyleNameMap" as="map(xs:string, xs:string)">
<xsl:map>
<xsl:map-entry key="'p1'" select="'Paragraph 1'"/>
</xsl:map>
</xsl:variable>
```

Each `<xsl:map-entry>` element maps a @class name (`key="'p1'"`) to a style name (`select="'Paragraph 1'"`).

You can override this variable in a custom XSLT to add your own mapping.

Note that the values of the @key and @select attributes are XSLT string literals: `'p1'` and `'Paragraph 1'`. Note the straight single quotes (`'`) around the strings. If you forget those your results will be strange.

The map variable is used like so:

```
<xsl:template mode="get-style-name" match="xhtml:span[@class] | xhtml:p[@class]" as="xs:string?">
<xsl:param name="doDebug" as="xs:boolean" tunnel="yes" select="false()"/>
<xsl:variable name="tokens" as="xs:string*" select="tokenize(@class, ' ')"/>
<xsl:variable name="key" select="$tokens[1]"/>
<xsl:variable name="styleName" as="xs:string?"
select="map:get($classToStyleNameMap, $key)"
/>
<xsl:sequence select="if (exists($styleName)) then $styleName else ()"/>
</xsl:template>
```

Here, the @class attribute of the element that matches the template is tokenized on blank spaces and then the first value is used to look up an entry in the `$classToStyleNameMap` variable.

*TBD: More guidance on customizing the mapping. Would also be easy to implement using a JSON file to define the mapping as a separate configuration file.*

## Managing Word Styles

Expand Down Expand Up @@ -98,7 +274,7 @@ Maven dependency:
<dependency>
<groupId>org.wordinator</groupId>
<artifactId>wordinator</artifactId>
<version>0.8.0</version>
<version>0.9.1</version>
</dependency>
```

Expand Down
62 changes: 62 additions & 0 deletions build.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
<?xml version="1.0" encoding="UTF-8"?>
<project basedir="." name="wordinator" default="package-release">

<!-- This Ant script just manages creating the release package.
It's a shortcut in advance of creating a proper Maven
version of this.
It assumes you've already run mvn install to generate
the jar files.
-->

<property file="version.properties"/>
<property name="target.dir" value="${basedir}/target"/>
<property name="src.dir" value="${basedir}/src/main"/>
<property name="resources.dir" value="${basedir}/src/test/resources"/>
<property name="dist.dir" value="${basedir}/dist"/>
<property name="package.name" value="wordinator"/>
<property name="package.dir" value="${dist.dir}/${package.name}"/>

<target name="init">
<tstamp/>
<buildnumber/>
</target>

<target name="dist" depends="init" >
<delete dir="${dist.dir}" failonerror="false"/>
<mkdir dir="${package.dir}"/>
<copy todir="${package.dir}">
<fileset dir="${target.dir}">
<include name="wordinator-*-jar-with-dependencies.jar"/>
</fileset>
<regexpmapper from="^(.+)\-[\d]+\..+\.jar$" to="\1.jar"/>
</copy>
<copy todir="${package.dir}">
<fileset dir="${resources.dir}">
<include name="docx/**/*"/>
<include name="html/**/*"/>
<exclude name="**/out/**"/>
</fileset>
<fileset dir="${src.dir}">
<include name="xsl/**/*"/>
<include name="doctypes/**/*"/>
</fileset>
<fileset dir="${src.dir}/ant">
<include name="**/*"/>
</fileset>
<fileset dir="${basedir}">
<include name="README.md"/>
<include name="LICENSE"/>
</fileset>
</copy>
</target>

<target name="package-release" depends="dist" description="Create a release package">
<zip file="${dist.dir}/${package.name}_${version}.zip">
<fileset dir="${package.dir}">
<include name="**/*"/>
</fileset>
</zip>
</target>

</project>
2 changes: 1 addition & 1 deletion pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
<groupId>org.wordinator</groupId>
<artifactId>wordinator</artifactId>
<packaging>jar</packaging>
<version>0.9.0</version>
<version>0.9.1</version>
<organization>
<name>wordinator.org</name>
<url>http://wordinator.org</url>
Expand Down
58 changes: 58 additions & 0 deletions src/main/ant/build.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
<?xml version="1.0" encoding="UTF-8"?>
<project basedir="." name="run-wordinator" default="ditahtml2docx">
<!-- =========================================================
Ant script to run the Wordinator Java command-line
tool.
You can create a file named "build.properties" in the
same directory to set any of the properties used
to construct the command line or pass them
in using -D{propertyname}={value}:
ant -Dditahtml2docx.dotx=myTemplate.dotx
========================================================= -->

<property file="build.properties"/>

<property name="xsl.dir" value="${basedir}/xsl"/>
<property name="docx.dir" value="${basedir}/docx"/>
<property name="ditahtml2docx.xsl" value="${xsl.dir}/ditahtml2docx/ditahtml2docx.xsl"/>
<property name="html2docx.xsl" value="${xsl.dir}/html2docx/html2docx.xsl"/>
<property name="ditahtml2docx.dotx" value="${docx.dir}/Test_Template.dotx"/>
<property name="html2docx.dotx" value="${docx.dir}/Test_Template.dotx"/>
<property name="output.dir" value="${basedir}/out"/>
<property name="input.html" value="${basedir}/html/sample_web_page.html"/>
<property name="webinator.jar" value="${basedir}/wordinator.jar"/>

<target name="init">
<tstamp/>
<buildnumber/>
</target>

<target name="ditahtml2docx" depends="init"
description="Run Wordinator against XHTML or HTML5 generated by the DITA Open Toolkit">

<java jar="${webinator.jar}" failonerror="true" fork="true">
<arg line="-i ${input.html}"/>
<arg line="-o ${output.dir}"/>
<arg line="-x ${ditahtml2docx.xsl}"/>
<arg line="-t ${ditahtml2docx.dotx}"/>
</java>

</target>

<target name="html2docx" depends="init"
description="Run Wordinator against arbitrary XHTML or HTML5"
>

<java jar="${webinator.jar}" failonerror="true" fork="true">
<arg line="-i ${input.html}"/>
<arg line="-o ${output.dir}"/>
<arg line="-x ${html2docx.xsl}"/>
<arg line="-t ${html2docx.dotx}"/>
</java>

</target>

</project>
1 change: 1 addition & 0 deletions src/main/doctypes/simplewpml/test/simplewpml-test-01.xml
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="../simplewpml.rng" type="application/xml" schematypens="http://relaxng.org/ns/structure/1.0"?>
<?xml-stylesheet type="text/css" href="../css/simplewpml.css" title="Simple WP View" alternate="no"?>
<document xmlns="urn:ns:wordinator:simplewpml">
<body>
<p style="Head1">
Expand Down
22 changes: 19 additions & 3 deletions src/main/java/org/wordinator/xml2docx/MakeDocx.java
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
import org.apache.commons.cli.CommandLine;
import org.apache.commons.cli.CommandLineParser;
import org.apache.commons.cli.DefaultParser;
import org.apache.commons.cli.HelpFormatter;
import org.apache.commons.cli.Option;
import org.apache.commons.cli.Options;
import org.apache.commons.cli.ParseException;
Expand Down Expand Up @@ -56,10 +57,25 @@ public class MakeDocx

public static void main( String[] args ) throws ParseException
{
Options options = buildOptions();

handleCommandLine(options, args, log);
boolean GOOD_OPTIONS = false;
Options options = null;
try {
options = buildOptions();
GOOD_OPTIONS = true;
} catch (Exception e) {
//
}

try {
handleCommandLine(options, args, log);
} catch (ParseException e) {
GOOD_OPTIONS = false;
}

if (!GOOD_OPTIONS) {
HelpFormatter formatter = new HelpFormatter();
formatter.printHelp( "wordinator", options, true );
}
}

/**
Expand Down
Loading

0 comments on commit 0419a1c

Please sign in to comment.