Skip to content

Commit

Permalink
Modify README and add README_zh
Browse files Browse the repository at this point in the history
  • Loading branch information
Mingzefei committed Sep 21, 2024
1 parent 428bab1 commit 00a3f63
Show file tree
Hide file tree
Showing 2 changed files with 264 additions and 30 deletions.
151 changes: 121 additions & 30 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,43 +1,134 @@
# README
# LaTeX to Word Conversion Tool

There are two types of people in the world: those who use LaTeX and those who don't. The latter often ask the former for Word versions of their files. Therefore, the following command line is created:
[中文版本](README_zh.md)

```bash
pandoc input.tex -o output.docx\
--filter pandoc-crossref \
--reference-doc=my_temp.docx \
--number-sections \
-M autoEqnLabels -M tableEqns \
-M reference-section-title=Reference \
--bibliography=my_ref.bib \
--citeproc --csl ieee.csl
This project provides a Python script that utilizes Pandoc and Pandoc-Crossref tools to automatically convert LaTeX files into Word documents in a specified format.
It should be noted that there is currently no perfect way to convert LaTeX to Word. The Word documents produced by this project can meet general review and editing needs, although about 5% of the content (such as author information) may need to be manually corrected after conversion.

## Features

- Supports the conversion of formulas;
- Supports automatic numbering and cross-referencing of images, tables, formulas, and references;
- Supports multi-figure images;
- Generally supports outputting Word in a specified format.

## Quick Start

Ensure that Pandoc, Pandoc-Crossref, and other dependencies are correctly installed, as detailed in [Installing Dependencies](#installing-dependencies). Execute the following command in the terminal:

```shell
python ./src/tex2docx.py --input_texfile <your_texfile> --multifig_dir <dir_saving_temporary_figs> --output_docxfile <your_docxfile> --reference_docfile <your_reference_docfile> --bibfile <your_bibfile> --cslfile <your_cslfile>
```

Replace `<...>` in the command with the appropriate file paths or folder names.

## Installing Dependencies

You need to install Pandoc, Pandoc-Crossref, and related Python libraries.

### Pandoc

Install Pandoc by referring to the [Pandoc Official Documentation](https://github.com/jgm/pandoc/blob/main/INSTALL.md). It is recommended to download the latest installation package from [Pandoc Releases](https://github.com/jgm/pandoc/releases).

### Pandoc-Crossref

Install Pandoc-Crossref as detailed in the [Pandoc-Crossref Official Documentation](https://github.com/lierdakil/pandoc-crossref). Ensure you download the version of Pandoc-Crossref that matches your Pandoc installation and configure the path appropriately.

### Related Python Libraries

Install Python dependencies:

```shell
pip install -e .
```

## Usage and Examples

The tool supports both command line and script usage, ensure all required dependencies are installed.

### Command Line Usage

Execute the following command in the terminal:

```shell
python ./src/tex2docx.py --input_texfile <your_texfile> --multifig_dir <dir_saving_temporary_figs> --output_docxfile <your_docxfile> --reference_docfile <your_reference_docfile> --bibfile <your_bibfile> --cslfile <your_cslfile>
```

The files used by this command can be found in this repository.
The following content will partially solve the above predicament, enabling you to comprehend and execute this command smoothly.
Parameter explanations:
- `--input_texfile`: Specifies the path of the LaTeX file to convert.
- `--multifig_dir`: Specifies the directory for temporarily storing generated multi-figures.
- `--output_docxfile`: Specifies the path of the output Word document.
- `--reference_docfile`: Specifies a Word reference document to ensure consistency in document styling.
- `--bibfile`: Specifies the BibTeX file for document citations.
- `--cslfile`: Specifies the Citation Style Language file to control the formatting of references.
- `--debug`: Enables debug mode to output more run-time information, helpful for troubleshooting.

For example, using the `tests/en` test case, execute the following command in the repository directory:

## Installation
```shell
python ./src/tex2docx.py --input_texfile ./tests/en/main.tex --multifig_dir ./tests/en/multifigs --output_docxfile ./tests/en/main_cli.docx --reference_docfile ./my_temp.docx --bibfile ./tests/ref.bib --cslfile ./ieee.csl
```
You will find the converted `main_cli.docx` file in the `tests/en` directory.

### Script Usage

Create the script `my_convert.py`, write the following code, and execute:

```python
# my_convert.py
from tex2docx import LatexToWordConverter

config = {
'input_texfile': '<your_texfile>',
'output_docxfile': '<your_docxfile>',
'multifig_dir': '<dir_saving_temporary_figs>',
'reference_docfile': '<your_reference_docfile>',
'cslfile': '<your_cslfile>',
'bibfile': '<your_bibfile>',
'debug': False
}

converter = LatexToWordConverter(**config)
converter.convert()
```

1. pandoc: Refer to the [official documentation](https://github.com/jgm/pandoc/blob/main/INSTALL.md) for instruction and installation. It is recommended to download the latest deb installation package from [Releases · jgm/pandoc (github.com)](https://github.com/jgm/pandoc/releases) and use `sudo dpkg -i /path/to/the/deb/file` to install it.
2. pandoc-crossref: Refer to the [official documentation](https://github.com/lierdakil/pandoc-crossref) for instruction and installation. **NOTE: Download the version that matches your pandoc version and move the executable file `pandoc-crossref` to `/usr/bin`, or specify the specific file when using the above command.**
Examples can be found in `tests/test_tex2docx.py`.

## Usage
## Implementation Principles and References

1. `--filter pandoc-crossref` processes cross-references.
2. `--reference-doc=my_temp.docx` processes the converted `output.docx` according to the style in `my_temp.docx`. There are two template files, `TIE-temp.docx` and `my_temp.docx`, in the repository [Mingzefei/latex2word](https://github.com/Mingzefei/latex2word). The former is the Word template for TIE journal submissions (two columns), and the latter is a Word template adjusted by the author (single column, large font, suitable for annotations).
3. `--number-sections` adds numerical numbering before (sub)chapter titles.
4. `-M autoEqnLabels`, `-M tableEqns` sets the numbering of equations, tables, etc.
5. `-M reference-sction-title=Reference` adds the chapter title "Reference" to the reference section.
6. `--biblipgraphy=my_ref.bib` generates the reference list using `my_ref.bib`.
7. The `--citeproc --csl ieee.csl` generates the references in the `ieee` format.
The core of this project is the use of Pandoc and Pandoc-Crossref tools to convert LaTeX to Word, configured as follows:

### Running Test
```shell
pandoc texfile -o docxfile \
--lua-filter resolve_equation_labels.lua \
--filter pandoc-crossref \
--reference-doc=temp.docx \
--number-sections \
-M autoEqnLabels \
-M tableEqns \
-M reference-section-title=Reference \
--bibliography=ref.bib \
--citeproc --csl ieee.csl
```

Go to `./test` and run `bash ./run.sh`.
However, this method may encounter issues such as improper image importation and incorrect referencing when dealing with LaTeX files containing multi-figure images directly. To address this, the project extracts multi-figure image code from the LaTeX files and uses LaTeX's built-in `convert` and `pdftocairo` tools to automatically compile these images into a single large PNG format. These PNG files then replace the original image codes in the LaTeX document, ensuring smooth import of multi-figure images. For implementation details, see `tex2docx.py`.

## Outstanding Issues

1. "Error" may raise when opening the converted docx document. This is likely due to the complexity of the tex file being converted. Try reducing the number of images and avoiding the use of tikz, for example.
2. Poor support for subfigures, particularly with regard to numbering. If the tex file does not involve subfigures, use the `-t docx+native_numbering` option to optimize numbering for images and tables.
3. References to equations appear in the form `[<label>]`. See [Equation numbering in MS Word · Issue](https://github.com/lierdakil/pandoc-crossref/issues/221) for more information. No solution has been found yet, but global replace commands can be used in Word to replace them.
4. The image size set in tex does not work in the converted docx file. A method for setting the caption style for images has not been found yet.
1. Refer to subfigures uniformly using `\ref{<figure_lab>}(a)`, not `\ref{<subfigure_lab>}` (direct subfigure referencing will be supported in future updates);
2. The formatting of image captions and author information in the exported Word document needs manual adjustment.

## Other

There are two kinds of people in the world, those who use LaTeX and those who do not. The latter often request Word versions of documents from the former. Thus, the following command was born:

```bash
pandoc input.tex -o output.docx\
--filter pandoc-crossref \
--reference-doc=my_temp.docx \
--number-sections \
-M autoEqnLabels -M tableEqns \
-M reference-section-title=Reference \
--bibliography=my_ref.bib \
--citeproc --csl ieee.csl
```
143 changes: 143 additions & 0 deletions README_zh.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
# LaTeX 到 Word 文件转换工具

本项目提供一个 Python 脚本,利用 Pandoc 和 Pandoc-Crossref 工具,将 LaTeX 文件自动地按照指定格式转换为 Word 文件。
需要说明的是,目前仍没有能够将 LaTeX 转换为 Word 的完美方法,本项目生成的 Word 文件可满足一般的审阅和修改需求,其中约 5% 的内容(如作者信息等)可能需要在转换后手动更正。

## 特性

- 支持公式的转换;
- 支持图片、表格、公式和参考文献的自动编号及交叉引用;
- 支持多子图;
- 基本支持按照指定格式输出 Word。

## 快速使用

确保已正确安装 Pandoc 和 Pandoc-Crossref 等依赖,详见[安装依赖](#安装依赖)。在命令行中执行以下命令:

```shell
python ./src/tex2docx.py --input_texfile <your_texfile> --multifig_dir <dir_saving_temporary_figs> --output_docxfile <your_docxfile> --reference_docfile <your_reference_docfile> --bibfile <your_bibfile> --cslfile <your_cslfile>
```

将命令中的 `<...>` 替换为相应文件路径或文件夹名称即可。

## 安装依赖

需要安装 Pandoc、Pandoc-Crossref 和相关 Python 库。

### Pandoc

安装 Pandoc,详见 [Pandoc 官方文档](https://github.com/jgm/pandoc/blob/main/INSTALL.md)。建议从 [Pandoc Releases](https://github.com/jgm/pandoc/releases) 下载最新的安装包。

### Pandoc-Crossref

安装 Pandoc-Crossref,详见 [Pandoc-Crossref 官方文档](https://github.com/lierdakil/pandoc-crossref)。确保下载与 Pandoc 版本相匹配的 Pandoc-Crossref,并适当配置路径。

### 相关 Python 库

安装 Python 依赖:

```shell
pip install -e .
```

## 使用说明及案例

支持命令行和脚本两种使用方式,确保已安装所需依赖。

### 命令行使用

在终端执行以下命令:

```shell
python ./src/tex2docx.py --input_texfile <your_texfile> --multifig_dir <dir_saving_temporary_figs> --output_docxfile <your_docxfile> --reference_docfile <your_reference_docfile> --bibfile <your_bibfile> --cslfile <your_cslfile>
```

参数说明:
- `--input_texfile`:指定要转换的 LaTeX 文件的路径。
- `--multifig_dir`:指定临时存放生成的多图文件的目录。
- `--output_docxfile`:指定输出的 Word 文件的路径。
- `--reference_docfile`:指定 Word 输出格式的参考文档,这有助于确保文档样式的一致性。
- `--bibfile`:指定参考文献的 BibTeX 文件,用于文档中的引用。
- `--cslfile`:指定引用样式文件(Citation Style Language),控制参考文献的格式。
- `--debug`:开启调试模式以输出更多的运行信息,有助于排查问题。


`tests/en` 测试案例为例,在仓库目录下执行如下命令:

```shell
python ./src/tex2docx.py --input_texfile ./tests/en/main.tex --multifig_dir ./tests/en/multifigs --output_docxfile ./tests/en/main_cli.docx --reference_docfile ./my_temp.docx --bibfile ./tests/ref.bib --cslfile ./ieee.csl
```
则可以在 `tests/en` 目录下找到转换后的 `main_cli.docx` 文件。

### 脚本使用

创建脚本 `my_convert.py` ,写入以下代码,并执行:

```python
# my_convert.py
from tex2docx import LatexToWordConverter

config = {
'input_texfile': '<your_texfile>',
'output_docxfile': '<your_docxfile>',
'multifig_dir': '<dir_saving_temporary_figs>',
'reference_docfile': '<your_reference_docfile>',
'cslfile': '<your_cslfile>',
'bibfile': '<your_bibfile>',
'debug': False
}

converter = LatexToWordConverter(**config)
converter.convert()
```

案例可以参考`tests/test_tex2docx.py`

## 实现原理及参考资料

该项目核心是使用 Pandoc 和 Pandoc-Crossref 工具实现 LaTeX 到 Word 的转换,具体配置如下:

```shell
pandoc texfile -o docxfile \
--lua-filter resolve_equation_labels.lua \
--filter pandoc-crossref \
--reference-doc=temp.docx \
--number-sections \
-M autoEqnLabels \
-M tableEqns \
-M reference-section-title=Reference \
--bibliography=ref.bib \
--citeproc --csl ieee.csl
```

其中,
1. `--lua-filter resolve_equation_labels.lua` 处理公式编号及公式交叉引用,受 Constantin Ahlmann-Eltze 的[脚本](https://gist.githubusercontent.com/const-ae/752ad85c43d92b72865453ea3a77e2dd/raw/28c1815979e5d03cd9ab3638f9befd354797a72b/resolve_equation_labels.lua)启发;
2. `--filter pandoc-crossref` 处理除公式以外的交叉引用;
3. `--reference-doc=my_temp.docx` 依照 `my_temp.docx` 中的样式生成 Word 文件。仓库 [Mingzefei/latex2word](https://github.com/Mingzefei/latex2word) 提供了两个模板文件 `TIE-temp.docx``my_temp.docx`,前者是 TIE 期刊的投稿 Word 模板(双栏),后者是个人调整出的 Word 模板(单栏,且便于批注);
4. `--number-sections` 在(子)章节标题前添加数字编号;
5. `-M autoEqnLabels``-M tableEqns`设置公式、表格等的编号;
6. `-M reference-sction-title=Reference` 在参考文献部分添加章节标题 Reference;
7. `--biblipgraphy=my_ref.bib` 使用 `ref.bib` 生成参考文献;
8. `--citeproc --csl ieee.csl` 生成的参考文献格式为 `ieee`

然而,上述方法在直接处理包含多子图的 Latex 文件时可能遇到图片无法正常导入和引用编号错误等问题。为此,本项目通过提取 LaTeX 文件中的多子图代码,使用 LaTeX 自带的 `convert``pdftocairo` 工具自动化编译这些图片为单个大图形式的 PNG 文件;然后,这些 PNG 文件将替换原始 LaTeX 文档中的相应图片代码,从而确保多子图形式的图片被顺利导入。具体的实现代码见 `tex2docx.py`

## 遗留问题

1. 子图引用请统一使用 `\ref{<figure_lab>}(a)` 形式,而非 `\ref{<subfigure_lab>}`(后续会支持直接引用子图);
2. 导出 Word 文件的图片 caption 格式和作者信息需要手动调整。

## 其他

世界上有两种人,一种人会用 Latex,另一种人不会用 Latex。 后者常常向前者要 Word 版本文件。 因此有了如下一行命令。

```bash
pandoc input.tex -o output.docx\
--filter pandoc-crossref \
--reference-doc=my_temp.docx \
--number-sections \
-M autoEqnLabels -M tableEqns \
-M reference-section-title=Reference \
--bibliography=my_ref.bib \
--citeproc --csl ieee.csl
```

0 comments on commit 00a3f63

Please sign in to comment.