-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
2 changed files
with
264 additions
and
30 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,43 +1,134 @@ | ||
# README | ||
# LaTeX to Word Conversion Tool | ||
|
||
There are two types of people in the world: those who use LaTeX and those who don't. The latter often ask the former for Word versions of their files. Therefore, the following command line is created: | ||
[中文版本](README_zh.md) | ||
|
||
```bash | ||
pandoc input.tex -o output.docx\ | ||
--filter pandoc-crossref \ | ||
--reference-doc=my_temp.docx \ | ||
--number-sections \ | ||
-M autoEqnLabels -M tableEqns \ | ||
-M reference-section-title=Reference \ | ||
--bibliography=my_ref.bib \ | ||
--citeproc --csl ieee.csl | ||
This project provides a Python script that utilizes Pandoc and Pandoc-Crossref tools to automatically convert LaTeX files into Word documents in a specified format. | ||
It should be noted that there is currently no perfect way to convert LaTeX to Word. The Word documents produced by this project can meet general review and editing needs, although about 5% of the content (such as author information) may need to be manually corrected after conversion. | ||
|
||
## Features | ||
|
||
- Supports the conversion of formulas; | ||
- Supports automatic numbering and cross-referencing of images, tables, formulas, and references; | ||
- Supports multi-figure images; | ||
- Generally supports outputting Word in a specified format. | ||
|
||
## Quick Start | ||
|
||
Ensure that Pandoc, Pandoc-Crossref, and other dependencies are correctly installed, as detailed in [Installing Dependencies](#installing-dependencies). Execute the following command in the terminal: | ||
|
||
```shell | ||
python ./src/tex2docx.py --input_texfile <your_texfile> --multifig_dir <dir_saving_temporary_figs> --output_docxfile <your_docxfile> --reference_docfile <your_reference_docfile> --bibfile <your_bibfile> --cslfile <your_cslfile> | ||
``` | ||
|
||
Replace `<...>` in the command with the appropriate file paths or folder names. | ||
|
||
## Installing Dependencies | ||
|
||
You need to install Pandoc, Pandoc-Crossref, and related Python libraries. | ||
|
||
### Pandoc | ||
|
||
Install Pandoc by referring to the [Pandoc Official Documentation](https://github.com/jgm/pandoc/blob/main/INSTALL.md). It is recommended to download the latest installation package from [Pandoc Releases](https://github.com/jgm/pandoc/releases). | ||
|
||
### Pandoc-Crossref | ||
|
||
Install Pandoc-Crossref as detailed in the [Pandoc-Crossref Official Documentation](https://github.com/lierdakil/pandoc-crossref). Ensure you download the version of Pandoc-Crossref that matches your Pandoc installation and configure the path appropriately. | ||
|
||
### Related Python Libraries | ||
|
||
Install Python dependencies: | ||
|
||
```shell | ||
pip install -e . | ||
``` | ||
|
||
## Usage and Examples | ||
|
||
The tool supports both command line and script usage, ensure all required dependencies are installed. | ||
|
||
### Command Line Usage | ||
|
||
Execute the following command in the terminal: | ||
|
||
```shell | ||
python ./src/tex2docx.py --input_texfile <your_texfile> --multifig_dir <dir_saving_temporary_figs> --output_docxfile <your_docxfile> --reference_docfile <your_reference_docfile> --bibfile <your_bibfile> --cslfile <your_cslfile> | ||
``` | ||
|
||
The files used by this command can be found in this repository. | ||
The following content will partially solve the above predicament, enabling you to comprehend and execute this command smoothly. | ||
Parameter explanations: | ||
- `--input_texfile`: Specifies the path of the LaTeX file to convert. | ||
- `--multifig_dir`: Specifies the directory for temporarily storing generated multi-figures. | ||
- `--output_docxfile`: Specifies the path of the output Word document. | ||
- `--reference_docfile`: Specifies a Word reference document to ensure consistency in document styling. | ||
- `--bibfile`: Specifies the BibTeX file for document citations. | ||
- `--cslfile`: Specifies the Citation Style Language file to control the formatting of references. | ||
- `--debug`: Enables debug mode to output more run-time information, helpful for troubleshooting. | ||
|
||
For example, using the `tests/en` test case, execute the following command in the repository directory: | ||
|
||
## Installation | ||
```shell | ||
python ./src/tex2docx.py --input_texfile ./tests/en/main.tex --multifig_dir ./tests/en/multifigs --output_docxfile ./tests/en/main_cli.docx --reference_docfile ./my_temp.docx --bibfile ./tests/ref.bib --cslfile ./ieee.csl | ||
``` | ||
You will find the converted `main_cli.docx` file in the `tests/en` directory. | ||
|
||
### Script Usage | ||
|
||
Create the script `my_convert.py`, write the following code, and execute: | ||
|
||
```python | ||
# my_convert.py | ||
from tex2docx import LatexToWordConverter | ||
|
||
config = { | ||
'input_texfile': '<your_texfile>', | ||
'output_docxfile': '<your_docxfile>', | ||
'multifig_dir': '<dir_saving_temporary_figs>', | ||
'reference_docfile': '<your_reference_docfile>', | ||
'cslfile': '<your_cslfile>', | ||
'bibfile': '<your_bibfile>', | ||
'debug': False | ||
} | ||
|
||
converter = LatexToWordConverter(**config) | ||
converter.convert() | ||
``` | ||
|
||
1. pandoc: Refer to the [official documentation](https://github.com/jgm/pandoc/blob/main/INSTALL.md) for instruction and installation. It is recommended to download the latest deb installation package from [Releases · jgm/pandoc (github.com)](https://github.com/jgm/pandoc/releases) and use `sudo dpkg -i /path/to/the/deb/file` to install it. | ||
2. pandoc-crossref: Refer to the [official documentation](https://github.com/lierdakil/pandoc-crossref) for instruction and installation. **NOTE: Download the version that matches your pandoc version and move the executable file `pandoc-crossref` to `/usr/bin`, or specify the specific file when using the above command.** | ||
Examples can be found in `tests/test_tex2docx.py`. | ||
|
||
## Usage | ||
## Implementation Principles and References | ||
|
||
1. `--filter pandoc-crossref` processes cross-references. | ||
2. `--reference-doc=my_temp.docx` processes the converted `output.docx` according to the style in `my_temp.docx`. There are two template files, `TIE-temp.docx` and `my_temp.docx`, in the repository [Mingzefei/latex2word](https://github.com/Mingzefei/latex2word). The former is the Word template for TIE journal submissions (two columns), and the latter is a Word template adjusted by the author (single column, large font, suitable for annotations). | ||
3. `--number-sections` adds numerical numbering before (sub)chapter titles. | ||
4. `-M autoEqnLabels`, `-M tableEqns` sets the numbering of equations, tables, etc. | ||
5. `-M reference-sction-title=Reference` adds the chapter title "Reference" to the reference section. | ||
6. `--biblipgraphy=my_ref.bib` generates the reference list using `my_ref.bib`. | ||
7. The `--citeproc --csl ieee.csl` generates the references in the `ieee` format. | ||
The core of this project is the use of Pandoc and Pandoc-Crossref tools to convert LaTeX to Word, configured as follows: | ||
|
||
### Running Test | ||
```shell | ||
pandoc texfile -o docxfile \ | ||
--lua-filter resolve_equation_labels.lua \ | ||
--filter pandoc-crossref \ | ||
--reference-doc=temp.docx \ | ||
--number-sections \ | ||
-M autoEqnLabels \ | ||
-M tableEqns \ | ||
-M reference-section-title=Reference \ | ||
--bibliography=ref.bib \ | ||
--citeproc --csl ieee.csl | ||
``` | ||
|
||
Go to `./test` and run `bash ./run.sh`. | ||
However, this method may encounter issues such as improper image importation and incorrect referencing when dealing with LaTeX files containing multi-figure images directly. To address this, the project extracts multi-figure image code from the LaTeX files and uses LaTeX's built-in `convert` and `pdftocairo` tools to automatically compile these images into a single large PNG format. These PNG files then replace the original image codes in the LaTeX document, ensuring smooth import of multi-figure images. For implementation details, see `tex2docx.py`. | ||
|
||
## Outstanding Issues | ||
|
||
1. "Error" may raise when opening the converted docx document. This is likely due to the complexity of the tex file being converted. Try reducing the number of images and avoiding the use of tikz, for example. | ||
2. Poor support for subfigures, particularly with regard to numbering. If the tex file does not involve subfigures, use the `-t docx+native_numbering` option to optimize numbering for images and tables. | ||
3. References to equations appear in the form `[<label>]`. See [Equation numbering in MS Word · Issue](https://github.com/lierdakil/pandoc-crossref/issues/221) for more information. No solution has been found yet, but global replace commands can be used in Word to replace them. | ||
4. The image size set in tex does not work in the converted docx file. A method for setting the caption style for images has not been found yet. | ||
1. Refer to subfigures uniformly using `\ref{<figure_lab>}(a)`, not `\ref{<subfigure_lab>}` (direct subfigure referencing will be supported in future updates); | ||
2. The formatting of image captions and author information in the exported Word document needs manual adjustment. | ||
|
||
## Other | ||
|
||
There are two kinds of people in the world, those who use LaTeX and those who do not. The latter often request Word versions of documents from the former. Thus, the following command was born: | ||
|
||
```bash | ||
pandoc input.tex -o output.docx\ | ||
--filter pandoc-crossref \ | ||
--reference-doc=my_temp.docx \ | ||
--number-sections \ | ||
-M autoEqnLabels -M tableEqns \ | ||
-M reference-section-title=Reference \ | ||
--bibliography=my_ref.bib \ | ||
--citeproc --csl ieee.csl | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,143 @@ | ||
# LaTeX 到 Word 文件转换工具 | ||
|
||
本项目提供一个 Python 脚本,利用 Pandoc 和 Pandoc-Crossref 工具,将 LaTeX 文件自动地按照指定格式转换为 Word 文件。 | ||
需要说明的是,目前仍没有能够将 LaTeX 转换为 Word 的完美方法,本项目生成的 Word 文件可满足一般的审阅和修改需求,其中约 5% 的内容(如作者信息等)可能需要在转换后手动更正。 | ||
|
||
## 特性 | ||
|
||
- 支持公式的转换; | ||
- 支持图片、表格、公式和参考文献的自动编号及交叉引用; | ||
- 支持多子图; | ||
- 基本支持按照指定格式输出 Word。 | ||
|
||
## 快速使用 | ||
|
||
确保已正确安装 Pandoc 和 Pandoc-Crossref 等依赖,详见[安装依赖](#安装依赖)。在命令行中执行以下命令: | ||
|
||
```shell | ||
python ./src/tex2docx.py --input_texfile <your_texfile> --multifig_dir <dir_saving_temporary_figs> --output_docxfile <your_docxfile> --reference_docfile <your_reference_docfile> --bibfile <your_bibfile> --cslfile <your_cslfile> | ||
``` | ||
|
||
将命令中的 `<...>` 替换为相应文件路径或文件夹名称即可。 | ||
|
||
## 安装依赖 | ||
|
||
需要安装 Pandoc、Pandoc-Crossref 和相关 Python 库。 | ||
|
||
### Pandoc | ||
|
||
安装 Pandoc,详见 [Pandoc 官方文档](https://github.com/jgm/pandoc/blob/main/INSTALL.md)。建议从 [Pandoc Releases](https://github.com/jgm/pandoc/releases) 下载最新的安装包。 | ||
|
||
### Pandoc-Crossref | ||
|
||
安装 Pandoc-Crossref,详见 [Pandoc-Crossref 官方文档](https://github.com/lierdakil/pandoc-crossref)。确保下载与 Pandoc 版本相匹配的 Pandoc-Crossref,并适当配置路径。 | ||
|
||
### 相关 Python 库 | ||
|
||
安装 Python 依赖: | ||
|
||
```shell | ||
pip install -e . | ||
``` | ||
|
||
## 使用说明及案例 | ||
|
||
支持命令行和脚本两种使用方式,确保已安装所需依赖。 | ||
|
||
### 命令行使用 | ||
|
||
在终端执行以下命令: | ||
|
||
```shell | ||
python ./src/tex2docx.py --input_texfile <your_texfile> --multifig_dir <dir_saving_temporary_figs> --output_docxfile <your_docxfile> --reference_docfile <your_reference_docfile> --bibfile <your_bibfile> --cslfile <your_cslfile> | ||
``` | ||
|
||
参数说明: | ||
- `--input_texfile`:指定要转换的 LaTeX 文件的路径。 | ||
- `--multifig_dir`:指定临时存放生成的多图文件的目录。 | ||
- `--output_docxfile`:指定输出的 Word 文件的路径。 | ||
- `--reference_docfile`:指定 Word 输出格式的参考文档,这有助于确保文档样式的一致性。 | ||
- `--bibfile`:指定参考文献的 BibTeX 文件,用于文档中的引用。 | ||
- `--cslfile`:指定引用样式文件(Citation Style Language),控制参考文献的格式。 | ||
- `--debug`:开启调试模式以输出更多的运行信息,有助于排查问题。 | ||
|
||
|
||
以 `tests/en` 测试案例为例,在仓库目录下执行如下命令: | ||
|
||
```shell | ||
python ./src/tex2docx.py --input_texfile ./tests/en/main.tex --multifig_dir ./tests/en/multifigs --output_docxfile ./tests/en/main_cli.docx --reference_docfile ./my_temp.docx --bibfile ./tests/ref.bib --cslfile ./ieee.csl | ||
``` | ||
则可以在 `tests/en` 目录下找到转换后的 `main_cli.docx` 文件。 | ||
|
||
### 脚本使用 | ||
|
||
创建脚本 `my_convert.py` ,写入以下代码,并执行: | ||
|
||
```python | ||
# my_convert.py | ||
from tex2docx import LatexToWordConverter | ||
|
||
config = { | ||
'input_texfile': '<your_texfile>', | ||
'output_docxfile': '<your_docxfile>', | ||
'multifig_dir': '<dir_saving_temporary_figs>', | ||
'reference_docfile': '<your_reference_docfile>', | ||
'cslfile': '<your_cslfile>', | ||
'bibfile': '<your_bibfile>', | ||
'debug': False | ||
} | ||
|
||
converter = LatexToWordConverter(**config) | ||
converter.convert() | ||
``` | ||
|
||
案例可以参考`tests/test_tex2docx.py`。 | ||
|
||
## 实现原理及参考资料 | ||
|
||
该项目核心是使用 Pandoc 和 Pandoc-Crossref 工具实现 LaTeX 到 Word 的转换,具体配置如下: | ||
|
||
```shell | ||
pandoc texfile -o docxfile \ | ||
--lua-filter resolve_equation_labels.lua \ | ||
--filter pandoc-crossref \ | ||
--reference-doc=temp.docx \ | ||
--number-sections \ | ||
-M autoEqnLabels \ | ||
-M tableEqns \ | ||
-M reference-section-title=Reference \ | ||
--bibliography=ref.bib \ | ||
--citeproc --csl ieee.csl | ||
``` | ||
|
||
其中, | ||
1. `--lua-filter resolve_equation_labels.lua` 处理公式编号及公式交叉引用,受 Constantin Ahlmann-Eltze 的[脚本](https://gist.githubusercontent.com/const-ae/752ad85c43d92b72865453ea3a77e2dd/raw/28c1815979e5d03cd9ab3638f9befd354797a72b/resolve_equation_labels.lua)启发; | ||
2. `--filter pandoc-crossref` 处理除公式以外的交叉引用; | ||
3. `--reference-doc=my_temp.docx` 依照 `my_temp.docx` 中的样式生成 Word 文件。仓库 [Mingzefei/latex2word](https://github.com/Mingzefei/latex2word) 提供了两个模板文件 `TIE-temp.docx` 和 `my_temp.docx`,前者是 TIE 期刊的投稿 Word 模板(双栏),后者是个人调整出的 Word 模板(单栏,且便于批注); | ||
4. `--number-sections` 在(子)章节标题前添加数字编号; | ||
5. `-M autoEqnLabels`, `-M tableEqns`设置公式、表格等的编号; | ||
6. `-M reference-sction-title=Reference` 在参考文献部分添加章节标题 Reference; | ||
7. `--biblipgraphy=my_ref.bib` 使用 `ref.bib` 生成参考文献; | ||
8. `--citeproc --csl ieee.csl` 生成的参考文献格式为 `ieee` 。 | ||
|
||
然而,上述方法在直接处理包含多子图的 Latex 文件时可能遇到图片无法正常导入和引用编号错误等问题。为此,本项目通过提取 LaTeX 文件中的多子图代码,使用 LaTeX 自带的 `convert` 和 `pdftocairo` 工具自动化编译这些图片为单个大图形式的 PNG 文件;然后,这些 PNG 文件将替换原始 LaTeX 文档中的相应图片代码,从而确保多子图形式的图片被顺利导入。具体的实现代码见 `tex2docx.py`。 | ||
|
||
## 遗留问题 | ||
|
||
1. 子图引用请统一使用 `\ref{<figure_lab>}(a)` 形式,而非 `\ref{<subfigure_lab>}`(后续会支持直接引用子图); | ||
2. 导出 Word 文件的图片 caption 格式和作者信息需要手动调整。 | ||
|
||
## 其他 | ||
|
||
世界上有两种人,一种人会用 Latex,另一种人不会用 Latex。 后者常常向前者要 Word 版本文件。 因此有了如下一行命令。 | ||
|
||
```bash | ||
pandoc input.tex -o output.docx\ | ||
--filter pandoc-crossref \ | ||
--reference-doc=my_temp.docx \ | ||
--number-sections \ | ||
-M autoEqnLabels -M tableEqns \ | ||
-M reference-section-title=Reference \ | ||
--bibliography=my_ref.bib \ | ||
--citeproc --csl ieee.csl | ||
``` |