Tom Rainsford, ILR Stuttgart, May 2023
The files provided in this repository may be used to process XML files downloaded from the FRANTEXT platform in the following ways:
- The files in the
import
directory allow TXM to import FRANTEXT XML files directly. - The files in the
conll
directory convert the imported XML-TXM files to into Conll format so that they can be parsed.
IMPORTANT NOTICE: This repository does not contain XML files from FRANTEXT. A subscription to FRANTEXT is required in order to obtain the source texts.
- Download and install TXM version 0.8.2
- Create a corpus in FRANTEXT containing the texts that you want to include in your corpus and download the XML files.
- Export the metadata for your corpus from FRANTEXT as a CSV file using the "Exporter" function.
- Create a new directory and copy the following files into it:
- the XML files downloaded from FRANTEXT
- the
.csv
file containing the metadata, which must be renamed asmetadata.csv
.- Ensure also that you remove the
.xml
suffix from theid
column.
- Ensure also that you remove the
- the
xsl
directory contained inimport
in this repository.
- Launch TXM and select
Import > XML-XTZ + CSV
. - Select the directory you created in step 4 and launch the importer.
The original FRANTEXT tokenization is based on lexical units, and so FRANTEXT tokens may contain spaces, e.g. parce que, Louis XIV, or even au fur et à mesure.
By default, the import XSL eliminates all tokens containing whitespace,
modifying the pos
and lemma
tags as follows:
- the single
pos
tag is copied to all tokens with an asterisk appended to thepos
tag of all but the final word.- For example, the token
parce que
taggedCS
becomes two tokens:parce
taggedCS*
andque
taggedCS
.
- For example, the token
- where possible, the
lemma
tag is also retokenized.- For example, the token
parce que
, lemmaparce que
becomes two tokens: the tokenparce
with lemmaparce
and the tokenque
with lemmaque
.
- For example, the token
- where the
lemma
tag cannot be redistributed across the new tokens, the whitespace is replaced by a full stop and the tag is copied to all new tokens. As with thepos
tag, an asterisk is appended to the lemma tag on all but the final token.- For example, the token
pource que
, lemmapour ce que
becomes two tokens, a tokenpource
, lemmapour.ce.que*
and a second tokenque
, lemmapour.ce.que
.
- For example, the token
If you wish to retain the original FRANTEXT tokenization, simply modify the file import/xsl/2-front/xml-frantext_to_xml-txm-xtz.xsl before importing.
Replace the line:
<xsl:param name="retokenize" select="'yes'"/>
with the line
<xsl:param name="retokenize" select="'no'"/>
When the texts are imported, TXM creates a new XML-TXM file containing
unique identifiers for each token. These can be found in
TXM-0.8.2/corpora/<NAME OF YOUR CORPUS>/txm/<NAME OF YOUR CORPUS>
.
To convert these files to Conll format while retaining the unique
identifiers, use the SAXON parser with the xml-txm_to_conll.xsl
stylesheet:
java -cp <PATH>/frantexttxm/saxonb.jar net.sf.saxon.Transform <XML-TXM FILE> <PATH>/frantexttxm/conll/xml-txm_to_conll.xsl > <OUTPUT_FILE>
If you want the Conll file to contain the Frantext lemmas and pos tags, use the following command:
java -cp <PATH>/frantexttxm/saxonb.jar net.sf.saxon.Transform <XML-TXM FILE> <PATH>/frantexttxm/conll/xml-txm_to_conll.xsl include-annotation=yes > <OUTPUT_FILE>
FRANTEXT: https://www.frantext.fr/ Saxon XSLT processor: https://www.saxonica.com TXM: https://txm.gitpages.huma-num.fr/textometrie/