Funnelling is a new ensemble method for heterogeneous transfer learning. The present Python implementation concerns the application of Funnelling to Polylingual Text Classification (PLC).
The two variants of Funnelling, Fun(KFCV) and Fun(TAT), are implemented by the FunnellingPolylingualClassifier class, and instantiated by setting folded_projections=k for Fun(KFCV) (with k>1 the number of folds) or folded_projections=1 for Fun(TAT).
This code has been used to produce all experimental results reported in the article "Esuli, A., Moreo, A., & Sebastiani, F. (2019). Funnelling: A New Ensemble Method for Heterogeneous Transfer Learning and Its Application to Cross-Lingual Text Classification. ACM Transactions on Information Systems (TOIS), 37(3), 37.".
This package also contains the code implementing all baselines involved in the experimental evaluation. Some of these baselines may require external resources. All baselines are implemented in the learners.py script. The list of baselines include: Naive, Lightweight Random Indexing (LRI), Cross-Lingual Explicit Semantic Analysis (CLESA), Kernel Canonical Correlation Analysis (KCCA), Distributional Correspondence Indexing (DCI), Poly-lingual Embeddings (MLE and MLE-LSTM), and UpperBound. Among those, CLESA, KCCA, MLE, and MLE-LSTM require the following additional resources:
- CLESA: the class CLESAPolylingualClassifier requires a processed version of a Wikipedia dump; see section Datasets for more information.
- KCCA: the class KCCAPolylingualClassifier also requires a processed version of Wikipedia. KCCA is built on top of a wrapper of pyrcca from the article Regularized kernel canonical correlation analysis in Python. If you intend to run KCCA you might first fork the aforementioned project and make it accessible at the root of this project.
- MLE: the class PolylingualEmbeddingsClassifier uses the multilingual embeddings from the article Word Translation without Parallel Data which can be downloaded from the MUSE repo.
- MLE-LSTM: is implemented in LSTMclassifierKeras.py and requires:
- The availability of the polylingual embeddings (as in MLE).
- A Keras installation.
The datasets we used to run our experiments include:
- RCV1/RCV2: a comparable corpus of Reuters newstories
- JRC-Acquis: a parallel corpus of legislative texts of the European Union
The datasets need to be built before running any experiment. This process requires downloading, parsing, preprocessing, splitting, and vectorizing. The datasets we generated and used in our experiments can be directly downloaded (in vector form) from here. Note that some methods (e.g., the PLE and PLE-LSTM methods) might require the original documents in raw form, which we are not allowed to distribute. The tools we used in order to build the datasets are also available in this repo, and are explained below (feel free to skip reading if you are ok with the pre-built version).
The dataset generation relies on NLTK for text preprocessing. Make sure you have NLTK installed and you have downloaded the packages needed for enabling stopword removal and stemming (via SnowballStemmer) before building the datasets.
A multilingual dump of the Wikipedia is required during the generation of the datasets for the CLESA and KCCA baselines (see section Baselines). If you are not interested in running CLESA or KCCA, you can simply omit this requirement by setting max_wiki=0 before running the script. If otherwise, you would have to go through the documentation which contains some tools and explanations on how to prepare the Wikipedia dump (you might require external tools).
We adapted the Wikipedia_Extractor to extract a comparable set of documents for all of the 11 languages involved in our experiments. Technical details and ad-hoc tools might be found in wikipedia_tools.py (in this repo). The toolkit allows:
- Simpliying the (huge) json dump file
- Processing the json file as a stream and filter out documents not satisfying certain conditions (e.g., do not have a view for all of the specified languages).
- Extract clean versions of documents (see the Wikipedia_Extractor for more information)
- Create multilingual maps of comparable documents, and pickle them for faster usage.
The dataset splits are built once for all using the dataset_builder.py script and then pickled for fast subsequent runs. JRC-Acquis is automatically donwloaded the first time. RCV1/RCV2, despite being public, cannot be downloaded without a formal permission. Please, refer to RCV1's site and RCV2's site before proceeding.
Once locally available, this script preprocesses the documents, and vectorizes them. 10 random splits are generated for experimental purposes. The list of ids we ended up using are accessible (in pickle format) here.
Most of the experiments were run using the script polylingual_classification.py. This script can be run with different command line arguments to reproduce all multilabel experiments (with the exception of PLE-LSTM, see below).
Run it with -h or --help to show this help.
Usage: polylingual_classification.py [options]
Options:
-h, --help show this help message and exit
-d DATASET, --dataset=DATASET
Path to the multilingual dataset processed and stored
in .pickle format
-m MODE, --mode=MODE Model code of the polylingual classifier, valid ones
include ['fun-kfcv', 'fun-tat', 'naive', 'lri',
'clesa', 'kcca', 'dci', 'ple', 'upper', 'fun-mono']
-o OUTPUT, --output=OUTPUT
Result file
-n NOTE, --note=NOTE A description note to be added to the result file
-c, --optimc Optimice hyperparameters
-b BINARY, --binary=BINARY
Run experiments on a single category specified with
this parameter
-L LANG_ABLATION, --lang_ablation=LANG_ABLATION
Removes the language from the training
-f, --force Run even if the result was already computed
-j N_JOBS, --n_jobs=N_JOBS
Number of parallel jobs (default is -1, all)
-s SET_C, --set_c=SET_C
Set the C parameter
-r KCCAREG, --kccareg=KCCAREG
Set the regularization parameter for KCCA
-w WE_PATH, --we-path=WE_PATH
Path to the polylingual word embeddings (required only
if --mode polyembeddings)
-W WIKI, --wiki=WIKI Path to Wikipedia raw documents
--calmode=CALMODE Calibration mode for the base classifiers (only for
class-based models). Valid ones are'cal' (default,
calibrates the base classifiers and use predict_proba
to project), 'nocal' (does not calibrate, use the
decision_function to project)'sigmoid' (does not
calibrate, use the sigmoid of the decision function to
project)
For example, the following command will produce the results for Fun(TAT) on the first random split of the RCV1/RCV2 dataset optimizing the C parameter of the first-tier SVM classifiers.
$> python polylingual_classification.py -d "../Datasets/RCV2/rcv1-2_nltk_trByLang1000_teByLang1000_processed_run0.pickle" -o ./results.csv --mode fun-tat --optimc
Once the experiment is over, some results will be displayed in the standard output:
evaluation (n_jobs=-1)
Lang nl: macro-F1=0.540 micro-F1=0.829
Lang es: macro-F1=0.582 micro-F1=0.843
Lang fr: macro-F1=0.499 micro-F1=0.765
Lang en: macro-F1=0.528 micro-F1=0.764
Lang sv: macro-F1=0.540 micro-F1=0.775
Lang it: macro-F1=0.511 micro-F1=0.789
Lang da: macro-F1=0.490 micro-F1=0.797
Lang pt: macro-F1=0.706 micro-F1=0.879
Lang de: macro-F1=0.416 micro-F1=0.741
Averages: MF1, mF1, MK, mK [0.53464632 0.79803785 0.5088316 0.75633335]
The complete record of the experiment is saved in the result file, which can be consulted with Pandas. For example, the following snippet will display the results for all languages:
import pandas as pd
results = pd.read_csv('results.csv', sep='\t')
pd.pivot_table(results, index = ['method', 'lang'], values=['microf1','macrof1','microk','macrok'])
Out[11]:
macrof1 macrok microf1 microk
method lang
fun-tat da 0.490002 0.455626 0.796524 0.742877
de 0.415858 0.394820 0.741391 0.698547
en 0.528280 0.488883 0.764349 0.716628
es 0.581849 0.577447 0.842697 0.823296
fr 0.499307 0.477912 0.764876 0.704686
it 0.510546 0.471447 0.788944 0.751368
nl 0.540213 0.510137 0.828782 0.789040
pt 0.705507 0.698201 0.879412 0.850810
sv 0.540255 0.505011 0.775367 0.729749
The code to run PLE-LSTM is implemented in LSTMclassifierKeras.py. Note that you need the raw version of the documents to run it (see the Datasets section).
Other scripts used include:
- monolingual_classification.py runs the multilabel monolingual experiments.
- binary_classification.py runs the binary polylingual experiments.
- crosslingual_classification.py generates the learning curves simulating under-resourced languages
- funemb_classification.py runs experiments using Fun(TAT)-PLE