Metrics implementation (#27)

* Test for actions. (main unchanged) * Code formatting. * Jury metrics added. * Test for actions. (main unchanged) * Code formatting. * Jury metrics added. * JuryMetric class added registered. * Metric class implemented, supported through Jury. - DataAdapter reworked to require "split" parameter on call to ensure label_list contains validation samples (to be evaluated). The use for the user is seamless (nothing additional required by user), only changed on test cases. - Test cases changed acordingly. * QA Adapter, appending answers changed to list from str. * jury added to requirements.txt (>=2.0.0). * Update requirements.txt * MetadataHandler introduced to save meta data from samples required for metric computation. * (Planning Phase (PP: No test implemented yet.)) MetadataHandler component implemented. - Test cases changed accordingly. - Associated components changed accordingly. * Small refactors for metadata handlers. * Updates from main branch (conflicts resolved). - Metadata handlers are introduced as handling metadata for metric evaluation (for tasks that require additional metadata for final prediction out.) - Bug fix: `qa_id` for FilteredInstance is changed to type str (from int) on question_answering_processor.py. - New requirement: jury>2.1.0. * Code formatting. * MetadataHandlerForPosTagging added for pos-tag example. - Seqeval class is removed from metrics.py. - Docstring added for MetadataHandler. * Code formatting. * README.md updated. - Unused imports removed. * Updates from main. - Name changed as metric handler. * Required changes for metric handler. * Test for MetricHandler (default implementation) added. * Metric handler test fixtures. * trapper.__init__ updated. - README.md updated. * Question answering notebook updated. * Unused class removed. * Updates from the reviews. - README.md updated. * setup.py updated. * setup.py corrected. * Updates from review. * README.md update. * README.md update. * Updates from review. * Updates from review. * Updates from reviews. - Unused import removed in question_answering_adapter.py. - Docstring added to question_answering_handler.py. - Parts regarding metric handler updated in README.md. - version updated to "0.0.5" from "0.0.4". * Updated structure of metric handlers. - MetricHandler is divided to MetricInputHandler and MetricOutputHandler to handling input for metrics and manipulating resulting output of metric computation respectively. - README.md updates. - Docstrings updated. - Test cases updated according to the changes. * version set back to 0.0.4 (from 0.0.5) * Requested changes from review. * Add `Why You Should Use Trapper` section to the README.md * Update the `Why You Should Use Trapper` section * Minor update Co-authored-by: cemilcengiz <[email protected]>
obss · Nov 9, 2021 · 13a2829 · 13a2829
1 parent a45ce70
commit 13a2829
Show file tree

Hide file tree

Showing 47 changed files with 744 additions and 244 deletions.
diff --git a/README.md b/README.md
@@ -13,7 +13,8 @@
 </p>
 
 Trapper is an NLP library that aims to make it easier to train transformer based
-models on downstream tasks. It wraps the HuggingFace's `transformers` library to
+models on downstream tasks. It
+wraps [huggingface/transformers](http://github.com/huggingface/transformers) to
 provide the transformer model implementations and training mechanisms. It defines
 abstractions with base classes for common tasks encountered while using transformer
 models. Additionally, it provides a dependency-injection mechanism and allows
@@ -24,13 +25,34 @@ changing the existing code. These features foster code reuse, less boiler-plate
 code, as well as repeatable and better documented training experiments which is
 crucial in machine learning.
 
+## Why You Should Use Trapper
+
+- You have been a `transformers` user for quite some time now. However, you started
+  to feel that some computation steps could be standardized through new
+  abstractions. You wish to reuse the scripts you write for data processing,
+  post-processing etc with different models/tokenizers easily. You would like to
+  separate the code from the experiment details, mix and match components through
+  configuration files while keeping your codebase clean and free of duplication.
+
+
+- You are an `AllenNLP` user who is really happy with the dependency-injection
+  system, well-defined abstractions and smooth workflow. However, you would like to
+  use the latest transformer models without having to wait for the core developers
+  to integrate them. Moreover, the `transformers` community is scaling up rapidly,
+  and you would like to join the party while still enjoying an `AllenNLP` touch.
+
+
+- You are an NLP researcher / practitioner, and you would like to give a shot to a
+  library aiming to support state-of-the-art models along with datasets, metrics and
+  more in unified APIs.
+
 ## Key Features
 
 ### Compatibility with HuggingFace Transformers
 
 **trapper extends transformers!**
 
-We implement the trapper components by trying to use the available components of the
+While implementing the components of trapper, we try to reuse the classes from the
 transformers library as much as we can. For example, trapper uses the models, and
 the trainer as they are in transformers. This makes it easy to use the models
 trained with trapper on other projects or libraries that depend on transformers
@@ -42,46 +64,60 @@ pipeline (e.g. for training).
 
 ### Dependency Injection and Training Based on Configuration Files
 
-We use `allennlp`'s registry mechanism to provide dependency injection and enable
-reading the experiment details from training configuration files in `json`
+We use the registry mechanism of [AllenNLP](http://github.com/allenai/allennlp) to
+provide dependency injection and enable reading the experiment details from training
+configuration files in `json`
 or `jsonnet` format. You can look at the
-[allennlp guide on dependency injection](https://guide.allennlp.org/using-config-files)
+[AllenNLP guide on dependency injection](https://guide.allennlp.org/using-config-files)
 to learn more about how the registry system and dependency injection works as well
 as how to write configuration files. In addition, we strongly recommend reading the
-remaining parts of the [allennlp guide](https://guide.allennlp.org/)
+remaining parts of the [AllenNLP guide](https://guide.allennlp.org/)
 to learn more about its design philosophy, the importance of abstractions etc.
 (especially Part2: Abstraction, Design and Testing). As a warning, please note that
-we do not use allennlp's abstractions and base classes in general, which means you
-can not mix and match the trapper's and allennlp's components. Instead, we just use
+we do not use AllenNLP's abstractions and base classes in general, which means you
+can not mix and match the trapper's and AllenNLP's components. Instead, we just use
 the class registry and dependency injection mechanisms and only adapt its very
 limited set of components, first by wrapping and registering them as trapper
-components. For example, we use the optimizers from allennlp since we can
+components. For example, we use the optimizers from AllenNLP since we can
 conveniently do so without hindering our full compatibility with transformers.
 
-### Full Integration with HuggingFace datasets
-
-In trapper, we officially use the format of the datasets from the HuggingFace's
-`datasets` library and provide full integration with it. You can directly use all
-datasets published in [datasets hub](https://huggingface.co/datasets) without doing
-any extra work. You can write the dataset name and extra loading arguments (if there
-are any) in your training config file, and trapper will automatically download the
-dataset and pass it to the trainer. If you have a local or private dataset, you can
-still use it after converting it to the HuggingFace `datasets` format by writing a
-dataset loading script as explained
+### Full Integration with HuggingFace Datasets
+
+In trapper, we officially use the format of the datasets
+from [datasets](http://github.com/huggingface/datasets) and provide full integration
+with it. You can directly use all datasets published
+in [datasets hub](https://huggingface.co/datasets) without doing any extra work. You
+can write the dataset name and extra loading arguments (if there are any) in your
+training config file, and trapper will automatically download the dataset and pass
+it to the trainer. If you have a local or private dataset, you can still use it
+after converting it to the HuggingFace `datasets` format by writing a dataset
+loading script as explained
 [here](https://huggingface.co/docs/datasets/dataset_script.html).
 
+### Support for Metrics through Jury
+
+Trapper supports the common NLP metrics through
+[jury](https://github.com/obss/jury). Jury is an NLP library dedicated to provide
+metric implementations by adopting and extending the datasets library. For metric
+computation during training you can use jury style metric
+instantiation/configuration to set up on your trapper configuration file to compute
+metrics on the fly on eval dataset with a specified `eval_steps` value. If your
+desired metric is not yet available on jury or datasets, you can still create your
+own by extending `trapper.Metric` and utilizing either
+`jury.Metric` or `datasets.Metric` for handling larger set of cases on predictions.
+
 ### Abstractions and Base Classes
 
-Following allennlp, we implement our own registrable base classes to abstract away
+Following AllenNLP, we implement our own registrable base classes to abstract away
 the common operations for data processing and model training.
 
 * Data reading and preprocessing base classes including
 
     - The classes to be used directly: `DatasetReader`, `DatasetLoader`
       and `DataCollator`.
 
-    - The classes that you may need to extend: `LabelMapper`,`DataProcessor`,
-      and `DataAdapter`.
+    - The classes that you may need to extend: `LabelMapper`,`DataProcessor`
+      , `DataAdapter`.
 
     - `TokenizerWrapper` classes utilizing `AutoTokenizer` from transformers are
       used as factories to instantiate wrapped tokenizers into which task-specific
@@ -92,8 +128,15 @@ the common operations for data processing and model training.
   are used as factories to instantiate the actual task-specific models from the
   configuration files dynamically.
 
+* Optimizers from AllenNLP: Implemented as children of the base `Optimizer` class.
 
-* Optimizers from allennlp: Implemented as children of the base `Optimizer` class.
+* Metric computation is supported through `jury`. In order to make the metrics
+  flexible enough to work with the trainer in a common interface, we introduced
+  metric handlers. You may need to extend these classes accordingly
+    * For conversion of predictions and references to a suitable form for a
+      particular metric or metric set: `MetricInputHandler`.
+    * For manipulating resulting score object containing the metric
+      results: `MetricOutputHandler`.
 
 ## Usage
 
@@ -192,7 +235,40 @@ already implemented one that matches your need.
    your TokenizerWrapper subclass. Otherwise, you can directly use TokenizerWrapper.
 
 
-5) **transformers.Pipeline**:
+5) **MetricInputHandler**:
+   This class is mainly responsible for preprocessing applied to predictions and
+   labels (references). This is performed for transforming the predictions and
+   labels to a suitable format to be fed in metrics for computation. For example,
+   while using BLEU in a language generation task, the predictions and labels need
+   to be converted to a string or list of strings. However, for extractive question
+   answering task in which the predictions are returned as start and end indices
+   pointing the answer within the context, additional information (e.g context in
+   such case) may be needed, so directly returning the start and end indices in this
+   case does not help, and additional operation is needed to be done by converting
+   predictions to actual answers extracted from the context. You are able to do this
+   kind of operations through `MetricInputHandler`, storing additional information,
+   converting predictions and labels to a suitable format, manipulating resulting
+   score. Furthermore, in child classes helper classes can also be implemented (e.g
+   `TokenizerWrapper`, `LabelMapper`) for required tasks. In this class, we provide
+   three main functionality:
+    * `_extract_metadata()`: This method allows user to extract metadata from
+      dataset instances to be later used for preprocessing predictions and labels
+      in `preprocess()` method.
+    * `__call__()`: This method allows converting predictions and labels into a
+      suitable form for metric computation. The default behavior is defined as
+      directly returning predictions and labels without manipulation, but only
+      applying `argmax()` to predictions to convert the model predictions to
+      predictions input for metrics.
+
+7) **MetricOutputHandler**:
+   The intention of this class is to support for manipulating the score object
+   returned by the metric computation phase. Jury returns a well-constructed
+   dictionary output for all metrics; however, to shorten dictionary items,
+   manipulate the information within the output or to add additional information to
+   score dictionary, this class can be extended as desired.
+
+
+7) **transformers.Pipeline**:
    The pipeline mechanism from the transformers library have not been fully
    integrated yet. For now, you should check the transformers to find a pipeline
    that is suitable for your needs and does the same pre-processing. If you could
@@ -312,15 +388,16 @@ thanks to configuration file based experiments.
 ### Training a POS Tagging Model on CONLL2003
 
 Since the transformers library lacks a direct support for POS tagging, we added an
-[example project](./examples/pos_tagging) that trains a transformer model on `CONLL2003` POS tagging dataset
-and perform inference using it. It is a
+[example project](./examples/pos_tagging) that trains a transformer model
+on `CONLL2003` POS tagging dataset and perform inference using it. It is a
 self-contained project including its own requirements file, therefore you can copy
 the folder into another directory to use as a template for your own project. Please
 follow its `README.md` to get started.
 
 ### Training a Question Answering Model on SQuAD Dataset
 
-You can use the notebook in the [Example QA Project](./examples/question_answering) `examples/question_answering/question_answering.ipynb`
+You can use the notebook in
+the [Example QA Project](./examples/question_answering) `examples/question_answering/question_answering.ipynb`
 to follow the steps while training a transformer model on SQuAD v1.
 
 ## Installation

diff --git a/examples/pos_tagging/README.md b/examples/pos_tagging/README.md
@@ -223,6 +223,7 @@ the root of the example project (i.e. the `pos_tagging` directory), you can run
 
 ```console
 cd examples/pos_tagging
+export PYTHONPATH=$PYTHONPATH:$PWD
 python -m scripts.run_tests
 ```
 
@@ -233,6 +234,7 @@ HuggingFace's datasets library using the following command.
 
 ```console
 cd examples/pos_tagging
+export PYTHONPATH=$PYTHONPATH:$PWD
 python -m scripts.cache_hf_datasets_fixtures
 ```
 

diff --git a/examples/pos_tagging/experiments/roberta/experiment.jsonnet b/examples/pos_tagging/experiments/roberta/experiment.jsonnet
@@ -21,8 +21,8 @@ local save_steps = 292;
         },
         "data_collator": {"type": "default"},
         "model_wrapper": {"type": "token_classification", "num_labels": 47},
-        "compute_metrics": {"type": "seqeval",
-                            "return_entity_level_metrics": false},
+        "compute_metrics": {"metric_params": "seqeval"},
+        "metric_handler": {"type": "pos-tagging"},
         "label_mapper": {"type": "conll2003_pos_tagging_example"},
         "args": {
             "type": "default",

diff --git a/examples/pos_tagging/scripts/cache_hf_datasets_fixtures.py b/examples/pos_tagging/scripts/cache_hf_datasets_fixtures.py
@@ -2,7 +2,8 @@
 Caches the tests dataset to HuggingFace's `datasets` library's cache so that the
 interpreter can find it when we try to load it through the `datasets` library.
 """
-from examples.pos_tagging.src import POS_TAGGING_FIXTURES_ROOT
+from src import POS_TAGGING_FIXTURES_ROOT
+
 from trapper.common.testing_utils.hf_datasets_caching import (
     renew_hf_datasets_fixtures_cache,
 )

diff --git a/examples/pos_tagging/scripts/run_tests.py b/examples/pos_tagging/scripts/run_tests.py
@@ -2,6 +2,6 @@
 
 if __name__ == "__main__":
     sts_tests = shell(
-        "pytest --cov trapper --cov-report term-missing --cov-report xml -vvv tests"
+        "pytest --cov src --cov-report term-missing --cov-report xml -vvv tests"
     )
     validate_and_exit(tests=sts_tests)
diff --git a/examples/pos_tagging/src/__init__.py b/examples/pos_tagging/src/__init__.py
@@ -1,7 +1,7 @@
 from pathlib import Path
 
-from examples.pos_tagging.src import data
-from examples.pos_tagging.src.pipeline import ExamplePosTaggingPipeline
+from src import data
+from src.pipeline import ExamplePosTaggingPipeline
 
 POS_TAGGING_PROJECT_ROOT = Path(__file__).parent.parent.resolve()
 POS_TAGGING_TESTS_ROOT = POS_TAGGING_PROJECT_ROOT / "tests"

diff --git a/examples/pos_tagging/src/data/__init__.py b/examples/pos_tagging/src/data/__init__.py
@@ -1,12 +1,4 @@
-from examples.pos_tagging.src.data.data_adapter import (
-    ExampleDataAdapterForPosTagging,
-)
-from examples.pos_tagging.src.data.data_processor import (
-    ExampleConll2003PosTaggingDataProcessor,
-)
-from examples.pos_tagging.src.data.label_mapper import (
-    ExampleLabelMapperForPosTagging,
-)
-from examples.pos_tagging.src.data.tokenizer_wrapper import (
-    ExamplePosTaggingTokenizerWrapper,
-)
+from src.data.data_adapter import ExampleDataAdapterForPosTagging
+from src.data.data_processor import ExampleConll2003PosTaggingDataProcessor
+from src.data.label_mapper import ExampleLabelMapperForPosTagging
+from src.data.tokenizer_wrapper import ExamplePosTaggingTokenizerWrapper
diff --git a/examples/pos_tagging/src/pipeline.py b/examples/pos_tagging/src/pipeline.py
@@ -19,6 +19,11 @@
 from typing import List, Optional, Union
 
 import numpy as np
+
+# needed for registering the data-related classes
+# noinspection PyUnresolvedReferences
+# pylint: disable=unused-import
+import src.data
 import torch
 from tokenizers.pre_tokenizers import Whitespace
 from transformers import (
@@ -33,13 +38,7 @@
     TokenClassificationArgumentHandler,
 )
 
-# needed for registering the data-related classes
-# noinspection PyUnresolvedReferences
-# pylint: disable=unused-import
-import examples.pos_tagging.src.data
-from trapper import PROJECT_ROOT
 from trapper.data import LabelMapper
-from trapper.pipelines.pipeline import create_pipeline_from_checkpoint
 
 
 class ExamplePosTaggingPipeline(TokenClassificationPipeline):

diff --git a/examples/pos_tagging/tests/conftest.py b/examples/pos_tagging/tests/conftest.py
@@ -1,6 +1,5 @@
 import pytest
-
-from examples.pos_tagging.src import POS_TAGGING_FIXTURES_ROOT
+from src import POS_TAGGING_FIXTURES_ROOT
 
 # noinspection PyUnresolvedReferences
 # pylint: disable=unused-import

diff --git a/examples/pos_tagging/tests/test_data_adapter.py b/examples/pos_tagging/tests/test_data_adapter.py
@@ -1,14 +1,8 @@
 import pytest
+from src.data.data_adapter import ExampleDataAdapterForPosTagging
+from src.data.data_processor import ExampleConll2003PosTaggingDataProcessor
+from src.data.tokenizer_wrapper import ExamplePosTaggingTokenizerWrapper
 
-from examples.pos_tagging.src.data.data_adapter import (
-    ExampleDataAdapterForPosTagging,
-)
-from examples.pos_tagging.src.data.data_processor import (
-    ExampleConll2003PosTaggingDataProcessor,
-)
-from examples.pos_tagging.src.data.tokenizer_wrapper import (
-    ExamplePosTaggingTokenizerWrapper,
-)
 from trapper.common.constants import IGNORED_LABEL_ID
 from trapper.data import InputBatch
 

diff --git a/examples/pos_tagging/tests/test_data_processor.py b/examples/pos_tagging/tests/test_data_processor.py
@@ -1,11 +1,6 @@
 import pytest
-
-from examples.pos_tagging.src.data.data_processor import (
-    ExampleConll2003PosTaggingDataProcessor,
-)
-from examples.pos_tagging.src.data.tokenizer_wrapper import (
-    ExamplePosTaggingTokenizerWrapper,
-)
+from src.data.data_processor import ExampleConll2003PosTaggingDataProcessor
+from src.data.tokenizer_wrapper import ExamplePosTaggingTokenizerWrapper
 
 
 @pytest.fixture(scope="module")

diff --git a/examples/pos_tagging/tests/test_trainer.py b/examples/pos_tagging/tests/test_trainer.py
@@ -1,11 +1,12 @@
 import datasets
 import pytest
-from transformers import DistilBertForTokenClassification, DistilBertTokenizerFast
 
 # needed for registering the data-related classes
 # noinspection PyUnresolvedReferences
 # pylint: disable=unused-import
-import examples.pos_tagging.src
+import src
+from transformers import DistilBertForTokenClassification, DistilBertTokenizerFast
+
 from trapper.common import Params
 from trapper.data.data_collator import DataCollator
 from trapper.training import TransformerTrainer, TransformerTrainingArguments
@@ -35,8 +36,8 @@ def trainer_params(temp_output_dir, temp_result_dir, get_hf_datasets_fixture_pat
         },
         "data_collator": {},
         "model_wrapper": {"type": "token_classification", "num_labels": 47},
-        "compute_metrics": {"type": "seqeval",
-                            "return_entity_level_metrics": False},
+        "compute_metrics": {"metric_params": "seqeval"},
+        "metric_input_handler": {"type": "token-classification"},
         "label_mapper": {"type": "conll2003_pos_tagging_example"},
         "args": {
             "type": "default",
@@ -58,6 +59,7 @@ def trainer_params(temp_output_dir, temp_result_dir, get_hf_datasets_fixture_pat
             "save_total_limit": 1,
             "metric_for_best_model": "eval_loss",
             "greater_is_better": False,
+            "seed": 100
         },
         "optimizer": {
             "type": "huggingface_adamw",

diff --git a/examples/question_answering/experiment.jsonnet b/examples/question_answering/experiment.jsonnet
@@ -26,6 +26,14 @@ local result_dir = std.extVar("OUTPUT_PATH");
     "model_wrapper": {
         "type": "question_answering"
     },
+    "metric_input_handler": {
+        "type": "question-answering"
+    },
+    "compute_metrics": {
+        "metric_params": [
+            "squad"
+        ]
+    },
     "args": {
         "type": "default",
         "output_dir": checkpoint_dir,