Encoding USAS in TEI #827

TomazErjavec · 2023-11-12T10:51:57Z

This issue discusses the non-resloved problems from #202 and #204. The current encoding of USAS in TEI is given in the guidelines, which is arguably ok, even though other possibilites exist (in particular stand-off markup where there are no problems with crossing XML tags but resolving them then gets complicated). Also, it is not yet clear whether retaining per-word USAS tags is sensible in the context of MWEs. These dilemas should be solved here.

The conversion of CoNLL-U with USAS tags into TEI is done by the conllu2tei.pl script. This script is badly written (it first just inserts <name> and <phr> into a temporary TEI and then afterwards tries to resolve conflicts, but does so in a bad way, i.e. it removes <phr> elements even in cases where it shouldn't, in particular phr/name, (arguably) name/phr, and and when a phr is adjecent to name, which is a definite bug. Again, how to make the script better should be discussed here.

The text was updated successfully, but these errors were encountered:

TomazErjavec · 2023-11-12T10:57:59Z

Here is the alternative proposal on how to encode USAS in TEI:

<seg xml:id="ParlaMint-LV_2014-11-04-PT12-264-U1-P1" xml:lang="en" corresp="mt-src:ParlaMint-LV_2014-11-04-PT12-264-U1-P1">
   <s xml:id="ParlaMint-LV_2014-11-04-PT12-264-U1-P1.1" n="1" corresp="mt-src:ParlaMint-LV_2014-11-04-PT12-264-U1-P1.1">
    <w xml:id="tok01" pos="NNP" msd="UPosTag=PROPN|Number=Sing" lemma="Mr.">Mr.</w>
    <w xml:id="tok02" pos="NNP" msd="UPosTag=PROPN|Number=Sing" lemma="President" join="right">President</w>
    <pc xml:id="tok03" pos="Z" msd="UPosTag=PUNCT" join="right">-</pc>
<!-- ... -->
    <spanGrp type="sem">
      <span target="#tok01 #tok02" type="Z1mf,Z3c" ana="sem:Z1"/>
      <span target="#tok03" type="Z9" ana="sem:Z9"/>
    </spanGrp>
  </s>
</seg>

My objections would be that:

it introduces a completely new construct for linguistic analysis, so far not used in ParlaMint or Parla-CLARIN
it just postones the problem of conflict with <name> to whatever application or conversion that will try to use both

So, I would advocate sticking to the current encoding but fix the converson script to do a better job.

TomazErjavec added bug Something isn't working enhancement New feature or request labels Nov 12, 2023

TomazErjavec added this to the Future milestone Nov 12, 2023

TomazErjavec assigned matyaskopp and TomazErjavec Nov 12, 2023

This was referenced Nov 12, 2023

Annotating words with USAS #204

Closed

USAS taxonomy #202

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encoding USAS in TEI #827

Encoding USAS in TEI #827

TomazErjavec commented Nov 12, 2023

TomazErjavec commented Nov 12, 2023

Encoding USAS in TEI #827

Encoding USAS in TEI #827

Comments

TomazErjavec commented Nov 12, 2023

TomazErjavec commented Nov 12, 2023