Skip to content

Recogito Tutorial: Download Options for Text

valeriavitale edited this page Jan 24, 2020 · 41 revisions

One of the benefits of using Recogito is the opportunity of downloading the annotations created online, in a number of formats that can be then stored, analysed, visualised and post processed with other applications. Download options can be accessed from the icon on the bar above the document.

  • ANNOTATIONS:
    • CSV: It stands for “comma separated values” and is a simple spreadsheet format, compatible with software like Excel or Numbers, as well as with a number of other applications from QGIS to Open Refine. The information generated during the annotation process is organised in 13 columns:
      • UUID: this is an automatically generated ID for each annotation. This implies that all annotations are citable in a non-ambiguous way.
      • FILE: this field stores the name and extension of provenance file you have annotated, as it has been uploaded in Recogito. It is particularly useful in the case of multiple-file uploads.
      • QUOTE TRANSCRIPTION: this field shows the actual word or words that have been highlighted and annotated in the text.
      • ANCHOR: this value refers to the position of the annotation in the text. It is calculated by counting the number of characters from the very first one in the digital text. This value is, therefore, not absolute like a standard citation, but dependent on the specific version of the file that has been uploaded.
      • TYPE: this field shows the category of the annotation, as selected from the annotation pop-up (or as automatically assigned by the NER algorithm). The possible values are “place”, “person”, or “event”. If no type has been selected by the user, this field remains empty.
      • URI: if the annotation has been associated to an entry in one of the Recogito’s gazetteers, the URI will appear in this field.
      • VOCAB_LABEL: if the annotation has been resolved against an entry in a gazetteer, Recogito will inherit from the gazetteer the official name or label that is associated to the place.
      • LAT: if the annotation has been resolved against an entry in a gazetteer that has a location reference expressed in coordinates, Recogito will inherit the value of the latitude from the gazetteer.
      • LNG: if the annotation has been resolved against an entry in a gazetteer that has a location reference expressed in coordinates, Recogito will inherit the value of the longitude from the gazetteer.
      • PLACE_TYPE: if the annotation has been resolved against an entry in a gazetteer, Recogito will inherit from the gazetteer the “type” label that is attached to the place (for example “settlement” or “river” or “mountain”).
      • VERIFICATION_STATUS: this value identifies the place annotations that have been produced and checked by users (verified) and those that have been generated by automatic annotation (unverified). The third possible value is “not_identifiable”, and it applies when the user declares that the place mentioned in the text cannot be found in any of the available gazetteers (see Creating Place Annotations).
      • TAGS: in this field appear all the tags that the user has created for each annotation, in the same order as in the original annotation, and separated by a comma.
      • COMMENTS: in this field are stored the comments for each annotation. It becomes especially useful when the “comments” field in the annotation interface is used to store external URIs (like wikidata identifiers) or manually entered coordinates.

      Recogito’s annotations can also be downloaded in three different lightweight linked data format that are both human and machine readable:

    • JSON-LD: based on the JSON format and very useful in programming environments
    • RDF-TURTLE: is a compact RDF syntax where all the information is expressed via URLs
    • RDF-XML: xml-based RDF syntax, based on the web annotation data model.
    • KML: stands for Keyhole mark-up Language, and it is a standard XML notation that is specific for places. It is also the file format used in Google Earth

  • PLACES
    • GeoJSON: for its own nature, only available for place annotations. The latter will be encoded as FeatureCollection.
    • KML: the places, in a format compatible with tools like Google Earth or other geospatial visualization platforms.
    • RELATIONS: if you have created Relations annotations, you will be able to download them, and visualise them as network graphs in other applications such as Gephi. The relations will be available as a simple CSV with three columns: one for the starting point of the relations (from_quote), one for the name of the relations (relation) and one for the target of the relation (to_quote). The annotations could be also downloaded as “nodes” and “edges” in two separate spreadsheets. It is important to note that, in Recogito, each annotation is an independent node. If the entity “Rome” is annotated twice in the text, each occurrence of “Rome” will receive a different ID and appear as a single node. To obtain meaningful visualisations, it is necessary to consolidate the data first. More on the workflow with Gephi on the Recogito Tips&Tricks

  • ANNOTATED DOCUMENT:
    • TEI/XML: You can download annotations performed on text files in valid XML-TEI. The text will be converted in XML and formatted in generic TEI. The basic metadata will be embedded as TEI tags in the <teiHeader>. Place annotations will be exported as <placeName> with the annotation’s unique identifier as “xml:id” attribute and the gazetteer URI as “ref” attribute (when present). The value of certainty is expressed with the “cert” attribute and accepts the two values “high” (for verified annotations) and “low” (for automatic annotations). Person annotations will be encoded as <persName> with the unique Recogito identifier as xml:id. When present, tags will be also included in the TEI export through the “ana” attribute of the corresponding annotations. Overlapping annotations will be discarded as they are not supported by the XML standard.

      Relations will be also downloaded in the TEI/XML document, wrapped by a <listRelation> tag that has a “passive” and “active” attribute to identify the two component, as well as the direction of the relation, and the attribute “name” which is the name that has been assigned to the relation during the annotation.

  • Other: we keep adding new export formats to support a growing number of uses. Most recently, we have added support for IOB, a format which allows you to use tagged plaintext documents as input to train your own Named Entity machine learning models
Clone this wiki locally