AnnotationQueryPython provides a suite of composable functions to query annotations stored as a parquet file. This implementation is the python version for the scala implementation of AnnotationQuery. While the annotations will typically be generated by popular text analytic tools such as Stanford Core or Genia, the only requirement is the annotations adhere to the AQAnnotation structure (a dataframe with a specific schema). The underlying implementation leverages Dataframes and Spark(SQL). The scala implementation provides better performance over the python implementation. This is a result of the need to use UDFs in the python implementation. Unfortunately, the vectorized UDFs don't support the map type. Once this happens, the code will be updated to leverage these performance improvements.
.. toctree:: :maxdepth: 1 :caption: Contents: api usage citing license
While it might seem odd that there are two different types of Annotations, this was purposely done. CATAnnotation is the archive format for the annotations and AQAnnotation is the runtime format that is used by AnnotationQuery. The decision to use two separate classes was to insulate the AnnotationQuery implementation from the archive format providing flexibility for future optimizations. The AQAnnotation record should never be archived (only the CATAnnotation records.
To use AnnotationQuery, the annotations need to be in a parquet file with the following record structure. We refer to this as a CATAnnotation or CATSchema. The other field provides the option of specifying name-value pair attributes for the annotation. For example, if the annotation and 2 attributes (color=red and size=xl), the other field would have the value color=red&size=xl.
docId: String, // Document Id (PII)
annotSet: String, // Annotation set (such as scnlp, ge)
annotType: String, // Annotation type (such as text, sentence)
startOffset: Long, // Starting offset for the annotation
endOffset: Long, // Ending offset for the annotation
annotId: Long, // Annotation Id
other: Option[String] = None) // Contains any attributes (name-value pairs ampersand delimited)
At runtime, AnnotationQuery uses the following record structure (we refer to this as a AQAnnotation or AQSchema). A utility function is provided to transform a CATAnnotation record into a AQAnnotation record. The name-value pairs defined in the other column (of CATAnnotation) will be converted to a Map with the name-value pairs.
docId: String, // Document Id (PII)
annotSet: String, // Annotation set (such as scnlp, ge)
annotType: String, // Annotation type (such as text, sentence)
startOffset: Long, // Starting offset for the annotation
endOffset: Long, // Ending offset for the annotation
annotId: Long, // Annotation Id
properties: Option[scala.collection.Map[String,String]] = None) // Properties
We realize that you can't have a Dataframe[AQAnnotation] like you can with scala/java Dataset[AQnnotation]. All you can really have is a Dataset[Row]. With that said, whenever we reference in this document a Dataframe[AQAnnotation], what we are trying to convey is that it is a Dataframe with the fields defined in AQAnnotation schema described above.