AnnotationQueryPython provides a suite of composable functions to query annotations stored as a parquet file. This implementation is the python version for the scala implementation of AnnotationQuery. While the annotations will typically be generated by popular text analytic tools such as Stanford Core or Genia, the only requirement is the annotations adhere to the AQAnnotation structure (a dataframe with a specific schema). The underlying implementation leverages Dataframes and Spark(SQL). The scala implementation provides better performance over the python implementation. This is a result of the need to use UDFs in the python implementation. Unfortunately, the vectorized UDFs don't support the map type. Once this happens, the code will be updated to leverage these performance improvements.
Read the complete API and usage documentation at: https://elsevierlabs-os.github.io/AnnotationQueryPython/
While it might seem odd that there are two different types of Annotations, this was purposely done. CATAnnotation is the archive format for the annotations and AQAnnotation is the runtime format that is used by AnnotationQuery. The decision to use two separate classes was to insulate the AnnotationQuery implementation from the archive format providing flexibility for future optimizations. The AQAnnotation record should never be archived (only the CATAnnotation records.
To use AnnotationQuery, the annotations need to be in a parquet file with the following record structure. We refer to this as a CATAnnotation or CATSchema. The other field provides the option of specifying name-value pair attributes for the annotation. For example, if the annotation and 2 attributes (color=red and size=xl), the other field would have the value color=red&size=xl.
docId: String, // Document Id (PII)
annotSet: String, // Annotation set (such as scnlp, ge)
annotType: String, // Annotation type (such as text, sentence)
startOffset: Long, // Starting offset for the annotation
endOffset: Long, // Ending offset for the annotation
annotId: Long, // Annotation Id
other: Option[String] = None) // Contains any attributes (name-value pairs ampersand delimited)
At runtime, AnnotationQuery uses the following record structure (we refer to this as a AQAnnotation or AQSchema). A utility function is provided to transform a CATAnnotation record into a AQAnnotation record. The name-value pairs defined in the other column (of CATAnnotation) will be converted to a Map with the name-value pairs.
docId: String, // Document Id (PII)
annotSet: String, // Annotation set (such as scnlp, ge)
annotType: String, // Annotation type (such as text, sentence)
startOffset: Long, // Starting offset for the annotation
endOffset: Long, // Ending offset for the annotation
annotId: Long, // Annotation Id
properties: Option[scala.collection.Map[String,String]] = None) // Properties
We realize that you can't have a Dataframe[AQAnnotation] like you can with scala/java Dataset[AQnnotation]. All you can really have is a Dataset[Row]. With that said, whenever we reference in this document a Dataframe[AQAnnotation], what we are trying to convey is that it is a Dataframe with the fields defined in AQAnnotation schema described above.
The GetAQAnnotation and GetCATAnnotation and utility classes have been developed to create an AQAnnotation from the archive format (CATAnnotation) and vise versa. When creating the AQAnnotation, the ampersand separated string of name-value pairs in the CATAnnotation other field is mapped to a Map in the AQAnnotation record. To minimize memory consumption and increase performance, you can specify which name-value pairs to include in the Map as well as which ones to decode or lower case. if you want all name-value pairs to be included in the map, simply specify a value of ["*"] for the parameter in the function.. For more details on the implementation, view the corresponding class for each function in the AQPython Utilities module. For usage examples, view the GetAQAnnotation and GetCATAnnotation classes in the test_utilities module.
The following functions are currently provided by AnnotationQuery. Since functions return a Dataframe of AQAnnotations, it is possible to nest function calls. For more details on the implementation, view the corresponding class for each function in the AQPython Query module. For usage examples, view the test_query module.
FilterProperty - Provide the ability to filter a property field with a specified value in a Dataframe of AQAnnotations. A single value or an array of values can be used for the filter comparison.
RegexProperty - Provide the ability to filter a property field using a regex expression in a Dataframe of AQAnnotations.
FilterSet - Provide the ability to filter the annotation set field in a Dataframe of AQAnnotations.
FilterType - Provide the ability to filter the annotation type field in a Dataframe of AQAnnotations.
Contains - Provide the ability to find annotations that contain another annotation. The input is 2 Dataframes of AQAnnotations. We will call them A and B. The purpose is to find those annotations in A that contain B. What that means is the start/end offset for an annotation from A must contain the start/end offset from an annotation in B. We of course have to also match on the document id. We ultimately return the container annotations (A) that meet this criteria. We also deduplicate the A annotations as there could be many annotations from B that could be contained by an annotation in A but it only makes sense to return the unique container annotations. There is also the option of negating the query (think Not Contains) so that we return only A where it does not contain B.
ContainedIn - Provide the ability to find annotations that are contained by another annotation. The input is 2 Dataframes of AQAnnotations. We will call them A and B. The purpose is to find those annotations in A that are contained in B. What that means is the start/end offset for an annotation from A must be contained by the start/end offset from an annotation in B. We of course have to also match on the document id. We ultimately return the contained annotations (A) that meet this criteria. There is also the option of negating the query (think Not Contains) so that we return only A where it is not contained in B.
ContainedInList - Provide the ability to find annotations that are contained by another annotation. The input is 2 Dataframes of AQAnnotations. We will call them A and B. The purpose is to find those annotations in A that are contained in B. What that means is the start/end offset for an annotation from A must be contained by the start/end offset from an annotation in B. We of course have to also match on the document id. We ultimately return a Dataframe with 2 fields where the first field is an annotation from B and the second field is an array of entries from A that are contained in the first entry.
Before - Provide the ability to find annotations that are before another annotation. The input is 2 Dataframes of AQAnnotations. We will call them A and B. The purpose is to find those annotations in A that are before B. What that means is the end offset for an annotation from A must be before the start offset from an annotation in B. We of course have to also match on the document id. We ultimately return the A annotations that meet this criteria. A distance operator can also be optionally specified. This would require an A annotation (endOffset) to occur n characters (or less) before the B annotation (startOffset). There is also the option of negating the query (think Not Before) so that we return only A where it is not before B.
After - Provide the ability to find annotations that are after another annotation. The input is 2 Dataframes of AQAnnotations. We will call them A and B. The purpose is to find those annotations in A that are after B. What that means is the start offset for an annotation from A must be after the end offset from an annotation in B. We of course have to also match on the document id. We ultimately return the A annotations that meet this criteria. A distance operator can also be optionally specified. This would require an A annotation (startOffset) to occur n characters (or less) after the B annotation (endOffset). There is also the option of negating the query (think Not After) so that we return only A where it is not after B.
Between - Provide the ability to find annotations that are before one annotation and after another. The input is 3 Dataframes of AQAnnotations. We will call them A, B and C. The purpose is to find those annotations in A that are before B and after C. What that means is the end offset for an annotation from A must be before the start offset from an annotation in B and the start offset for A be after the end offset from C. We of course have to also match on the document id. We ultimately return the A annotations that meet this criteria. A distance operator can also be optionally specified. This would require an A annotation (endOffset) to occur n characters (or less) before the B annotation (startOffset) and would require the A annotation (startOffset) to occur n characters (or less) after the C annotation (endOffset) . There is also the option of negating the query (think Not Between) so that we return only A where it is not before B nor after C.
Sequence - Provide the ability to find annotations that are before another annotation. The input is 2 Dataframes of AQAnnotations. We will call them A and B. The purpose is to find those annotations in A that are before B. What that means is the end offset for an annotation from A must be before the start offset from an annotation in B. We of course have to also match on the document id. We ultimately return the annotations that meet this criteria. Unlike the Before function, we adjust the returned annotation a bit. For example, we set the annotType to "seq" and we use the A startOffset and the B endOffset. A distance operator can also be optionally specified. This would require an A annotation (endOffset) to occur n characters (or less) before the B annotation (startOffset).
Or - Provide the ability to combine (union) annotations. The input is 2 Dataframes of AQAnnotations. The output is the union of these annotations.
And - Provide the ability to find annotations that are in the same document. The input is 2 Dataframes of AQAnnotations. We will call them A and B. The purpose is to find those annotations in A and B that are in the same document.
MatchProperty - Provide the ability to find annotations (looking at their property) that are in the same document. The input is 2 Dataframes of AQAnnotations. We will call them A and B. The purpose is to find those annotations in A that are in the same document as B and also match values on the specified property.
Preceding - Return the preceding sibling annotations for every annotation in the anchor Dataframe[AQAnnotations]. The preceding sibling annotations can optionally be required to be contained in a container Dataframe[AQAnnotations]. The return type of this function is different from other functions. Instead of returning a Dataframe[AQAnnotation] this function returns a Dataframe[(AQAnnotation,Array[AQAnnotation])].
Following - Return the following sibling annotations for every annotation in the anchor Dataframe[AQAnnotations]. The following sibling annotations can optionally be required to be contained in a container Dataframe[AQAnnotations]. The return type of this function is different from other functions. Instead of returning a Dataframe[AQAnnotation] this function returns a Dataframe[(AQAnnotation,Array[AQAnnotation])].
TokensSpan - Provides the ability to create a string from a list of tokens that are contained in a span. The specified tokenProperty is used to extract the values from the tokens when creating the string. For SCNLP, this tokenProperty could be values like 'orig', 'lemma', or 'pos'. The spans would typically be a SCNLP 'sentence' or could even be things like an OM 'ce:para'. Returns a Dataframe[AQAnnotation] spans with 3 new properties all prefixed with the specified tokenProperty value followed by (ToksStr, ToksSpos, ToksEpos) The ToksStr property will be the concatenated string of token property values contained in the span. The ToksSPos and ToksEpos are properties that will help us determine the start/end offset for each of the individual tokens in the ToksStr. These helper properties are needed for the function RegexTokensSpan so we can generate accurate accurate start/end offsets based on the str file.
RegexTokensSpan - Provides the ability to apply a regular expression to the concatenated string generated by TokensSpan. For the strings matching the regex, a Dataframe[AQAnnotations] will be returned. The AQAnnotation will correspond to the offsets within the concatenated string containing the match.
The following functions have proven useful when looking at AQAnnotations. When displaying an annotation, the starting text for the annotation will begin with a green ">" and end with a green "<". If you use the XMLConcordancer that outputs the original XML (from the OM annotations), the XML tags will be in orange. The XML may not be well-formed). When generating annotations, you sometimes may want to exclude some text. In AQAnnotations, this is done with the excludes property. When an annotation is encountered that has an excludes property, the text excluded will be highlighted in red. For more details on the implementation, view the corresponding class for each function in the AQPython Utilities module. For usage examples, view the test_utilities module.
Concordancer - Output the string of text identified by the AQAnnotation and highlight in 'red' the text that was ignored (excluded).
XMLConcordancer - Output the string of text identified by the AQAnnotation and highlight in 'red' the text that was ignored (excluded). Also add the XML tags (in 'orange') that would have occurred in this string. Note, there are no guarantees that the XML will be well-formed.
OrigPosLemConcordancer - Output the string of text identified by the AQAnnotation (typically a sentence annotation). Below the sentence (in successive rows) output the original terms, parts of speech, and lemma terms for the text identified by the AQAnnotation.
The typical usage pattern will be something like the following. In the below example, we are finding those sentences (identified by Genia) that are contained in a ce:para XML element. For more examples refer to the tests. We assume the AnnotationQueryPython wheel or egg has been installed.
from AQPython.Query import *
from AQPython.Utilities import *
import pyspark
from pyspark.storagelevel import StorageLevel
# Get a SparkSession
spark = pyspark.sql.SparkSession.builder.getOrCreate()
# Set a reasonabe value for shuffle partitons (help with join performance)
spark.conf.set("spark.sql.shuffle.partitions",4)
# Read in some annotations. The below reads in the Original Markup annotations (think XML)
omAnnots = GetAQAnnotations(spark.read.parquet("./tests/resources/om/"),
props=["*"],
decodeProps=["*"],
numPartitions=int(spark.conf.get("spark.sql.shuffle.partitions"))) \
.persist(StorageLevel.DISK_ONLY)
# Read in some more annotations. The below reads in NLP annotations generated by Genia.
geAnnots = GetAQAnnotations(spark.read.parquet("./tests/resources/genia/"),
props=["orig", "lemma", "pos"],
decodeProps=["orig", "lemma"],
numPartitions=int(spark.conf.get("spark.sql.shuffle.partitions"))) \
.persist(StorageLevel.DISK_ONLY)
# Find those sentence annotations contained in a ce:para (XML element)
sentenceParaAnnots = ContainedIn(FilterType(geAnnots,'sentence'),FilterType(omAnnots,'ce:para))
If you need to cite AnnotationQueryPython in your work, please use the following DOI:
McBeath, Darin (2019). AnnotationQueryPython [Computer Software];https://github.com/elsevierlabs-os/AnnotationQueryPython