-
Notifications
You must be signed in to change notification settings - Fork 4
Classification
Vincze et al.[1]
trained a Linear Conditional Random Fields (CRF) model for detecting semantic uncertainty in English. The backbone of a Linear CRF is the principle of Maximum Entropy, which is also the basis for our equivalent Logistic Regression algorithm (from scikit-learn).
The features used for training are explained in the aforementioned paper, and we have provided in-depth explanations and examples on our Feature Set page. Features are word-based with sentence-level context taken into consideration. Each word will have between 13
(sentence length of 1
) and 28
(sentence length of 5
or more) features. Features are then aggregated to ensure equal numbers for every word. A feature can either be present in a word (denoted by 1.0
) or absent (0.0
).
Our model supports the original binary classification from Vincze et al.[1]
, where a label of C
indicates certainty
and a label of U
indicates uncertainty
. In the same manner as the original, a sentence is considered uncertain if any of the words within that sentence are uncertain.
Our work extends the previous by breaking the U
class into five distinct classes of
semantic uncertainty:
-
E
: Epistemic - the proposition is possible (based on our knowledge of the universe), but its truth-value cannot be determined.- Ex: It may be raining.
-
D
: Doxastic - the proposition is assumed to be true or false, but its truth-value cannot be determined.- Ex: He believes the Earth is flat.
-
I
: Investigation - the proposition is in the process of having its truth-value determined.- Ex: We examined the role of fire in the development of civilization.
-
N
: Condition - the proposition is truth or false based on the truth-value of another proposition.- Ex: If it rains, we will stay inside.
The fifth class, U
, is a little different than the other four. Given a sentence, if there are two or more labels of E
, D
, I
, or N
present, the label with the maximum occurrences is assigned to the sentence. If multiple uncertainty classes have the same maximum occurrences, the sentence is labeled as generally uncertain, U
.
📃 [1]
Vincze, V. (2015). Uncertainty detection in natural language texts (Doctoral dissertation, szte).