-
Notifications
You must be signed in to change notification settings - Fork 4
Feature Set
The features used to train this implementation of the classifier were reverse-engineered from those provided in the Szeged Uncertainty Corpus. They are briefly described below with examples.
Token | Distinct |
||
---|---|---|---|
Features | Dis |
Dist |
Disti |
Formatted | prefix_3_Dis:1.0 |
prefix_4_Dist:1.0 |
prefix_5_Disti:1.0 |
💾 Code
Token | Distinct |
||
---|---|---|---|
Features | nct |
inct |
tinct |
Formatted | suffix_3_nct:1.0 |
suffix_4_inct:1.0 |
suffix_5_tinct:1.0 |
💾 Code
The stem/lemma of the two previous tokens, the current token, and the two following tokens.
Substring | Cells |
in |
Regulating |
Cellular |
Immunity |
---|---|---|---|---|---|
Relative Index | -2 |
-1 |
0 |
1 |
2 |
Features | cell |
in |
regulate |
cellular |
immun |
Formatted | lemma_-2_cell |
lemma_-1_in |
lemma_0_regulate |
lemma_1_cellular |
lemma_2_immun |
💾 Code
A string of characters representing the surface-pattern or shape of a word.
-
A
anda
denote uppercase and lowercase character sequences, respectively. -
0
denotes numerical sequences. -
G
andg
denote uppercase and lowercase Greek character sequences, respectively. -
R
andr
denote uppercase and lowercase Roman Numeral sequences, respectively. -
!
denotes the presence of non-alphanumeric characters.
Substring | in |
Regulating |
Cellular |
---|---|---|---|
Relative Index | -1 |
0 |
1 |
Features | a |
Aa |
Aa |
Formatted | pattern_-1_a |
pattern_0_Aa |
pattern_1_Aa |
💾 Code
The first character in the surface pattern of the current token.[1]
Substring | Regulating |
---|---|
Relative Index | 0 |
Features | Aa |
Formatted | pattern_prefix_A |
💾 Code
Substring | Cells |
in |
Regulating |
Cellular |
Immunity |
---|---|---|---|---|---|
Relative Index | -2 |
-1 |
0 |
1 |
2 |
Features | NNS |
IN |
VBG |
JJ |
NN |
Formatted | pos_-2_NNS |
pos_-1_IN |
pos_0_VBG |
pos_1_JJ |
pos_2_NN |
💾 Code
The Chunk tags used in Vincze et al.[2]
were obtained using the C&C Chunker. Due to lack of availability, we used the we trained our own NLTK chunker using NLTK's treebank_chunk
and conll2000
corpora.
Substring | Cells |
in |
Regulating |
Cellular |
Immunity |
---|---|---|---|---|---|
Relative Index | -2 |
-1 |
0 |
1 |
2 |
Features | I-NP |
B-PP |
B-VP |
B-NP |
I-NP |
Formatted | chunk_-2_I-NP |
chunk_-1_B-PP |
chunk_0_B-VP |
chunk_1_B-NP |
chunk_2_I-NP |
💾 Code
8-A) Stems/Lemmas w/ Current Chunk w/ a Window of Length 1[1]
Substring | in |
Regulating |
Cellular |
---|---|---|---|
Relative Index | -1 |
0 |
1 |
Features | in |
B-vp , regulate
|
cellular |
Formatted | L_-1_in_C_0_B-VP |
L_0_regulate_C_0_B-VP |
L_1_cellular_C_0_B-VP |
💾 Code
8-B) Stems/Lemmas w/ Current POS w/ a Window of Length 1[1]
Substring | in |
Regulating |
Cellular |
---|---|---|---|
Relative Index | -1 |
0 |
1 |
Features | in |
VBG , regulate
|
cellular |
Formatted | L_-1_in_P_0_VBG |
L_0_regulate_P_0_VBG |
L_1_cellular_P_0_VBG |
💾 Code
Unique ID | Raw Token | Stem | POS | Binary Label | Multiclass Label | Features |
---|---|---|---|---|---|---|
sent14token0000 | Cells | cell | NNP | X | X | L_0_cell_C_0_B-NP:1.0 L_0_cell_P_0_NNS:1.0 L_1_in_C_0_B-NP:1.0 L_1_in_P_0_NNS:1.0 chunk_0_B-NP:1.0 chunk_1_B-PP:1.0 chunk_2_B-VP:1.0 lemma_0_cell:1.0lemma_1_in:1.0 lemma_2_regul:1.0 pattern_0_Aa:1.0 pattern_1_a:1.0pattern_prefix_A:1.0 pos_0_NNS:1.0 pos_1_IN:1.0 pos_2_VBG:1.0 prefix_3_Cel:1.0 prefix_4_Cell:1.0 suffix_3_lls:1.0 suffix_4_ells:1.0 |
sent14token0001 | in | in | NN | X | X | L_-1_cell_C_0_B-PP:1.0 L_-1_cell_P_0_IN:1.0 L_0_in_C_0_B-PP:1.0 L_0_in_P_0_IN:1.0 L_1_regul_C_0_B-PP:1.0 L_1_regul_P_0_IN:1.0 chunk_-1_B-NP:1.0 chunk_0_B-PP:1.0chunk_1_B-VP:1.0 chunk_2_B-NP:1.0 lemma_-1_cell:1.0 lemma_0_in:1.0 lemma_1_regul:1.0 lemma_2_cellular:1.0 pattern_-1_Aa:1.0 pattern_0_a:1.0 pattern_1_Aa:1.0 pattern_prefix_a:1.0 pos_-1_NNS:1.0 pos_0_IN:1.0 pos_1_VBG:1.0 pos_2_JJ:1.0 prefix_3_in:1.0 prefix_4_in:1.0prefix_5_in:1.0 suffix_3_in:1.0 suffix_4_in:1.0 suffix_5_in:1.0 |
sent14token0002 | Regulating | regul | NNP | X | X | L_-1_in_C_0_B-VP:1.0 L_-1_in_P_0_VBG:1.0 L_0_regul_C_0_B-VP:1.0 L_0_regul_P_0_VBG:1.0 L_1_cellular_C_0_B-VP:1.0 L_1_cellular_P_0_VBG:1.0 chunk_-1_B-PP:1.0 chunk_-2_B-NP:1.0 chunk_0_B-VP:1.0 chunk_1_B-NP:1.0 chunk_2_I-NP:1.0 lemma_-1_in:1.0 lemma_-2_cell:1.0 lemma_0_regul:1.0 lemma_1_cellular:1.0 lemma_2_immun:1.0 pattern_-1_a:1.0 pattern_0_Aa:1.0 pattern_1_Aa:1.0 pattern_prefix_A:1.0 pos_-1_IN:1.0 pos_-2_NNS:1.0 pos_0_VBG:1.0 pos_1_JJ:1.0 pos_2_NN:1.0 prefix_3_Reg:1.0prefix_4_Regu:1.0 prefix_5_Regul:1.0 suffix_3_ing:1.0 suffix_4_ting:1.0 suffix_5_ating:1.0 |
sent14token0003 | Cellular | cellular | NNP | X | X | L_-1_regul_C_0_B-NP:1.0 L_-1_regul_P_0_JJ:1.0 L_0_cellular_C_0_B-NP:1.0 L_0_cellular_P_0_JJ:1.0 L_1_immun_C_0_B-NP:1.0 L_1_immun_P_0_JJ:1.0 chunk_-1_B-VP:1.0 chunk_-2_B-PP:1.0 chunk_0_B-NP:1.0 chunk_1_I-NP:1.0 lemma_-1_regul:1.0 lemma_-2_in:1.0 lemma_0_cellular:1.0 lemma_1_immun:1.0 pattern_-1_Aa:1.0 pattern_0_Aa:1.0 pattern_1_Aa:1.0 pattern_prefix_A:1.0 pos_-1_VBG:1.0 pos_-2_IN:1.0 pos_0_JJ:1.0 pos_1_NN:1.0 prefix_3_Cel:1.0 prefix_4_Cell:1.0 prefix_5_Cellu:1.0 suffix_3_lar:1.0 suffix_4_ular:1.0 suffix_5_lular:1.0 |
sent14token0004 | Immunity | immun | PRP | X | X | L_-1_cellular_C_0_I-NP:1.0 L_-1_cellular_P_0_NN:1.0 L_0_immun_C_0_I-NP:1.0 L_0_immun_P_0_NN:1.0 chunk_-1_B-NP:1.0 chunk_-2_B-VP:1.0 chunk_0_I-NP:1.0lemma_-1_cellular:1.0 lemma_-2_regul:1.0 lemma_0_immun:1.0 pattern_-1_Aa:1.0 pattern_0_Aa:1.0 pattern_prefix_A:1.0 pos_-1_JJ:1.0 pos_-2_VBG:1.0 pos_0_NN:1.0 prefix_3_Imm:1.0 prefix_4_Immu:1.0 prefix_5_Immun:1.0 suffix_3_ity:1.0 suffix_4_nity:1.0 suffix_5_unity:1.0 |
📃 [1]
This feature is present in the reverse-engineered dataset, but is not described within Vincze et al.[2]
📃 [2]
Vincze, V. (2015). Uncertainty detection in natural language texts (Doctoral dissertation, szte).