Feature Set

► Feature Set

The features used to train this implementation of the classifier were reverse-engineered from those provided in the Szeged Uncertainty Corpus. They are briefly described below with examples.

★ Surface Form of Tokens

1) Prefixes of Length 3-5

Token	`Distinct`
Features	`Dis`	`Dist`	`Disti`
Formatted	`prefix_3_Dis:1.0`	`prefix_4_Dist:1.0`	`prefix_5_Disti:1.0`

💾 Code

2) Suffixes of Length 3-5

Token	`Distinct`
Features	`nct`	`inct`	`tinct`
Formatted	`suffix_3_nct:1.0`	`suffix_4_inct:1.0`	`suffix_5_tinct:1.0`

💾 Code

3) Stems/Lemmas w/ a Window of Length 2

The stem/lemma of the two previous tokens, the current token, and the two following tokens.

Substring	`Cells`	`in`	`Regulating`	`Cellular`	`Immunity`
Relative Index	`-2`	`-1`	`0`	`1`	`2`
Features	`cell`	`in`	`regulate`	`cellular`	`immun`
Formatted	`lemma_-2_cell`	`lemma_-1_in`	`lemma_0_regulate`	`lemma_1_cellular`	`lemma_2_immun`

💾 Code

4) Surface Patterns w/ a Window of Length 1

A string of characters representing the surface-pattern or shape of a word.

A and a denote uppercase and lowercase character sequences, respectively.
0 denotes numerical sequences.
G and g denote uppercase and lowercase Greek character sequences, respectively.
R and r denote uppercase and lowercase Roman Numeral sequences, respectively.
! denotes the presence of non-alphanumeric characters.

Substring	`in`	`Regulating`	`Cellular`
Relative Index	`-1`	`0`	`1`
Features	`a`	`Aa`	`Aa`
Formatted	`pattern_-1_a`	`pattern_0_Aa`	`pattern_1_Aa`

💾 Code

5) Pattern Prefix

The first character in the surface pattern of the current token.^[1]

Substring	`Regulating`
Relative Index	`0`
Features	`Aa`
Formatted	`pattern_prefix_A`

💾 Code

★ Syntactic Properties of Tokens

6) Part-of-Speech Tags w/ a Window of Length 2

Substring	`Cells`	`in`	`Regulating`	`Cellular`	`Immunity`
Relative Index	`-2`	`-1`	`0`	`1`	`2`
Features	`NNS`	`IN`	`VBG`	`JJ`	`NN`
Formatted	`pos_-2_NNS`	`pos_-1_IN`	`pos_0_VBG`	`pos_1_JJ`	`pos_2_NN`

💾 Code

7) Syntactic Chunk w/ a Window of Length 2

The Chunk tags used in Vincze et al.^[2] were obtained using the C&C Chunker. Due to lack of availability, we used the we trained our own NLTK chunker using NLTK's treebank_chunk and conll2000 corpora.

Substring	`Cells`	`in`	`Regulating`	`Cellular`	`Immunity`
Relative Index	`-2`	`-1`	`0`	`1`	`2`
Features	`I-NP`	`B-PP`	`B-VP`	`B-NP`	`I-NP`
Formatted	`chunk_-2_I-NP`	`chunk_-1_B-PP`	`chunk_0_B-VP`	`chunk_1_B-NP`	`chunk_2_I-NP`

💾 Code

8-A) Stems/Lemmas w/ Current Chunk w/ a Window of Length 1^[1]

Substring	`in`	`Regulating`	`Cellular`
Relative Index	`-1`	`0`	`1`
Features	`in`	`B-vp`, `regulate`	`cellular`
Formatted	`L_-1_in_C_0_B-VP`	`L_0_regulate_C_0_B-VP`	`L_1_cellular_C_0_B-VP`

💾 Code

8-B) Stems/Lemmas w/ Current POS w/ a Window of Length 1^[1]

Substring	`in`	`Regulating`	`Cellular`
Relative Index	`-1`	`0`	`1`
Features	`in`	`VBG`, `regulate`	`cellular`
Formatted	`L_-1_in_P_0_VBG`	`L_0_regulate_P_0_VBG`	`L_1_cellular_P_0_VBG`

💾 Code

► Example Training Data

Unique ID	Raw Token	Stem	POS	Binary Label	Multiclass Label	Features
sent14token0000	Cells	cell	NNP	X	X	L_0_cell_C_0_B-NP:1.0 L_0_cell_P_0_NNS:1.0 L_1_in_C_0_B-NP:1.0 L_1_in_P_0_NNS:1.0 chunk_0_B-NP:1.0 chunk_1_B-PP:1.0 chunk_2_B-VP:1.0 lemma_0_cell:1.0lemma_1_in:1.0 lemma_2_regul:1.0 pattern_0_Aa:1.0 pattern_1_a:1.0pattern_prefix_A:1.0 pos_0_NNS:1.0 pos_1_IN:1.0 pos_2_VBG:1.0 prefix_3_Cel:1.0 prefix_4_Cell:1.0 suffix_3_lls:1.0 suffix_4_ells:1.0
sent14token0001	in	in	NN	X	X	L_-1_cell_C_0_B-PP:1.0 L_-1_cell_P_0_IN:1.0 L_0_in_C_0_B-PP:1.0 L_0_in_P_0_IN:1.0 L_1_regul_C_0_B-PP:1.0 L_1_regul_P_0_IN:1.0 chunk_-1_B-NP:1.0 chunk_0_B-PP:1.0chunk_1_B-VP:1.0 chunk_2_B-NP:1.0 lemma_-1_cell:1.0 lemma_0_in:1.0 lemma_1_regul:1.0 lemma_2_cellular:1.0 pattern_-1_Aa:1.0 pattern_0_a:1.0 pattern_1_Aa:1.0 pattern_prefix_a:1.0 pos_-1_NNS:1.0 pos_0_IN:1.0 pos_1_VBG:1.0 pos_2_JJ:1.0 prefix_3_in:1.0 prefix_4_in:1.0prefix_5_in:1.0 suffix_3_in:1.0 suffix_4_in:1.0 suffix_5_in:1.0
sent14token0002	Regulating	regul	NNP	X	X	L_-1_in_C_0_B-VP:1.0 L_-1_in_P_0_VBG:1.0 L_0_regul_C_0_B-VP:1.0 L_0_regul_P_0_VBG:1.0 L_1_cellular_C_0_B-VP:1.0 L_1_cellular_P_0_VBG:1.0 chunk_-1_B-PP:1.0 chunk_-2_B-NP:1.0 chunk_0_B-VP:1.0 chunk_1_B-NP:1.0 chunk_2_I-NP:1.0 lemma_-1_in:1.0 lemma_-2_cell:1.0 lemma_0_regul:1.0 lemma_1_cellular:1.0 lemma_2_immun:1.0 pattern_-1_a:1.0 pattern_0_Aa:1.0 pattern_1_Aa:1.0 pattern_prefix_A:1.0 pos_-1_IN:1.0 pos_-2_NNS:1.0 pos_0_VBG:1.0 pos_1_JJ:1.0 pos_2_NN:1.0 prefix_3_Reg:1.0prefix_4_Regu:1.0 prefix_5_Regul:1.0 suffix_3_ing:1.0 suffix_4_ting:1.0 suffix_5_ating:1.0
sent14token0003	Cellular	cellular	NNP	X	X	L_-1_regul_C_0_B-NP:1.0 L_-1_regul_P_0_JJ:1.0 L_0_cellular_C_0_B-NP:1.0 L_0_cellular_P_0_JJ:1.0 L_1_immun_C_0_B-NP:1.0 L_1_immun_P_0_JJ:1.0 chunk_-1_B-VP:1.0 chunk_-2_B-PP:1.0 chunk_0_B-NP:1.0 chunk_1_I-NP:1.0 lemma_-1_regul:1.0 lemma_-2_in:1.0 lemma_0_cellular:1.0 lemma_1_immun:1.0 pattern_-1_Aa:1.0 pattern_0_Aa:1.0 pattern_1_Aa:1.0 pattern_prefix_A:1.0 pos_-1_VBG:1.0 pos_-2_IN:1.0 pos_0_JJ:1.0 pos_1_NN:1.0 prefix_3_Cel:1.0 prefix_4_Cell:1.0 prefix_5_Cellu:1.0 suffix_3_lar:1.0 suffix_4_ular:1.0 suffix_5_lular:1.0
sent14token0004	Immunity	immun	PRP	X	X	L_-1_cellular_C_0_I-NP:1.0 L_-1_cellular_P_0_NN:1.0 L_0_immun_C_0_I-NP:1.0 L_0_immun_P_0_NN:1.0 chunk_-1_B-NP:1.0 chunk_-2_B-VP:1.0 chunk_0_I-NP:1.0lemma_-1_cellular:1.0 lemma_-2_regul:1.0 lemma_0_immun:1.0 pattern_-1_Aa:1.0 pattern_0_Aa:1.0 pattern_prefix_A:1.0 pos_-1_JJ:1.0 pos_-2_VBG:1.0 pos_0_NN:1.0 prefix_3_Imm:1.0 prefix_4_Immu:1.0 prefix_5_Immun:1.0 suffix_3_ity:1.0 suffix_4_nity:1.0 suffix_5_unity:1.0

☕ Footnotes

📃 [1] This feature is present in the reverse-engineered dataset, but is not described within Vincze et al.^[2]

📃 [2] Vincze, V. (2015). Uncertainty detection in natural language texts (Doctoral dissertation, szte).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Set

► Feature Set

★ Surface Form of Tokens

1) Prefixes of Length 3-5

2) Suffixes of Length 3-5

3) Stems/Lemmas w/ a Window of Length 2

4) Surface Patterns w/ a Window of Length 1

5) Pattern Prefix

★ Syntactic Properties of Tokens

6) Part-of-Speech Tags w/ a Window of Length 2

7) Syntactic Chunk w/ a Window of Length 2

8-A) Stems/Lemmas w/ Current Chunk w/ a Window of Length 1^[1]

8-B) Stems/Lemmas w/ Current POS w/ a Window of Length 1^[1]

► Example Training Data

☕ Footnotes

Clone this wiki locally

Feature Set

► Feature Set

★ Surface Form of Tokens

1) Prefixes of Length 3-5

2) Suffixes of Length 3-5

3) Stems/Lemmas w/ a Window of Length 2

4) Surface Patterns w/ a Window of Length 1

5) Pattern Prefix

★ Syntactic Properties of Tokens

6) Part-of-Speech Tags w/ a Window of Length 2

7) Syntactic Chunk w/ a Window of Length 2

8-A) Stems/Lemmas w/ Current Chunk w/ a Window of Length 1[1]

8-B) Stems/Lemmas w/ Current POS w/ a Window of Length 1[1]

► Example Training Data

☕ Footnotes

Clone this wiki locally

8-A) Stems/Lemmas w/ Current Chunk w/ a Window of Length 1^[1]

8-B) Stems/Lemmas w/ Current POS w/ a Window of Length 1^[1]