Skip to content

Feature Set

Benjamin Meyers edited this page May 22, 2017 · 5 revisions

► Feature Set

The features used to train this implementation of the classifier were reverse-engineered from those provided in the Szeged Uncertainty Corpus. They are briefly described below with examples.

★ Surface Form of Tokens

1) Prefixes of Length 3-5
Token Distinct
Features Dis Dist Disti
Formatted prefix_3_Dis:1.0 prefix_4_Dist:1.0 prefix_5_Disti:1.0

💾 Code

2) Suffixes of Length 3-5
Token Distinct
Features nct inct tinct
Formatted suffix_3_nct:1.0 suffix_4_inct:1.0 suffix_5_tinct:1.0

💾 Code

3) Stems/Lemmas w/ a Window of Length 2

The stem/lemma of the two previous tokens, the current token, and the two following tokens.

Substring Cells in Regulating Cellular Immunity
Relative Index -2 -1 0 1 2
Features cell in regulate cellular immun
Formatted lemma_-2_cell lemma_-1_in lemma_0_regulate lemma_1_cellular lemma_2_immun

💾 Code

4) Surface Patterns w/ a Window of Length 1

A string of characters representing the surface-pattern or shape of a word.

  • A and a denote uppercase and lowercase character sequences, respectively.
  • 0 denotes numerical sequences.
  • G and g denote uppercase and lowercase Greek character sequences, respectively.
  • R and r denote uppercase and lowercase Roman Numeral sequences, respectively.
  • ! denotes the presence of non-alphanumeric characters.
Substring in Regulating Cellular
Relative Index -1 0 1
Features a Aa Aa
Formatted pattern_-1_a pattern_0_Aa pattern_1_Aa

💾 Code

5) Pattern Prefix

The first character in the surface pattern of the current token.[1]

Substring Regulating
Relative Index 0
Features Aa
Formatted pattern_prefix_A

💾 Code

★ Syntactic Properties of Tokens

6) Part-of-Speech Tags w/ a Window of Length 2
Substring Cells in Regulating Cellular Immunity
Relative Index -2 -1 0 1 2
Features NNS IN VBG JJ NN
Formatted pos_-2_NNS pos_-1_IN pos_0_VBG pos_1_JJ pos_2_NN

💾 Code

7) Syntactic Chunk w/ a Window of Length 2

The Chunk tags used in Vincze et al.[2] were obtained using the C&C Chunker. Due to lack of availability, we used the we trained our own NLTK chunker using NLTK's treebank_chunk and conll2000 corpora.

Substring Cells in Regulating Cellular Immunity
Relative Index -2 -1 0 1 2
Features I-NP B-PP B-VP B-NP I-NP
Formatted chunk_-2_I-NP chunk_-1_B-PP chunk_0_B-VP chunk_1_B-NP chunk_2_I-NP

💾 Code

8-A) Stems/Lemmas w/ Current Chunk w/ a Window of Length 1[1]
Substring in Regulating Cellular
Relative Index -1 0 1
Features in B-vp, regulate cellular
Formatted L_-1_in_C_0_B-VP L_0_regulate_C_0_B-VP L_1_cellular_C_0_B-VP

💾 Code

8-B) Stems/Lemmas w/ Current POS w/ a Window of Length 1[1]
Substring in Regulating Cellular
Relative Index -1 0 1
Features in VBG, regulate cellular
Formatted L_-1_in_P_0_VBG L_0_regulate_P_0_VBG L_1_cellular_P_0_VBG

💾 Code

► Example Training Data

Unique ID Raw Token Stem POS Binary Label Multiclass Label Features
sent14token0000 Cells cell NNP X X L_0_cell_C_0_B-NP:1.0
L_0_cell_P_0_NNS:1.0
L_1_in_C_0_B-NP:1.0
L_1_in_P_0_NNS:1.0
chunk_0_B-NP:1.0
chunk_1_B-PP:1.0
chunk_2_B-VP:1.0
lemma_0_cell:1.0lemma_1_in:1.0
lemma_2_regul:1.0
pattern_0_Aa:1.0
pattern_1_a:1.0pattern_prefix_A:1.0
pos_0_NNS:1.0
pos_1_IN:1.0
pos_2_VBG:1.0
prefix_3_Cel:1.0
prefix_4_Cell:1.0
suffix_3_lls:1.0
suffix_4_ells:1.0
sent14token0001 in in NN X X L_-1_cell_C_0_B-PP:1.0
L_-1_cell_P_0_IN:1.0
L_0_in_C_0_B-PP:1.0
L_0_in_P_0_IN:1.0
L_1_regul_C_0_B-PP:1.0
L_1_regul_P_0_IN:1.0
chunk_-1_B-NP:1.0
chunk_0_B-PP:1.0chunk_1_B-VP:1.0
chunk_2_B-NP:1.0
lemma_-1_cell:1.0
lemma_0_in:1.0
lemma_1_regul:1.0
lemma_2_cellular:1.0
pattern_-1_Aa:1.0
pattern_0_a:1.0
pattern_1_Aa:1.0
pattern_prefix_a:1.0
pos_-1_NNS:1.0
pos_0_IN:1.0
pos_1_VBG:1.0
pos_2_JJ:1.0
prefix_3_in:1.0
prefix_4_in:1.0prefix_5_in:1.0
suffix_3_in:1.0
suffix_4_in:1.0
suffix_5_in:1.0
sent14token0002 Regulating regul NNP X X L_-1_in_C_0_B-VP:1.0
L_-1_in_P_0_VBG:1.0
L_0_regul_C_0_B-VP:1.0
L_0_regul_P_0_VBG:1.0
L_1_cellular_C_0_B-VP:1.0
L_1_cellular_P_0_VBG:1.0
chunk_-1_B-PP:1.0
chunk_-2_B-NP:1.0
chunk_0_B-VP:1.0
chunk_1_B-NP:1.0
chunk_2_I-NP:1.0
lemma_-1_in:1.0
lemma_-2_cell:1.0
lemma_0_regul:1.0
lemma_1_cellular:1.0
lemma_2_immun:1.0
pattern_-1_a:1.0
pattern_0_Aa:1.0
pattern_1_Aa:1.0
pattern_prefix_A:1.0
pos_-1_IN:1.0
pos_-2_NNS:1.0
pos_0_VBG:1.0
pos_1_JJ:1.0
pos_2_NN:1.0
prefix_3_Reg:1.0prefix_4_Regu:1.0
prefix_5_Regul:1.0
suffix_3_ing:1.0
suffix_4_ting:1.0
suffix_5_ating:1.0
sent14token0003 Cellular cellular NNP X X L_-1_regul_C_0_B-NP:1.0
L_-1_regul_P_0_JJ:1.0
L_0_cellular_C_0_B-NP:1.0
L_0_cellular_P_0_JJ:1.0
L_1_immun_C_0_B-NP:1.0
L_1_immun_P_0_JJ:1.0
chunk_-1_B-VP:1.0
chunk_-2_B-PP:1.0
chunk_0_B-NP:1.0
chunk_1_I-NP:1.0
lemma_-1_regul:1.0
lemma_-2_in:1.0
lemma_0_cellular:1.0
lemma_1_immun:1.0
pattern_-1_Aa:1.0
pattern_0_Aa:1.0
pattern_1_Aa:1.0
pattern_prefix_A:1.0
pos_-1_VBG:1.0
pos_-2_IN:1.0
pos_0_JJ:1.0
pos_1_NN:1.0
prefix_3_Cel:1.0
prefix_4_Cell:1.0
prefix_5_Cellu:1.0
suffix_3_lar:1.0
suffix_4_ular:1.0
suffix_5_lular:1.0
sent14token0004 Immunity immun PRP X X L_-1_cellular_C_0_I-NP:1.0
L_-1_cellular_P_0_NN:1.0
L_0_immun_C_0_I-NP:1.0
L_0_immun_P_0_NN:1.0
chunk_-1_B-NP:1.0
chunk_-2_B-VP:1.0
chunk_0_I-NP:1.0lemma_-1_cellular:1.0
lemma_-2_regul:1.0
lemma_0_immun:1.0
pattern_-1_Aa:1.0
pattern_0_Aa:1.0
pattern_prefix_A:1.0
pos_-1_JJ:1.0
pos_-2_VBG:1.0
pos_0_NN:1.0
prefix_3_Imm:1.0
prefix_4_Immu:1.0
prefix_5_Immun:1.0
suffix_3_ity:1.0
suffix_4_nity:1.0
suffix_5_unity:1.0

☕ Footnotes

📃 [1] This feature is present in the reverse-engineered dataset, but is not described within Vincze et al.[2]

📃 [2] Vincze, V. (2015). Uncertainty detection in natural language texts (Doctoral dissertation, szte).