forked from stefaniegehrke/dhd2016-boa
-
Notifications
You must be signed in to change notification settings - Fork 1
/
vorträge-049.xml
370 lines (370 loc) · 28.1 KB
/
vorträge-049.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0" xml:id="vorträge-049">
<teiHeader>
<fileDesc>
<titleStmt>
<title>Classification of Literary Subgenres</title>
<author>
<name>
<surname>Hettinger</surname>
<forename>Lena</forename>
</name>
<affiliation>Universität Würzburg, Deutschland</affiliation>
<email>[email protected]</email>
</author>
<author>
<name>
<surname>Reger</surname>
<forename>Isabella</forename>
</name>
<affiliation>Universität Würzburg, Deutschland</affiliation>
<email>[email protected]</email>
</author>
<author>
<name>
<surname>Jannidis</surname>
<forename>Fotis</forename>
</name>
<affiliation>Universität Würzburg, Deutschland</affiliation>
<email>[email protected]</email>
</author>
<author>
<name>
<surname>Hotho</surname>
<forename>Andreas</forename>
</name>
<affiliation>Universität Würzburg, Deutschland</affiliation>
<email>[email protected]</email>
</author>
</titleStmt>
<editionStmt>
<edition>
<date>2015-10-15T13:39:00Z</date>
</edition>
</editionStmt>
<publicationStmt>
<publisher>Elisabeth Burr, Universität Leipzig</publisher>
<address>
<addrLine>Beethovenstr. 15</addrLine>
<addrLine>04107 Leipzig</addrLine>
<addrLine>Deutschland</addrLine>
<addrLine>Elisabeth Burr</addrLine>
</address>
</publicationStmt>
<sourceDesc>
<p>Converted from a Word document </p>
</sourceDesc>
</fileDesc>
<encodingDesc>
<appInfo>
<application ident="DHCONVALIDATOR" version="1.17">
<label>DHConvalidator</label>
</application>
</appInfo>
</encodingDesc>
<profileDesc>
<textClass>
<keywords scheme="ConfTool" n="category">
<term>Vortrag</term>
</keywords>
<keywords scheme="ConfTool" n="subcategory">
<term></term>
</keywords>
<keywords scheme="ConfTool" n="keywords">
<term>Genreklassifikation</term>
<term>Topic Modelling</term>
</keywords>
<keywords scheme="ConfTool" n="topics">
<term>Inhaltsanalyse</term>
<term>Netzwerkanalyse</term>
<term>Stilistische Analyse</term>
<term>Literatur</term>
<term>Text</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<text>
<body>
<div type="div1" rend="DH-Heading1">
<head>Introduction</head>
<p>Literary scholars and common readers use labels like educational novel, crime
novel or adventure novel to organize the large domain of fiction. In both
discourses the use of these categories is well-established even though they are
evolving and tend to be inconsistent. The classification of genres is one of the
standard tasks in document classification and has been researched intensively
(cf. Biber 1989; Santini 2004; Freund et al 2006; Sharoff et al. 2010). Some
results seem impressive, for example distinguishing clear-cut genres like poetry
from fiction (Underwood 2014), but most texts on literary genre classification
emphasize, as the literature on genre classification in general, the variability
of genre signals (Allison et al. 2011: 19; Underwood et al. 2013; Underwood
2014). The scores for genre classification over all categories are therefore
often not very high. Jockers for example reports an accuracy of 67% (Jockers
2013: 81). Genre classification in general works best with most frequent words,
all words or character tetragrams (Freund et al. 2006; Sharoff et al. 2010) and
most of the reported experiments for literary genre classification also use all
words or only the <hi rend="italic">n</hi> most frequent word (sometimes
including punctuation) as features. In a series of experiments we examine
whether it is possible to enhance these results for the classification of
subgenres of novels. Our research is motivated by an understanding of novel
genres as concepts which are differentiated by style, settings, character
constellations and plots. We use most frequent words as an indicator for style
and network characteristics as an indicator for character constellations.
Setting is partially covered by topic models which also represent information on
typical ways of telling a story, narrative <hi rend="italic">topoi</hi>. We have
to omit plot, as we don’t have a reliable way to represent plot by any
indicators yet. </p>
</div>
<div type="div1" rend="DH-Heading1">
<head>Setting</head>
<p>In the following we will describe the corpus and the features we use for the task of subgenre classification. </p>
<div type="div2" rend="DH-Heading2">
<head>Corpus</head>
<p>Our corpus consists of 628 German novels mainly from the 19th century
(roughly 1745 to 1935) obtained from sources like the <ref
target="textgrid.de/digitale-bibliothek">TextGrid Digital Library</ref>
or the <ref target="http://gutenberg.spiegel.de">German Projekt Gutenberg</ref>. The novels have been manually labeled according to their subgenre
after research in literary lexica and handbooks. The corpus contains 221
adventure novels, 88 social novels and 86 educational novels; the rest are
novels from different subgenres. </p>
</div>
<div type="div2" rend="DH-Heading2">
<head>Features</head>
<p>As mentioned in section 1 we use three types of features (stylometric, topic based and network) that are described in more detail in Hettinger et al. (2015). Features are extracted and normalized to a range of [0,1] based on the whole corpus consisting of 628 novels. </p>
<div type="div3" rend="DH-Heading3">
<head>Stylometric features</head>
<p>We use word frequencies as well as character tetragrams to represent stylometric features. We tested different amounts of most frequent words and decided to work with the top 3000 (mfw3000). Additionally we use the top 1000 character tetragrams (4gram). </p>
</div>
<div type="div3" rend="DH-Heading3">
<head>Topic features</head>
<p>We use Latent Dirichlet Allocation (LDA) by Blei et al. (2003) to extract topics from our data. In literary texts topics sometimes represent themes, but more often they represent topoi, often used ways of telling a story or parts of it. For each novel we derive a topic distribution, i.e. we calculate how strongly each topic is associated with each novel. We try different preprocessing approaches and topic numbers and build ten models for each setting to reduce the influence of randomness in LDA models. In every setting we first remove a set of predefined stop words from the novels and then use LDA on our corpus of 628 novels. The different forms of preprocessing we use are: </p>
<list type="unordered">
<item>no additional preprocessing</item>
<item><ref target="https://github.com/MarkusKrug/NERDetection">removal
of Named Entities</ref> (Jannidis et al. 2015) </item>
<item><ref target="http://snowball.tartarus.org/">word stemming</ref>
</item>
<item><ref target="http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/">word lemmatization</ref>
</item>
<item>word lemmatization + removal of Named Entities </item>
</list>
</div>
<div type="div3" rend="DH-Heading3">
<head>Network features</head>
<p>We use the character recognition system described in Jannidis et al. (2015) to identify the characters of each novel. Although the NER tool may be employed with co-reference resolution we do not make use of this option here. We extract proper names to build a network where each node is a character and the number of co-occurrences of two characters in the same paragraph is the weight of the edge between these two. The network of each novel is reduced to the most central characters and the most frequent interactions in order to bring out their basic shape. The network feature set consists of the total number of characters in a novel and six network measures: maximum degree centrality, global efficiency, transitivity, average clustering coefficient, central point dominance and density.</p>
</div>
</div>
</div>
<div type="div1" rend="DH-Heading1">
<head>Evaluation</head>
<p>Classification is done by means of a linear Support Vector Machine (SVM) as we have already shown in Hettinger et al. (2015) that it works best in this setting (see also Yu 2008). In each experiment we apply 100 iterations of 10-fold cross validations to account for the small data sets. The depicted results are the average over 1000 classification accuracy values. We want to investigate the following subgenre constellations: </p>
<list type="unordered">
<item>Adventure versus non-Adventure novels</item>
<item>Educational versus non-Educational novels</item>
<item>Social versus non-Social novels </item>
<item>Adventure versus Educational novels and </item>
<item>Educational versus Social novels </item>
</list>
<p>Depending on the setting the label distribution is often imbalanced. To make results comparable we use undersampling where in each of the 100 iterations a new sample is drawn from the larger class while all instances of the smaller class are used. This accounts for a majority vote (MV) baseline that always yields an accuracy score of 0.50 as both classes have equal size.</p>
<p>To determine the influence of the LDA topic parameter
<hi rend="italic">t</hi> and different preprocessing procedures we report accuracy for
<hi rend="italic">t </hi>=10, 70, 100 and 200 (see figure 1). Differences between different preprocessing categories are minimal, but removal of named entities seems to improve results overall. We observe comparable result for
<hi rend="italic">t</hi> =10, 70 and 100 and a drop in performance for
<hi rend="italic">t</hi> = 200. Therefore, we will use LDA with lemmatized words and named entities removed for
<hi rend="italic">t</hi> = 100 in the following experiments.
</p>
<figure>
<graphic n="1001" width="10.668230555555555cm" height="6.892397222222222cm" url="049-image1.png" rend="inline"/>
</figure>
<p>a) adventure/non-adventure </p>
<figure>
<graphic n="1002" width="10.67cm" height="6.89cm" url="049-image2.png" rend="inline"/>
</figure>
<p>b) educational/non-educational</p>
<figure>
<graphic n="1003" width="10.67cm" height="6.89cm" url="049-image3.png" rend="inline"/>
</figure>
<p>c) social/non-social</p>
<figure>
<graphic n="1004" width="10.67cm" height="6.89cm" url="049-image4.png" rend="inline"/>
</figure>
<p>d) adventure/educational</p>
<figure>
<graphic n="1005" width="10.67cm" height="6.89cm" url="049-image5.png" rend="inline"/>
</figure>
<p>e) educational/social</p>
<p><hi rend="bold">Fig. 1</hi>: Classification results for different subgenre
settings in terms of accuracy using LDA with topic size <hi rend="italic">t</hi>
= 10, 70, 100, 200 and five different preprocessing procedures on 628 German
novels </p>
<p>When comparing different feature sets across our subgenre constellations we can
see that semantic based features (mfw, 4grams, lda) all perform quite good while
network features perform rather poorly (see figure 2). With an accuracy score of
more than 90% adventure novels seem to be fairly easy to differentiate from
other genres. In contrast, the other genres don’t show such a distinct signal
using surface features. </p>
<figure>
<graphic n="1006" width="15.138125cm" height="8.845930555555556cm" url="049-image6.png" rend="inline"/>
</figure>
<p><hi rend="bold">Fig. 2</hi>: Accuracy for different scenarios and feature sets
including the majority vote baseline. </p>
<p>As the classification performance of adventure/educational is quite impressive we
take a closer look at the discriminating words of these genres (Figure 3). Some
of the most typical words of adventure novels include <hi rend="italic">captain,
shore, (on) board, help, danger, immediately</hi>. On the other hand words
like <hi rend="italic">soul, love, world, heart, (at) home, mother, joy, often,
father, emotion, human, to love</hi> are characteristic for educational
novels. </p>
<figure>
<graphic n="1007" width="14.16843888888889cm" height="15.886536111111111cm" url="049-image7.png" rend="inline"/>
</figure>
<p><hi rend="bold">Fig. 3</hi>: Discriminating words for adventure (preferred) and
educational (avoided) novels (Craig’s Zeta)</p>
<p>To test whether authorship of novels influences our results we removed the author
signal by allowing only one document per author. In this way, we construct a new
dataset called ‘uni’. The sampling is done once so that the same novels are used
in each setting. As shown in figure 4, we observe a much lower quality after
removing the authorship information indicating an overemphasized focus of
features and models on the hidden authorship signal. This varies for different
settings as adventure/educational shows a loss of 0.09 (blue lines) and
educational/social loses 0.23 (red lines). The relatively small loss in the
first setting is remarkable as it contains 8 novels per author on average. One
would expect the opposite given the weaker author signal of two novels per
author for the other categories. </p>
<figure>
<graphic n="1008" width="15.896166666666666cm" height="8.6995cm" url="049-image8.png" rend="inline"/>
</figure>
<p><hi rend="bold">Fig. 4</hi>: The influence of authorship </p>
<p>In the next experiment we test whether the combination of feature sets changes
our classification results. To balance the size of the feature sets we use
Principal Component Analysis (PCA) and construct 100 features from the 3000 mfw
and 1000 4gram features each. As shown in figure 5 some feature sets improve
when combined (e.g. 4gram100 and lda100) and for others (e.g. lda100 and
network) performance decreases. But classification results vary greatly in this
setting as signaled by standard deviation bars so these differences should not
be overrated.</p>
<figure>
<graphic n="1009" width="15.191041666666667cm" height="9.714894444444445cm" url="049-image9.png" rend="inline"/>
</figure>
<p><hi rend="bold">Fig. 5</hi>: Classification results for combinations of different
feature sets including bars for standard deviation</p>
</div>
<div type="div1" rend="DH-Heading1">
<head>Conclusion</head>
<p>In this work we classified subgenres of German novels using different feature sets (mfw 3000, 4gram, lda etc.). Some subgenres, like adventure novels, are much easier to classify than others. Most of the applied feature sets showed a varying but comparable performance except the network features. The weak performance of network features might be caused by the weak link between the novel genre and character constellation. The variability of the subgenre signal could not be countered by using higher level features like topics and network characteristics. Interestingly, the author signal has a strong influence on the classification quality. The strength of influence seems to depend on category but is visible in all experiments. In the future, we would like to extend our work by using different network features, work on advanced topic models and find a reliable indicator for plot. Another challenge we have not faced yet is the development of subgenres over time. </p>
</div>
</body>
<back>
<div type="bibliogr">
<listBibl>
<head>Bibliographie</head>
<bibl>
<hi rend="bold">Allison, Sarah / Heuser, Ryan / Jockers, Matthew / Moretti,
Franco / Witmore, Michael</hi> (2011): <hi rend="italic">Quantitative
Formalism</hi>. An Experiment (= Stanford Literary Lab Pamphlet 1) <ref
target="http://litlab.stanford.edu/LiteraryLabPamphlet1.pdf"
>http://litlab.stanford.edu/LiteraryLabPamphlet1.pdf</ref> [letzter
Zugriff 09. Februar 2016]. </bibl>
<bibl>
<hi rend="bold">Biber, Douglas</hi> (1989): "A typology of English texts",
in: <hi rend="italic">Linguistics </hi>27: 3-43. </bibl>
<bibl>
<hi rend="bold">Blei, David / Ng, Andrew / Jordan, Michael</hi> (2003):
"Latent Dirichlet allocation", in: <hi rend="italic">The Journal of Machine
Learning Research</hi> 3: 993-1022. </bibl>
<bibl>
<hi rend="bold">Finn, Aidan / Kushmerick, Nicholas</hi> (2006): "Learning to
classify documents according to genre", in: <ref
target="http://dx.doi.org/10.1002/asi.20427"><hi rend="italic">Journal
of the American Society for Information Science and
Technology</hi></ref><hi rend="italic"> (JASIST)</hi>. Special Issue on
Computational Analysis of Style 57, 11: 1506-1518.</bibl>
<bibl>
<hi rend="bold">Freund, Luanne / Clarke, Charles L. A. / Toms, Elaine
G.</hi> (2006): "Towards genre classification for IR in the workplace",
in: <hi rend="italic">Proceedings of the 1st international conference on
Information interaction in context</hi> (IIiX). New York, NY: ACM 30-36.
<ref target="http://dx.doi.org/10.1145/1164820.1164829"
>http://dx.doi.org/10.1145/1164820.1164829 </ref> [letzter Zugriff 09.
Februar 2016].</bibl>
<bibl>
<hi rend="bold">Hettinger, Lena / Becker, Martin / Reger, Isabella /
Jannidis, Fotis / Hotho, Andreas</hi> (2015): "Genre classification on
German novels", in: <hi rend="italic">Proceedings of the 12th International
Workshop on Text-based Information Retrieval</hi>. </bibl>
<bibl>
<hi rend="bold">Jannidis, Fotis / Krug, Markus / Reger, Isabella / Toepfer,
Martin / Weimer, Lukas / Puppe, Frank</hi> (2015): "Automatische
Erkennung von Figuren in deutschsprachigen Romanen", in: <hi rend="italic"
>DHd-Tagung 2015.</hi>. Von Daten zu Erkenntnissen, 23. bis 27. Februar
2015, Graz. </bibl>
<bibl>
<hi rend="bold">Jockers, Matthew L.</hi> (2013): <hi rend="italic"
>Macroanalysis</hi>. Digital Methods and Literary History. Champaign:
University of Illinois Press. </bibl>
<bibl><hi rend="bold">Krug, Markus</hi> (2015): <hi rend="italic"
>NERDetection</hi>
<ref target="https://github.com/MarkusKrug/NERDetection"
>https://github.com/MarkusKrug/NERDetection</ref> [letzter Zugriff 09.
Februar 2016].</bibl>
<bibl>
<hi rend="bold">Petrenz, Philipp / Webber, Bonnie</hi> (2011): "Stable
classification of text genres", in: <hi rend="italic"><ref
target="http://dx.doi.org/10.1162/COLI_a_00052%20">Computational
Linguistics</ref></hi> 37, 2: 385-393. </bibl>
<bibl><hi rend="bold">Porter, Martin / Boulton, Richard</hi> (2001-2014): <hi
rend="italic">Snowball</hi>
<ref target="http://snowball.tartarus.org/"
>http://snowball.tartarus.org/</ref> [letzter Zugriff 09. Februar
2016].</bibl>
<bibl><hi rend="bold">Projekt Gutenberg-DE</hi> (1994-): <hi rend="italic"
>Projekt Gutenberg-DE</hi>
<ref target="http://gutenberg.spiegel.de/"
>http://gutenberg.spiegel.de/</ref> [letzter Zugriff 09. Februar
2016].</bibl>
<bibl>
<hi rend="bold">Santini, Marina</hi> (2004): "State-of-the-art on Automatic
Genre Identification", in: <hi rend="italic">Technical Report ITRI-04-03,
ITRI, University of Brighton (UK)</hi>. </bibl>
<bibl><hi rend="bold">Schmid, Helmut</hi>(1994-): <hi rend="italic"
>TreeTagger</hi>. A language independent part-of-speech tagger <ref
target="http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/"
>http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/</ref> [letzter
Zugriff 09. Februar 2016].</bibl>
<bibl>
<hi rend="bold">Sharoff, Serge / Wu, Zhili / Markert, Katja</hi> (2010):
"The Web Library of Babel: evaluating genre collections", in: <hi
rend="italic">Proceedings of the conference on Language Resources and
Evaluation (LREC), Malta</hi> 3063-3070. </bibl>
<bibl><hi rend="bold">TextGrid</hi> (2006-2015): <hi rend="italic">Die Digitale
Bibliothek bei TextGrid</hi>
<ref target="textgrid.de/digitale-bibliothek"
>textgrid.de/digitale-bibliothek</ref> [letzter Zugriff 09.Februar
2016].</bibl>
<bibl>
<hi rend="bold">Underwood, Ted</hi> (2014): <hi rend="italic">Understanding
Genre in a Collection of a Million Volumes</hi>. White Paper Report
29.12.2014 <ref
target="http://files.figshare.com/1857045/UnderstandingGenreInterimReport.pdf"
> http://files.figshare.com/1857045/UnderstandingGenreInterimReport.pdf
</ref> [letzter Zugriff 09. Februar 2016].</bibl>
<bibl>
<hi rend="bold">Underwood, Ted / Black, Michael L. / Auvil, Loretta /
Capitanu, Boris</hi> (2013): "Mapping Mutable Genres in Structurally
Complex Volumes", in: <hi rend="italic">2013 IEEE International Conference
on Big Data</hi>
<ref target="http://arxiv.org/abs/1309.3323v2"
>http://arxiv.org/abs/1309.3323v2</ref> [letzter Zugriff 09. Februar
2016]. </bibl>
<bibl>
<hi rend="bold">Yu, Bei</hi> (2008): "An Evaluation of Text Classification
Methods for Literary Study", in: <hi rend="italic">Literary and Linguistic
Computing</hi> 23: 327-343 <ref
target="http://dx.doi.org/10.1093/llc/fqn015"
>http://dx.doi.org/10.1093/llc/fqn015</ref> [letzter Zugriff 09. Februar
2016]. </bibl>
</listBibl>
</div>
</back>
</text>
</TEI>