input/xml/vorträge-049.xml

<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0" xml:id="vorträge-049">
  <teiHeader>
    <fileDesc>
      <titleStmt>
        <title>Classification of Literary Subgenres</title>
        <author>
          <name>
            <surname>Hettinger</surname>
            <forename>Lena</forename>
          </name>
          <affiliation>Universität Würzburg, Deutschland</affiliation>
          <email>lena.hettinger@uni-wuerzburg.de</email>
        </author>
        <author>
          <name>
            <surname>Reger</surname>
            <forename>Isabella</forename>
          </name>
          <affiliation>Universität Würzburg, Deutschland</affiliation>
          <email>isabella.reger@uni-wuerzburg.de</email>
        </author>
        <author>
          <name>
            <surname>Jannidis</surname>
            <forename>Fotis</forename>
          </name>
          <affiliation>Universität Würzburg, Deutschland</affiliation>
          <email>fotis.jannidis@uni-wuerzburg.de</email>
        </author>
        <author>
          <name>
            <surname>Hotho</surname>
            <forename>Andreas</forename>
          </name>
          <affiliation>Universität Würzburg, Deutschland</affiliation>
          <email>hotho@informatik.uni-wuerzburg.de</email>
        </author>
      </titleStmt>
      <editionStmt>
        <edition>
          <date>2015-10-15T13:39:00Z</date>
        </edition>
      </editionStmt>
      <publicationStmt>
        <publisher>Elisabeth Burr, Universität Leipzig</publisher>
        <address>
          <addrLine>Beethovenstr. 15</addrLine>
          <addrLine>04107 Leipzig</addrLine>
          <addrLine>Deutschland</addrLine>
          <addrLine>Elisabeth Burr</addrLine>
        </address>
      </publicationStmt>
      <sourceDesc>
        <p>Converted from a Word document </p>
      </sourceDesc>
    </fileDesc>
    <encodingDesc>
      <appInfo>
        <application ident="DHCONVALIDATOR" version="1.17">
          <label>DHConvalidator</label>
        </application>
      </appInfo>
    </encodingDesc>
    <profileDesc>
      <textClass>
        <keywords scheme="ConfTool" n="category">
          <term>Vortrag</term>
        </keywords>
        <keywords scheme="ConfTool" n="subcategory">
          <term></term>
        </keywords>
        <keywords scheme="ConfTool" n="keywords">
          <term>Genreklassifikation</term>
          <term>Topic Modelling</term>
        </keywords>
        <keywords scheme="ConfTool" n="topics">
          <term>Inhaltsanalyse</term>
          <term>Netzwerkanalyse</term>
          <term>Stilistische Analyse</term>
          <term>Literatur</term>
          <term>Text</term>
        </keywords>
      </textClass>
    </profileDesc>
  </teiHeader>
  <text>
    <body>
      <div type="div1" rend="DH-Heading1">
        <head>Introduction</head>
        <p>Literary scholars and common readers use labels like educational novel, crime
          novel or adventure novel to organize the large domain of fiction. In both
          discourses the use of these categories is well-established even though they are
          evolving and tend to be inconsistent. The classification of genres is one of the
          standard tasks in document classification and has been researched intensively
          (cf. Biber 1989;  Santini 2004; Freund et al 2006; Sharoff et al. 2010). Some
          results seem impressive, for example distinguishing clear-cut genres like poetry
          from fiction (Underwood 2014), but most texts on literary genre classification
          emphasize, as the literature on genre classification in general, the variability
          of genre signals (Allison et al. 2011: 19; Underwood et al. 2013; Underwood
          2014). The scores for genre classification over all categories are therefore
          often not very high. Jockers for example reports an accuracy of 67% (Jockers
          2013: 81). Genre classification in general works best with most frequent words,
          all words or character tetragrams (Freund et al. 2006; Sharoff et al. 2010) and
          most of the reported experiments for literary genre classification also use all
          words or only the <hi rend="italic">n</hi> most frequent word (sometimes
          including punctuation) as features. In a series of experiments we examine
          whether it is possible to enhance these results for the classification of
          subgenres of novels. Our research is motivated by an understanding of novel
          genres as concepts which are differentiated by style, settings, character
          constellations and plots. We use most frequent words as an indicator for style
          and network characteristics as an indicator for character constellations.
          Setting is partially covered by topic models which also represent information on
          typical ways of telling a story, narrative <hi rend="italic">topoi</hi>. We have
          to omit plot, as we don’t have a reliable way to represent plot by any
          indicators yet. </p>
        </div>
        <div type="div1" rend="DH-Heading1">
          <head>Setting</head>
          <p>In the following we will describe the corpus and the features we use for the task of subgenre classification. </p>
          <div type="div2" rend="DH-Heading2">
            <head>Corpus</head>
            <p>Our corpus consists of 628 German novels mainly from the 19th century
              (roughly 1745 to 1935) obtained from sources like the <ref
              target="textgrid.de/digitale-bibliothek">TextGrid Digital Library</ref>
              or the <ref target="http://gutenberg.spiegel.de">German Projekt Gutenberg</ref>. The novels have been manually labeled according to their subgenre
              after research in literary lexica and handbooks. The corpus contains 221
              adventure novels, 88 social novels and 86 educational novels; the rest are
              novels from different subgenres. </p>
            </div>
            <div type="div2" rend="DH-Heading2">
              <head>Features</head>
              <p>As mentioned in section 1 we use three types of features (stylometric, topic based and network) that are described in more detail in Hettinger et al. (2015). Features are extracted and normalized to a range of [0,1] based on the whole corpus consisting of 628 novels. </p>
              <div type="div3" rend="DH-Heading3">
                <head>Stylometric features</head>
                <p>We use word frequencies as well as character tetragrams to represent stylometric features. We tested different amounts of most frequent words and decided to work with the top 3000 (mfw3000). Additionally we use the top 1000 character tetragrams (4gram). </p>
              </div>
              <div type="div3" rend="DH-Heading3">
                <head>Topic features</head>
                <p>We use Latent Dirichlet Allocation (LDA) by Blei et al. (2003) to extract topics from our data. In literary texts topics sometimes represent themes, but more often they represent topoi, often used ways of telling a story or parts of it. For each novel we derive a topic distribution, i.e. we calculate how strongly each topic is associated with each novel. We try different preprocessing approaches and topic numbers and build ten models for each setting to reduce the influence of randomness in LDA models. In every setting we first remove a set of predefined stop words from the novels and then use LDA on our corpus of 628 novels. The different forms of preprocessing we use are: </p>
                <list type="unordered">
                  <item>no additional preprocessing</item>
                  <item><ref target="https://github.com/MarkusKrug/NERDetection">removal
                    of Named Entities</ref> (Jannidis et al. 2015) </item>
                    <item><ref target="http://snowball.tartarus.org/">word stemming</ref>
                  </item>
                  <item><ref target="http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/">word lemmatization</ref>
                </item>
                <item>word lemmatization + removal of Named Entities </item>
              </list>
            </div>
            <div type="div3" rend="DH-Heading3">
              <head>Network features</head>
              <p>We use the character recognition system described in Jannidis et al. (2015) to identify the characters of each novel. Although the NER tool may be employed with co-reference resolution we do not make use of this option here. We extract proper names to build a network where each node is a character and the number of co-occurrences of two characters in the same paragraph is the weight of the edge between these two. The network of each novel is reduced to the most central characters and the most frequent interactions in order to bring out their basic shape. The network feature set consists of the total number of characters in a novel and six network measures: maximum degree centrality, global efficiency, transitivity, average clustering coefficient, central point dominance and density.</p>
            </div>
          </div>
        </div>
        <div type="div1" rend="DH-Heading1">
          <head>Evaluation</head>
          <p>Classification is done by means of a linear Support Vector Machine (SVM) as we have already shown in Hettinger et al. (2015) that it works best in this setting (see also Yu 2008). In each experiment we apply 100 iterations of 10-fold cross validations to account for the small data sets. The depicted results are the average over 1000 classification accuracy values. We want to investigate the following subgenre constellations: </p>
          <list type="unordered">
            <item>Adventure versus non-Adventure novels</item>
            <item>Educational versus non-Educational novels</item>
            <item>Social versus non-Social novels </item>
            <item>Adventure versus Educational novels and </item>
            <item>Educational versus Social novels </item>
          </list>
          <p>Depending on the setting the label distribution is often imbalanced. To make results comparable we use undersampling where in each of the 100 iterations a new sample is drawn from the larger class while all instances of the smaller class are used. This accounts for a majority vote (MV) baseline that always yields an accuracy score of 0.50 as both classes have equal size.</p>
          <p>To determine the influence of the LDA topic parameter
            <hi rend="italic">t</hi> and different preprocessing procedures we report accuracy for
            <hi rend="italic">t </hi>=10, 70, 100 and 200 (see figure 1). Differences between different preprocessing categories are minimal, but removal of named entities seems to improve results overall. We observe comparable result for
            <hi rend="italic">t</hi> =10, 70 and 100 and a drop in performance for
            <hi rend="italic">t</hi> = 200. Therefore, we will use LDA with lemmatized words and named entities removed for
            <hi rend="italic">t</hi> = 100 in the following experiments.
          </p>
          <figure>
            <graphic n="1001" width="10.668230555555555cm" height="6.892397222222222cm" url="049-image1.png" rend="inline"/>
          </figure>
          <p>a) adventure/non-adventure </p>
          <figure>
            <graphic n="1002" width="10.67cm" height="6.89cm" url="049-image2.png" rend="inline"/>
          </figure>
          <p>b) educational/non-educational</p>
          <figure>
            <graphic n="1003" width="10.67cm" height="6.89cm" url="049-image3.png" rend="inline"/>
          </figure>
          <p>c) social/non-social</p>
          <figure>
            <graphic n="1004" width="10.67cm" height="6.89cm" url="049-image4.png" rend="inline"/>
          </figure>
          <p>d) adventure/educational</p>
          <figure>
            <graphic n="1005" width="10.67cm" height="6.89cm" url="049-image5.png" rend="inline"/>
          </figure>
          <p>e) educational/social</p>
          <p><hi rend="bold">Fig. 1</hi>: Classification results for different subgenre
          settings in terms of accuracy using LDA with topic size <hi rend="italic">t</hi>
          = 10, 70, 100, 200 and five different preprocessing procedures on 628 German
          novels </p>
          <p>When comparing different feature sets across our subgenre constellations we can
            see that semantic based features (mfw, 4grams, lda) all perform quite good while
            network features perform rather poorly (see figure 2). With an accuracy score of
            more than 90% adventure novels seem to be fairly easy to differentiate from
            other genres. In contrast, the other genres don’t show such a distinct signal
            using surface features. </p>
            <figure>
              <graphic n="1006" width="15.138125cm" height="8.845930555555556cm" url="049-image6.png" rend="inline"/>
            </figure>
            <p><hi rend="bold">Fig. 2</hi>: Accuracy for different scenarios and feature sets
            including the majority vote baseline. </p>
            <p>As the classification performance of adventure/educational is quite impressive we
              take a closer look at the discriminating words of these genres (Figure 3). Some
              of the most typical words of adventure novels include <hi rend="italic">captain,
              shore, (on) board, help, danger, immediately</hi>. On the other hand words
              like <hi rend="italic">soul, love, world, heart, (at) home, mother, joy, often,
              father, emotion, human, to love</hi> are characteristic for educational
              novels. </p>
              <figure>
                <graphic n="1007" width="14.16843888888889cm" height="15.886536111111111cm" url="049-image7.png" rend="inline"/>
              </figure>
              <p><hi rend="bold">Fig. 3</hi>: Discriminating words for adventure (preferred) and
              educational (avoided) novels (Craig’s Zeta)</p>
              <p>To test whether authorship of novels influences our results we removed the author
                signal by allowing only one document per author. In this way, we construct a new
                dataset called ‘uni’. The sampling is done once so that the same novels are used
                in each setting. As shown in figure 4, we observe a much lower quality after
                removing the authorship information indicating an overemphasized focus of
                features and models on the hidden authorship signal. This varies for different
                settings as adventure/educational shows a loss of 0.09 (blue lines) and
                educational/social loses 0.23 (red lines). The relatively small loss in the
                first setting is remarkable as it contains 8 novels per author on average. One
                would expect the opposite given the weaker author signal of two novels per
                author for the other categories. </p>
                <figure>
                  <graphic n="1008" width="15.896166666666666cm" height="8.6995cm" url="049-image8.png" rend="inline"/>
                </figure>
                <p><hi rend="bold">Fig. 4</hi>: The influence of authorship </p>
                <p>In the next experiment we test whether the combination of feature sets changes
                  our classification results. To balance the size of the feature sets we use
                  Principal Component Analysis (PCA) and construct 100 features from the 3000 mfw
                  and 1000 4gram features each. As shown in figure 5 some feature sets improve
                  when combined (e.g. 4gram100 and lda100) and for others (e.g. lda100 and
                  network) performance decreases. But classification results vary greatly in this
                  setting as signaled by standard deviation bars so these differences should not
                  be overrated.</p>
                  <figure>
                    <graphic n="1009" width="15.191041666666667cm" height="9.714894444444445cm" url="049-image9.png" rend="inline"/>
                  </figure>
                  <p><hi rend="bold">Fig. 5</hi>: Classification results for combinations of different
                  feature sets including bars for standard deviation</p>
                </div>
                <div type="div1" rend="DH-Heading1">
                  <head>Conclusion</head>
                  <p>In this work we classified subgenres of German novels using different feature sets (mfw 3000, 4gram, lda etc.). Some subgenres, like adventure novels, are much easier to classify than others. Most of the applied feature sets showed a varying but comparable performance except the network features. The weak performance of network features might be caused by the weak link between the novel genre and character constellation. The variability of the subgenre signal could not be countered by using higher level features like topics and network characteristics. Interestingly, the author signal has a strong influence on the classification quality. The strength of influence seems to depend on category but is visible in all experiments. In the future, we would like to extend our work by using different network features, work on advanced topic models and find a reliable indicator for plot. Another challenge we have not faced yet is the development of subgenres over time. </p>
                </div>
              </body>
              <back>
                <div type="bibliogr">
                  <listBibl>
                    <head>Bibliographie</head>
                    <bibl>
                      <hi rend="bold">Allison, Sarah / Heuser, Ryan / Jockers, Matthew / Moretti,
                        Franco / Witmore, Michael</hi> (2011): <hi rend="italic">Quantitative
                        Formalism</hi>. An Experiment (= Stanford Literary Lab Pamphlet 1) <ref
                        target="http://litlab.stanford.edu/LiteraryLabPamphlet1.pdf"
                        >http://litlab.stanford.edu/LiteraryLabPamphlet1.pdf</ref> [letzter
                        Zugriff 09. Februar 2016]. </bibl>
                        <bibl>
                          <hi rend="bold">Biber, Douglas</hi> (1989): "A typology of English texts",
                          in: <hi rend="italic">Linguistics </hi>27: 3-43. </bibl>
                          <bibl>
                            <hi rend="bold">Blei, David / Ng, Andrew / Jordan, Michael</hi> (2003):
                            "Latent Dirichlet allocation", in: <hi rend="italic">The Journal of Machine
                            Learning Research</hi> 3: 993-1022. </bibl>
                            <bibl>
                              <hi rend="bold">Finn, Aidan / Kushmerick, Nicholas</hi> (2006): "Learning to
                              classify documents according to genre", in: <ref
                              target="http://dx.doi.org/10.1002/asi.20427"><hi rend="italic">Journal
                              of the American Society for Information Science and
                              Technology</hi></ref><hi rend="italic"> (JASIST)</hi>. Special Issue on
                              Computational Analysis of Style 57, 11: 1506-1518.</bibl>
                              <bibl>
                                <hi rend="bold">Freund, Luanne / Clarke, Charles L. A. / Toms, Elaine
                                  G.</hi> (2006): "Towards genre classification for IR in the workplace",
                                  in: <hi rend="italic">Proceedings of the 1st international conference on
                                  Information interaction in context</hi> (IIiX). New York, NY: ACM 30-36.
                                  <ref target="http://dx.doi.org/10.1145/1164820.1164829"
                                    >http://dx.doi.org/10.1145/1164820.1164829 </ref> [letzter Zugriff 09.
                                    Februar 2016].</bibl>
                                    <bibl>
                                      <hi rend="bold">Hettinger, Lena / Becker, Martin / Reger, Isabella /
                                        Jannidis, Fotis / Hotho, Andreas</hi> (2015): "Genre classification on
                                        German novels", in: <hi rend="italic">Proceedings of the 12th International
                                        Workshop on Text-based Information Retrieval</hi>. </bibl>
                                        <bibl>
                                          <hi rend="bold">Jannidis, Fotis / Krug, Markus / Reger, Isabella / Toepfer,
                                            Martin / Weimer, Lukas / Puppe, Frank</hi> (2015): "Automatische
                                            Erkennung von Figuren in deutschsprachigen Romanen", in: <hi rend="italic"
                                            >DHd-Tagung 2015.</hi>. Von Daten zu Erkenntnissen, 23. bis 27. Februar
                                            2015, Graz. </bibl>
                                            <bibl>
                                              <hi rend="bold">Jockers, Matthew L.</hi> (2013): <hi rend="italic"
                                              >Macroanalysis</hi>. Digital Methods and Literary History. Champaign:
                                              University of Illinois Press. </bibl>
                                              <bibl><hi rend="bold">Krug, Markus</hi> (2015): <hi rend="italic"
                                              >NERDetection</hi>
                                              <ref target="https://github.com/MarkusKrug/NERDetection"
                                                >https://github.com/MarkusKrug/NERDetection</ref> [letzter Zugriff 09.
                                                Februar 2016].</bibl>
                                                <bibl>
                                                  <hi rend="bold">Petrenz, Philipp / Webber, Bonnie</hi> (2011): "Stable
                                                  classification of text genres", in: <hi rend="italic"><ref
                                                  target="http://dx.doi.org/10.1162/COLI_a_00052%20">Computational
                                                  Linguistics</ref></hi> 37, 2: 385-393. </bibl>
                                                  <bibl><hi rend="bold">Porter, Martin / Boulton, Richard</hi> (2001-2014): <hi
                                                  rend="italic">Snowball</hi>
                                                  <ref target="http://snowball.tartarus.org/"
                                                    >http://snowball.tartarus.org/</ref> [letzter Zugriff 09. Februar
                                                    2016].</bibl>
                                                    <bibl><hi rend="bold">Projekt Gutenberg-DE</hi> (1994-): <hi rend="italic"
                                                    >Projekt Gutenberg-DE</hi>
                                                    <ref target="http://gutenberg.spiegel.de/"
                                                      >http://gutenberg.spiegel.de/</ref> [letzter Zugriff 09. Februar
                                                      2016].</bibl>
                                                      <bibl>
                                                        <hi rend="bold">Santini, Marina</hi> (2004): "State-of-the-art on Automatic
                                                        Genre Identification", in: <hi rend="italic">Technical Report ITRI-04-03,
                                                        ITRI, University of Brighton (UK)</hi>. </bibl>
                                                        <bibl><hi rend="bold">Schmid, Helmut</hi>(1994-): <hi rend="italic"
                                                        >TreeTagger</hi>. A language independent part-of-speech tagger <ref
                                                        target="http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/"
                                                        >http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/</ref> [letzter
                                                        Zugriff 09. Februar 2016].</bibl>
                                                        <bibl>
                                                          <hi rend="bold">Sharoff, Serge / Wu, Zhili / Markert, Katja</hi> (2010):
                                                          "The Web Library of Babel: evaluating genre collections", in: <hi
                                                          rend="italic">Proceedings of the conference on Language Resources and
                                                          Evaluation (LREC), Malta</hi> 3063-3070. </bibl>
                                                          <bibl><hi rend="bold">TextGrid</hi> (2006-2015): <hi rend="italic">Die Digitale
                                                          Bibliothek bei TextGrid</hi>
                                                          <ref target="textgrid.de/digitale-bibliothek"
                                                            >textgrid.de/digitale-bibliothek</ref> [letzter Zugriff 09.Februar
                                                            2016].</bibl>
                                                            <bibl>
                                                              <hi rend="bold">Underwood, Ted</hi> (2014): <hi rend="italic">Understanding
                                                              Genre in a Collection of a Million Volumes</hi>. White Paper Report
                                                              29.12.2014 <ref
                                                              target="http://files.figshare.com/1857045/UnderstandingGenreInterimReport.pdf"
                                                              > http://files.figshare.com/1857045/UnderstandingGenreInterimReport.pdf
                                                            </ref> [letzter Zugriff 09. Februar 2016].</bibl>
                                                            <bibl>
                                                              <hi rend="bold">Underwood, Ted / Black, Michael L. / Auvil, Loretta /
                                                                Capitanu, Boris</hi> (2013): "Mapping Mutable Genres in Structurally
                                                                Complex Volumes", in: <hi rend="italic">2013 IEEE International Conference
                                                                on Big Data</hi>
                                                                <ref target="http://arxiv.org/abs/1309.3323v2"
                                                                  >http://arxiv.org/abs/1309.3323v2</ref> [letzter Zugriff 09. Februar
                                                                  2016]. </bibl>
                                                                  <bibl>
                                                                    <hi rend="bold">Yu, Bei</hi> (2008): "An Evaluation of Text Classification
                                                                    Methods for Literary Study", in: <hi rend="italic">Literary and Linguistic
                                                                    Computing</hi> 23: 327-343 <ref
                                                                    target="http://dx.doi.org/10.1093/llc/fqn015"
                                                                    >http://dx.doi.org/10.1093/llc/fqn015</ref> [letzter Zugriff 09. Februar
                                                                    2016]. </bibl>
                                                                  </listBibl>
                                                                </div>
                                                              </back>
                                                            </text>
                                                          </TEI>