Towards a Computational Study of German Book Reviews A Comparison between Emotion Dictionaries and Transfer Learning in Sentiment Analysis

Rebora, Simone; Messerli, Thomas; Herrmann, J. Berenike

This poster reports on the groundwork for the computational study of evaluative practices in German language book reviews. We trained classifiers for evaluation and sentiment at sentence level on the LOBO corpus, comprising ~1.3 million book reviews downloaded from the social reading platform LovelyBooks.

For the two classification tasks, we compared performance of dictionary-based and transfer-learning (TL) based sentiment analysis (SA). To use dictionary-based SA systematically, a repository of twelve open-source German-language SA lexicons was created (see Table 1). Lexicon formats were uniformed to automatically annotate reviews for sentiment in a processing pipeline. For the TL approaches, we chose BERT and FastText, both of which based on distributional representations of natural language (see Devlin et al., 2019; Mikolov et al., 2017) .

Tab. 1: Overview of German-language sentiment dictionaries

The dictionary-based and TL approaches were evaluated on two manually annotated datasets, working with two annotators: in the first dataset (~21,000 sentences), the annotation task was that of identifying evaluative language (vs. descriptive language); in the second dataset (~13,500 sentences), the task focused on the distinction between positive and negative sentiment. These two classification tasks form the basis for a large-scale analysis of the LOBO corpus, which segments reviews into evaluative and descriptive passages, to describe differences in evaluation practices across genres (e.g., romance, science fiction) and ratings (1-5 stars).

For the creation of the Gold Standard of Task 1 (evaluation classification), manual annotation reliability was evaluated on a subset of 250 reviews (~4,000 sentences). Cohen’s Kappa (0.76) indicated a strong agreement between annotators. Overall, 66% of the total sentences were annotated as “evaluation”. Training an SVM classifier on the features generated by the 12 sentiment dictionaries rendered a macro F1 score of 0.75 (see Table 2 for details).

Tab. 2: Efficiency of dictionary-based SVM on Task 1

To compare the efficiency of the dictionaries, the same classifier was trained separately with the single dictionaries. Fig. 1 shows the results, with AffectiveNorms as the best-performing dictionary (macro F1 score of 0.67).

Fig. 1: Efficiency of single German-language SA dictionaries (Task 1)

Yet, by contrast, TL-methods proved substantially more efficient, with macro F1 scores of 0.83 for FastText and of 0.89 for BERT (results obtained via a 5-fold cross validation, repeated five times to average variance, see Table 3 for details on BERT).

Tab. 3: Efficiency of BERT on Task 1

The evaluation procedure was repeated on Task 2 (positive vs. negative sentiment). Again, inter-annotator agreement was strong for manual annotation of the Gold Standard (Cohen’s Kappa = 0.79). Annotation percentages are shown by Fig. 2 (where the “other” category indicates both mixed feelings and the absence of evaluation).

Fig. 2: Percentages of annotations for task 2

The dictionary-based SVM classifier reached a macro F1 score of 0.64, while the best performance was obtained by SentiArt (see Fig. 3). Efficiency was again higher for FastText (macro F1 score = 0.72) and best for BERT (macro F1 score = 0.83). However, the learning curve for BERT shows how there is still room for improvement, with efficiency not fully reaching a plateau (see Fig. 4).

Fig. 3: Efficiency of single German-language SA dictionaries on Task 2

Fig. 4: Learning curve (with increasing amount of training material) of BERT for Task 2

Tab. 4: Overview of results for lexicon-based and TL-based approache

Our results highlight the higher efficiency of TL-methods (see Table 4) and of dictionaries based on vector space models (like SentiArt and AffectiveNorms). They show that computational methods can reliably identify sentiment of book reviews in German. In order to fruitfully use similar methodology to identify types of engagement by reviewers with literature beyond the descriptive/evaluative and positive/negative dichotomies, a useful next step will be to attempt the design of TL-tasks for the identification of more fine-grained evaluative practices. These include the construction of and orientation to particular evaluative scales (e.g. reading pleasure, literary quality) and particular subjects of evaluation (e.g. novels, authors, characters).

Towards a Computational Study of German Book Reviews A Comparison between Emotion Dictionaries and Transfer Learning in Sentiment Analysis

Bibliographie